OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution
|
|
- Edith Sanders
- 8 years ago
- Views:
Transcription
1 OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution Ignacio Medina, Paul Calleja, John Taylor (University of Cambridge, UIS, HPC Service (HPCS))
2 Abstract The advent of Next Generation Sequencing (NGS) techniques in Computational Biology are revolutionising the practise of clinical medicine. These techniques produce vast amounts of data and when combined with other clinical data, drastically increases the amount of data processing power required to infer clinical meaning from the data. In fact, today the bottleneck in whole genome sequencing is not the sequencing itself but the analysis of the data. As time moves on, more varieties of medical data are becoming available, the problem grows and if we are able to realise the potential of the omics revolution to advance patient outcomes, new bioinformatics Big Data solutions are needed. The open-source software for Computational Biology project (OpenCB) is just such a new solution. It is an open-source collaborative initiative which is developing a High-Performance Computing (HPC) and Big Data software for storage, analysis, sharing and visualisation of big data in genomics. It provides a revolutionary new approach taking emerging technologies from the web-scale and Cloud industries utilising technologies such as Apache Hadoop or Spark together with other NoSQL databases. With this new technology, significant functionality and performance for a broad range of genomics analysis and visualisation tasks have been achieved providing the breakthrough in performance that is needed to truly take advantage of the current NGS revolution. Some of the projects developed offer a 3x speedup such as NGS read aligner, while others such as the HBase implementation of OpenCGA Storage show a 12x when loading and indexing data. Depending on the tools used in different pipelines, the overall performance can be significantly improved. This performance increase, scalable architecture and commodity solution space provided by Dell and Intel mean that the data analysis bottleneck has now been removed, unlocking the potential of large scale next generation genomics data to drive improvements in patient health. The OpenCB platform consists of different projects that can be deployed independently such as CellBase or HPG BigData; or integrated together in major application called OpenCGA which allows for efficient processing, indexing and visualisation of hundreds of TBytes of genomic data. OpenCB development is being led out of the University of Cambridge with active participation of several other leading bioinformatics institutions. The source code is open and freely available in GitHub at 1
3 Introduction Over the last few years, biology has experienced a revolution as a result of the introduction of new DNA sequencing technology, known as Next-Generation Sequencing (NGS), that makes it possible to sequence the whole genomic DNA or RNA transcriptome in days instead of years. These recent high-throughput sequencers produce data at unprecedented rates and scale, the decreasing costs and the increasing throughput have popularised their use in many fields of life sciences and clinics. Whole genome DNA resequencing allows us to find and catalogue genomic variants or mutations, helpful for discovering new disease-related mutations in clinical research. RNA sequencing (RNA-seq) has also arisen as a crucial analysis for biological and clinical research, as it can help to determine and quantify the expression of genes, the RNA transcripts that are activated or repressed in different diseases or phenotypes, therefore providing an unbiased profile of a transcriptome that helps to understand the etiology of a disease. Current NGS technologies can sequence short DNA or RNA fragments, of length usually between 75 and 300 nucleotides (nts), some new sequencers with longer fragment sizes are being developed. Primary data produced by NGS sequencers consists of hundreds of millions or even billions of short DNA or RNA fragments which are called reads. The first step in NGS data processing in many comparative genomic experiments, including genome re-sequencing or RNA-seq, involves mapping the NGS reads onto a reference genome, in order to locate the genomic coordinates where these fragments emanate. The mapping process is particularly more difficult for RNA-seq, as the genes in eukaryotes may be split into small regions, called exons that are separated by intron zones composed of thousands of nucleotides. Once the exons are transcribed to RNA, they are brought together to form the transcripts in a splicing process. Thus, when mapping reads from RNA transcripts onto a reference genome, it must be taken into account that these reads may contain a splice junction and, therefore, involve different exons, so that in practice they may lie thousands of nucleotides apart, this situation is referred to as a gapped alignment. Mapping step constitutes a very expensive process from the computational point of view. Furthermore, sensitivity is also a serious concern at this point, given that natural variations or error sequencing may occur, yielding frequent mismatches between reads and the reference genome, which increase the computational complexity of the procedure. In order to ensure that such techniques become inexpensive, there is a need to optimise sequencing to take advantage of new processing techniques made available in modern day compute architectures. HPG Aligner project from OpenCB provides a High-Performance Computing (HPC) implementation of DNA and RNA-seq NGS aligners. This implementation is based on multi-threading and SIMD vectorization targeting Intel Advanced Vector Extensions (AVX2)) - instruction set extension found in Intel processors and aims to provide a very high sensitivity and performance. HPG Aligner shows an excellent sensitivity, even with a high rate of mutations, and remarkable parallel performance for both short and long DNA and RNA-seq reads. In HPG Aligner, reads are aligned using a combination of mapping with Suffix Arrays (SA) and local alignment with the Smith-Waterman algorithm (SWA). 2
4 The advances in high-throughput technologies have also produced an unprecedented growth in the number and size of public biological databases and repositories but unfortunately, the current status of many of these repositories is far from being optimal. For example, all this information is spread out in many small databases implementing different standards. Furthermore, data size is increasingly becoming an obstacle when accessing or storing biological data. All these issues make it very difficult to extract and integrate information from different sources, to analyse experiments or to access and query this information in a programmatic way. CellBase project from OpenCB provides a solution to the growing necessity of integration by easing the access to biological data. CellBase implements a set of RESTful web services that query a high-performance NoSQL database containing the most relevant biological data sources accounting for several TBytes. Another step of NGS data processing include the variant calling, during this process hundreds of millions of genomic variants are identified from the mapped reads. Current genomic and clinical projects are sequencing and calling variants from thousands of samples, producing hundreds of TBs and making it extremely difficult, if not impossible, for researchers to store and analyse these big datasets with current bioinformatics tools. For example, Genomics England (GEL) project aims to sequence 100,000 rare diseases and cancer patients from NHS UK producing about 400TB of compressed data. For the analysis of these data, both variants and samples need to be highly annotated. OpenCGA project from OpenCB allows not only the storage and index of these big datasets using different NoSQL database or big data frameworks such as Hadoop but also the variant annotation from CellBase and the sample annotation using a built-in component called OpenCGA Catalog. Other OpenCB projects include HPG BigData and Genome Maps. HPG BigData aims to provide a scalable solution for NGS in a Hadoop environment, most common data processing and analysis are being implemented using Hadoop MapReduce and Spark execution engines. Genome Maps is a highperformance web-based genome browser that can render CellBase data and render remote NGS experiments from OpenCGA. Much of the existing software solutions in bioinformatics are not designed to work at these data volumes, thus inhibiting scalability and limiting efficiency and consequently making it very difficult for researchers to store, analyse, share and visualise data in a secure and collaborative manner. As time moves on this problem will get more severe since the genome cost is falling faster than Moore s Law 1 whereas the cost of sequencing a full genome is substantially quicker (see Figure below). In order to close this growing Omics Moore s law gap, new Omics analytics platforms are required that combine new computational methods and new leading edge hardware and software technologies. 1 Interpreted here as the cost per operation halving every two years 3
5 Figure 2 The increasing divide between Cost per Base of DNA vs. Moore s Law Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: Accessed [Nov 2015]. Here we introduce the OpenCB initiative that offers a new capability in genomics analytics: to provide an extensible platform which aims to redress these challenges and provide a complete stack for big data in genomics. OpenCB is implemented using the state-of-the-art advanced High-Performance Computing (HPC), and Big Data technologies from Dell & Intel and is actively being developed at the University of Cambridge and other research institutes. With the OpenCB platform additional challenges relating to the integration of diverse omics data can be realised providing even more insight for clinicians and practitioners. OpenCB Overview OpenCB initiative was launched in 2012 by Ignacio Medina now Head of the Computational Biology Lab at UIS Cambridge to provide biological and clinical researchers a scalable,high performance and high quality software environment for genome-scale data analysis. Cellbase is now used by many projects and research institutes. Currently it is being actively developed by more than 12 researchers from University of Cambridge, EMBL-EBI and Genomics England among others. More information can be found at OpenCB consists of different projects that solve different problems in current genomics, each of these projects constitutes a standalone solution than can be easily imported into existing projects The projects have been designed to provide a scalable and high-performance solution for storing, processing, analysing, sharing and visualizing big data in genomics and clinics in a secure and efficient manner. To achieve this, OpenCB uses the most advanced computing technologies in HPC (such as Task- and Data-parallelism with AVX2 or GPUs and Big Data (Hadoop MapReduce, Spark) for data processing and analysis; NoSQL databases (MongoDB, HBase) for data indexing or HTML5 for interactive data visualisation. An overview of all the projects can be seen at 4
6 Fig 3 OpenCB architecture. Server side stores, indexes and executes all the analysis. Client side HTML5 applications and CLI use RESTful web services to interact with the server OpenCB consists of a number of projects as listed below: High-Performance Genomics (HPG) HPG projects make use of standard HPC and big data technologies to provide a scalable and efficient solution for several genomic analysis. The main HPG subprojects are: a) HPG Aligner ( is a DNA and RNA-seq ultra-fast and sensitive HPCNGS read aligner. It combines advanced data structures and novel algorithms implemented with multi-threading and vectorization using AVX 2. Current work at Cambridge is being performed to explore the Intel Xeon Phi Coprocessor as a platform. b) HPG Variant ( is a HPC software to process and anlayse genomic variant data, several algorithms have been developed and implemented. c) HPG BigData ( is a Hadoop MapReduce and Spark implementation of several genomic tools and analyses for working with genome-scale data. 5
7 CellBase CellBase ( constitutes the knowledge-base database for all OpenCB projects. CellBase is a high-performance and scalable NoSQL database that integrates the most relevant biological repositories, among the most significant data we can find genomic features, proteins, gene expression, regulatory elements, functional annotation, genomic variation and systems biology information. Its knowledge base relies on the most relevant repositories such as ENSEMBL, Uniprot, ClinVar, COSMIC or IntAct among others. CellBase implements also a fast variant annotation built-in component that provides an Ensembl VEP compatible annotation. All data is available through a command line or by RESTful web services. OpenCGA OpenCGA ( provides a scalable and high-performance platform for big data analysis and visualisation in a shared environment. OpenCGA integrates some of the OpenCB projects and implements, in addition, other components: a) A Storage Engine framework to store and index alignments and genomic variants into different NoSQL such as MongoDB or Hadoop HBase - the current implementation can store efficiently thousands of gvcf files while remaining responsive when querying data. b) A Catalog which keeps track of users, projects, files, samples annotations, etc and also provides authentication and authorization capabilities. c) Analysis engine to execute genomic analysis in a traditional HPC cluster or in Hadoop. OpenCGA has implemented a command line and RESTful web services to manage and query all the data. Visualisation with Genome Maps and CellMaps Finally in OpenCB, a high-performance HTML5 web-based genome browser called Genome Maps ( and a systems biology tool called CellMaps ( provide a Big Data scientific visualisation capability to OpenCB. Genome Maps can interactively display CellBase and OpenCGA indexed data such as BAM and VCF files. Users can also easily extend Genome Maps to display their own data and formats. In addition, OpenCB projects are compliant with the new GA4GH data models and formats. 6
8 Fig 4 An overview of main OpenCB components. Some OpenCB projects and tools in ovals Who is using it Many projects within research institutes around the world are using some OpenCB technologies demonstrating the success of this initiative. For instance ICGC, EMBL-EBI or Genomics England are using and contributing to some of this projects. Source code is open and it is freely available in GitHub at Benefits of moving to HPC and Hadoop DNA and RNA-seq HPG Aligners implemented show a very high-performance while having the highest sensitivity in most scenarios such as short and longer reads or at different mutation rates. Results are especially good in the case of RNA-seq where HPG Aligner is the most sensitive and fast aligner at different read length or mutation rates when compared to reference RNA-seq aligners (Fig) 7
9 Fig 5. Wall clock times for different simulated read lengths ranging from 50 to 400 nucleotides. Three different mutations rates were studied for four different RNA-seq aligners. As can be seen HPG Aligner is the fastest and more sensitive in all scenarios. HPG BigData implements most common NGS data processing and analysis tools. Some benchmarks have been run on a modest 8-node Hadoop cluster from Dell (see Appendix I) running Cloudera Simulating a big variant data set of 500 million variants and 100 samples takes 40 min to generate few TBytes of data. Loading this variant data set into Hadoop HBase yields more than 500,000 variants/ second, this means a 12x speedup from similar implementation using other NoSQL databases like MongoDB, many of the HBase queries implemented run in sub-second scale when using the row key index and in a few seconds for table scans. 8
10 The current version of CellBase is based on a MongoDB NoSQL database and contains eleven collections accounting for about 1 TBytes of data, some of these collections have more than 100 million complex documents or a few billion data values. All data models and collections have been designed to offer a low latency and high-performance query execution. Most of the queries, even complex aggregations, perform in millisecond scale. A main component of CellBase is the built-in variant annotation tool, this has been implemented using a multithreaded and asynchronous approach to speedup performance, current version can annotate more than 1,200 genomic variants per second when connecting to MongoDB, and about 800 variants/second when using the RESTful web services. OpenCGA integrates several projects from OpenCB and aims to provide a complete solution for genomic big data analysis and visualisation. The OpenCGA Catalog has been implemented using MongoDB and can load millions of file metadata and sample annotations in just a few minutes with complex queries and aggregations running in millisecond scale. The OpenCGA Storage built-in framework can normalize and transform data into binary formats using a multithreaded implementation, for example processing a gvcf file with about 400 million records takes less than 2 minutes in a standard server. Processed data can be loaded and indexed in MongoDB or HBase NoSQL databases, the performance of HBase is about 12x when compared with MongoDB reaching more than 500,000 variants loaded and indexed per second in a 8-node Hadoop cluster. Executing complex queries and aggregations perform in a seconds outperforming any other existing solution. Future Developments In order to ensure OpenCB maintains a cutting-edge platform for large-scale Genomics, processing access to state-of-the-art technologies is critical. At Cambridge, the OpenCB team is exploring future processing capabilities offered by Intel Xeon Phi Coprocessors, FPGA, and GPU technologies. In addition new nonvolatile RAM solutions are being investigated as a means to effect larger in-memory processing capability as well as enhanced MapReduce capability to ensure that scalability is maintained. These technologies will be coupled with enhanced statistical analysis techniques such as those provided by Dell Statistica to provide practitioners with even more insight into omics. The University of Cambridge under the auspices of the UIS is working with a number of industry partners in this respect. In particular the University will work with Dell and Intel to increase the performance of the solution and also shrink-wrap the solution onto a well-tested hardware platform to produce a turnkey Next Gen Sequence Analytics Appliance extending the specification detailed below in Appendix I. 9
11 Appendix I - Big Data Platform The current OpenCB Development Platform consists of the following: MongoDB solution A replica set of three servers connected to a storage solution. Specifications: Server Function Specification Server 1-3 Database replica set Dell PowerEdge R630 2x Intel Xeon Processor E5-2560v3 256 Gbytes RDIMM Storage Storage Dell MD x4TB SAS 7.2K RPM HDD 5x800GB SSD Read Intensive Hadoop A development cluster consisting of 8 nodes. Specifications: Server Function Specification Master Cloudera Manager Dell PowerEdge R720 2x Intel Xeon Processor E5-2560v2 128 Gbytes RDIMM 6x600GB SAS 15K RPM HDD Server 1-8 Data node Dell PowerEdge R720xd 2x Intel Xeon Processor E5-2667v2 64 Gbytes RDIMM 24x500GB SAS 7.2K RPM HDD Network Dell Force10 10GbE Software Cloudera Dell Statistica Spark 1.3 Java 1.8.0_11 10
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio
More informationNew solutions for Big Data Analysis and Visualization
New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology
More informationNew advanced solutions for Genomic big data Analysis and Visualization
New advanced solutions for Genomic big data Analysis and Visualization Ignacio Medina (Nacho) im411@cam.ac.uk http://www.hpc.cam.ac.uk/compbio Head of Computational Biology Lab HPC Service, University
More informationDelivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
More informationLeading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik
Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated
More informationCisco Data Preparation
Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and
More informationHadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More informationThree data delivery cases for EMBL- EBI s Embassy. Guy Cochrane www.ebi.ac.uk
Three data delivery cases for EMBL- EBI s Embassy Guy Cochrane www.ebi.ac.uk EMBL European Bioinformatics Institute Genes, genomes & variation European Nucleotide Archive 1000 Genomes Ensembl Ensembl Genomes
More informationBig Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationThe big data revolution
The big data revolution Friso van Vollenhoven (Xebia) Enterprise NoSQL Recently, there has been a lot of buzz about the NoSQL movement, a collection of related technologies mostly concerned with storing
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationNew advanced solutions for Genomic big data Analysis and Visualization
New advanced solutions for Genomic big data Analysis and Visualization Ignacio Medina (Nacho) im411@cam.ac.uk http://bioinfo.cipf.es/imedina Head of Computational Biology Lab HPC Service, University of
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationPARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
More informationNoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB
bankmark UG (haftungsbeschränkt) Bahnhofstraße 1 9432 Passau Germany www.bankmark.de info@bankmark.de T +49 851 25 49 49 F +49 851 25 49 499 NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB,
More informationComplexity and Scalability in Semantic Graph Analysis Semantic Days 2013
Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013 James Maltby, Ph.D 1 Outline of Presentation Semantic Graph Analytics Database Architectures In-memory Semantic Database Formulation
More informationRETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
More informationDell Reference Configuration for Hortonworks Data Platform
Dell Reference Configuration for Hortonworks Data Platform A Quick Reference Configuration Guide Armando Acosta Hadoop Product Manager Dell Revolutionary Cloud and Big Data Group Kris Applegate Solution
More informationFast, Low-Overhead Encryption for Apache Hadoop*
Fast, Low-Overhead Encryption for Apache Hadoop* Solution Brief Intel Xeon Processors Intel Advanced Encryption Standard New Instructions (Intel AES-NI) The Intel Distribution for Apache Hadoop* software
More informationHADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com
More informationHPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis
HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC4NGS 2012, Valencia Ignacio Medina imedina@cipf.es Scientific Computing Unit Bioinformatics and Genomics Department
More informationDell* In-Memory Appliance for Cloudera* Enterprise
Built with Intel Dell* In-Memory Appliance for Cloudera* Enterprise Find out what faster big data analytics can do for your business The need for speed in all things related to big data is an enormous
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationInteractive data analytics drive insights
Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has
More informationSGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationUnlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation
Unlocking the Intelligence in Big Data Ron Kasabian General Manager Big Data Solutions Intel Corporation Volume & Type of Data What s Driving Big Data? 10X Data growth by 2016 90% unstructured 1 Lower
More informationShouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center
Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing
More informationCloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
More informationActian SQL in Hadoop Buyer s Guide
Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationPersonalized Medicine and IT
Personalized Medicine and IT Data-driven Medicine in the Age of Genomics www.intel.com/healthcare/bigdata Ketan Paranjape General Manager, Life Sciences Intel Corp. @Portlandketan 1 The Central Dogma of
More informationHadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,
More informationBig Data and Analytics (Fall 2015)
Big Data and Analytics (Fall 2015) Core/Elective: MS CS Elective MS SPM Elective Instructor: Dr. Tariq MAHMOOD Credit Hours: 3 Pre-requisite: All Core CS Courses (Knowledge of Data Mining is a Plus) Every
More informationThe Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO
The Fusion of Supercomputing and Big Data Peter Ungaro President & CEO The Supercomputing Company Supercomputing Big Data Because some great things never change One other thing that hasn t changed. Cray
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationPreparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo
Preparing the scenario for the use of patient s genome sequences in clinic Joaquín Dopazo Computational Medicine Institute, Centro de Investigación Príncipe Felipe (CIPF), Functional Genomics Node, (INB),
More informationENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and
More informationBest Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays
Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Database Solutions Engineering By Murali Krishnan.K Dell Product Group October 2009
More informationCSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
More informationAchieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
More informationCloud-based Analytics and Map Reduce
1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,
More informationEuropean Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute
European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute Justin Paschall Team Leader Genetic Variation / EGA ! European Genome-phenome
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationAttacking the Biobank Bottleneck
Attacking the Biobank Bottleneck Professor Jan-Eric Litton BBMRI-ERIC BBMRI-ERIC Big Data meets research biobanking Big data is high-volume, high-velocity and highvariety information assets that demand
More informationAccelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com. 2013 DataDirect Networks. All Rights Reserved
DDN Case Study Accelerate > Converged Storage Infrastructure 2013 DataDirect Networks. All Rights Reserved The University of Florida s (ICBR) offers access to cutting-edge technologies designed to enable
More informationGeneProf and the new GeneProf Web Services
GeneProf and the new GeneProf Web Services Florian Halbritter florian.halbritter@ed.ac.uk Stem Cell Bioinformatics Group (Simon R. Tomlinson) simon.tomlinson@ed.ac.uk December 10, 2012 Florian Halbritter
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationLarge-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationG E N OM I C S S E RV I C ES
GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E
More informationUsing In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
More informationOn a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
More informationAccelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera
Accelerating Enterprise Big Data Success Tim Stevens, VP of Business and Corporate Development Cloudera 1 Big Opportunity: Extract value from data Revenue Growth x = 50 Billion 35 ZB Cost Savings Margin
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationTowards Integrating the Detection of Genetic Variants into an In-Memory Database
Towards Integrating the Detection of Genetic Variants into an 2nd International Workshop on Big Data in Bioinformatics and Healthcare Oct 27, 2014 Motivation Genome Data Analysis Process DNA Sample Base
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationEfficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
More informationHadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture
More informationProcessing NGS Data with Hadoop-BAM and SeqPig
Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3
More informationBig Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
More informationBenchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk
Benchmarking Couchbase Server for Interactive Applications By Alexey Diomin and Kirill Grigorchuk Contents 1. Introduction... 3 2. A brief overview of Cassandra, MongoDB, and Couchbase... 3 3. Key criteria
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationCluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer
Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationData deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti
Data deluge (and its applications) Prologue Data is becoming cheaper and cheaper to produce and store Driving mechanism is parallelism on sensors, storage, computing Data directly produced are complex
More informationHadoopTM Analytics DDN
DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate
More informationConstructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More informationPARALLELS CLOUD SERVER
PARALLELS CLOUD SERVER Performance and Scalability 1 Table of Contents Executive Summary... Error! Bookmark not defined. LAMP Stack Performance Evaluation... Error! Bookmark not defined. Background...
More informationDriving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
More informationNavigating the Big Data infrastructure layer Helena Schwenk
mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationSAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES
SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES AWS GLOBAL INFRASTRUCTURE 10 Regions 25 Availability Zones 51 Edge locations WHAT
More informationHigh Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
More informationAn Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
More informationSAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence
SAP HANA SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence SAP HANA Performance Table of Contents 3 Introduction 4 The Test Environment Database Schema Test Data System
More informationIntegrated Rule-based Data Management System for Genome Sequencing Data
Integrated Rule-based Data Management System for Genome Sequencing Data A Research Data Management (RDM) Green Shoots Pilots Project Report by Michael Mueller, Simon Burbidge, Steven Lawlor and Jorge Ferrer
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationGo where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe
Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe Go where the biology takes you. To published results faster With proven scalability To the forefront of discovery To limitless applications
More informationHIV NOMOGRAM USING BIG DATA ANALYTICS
HIV NOMOGRAM USING BIG DATA ANALYTICS S.Avudaiselvi and P.Tamizhchelvi Student Of Ayya Nadar Janaki Ammal College (Sivakasi) Head Of The Department Of Computer Science, Ayya Nadar Janaki Ammal College
More informationLaurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud
Laurence Liew General Manager, APAC Economics Is Driving Big Data Analytics to the Cloud Big Data 101 The Analytics Stack Economics of Big Data Convergence of the 3 forces Big Data Analytics in the Cloud
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationHigh Performance Computing and Big Data: The coming wave.
High Performance Computing and Big Data: The coming wave. 1 In science and engineering, in order to compete, you must compute Today, the toughest challenges, and greatest opportunities, require computation
More informationTableau Server 7.0 scalability
Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different
More informationHow To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
More informationWHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution
WHITEPAPER A Technical Perspective on the Talena Data Availability Management Solution BIG DATA TECHNOLOGY LANDSCAPE Over the past decade, the emergence of social media, mobile, and cloud technologies
More informationEuro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.3 Selected Standards
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationhigh-performance computing so you can move your enterprise forward
Whether targeted to HPC or embedded applications, Pico Computing s modular and highly-scalable architecture, based on Field Programmable Gate Array (FPGA) technologies, brings orders-of-magnitude performance
More informationFederated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian Tzolov @christzolov
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by Christian Tzolov @christzolov Whoami Christian Tzolov Technical Architect at Pivotal, BigData, Hadoop, SpringXD,
More information