New solutions for Big Data Analysis and Visualization

Similar documents
HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis

New advanced solutions for Genomic big data Analysis and Visualization

New advanced solutions for Genomic big data Analysis and Visualization

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

Hadoopizer : a cloud environment for bioinformatics data analysis

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

GeneProf and the new GeneProf Web Services

Importance of Statistics in creating high dimensional data

Delivering the power of the world s most successful genomics platform

Analysis of NGS Data

Cloud-Based Big Data Analytics in Bioinformatics

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Practical Solutions for Big Data Analytics

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

Cloud-based Analytics and Map Reduce

CSE-E5430 Scalable Cloud Computing. Lecture 4

Hadoop-BAM and SeqPig

Introduction to NGS data analysis

LifeScope Genomic Analysis Software 2.5

Module 1. Sequence Formats and Retrieval. Charles Steward

Challenges associated with analysis and storage of NGS data

G E N OM I C S S E RV I C ES

Big Data Challenges in Bioinformatics

-> Integration of MAPHiTS in Galaxy

Comparing Methods for Identifying Transcription Factor Target Genes

Version 5.0 Release Notes

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Text file One header line meta information lines One line : variant/position

Architectures for Big Data Analytics A database perspective

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Next generation sequencing (NGS)

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

CHALLENGES IN NEXT-GENERATION SEQUENCING

Hadoop. Sunday, November 25, 12

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Processing Genome Data using Scalable Database Technology. My Background

High Throughput Sequencing Data Analysis using Cloud Computing

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Using GPUs in the Cloud for Scalable HPC in Engineering and Manufacturing March 26, 2014

Bursting to a Hybrid Cloud for Services OFC 2015

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Basic processing of next-generation sequencing (NGS) data

The data explosion is transforming science

BioInterchange 2.0 CODAMONO. Integrating and Scaling Genomics Data

Automating Big Data Benchmarking for Different Architectures with ALOJA

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Large-Scale Data Processing

Intro to Map/Reduce a.k.a. Hadoop

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Twister4Azure: Data Analytics in the Cloud

School of Nursing. Presented by Yvette Conley, PhD

PaaS - Platform as a Service Google App Engine

Next generation DNA sequencing technologies. theory & prac-ce

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

A Performance Analysis of Distributed Indexing using Terrier

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Accelerating variant calling

Storage Solutions for Bioinformatics

Assignment # 1 (Cloud Computing Security)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Microsoft Research Windows Azure for Research Training

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

BIG DATA TRENDS AND TECHNOLOGIES

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Experimentation on Cloud Databases to Handle Genomic Big Data

Hadoop. Bioinformatics Big Data

Visual Mining for Big Data

Microsoft Research Microsoft Azure for Research Training

Richmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University,

Visualizing Data: Scalable Interactivity

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Data Sharing Initiative: International Cancer Genome Consortium

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Cloud Computing. Chapter 1 Introducing Cloud Computing

Contents. Preface Acknowledgements. Chapter 1 Introduction 1.1

Transcription:

New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology Unit Computational Medicine Institute Centro de Investigación Príncipe Felipe (CIPF) (Valencia, Spain)

Index Introduction Next steps Conclusions

Introduction Big data in Biology, a new scenario Next-Generation Sequencing (NGS) technology is changing the way how researchers perform experiments. Many new experiments are being conducted by sequencing: exome re-sequencing, RNA-seq, Meth-seq, ChIP-seq,... NGS is allowing researches to: looks good! Find exome and genomic variants responsible of diseases Study the whole transcriptome of a phenotype Establish the methylation state of a condition Locate DNA binding proteins But experiments have increased data size by 1000x when compared with microarrays, i.e. from MB to hundreds of GB in transcriptomics Data processing and analysis are becoming a bottleneck and a nightmare, from days or weeks with microarrays to months with NGS, and it will be worse as more data become available

Introduction Big data challenges and solutions Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications Big data is not a new scenario for other science areas: meteorology, physics, internet search, finance, business,... Which are the main Big data challenges?: curation, search, sharing, storage, analysis and visualization We need to study and use new computational technologies available: High-Performance Computing (HPC): multi-core CPUs, SSE/AVX, GPUs Distributed computing: Apache Hadoop MapReduce, MPI Distributed and NoSQL databases: Apache Cassandra, HBase, Web apps: HTML5 (SVG, WebGL,...), Javascript, RESTful WS,... Clouds: Amazon AWS, Google Cloud, Microsoft Azure, Science: Machine learning, data mining, clustering, probablistic graphicals models, visualization,...

Introduction Brief presentation of Computational Biology Unit Part of Institute of Computational Medicine, http://bioinfo.cipf.es http://bioinfo.cipf.es/compbio Big data in biology. New software and solutions are needed. Efficiency matters. This specialized unit consists of Computer Scientists such as Scientific programmers, HPC developers and Web and Distributed developers to produce efficient software Joaquin Dopazo's Group: long experience in genomic data analysis for more than 10 years Many methodologies and applications developed: GEPAS, Babelomics (FatiGO, SNOW,...), NetworkMiner, More than 10 NARs and many papers with thousands of users, more than 250 jobs executed daily Today computational requirements in biology makes necessary more advanced computing solutions, so the Computational Computing Unit has been created: Genomic Variants analysis, NGS data analysis, Computational Systems biology, Databases. But also HPC computing, Machine Learning, cloud-based solutions,...

Introduction Motivation and goals Poor performance, Software in Bioinformatics is in general slow and not designed for running in clusters or clouds like in other science fields. NGS push computer needs, new solutions and technologies are needed: more analysis, performance, scalable, cloud As data analysis group we focus in analysis: It's analysis, stupid! ~1-2 weeks sample sequencing, ~2-4 weeks preprocessing and read mapping, this is necessary to be improved but is not enough! Several months of data analysis, lack of software for biological analysis, i.e. which are the genes with a coverage >30x that are differentially expressed in a dataset of 50 BAMs accounting for 500GB of data with a p-value<0.05? Goals: to develop new generation of software in bioinformatics to be Very fast software with a low memory footprint and integrated analysis toolkit be able to handle and analyze TB of data, store data efficiently to be queried distribute computation, no data. Make use of cloud computing Software must exploit current hardware and be fast and efficient in a standard workstation, HPC cluster or in a cloud environment

NGS pipeline, a HPC implementation NGS sequencer Fastq file, up to hundreds of GB per run HPG suite High-Performance Genomics QC and preprocessing QC stats, filtering and preprocessing options HPG Aligner, short read aligner Double mapping strategy: Burrows-Wheeler Transform (GPU Nvidia CUDA) + Smith-Waterman (CPU OpenMP+SSE/AVX) More info at: SAM/BAM file http://bioinfo.cipf.es/docs/compbio/projects /hpg/doku.php QC and preprocessing Other analysis (HPC4Genomics consortium) RNA-seq (mrna sequenced) DNA assembly (not a real analysis) Meth-seq Copy Number Transcript isoform Variant VCF viewer... HTML5+SVG Web based viewer QC stats, filtering and preprocessing options Variant calling analysis GATK and SAM mpileup HPC Implementation. Statistics genomic tests VCF file QC and preprocessing QC stats, filtering and preprocessing options HPG Variant, Variant analysis Consequence type, GWAS, regulatory variants and system biology information

HPG Aligner, the first HPC DNA and RNA-seq aligner Current read aligners software tend to fit in one of these groups: Very fast, but no too sensitive: no gaps, no indels, rna-seq... Slow, but very sensitive: up to 1 day by sample Current aligners show bad performance with long reads Current read Aligner algorithms Burrows-Wheeler Transform (BWT): very fast! No sensitive Smith-Waterman (SW): very sensitive but very slow Hybrid approach (papers in preparation): Fastq file HPG-BWT Architecture OpenMP+GPU < 2 errors HPG-BWT implemented with OpenMP and Nvidia CUDA HPG-SW implemented using OpenMP and SSE (~26x in 8-core) >= 2 errors Find CALs with HPG-BWT (seeds) HPG-SW 10x compared to BFAST OpenMP+SSE ~50x with SSE Testing BWT-GPU SAM file

HPG Aligner results First results show an amazing performance and the best sensitivity DNA 2M simulated datasets Program RNA-seq 1M simulated datasets 150nt %mapped Time(min) 250nt %mapped Time(min) Program HPG Aligner 96.22% dna mode 1.26min 96.98% 1.9min 97.83% 3.7min BWA 0.6.2 93.58% 4.3min 92.6% 6.3min 98.0% 9.7min Bowtie 0.12.8 79.95% 1.8min 60.11% 2.42min - Bowtie2 2.0.0 94.71% 2.48 96.75% 3.25 98% 5.75 100nt %mapped Time(min) Notes: SE with 2M reads datasets, 5% random reads Tests in a 12-core machine, no GPU used INDELS supported GEM, Soap2 and BFAST outperformed 75nt %mapped Time(min) 100nt %mapped Time(min) 150nt %mapped Time(min) HPG Aligner 96.5% rna mode 1.38min 97.1% 2.41min 97.7% 5.91min TopHat 2.0.4 74.4% 62.1% 20.1min 37.1% 36.2min 15.6min 10x faster!! Notes: Max errors allowed Similar results are obtained with real datasets TopHat doubles disk space and big memory needs Similar results are obtained with real datasets Papers in preparation

HPG Aligner, RNA-seq results Comparative results with TopHat2 using both Bowtie and Bowtie2 Hardware scalability tests

HPG Variant, suite of tools for variant analysis HPG Variant, a suite of tools for HPC-based genomic variant analysis Three tools are already implemented: vcf, gwas and effect. Implemented using OpenMP, Nvidia CUDA and MPI for large clusters. VCF: C library and tool: allows to analyze large VCFs files with a low memory footprint: stats, filter, split, merge, (paper in preparation) VARIANT = VARIant ANalysis Tool Example: hpg-variant vcf stats vcf-file ceu.vcf GWAS: suite of tools for gwas variant analysis (~Plink) association, TDT Hardy-Weinberg, LOH Epistasis Example: hpg-variant gwas tdt vcf-file tumor.vcf EFFECT: A CLI and web application, it's a cloud-based genomic variant effect predictor tool has been implemented ( http://variant.bioinfo.cipf.es, published in NAR)

CellBase, an integrative database and RESTful WS API Based on CellBase (published in Project: http://bioinfo.cipf.es/compbio/cellbase NAR), a comprehensive integrative Wiki: http://docs.bioinfo.cipf.es/projects/cellbase/wiki database and RESTful Web Services API, more than 250GB of data and 90 SQL tables exported in TXT and JSON: Core features: genes, transcripts, exons, cytobands, proteins (UniProt),... Variation: dbsnp and Ensembl SNPs, HapMap, 1000Genomes, Cosmic,... Functional: 40 OBO ontologies(gene Ontology), Interpro,... Regulatory: TFBS, mirna targets, conserved regions, System biology: Interactome (IntAct), Reactome database, co-expressed genes, NoSQL based version coming, scale to TB

Genome Maps, a HTML5+SVG data visualization Genome scale data visualization is an important part of the data analysis: heavy data issue! Do not move data! Features of Genome Maps (www.genomemaps.org, under review) First 100% HTML5 web based: HTML5+SVG (Google Maps inspired) Always updated, no browser plugins or installation, USA Amazon AWS Data taken from CellBase, remote NGS data, local files and DAS servers: genes, transcripts, exons, SNPs, TFBS, mirna targets, Other features: Multi species, API oriented, integrable, plugin framework,... Project: http://bioinfo.cipf.es/compbio/genomemaps

Next steps cloud-based and open solutions cloud-based environment integration ready, codename: GASC Storage: efficient storage and data retrieval of ~TB, transparent connection to others clouds such as Amazon AWS or Microsoft Azure Analysis: many tools ready to use (aligners, GATK, ), users can upload their tools to extend functionality, SGE queue, Search and access: data is indexed and can be queried efficiently, RESTful WS allows users to access to data and analysis programatically Sharing: users can share their data and analysis, public and private data Visualization: HTML5-SVG based web applications to visualize data Open development initiative HPG project, CellBase, Genome Maps, GASC, released as open source development initiative Source code controlled with Git, hosted freely in GitHub Scientist are encouraged to collaborate and extend functionality, a HPC4G consortium from universities already created

Conclusion High-throughput technologies such as NGS is pushing Biology into Big Data Bioinformatics must learn how to deal with this huge amount of data This new scenario demands new solutions, new computational technologies must be used Open development model allows researchers to join forces and build up better solutions