HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis



Similar documents
New solutions for Big Data Analysis and Visualization

New advanced solutions for Genomic big data Analysis and Visualization

New advanced solutions for Genomic big data Analysis and Visualization

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Hadoop-BAM and SeqPig

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Hadoop. Bioinformatics Big Data

Introduction. Xiangke Liao 1, Shaoliang Peng, Yutong Lu, Chengkun Wu, Yingbo Cui, Heng Wang, Jiajun Wen

Introduction to NGS data analysis

Parallel Programming Survey

Accelerating variant calling

Architectures for Big Data Analytics A database perspective

Delivering the power of the world s most successful genomics platform

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Big Data Challenges in Bioinformatics

CSE-E5430 Scalable Cloud Computing. Lecture 4

Hadoopizer : a cloud environment for bioinformatics data analysis

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Cloud-based Analytics and Map Reduce

Text file One header line meta information lines One line : variant/position

Processing NGS Data with Hadoop-BAM and SeqPig

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

-> Integration of MAPHiTS in Galaxy

Case Study on Productivity and Performance of GPGPUs

Intro to Map/Reduce a.k.a. Hadoop

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Open source large scale distributed data management with Google s MapReduce and Bigtable

Analysis of NGS Data

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

How To Scale Out Of A Nosql Database

Bringing Big Data Modelling into the Hands of Domain Experts

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

An Approach to Implement Map Reduce with NoSQL Databases

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

Trends in High-Performance Computing for Power Grid Applications

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Version 5.0 Release Notes

Benchmarking Hadoop & HBase on Violin

Overview of HPC Resources at Vanderbilt

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Open source Google-style large scale data analysis with Hadoop

HPC with Multicore and GPUs

Comparing Methods for Identifying Transcription Factor Target Genes

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Data processing goes big

Practical Solutions for Big Data Analytics

GeneProf and the new GeneProf Web Services

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

10- High Performance Compu5ng

GPUs for Scientific Computing

Next Generation GPU Architecture Code-named Fermi

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Acceleration for Personalized Medicine Big Data Applications

Hadoop IST 734 SS CHUNG

Building a Top500-class Supercomputing Cluster at LNS-BUAP

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

G E N OM I C S S E RV I C ES

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

ST810 Advanced Computing

Cloud-Based Big Data Analytics in Bioinformatics

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

HPC Wales Skills Academy Course Catalogue 2015

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Similarity Search in a Very Large Scale Using Hadoop and HBase

Evaluation of CUDA Fortran for the CFD code Strukti

Comparing SQL and NOSQL databases

HPC ABDS: The Case for an Integrating Apache Big Data Stack

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Next generation sequencing (NGS)

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

A survey on platforms for big data analytics

Transcription:

HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC4NGS 2012, Valencia Ignacio Medina imedina@cipf.es Scientific Computing Unit Bioinformatics and Genomics Department Centro de Investigación Príncipe Felipe (CIPF) Valencia, Spain

Index Introduction High Performance Computing (HPC) Distributed and cloud computing (cloud) NGS Data Analysis Pipeline Next Steps Conclusions

Introduction New scenario in Biology Next Generation Sequencing (NGS) technology is changing the way how researchers perform experiments. Many new experiments are being conducted by sequencing: exome re-sequencing, RNA-seq, Meth-seq, ChIP-seq,... NGS is allowing researches to: Find exome variants responsible of diseases Study the whole transcriptome of a phenotype Establish the methylation state of a condition Locate DNA binding proteins Fantastic! But, this new experiments have increased data size by 1000x when compared with microarrays, i.e. from MB to GB in expression analysis Data processing and analysis are becoming a bottleneck and a nightmare, from days or weeks with microarrays to months with NGS, and it will be worse as more data become available

Introduction The NGS data NGS throughput is amazing, 30x human genomes in less than a week, and it's becoming faster! A single run can generate hundreds of GB of data, a typical NGS experiment today can produce 1TB of data in 1 week The good and bad news: Prices falling down quick! The good news: Parallelization is possible! Many of the analysis is short read independent This new scenario demands more advanced software solutions that exploit the current computer hardware properly. We need new technologies like HPC or cloud-based applications In just 10 years!

Introduction Current computer hardware Multi-core CPUs Coming soon: Intel MIC (Many Integrated Core) architecture with more than 50 processing cores GPU computing solutions Currently: quad-core and hexa-core processors, with hyper-threading Desktop: Nvidia GTX580 Fermi architecture (512 cores performing 1.5TFlops) HPC: Nvidia Tesla M2090 Fermi architecture (6GB memory, 512 cores, 1.3TFlops SP, 665 GFlops DP, memory bandwidth 177GB/sec) AVX, a new version of SSE, SIMD extended to 256 bits. New instruction set and registers Designed to support 512 or 1024-bits SIMD registers

Introduction Poor software performance But, software in Bioinformatics is in general slow and not designed for running in clusters or clouds, even new NGS software is still being developed in Perl, R and Java (Disclaimer: I usually code in Java, R and Perl, but they are not recommendable for NGS data analysis) New NGS technologies push computer needs, new solutions and technologies become necessary: more analysis, performance, scalable, cloud Computer performance vs data analysis As a Computer scientist I want high performance software Fast, very fast! Speed matters, low memory footprint! Scalable solutions: clusters, cloud As a Biologist I need useful data analysis tools Efficient and sensitive Extract relevant information It's not a choice! We need both of them, i.e. Google Search

Introduction Motivation and goals As data analysis group we focus in analysis: It's analysis, stupid! ~1-2 weeks sample sequencing ~2-4 weeks preprocessing and read mapping, this is necessary to be improved but is not enough! Several months of data analysis, lack of software for biological analysis, i.e. which are the genes with a coverage >30x that are differentially expressed in a dataset of 50 BAMs accounting for 500GB of data with a p-value<0.05? To develop new generation of software in bioinformatics to be fast, actually, very fast software! With a low memory consumption! be able to handle and analyze TB of data store data efficiently to query distribute computation, no data focused and useful in analysis! Software must exploit current hardware and be fast and efficient in a single computer or in a HPC cluster

High Performance Computing (HPC) Technologies Overview OpenMP SIMD instruction set extension to x86 architecture AVX (Advanced Vector Extensions) is an advanced version of SSE, SIMD register extended from 128 bits to 256 bits with new instructions Message Passing Interface (MPI) Frameworks: Nvidia CUDA, Khronos OpenCL, OpenACC (in development) Streaming SIMD Extensions (SSE) flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer GPU computing multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures de facto standard for communication in a parallel program running on a distributedmemory system FPGAs Integrated circuit configurable by programmer

High Performance Computing (HPC) Software architecture Common mistakes in new HPC developers Not all algorithms or problems are suitable for GPUs CPUs multi-core must be also exploited, hyper-threading Heterogeneous HPC computation, in shared-memory SSE/AVX SIMD instructions can also improve algorithms, i.e. speed-up 6-8x of Smith-Waterman, Farrar 2007) CPU(OpenMP + SSE/AVX) + GPU(Nvidia CUDA) Hybrid approach CPU OpenMP + SSE/AVX MPI GPU Hundreds of samples TB of data Nvidia CUDA Single Node execution Hadoop MapReduce Cluster execution

Distributed and cloud computing Apache Hadoop framework It's a Java framework library that allows for the distributed processing of large data sets across clusters of computers using a simple programming model http://hadoop.apache.org/ Develops open-source software for reliable, scalable, distributed computing Apache Hadoop framework overview:

Distributed and cloud computing Apache Hadoop framework MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner Who is using in Bioinformatics? GATK Crossbow 1000genomes http://aws.amazon.com/1000genomes/ not many more... yet HBase is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. No SQL schema, different number of columns is normal, very flexible

Distributed and cloud computing Amazon AWS http://aws.amazon.com/ In 2006, Amazon Web Services (AWS) began offering IT infrastructure services to businesses in the form of web services -- now commonly known as cloud computing. One of the key benefits of cloud computing is the opportunity to replace up-front capital infrastructure expenses with low variable costs that scale with your business

NGS data analysis pipeline Global overview NGS sequencer Fastq file, up to hundreds of GB per run HPG suite High Performance Genomics QC and preprocessing QC stats, filtering and preprocessing options Short read aligner Double mapping strategy: Burrows-Wheeler Transform (GPU Nvidia CUDA) + Smith-Waterman (CPU OpenMP+SSE/AVX) More info SAM/BAM file http://docs.bioinfo.cipf.es/projects/ngsgpu-pipeline/wiki QC and preprocessing Other analysis (HPC for genomics consortium) QC stats, filtering and preprocessing options Variant Calling analysis RNA-seq (mrna sequenced) DNA assembly (not a real analysis) Meth-seq Copy Number Transcript isoform Variant VCF viewer... HTML5+SVG Web based viewer GATK and SAM mpileup HPC Implementation. Statistics genomic tests VCF file QC and preprocessing QC stats, filtering and preprocessing options Variant effect analysis Consequence type, regulatory variants and system biology information

NGS data analysis pipeline HPG-Fastq tool Features: Filtering: filter by length range, filer by quality range, num. nt out of range, N's,... Preprocessing: quality ends trim, remove cloning vectors,... Hybrid approach implementation: Nvida CUDA + OpenMP QC stats: read length and quality average, nucleotides %, GC% content, quality per nucleotide, k-mers,... example: hpg-fastq qc fq fastq_file.fastq -t -v Paired-end Fastq files of 15GB each QC'ed and preprocessed in only one execution in less than 4 minutes in a standard workstation: Intel quad-core with 4GB RAM and Nvidia GTX580

NGS data analysis pipeline HPG-aligner, a HPC DNA short read aligner Current read aligners software tend to fit in one of these groups: Very fast, but no too sensitive: no gaps, no indels, rna-seq... Slow, but very sensitive: up to 1 day by sample Fastq file Read Aligner algorithms Burrows-Wheeler Transform (BWT): very fast! No sensitive Smith-Waterman (SW): very sensitive but very slow Hybrid approach (paper sent): BWT +1 error implemented in Nvidia CUDA SW implemented using OpenMP and SSE/AVX Architecture BWT-GPU < 2 errors Find CALs with BWT-GPU (seeds) 10x compared to BFAST ~50x with SSE Testing BWT-GPU >= 2 errors SW OpenMP/SSE SAM file

NGS data analysis pipeline HPG-BAM tool Features BAM Sort implemented in Nvidia CUDA (2x compared to SAM tools) Quality control: % reads with N errors, % reads with multiple mappings, strand bias, paired-end insert,... Coverage implemented in OpenMP (2x faster): by gene, region,... Filtering: by number of errors, number of hits, Comparator: stats, intersection,... In a standard workstation works in minute scale Not too much performance improvement, but many new analysis features

NGS data analysis pipeline HPG-variant, suite of tools for variant analysis HPG-VCF tools: to analyze large VCFs files with a low memory footprint: stats, filter, split, merge, HPG-Variant: suite of tools for genome-scale variant analysis Variant calling: is a critical step, two HPC implementation are being developed: GATK, SAMtools mpileup Genomic analysis: association, TDT, Hardy-Weinberg, LOH, (~Plink) VCF WEB Viewer and VARIANT software (later) more coming

NGS data analysis pipeline CellBase, an integrative database and RESTful WS API Based on CellBase (accepted in NAR), a comprehensive integrative database and RESTful Web Services API, more than 120GB of data and 100 SQL tables exported in TXT and JSON: Core features: genes, transcripts, exons, cytobands, proteins (UniProt),... Variation: dbsnp and Ensembl SNPs, HapMap, 1000Genomes, Cosmic,... Functional: 40 OBO ontologies(gene Ontology), Interpro,... Regulatory: TFBS, mirna targets, conserved regions, System biology: Interactome (IntAct), Reactome database, co-expressed genes,... http://docs.bioinfo.cipf.es/projects/cellbase/wiki

NGS data analysis pipeline Genome Maps, a HTML5+SVG genome browser Visualization is an important part of the data analysis: heavy data issue! Do not move data! Features of Genome Maps (www.genomemaps.org, paper sent) First 100% web based: HTML5+SVG (Google Maps inspired) Always updated, no browser plugins or installation Data from CellBase database and DAS servers: genes, transcripts, exons, SNPs, TFBS, mirna targets, Other features: Multi species, API oriented, integrable, plugin framework,...

NGS data analysis pipeline VARIANT, a genomic VARIant ANalysis Tool A genomic variant effect predictor tool, recursive acronym: VARIANT combines CellBase Web Services and Genome Maps to build a cloud variant effect predictor, no installation, always updated Variant effects: VARIANT = VARIant ANalysis Tool (http://variant.bioinfo.cipf.es, accepted in NAR) Genomic location: intronic, non-synonymous, upstream, Regulatory location: TFBS, mirna targets, conserved regions, MySQL clusters +120GB of data Three interfaces: RESTful, web application and CLI Java RESTful WS API Fast! - GZip - Multi Thread Three clients available WEB browser (AJAX) WEB application CLI program

NGS data analysis pipeline Moving data analysis pipeline forward HPC for Genomics Consortium, CIPF and Bull leading a Consortium with UV, UPV and UJI Universities. More than 15 HPC computational researches and developers joined Extending data analysis pipeline to DNA assembly: new assembly algorithm and HPC implementation is being developed RNA-seq: RNA mapping, quantification, diff. expr. analysis, isoform detection,... Meth-seq: new bisulfite-seq mapping algorithm, diff. methylated region analysis, Copy Number: detect amplification and deletions, LOH,... Next sub-projects ChIP-seq Data analysis integration: eqtl(variant+expr), expr+methylation Distributing: MPI-CUDA, rcuda, Hadoop MapReduce

Next steps Variant NoSQL Database Clients: Currently thousands of VCFs are being generated, VCF files are being stored in repositories... not enough Motivation: variant analysis! Biological questions remain to be answered easily, ie: Which variants with more than 20x are shared in a single gene in more than 80% of disease samples? Too many variants for a relational SQL database, we need new solutions, new technology available: NoSQL distributed databases A NoSQL database and Java and RESTful WS API to store and analyze TeraBytes of variant data is being implemented using Hadoop Framework, release in a few months WEB browser Perl, python, Java,... RESTful WS API Variant Analysis Java API Hadoop Java API

Next steps Browsing NGS data, HTML5 web based solution Postulate: As data size increase people tend to develop desktop software for visualization and analysis i.e. GBrowse, Cytoscape, snpeff, etc. We took the opposite direction (Google and cloud approach): Moving high volumes of data is not scalable (~TB of data) Cloud distributed and collaborative solutions Based on Genome Maps visualization technology: Browse mapped reads Examine variants from a familiy Analysis widgets integrated Thousands of files ~TBs of data SAM/BAM data SAM tools API VCF/NoSQL data VCF homemade API RESTful WS API Specification Java implementation provided JSON Client

Next steps HPC for Genomics Consortium HPC for Genomics Consortium, CIPF and Bull leading a Consortium with UV, UPV and UJI Universities. More than 15 HPC computational researches and developers joined Last weeks some computational researchers from Germany or Chinese BGI have raised interest so far in this project Motivation: Computation is a bottleneck and the current development model based in many small solutions does not solve big problems As sequencing technology throughput improve and prices go down this problem will become much worse Focus in analysis, many more analysis are needed So, bioinformatician with strong background in computation is invited to participate and contact us

Next Steps Future Clinic project Future Clinic project aims to introduce personalized medicine in the Valencian health system Consortium formed by Indra, Bull, Valencian Health System, Universitat Politecnica, and led by CIPF Specific goals: Develop algorithms and software to process, store and analyze genomic information Integrate this information into health system: Digital Clinic History Develop software and technology to aid to diagnosis This project is funded by the Generalitat Valenciana and FEDER

Conclusions NGS technology is pushing computer needs, new solutions are necessary The huge data demands faster and more efficient software solution, HPC can be applied New architecture model, more cloud solutions It's not only a hardware problem, we need better software Analysis software to handle this data is being developed

Acknowledgements Head of Department Joaquín Dopazo HPC developers Joaquín Tárraga Cristina Y. González Víctor Requena (BULL) Web and Java developers Franscisco Salavert Ruben Sánchez Others developers Alejandro de María Roberto Alonso Jose Carbonell And the rest of the department!! HPC Consortium: - Ignacio Blanquer (UPV) - José Salavert (UPV) - Jose Duato (UPV) - Juan M. Orduña (UV) - Vicente Arnau (UV) - Enrique Quintana (UJI) - Mario Gómez (BULL) -... Bull Extreme Computing (BUX)