HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC4NGS 2012, Valencia Ignacio Medina imedina@cipf.es Scientific Computing Unit Bioinformatics and Genomics Department Centro de Investigación Príncipe Felipe (CIPF) Valencia, Spain
Index Introduction High Performance Computing (HPC) Distributed and cloud computing (cloud) NGS Data Analysis Pipeline Next Steps Conclusions
Introduction New scenario in Biology Next Generation Sequencing (NGS) technology is changing the way how researchers perform experiments. Many new experiments are being conducted by sequencing: exome re-sequencing, RNA-seq, Meth-seq, ChIP-seq,... NGS is allowing researches to: Find exome variants responsible of diseases Study the whole transcriptome of a phenotype Establish the methylation state of a condition Locate DNA binding proteins Fantastic! But, this new experiments have increased data size by 1000x when compared with microarrays, i.e. from MB to GB in expression analysis Data processing and analysis are becoming a bottleneck and a nightmare, from days or weeks with microarrays to months with NGS, and it will be worse as more data become available
Introduction The NGS data NGS throughput is amazing, 30x human genomes in less than a week, and it's becoming faster! A single run can generate hundreds of GB of data, a typical NGS experiment today can produce 1TB of data in 1 week The good and bad news: Prices falling down quick! The good news: Parallelization is possible! Many of the analysis is short read independent This new scenario demands more advanced software solutions that exploit the current computer hardware properly. We need new technologies like HPC or cloud-based applications In just 10 years!
Introduction Current computer hardware Multi-core CPUs Coming soon: Intel MIC (Many Integrated Core) architecture with more than 50 processing cores GPU computing solutions Currently: quad-core and hexa-core processors, with hyper-threading Desktop: Nvidia GTX580 Fermi architecture (512 cores performing 1.5TFlops) HPC: Nvidia Tesla M2090 Fermi architecture (6GB memory, 512 cores, 1.3TFlops SP, 665 GFlops DP, memory bandwidth 177GB/sec) AVX, a new version of SSE, SIMD extended to 256 bits. New instruction set and registers Designed to support 512 or 1024-bits SIMD registers
Introduction Poor software performance But, software in Bioinformatics is in general slow and not designed for running in clusters or clouds, even new NGS software is still being developed in Perl, R and Java (Disclaimer: I usually code in Java, R and Perl, but they are not recommendable for NGS data analysis) New NGS technologies push computer needs, new solutions and technologies become necessary: more analysis, performance, scalable, cloud Computer performance vs data analysis As a Computer scientist I want high performance software Fast, very fast! Speed matters, low memory footprint! Scalable solutions: clusters, cloud As a Biologist I need useful data analysis tools Efficient and sensitive Extract relevant information It's not a choice! We need both of them, i.e. Google Search
Introduction Motivation and goals As data analysis group we focus in analysis: It's analysis, stupid! ~1-2 weeks sample sequencing ~2-4 weeks preprocessing and read mapping, this is necessary to be improved but is not enough! Several months of data analysis, lack of software for biological analysis, i.e. which are the genes with a coverage >30x that are differentially expressed in a dataset of 50 BAMs accounting for 500GB of data with a p-value<0.05? To develop new generation of software in bioinformatics to be fast, actually, very fast software! With a low memory consumption! be able to handle and analyze TB of data store data efficiently to query distribute computation, no data focused and useful in analysis! Software must exploit current hardware and be fast and efficient in a single computer or in a HPC cluster
High Performance Computing (HPC) Technologies Overview OpenMP SIMD instruction set extension to x86 architecture AVX (Advanced Vector Extensions) is an advanced version of SSE, SIMD register extended from 128 bits to 256 bits with new instructions Message Passing Interface (MPI) Frameworks: Nvidia CUDA, Khronos OpenCL, OpenACC (in development) Streaming SIMD Extensions (SSE) flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer GPU computing multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures de facto standard for communication in a parallel program running on a distributedmemory system FPGAs Integrated circuit configurable by programmer
High Performance Computing (HPC) Software architecture Common mistakes in new HPC developers Not all algorithms or problems are suitable for GPUs CPUs multi-core must be also exploited, hyper-threading Heterogeneous HPC computation, in shared-memory SSE/AVX SIMD instructions can also improve algorithms, i.e. speed-up 6-8x of Smith-Waterman, Farrar 2007) CPU(OpenMP + SSE/AVX) + GPU(Nvidia CUDA) Hybrid approach CPU OpenMP + SSE/AVX MPI GPU Hundreds of samples TB of data Nvidia CUDA Single Node execution Hadoop MapReduce Cluster execution
Distributed and cloud computing Apache Hadoop framework It's a Java framework library that allows for the distributed processing of large data sets across clusters of computers using a simple programming model http://hadoop.apache.org/ Develops open-source software for reliable, scalable, distributed computing Apache Hadoop framework overview:
Distributed and cloud computing Apache Hadoop framework MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner Who is using in Bioinformatics? GATK Crossbow 1000genomes http://aws.amazon.com/1000genomes/ not many more... yet HBase is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. No SQL schema, different number of columns is normal, very flexible
Distributed and cloud computing Amazon AWS http://aws.amazon.com/ In 2006, Amazon Web Services (AWS) began offering IT infrastructure services to businesses in the form of web services -- now commonly known as cloud computing. One of the key benefits of cloud computing is the opportunity to replace up-front capital infrastructure expenses with low variable costs that scale with your business
NGS data analysis pipeline Global overview NGS sequencer Fastq file, up to hundreds of GB per run HPG suite High Performance Genomics QC and preprocessing QC stats, filtering and preprocessing options Short read aligner Double mapping strategy: Burrows-Wheeler Transform (GPU Nvidia CUDA) + Smith-Waterman (CPU OpenMP+SSE/AVX) More info SAM/BAM file http://docs.bioinfo.cipf.es/projects/ngsgpu-pipeline/wiki QC and preprocessing Other analysis (HPC for genomics consortium) QC stats, filtering and preprocessing options Variant Calling analysis RNA-seq (mrna sequenced) DNA assembly (not a real analysis) Meth-seq Copy Number Transcript isoform Variant VCF viewer... HTML5+SVG Web based viewer GATK and SAM mpileup HPC Implementation. Statistics genomic tests VCF file QC and preprocessing QC stats, filtering and preprocessing options Variant effect analysis Consequence type, regulatory variants and system biology information
NGS data analysis pipeline HPG-Fastq tool Features: Filtering: filter by length range, filer by quality range, num. nt out of range, N's,... Preprocessing: quality ends trim, remove cloning vectors,... Hybrid approach implementation: Nvida CUDA + OpenMP QC stats: read length and quality average, nucleotides %, GC% content, quality per nucleotide, k-mers,... example: hpg-fastq qc fq fastq_file.fastq -t -v Paired-end Fastq files of 15GB each QC'ed and preprocessed in only one execution in less than 4 minutes in a standard workstation: Intel quad-core with 4GB RAM and Nvidia GTX580
NGS data analysis pipeline HPG-aligner, a HPC DNA short read aligner Current read aligners software tend to fit in one of these groups: Very fast, but no too sensitive: no gaps, no indels, rna-seq... Slow, but very sensitive: up to 1 day by sample Fastq file Read Aligner algorithms Burrows-Wheeler Transform (BWT): very fast! No sensitive Smith-Waterman (SW): very sensitive but very slow Hybrid approach (paper sent): BWT +1 error implemented in Nvidia CUDA SW implemented using OpenMP and SSE/AVX Architecture BWT-GPU < 2 errors Find CALs with BWT-GPU (seeds) 10x compared to BFAST ~50x with SSE Testing BWT-GPU >= 2 errors SW OpenMP/SSE SAM file
NGS data analysis pipeline HPG-BAM tool Features BAM Sort implemented in Nvidia CUDA (2x compared to SAM tools) Quality control: % reads with N errors, % reads with multiple mappings, strand bias, paired-end insert,... Coverage implemented in OpenMP (2x faster): by gene, region,... Filtering: by number of errors, number of hits, Comparator: stats, intersection,... In a standard workstation works in minute scale Not too much performance improvement, but many new analysis features
NGS data analysis pipeline HPG-variant, suite of tools for variant analysis HPG-VCF tools: to analyze large VCFs files with a low memory footprint: stats, filter, split, merge, HPG-Variant: suite of tools for genome-scale variant analysis Variant calling: is a critical step, two HPC implementation are being developed: GATK, SAMtools mpileup Genomic analysis: association, TDT, Hardy-Weinberg, LOH, (~Plink) VCF WEB Viewer and VARIANT software (later) more coming
NGS data analysis pipeline CellBase, an integrative database and RESTful WS API Based on CellBase (accepted in NAR), a comprehensive integrative database and RESTful Web Services API, more than 120GB of data and 100 SQL tables exported in TXT and JSON: Core features: genes, transcripts, exons, cytobands, proteins (UniProt),... Variation: dbsnp and Ensembl SNPs, HapMap, 1000Genomes, Cosmic,... Functional: 40 OBO ontologies(gene Ontology), Interpro,... Regulatory: TFBS, mirna targets, conserved regions, System biology: Interactome (IntAct), Reactome database, co-expressed genes,... http://docs.bioinfo.cipf.es/projects/cellbase/wiki
NGS data analysis pipeline Genome Maps, a HTML5+SVG genome browser Visualization is an important part of the data analysis: heavy data issue! Do not move data! Features of Genome Maps (www.genomemaps.org, paper sent) First 100% web based: HTML5+SVG (Google Maps inspired) Always updated, no browser plugins or installation Data from CellBase database and DAS servers: genes, transcripts, exons, SNPs, TFBS, mirna targets, Other features: Multi species, API oriented, integrable, plugin framework,...
NGS data analysis pipeline VARIANT, a genomic VARIant ANalysis Tool A genomic variant effect predictor tool, recursive acronym: VARIANT combines CellBase Web Services and Genome Maps to build a cloud variant effect predictor, no installation, always updated Variant effects: VARIANT = VARIant ANalysis Tool (http://variant.bioinfo.cipf.es, accepted in NAR) Genomic location: intronic, non-synonymous, upstream, Regulatory location: TFBS, mirna targets, conserved regions, MySQL clusters +120GB of data Three interfaces: RESTful, web application and CLI Java RESTful WS API Fast! - GZip - Multi Thread Three clients available WEB browser (AJAX) WEB application CLI program
NGS data analysis pipeline Moving data analysis pipeline forward HPC for Genomics Consortium, CIPF and Bull leading a Consortium with UV, UPV and UJI Universities. More than 15 HPC computational researches and developers joined Extending data analysis pipeline to DNA assembly: new assembly algorithm and HPC implementation is being developed RNA-seq: RNA mapping, quantification, diff. expr. analysis, isoform detection,... Meth-seq: new bisulfite-seq mapping algorithm, diff. methylated region analysis, Copy Number: detect amplification and deletions, LOH,... Next sub-projects ChIP-seq Data analysis integration: eqtl(variant+expr), expr+methylation Distributing: MPI-CUDA, rcuda, Hadoop MapReduce
Next steps Variant NoSQL Database Clients: Currently thousands of VCFs are being generated, VCF files are being stored in repositories... not enough Motivation: variant analysis! Biological questions remain to be answered easily, ie: Which variants with more than 20x are shared in a single gene in more than 80% of disease samples? Too many variants for a relational SQL database, we need new solutions, new technology available: NoSQL distributed databases A NoSQL database and Java and RESTful WS API to store and analyze TeraBytes of variant data is being implemented using Hadoop Framework, release in a few months WEB browser Perl, python, Java,... RESTful WS API Variant Analysis Java API Hadoop Java API
Next steps Browsing NGS data, HTML5 web based solution Postulate: As data size increase people tend to develop desktop software for visualization and analysis i.e. GBrowse, Cytoscape, snpeff, etc. We took the opposite direction (Google and cloud approach): Moving high volumes of data is not scalable (~TB of data) Cloud distributed and collaborative solutions Based on Genome Maps visualization technology: Browse mapped reads Examine variants from a familiy Analysis widgets integrated Thousands of files ~TBs of data SAM/BAM data SAM tools API VCF/NoSQL data VCF homemade API RESTful WS API Specification Java implementation provided JSON Client
Next steps HPC for Genomics Consortium HPC for Genomics Consortium, CIPF and Bull leading a Consortium with UV, UPV and UJI Universities. More than 15 HPC computational researches and developers joined Last weeks some computational researchers from Germany or Chinese BGI have raised interest so far in this project Motivation: Computation is a bottleneck and the current development model based in many small solutions does not solve big problems As sequencing technology throughput improve and prices go down this problem will become much worse Focus in analysis, many more analysis are needed So, bioinformatician with strong background in computation is invited to participate and contact us
Next Steps Future Clinic project Future Clinic project aims to introduce personalized medicine in the Valencian health system Consortium formed by Indra, Bull, Valencian Health System, Universitat Politecnica, and led by CIPF Specific goals: Develop algorithms and software to process, store and analyze genomic information Integrate this information into health system: Digital Clinic History Develop software and technology to aid to diagnosis This project is funded by the Generalitat Valenciana and FEDER
Conclusions NGS technology is pushing computer needs, new solutions are necessary The huge data demands faster and more efficient software solution, HPC can be applied New architecture model, more cloud solutions It's not only a hardware problem, we need better software Analysis software to handle this data is being developed
Acknowledgements Head of Department Joaquín Dopazo HPC developers Joaquín Tárraga Cristina Y. González Víctor Requena (BULL) Web and Java developers Franscisco Salavert Ruben Sánchez Others developers Alejandro de María Roberto Alonso Jose Carbonell And the rest of the department!! HPC Consortium: - Ignacio Blanquer (UPV) - José Salavert (UPV) - Jose Duato (UPV) - Juan M. Orduña (UV) - Vicente Arnau (UV) - Enrique Quintana (UJI) - Mario Gómez (BULL) -... Bull Extreme Computing (BUX)