Open source analytics for Big Data in Big Pharma



Similar documents
New solutions for Big Data Analysis and Visualization

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

A leader in the development and application of information technology to prevent and treat disease.

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Hadoopizer : a cloud environment for bioinformatics data analysis

G E N OM I C S S E RV I C ES

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Data Sharing Initiative: International Cancer Genome Consortium

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Delivering the power of the world s most successful genomics platform

Personalized Medicine and IT

Introduction to NGS data analysis

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

CHALLENGES IN NEXT-GENERATION SEQUENCING

High Throughput Sequencing Data Analysis using Cloud Computing

The NGS IT notes. George Magklaras PhD RHCE

Genomic Medicine The Future of Cancer Care. Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America

Challenges associated with analysis and storage of NGS data

Big data in cancer research : DNA sequencing and personalised medicine

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Cloud-based Analytics and Map Reduce

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Cloud-Based Big Data Analytics in Bioinformatics

Worldwide Collaborations in Molecular Profiling

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

GC3 Use cases for the Cloud

Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE

Practical Solutions for Big Data Analytics

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer Drugs

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

Genetic diagnostics the gateway to personalized medicine

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Comparing Methods for Identifying Transcription Factor Target Genes

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

BioHPC Web Computing Resources at CBSU

Text file One header line meta information lines One line : variant/position

Installation Guide for Windows

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Analysis of NGS Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

How-To: SNP and INDEL detection

Considering De-Identification? Legacy Data. Kymberly Lee 16-Jul-2015

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Bursting to a Hybrid Cloud for Services OFC 2015

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Disease gene identification with exome sequencing

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Data management challenges in todays Healthcare and Life Sciences ecosystems

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS

LifeScope Genomic Analysis Software 2.5

Overview of Next Generation Sequencing platform technologies

Basic processing of next-generation sequencing (NGS) data

Integrated Rule-based Data Management System for Genome Sequencing Data

School of Nursing. Presented by Yvette Conley, PhD

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

The most powerful open source data science technologies in your browser.!! Yves Hilpisch

How Real-time Analysis turns Big Medical Data into Precision Medicine?

Building your Big Data Architecture on Amazon Web Services

Automating installation, testing and development of bcbio-nextgen pipeline

CSE-E5430 Scalable Cloud Computing. Lecture 4

Big Data Challenges in Bioinformatics

Data Analysis for Ion Torrent Sequencing

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

THE ROLE OF BIG DATA IN HEALTH AND BIOMEDICAL RESEARCH. John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health

Next generation sequencing (NGS)

Netapp HPC Solution for Lustre. Rich Fenton UK Solutions Architect

Information for patients and the public and patient information about DNA / Biobanking across Europe

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London

HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Hadoop IST 734 SS CHUNG

Cloud Computing and Amazon Web Services

Transcription:

Open source analytics for Big Data in Big Pharma Applications in next generation sequencing data Big Data SIG 23 Apr 2015 Miika Ahdesmaki Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Crash course to molecular biology Central dogma DNA is the ~static part RNA is the dynamic middle man - Only 1% of DNA is protein-coding (or exonic ) Proteins are involved in virtually all cell functions We can sequence DNA and RNA using ultra high throughput sequencing (3 rd gen Next Generation Sequencing) "Centraldogma nodetails" by Narayanese at English Wikipedia - Own work. Licensed under Public Domain via Wikimedia Commons http://commons.wikimedia.org/wiki/file:centraldogma_nodetails.png#/media/file:centraldogma_nodetails.png 2 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Why NGS? Personalised medicine: - One drug for all patients no longer realistic (especially in oncology) - Different demographics have different variations of risks - Understanding patient specific needs will help guide their individual medication Cancer is a genetic disease, most often the result of spurious mutations in DNA - Understanding changes in cancer DNA can help defeat the disease Next generation high throughput sequencing offers genome DNA analyses in days and under $10k 3 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

What is next generation sequencing? Sequencing NGS: massively parallel DNA sequencing Oncology biggest consumer of NGS at AZ We sequence RNA and DNA e.g. from - Clinical samples - Cell lines - Xenografts / explants 4 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

What is next generation sequencing? Sequencing The DNA/RNA is pre-processed, fragmented and the short fragments are sequenced (in random order) 5 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

What is next generation sequencing? Alignment The short fragments are aligned to a reference sequence, such as the human reference HG19 6 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

What is next generation sequencing? Downstream Processing (variants, expression) The alignments are further processed to answer the following questions - How are the alignments different from the reference (SNPs, indels)? - Which genes are expressed? HG19 7 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Uses of NGS Patient stratification Biomarkers for prognosis, drug response, safety Expression Variants NGS Data Explants Tumors-FFPE Tumors fresh frozen Cell lines Clinical samples RNA-Seq DNA-Seq Targeted Whole exome Whole genome Fusions Coding variants Coding and noncoding variants New Target ID Mechanism of drug action Mechanism of disease Mechanisms of resistance 8 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Data generation and volumes AZ: Mix of outsourced sequencing and internal data generation Typical size of files per sample: In oncology, individuals are often studied in pairs (tumour/normal, parental/daughter), doubling the data volumes Typical study sizes: 100GB - 1TB raw compressed data One of our most frequent Big Data problems Whole genome: 60-180GB Exome Dna-seq: 10-20GB RNA-seq 10-15GB Single gene targeted: 100-200MB 9 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Data generation and volumes Over the past 3-4 years we accumulated ~400TB of sequencing data via - Acquiring public data sets (TCGA, ICGC) - Vendor sequencing (major) - Internal sequencing (minor) Over 2015-2016 we expect - Internal sequencing to become the major data generation source (5 new sequencers in 2015 to accompany 2 sequencers in 2013-2014) - 1PB of sequencing data by mid 2016 Long term prediction of volumes difficult 3 tiered storage for processing, short term storage and long term storage - Amazon Glacier strongly considered for long term storage 10 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Partnering with the leaders Illumina Announces Strategic Partnerships with AstraZeneca, Janssen and Sanofi to Redefine Companion Diagnostics for Oncology - http://investor.illumina.com/phoenix.zhtml?c=121127&p=irolnewsarticle&id=1960007 - Illumina, Inc. announced it has formed collaborative partnerships with leading pharmaceutical companies to develop a universal NGS-based oncology test system - The system will be used for clinical trials of targeted cancer therapies with a goal of developing and commercializing a multi-gene panel for therapeutic selection, resulting in a more comprehensive tool for precision medicine 11 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Pipelines and analytics 12

Production Dealing with the complexity Number of NGS tools increases daily.. annotatebed bcbio_nextgen.py ctest hash_tar plot_roc.r srf_info vcffilter vcfrandom append_sff bcftools cuffcompare index_tar plot-vcfstats srf_list vcffixup vcfrandomsample bam12auxmerge bed12tobed6 cuffdiff interpolate_sam.pl prep_reads STAR vcfflatten vcfregionreduce bam12split bedgraphtobigwig cufflinks intersectbed psl2sam.pl subtractbed vcfgeno2alleles vcfregionreduce_and_cut bam12strip bedpetobam cuffmerge io_lib-config qualimap tabix vcfgeno2haplo vcfregionreduce_pipe bam2fastx bedpetobed12 dbilogstrip isnovoindex randombed tabtk vcfgenosamplenames vcfregionreduce_uncompressed bamadapterclip bedpetovcf dbiprof juncs_db rtg tagbam vcfgenosummarize vcfremap bamadapterfind bedtobam dbiproxy kmerprob s3cmd tophat vcfgenotypecompare vcfremoveaberrantgenotypes bamauxsort bedtobigbed expandcols liftover sam2vcf.pl tophat2 vcfgenotypes vcfremovenonatgc bamcat bedtoigv export2sam.pl linksbed sambamba tophat-fusion-post vcfglbound vcfremovesamples bamchecksort bed_to_juncs extract_fastq long_spanning_reads samblaster tophat_reports vcfglxgt vcfroc bamclipreinsert bedtools extract_qual lumpy sam_juncs trace_dump vcfgtcompare.sh vcfsample2info bamcollate bgzip extract_seq makescf samtools twobitinfo vcfhetcount vcfsamplediff bamcollate2 bigbedinfo facount map2gtf samtools.pl twobittofa vcfhethomratio vcfsamplenames bamdownsamplerandom bigbedsummary fasize mapbed scalpel unionbedgraphs vcfindelproximity vcfsitesummarize bamfilteraux bigbedtobed fastafrombed maq2sam-long scf_dump variant_effect_predictor.pl vcfindels vcfsnps bamfilterflags bigwiginfo fastqc maq2sam-short scf_info vcf2fasta vcfindex vcfsom bamfilterheader bigwigsummary fastqtobam maskfastafrombed scf_update vcf2sqlite.py vcfintersect vcfsort bamfilterrg bigwigtobedgraph fatotwobit md5fa scramble vcf2tsv vcfkeepgeno vcfstats bamfixmateinformation bigwigtowig featurecounts md5sum-lite scram_flagstat vcfaddinfo vcfkeepinfo vcfstreamsort bamindex blast2sam.pl fetchchromsizes mergebed scram_merge vcfafpath vcfkeepsamples vcf_strip_extra_headers bamleftalign bowtie2 filter_vep.pl multibamcov scram_pileup vcfallelicprimitives vcfleftalign vcftobedpe bammapdist bowtie2-align fix_map_ordering multiintersectbed segment_juncs vcfaltcount vcflength vcfuniq bammarkduplicates bowtie2-build flankbed mutect-1.1.6.jar seqtk vcfannotate vcfmultiallelic vcfuniqalleles bammarkduplicates2 bowtie2-inspect freebayes normalisefasta shufflebed vcfannotategenotypes vcfmultiway vcfutils.pl bammaskflags bowtie2sam.pl gatk-framework novo2paf slopbed vcfbiallelic vcfmultiwayscripts vcfvarstats bammdnm brew GenomeAnalysisTK.jar novo2sam.pl snpeff vcfbreakmulti vcfnobiallelicsnps vep_convert_cache.pl bammerge bwa genomecoveragebed novoalign soap2sam.pl vcfcat vcfnoindels vep_install.pl bam_merge ccmake get_comment novoaligncs SomaticAnalysisTK.jar vcfcheck vcfnosnps vt bamrank closestbed getoverlap novoaligncsmpi sortbed vcfclassify vcfnulldotslashdot wgsim bamrecompress clusterbed gffread novoalignmpi speedseq vcfcleancomplex vcfnumalt wgsim_eval.pl bamreset cmake glia novobarcode speedseq.config vcfclearid vcfoverlay wigtobigwig bamseqchksum complementbed grabix novoindex splitreadsamtobedpe vcfclearinfo vcfparsealts windowbed bamsort contig_to_chr_coords groupby novomethyl splittertobreakpoint vcfcombine vcfplotaltdiscrepancy.r windowmaker bamsplit convert_trace gtf_juncs novope2bed.pl sra_to_solid vcfcommonsamples vcfplotaltdiscrepancy.sh xmlwf bamsplitdiv coveragebed gtf_to_fasta novorun.pl srf2fasta vcfcomplex vcfplotsitediscrepancy.r zoom2sam.pl bamtobed cpack gtftogenepred novosort srf2fastq vcfcountalleles vcfplottstv.sh ztr_dump bamtofastq cpanm gtf_to_sam novoutil srf_dump_all vcfcreatemulti vcfprimers bamtofastq cram_dump hash_exp nucbed srf_extract_hash vcfdistance vcfprintaltdiscrepancy.r bamtools cram_index hash_extract pairtobed srf_extract_linear vcfecho vcfprintaltdiscrepancy.sh bamtools-2.3.0 cramtools hash_list pairtopair srf_filter vcfentropy vcfqual2info bamzztoname crc32 hash_sff platypus srf_index_hash vcfevenregions vcfqualfilter 300+ (OSS) tools within our production framework Infinite number of combinations to get it wrong 13 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Production Overcoming the Complexity Scalability, Reproducibility, Flexibility, Accessibility Forced to use open source tools and OS (Linux), no closed source alternatives exist - Integration challenging - Variant calling and expression analysis very much an open research questions, rapidly changing code - No licensing costs, but costs in internal and external consulting Bcbio-nextgen - An open source Python toolkit providing best practice pipelines for fully automated NGS analysis - Main developer Brad Chapman (HSPH) - Unit tested, version controlled, development in Github https://github.com/chapmanb/bcbio-nextgen - Scalable across different clusters, schedulers, Amazon cloud AZ is active recognised contributor and collaborator to HSPH and bcbio-nextgen 14 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Production Overcoming the Complexity Bcbio-nextgen overview The user writes/modifies a high level configuration file specifying inputs and analysis parameters - Very few tuning parameters -> Given the same data, two analysts will produce the same results 15 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Getting it right Given the rapid changes in the individual analysis tools, how do we know the pipeline gets it right? Solution: reference standards For germline sequencing, the Genome in A Bottle Consortium established a gold standard for an individual (NA12878) - Samples from NA12878 can be bought off the shelf - Compare sequencing and analytics results to the gold standard, establish sensitivity, PPV of variant calls, compare to other people s results For tumour sequencing, several standards exist - Horizon Diagnostics tumour standard - ICGC-TCGA DREAM Mutation Calling challenge 16 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Processing and managing the data NGS HPC clusters on 4 main R&D sites - UK (SGE, ~200 cores, gpfs) - Sweden (SLURM, >500 cores, Lustre) - China (SGE, >100 cores, gpfs) - US (UGE, >200 cores, gpfs) Data generated or received in one place processed locally by the NGS Production Team (each member has access to all HPC clusters) - Processed data handed over to disease area bioinformaticians in a controlled manner Quick pipes between the sites allows data sharing when required Cloud computing 17 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

NGS + Cloud NGS Suited to using Cloud Large scale storage needs High computational power that can continue to scale Inherently (embarrassingly) parallel, easily ported Peaks and valleys in compute needs, so burst into cloud as needed instead of large investment upfront Launch-able computing centre utilising Amazon EC2 18 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

StarCluster from MIT with our pipeline 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 320 SSD 320 SSD 320 SSD 320 SSD 40 TB GlusterFS /ngs 320 SSD 320 SSD 32 Core 32 Core 32 Core 19 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Why not Hadoop? The use of a large number of mostly academic open source tools that are 99.9% not written for Hadoop No pipeline implements wrapping up of the above tools in a Hadoop framework Disk I/O admittedly the bottle neck in current parallel file system architectures for NGS analytics - Gpfs locally at AZ - Lustre in AWS, local scratch SSD 20 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Visualising the data JBrowse genome browser Most popular genome analysis viewer is the Integrated Genome Viewer (IGV, Broad Institute), a Java based standalone program - Requires a Java app - Requires configuration JBrowse, a web browser based genome viewer is inherently easier for non-tech savvy people: point your browser to it and it just works - Physical location of data less important, only the part that is shown transferred Data of interest, such as genomic variants, can be annotated by a URL to JBrowse 21 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

JBrowse BRCA2 gene screenshot Reference DNA sequence and amino acids BRCA2 alternative exons Detected gene variant (G to A mutation) Evidence in the data for the variant Noise in the data 22 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

Summary 23

Summary NGS data is accumulating faster and faster Analysing and interpreting the data is I/O intensive (+CPU and RAM) Easily parallelised using SMP and simple schedulers (SGE, Slurm) Current challenges in integrating all the processed data (in e.g. no-sql databases) Long term storage (due to e.g. regulatory requirements) in e.g. Amazon Glacier

Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com 25 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca