Open source analytics for Big Data in Big Pharma

Size: px
Start display at page:

Download "Open source analytics for Big Data in Big Pharma"

Transcription

1 Open source analytics for Big Data in Big Pharma Applications in next generation sequencing data Big Data SIG 23 Apr 2015 Miika Ahdesmaki Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

2 Crash course to molecular biology Central dogma DNA is the ~static part RNA is the dynamic middle man - Only 1% of DNA is protein-coding (or exonic ) Proteins are involved in virtually all cell functions We can sequence DNA and RNA using ultra high throughput sequencing (3 rd gen Next Generation Sequencing) "Centraldogma nodetails" by Narayanese at English Wikipedia - Own work. Licensed under Public Domain via Wikimedia Commons 2 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

3 Why NGS? Personalised medicine: - One drug for all patients no longer realistic (especially in oncology) - Different demographics have different variations of risks - Understanding patient specific needs will help guide their individual medication Cancer is a genetic disease, most often the result of spurious mutations in DNA - Understanding changes in cancer DNA can help defeat the disease Next generation high throughput sequencing offers genome DNA analyses in days and under $10k 3 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

4 What is next generation sequencing? Sequencing NGS: massively parallel DNA sequencing Oncology biggest consumer of NGS at AZ We sequence RNA and DNA e.g. from - Clinical samples - Cell lines - Xenografts / explants 4 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

5 What is next generation sequencing? Sequencing The DNA/RNA is pre-processed, fragmented and the short fragments are sequenced (in random order) 5 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

6 What is next generation sequencing? Alignment The short fragments are aligned to a reference sequence, such as the human reference HG19 6 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

7 What is next generation sequencing? Downstream Processing (variants, expression) The alignments are further processed to answer the following questions - How are the alignments different from the reference (SNPs, indels)? - Which genes are expressed? HG19 7 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

8 Uses of NGS Patient stratification Biomarkers for prognosis, drug response, safety Expression Variants NGS Data Explants Tumors-FFPE Tumors fresh frozen Cell lines Clinical samples RNA-Seq DNA-Seq Targeted Whole exome Whole genome Fusions Coding variants Coding and noncoding variants New Target ID Mechanism of drug action Mechanism of disease Mechanisms of resistance 8 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

9 Data generation and volumes AZ: Mix of outsourced sequencing and internal data generation Typical size of files per sample: In oncology, individuals are often studied in pairs (tumour/normal, parental/daughter), doubling the data volumes Typical study sizes: 100GB - 1TB raw compressed data One of our most frequent Big Data problems Whole genome: GB Exome Dna-seq: 10-20GB RNA-seq 10-15GB Single gene targeted: MB 9 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

10 Data generation and volumes Over the past 3-4 years we accumulated ~400TB of sequencing data via - Acquiring public data sets (TCGA, ICGC) - Vendor sequencing (major) - Internal sequencing (minor) Over we expect - Internal sequencing to become the major data generation source (5 new sequencers in 2015 to accompany 2 sequencers in ) - 1PB of sequencing data by mid 2016 Long term prediction of volumes difficult 3 tiered storage for processing, short term storage and long term storage - Amazon Glacier strongly considered for long term storage 10 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

11 Partnering with the leaders Illumina Announces Strategic Partnerships with AstraZeneca, Janssen and Sanofi to Redefine Companion Diagnostics for Oncology Illumina, Inc. announced it has formed collaborative partnerships with leading pharmaceutical companies to develop a universal NGS-based oncology test system - The system will be used for clinical trials of targeted cancer therapies with a goal of developing and commercializing a multi-gene panel for therapeutic selection, resulting in a more comprehensive tool for precision medicine 11 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

12 Pipelines and analytics 12

13 Production Dealing with the complexity Number of NGS tools increases daily.. annotatebed bcbio_nextgen.py ctest hash_tar plot_roc.r srf_info vcffilter vcfrandom append_sff bcftools cuffcompare index_tar plot-vcfstats srf_list vcffixup vcfrandomsample bam12auxmerge bed12tobed6 cuffdiff interpolate_sam.pl prep_reads STAR vcfflatten vcfregionreduce bam12split bedgraphtobigwig cufflinks intersectbed psl2sam.pl subtractbed vcfgeno2alleles vcfregionreduce_and_cut bam12strip bedpetobam cuffmerge io_lib-config qualimap tabix vcfgeno2haplo vcfregionreduce_pipe bam2fastx bedpetobed12 dbilogstrip isnovoindex randombed tabtk vcfgenosamplenames vcfregionreduce_uncompressed bamadapterclip bedpetovcf dbiprof juncs_db rtg tagbam vcfgenosummarize vcfremap bamadapterfind bedtobam dbiproxy kmerprob s3cmd tophat vcfgenotypecompare vcfremoveaberrantgenotypes bamauxsort bedtobigbed expandcols liftover sam2vcf.pl tophat2 vcfgenotypes vcfremovenonatgc bamcat bedtoigv export2sam.pl linksbed sambamba tophat-fusion-post vcfglbound vcfremovesamples bamchecksort bed_to_juncs extract_fastq long_spanning_reads samblaster tophat_reports vcfglxgt vcfroc bamclipreinsert bedtools extract_qual lumpy sam_juncs trace_dump vcfgtcompare.sh vcfsample2info bamcollate bgzip extract_seq makescf samtools twobitinfo vcfhetcount vcfsamplediff bamcollate2 bigbedinfo facount map2gtf samtools.pl twobittofa vcfhethomratio vcfsamplenames bamdownsamplerandom bigbedsummary fasize mapbed scalpel unionbedgraphs vcfindelproximity vcfsitesummarize bamfilteraux bigbedtobed fastafrombed maq2sam-long scf_dump variant_effect_predictor.pl vcfindels vcfsnps bamfilterflags bigwiginfo fastqc maq2sam-short scf_info vcf2fasta vcfindex vcfsom bamfilterheader bigwigsummary fastqtobam maskfastafrombed scf_update vcf2sqlite.py vcfintersect vcfsort bamfilterrg bigwigtobedgraph fatotwobit md5fa scramble vcf2tsv vcfkeepgeno vcfstats bamfixmateinformation bigwigtowig featurecounts md5sum-lite scram_flagstat vcfaddinfo vcfkeepinfo vcfstreamsort bamindex blast2sam.pl fetchchromsizes mergebed scram_merge vcfafpath vcfkeepsamples vcf_strip_extra_headers bamleftalign bowtie2 filter_vep.pl multibamcov scram_pileup vcfallelicprimitives vcfleftalign vcftobedpe bammapdist bowtie2-align fix_map_ordering multiintersectbed segment_juncs vcfaltcount vcflength vcfuniq bammarkduplicates bowtie2-build flankbed mutect jar seqtk vcfannotate vcfmultiallelic vcfuniqalleles bammarkduplicates2 bowtie2-inspect freebayes normalisefasta shufflebed vcfannotategenotypes vcfmultiway vcfutils.pl bammaskflags bowtie2sam.pl gatk-framework novo2paf slopbed vcfbiallelic vcfmultiwayscripts vcfvarstats bammdnm brew GenomeAnalysisTK.jar novo2sam.pl snpeff vcfbreakmulti vcfnobiallelicsnps vep_convert_cache.pl bammerge bwa genomecoveragebed novoalign soap2sam.pl vcfcat vcfnoindels vep_install.pl bam_merge ccmake get_comment novoaligncs SomaticAnalysisTK.jar vcfcheck vcfnosnps vt bamrank closestbed getoverlap novoaligncsmpi sortbed vcfclassify vcfnulldotslashdot wgsim bamrecompress clusterbed gffread novoalignmpi speedseq vcfcleancomplex vcfnumalt wgsim_eval.pl bamreset cmake glia novobarcode speedseq.config vcfclearid vcfoverlay wigtobigwig bamseqchksum complementbed grabix novoindex splitreadsamtobedpe vcfclearinfo vcfparsealts windowbed bamsort contig_to_chr_coords groupby novomethyl splittertobreakpoint vcfcombine vcfplotaltdiscrepancy.r windowmaker bamsplit convert_trace gtf_juncs novope2bed.pl sra_to_solid vcfcommonsamples vcfplotaltdiscrepancy.sh xmlwf bamsplitdiv coveragebed gtf_to_fasta novorun.pl srf2fasta vcfcomplex vcfplotsitediscrepancy.r zoom2sam.pl bamtobed cpack gtftogenepred novosort srf2fastq vcfcountalleles vcfplottstv.sh ztr_dump bamtofastq cpanm gtf_to_sam novoutil srf_dump_all vcfcreatemulti vcfprimers bamtofastq cram_dump hash_exp nucbed srf_extract_hash vcfdistance vcfprintaltdiscrepancy.r bamtools cram_index hash_extract pairtobed srf_extract_linear vcfecho vcfprintaltdiscrepancy.sh bamtools cramtools hash_list pairtopair srf_filter vcfentropy vcfqual2info bamzztoname crc32 hash_sff platypus srf_index_hash vcfevenregions vcfqualfilter 300+ (OSS) tools within our production framework Infinite number of combinations to get it wrong 13 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

14 Production Overcoming the Complexity Scalability, Reproducibility, Flexibility, Accessibility Forced to use open source tools and OS (Linux), no closed source alternatives exist - Integration challenging - Variant calling and expression analysis very much an open research questions, rapidly changing code - No licensing costs, but costs in internal and external consulting Bcbio-nextgen - An open source Python toolkit providing best practice pipelines for fully automated NGS analysis - Main developer Brad Chapman (HSPH) - Unit tested, version controlled, development in Github - Scalable across different clusters, schedulers, Amazon cloud AZ is active recognised contributor and collaborator to HSPH and bcbio-nextgen 14 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

15 Production Overcoming the Complexity Bcbio-nextgen overview The user writes/modifies a high level configuration file specifying inputs and analysis parameters - Very few tuning parameters -> Given the same data, two analysts will produce the same results 15 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

16 Getting it right Given the rapid changes in the individual analysis tools, how do we know the pipeline gets it right? Solution: reference standards For germline sequencing, the Genome in A Bottle Consortium established a gold standard for an individual (NA12878) - Samples from NA12878 can be bought off the shelf - Compare sequencing and analytics results to the gold standard, establish sensitivity, PPV of variant calls, compare to other people s results For tumour sequencing, several standards exist - Horizon Diagnostics tumour standard - ICGC-TCGA DREAM Mutation Calling challenge 16 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

17 Processing and managing the data NGS HPC clusters on 4 main R&D sites - UK (SGE, ~200 cores, gpfs) - Sweden (SLURM, >500 cores, Lustre) - China (SGE, >100 cores, gpfs) - US (UGE, >200 cores, gpfs) Data generated or received in one place processed locally by the NGS Production Team (each member has access to all HPC clusters) - Processed data handed over to disease area bioinformaticians in a controlled manner Quick pipes between the sites allows data sharing when required Cloud computing 17 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

18 NGS + Cloud NGS Suited to using Cloud Large scale storage needs High computational power that can continue to scale Inherently (embarrassingly) parallel, easily ported Peaks and valleys in compute needs, so burst into cloud as needed instead of large investment upfront Launch-able computing centre utilising Amazon EC2 18 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

19 StarCluster from MIT with our pipeline 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 320 SSD 320 SSD 320 SSD 320 SSD 40 TB GlusterFS /ngs 320 SSD 320 SSD 32 Core 32 Core 32 Core 19 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

20 Why not Hadoop? The use of a large number of mostly academic open source tools that are 99.9% not written for Hadoop No pipeline implements wrapping up of the above tools in a Hadoop framework Disk I/O admittedly the bottle neck in current parallel file system architectures for NGS analytics - Gpfs locally at AZ - Lustre in AWS, local scratch SSD 20 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

21 Visualising the data JBrowse genome browser Most popular genome analysis viewer is the Integrated Genome Viewer (IGV, Broad Institute), a Java based standalone program - Requires a Java app - Requires configuration JBrowse, a web browser based genome viewer is inherently easier for non-tech savvy people: point your browser to it and it just works - Physical location of data less important, only the part that is shown transferred Data of interest, such as genomic variants, can be annotated by a URL to JBrowse 21 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

22 JBrowse BRCA2 gene screenshot Reference DNA sequence and amino acids BRCA2 alternative exons Detected gene variant (G to A mutation) Evidence in the data for the variant Noise in the data 22 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

23 Summary 23

24 Summary NGS data is accumulating faster and faster Analysing and interpreting the data is I/O intensive (+CPU and RAM) Easily parallelised using SMP and simple schedulers (SGE, Slurm) Current challenges in integrating all the processed data (in e.g. no-sql databases) Long term storage (due to e.g. regulatory requirements) in e.g. Amazon Glacier

25 Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0) , F: +44 (0) , 25 Miika Ahdesmaki 23 April 2015 Cambridge Wireless Big Data SIG AstraZeneca

New solutions for Big Data Analysis and Visualization

New solutions for Big Data Analysis and Visualization New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina [email protected] http://bioinfo.cipf.es/imedina Head of the Computational Biology

More information

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing

More information

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Current Issues Current Issues The QSEQ file Number files per

More information

A leader in the development and application of information technology to prevent and treat disease.

A leader in the development and application of information technology to prevent and treat disease. A leader in the development and application of information technology to prevent and treat disease. About MOLECULAR HEALTH Molecular Health was founded in 2004 with the vision of changing healthcare. Today

More information

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow Barry Bolding Cray Inc Seattle, WA 1 CUG 2013 Paper Genomic Applications on Cray supercomputers: Next Generation Sequencing

More information

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio

More information

Hadoopizer : a cloud environment for bioinformatics data analysis

Hadoopizer : a cloud environment for bioinformatics data analysis Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) [email protected], INRIA/Irisa, Campus de Beaulieu, 35042,

More information

G E N OM I C S S E RV I C ES

G E N OM I C S S E RV I C ES GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E

More information

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013 ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and

More information

Data Sharing Initiative: International Cancer Genome Consortium

Data Sharing Initiative: International Cancer Genome Consortium Data Sharing Initiative: International Cancer Genome Consortium Tom Hudson, MD President and Scientific Director Ontario Institute for Cancer Research 1 Sharing Data Sharing BIG Genome Initiative: DATA

More information

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department

More information

Delivering the power of the world s most successful genomics platform

Delivering the power of the world s most successful genomics platform Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

More information

Personalized Medicine and IT

Personalized Medicine and IT Personalized Medicine and IT Data-driven Medicine in the Age of Genomics www.intel.com/healthcare/bigdata Ketan Paranjape General Manager, Life Sciences Intel Corp. @Portlandketan 1 The Central Dogma of

More information

Introduction to NGS data analysis

Introduction to NGS data analysis Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High

More information

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo Preparing the scenario for the use of patient s genome sequences in clinic Joaquín Dopazo Computational Medicine Institute, Centro de Investigación Príncipe Felipe (CIPF), Functional Genomics Node, (INB),

More information

CHALLENGES IN NEXT-GENERATION SEQUENCING

CHALLENGES IN NEXT-GENERATION SEQUENCING CHALLENGES IN NEXT-GENERATION SEQUENCING BASIC TENETS OF DATA AND HPC Gray s Laws of data engineering 1 : Scientific computing is very dataintensive, with no real limits. The solution is scale-out architecture

More information

High Throughput Sequencing Data Analysis using Cloud Computing

High Throughput Sequencing Data Analysis using Cloud Computing High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom ([email protected]) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure

More information

The NGS IT notes. George Magklaras PhD RHCE

The NGS IT notes. George Magklaras PhD RHCE The NGS IT notes George Magklaras PhD RHCE Biotechnology Center of Oslo & The Norwegian Center of Molecular Medicine University of Oslo, Norway http://www.biotek.uio.no http://www.ncmm.uio.no http://www.no.embnet.org

More information

Genomic Medicine The Future of Cancer Care. Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America

Genomic Medicine The Future of Cancer Care. Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America Genomic Medicine The Future of Cancer Care Shayma Master Kazmi, M.D. Medical Oncology/Hematology Cancer Treatment Centers of America Personalized Medicine Personalized health care is a broad term for interventions

More information

Challenges associated with analysis and storage of NGS data

Challenges associated with analysis and storage of NGS data Challenges associated with analysis and storage of NGS data Gabriella Rustici Research and training coordinator Functional Genomics Group [email protected] Next-generation sequencing Next-generation sequencing

More information

Big data in cancer research : DNA sequencing and personalised medicine

Big data in cancer research : DNA sequencing and personalised medicine Big in cancer research : DNA sequencing and personalised medicine Philippe Hupé Conférence BIGDATA 04/04/2013 1 - Titre de la présentation - nom du département émetteur et/ ou rédacteur - 00/00/2005 Deciphering

More information

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated

More information

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze

More information

Cloud-based Analytics and Map Reduce

Cloud-based Analytics and Map Reduce 1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,

More information

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation PN 100-9879 A1 TECHNICAL NOTE Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation Introduction Cancer is a dynamic evolutionary process of which intratumor genetic and phenotypic

More information

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) [email protected] Paris November 2013 The Sanger DNA sequencing method Sequencing

More information

Cloud-Based Big Data Analytics in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large

More information

Worldwide Collaborations in Molecular Profiling

Worldwide Collaborations in Molecular Profiling Worldwide Collaborations in Molecular Profiling Lillian L. Siu, MD Director, Phase I Program and Cancer Genomics Program Princess Margaret Cancer Centre Lillian Siu, MD Contracted Research: Novartis, Pfizer,

More information

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute [email protected] http://www.jcvi.org/cms/about/bios/kkrampis/

More information

GC3 Use cases for the Cloud

GC3 Use cases for the Cloud GC3: Grid Computing Competence Center GC3 Use cases for the Cloud Some real world examples suited for cloud systems Antonio Messina Trieste, 24.10.2013 Who am I System Architect

More information

Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE

Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE CRUK Stratified Medicine Initiative Somatic mutation testing for prediction of treatment response in patients with solid tumours:

More information

Practical Solutions for Big Data Analytics

Practical Solutions for Big Data Analytics Practical Solutions for Big Data Analytics Ravi Madduri Computation Institute ([email protected]) Paul Dave ([email protected]) Dinanath Sulakhe ([email protected]) Alex Rodriguez ([email protected])

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute [email protected] http://www.jcvi.org/cms/about/bios/kkrampis/

More information

Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer Drugs

Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer Drugs Provisional Translation (as of January 27, 2014)* November 15, 2013 Pharmaceuticals and Bio-products Subcommittees, Science Board Summary of Discussion on Non-clinical Pharmacology Studies on Anticancer

More information

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar Computational infrastructure for NGS data analysis José Carbonell Caballero Pablo Escobar Computational infrastructure for NGS Cluster definition: A computer cluster is a group of linked computers, working

More information

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Next Generation Sequencing: Adjusting to Big Data Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Outline Human Genome Project Next-Generation Sequencing Personalized Medicine

More information

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution Ignacio Medina, Paul Calleja, John Taylor (University of Cambridge, UIS, HPC Service (HPCS)) Abstract The advent

More information

Genetic diagnostics the gateway to personalized medicine

Genetic diagnostics the gateway to personalized medicine Micronova 20.11.2012 Genetic diagnostics the gateway to personalized medicine Kristiina Assoc. professor, Director of Genetic Department HUSLAB, Helsinki University Central Hospital The Human Genome Packed

More information

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences

More information

Comparing Methods for Identifying Transcription Factor Target Genes

Comparing Methods for Identifying Transcription Factor Target Genes Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF

More information

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of Chicago @madduri

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of Chicago @madduri Large-scale Research Data Management and Analysis Using Globus Services Ravi Madduri Argonne National Lab University of Chicago @madduri Outline Who we are Challenges in Big Data Management and Analysis

More information

BioHPC Web Computing Resources at CBSU

BioHPC Web Computing Resources at CBSU BioHPC Web Computing Resources at CBSU 3CPG workshop Robert Bukowski Computational Biology Service Unit http://cbsu.tc.cornell.edu/lab/doc/biohpc_web_tutorial.pdf BioHPC infrastructure at CBSU BioHPC Web

More information

Text file One header line meta information lines One line : variant/position

Text file One header line meta information lines One line : variant/position Software Calling: GATK SAMTOOLS mpileup Varscan SOAP VCF format Text file One header line meta information lines One line : variant/position ##fileformat=vcfv4.1! ##filedate=20090805! ##source=myimputationprogramv3.1!

More information

Installation Guide for Windows

Installation Guide for Windows Installation Guide for Windows Overview: Getting Ready Installing Sequencher Activating and Installing the License Registering Sequencher GETTING READY Trying Sequencher: Sequencher 5.2 and newer requires

More information

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT Building Bioinformatics Capacity in Africa Nicky Mulder CBIO Group, UCT Outline What is bioinformatics? Why do we need IT infrastructure? What e-infrastructure does it require? How we are developing this

More information

Analysis of NGS Data

Analysis of NGS Data Analysis of NGS Data Introduction and Basics Folie: 1 Overview of Analysis Workflow Images Basecalling Sequences denovo - Sequencing Assembly Annotation Resequencing Alignments Comparison to reference

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office 2013 Laboratory Accreditation Program Audioconferences and Webinars Implementing Next Generation Sequencing (NGS) as a Clinical Tool in the Laboratory Nazneen Aziz, PhD Director, Molecular Medicine Transformation

More information

How-To: SNP and INDEL detection

How-To: SNP and INDEL detection How-To: SNP and INDEL detection April 23, 2014 Lumenogix NGS SNP and INDEL detection Mutation Analysis Identifying known, and discovering novel genomic mutations, has been one of the most popular applications

More information

Considering De-Identification? Legacy Data. Kymberly Lee 16-Jul-2015

Considering De-Identification? Legacy Data. Kymberly Lee 16-Jul-2015 Considering De-Identification? Legacy Data Kymberly Lee 16-Jul-2015 Introduction This presentation provides an overview of Clinical data sharing, clinical data privacy, and clinical transparency. Discuss

More information

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System Young-Ho Kim, Eun-Ji Lim, Gyu-Il Cha, Seung-Jo Bae Electronics and Telecommunications

More information

IOmark- VDI. Nimbus Data Gemini Test Report: VDI- 130906- a Test Report Date: 6, September 2013. www.iomark.org

IOmark- VDI. Nimbus Data Gemini Test Report: VDI- 130906- a Test Report Date: 6, September 2013. www.iomark.org IOmark- VDI Nimbus Data Gemini Test Report: VDI- 130906- a Test Copyright 2010-2013 Evaluator Group, Inc. All rights reserved. IOmark- VDI, IOmark- VDI, VDI- IOmark, and IOmark are trademarks of Evaluator

More information

Bursting to a Hybrid Cloud for Services OFC 2015

Bursting to a Hybrid Cloud for Services OFC 2015 Bursting to a Hybrid Cloud for Services OFC 2015 Big Data applications Big Compute in the cloud Why burst to the cloud? Opportunities 2 Big Data Apps Need Big Compute Life Sciences Bioinformatics Next

More information

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers Ntinos Krampis Asst. Professor J. Craig Venter Institute [email protected] http://www.jcvi.org/cms/about/bios/kkrampis/

More information

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS 29 OCTOBER 2015 DR. DIRK J. EVERS BACKGROUND TreatmentMAP

More information

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Managing and Conducting Biomedical Research on the Cloud Prasad Patil Managing and Conducting Biomedical Research on the Cloud Prasad Patil Laboratory for Personalized Medicine Center for Biomedical Informatics Harvard Medical School SaaS & PaaS gmail google docs app engine

More information

Big Data Challenges. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.

Big Data Challenges. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres. Big Data Challenges technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Data Deluge: Due to the changes in big data generation Example: Biomedicine

More information

Disease gene identification with exome sequencing

Disease gene identification with exome sequencing Disease gene identification with exome sequencing Christian Gilissen Dept. of Human Genetics Radboud University Nijmegen Medical Centre [email protected] Contents Infrastructure Exome sequencing

More information

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates

More information

Data management challenges in todays Healthcare and Life Sciences ecosystems

Data management challenges in todays Healthcare and Life Sciences ecosystems Data management challenges in todays Healthcare and Life Sciences ecosystems Jose L. Alvarez Principal Engineer, WW Director Life Sciences [email protected] Evolution of Data Sets in Healthcare

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources 1 of 8 11/7/2004 11:00 AM National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools

More information

ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS

ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS INCORPORATE GENOMIC DATA INTO CLINICAL R&D KEY BENEFITS Enable more targeted, biomarker-driven clinical trials Improves efficiencies, compressing

More information

LifeScope Genomic Analysis Software 2.5

LifeScope Genomic Analysis Software 2.5 USER GUIDE LifeScope Genomic Analysis Software 2.5 Graphical User Interface DATA ANALYSIS METHODS AND INTERPRETATION Publication Part Number 4471877 Rev. A Revision Date November 2011 For Research Use

More information

Overview of Next Generation Sequencing platform technologies

Overview of Next Generation Sequencing platform technologies Overview of Next Generation Sequencing platform technologies Dr. Bernd Timmermann Next Generation Sequencing Core Facility Max Planck Institute for Molecular Genetics Berlin, Germany Outline 1. Technologies

More information

Basic processing of next-generation sequencing (NGS) data

Basic processing of next-generation sequencing (NGS) data Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance

More information

Integrated Rule-based Data Management System for Genome Sequencing Data

Integrated Rule-based Data Management System for Genome Sequencing Data Integrated Rule-based Data Management System for Genome Sequencing Data A Research Data Management (RDM) Green Shoots Pilots Project Report by Michael Mueller, Simon Burbidge, Steven Lawlor and Jorge Ferrer

More information

School of Nursing. Presented by Yvette Conley, PhD

School of Nursing. Presented by Yvette Conley, PhD Presented by Yvette Conley, PhD What we will cover during this webcast: Briefly discuss the approaches introduced in the paper: Genome Sequencing Genome Wide Association Studies Epigenomics Gene Expression

More information

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?

More information

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless

More information

The most powerful open source data science technologies in your browser.!! Yves Hilpisch

The most powerful open source data science technologies in your browser.!! Yves Hilpisch The most powerful open source data science technologies in your browser.!! Yves Hilpisch I. The Market and The Problem II. How We Solve The Problem III. Market Size and Facts IV. Strategic Opportunities

More information

How Real-time Analysis turns Big Medical Data into Precision Medicine?

How Real-time Analysis turns Big Medical Data into Precision Medicine? Medical Data into Dr. Matthieu-P. Schapranow GLOBAL HEALTH, Rome, Italy August 27, 2014 Important things first: Where to find additional information? Online: Visit http://we.analyzegenomes.com for latest

More information

Building your Big Data Architecture on Amazon Web Services

Building your Big Data Architecture on Amazon Web Services Building your Big Data Architecture on Amazon Web Services Abhishek Sinha @abysinha [email protected] AWS Services Deployment & Administration Application Services Compute Storage Database Networking

More information

Automating installation, testing and development of bcbio-nextgen pipeline

Automating installation, testing and development of bcbio-nextgen pipeline Automating installation, testing and development of bcbio-nextgen pipeline GUILLERMO CARRASCO HERNÁNDEZ [email protected] June 2013 Final project at Barcelona School of Informatics (FIB)

More information

CSE-E5430 Scalable Cloud Computing. Lecture 4

CSE-E5430 Scalable Cloud Computing. Lecture 4 Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?

More information

Data Analysis for Ion Torrent Sequencing

Data Analysis for Ion Torrent Sequencing IFU022 v140202 Research Use Only Instructions For Use Part III Data Analysis for Ion Torrent Sequencing MANUFACTURER: Multiplicom N.V. Galileilaan 18 2845 Niel Belgium Revision date: August 21, 2014 Page

More information

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples DATA Sheet Single-Cell DNA Sequencing with the C 1 Single-Cell Auto Prep System Reveal hidden populations and genetic diversity within complex samples Single-cell sensitivity Discover and detect SNPs,

More information

THE ROLE OF BIG DATA IN HEALTH AND BIOMEDICAL RESEARCH. John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health

THE ROLE OF BIG DATA IN HEALTH AND BIOMEDICAL RESEARCH. John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health THE ROLE OF BIG DATA IN HEALTH AND BIOMEDICAL RESEARCH John Quackenbush Dana-Farber Cancer Institute Harvard School of Public Health CONFIDENTIAL Background and Disclosures Professor of Biostatistics and

More information

Next generation sequencing (NGS)

Next generation sequencing (NGS) Next generation sequencing (NGS) Vijayachitra Modhukur BIIT [email protected] 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known

More information

Netapp HPC Solution for Lustre. Rich Fenton ([email protected]) UK Solutions Architect

Netapp HPC Solution for Lustre. Rich Fenton (fenton@netapp.com) UK Solutions Architect Netapp HPC Solution for Lustre Rich Fenton ([email protected]) UK Solutions Architect Agenda NetApp Introduction Introducing the E-Series Platform Why E-Series for Lustre? Modular Scale-out Capacity Density

More information

Information for patients and the public and patient information about DNA / Biobanking across Europe

Information for patients and the public and patient information about DNA / Biobanking across Europe Information for patients and the public and patient information about DNA / Biobanking across Europe BIOBANKING / DNA BANKING SUMMARY: A biobank is a store of human biological material, used for the purposes

More information

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London

Building a Collaborative Informatics Platform for Translational Research: Prof. Yike Guo Department of Computing Imperial College London Building a Collaborative Informatics Platform for Translational Research: An IMI Project Experience Prof. Yike Guo Department of Computing Imperial College London Living in the Era of BIG Big Data : Massive

More information

HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis

HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC4NGS 2012, Valencia Ignacio Medina [email protected] Scientific Computing Unit Bioinformatics and Genomics Department

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons The NIH Commons Summary The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage,

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Cloud Computing and Amazon Web Services

Cloud Computing and Amazon Web Services Cloud Computing and Amazon Web Services Gary A. McGilvary edinburgh data.intensive research 1 OUTLINE 1. An Overview of Cloud Computing 2. Amazon Web Services 3. Amazon EC2 Tutorial 4. Conclusions 2 CLOUD

More information