SAP HANA Enabling Genome Analysis Joanna L. Kelley, PhD Postdoctoral Scholar, Stanford University Enakshi Singh, MSc HANA Product Management, SAP Labs LLC
Outline Use cases Genomics review Challenges in genomic analysis SAP HANA as the solugon Rethinking the genomics pipeline Vision for the future
Use Case 1: Clinician IdenGfy Clinically AcGonable GeneGc Variants (e.g. Causing Tumor FormaGon) in Order to Deliver Personalized Medical Treatment Needs: Real- Time Comparison of Variants to Assess Causal Ones Access to all PaGent- Specific Data AnyGme and Anywhere 3
Use Case 2: Researcher IdenGfy Causal Variants or MutaGons in Cohorts (> 10,000 Individuals) Suffering from Diseases of Interest, e.g. AuGsm Needs: Comparison of Variants in Diseased and Healthy Cohorts Flexible Queries to Verify Hypotheses in Real- Time 4
What is the Genome? GENOMICS Today ~3500 known diseases caused by DNA changes
GeneEc Variants Chromosomes are like chapters in a book Genes are like sentences in a chapter GeneEc variaeon (or mutagons) are like misspelled words or missing sentences Any two individuals are 99.9% idengcal in their DNA The 0.1% of unique DNA is what makes us different
Human Genome The engre set of genegc informagon is called our genome A single genome consists of 3.2 billion base- pairs of DNA, spread across 23 chromosome pairs. Sex chromosomes
Why is this a big data problem?
Why is this a big data problem? Sboner et al Genome Biology 2011
Genomics Data & SAP HANA How big? Tens to hundreds of billions of records Tens to hundreds of terabytes of genome sequence data Need for speed? Clinical Environments Premature & New- born babies Researchers InteracGve speed Why HANA? Speed In- Memory, Column Store Efficient Caching, Compression, Late MaterializaGon VersaGlity SQL Language ApplicaGon Builder
SAP HANA Due to the Power of MathemaGcs and Distributed CompuGng, SAP HANA can Predictably Complete any InformaGon Processing Task, However Complex, Within a Given Time- Window. Scanning 3MB/msec/core InserGng 1.5M Records/sec AggregaGng 12.5M Records/sec/core 11
SAP HANA + 12
Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon
Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon Numerous open-source/commercial tools for alignment Common tools: BWA-SW & SOAP Raw DNA reads aligned to human reference genome Algorithms must be tolerant to slight variations in reads (from the reference) SLOW process
Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon Faster BWA-SW 28.3h SAP HANA 3.6h Higher Accuracy BWA-SW 0.53% misaligned SAP HANA 0.35% misaligned BWA-SW 0.34% unaligned SAP HANA 0.14% unaligned
Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon Identifying and analyzing frequency of variants Identifying variant leading to condition of interest Searching through literature databases for info on variant of interest
AnnotaEon & Analysis Output of variant calling commonly stored in Variant Call Format (VCF) Contains: PosiGons and states of variants idengfied Quality score of each variant AddiGonal meta- data for each variant AnnotaGon - common query: Report SNPs (Single NucleoGde Polymorphisms) Failing Quality Control Common tool: UCSC Genome Browser
AnnotaEon & Analysis Analysis common queries: Compute the alternagve allele frequency for each variant in a genomic region (Chromosome 1, posigons 100 000 200 000) Compute the total number of missing genotypes for each individual Common tool: VCFtools
Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon Report SNPs (Single NucleoGde Polymorphisms) Failing Quality Control UCSC 102.47 sec SAP HANA 1.25 sec Compute the AlternaGve Allele Frequency for Each Variant in a Genomic Region (Chromosome 1, PosiGons 100,000-200,000) VCFtools 259 sec SAP HANA 0.43 sec Compute the Total Number of Missing Genotypes for Each Individual VCFtools 548 sec SAP HANA 2 sec 82x faster 600x faster 270x faster
Total # of Missing Genotypes Sample # SNP Genotypes 1 A/A A/T G/G A/A 2 A/A./../. A/A 3 A/C./../. A/A 4 A/A./../. A/A 5 A/C./../. C/C 6?./../. C/C 7 C/C A/T G/G C/C 8 C/C A/T G/T C/C 9 C/C./../. C/C 10 C/C./../. A/C Compute the Total Number of Missing Genotypes for Each Individual VCFtools 548 sec SAP HANA 2 sec 270x faster./. = Missing Genotype SNPs with high rate of missingness poteneal problem Source: Shaila Musharoff / Bustamante Lab / Stanford
Moving Forward Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon SAP HANA to contribute to all parts of the pipeline
The Future Enable Clinicians to: Make Evidence- Based Therapy Decisions at the PaGent s Bed Supervise High- Risk PaGents to Prevent Emergencies Enable Researchers to: InvesGgate the Genomes of Millions of High- Risk PaGents on a Cluster < 10M USD Analyze the Results in Real- Time 22
Pathway A molecular pathway is a signaling cascade in a cell with proteins as key components Drug Compound designed to cure diseases METABOLOMICS PROTEOMICS TRANSCRIPTOMICS GENOMICS
Vision for Personalized Medicine InformaGon and Feedback within the Window of Opportunity PaGents Doctors Insurers Researchers Real- Time Data Capture and Analysis SAP HANA Healthcare Plaqorm Genomics Electronic Medical Records AnnotaGons... All Relevant Medical InformaGon
THANK YOU FOR PARTICIPATING Please provide feedback on this session by complegng a short survey via the event mobile applicagon. SESSION CODE: 3503 For ongoing educaeon on this area of focus, visit www.asug.com