SAP HANA Enabling Genome Analysis



Similar documents
How Real-time Analysis turns Big Medical Data into Precision Medicine?

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

The National Institute of Genomic Medicine (INMEGEN) was

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

What s New in Pathway Studio Web 11.1

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

Cancer Genomics: What Does It Mean for You?

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Big Data for Population Health and Personalised Medicine through EMR Linkages

School of Nursing. Presented by Yvette Conley, PhD

How Can Institutions Foster OMICS Research While Protecting Patients?

Predictive Analytics and the Big Data Challenge

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

A Primer of Genome Science THIRD

Integration of genomic data into electronic health records

Cloud-Based Big Data Analytics in Bioinformatics

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

Version 5.0 Release Notes

Delivering the power of the world s most successful genomics platform

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Data Analysis for Ion Torrent Sequencing

The Future of the Electronic Health Record. Gerry Higgins, Ph.D., Johns Hopkins

Acceleration for Personalized Medicine Big Data Applications

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

A leader in the development and application of information technology to prevent and treat disease.

Next Generation Sequencing: Technology, Mapping, and Analysis

Simplifying Data Interpretation with Nexus Copy Number

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Biomedical Big Data and Precision Medicine

SAP Healthcare Analytics Solutions Provide physicians and researchers access to patient data from various systems in realtime

How To Change Medicine

Human Genome Organization: An Update. Genome Organization: An Update

Integrating Bioinformatics, Medical Sciences and Drug Discovery

Electronic Medical Records and Genomics: Possibilities, Realities, Ethical Issues to Consider

Module 1. Sequence Formats and Retrieval. Charles Steward

Genetic diagnostics the gateway to personalized medicine

Attacking the Biobank Bottleneck

Understanding West Nile Virus Infection

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Cystic Fibrosis Webquest Sarah Follenweider, The English High School 2009 Summer Research Internship Program

Outline. Personal profile & research interests. Rheumatology research in Ireland. Current standing. Future plans

Data, Measurements, Features

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Milk protein genetic variation in Butana cattle

Vision for the Cohort and the Precision Medicine Initiative Francis S. Collins, M.D., Ph.D. Director, National Institutes of Health Precision

Single Nucleotide Polymorphisms (SNPs)

Personalized Medicine and IT

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Information for patients and the public and patient information about DNA / Biobanking across Europe

Core Facility Genomics

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

What is Pharmacogenomics? Personalization of Medications for You! Michigan State Medical Assistants Conference May 6, 2006

Molecular typing of VTEC: from PFGE to NGS-based phylogeny

LifeScope Genomic Analysis Software 2.5

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

How does genetic testing work?

Lessons from the Stanford HIV Drug Resistance Database

G E N OM I C S S E RV I C ES

Text file One header line meta information lines One line : variant/position

PHYSIOLOGY. THE STUDY OF LIFE, and how genes, cells, tissues, and organisms function.

Personalized Medicine: Humanity s Ultimate Big Data Challenge. Rob Fassett, MD Chief Medical Informatics Officer Oracle Health Sciences

Genomes and SNPs in Malaria and Sickle Cell Anemia

Genetics 1. Defective enzyme that does not make melanin. Very pale skin and hair color (albino)

Genetic Testing in Research & Healthcare

Investigating the genetic basis for intelligence

SNPbrowser Software v3.5

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequence Variants

Outcome Data, Links to Electronic Medical Records. Dan Roden Vanderbilt University

GOBII. Genomic & Open-source Breeding Informatics Initiative

The M.U.R.D.O.C.K. Study

SNP Essentials The same SNP story

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Genomic Selection in. Applied Training Workshop, Sterling. Hans Daetwyler, The Roslin Institute and R(D)SVS

This fact sheet describes how genes affect our health when they follow a well understood pattern of genetic inheritance known as autosomal recessive.

Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering

GenBank: A Database of Genetic Sequence Data

Processing Genome Data using Scalable Database Technology. My Background

Implementation of Pharmacogenomics in Clinical Practice: Barriers and Potential Solutions

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

14.3 Studying the Human Genome

Transcription:

SAP HANA Enabling Genome Analysis Joanna L. Kelley, PhD Postdoctoral Scholar, Stanford University Enakshi Singh, MSc HANA Product Management, SAP Labs LLC

Outline Use cases Genomics review Challenges in genomic analysis SAP HANA as the solugon Rethinking the genomics pipeline Vision for the future

Use Case 1: Clinician IdenGfy Clinically AcGonable GeneGc Variants (e.g. Causing Tumor FormaGon) in Order to Deliver Personalized Medical Treatment Needs: Real- Time Comparison of Variants to Assess Causal Ones Access to all PaGent- Specific Data AnyGme and Anywhere 3

Use Case 2: Researcher IdenGfy Causal Variants or MutaGons in Cohorts (> 10,000 Individuals) Suffering from Diseases of Interest, e.g. AuGsm Needs: Comparison of Variants in Diseased and Healthy Cohorts Flexible Queries to Verify Hypotheses in Real- Time 4

What is the Genome? GENOMICS Today ~3500 known diseases caused by DNA changes

GeneEc Variants Chromosomes are like chapters in a book Genes are like sentences in a chapter GeneEc variaeon (or mutagons) are like misspelled words or missing sentences Any two individuals are 99.9% idengcal in their DNA The 0.1% of unique DNA is what makes us different

Human Genome The engre set of genegc informagon is called our genome A single genome consists of 3.2 billion base- pairs of DNA, spread across 23 chromosome pairs. Sex chromosomes

Why is this a big data problem?

Why is this a big data problem? Sboner et al Genome Biology 2011

Genomics Data & SAP HANA How big? Tens to hundreds of billions of records Tens to hundreds of terabytes of genome sequence data Need for speed? Clinical Environments Premature & New- born babies Researchers InteracGve speed Why HANA? Speed In- Memory, Column Store Efficient Caching, Compression, Late MaterializaGon VersaGlity SQL Language ApplicaGon Builder

SAP HANA Due to the Power of MathemaGcs and Distributed CompuGng, SAP HANA can Predictably Complete any InformaGon Processing Task, However Complex, Within a Given Time- Window. Scanning 3MB/msec/core InserGng 1.5M Records/sec AggregaGng 12.5M Records/sec/core 11

SAP HANA + 12

Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon

Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon Numerous open-source/commercial tools for alignment Common tools: BWA-SW & SOAP Raw DNA reads aligned to human reference genome Algorithms must be tolerant to slight variations in reads (from the reference) SLOW process

Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon Faster BWA-SW 28.3h SAP HANA 3.6h Higher Accuracy BWA-SW 0.53% misaligned SAP HANA 0.35% misaligned BWA-SW 0.34% unaligned SAP HANA 0.14% unaligned

Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon Identifying and analyzing frequency of variants Identifying variant leading to condition of interest Searching through literature databases for info on variant of interest

AnnotaEon & Analysis Output of variant calling commonly stored in Variant Call Format (VCF) Contains: PosiGons and states of variants idengfied Quality score of each variant AddiGonal meta- data for each variant AnnotaGon - common query: Report SNPs (Single NucleoGde Polymorphisms) Failing Quality Control Common tool: UCSC Genome Browser

AnnotaEon & Analysis Analysis common queries: Compute the alternagve allele frequency for each variant in a genomic region (Chromosome 1, posigons 100 000 200 000) Compute the total number of missing genotypes for each individual Common tool: VCFtools

Genomics Pipeline Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon Report SNPs (Single NucleoGde Polymorphisms) Failing Quality Control UCSC 102.47 sec SAP HANA 1.25 sec Compute the AlternaGve Allele Frequency for Each Variant in a Genomic Region (Chromosome 1, PosiGons 100,000-200,000) VCFtools 259 sec SAP HANA 0.43 sec Compute the Total Number of Missing Genotypes for Each Individual VCFtools 548 sec SAP HANA 2 sec 82x faster 600x faster 270x faster

Total # of Missing Genotypes Sample # SNP Genotypes 1 A/A A/T G/G A/A 2 A/A./../. A/A 3 A/C./../. A/A 4 A/A./../. A/A 5 A/C./../. C/C 6?./../. C/C 7 C/C A/T G/G C/C 8 C/C A/T G/T C/C 9 C/C./../. C/C 10 C/C./../. A/C Compute the Total Number of Missing Genotypes for Each Individual VCFtools 548 sec SAP HANA 2 sec 270x faster./. = Missing Genotype SNPs with high rate of missingness poteneal problem Source: Shaila Musharoff / Bustamante Lab / Stanford

Moving Forward Sequencing Service/Lab e.g. Biologist ComputaGonal Pipeline e.g. BioinformaGcian ComputaGonal Analysis e.g. Clinicians and Researchers Sequencing Alignment Variant Calling AnnotaEon and Analysis PaGent Samples Raw DNA Reads Mapped Genome Discovered Variants Follow- up and ValidaGon SAP HANA to contribute to all parts of the pipeline

The Future Enable Clinicians to: Make Evidence- Based Therapy Decisions at the PaGent s Bed Supervise High- Risk PaGents to Prevent Emergencies Enable Researchers to: InvesGgate the Genomes of Millions of High- Risk PaGents on a Cluster < 10M USD Analyze the Results in Real- Time 22

Pathway A molecular pathway is a signaling cascade in a cell with proteins as key components Drug Compound designed to cure diseases METABOLOMICS PROTEOMICS TRANSCRIPTOMICS GENOMICS

Vision for Personalized Medicine InformaGon and Feedback within the Window of Opportunity PaGents Doctors Insurers Researchers Real- Time Data Capture and Analysis SAP HANA Healthcare Plaqorm Genomics Electronic Medical Records AnnotaGons... All Relevant Medical InformaGon

THANK YOU FOR PARTICIPATING Please provide feedback on this session by complegng a short survey via the event mobile applicagon. SESSION CODE: 3503 For ongoing educaeon on this area of focus, visit www.asug.com