Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Similar documents
How Real-time Analysis turns Big Medical Data into Precision Medicine?

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Accelerating variant calling

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

SAP HANA Enabling Genome Analysis

Cloud-Based Big Data Analytics in Bioinformatics

Next Generation Sequencing: Technology, Mapping, and Analysis

Analysis of NGS Data

In-Memory Data Management for Enterprise Applications

Module 1. Sequence Formats and Retrieval. Charles Steward

Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis

Introduction to NGS data analysis

How-To: SNP and INDEL detection

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Accelerating Data-Intensive Genome Analysis in the Cloud

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

SQL Server 2012 Performance White Paper

Hadoop-BAM and SeqPig

Semplicità ed Innovazione a portata di mano

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Delivering the power of the world s most successful genomics platform

Sybase Adaptive Server Enterprise

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

HADOOP IN THE LIFE SCIENCES:

Practical Guideline for Whole Genome Sequencing

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

In-Memory Analytics: A comparison between Oracle TimesTen and Oracle Essbase

School of Nursing. Presented by Yvette Conley, PhD

CSE-E5430 Scalable Cloud Computing. Lecture 4

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

An FPGA Acceleration of Short Read Human Genome Mapping

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Version 5.0 Release Notes

Processing NGS Data with Hadoop-BAM and SeqPig

Actian Vector in Hadoop

Genome sequence analysis with MonetDB: a case study on Ebola virus diversity

Navigating the Big Data infrastructure layer Helena Schwenk

Big Data Challenges in Bioinformatics

Data Integrator Performance Optimization Guide

Oracle Database In-Memory The Next Big Thing

New solutions for Big Data Analysis and Visualization

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Parallel Data Preparation with the DS2 Programming Language

PERFORMANCE TIPS FOR BATCH JOBS

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

DNA Sequencing Data Compression. Michael Chung

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Challenges associated with analysis and storage of NGS data

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

-> Integration of MAPHiTS in Galaxy

SAP HANA In-Memory Database Sizing Guideline

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Architectures for Big Data Analytics A database perspective

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Performance Verbesserung von SAP BW mit SQL Server Columnstore

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

SAP Business Suite powered by SAP HANA

PUBLIC Performance Optimization Guide

Integrating computational data analysis capabilities into analytics applications

VariantSpark: Applying Spark-based machine learning methods to genomic information

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

High-Volume Data Warehousing in Centerprise. Product Datasheet

Bioinformatics Resources at a Glance

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Biomedical Big Data and Precision Medicine

How, What, and Where of Data Warehouses for MySQL

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Ontology construction on a cloud computing platform

CHALLENGES IN NEXT-GENERATION SEQUENCING

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015

CitusDB Architecture for Real-Time Big Data

bigdata Managing Scale in Ontological Systems

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Cloud-Based Big Data Analytics in Bioinformatics: A Review

Step by Step Guide to Importing Genetic Data into JMP Genomics

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

European Medicines Agency

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

SAP HANA. SAP HANA Performance Efficient Speed and Scale-Out for Real-Time Business Intelligence

Transcription:

Towards Integrating the Detection of Genetic Variants into an 2nd International Workshop on Big Data in Bioinformatics and Healthcare Oct 27, 2014

Motivation Genome Data Analysis Process DNA Sample Base Sequencing Read Alignment Variant Calling Data Annotation Analysis Results Next-generation sequencing (NGS) requires adapted analysis workflow Higher error rates Shorter reads Base sequencing step produces output within a few hours Subsequent processing steps take days up to several weeks 2

Motivation The Next-Generation Sequencing Data Deluge NGS growth pattern more remarkable than Moore s law à Addressing data deluge with more computing power no option For variant calling: Still options to improve data processing Single-threaded processing Data stored in files on disk Cost in [USD] Cost in [USD] 10000 10000 1000 1000 100 100 10 1 0.1 10 1 0.1 0.01 0.01 Main Main Memory Memory Cost Cost per Megabyte per Megabyte Sequencing Sequencing Cost Cost per Megabase per Megabase 0.001 0.001 01/12/01 01/12/01 01/12/03 01/12/03 01/12/05 01/12/05 01/12/07 01/12/07 01/12/09 01/12/09 01/12/11 01/12/11 01/12/13 01/12/13 Date Date 3

IMDB Building Blocks P v Combined column and row store Map/Reduce Single and multi-tenancy Insert only for time travel Real-time replication Working on integers Active/passive data store Minimal projections Group key Dynamic multithreading Bulk load of data Objectrelational mapping No aggregate tables Data partitioning Any attribute as index On-the-fly extensibility Analytics on historical data Multi-core/ parallelization t Lightweight compression SQL SQL interface on columns and rows Reduction of software layers x x T disk Text retrieval and extraction engine No disk 4

IMDB Building Blocks P v Combined column and row store Map/Reduce Single and multi-tenancy Insert only for time travel Real-time replication Working on integers Active/passive data store Minimal projections Group key Dynamic multithreading Bulk load of data Objectrelational mapping No aggregate tables Data partitioning Any attribute as index On-the-fly extensibility Analytics on historical data Multi-core/ parallelization t Lightweight compression SQL SQL interface on columns and rows Reduction of software layers x x T disk Text retrieval and extraction engine No disk 5

Different Types of Genetic Variants AACTG vs. ATCTG Single Nucleotide Polymorphism (SNP) AACTG vs. AA_TG Insertion or Deletion (InDel) AACTG vs. GTCAA Structural Variations (SV) Different calling strategies for variant types with increasing complexity SNP calling (single-/ multi-sample) Indel calling à Focus here on single-sample SNP calling 6

Our Contribution Integrating SNP Calling into an SNP calling implemented as core component of the database Invocation of SNP calling via stored procedure call: CALL "_SYS_AFL"."CALL_SNPS ( SAMIMPORT.NA19240, REFERENCE.HG19CHR1, 'chr1', 20, 20, 30, 40, VARIANTS.OUTPUT); Built-in parallel scheduling and resource management of distinct SNP calling steps 7

Our Contribution SNP Calling Data Artifacts Reference Genome Base sequence for comparison Stored position-wise Read Alignments Reads mapped to the reference genome Table conforming SAM format Variant/SNP Calls Detected SNPs Table conforming VCF format 8

Our Contribution Genotype Calling Formula Genotype calling = deriving the actual genotype at a particular position Assign probability to all possible genotypes depending on given data P(G i ) = Uniform for all genotypes G i,i.e. 1 D j = all base occurrences at a particular position j G i = Genotype for which to calculate the probability H l = Haploid part of genotype G i b j,k = Base quality score of the particular base d j,k à Formula applied by GATK s UnifiedGenotyper 9

Our Contribution Experiment Results Data: 68.8M chr1 read alignments from 1,000 genomes project 10000 9000 8000 GATK IMDB Performance speedup by up to 22x for IMDB-based SNP calling Duration (seconds) 7000 6000 5000 4000 GATK s runtime depends on system s I/O capabilities Lower boundary for our approach around 369s 3000 2000 1000 0 0 5 10 15 20 25 30 35 40 Covered Positions on Chromosome 1 (millions) 10

Conclusion Running SNP calling within in-memory database satisfies expectations Main memory availability Built-in parallelization strategies à Memory access is the new bottleneck SNP calling runtime improves up to factor 22 compared to GATK Further evaluations on runtime performance and result set quality Extension of statistical formula to incorporate other aspects 11

Keep in contact with us. Cindy Fähnrich, M. Sc. cindy.faehnrich@hpi.de Dr. schapranow@hpi.de http://we.analyzegenomes.com/ Hasso Plattner Institute Enterprise Platform & Integration Concepts August-Bebel-Str. 88 14482 Potsdam, Germany 12