Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Towards Integrating the Detection of Genetic Variants into an 2nd International Workshop on Big Data in Bioinformatics and Healthcare Oct 27, 2014

Motivation Genome Data Analysis Process DNA Sample Base Sequencing Read Alignment Variant Calling Data Annotation Analysis Results Next-generation sequencing (NGS) requires adapted analysis workflow Higher error rates Shorter reads Base sequencing step produces output within a few hours Subsequent processing steps take days up to several weeks 2

Motivation The Next-Generation Sequencing Data Deluge NGS growth pattern more remarkable than Moore s law à Addressing data deluge with more computing power no option For variant calling: Still options to improve data processing Single-threaded processing Data stored in files on disk Cost in [USD] Cost in [USD] 10000 10000 1000 1000 100 100 10 1 0.1 10 1 0.1 0.01 0.01 Main Main Memory Memory Cost Cost per Megabyte per Megabyte Sequencing Sequencing Cost Cost per Megabase per Megabase 0.001 0.001 01/12/01 01/12/01 01/12/03 01/12/03 01/12/05 01/12/05 01/12/07 01/12/07 01/12/09 01/12/09 01/12/11 01/12/11 01/12/13 01/12/13 Date Date 3

IMDB Building Blocks P v Combined column and row store Map/Reduce Single and multi-tenancy Insert only for time travel Real-time replication Working on integers Active/passive data store Minimal projections Group key Dynamic multithreading Bulk load of data Objectrelational mapping No aggregate tables Data partitioning Any attribute as index On-the-fly extensibility Analytics on historical data Multi-core/ parallelization t Lightweight compression SQL SQL interface on columns and rows Reduction of software layers x x T disk Text retrieval and extraction engine No disk 4

IMDB Building Blocks P v Combined column and row store Map/Reduce Single and multi-tenancy Insert only for time travel Real-time replication Working on integers Active/passive data store Minimal projections Group key Dynamic multithreading Bulk load of data Objectrelational mapping No aggregate tables Data partitioning Any attribute as index On-the-fly extensibility Analytics on historical data Multi-core/ parallelization t Lightweight compression SQL SQL interface on columns and rows Reduction of software layers x x T disk Text retrieval and extraction engine No disk 5

Different Types of Genetic Variants AACTG vs. ATCTG Single Nucleotide Polymorphism (SNP) AACTG vs. AA_TG Insertion or Deletion (InDel) AACTG vs. GTCAA Structural Variations (SV) Different calling strategies for variant types with increasing complexity SNP calling (single-/ multi-sample) Indel calling à Focus here on single-sample SNP calling 6

Our Contribution Integrating SNP Calling into an SNP calling implemented as core component of the database Invocation of SNP calling via stored procedure call: CALL "_SYS_AFL"."CALL_SNPS ( SAMIMPORT.NA19240, REFERENCE.HG19CHR1, 'chr1', 20, 20, 30, 40, VARIANTS.OUTPUT); Built-in parallel scheduling and resource management of distinct SNP calling steps 7

Our Contribution SNP Calling Data Artifacts Reference Genome Base sequence for comparison Stored position-wise Read Alignments Reads mapped to the reference genome Table conforming SAM format Variant/SNP Calls Detected SNPs Table conforming VCF format 8

Our Contribution Genotype Calling Formula Genotype calling = deriving the actual genotype at a particular position Assign probability to all possible genotypes depending on given data P(G i ) = Uniform for all genotypes G i,i.e. 1 D j = all base occurrences at a particular position j G i = Genotype for which to calculate the probability H l = Haploid part of genotype G i b j,k = Base quality score of the particular base d j,k à Formula applied by GATK s UnifiedGenotyper 9

Our Contribution Experiment Results Data: 68.8M chr1 read alignments from 1,000 genomes project 10000 9000 8000 GATK IMDB Performance speedup by up to 22x for IMDB-based SNP calling Duration (seconds) 7000 6000 5000 4000 GATK s runtime depends on system s I/O capabilities Lower boundary for our approach around 369s 3000 2000 1000 0 0 5 10 15 20 25 30 35 40 Covered Positions on Chromosome 1 (millions) 10

Conclusion Running SNP calling within in-memory database satisfies expectations Main memory availability Built-in parallelization strategies à Memory access is the new bottleneck SNP calling runtime improves up to factor 22 compared to GATK Further evaluations on runtime performance and result set quality Extension of statistical formula to incorporate other aspects 11

Keep in contact with us. Cindy Fähnrich, M. Sc. cindy.faehnrich@hpi.de Dr. schapranow@hpi.de http://we.analyzegenomes.com/ Hasso Plattner Institute Enterprise Platform & Integration Concepts August-Bebel-Str. 88 14482 Potsdam, Germany 12