Accelerating variant calling

Similar documents

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Analysis of NGS Data

Delivering the power of the world s most successful genomics platform

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Practical Guideline for Whole Genome Sequencing

Evaluation of CUDA Fortran for the CFD code Strukti

Turbomachinery CFD on many-core platforms experiences and strategies

A Brief Introduction to Apache Tez

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Direct GPU/FPGA Communication Via PCI Express

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

HPC with Multicore and GPUs

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Binary search tree with SIMD bandwidth optimization using SSE

Introduction. Xiangke Liao 1, Shaoliang Peng, Yutong Lu, Chengkun Wu, Yingbo Cui, Heng Wang, Jiajun Wen

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Stream Processing on GPUs Using Distributed Multimedia Middleware

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

New solutions for Big Data Analysis and Visualization

Accelerating Data-Intensive Genome Analysis in the Cloud

Texture Cache Approximation on GPUs

Accelerating CFD using OpenFOAM with GPUs

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Cloud-Based Big Data Analytics in Bioinformatics

Overview of HPC Resources at Vanderbilt

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Parallel Computing. Benson Muite. benson.

NVIDIA GeForce GTX 580 GPU Datasheet

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GeoImaging Accelerator Pansharp Test Results

Cornell University Center for Advanced Computing

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Parallel Computing with MATLAB

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

CFD Implementation with In-Socket FPGA Accelerators

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

GPUs for Scientific Computing

GOBII. Genomic & Open-source Breeding Informatics Initiative

Introduction to GPGPU. Tiziano Diamanti

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Embedded Systems: map to FPGA, GPU, CPU?

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

L20: GPU Architecture and Models

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Introduction to GPU Programming Languages

Copy Number Variation: available tools

Clustering Billions of Data Points Using GPUs

Oracle Big Data SQL Technical Update

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

NVIDIA VIDEO ENCODER 5.0

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Intel Xeon +FPGA Platform for the Data Center

Computer Graphics Hardware An Overview

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

How-To: SNP and INDEL detection

FLIX: Fast Relief for Performance-Hungry Embedded Applications

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

Personalized Medicine and IT

Large-Data Software Defined Visualization on CPUs

Case Study on Productivity and Performance of GPGPUs

Real-time Visual Tracker by Stream Processing

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Graphic Processing Units: a possible answer to High Performance Computing?

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

High Performance Computing in CST STUDIO SUITE

DynamicCloudSim: Simulating Heterogeneity in Computational Clouds

Infrastructure Matters: POWER8 vs. Xeon x86

Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis

PassMark - G3D Mark High End Videocards - Updated 12th of November 2015

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

QCD as a Video Game?

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Next Generation GPU Architecture Code-named Fermi

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Speeding Up RSA Encryption Using GPU Parallelization

Introduction to GPU hardware and to CUDA

Bringing Big Data Modelling into the Hands of Domain Experts

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Transcription:

Accelerating variant calling Mauricio Carneiro GSA Broad Institute Intel Genomic Sequencing Pipeline Workshop Mount Sinai 12/10/2013

This is the work of many Genome sequencing and analysis team Mark DePristo Eric Banks Stacey Gabriel David Altshuler

Scope and schema of the Best Practices More involved support for A brand downstream new data analysis processing (e.g. posterior pipeline genotype likelihoods tool, diabetes Entirely new portal) alignment pipeline More New tools algorithms for data for analysts and recalibration statistical geneticists and to realignment make better use of the genotyping likelihoods (often PCRFree ignored NA12878 in current 80xWGS non- is successfully GATK analysis processed software) from fastq => analysis ready bam in A new under engine 20 minutes framework (plus for data intense alignment repetitive ~4-7h). analysis steps over Hadoop

We are here in the Best Practices workflow CALLING VARIANTS

GATK s Haplotype Caller is replacing the seasoned Unified Genotyper Haplotype Caller is ready today for a small number of samples, but needs to scale better

Artifact SNPs and small indels caused by large indel is only recovered by local assembly NA12878 original read data Multiple caller artifacts that are hard to filter out, since they are well supported by read data Haplotype Caller (validated) 6

Haplotype Caller is more accurate than the Unified Genotyper

However the Haplotype Caller is more CPU intensive h R ]] r H 1. Active region traversal identifies the regions that need to be reassembled 3. Pair-Hmm evaluation of all reads against all haplotypes (scales exponentially) 2. Local de-novo assembly builds the most likely haplotypes for evaluation 4. Genotyping using the exact model

Pair-HMM is the biggest culprit for the low performance Stage Time Runtime % Assembly 2,598s 13% Pair-HMM 14,225s 70% Traversal + Genotyping 3,379s 17% NA12878 80xWGS chromosome 20 haplotype caller run Chr20 time: 5.6 hours WGS time: 7.6 days

How we can improve performance? 1. Distributed parallelism: Queue/MapReduce 2. Alternative way to calculate likelihoods. 3. Heterogeneous parallel compute: GPU, FPGA and Vectorization. 4. Joint-calling with incremental single sample discovery.

In memory parallelism + map/reduce parallelism with queue PARALLELISM SUPPORT IN THE GATK TODAY

The GATK is actually two different beasts Engine Takes care of the input and output. Preprocess and organizes a traversal system for the walkers Walkers Sees the genome in an organized fashion and applies an algorithm to it

The three ways to parallelize the GATK Spawn many Parallelizing at the Parallelizing at the instances of the GATK engine level to walker level to speed to work on separate process different up the processing of (arbitrary) parts of the parts of the genome each individual region genome at the same at the same time. of the genome. time. -nt -nct Queue/MapReduce This is not really a solution, but a last resort that we use routinely to make calls today

How we can improve performance? 1. Distributed parallelism: Queue/MapReduce 2. Alternative way to calculate likelihoods. 3. Heterogeneous parallel compute: GPU, FPGA and Vectorization. 4. Joint-calling with incremental single sample discovery.

Reducing the number of times we need to run the pair-hmm GRAPH BASED LIKELIHOODS

Calculating genotype likelihoods straight from the assembly graph reduces pair-hmm usage Mapping each read to the haplotype assembly graph we can constrain the underlying pair-hmm to avoid quasi-zero likelihood unrealistic alignments. ALT REF r 3 Lk = 0 ]] r 1 h 1 r 2 ]] ]] r 2 h 2 R r 3 h 3 r 1 ]] Lk = 0 ALT REF h 1 h 2 h 3 H The resulting sub-problems become modules that can be reused across haplotypes that share sub-paths in the graph.

Speed-up can be quite significant depending on data size and variation present. Variation Civar PairHMM (ms) GraphBased (ms) Invariants: kmersize = 10, readcount = 10000, readlength = 100, regionsize = 300 Speed-up 1 SNP *1T* 8139 493 16x 1 short ins. *3I* 10785 485 22x 1 long ins. *30I* 11249 522 21x 1 short del. *3D* 10649 490 21x 1 long del. *30D* 10212 546 18x 1 SNP 1 ins. close by *1T*3=3I* 21552 584 36x 1 SNP 1 ins. far away *1T*3I* 21420 626 34x 5 close by SNPs *1T8=1T8=1T8=1T8=1T* 83035 3064 27x 5 far away SNPs *1T*1T*1T*1T*1T* 57346 1332 43x First implementation just 4x speed-up but can (and will) be improved

How we can improve performance? 1. Distributed parallelism: Queue/MapReduce 2. Alternative way to calculate likelihoods. 3. Heterogeneous parallel compute: GPU, FPGA and Vectorization. 4. Joint-calling with incremental single sample discovery.

GPUs, FPGA and vectorized implementations of the pair-hmm ALTERNATIVE PLATFORMS

Alternative implementations in the GATK FPGA (Convey Computer) Implemented by Scott Thibault from Green Mountain Computing Systems Fast I/O with bi-directional data bus, somewhat large shared memory Highly parallelizable (hundreds of processing elements) GPU (NVidia CUDA) Implemented in collaboration with Diego Nehab from IMPA-RJ Fast I/O but limited memory with thousands of cores in one board Especially fast float precision calculations Vectorized (AVX) Implemented in collaboration with Intel Corp In processor, no extra I/O, already available in most computers up to 32 512-bit registers per CPU

A parallelized version of the Pair-HMM A A C A G C A G T C A G T C C Read Haplotype A A C A G T C C match/insertion/deletion

Sandbox Performance comparison Data: NA12878 80xWGS chromosome 20 TECH Hardware Runtime (seconds) Improvement (fold) AVX Intel Xeon 24-core* 15 720x GPU NVidia Tesla K40 160 67x GPU NVidia GeForce GTX Titan 161 67x GPU NVidia GeForce GTX 480 190 56x GPU NVidia GeForce GTX 680 274 40x GPU NVidia GeForce GTX 670 288 38x AVX Intel Xeon 1-core* 309 35x FPGA Convey Computers HC2 834 13x - C++ (baseline) 1,267 9x - Java (gatk) 10,800 - * Reported by Intel - unreleased hardware

GATK engine is not ready to leverage the massive parallelism Full Haplotype Caller run on 80xWGS PCR-Free NA12878: 13 days on single CPU (java) 3.5 days on Convey HC-2 Human Chromosome 1 3.0 2.81 2.5 2.0 3.7x improvement instead of the expected 13x fold improvement Jobs per Day 1.5 GATK engine does not keep the pipe full for the alternative implementations of the pair-hmm to shine 1.0 0.5 1.16 A different approach is necessary (GATK 3.x target release) 0.0 12C x86 HC-2

Synchronous traversal is to blame h R ]] r H 1. Active region traversal identifies the regions that need to be reassembled 3. Pair-Hmm evaluation of all reads against all haplotypes (scales exponentially) 2. Local de-novo assembly builds the most likely haplotypes for evaluation 4. Genotyping using the exact model

Summary The GSA team is continuously revising and updating the best practices for variant calling in order to enable the large scale of tomorrow s medical genetics needs. Assembly genotyping will work in combination with any other accelerations made to the pair-hmm. GPUs and AVX are the most promising platforms for large scale projects or pipelines in need of very fast turnaround (e.g. diagnostics) The GATK will need an asynchronous engine to enable the full potential of these platforms Available in the next major version release of the GATK

THE GATK TEAM NEEDS YOU Talk to me for more information or email downing@broadinstitute.org