GENOME ANALYTICS. Performance in-situ DDN BPGW15. Hanif Khalak September 22, 2015 Cambridge, UK



Similar documents
Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine. #AsterixDB

Moving From Hadoop to Spark

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop Ecosystem B Y R A H I M A.

Unified Big Data Processing with Apache Spark. Matei

How To Scale Out Of A Nosql Database

How To Create A Data Visualization With Apache Spark And Zeppelin

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Processing NGS Data with Hadoop-BAM and SeqPig

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Open source large scale distributed data management with Google s MapReduce and Bigtable

A programming model in Cloud: MapReduce

Oracle Big Data SQL Technical Update

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Application Development. A Paradigm Shift

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Large-Scale Data Processing

Open Source Technologies on Microsoft Azure

An Approach to Implement Map Reduce with NoSQL Databases

Hadoop IST 734 SS CHUNG

Apache Flink Next-gen data analysis. Kostas

Database Scalability and Oracle 12c

Bringing Big Data Modelling into the Hands of Domain Experts

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

New solutions for Big Data Analysis and Visualization

Hadoop-BAM and SeqPig

SQL on NoSQL (and all of the data) With Apache Drill

A Brief Introduction to Apache Tez

HDFS. Hadoop Distributed File System

Sharding with postgres_fdw

Peers Techno log ies Pv t. L td. HADOOP

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Open source Google-style large scale data analysis with Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

The Future of Data Management

Distributed Computing and Big Data: Hadoop and MapReduce

the missing log collector Treasure Data, Inc. Muga Nishizawa

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Challenges for Data Driven Systems

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Systems, Big Data

Analytics on Spark &

Large Scale Text Analysis Using the Map/Reduce

6.S897 Large-Scale Systems

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Performance and Scalability Overview

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Big Data and Scripting Systems build on top of Hadoop

Hadoop: Embracing future hardware

Large scale processing using Hadoop. Ján Vaňo

Comparing SQL and NOSQL databases

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop & Spark Using Amazon EMR

Integrating Apache Spark with an Enterprise Data Warehouse

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Why Spark on Hadoop Matters

Scalable Architecture on Amazon AWS Cloud

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Scaling up to Production

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CSE-E5430 Scalable Cloud Computing. Lecture 4

Intro to Map/Reduce a.k.a. Hadoop

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

NoSQL: Going Beyond Structured Data and RDBMS

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Can High-Performance Interconnects Benefit Memcached and Hadoop?

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Cloud Computing. Lecture 24 Cloud Platform Comparison

TRAINING PROGRAM ON BIGDATA/HADOOP

Write a Foreign Data Wrapper in 15 minutes

Case Study : 3 different hadoop cluster deployments

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Data Science: Behind the Buzz Words

Unified Big Data Analytics Pipeline. 连 城

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Course Highlights

HAWQ Architecture. Alexey Grishchenko

Hadoop implementation of MapReduce computational model. Ján Vaňo

Transcription:

GENOME ANALYTICS Performance in-situ DDN BPGW15 Hanif Khalak September 22, 2015 Cambridge, UK

Weill-Cornell in Qatar Medical Education Pre-medical (2-yr) n WCMC-Q Medical (4-yr MD) n n n Math & Science Identical to NY curriculum Cross-registrations 100% USMLE success and residency in US Biomedical Research Human Genetics & Genomics Proteomics & Metabolomics Biostatistics & Epidemiology Molecular & Cell Biology Stem Cells & Tumor Microenvironment Biophysics & Physiology Global & Public Health

Genomic Big Data @ WCMC-Q Whole Genome (30X+ coverage) 200GB+, block gzipped 1G+ sequence objects Derived variants = 20GB+ compressed, ~5M features Whole Exome (50X+ coverage) 15GB+, block gzipped 50M+ sequence objects Derived variants = 500MB+ compressed, 0.5M features Genome Repository 1000+ genomes, exomes across multiple studies Great value in meta- and re-analysis HPC Infrastructure 30 mixed nodes (1K cores, 3TB RAM) 1PB DDN GridScalar (GPFS) Upgrade to 3x capacity by end 2015 Software API c/c++ samtools, bamtools java Picard perl Bio::DB::SAM python pysam Databases postgres, gemini

Analytics on Genome Data in situ Big Data 200GB BAM compressed file à up to 1B seq reads per genome 500TB genome repository à 2PB+ in HDFS, RDB, MongoDB Significant storage and compute resources required with most solutions Analytics API Next-gen analysis is still in early stages à iterative development Skill gap: molecular genomics (science) and analysis informatics (programming) Hadoop/Spark/etc still difficult for scientists, even bioinformaticians SQL skills are common and easier to pick up Performance Interactivity for data access and query response times

Gemini: HPC DB for Genome Variants http://gemini.readthedocs.org

BAM Data API options Access% Mode% Files% RDBMS% with%etl% NoSQL% SQL% No%ETL% Direct%%% (API)% SDKs%% (samtools,% Hadoop0BAM)% CloverETL,%% % MongoDB%% Cassandra% N/A$ SQL% Hive%/%Drill%/% Impala%/%HAWQ%%% HDFS%connectors% ODBC,%JDBC,% DBIx% Simba%% (ODBC%for% Cassandra)% PostgresSQL%FDW% (multicorn,%citusdb% PGStrom)% Apache%Drill%!

SQL options in situ Postgres FDW FDW = foreign data wrapper API for pluggable storage back-ends as foreign tables Multicorn: python FDW framework Recent work on accelerated offloading with GPUs n PG-Strom (OpenCL) n MapD (CUDA) Apache Drill SQL on streams framework open-sourced by MapR Limited stream formats supported Only recently added support for gzipped streams next project, TBD Any approach would benefit from accelerated I/ O with data files

Use Case: Qatar Genome Browser (QGB)

Genomic Big Data Query Input: list of genome files/ids and regions of interest Output: JSON-style object(s) Query: Single exome (coding regions) ~ 5GB file Single request: return read info for 100 chromosome intervals Performance using SDK from CLI n ~5.5s clock time n <100MB RAM n Scaling of query size, complexity and number of data files???

Multithreading using CLI Runtime (s) % CPU Speedup (x)

PostgreSQL Foreign Data Wrapper (FDW) Foreign table API for PostgreSQL Many drivers available (SQL, NoSQL, CSV, gzip, HDFS) https://wiki.postgresql.org/wiki/foreign_data_wrappers Multicorn 3 rd party FDW framework to write custom data source drivers in python E.g.: RSS, IMAP, Google, Hive, VCF

MySQL: FEDERATED Storage Engine (mysql only) MSSQL: Text File Driver (cvs only) Firebird: External Table (cvs only) DB2: Complete SQL/MED implementation PostgreSQL FDW 9 Overview 7 / 23

FDW / multicorn for BAM data CREATE OR REPLACE FUNCTION bammeta() RETURNS SETOF bam_core AS $BODY$ DECLARE crow called_exome_targets_20110225%rowtype; brow bam_core%rowtype; BEGIN FOR crow IN SELECT * FROM called_exome_targets_20110225 LOOP PERFORM * FROM bam_core WHERE bam_core.contig = crow.contig AND bam_core.reference_start >= crow.start AND bam_core.reference_end <= crow.end; END LOOP; END; $BODY$ LANGUAGE 'plpgsql' ; Optimize move loop to python à parallel offload

FDW / multicorn performance Whole exome 10K 1000 100 regions

Future Work Software Methods CitusDB n n Distributed query engine Modify their approach: offload instead of clustering query PostgreSQL + PG-Strom n OpenCL-based extension of FDW for query-on-accelerator (GPU) Apache Drill n I/O System Java adapter for.bam DDN IME!

Storage Performance with DDN IME A touching storey, full of tiers

Storage Performance vs. Capacity Storage latency (log-scale) Seagate, 2015

Touch Rate Steve Hetzler, IBM Architect Touch rate n a scale-free metric to evaluate storage performance n Definition: the proportion of a storage device or system s total content that can be accessed per unit time (e.g. year)

Touch Rate Steve Hetzler, IBM Architect Touch rate n a scale-free metric to evaluate storage performance n Definition: the proportion of a storage device or system s total content that can be accessed per unit time (e.g. year)

Hetzler & Coughlin, 2015

Hetzler & Coughlin, 2015

Hetzler & Coughlin, 2015

Hetzler & Coughlin, 2015

Hetzler & Coughlin, 2015

Hybrid Storage: Tiering Up FLASH! $$$!! System Design??? Qlogic, 2015

~1% tiered software intermediation ê DDN IME Hetzler & Coughlin, 2015

DDN IME infinite memory engine before IME POSIX, MPIIO, GPFS, Lustre, after IME

DDN IME: I/O and Application Acceleration postgresql DDN IME softwaredefined storage FDW A natural platform for high-performance data APIs On-appliance BAM file processing TBD BAM files

Acknowledgements DDN BPGW15 Collaborations George Vacek, DDN Laurent Thiers, DDN Sanger Centre Will Schepp, EMC/Pivotal WCMC-Q Gaurav Kaul, Intel Utku Azman, CitusDB Jillian Rowe, Greg Smith - HPC Karsten Suhre, Bioinformatics Khaled Fakhro, Genetic Medicine Alice Aleem, Human Genetics Shahzad Jafri, CIO

Genomic Big Data - Options Standard Data Technologies SQL, NoSQL, HDFS, Require replication n High cost of additional (slow) storage More analysts with SQL skills than MapReduce / Hadoop Data API Goals: in-place data repository Ease of integrated queries: raw + metadata + high performance queries n Threads, cores, RAM, I/O,

Genomic Clinical Decision System (CDS) Intel, 2014

WCMC-Q Data Engine: Scale to Cloud FW CLI Galaxy, web apps R, MATLAB, AWS, Google, Rackspace Query ID#, SQL, JSON, Response TSV, JSON, XML, Virtual Data Engine slurm, CLI, Hadoop, Spark, yarn SQL No SQL HDFS, CEPH Omics (.BAM,.VCF,.BED) 400TB+ GPFS Annotation Files Other Data Files bandwidth? Remote Sites (FTP, ) N C B I U C S C E M B L

Data Federation vs. Virtualization

No Shortage of NoSQL Big Data Analysis Platforms! Query/Scripting Language SCOPE AQL Meteor PigLatin Jaql Sawzall Dremel SQL High-Level API Compiler/Optimizer SCOPE DryadLINQ Algebricks Spark Sopremo Java/Scala Pig Cascading Cascading Jaql FlumeJava FlumeJava Dremel SQL Low-Level API Execution Engine Dryad Hyracks RDDs Spark Nephele PACT Tez MapReduce Hadoop MapReduce Google MapReduce Dremel Dataflow Processor Data Store Cosmos TidyFS Hyracks LSM Storage HBase HDFS GFS Bigtable Relational Row/ Column Storage Resource Management Quincy Mesos YARN Omega 7

Example Query Query n Depth of coverage along genome n What percent of sites are 0 <= 10, 11 <= 20, 21 <= 30, and so on up to 100? n What percentage of SNP variants fall in each of these bins? n Within specific regions on genome n Sizes: 100, 1000, 10K, Whole exome n 176,715 regions exons (human genome b37/hg19) Target n sample whole exome BAM file from 1000genomes project n HG00096.mapped.illumina.mosaik.GBR.exome.20110411.bam

Genomic Big Data Query Issues Scaling: n many files (1000+) n batched queries (e.g. visualization) n parallel requests (e.g. many users) Locality: central GPFS store vs. distributed FS Network speed: bandwidth, latency Ease of Query integration n SQL vs. R vs. Hadoop vs. Spark vs. Pig vs.

Comparison Methods Perl-MCE n Multithreaded, shared memory parallelism n Queries are limited by samtools API PostgreSQL/multicorn n Multiprocess n Arbitrary SQL, in theory n Uncertain performance

Alignment Metadata Per bam file, retrieve alignment metadata for region(s) seq_id, start, end, strand, cigar_str, query.start, query.end, dna, query.dna, qscore, qual, tagpaired a. 100, 1000, 10,000 regions b. Whole exome, 176,715 regions

MCE: Many-core processing with Perl Threading shared memory; workers use callback functions Chunking can reduce IPC overhead and likelihood that workers finish tasks at same time Serial I/O better than random I/O of workers; esp. with caching

MCE: Many-Core Engine Channels separate threads Communication Serialized Input scatter / gather Serialized output Queue Sync (event, term) Benchmark Net::Ping + MCE w/ 30 event loops 25K IPs/sec

PostgreSQL (FDW) / multicorn Load CREATE SERVER alchemy_srv foreign data wrapper multicorn options ( wrapper 'multicorn.sqlalchemyfdw.sqlalchemyfdw ); Attach data source CREATE FOREIGN TABLE mysql_datatable ( id integer, created_at timestamp without time zone, updated_at timestamp without time zone ) server alchemy_srv options ( tablename datatable, db_url 'mysql://root:password@127.0.0.1/testing );

FDW / multicorn (2) Rewrite as PG-SQL function which calls multicorn code to pull BAM data For SQL user, becomes SELECT * FROM input(contig, start, end) Parallelize basic data pull in python Much faster

Lessons In-situ queries on genome data files are possible and can be parallelized Reliance on samtools API limits query options Using FDW framework, files can be queried with SQL Basic queries have similar performance to CLI Can populate temporary tables and continue analytics in pure SQL Need to accelerate joins