GENOME ANALYTICS. Performance in-situ DDN BPGW15. Hanif Khalak September 22, 2015 Cambridge, UK

Transcription

1 GENOME ANALYTICS Performance in-situ DDN BPGW15 Hanif Khalak September 22, 2015 Cambridge, UK

2 Weill-Cornell in Qatar Medical Education Pre-medical (2-yr) n WCMC-Q Medical (4-yr MD) n n n Math & Science Identical to NY curriculum Cross-registrations 100% USMLE success and residency in US Biomedical Research Human Genetics & Genomics Proteomics & Metabolomics Biostatistics & Epidemiology Molecular & Cell Biology Stem Cells & Tumor Microenvironment Biophysics & Physiology Global & Public Health

3 Genomic Big WCMC-Q Whole Genome (30X+ coverage) 200GB+, block gzipped 1G+ sequence objects Derived variants = 20GB+ compressed, ~5M features Whole Exome (50X+ coverage) 15GB+, block gzipped 50M+ sequence objects Derived variants = 500MB+ compressed, 0.5M features Genome Repository genomes, exomes across multiple studies Great value in meta- and re-analysis HPC Infrastructure 30 mixed nodes (1K cores, 3TB RAM) 1PB DDN GridScalar (GPFS) Upgrade to 3x capacity by end 2015 Software API c/c++ samtools, bamtools java Picard perl Bio::DB::SAM python pysam Databases postgres, gemini

4 Analytics on Genome Data in situ Big Data 200GB BAM compressed file à up to 1B seq reads per genome 500TB genome repository à 2PB+ in HDFS, RDB, MongoDB Significant storage and compute resources required with most solutions Analytics API Next-gen analysis is still in early stages à iterative development Skill gap: molecular genomics (science) and analysis informatics (programming) Hadoop/Spark/etc still difficult for scientists, even bioinformaticians SQL skills are common and easier to pick up Performance Interactivity for data access and query response times

5 Gemini: HPC DB for Genome Variants

6 BAM Data API options Access% Mode% Files% RDBMS% with%etl% NoSQL% SQL% No%ETL% Direct%%% (API)% SDKs%% (samtools,% Hadoop0BAM)% CloverETL,%% % MongoDB%% Cassandra% N/A$ SQL% Hive%/%Drill%/% Impala%/%HAWQ%%% HDFS%connectors% ODBC,%JDBC,% DBIx% Simba%% (ODBC%for% Cassandra)% PostgresSQL%FDW% (multicorn,%citusdb% PGStrom)% Apache%Drill%!

7 SQL options in situ Postgres FDW FDW = foreign data wrapper API for pluggable storage back-ends as foreign tables Multicorn: python FDW framework Recent work on accelerated offloading with GPUs n PG-Strom (OpenCL) n MapD (CUDA) Apache Drill SQL on streams framework open-sourced by MapR Limited stream formats supported Only recently added support for gzipped streams next project, TBD Any approach would benefit from accelerated I/ O with data files

8 Use Case: Qatar Genome Browser (QGB)

9 Genomic Big Data Query Input: list of genome files/ids and regions of interest Output: JSON-style object(s) Query: Single exome (coding regions) ~ 5GB file Single request: return read info for 100 chromosome intervals Performance using SDK from CLI n ~5.5s clock time n <100MB RAM n Scaling of query size, complexity and number of data files???

10 Multithreading using CLI Runtime (s) % CPU Speedup (x)

11 PostgreSQL Foreign Data Wrapper (FDW) Foreign table API for PostgreSQL Many drivers available (SQL, NoSQL, CSV, gzip, HDFS) Multicorn 3 rd party FDW framework to write custom data source drivers in python E.g.: RSS, IMAP, Google, Hive, VCF

12 MySQL: FEDERATED Storage Engine (mysql only) MSSQL: Text File Driver (cvs only) Firebird: External Table (cvs only) DB2: Complete SQL/MED implementation PostgreSQL FDW 9 Overview 7 / 23

13 FDW / multicorn for BAM data CREATE OR REPLACE FUNCTION bammeta() RETURNS SETOF bam_core AS $BODY$ DECLARE crow called_exome_targets_ %rowtype; brow bam_core%rowtype; BEGIN FOR crow IN SELECT * FROM called_exome_targets_ LOOP PERFORM * FROM bam_core WHERE bam_core.contig = crow.contig AND bam_core.reference_start >= crow.start AND bam_core.reference_end <= crow.end; END LOOP; END; $BODY$ LANGUAGE 'plpgsql' ; Optimize move loop to python à parallel offload

14 FDW / multicorn performance Whole exome 10K regions

15 Future Work Software Methods CitusDB n n Distributed query engine Modify their approach: offload instead of clustering query PostgreSQL + PG-Strom n OpenCL-based extension of FDW for query-on-accelerator (GPU) Apache Drill n I/O System Java adapter for.bam DDN IME!

16 Storage Performance with DDN IME A touching storey, full of tiers

17 Storage Performance vs. Capacity Storage latency (log-scale) Seagate, 2015

18 Touch Rate Steve Hetzler, IBM Architect Touch rate n a scale-free metric to evaluate storage performance n Definition: the proportion of a storage device or system s total content that can be accessed per unit time (e.g. year)

19 Touch Rate Steve Hetzler, IBM Architect Touch rate n a scale-free metric to evaluate storage performance n Definition: the proportion of a storage device or system s total content that can be accessed per unit time (e.g. year)

20 Hetzler & Coughlin, 2015

25 Hybrid Storage: Tiering Up FLASH! $$$!! System Design??? Qlogic, 2015

26 ~1% tiered software intermediation ê DDN IME Hetzler & Coughlin, 2015

27 DDN IME infinite memory engine before IME POSIX, MPIIO, GPFS, Lustre, after IME

28 DDN IME: I/O and Application Acceleration postgresql DDN IME softwaredefined storage FDW A natural platform for high-performance data APIs On-appliance BAM file processing TBD BAM files

29 Acknowledgements DDN BPGW15 Collaborations George Vacek, DDN Laurent Thiers, DDN Sanger Centre Will Schepp, EMC/Pivotal WCMC-Q Gaurav Kaul, Intel Utku Azman, CitusDB Jillian Rowe, Greg Smith - HPC Karsten Suhre, Bioinformatics Khaled Fakhro, Genetic Medicine Alice Aleem, Human Genetics Shahzad Jafri, CIO

30

31 Genomic Big Data - Options Standard Data Technologies SQL, NoSQL, HDFS, Require replication n High cost of additional (slow) storage More analysts with SQL skills than MapReduce / Hadoop Data API Goals: in-place data repository Ease of integrated queries: raw + metadata + high performance queries n Threads, cores, RAM, I/O,

32 Genomic Clinical Decision System (CDS) Intel, 2014

33 WCMC-Q Data Engine: Scale to Cloud FW CLI Galaxy, web apps R, MATLAB, AWS, Google, Rackspace Query ID#, SQL, JSON, Response TSV, JSON, XML, Virtual Data Engine slurm, CLI, Hadoop, Spark, yarn SQL No SQL HDFS, CEPH Omics (.BAM,.VCF,.BED) 400TB+ GPFS Annotation Files Other Data Files bandwidth? Remote Sites (FTP, ) N C B I U C S C E M B L

34 Data Federation vs. Virtualization

35 No Shortage of NoSQL Big Data Analysis Platforms! Query/Scripting Language SCOPE AQL Meteor PigLatin Jaql Sawzall Dremel SQL High-Level API Compiler/Optimizer SCOPE DryadLINQ Algebricks Spark Sopremo Java/Scala Pig Cascading Cascading Jaql FlumeJava FlumeJava Dremel SQL Low-Level API Execution Engine Dryad Hyracks RDDs Spark Nephele PACT Tez MapReduce Hadoop MapReduce Google MapReduce Dremel Dataflow Processor Data Store Cosmos TidyFS Hyracks LSM Storage HBase HDFS GFS Bigtable Relational Row/ Column Storage Resource Management Quincy Mesos YARN Omega 7

36 Example Query Query n Depth of coverage along genome n What percent of sites are 0 <= 10, 11 <= 20, 21 <= 30, and so on up to 100? n What percentage of SNP variants fall in each of these bins? n Within specific regions on genome n Sizes: 100, 1000, 10K, Whole exome n 176,715 regions exons (human genome b37/hg19) Target n sample whole exome BAM file from 1000genomes project n HG00096.mapped.illumina.mosaik.GBR.exome bam

37 Genomic Big Data Query Issues Scaling: n many files (1000+) n batched queries (e.g. visualization) n parallel requests (e.g. many users) Locality: central GPFS store vs. distributed FS Network speed: bandwidth, latency Ease of Query integration n SQL vs. R vs. Hadoop vs. Spark vs. Pig vs.

38 Comparison Methods Perl-MCE n Multithreaded, shared memory parallelism n Queries are limited by samtools API PostgreSQL/multicorn n Multiprocess n Arbitrary SQL, in theory n Uncertain performance

39 Alignment Metadata Per bam file, retrieve alignment metadata for region(s) seq_id, start, end, strand, cigar_str, query.start, query.end, dna, query.dna, qscore, qual, tagpaired a. 100, 1000, 10,000 regions b. Whole exome, 176,715 regions

40 MCE: Many-core processing with Perl Threading shared memory; workers use callback functions Chunking can reduce IPC overhead and likelihood that workers finish tasks at same time Serial I/O better than random I/O of workers; esp. with caching

41 MCE: Many-Core Engine Channels separate threads Communication Serialized Input scatter / gather Serialized output Queue Sync (event, term) Benchmark Net::Ping + MCE w/ 30 event loops 25K IPs/sec

42 PostgreSQL (FDW) / multicorn Load CREATE SERVER alchemy_srv foreign data wrapper multicorn options ( wrapper 'multicorn.sqlalchemyfdw.sqlalchemyfdw ); Attach data source CREATE FOREIGN TABLE mysql_datatable ( id integer, created_at timestamp without time zone, updated_at timestamp without time zone ) server alchemy_srv options ( tablename datatable, db_url 'mysql://root:password@ /testing );

43 FDW / multicorn (2) Rewrite as PG-SQL function which calls multicorn code to pull BAM data For SQL user, becomes SELECT * FROM input(contig, start, end) Parallelize basic data pull in python Much faster

44 Lessons In-situ queries on genome data files are possible and can be parallelized Reliance on samtools API limits query options Using FDW framework, files can be queried with SQL Basic queries have similar performance to CLI Can populate temporary tables and continue analytics in pure SQL Need to accelerate joins

45