GENOME ANALYTICS Performance in-situ DDN BPGW15 Hanif Khalak September 22, 2015 Cambridge, UK
Weill-Cornell in Qatar Medical Education Pre-medical (2-yr) n WCMC-Q Medical (4-yr MD) n n n Math & Science Identical to NY curriculum Cross-registrations 100% USMLE success and residency in US Biomedical Research Human Genetics & Genomics Proteomics & Metabolomics Biostatistics & Epidemiology Molecular & Cell Biology Stem Cells & Tumor Microenvironment Biophysics & Physiology Global & Public Health
Genomic Big Data @ WCMC-Q Whole Genome (30X+ coverage) 200GB+, block gzipped 1G+ sequence objects Derived variants = 20GB+ compressed, ~5M features Whole Exome (50X+ coverage) 15GB+, block gzipped 50M+ sequence objects Derived variants = 500MB+ compressed, 0.5M features Genome Repository 1000+ genomes, exomes across multiple studies Great value in meta- and re-analysis HPC Infrastructure 30 mixed nodes (1K cores, 3TB RAM) 1PB DDN GridScalar (GPFS) Upgrade to 3x capacity by end 2015 Software API c/c++ samtools, bamtools java Picard perl Bio::DB::SAM python pysam Databases postgres, gemini
Analytics on Genome Data in situ Big Data 200GB BAM compressed file à up to 1B seq reads per genome 500TB genome repository à 2PB+ in HDFS, RDB, MongoDB Significant storage and compute resources required with most solutions Analytics API Next-gen analysis is still in early stages à iterative development Skill gap: molecular genomics (science) and analysis informatics (programming) Hadoop/Spark/etc still difficult for scientists, even bioinformaticians SQL skills are common and easier to pick up Performance Interactivity for data access and query response times
Gemini: HPC DB for Genome Variants http://gemini.readthedocs.org
BAM Data API options Access% Mode% Files% RDBMS% with%etl% NoSQL% SQL% No%ETL% Direct%%% (API)% SDKs%% (samtools,% Hadoop0BAM)% CloverETL,%% % MongoDB%% Cassandra% N/A$ SQL% Hive%/%Drill%/% Impala%/%HAWQ%%% HDFS%connectors% ODBC,%JDBC,% DBIx% Simba%% (ODBC%for% Cassandra)% PostgresSQL%FDW% (multicorn,%citusdb% PGStrom)% Apache%Drill%!
SQL options in situ Postgres FDW FDW = foreign data wrapper API for pluggable storage back-ends as foreign tables Multicorn: python FDW framework Recent work on accelerated offloading with GPUs n PG-Strom (OpenCL) n MapD (CUDA) Apache Drill SQL on streams framework open-sourced by MapR Limited stream formats supported Only recently added support for gzipped streams next project, TBD Any approach would benefit from accelerated I/ O with data files
Use Case: Qatar Genome Browser (QGB)
Genomic Big Data Query Input: list of genome files/ids and regions of interest Output: JSON-style object(s) Query: Single exome (coding regions) ~ 5GB file Single request: return read info for 100 chromosome intervals Performance using SDK from CLI n ~5.5s clock time n <100MB RAM n Scaling of query size, complexity and number of data files???
Multithreading using CLI Runtime (s) % CPU Speedup (x)
PostgreSQL Foreign Data Wrapper (FDW) Foreign table API for PostgreSQL Many drivers available (SQL, NoSQL, CSV, gzip, HDFS) https://wiki.postgresql.org/wiki/foreign_data_wrappers Multicorn 3 rd party FDW framework to write custom data source drivers in python E.g.: RSS, IMAP, Google, Hive, VCF
MySQL: FEDERATED Storage Engine (mysql only) MSSQL: Text File Driver (cvs only) Firebird: External Table (cvs only) DB2: Complete SQL/MED implementation PostgreSQL FDW 9 Overview 7 / 23
FDW / multicorn for BAM data CREATE OR REPLACE FUNCTION bammeta() RETURNS SETOF bam_core AS $BODY$ DECLARE crow called_exome_targets_20110225%rowtype; brow bam_core%rowtype; BEGIN FOR crow IN SELECT * FROM called_exome_targets_20110225 LOOP PERFORM * FROM bam_core WHERE bam_core.contig = crow.contig AND bam_core.reference_start >= crow.start AND bam_core.reference_end <= crow.end; END LOOP; END; $BODY$ LANGUAGE 'plpgsql' ; Optimize move loop to python à parallel offload
FDW / multicorn performance Whole exome 10K 1000 100 regions
Future Work Software Methods CitusDB n n Distributed query engine Modify their approach: offload instead of clustering query PostgreSQL + PG-Strom n OpenCL-based extension of FDW for query-on-accelerator (GPU) Apache Drill n I/O System Java adapter for.bam DDN IME!
Storage Performance with DDN IME A touching storey, full of tiers
Storage Performance vs. Capacity Storage latency (log-scale) Seagate, 2015
Touch Rate Steve Hetzler, IBM Architect Touch rate n a scale-free metric to evaluate storage performance n Definition: the proportion of a storage device or system s total content that can be accessed per unit time (e.g. year)
Touch Rate Steve Hetzler, IBM Architect Touch rate n a scale-free metric to evaluate storage performance n Definition: the proportion of a storage device or system s total content that can be accessed per unit time (e.g. year)
Hetzler & Coughlin, 2015
Hetzler & Coughlin, 2015
Hetzler & Coughlin, 2015
Hetzler & Coughlin, 2015
Hetzler & Coughlin, 2015
Hybrid Storage: Tiering Up FLASH! $$$!! System Design??? Qlogic, 2015
~1% tiered software intermediation ê DDN IME Hetzler & Coughlin, 2015
DDN IME infinite memory engine before IME POSIX, MPIIO, GPFS, Lustre, after IME
DDN IME: I/O and Application Acceleration postgresql DDN IME softwaredefined storage FDW A natural platform for high-performance data APIs On-appliance BAM file processing TBD BAM files
Acknowledgements DDN BPGW15 Collaborations George Vacek, DDN Laurent Thiers, DDN Sanger Centre Will Schepp, EMC/Pivotal WCMC-Q Gaurav Kaul, Intel Utku Azman, CitusDB Jillian Rowe, Greg Smith - HPC Karsten Suhre, Bioinformatics Khaled Fakhro, Genetic Medicine Alice Aleem, Human Genetics Shahzad Jafri, CIO
Genomic Big Data - Options Standard Data Technologies SQL, NoSQL, HDFS, Require replication n High cost of additional (slow) storage More analysts with SQL skills than MapReduce / Hadoop Data API Goals: in-place data repository Ease of integrated queries: raw + metadata + high performance queries n Threads, cores, RAM, I/O,
Genomic Clinical Decision System (CDS) Intel, 2014
WCMC-Q Data Engine: Scale to Cloud FW CLI Galaxy, web apps R, MATLAB, AWS, Google, Rackspace Query ID#, SQL, JSON, Response TSV, JSON, XML, Virtual Data Engine slurm, CLI, Hadoop, Spark, yarn SQL No SQL HDFS, CEPH Omics (.BAM,.VCF,.BED) 400TB+ GPFS Annotation Files Other Data Files bandwidth? Remote Sites (FTP, ) N C B I U C S C E M B L
Data Federation vs. Virtualization
No Shortage of NoSQL Big Data Analysis Platforms! Query/Scripting Language SCOPE AQL Meteor PigLatin Jaql Sawzall Dremel SQL High-Level API Compiler/Optimizer SCOPE DryadLINQ Algebricks Spark Sopremo Java/Scala Pig Cascading Cascading Jaql FlumeJava FlumeJava Dremel SQL Low-Level API Execution Engine Dryad Hyracks RDDs Spark Nephele PACT Tez MapReduce Hadoop MapReduce Google MapReduce Dremel Dataflow Processor Data Store Cosmos TidyFS Hyracks LSM Storage HBase HDFS GFS Bigtable Relational Row/ Column Storage Resource Management Quincy Mesos YARN Omega 7
Example Query Query n Depth of coverage along genome n What percent of sites are 0 <= 10, 11 <= 20, 21 <= 30, and so on up to 100? n What percentage of SNP variants fall in each of these bins? n Within specific regions on genome n Sizes: 100, 1000, 10K, Whole exome n 176,715 regions exons (human genome b37/hg19) Target n sample whole exome BAM file from 1000genomes project n HG00096.mapped.illumina.mosaik.GBR.exome.20110411.bam
Genomic Big Data Query Issues Scaling: n many files (1000+) n batched queries (e.g. visualization) n parallel requests (e.g. many users) Locality: central GPFS store vs. distributed FS Network speed: bandwidth, latency Ease of Query integration n SQL vs. R vs. Hadoop vs. Spark vs. Pig vs.
Comparison Methods Perl-MCE n Multithreaded, shared memory parallelism n Queries are limited by samtools API PostgreSQL/multicorn n Multiprocess n Arbitrary SQL, in theory n Uncertain performance
Alignment Metadata Per bam file, retrieve alignment metadata for region(s) seq_id, start, end, strand, cigar_str, query.start, query.end, dna, query.dna, qscore, qual, tagpaired a. 100, 1000, 10,000 regions b. Whole exome, 176,715 regions
MCE: Many-core processing with Perl Threading shared memory; workers use callback functions Chunking can reduce IPC overhead and likelihood that workers finish tasks at same time Serial I/O better than random I/O of workers; esp. with caching
MCE: Many-Core Engine Channels separate threads Communication Serialized Input scatter / gather Serialized output Queue Sync (event, term) Benchmark Net::Ping + MCE w/ 30 event loops 25K IPs/sec
PostgreSQL (FDW) / multicorn Load CREATE SERVER alchemy_srv foreign data wrapper multicorn options ( wrapper 'multicorn.sqlalchemyfdw.sqlalchemyfdw ); Attach data source CREATE FOREIGN TABLE mysql_datatable ( id integer, created_at timestamp without time zone, updated_at timestamp without time zone ) server alchemy_srv options ( tablename datatable, db_url 'mysql://root:password@127.0.0.1/testing );
FDW / multicorn (2) Rewrite as PG-SQL function which calls multicorn code to pull BAM data For SQL user, becomes SELECT * FROM input(contig, start, end) Parallelize basic data pull in python Much faster
Lessons In-situ queries on genome data files are possible and can be parallelized Reliance on samtools API limits query options Using FDW framework, files can be queried with SQL Basic queries have similar performance to CLI Can populate temporary tables and continue analytics in pure SQL Need to accelerate joins