Processing NGS Data with Hadoop-BAM and SeqPig

Similar documents

Hadoop-BAM and SeqPig

CSE-E5430 Scalable Cloud Computing. Lecture 4

Big Data and Industrial Internet

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Scalable Cloud Computing

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Ecosystem B Y R A H I M A.

Future Prospects of Scalable Cloud Computing

Large scale processing using Hadoop. Ján Vaňo

Unified Big Data Processing with Apache Spark. Matei

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Moving From Hadoop to Spark

Open source Google-style large scale data analysis with Hadoop

How Companies are! Using Spark

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop & Spark Using Amazon EMR

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

The Internet of Things and Big Data: Intro

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data With Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data and Apache Hadoop s MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 11

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop. Sunday, November 25, 12

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Hadoop implementation of MapReduce computational model. Ján Vaňo

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Next-Gen Big Data Analytics using the Spark stack

Data processing goes big

How To Scale Out Of A Nosql Database

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop IST 734 SS CHUNG

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data on Microsoft Platform

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Course Highlights

Cloud-based Analytics and Map Reduce

Scaling Out With Apache Spark. DTL Meeting Slides based on

Accelerating and Simplifying Apache

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Reference Architecture, Requirements, Gaps, Roles

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Big Data and Market Surveillance. April 28, 2014

How To Handle Big Data With A Data Scientist

Hadoop. Bioinformatics Big Data

A Brief Introduction to Apache Tez

BIG DATA USING HADOOP

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Dell In-Memory Appliance for Cloudera Enterprise

Big Data and Data Science: Behind the Buzz Words

Beyond Hadoop with Apache Spark and BDAS

Hadoop: Embracing future hardware

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

An Open Source Memory-Centric Distributed Storage System

A Brief Outline on Bigdata Hadoop

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Parquet. Columnar storage for the people

Alternatives to HIVE SQL in Hadoop File Structure

Open source large scale distributed data management with Google s MapReduce and Bigtable

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Chase Wu New Jersey Ins0tute of Technology

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Dominik Wagenknecht Accenture

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Peers Techno log ies Pv t. L td. HADOOP

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

Architectures for Big Data Analytics A database perspective

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Fast Data Hadoop acceleration with Flash. June 2013

Conquering Big Data with BDAS (Berkeley Data Analytics)

Can the Elephants Handle the NoSQL Onslaught?

Application Development. A Paradigm Shift

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop and Map-Reduce. Swati Gore

Hadoop Big Data for Processing Data and Performing Workload

The Inside Scoop on Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Apache Flink Next-gen data analysis. Kostas

Workshop on Hadoop with Big Data

Transcription:

Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Helsinki Institute for Information Technology HIIT and Department of Computer Science, Aalto University firstname.lastname@aalto.fi 2 International Computer Science Institute, Berkeley, CA, USA 3 CRS4 Center for Advanced Studies, Research and Development, Italy 4 CSC IT Center for Science 1/26

Next Generation Sequencing and Big Data The amount of NGS data worldwide is predicted to double every 5 months This growth is much faster than Moore s law for the growth rate of computing (historically transistor counts have doubled every 18-24 months), Kryder s law for the growth of storage capacity (historically doubling approx every 13 months), and Butter s law for growth in optical communications bandwidth (historically doubling approx every 9 months) Without increased expenditure in distributed computing methods genomics research will hit computational limits 2/26

No Processor Clock Speed Increases Ahead Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb s Journal, 30(3), March 2005 (updated graph in August 2009). 3/26

Implications of the End of Free Lunch The clock speeds of microprocessors are not going to improve much in the foreseeable future The efficiency gains in single threaded performance are going to be only moderate The number of transistors in a microprocessor is still growing at a high rate One of the main uses of transistors has been to increase the number of computing cores the processor has The number of cores in a low end workstation (as those employed in large scale datacenters) is going to keep on steadily growing Programming models need to change to efficiently exploit all the available concurrency - scalability to high number of cores/processors will need to be a major focus 4/26

Tape is Dead, Disk is Tape, RAM locality is King Trends of RAM, SSD, and HDD prices. From: H. Plattner and A. Zeier: In-Memory Data Management: An Inflection Point for Enterprise Applications 5/26

Tape is Dead, Disk is Tape, RAM locality is King RAM (and SSDs) are radically faster than HDDs: One should use RAM/SSDs whenever possible RAM is roughly the same price as HDDs were a decade earlier Workloads that were viable with hard disks a decade ago are now viable in RAM One should only use hard disk based storage for datasets that are not yet economically viable in RAM (or SSD) In memory distributed filesystems such as Tachyon are needed for temp files! The Big Data applications (HDD based massive storage) should consist of applications that were not economically feasible a decade ago using HDDs 6/26

Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System Distribution for Big Data Based on Google architecture design Cheap commodity hardware for storage Fault tolerant distributed filesystems: HDFS, Tachyon Batch processing systems: Hadoop MapReduce, Apache Hive, and Apache Pig (HDD); Apache Spark (RAM) Parallel SQL implementations for analytics: Apache Hive, Cloudera Impala, Apache Shark, Facebook Presto Fault tolerant distributed database: HBase Distributed machine learning libraries, text indexing & search, etc. Project Web page: http://hadoop.apache.org/ Hadoop MapReduce is just one example application on top of the Hadoop Open Source distribution! 7/26

Commercial Hadoop Support Cloudera: Probably the largest Hadoop distributor, partially owned by Intel (740 million USD investment for 18% share). Available from: http://www.cloudera.com/ Hortonworks: Yahoo! spin-off from their large Hadoop development team: http://www.hortonworks.com/ MapR: A rewrite of much of Apache Hadoop in C++, including a new filesystem. API-compatible with Apache Hadoop. http://www.mapr.com/ 8/26

Hadoop-BAM A library to interface NGS data formats with both Hadoop and Spark Includes tools for e.g., sorting of reads, as needed by merging results of parallel read aligners Supported fileformats: BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF Some fileformats like BAM notoriously badly designed for parallel processing Released in Dec 2010, at Version 7.0 of the Hadoop-BAM: http://sourceforge.net/projects/hadoop-bam/ 2800+ Downloads of the library Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., and Heljanko, K.: Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the Cloud. Bioinformatics 28(6):876-877, 2012. (http://dx.doi.org/10.1093/bioinformatics/bts054). 9/26

Mean speedup 50 GB sorted 50 GB summarized for B=2,4,8,16,32 16 14 12 Ideal Input file import Sorting Output file export Total elapsed 16 14 12 Ideal Input file import Summarizing Output file export Total elapsed Mean speedup 10 8 6 Mean speedup 10 8 6 4 4 2 2 0 1 2 4 8 15 0 1 2 4 8 15 Workers Workers 10/26

SeqPig Parallel scripting language for NGS data sets based on the Apache Pig language Compiles into Java, executed by Hadoop MapReduce SQL-like functionality with helper functions for NGS data: Filtering data, computing aggregate statistics, doing joins Supported fileformats: BAM, SAM, FASTQ, QSEQ, and FASTA Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., and Heljanko, K.: SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30 (1): 119-120, 2014. (dx.doi.org/10.1093/bioinformatics/btt601.) See also supplement: http://bioinformatics.oxfordjournals.org/content/ suppl/2013/10/17/btt601.dc1/supplement.pdf 11/26

SeqPig Use Case Examples Automatically parallelizing Pig example scripts for: File format conversion Filtering out unmapped reads and PCR or optical duplicates Filtering out reads with low mapping quality Filtering by regions (samtools syntax) Sorting BAM files Computing read coverage Computing base frequencies (counts) for each reference coordinate Pileup Collecting read-mapping-quality statistics Collecting per-base statistics of reads... 12/26

Scalability of SeqPig Speedup versus FastQC 60 50 40 30 20 10 avg readqual read length basequal stats GC contents all at once 0 0 10 20 30 40 50 60 70 Worker nodes Figure: Scalabilty of SeqPig vs sequential FastQC. Computing statistics on 61.4 GB input file with up to 63 computer Hadoop cluster 13/26

SeqPig Benefits and Drawbacks Benefits: Automatic parallelization of data processing scripts Easy to learn scripting language with full power of MapReduce Most scripts are at most tens of lines of code vs. hundreds to thousands of lines of Java Also allows calling back user defined functions written in Java/Python Implements SQL like functionality Drawbacks: MapReduce has 10+ second startup delay: No for interactive use A specialized language instead of a standard like SQL 14/26

Apache Spark Apache Spark is fast and general purpose in memory cluster-computing system. Spark can cache data-sets, and has much flexible DAG execution scheme. More suitable than Hadoop for iterative algorithms. Can be up to 100 times faster than Apache Hadoop. Runs Standalone, on Yarn, Mesos, and EC2. Can utilize HDFS, HBase, Cassandra and any Hadoop Source (including Hadoop-BAM). 15/26

Spork Can run existing Pig scripts with minimal change. Compiles Pig scripts to Spark jobs instead of Hadoop jobs. Extends Pig CACHE operator. Open source but in alpha stage. Codebase: https://github.com/sigmoidanalytics/spork Some open issues: https://issues.apache.org/jira/browse/pig-4059 16/26

Pig vs Spork Execution Path 17/26

SeqSpork Can run existing SeqPig scripts with minimal change. Extends Spork to read Genomics File Types. Extends Spork with some Genomics Functionality. Tries to optimize Spark parameters according to needs of Genomics Domain. Compiles into Apache Spark Jobs. Currently up to 2 times faster compared to SeqPig Codebase: https://github.com/ridvandongelci/seqpig/tree/spork-1.2.0 18/26

SeqSpork Advantages & Disadvantages Advantages More flexible and faster than SeqPig Provides caching functionality. No separate map-reduce jobs thus not temp files between map-reduce jobs. Data processing is done mostly in-memory. Disadvantages Alpha quality software & somewhat instable. Still imperative like SeqPig not declarative like SQL. Start-up delay still exists for interactive jobs. 19/26

Future of SeqSpork Increase software quality and stability. Address some performance issues in group and join operations. Implement more genomic functionality. Parallel variant detection needs to be integrated with the pipeline. Using different datasource than HDFS such as object storage systems or in-memory filesystem. 20/26

Moving from BAM files to SQL Data Warehouse A proper data warehouse system can Very efficiently evaluate parallel queries over Petabytes of data Allows for efficient compression and indexing, including indexing several BAM files in a single table Allows to ride on the Hadoop+Spark software investments Many new analytics SQL implementations on top of Hadoop designed for handling Petabyte-class datasets Apache Hive Cloudera Impala Spark SQL Presto from Facebook 21/26

Hadoop-BAM Integrated with Hive We have interfaced Hive to Hadoop-BAM, allowing Hive to directly read BAM files Hive can then convert the NGS data into columnar formats such as RCFile or Parquet for improved compression and query performance Using RCFile or Parquet storage also allows the contents of BAM files to be also queried by Spark SQL (and Shark), Impala, and Presto See details: Master s Thesis: Matti Niemenmaa: Analysing sequencing data in Hadoop: The road to interactivity via SQL, 2013. Hive interfacing code available on request 22/26

Initial Hive, Shark, and Impala SQL Benchmarks Benchmarks on SQL queries on BAM file contents, including quality control, joining with BED file, etc. RCFile columnar format used (currently BAM reading only supported by Hive), frameworks as of end of 2013: Hive 0.11 Shark 0.7 Impala 1.0.1 Hive was stable but never faster than 10 seconds, making it unsuitable for interactive use The new frameworks do really well on small queries, on large data sets and node counts Hive catches up Both Shark and Impala had some stability and/or correctness issues at end of 2013 23/26

Hive, Shark, Impala Benchmarks (Linear Scale) 24/26

Hive, Shark, Impala Benchmarks (Log Scale) 25/26

Future plans Porting all our tools over to Spark from Hadoop Parallel variant detection using black box variant callers needs to be integrated with the pipeline - Work in progress Releasing a tool with SeqPig like functionality on a parallel SQL Data Warehouse exploiting all available query engines: Spark SQL, Impala, Presto, Hive Using standard columnar data formats like Parquet and RCFile for storage of NGS data Interactive ad-hoc queries of Big Data in Genomics In memory-filesystem and object storage integration Parallel cloud datastore (HBase) for parallelized database functionality We are open to working with you on Parallel Next Generation Data Processing on Hadoop and Spark Ecosystems 26/26