Hadoop-BAM and SeqPig



Similar documents
Processing NGS Data with Hadoop-BAM and SeqPig

CSE-E5430 Scalable Cloud Computing. Lecture 4

Big Data and Industrial Internet

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Scalable Cloud Computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Large scale processing using Hadoop. Ján Vaňo

Future Prospects of Scalable Cloud Computing

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

How To Scale Out Of A Nosql Database

Hadoop IST 734 SS CHUNG

BIG DATA USING HADOOP

Big Data and Apache Hadoop s MapReduce

Open source Google-style large scale data analysis with Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE-E5430 Scalable Cloud Computing Lecture 11

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Scaling Out With Apache Spark. DTL Meeting Slides based on

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

How To Handle Big Data With A Data Scientist

Hadoop implementation of MapReduce computational model. Ján Vaňo

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Architectures for Big Data Analytics A database perspective

The Inside Scoop on Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Cloud-based Analytics and Map Reduce

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

How Companies are! Using Spark

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data and Market Surveillance. April 28, 2014

Data processing goes big

Hadoop Ecosystem B Y R A H I M A.

File S1: Supplementary Information of CloudDOE

Big Data With Hadoop

Unified Big Data Processing with Apache Spark. Matei

A Brief Outline on Bigdata Hadoop

Hadoop & Spark Using Amazon EMR

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

The Internet of Things and Big Data: Intro

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Big Data for Processing Data and Performing Workload

Accelerating and Simplifying Apache

Apache Hadoop FileSystem and its Usage in Facebook

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Dell In-Memory Appliance for Cloudera Enterprise

Hadoop: Embracing future hardware

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop. Bioinformatics Big Data

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Next-Gen Big Data Analytics using the Spark stack

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

Hadoop & its Usage at Facebook

Big Data on Microsoft Platform

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Big Data and Data Science: Behind the Buzz Words

Big Data Analytics OverOnline Transactional Data Set

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop. Sunday, November 25, 12

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop and Map-Reduce. Swati Gore

Constructing a Data Lake: Hadoop and Oracle Database United!

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Modernizing Your Data Warehouse for Hadoop

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Fast, Low-Overhead Encryption for Apache Hadoop*

Manifest for Big Data Pig, Hive & Jaql

Apache Hadoop FileSystem Internals

RevoScaleR Speed and Scalability

Big Fast Data Hadoop acceleration with Flash. June 2013

Can the Elephants Handle the NoSQL Onslaught?

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Hadoop & its Usage at Facebook

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Parallel Computing. Benson Muite. benson.

A Brief Introduction to Apache Tez

Oracle Big Data SQL Technical Update

Apache HBase. Crazy dances on the elephant back

Actian SQL in Hadoop Buyer s Guide

Big Data Analytics - Accelerated. stream-horizon.com

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Transcription:

Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer Science and Engineering and Helsinki Institute for Information Technology HIIT School of Science, Aalto University firstname.lastname@aalto.fi 2 International Computer Science Institute, Berkeley, CA, USA 3 CRS4 Center for Advanced Studies, Research and Development in Sardinia, Italy 4 CSC IT Center for Science 20.1-2015 1/15

Next Generation Sequencing and Big Data The amount of NGS data worldwide is predicted to double every 5 months This growth is much faster than Moore s law for the growth rate of computing (historically transistor counts have doubled every 18-24 months), Kryder s law for the growth of storage capacity (historically doubling approx every 13 months), and Butter s law for growth in optical communications bandwidth (historically doubling approx every 9 months) Without increased expenditure in distributed computing methods genomics research will hit computational limits 2/15

No Processor Clock Speed Increases Ahead Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb s Journal, 30(3), March 2005 (updated graph in August 2009). 3/15

Implications of the End of Free Lunch The clock speeds of microprocessors are not going to improve much in the foreseeable future The efficiency gains in single threaded performance are going to be only moderate The number of transistors in a microprocessor is still growing at a high rate One of the main uses of transistors has been to increase the number of computing cores the processor has The number of cores in a low end workstation (as those employed in large scale datacenters) is going to keep on steadily growing Programming models need to change to efficiently exploit all the available concurrency - scalability to high number of cores/processors will need to be a major focus 4/15

Tape is Dead, Disk is Tape, RAM locality is King Trends of RAM, SSD, and HDD prices. From: H. Plattner and A. Zeier: In-Memory Data Management: An Inflection Point for Enterprise Applications 5/15

Tape is Dead, Disk is Tape, RAM locality is King RAM (and SSDs) are radically faster than HDDs: One should use RAM/SSDs whenever possible RAM is roughly the same price as HDDs were a decade earlier Workloads that were viable with hard disks a decade ago are now viable in RAM One should only use hard disk based storage for datasets that are not yet economically viable in RAM (or SSD) In memory distributed filesystems such as Tachyon are needed for temp files! The Big Data applications (HDD based massive storage) should consist of applications that were not economically feasible a decade ago using HDDs 6/15

Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System Distribution for Big Data Based on Google architecture design Cheap commodity hardware for storage Fault tolerant distributed filesystems: HDFS, Tachyon Batch processing systems: Hadoop MapReduce, Apache Hive, and Apache Pig (HDD); Apache Spark (RAM) Parallel SQL implementations for analytics: Apache Hive, Cloudera Impala, Apache Shark, Facebook Presto Fault tolerant distributed database: HBase Distributed machine learning libraries, text indexing & search, etc. Project Web page: http://hadoop.apache.org/ Hadoop MapReduce is just one example application on top of the Hadoop Open Source distribution! 7/15

Commercial Hadoop Support Cloudera: Probably the largest Hadoop distributor, partially owned by Intel (740 million USD investment for 18% share). Available from: http://www.cloudera.com/ Hortonworks: Yahoo! spin-off from their large Hadoop development team: http://www.hortonworks.com/ MapR: A rewrite of much of Apache Hadoop in C++, including a new filesystem. API-compatible with Apache Hadoop. http://www.mapr.com/ 8/15

Hadoop-BAM A library to interface NGS data formats with both Hadoop and Spark Includes tools for e.g., sorting of reads, as needed by merging results of parallel read aligners Supported fileformats: BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF Some fileformats like BAM notoriously badly designed for parallel processing Version 7.0 of the hadoop-bam released: http://sourceforge.net/projects/hadoop-bam/ 2700+ Downloads of the library Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., and Heljanko, K.: Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the Cloud. Bioinformatics 28(6):876-877, 2012. (http://dx.doi.org/10.1093/bioinformatics/bts054). 9/15

SeqPig Parallel scripting language for NGS data sets based on the Apache Pig language Compiles into Java, executed by Hadoop MapReduce SQL-like functionality with helper functions for NGS data: Filtering data, computing aggregate statistics, doing joins Supported fileformats: BAM, SAM, FASTQ, QSEQ, and FASTA Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., and Heljanko, K.: SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30 (1): 119-120, 2014. (dx.doi.org/10.1093/bioinformatics/btt601.) See also supplement: http://bioinformatics.oxfordjournals.org/content/ suppl/2013/10/17/btt601.dc1/supplement.pdf 10/15

SeqPig Use Case Examples Automatically parallelizing Pig example scripts for: File format conversion Filtering out unmapped reads and PCR or optical duplicates Filtering out reads with low mapping quality Filtering by regions (samtools syntax) Sorting BAM files Computing read coverage Computing base frequencies (counts) for each reference coordinate Pileup Collecting read-mapping-quality statistics Collecting per-base statistics of reads... 11/15

Scalability of SeqPig Speedup versus FastQC 60 50 40 30 20 10 avg readqual read length basequal stats GC contents all at once 0 0 10 20 30 40 50 60 70 Worker nodes Figure: Scalabilty of SeqPig vs sequential FastQC. Computing statistics on 61.4 GB input file with up to 63 computer Hadoop cluster 12/15

SeqPig Benefits and Drawbacks Benefits: Automatic parallelization of data processing scripts Easy to learn scripting language with full power of MapReduce Most scripts are at most tens of lines of code vs. hundreds to thousands of lines of Java Also allows calling back user defined functions written in Java/Python Implements SQL like functionality Drawbacks: MapReduce has 10+ second startup delay: No for interactive use A specialized language instead of a standard like SQL 13/15

Move from BAM files to SQL Data Warehouse A proper data warehouse system can Very efficiently evaluate parallel queries over Petabytes of data Allows for efficient compression and indexing Allows to ride on the Hadoop+Spark software investments Many new analytics SQL implementations on top of Hadoop designed for handling Petabyte-class datasets Apache Hive Cloudera Impala Spark SQL Presto from Facebook We have interfaced Hive with NGS file formats using Hadoop-BAM, see Master s Thesis: Matti Niemenmaa: Analysing sequencing data in Hadoop: The road to interactivity via SQL, 2013. 14/15

Future plans Porting all our tools over to Spark from Hadoop Parallel variant detection using black box variant callers needs to be integrated with the pipeline - Work in progress Releasing a tool with SeqPig like functionality on a parallel SQL Data Warehouse Interactive ad-hoc queries of Big Data in Genomics In memory-filesystem and object storage integration We are open to working with you to help with your Big Data problems! 15/15