SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop



Similar documents
Hadoop-BAM and SeqPig

Processing NGS Data with Hadoop-BAM and SeqPig

CSE-E5430 Scalable Cloud Computing. Lecture 4

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research

File S1: Supplementary Information of CloudDOE

CSE-E5430 Scalable Cloud Computing Lecture 2

Cloud-Based Big Data Analytics in Bioinformatics

Scalable Cloud Computing

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Cloud-based Analytics and Map Reduce

Accelerating Data-Intensive Genome Analysis in the Cloud

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoopizer : a cloud environment for bioinformatics data analysis

Hadoop. Bioinformatics Big Data

A Brief Introduction to Apache Tez

Henrik Nordberg 1,2 Karan Bhatia 1,3, Kai Wang 1, and Zhong Wang 1,2* 1 Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA.

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

-> Integration of MAPHiTS in Galaxy

Delivering the power of the world s most successful genomics platform

Richmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University,

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Data processing goes big

Implement Hadoop jobs to extract business value from large and varied data sets

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Jeffrey D. Ullman slides. MapReduce for data intensive computing

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Excalibur: An Autonomic Cloud Architecture for Executing Parallel Applications

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

A Brief Outline on Bigdata Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Intel Platform and Big Data: Making big data work for you.

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison

The Performance Characteristics of MapReduce Applications on Scalable Clusters

BIG DATA TRENDS AND TECHNOLOGIES

Globus Genomics Tutorial GlobusWorld 2014

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Phylogenetic Code in the Cloud Can it Meet the Expectations?

Open source Google-style large scale data analysis with Hadoop

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Introduction to NGS data analysis

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

MapReducing a Genomic Sequencing Workflow

Integrated Grid Solutions. and Greenplum

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Application Development. A Paradigm Shift

Big Data and Natural Language: Extracting Insight From Text

Evaluating HDFS I/O Performance on Virtualized Systems

Big Data and Industrial Internet

New solutions for Big Data Analysis and Visualization

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop s Rise in Life Sciences

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Approaching Big Data technologies to scientific applications

How To Scale Out Of A Nosql Database

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

Scaling Out With Apache Spark. DTL Meeting Slides based on

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Cisco Data Preparation

Hadoop. Sunday, November 25, 12

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

How To Handle Big Data With A Data Scientist

Practical Solutions for Big Data Analytics

Productivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

HADOOP IN THE LIFE SCIENCES:

A Performance Analysis of Distributed Indexing using Terrier

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Building your Big Data Architecture on Amazon Web Services

Big Data on Microsoft Platform

Manifest for Big Data Pig, Hive & Jaql

Hadoop Job Oriented Training Agenda

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

MapReduce and Hadoop Distributed File System V I J A Y R A O

SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Bigtable is a proven design Underpins 100+ Google services:

Transcription:

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing data sets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. In order to solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing data sets in a scalable and simple manner. SeqPig scripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig s scalability over many computing nodes and illustrate its use with example scripts. 1 Availability and Implementation: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/ Contact: andre.schumacher@yahoo.com 1 Introduction Novel computational approaches are required to cope with the increasing data volumes of large-scale sequencing projects, since the growth in processing power and storage access speed is unable to keep pace with them [11, 4]. Several innovative tools and technologies have been proposed to tackle these challenges. Some are based on MapReduce, which is a distributed computing paradigm that is based on the idea of splitting input data into chunks which can be processed largely independently (via a Map function). Subresults can later be merged after grouping related subresults (by a Reduce function). MapReduce permits automatic parallelization and scalable data distribution across many computers. The most popular implementation available as open source software is Apache Hadoop, which also comes with its own distributed filesystem. The validity of 1 This is the author s version of the article that appeared in Bioinformatics 30 (1): 119-120, 2014 and is available online via open access at http://bioinformatics.oxfordjournals.org/ content/30/1/119.abstract. 1

Hadoop as a data processing platform is demonstrated by the level of adoption in major data-intensive companies, e.g., Twitter, Facebook and Amazon. There are an increasing number of Hadoop-based tools for processing sequencing data [12], ranging from quality control [9] and alignment [3, 8] to SNP calling [3], variant annotation [7] and structural variant detection [13], including general purpose workflow management [10]. Note the recent publication of independent and complementary work in [6]. While Hadoop does simplify writing scalable, distributed software, it does not make it trivial. Such a task still requires specialized skills and a significant amount of work, particularly if the solution involves sequences of MapReduce jobs. This effort can be reduced significantly by using high-level tools such as Apache Pig, which implements an SQL-like scripting language that is automatically translated into a sequence of MapReduce jobs. Given its flexibility and simplicity for developing data processing pipelines, it is not surprising that a large fraction of computing jobs in contemporary Hadoop deployments originate from Apache Pig or similar high-level tools [2]. SeqPig brings the benefits of Apache Pig to sequencing data analysis. It allows users to integrate their own analysis components with existing MapReduce programs in order to create full NGS pipelines based on Hadoop. 2 Methods SeqPig extends Pig with a number of features and functionalities specialized for processing sequencing data. Specifically, it provides: 1) data input and output components, 2) functions to access fields and transform data and 3) a collection of scripts for frequent tasks (e.g., pile-up, QC statistics). Apache Pig provides an extension mechanism through the definition of new library functions, implemented in one of several supported programming languages (Java, Python, Ruby, JavaScript); these functions can then be called from Pig scripts. SeqPig uses this feature to augment the set of operators provided by plain Pig with a number of custom sequencing-specific functions. SeqPig supports ad hoc (scripted and interactive) distributed manipulation and analysis of large sequencing datasets so that processing speed scales with the number of available computing nodes. It provides import and export functions for file formats commonly used for sequencing data: Fastq, Qseq, FASTA, SAM and BAM. These components, implemented with the help of Hadoop-BAM [5], allow the user to load and export sequencing data in the Pig environment. All available fields, such as BAM/SAM optional read attributes for example, can then be accessed and modified from within Pig. SeqPig also includes functions to access SAM flags, split reads by base (for computing base-level statistics), reverse-complement reads, calculate read reference positions in a mapping (for pileups, extracting SNP positions), and more. It comes packaged with scripts that calculate various statistics and manipulations on read data, which also serve as examples. The growing library of functions and scripts is documented in the SeqPig manual. Contributions from the community are welcome and 2

reads = LOAD 'in.qseq' USING QseqLoader(); STORE reads INTO 'out.fq' USING FastqStorer(); Figure 1: Converting Qseq into Fastq; the data set is simply read and then written using the appropriate load/store functions. reads = LOAD 'in.fq' USING FastqLoader(); read_bases = FOREACH reads GENERATE UnalignedReadSplit(sequence, quality); read_gc = FOREACH read_bases { only_gc = FILTER $0 BY readbase == 'G' OR readbase == 'C'; GENERATE COUNT(only_gc) AS count; } gc_counts = FOREACH (GROUP read_gc BY count) GENERATE group AS gc_cnt, COUNT($1) AS cnt; DUMP gc_counts; Figure 2: Script that generates a histogram for GC content of reads. The script loads a Fastq file, splits each read into separate bases and for each read coordinate filters only bases that are either G or C. The filtered bases are then counted and counts are grouped. Finally, the script prints records which contain the GC count and the count of reads that have the given GC count. encouraged. Figures 1 and 2 show script examples. For a more detailed list of features and more examples please see the project web site: http://seqpig. sourceforge.net/. To evaluate SeqPig, we implemented a script which calculates most of the read quality statistics that are collected by the popular FastQC tool [1]. The script is included with the examples (fast fastqc.pig). We ran a set of experiments which measured the speed-up gained by using SeqPig on Hadoop clusters of different sizes compared to using a single-node FastQC run. We used a set of Illumina reads as input (read length: 101 bases; file size: 61.4 GB; format: Fastq). Software versions were as follows: FastQC 0.10.1; Hadoop 1.0.4; Pig 0.11.1. All tests were run on nodes equipped with dual quad-core Intel Xeon CPUs @ 2.83 GHz, 16 GB of RAM and one 250 GB SATA disk available to Hadoop. Nodes are connected via Gigabit Ethernet. FastQC read its data from a high-performance shared parallel file system by DDN. SeqPig used the Hadoop file system (HDFS) which uses each node s local disk drive. We first ran five different SeqPig read statistics for a different number of computing nodes: the sample distribution of a) the average base quality of the reads; b) the length of reads; c) bases by position inside the reads; d) the GC contents of reads. Finally, we combined them into a single script. Each of the executions results in a single MapReduce job and thus a single scan through the data. All runs were repeated three times and averaged (deviation from average < 7%). From Figure 3 one can see that it is possible to achieve a significant speedup by exploiting the parallelism in read and base statistics computation using 3

Speedup versus FastQC 60 50 40 30 20 10 avg readqual read length basequal stats GC contents all at once 0 0 10 20 30 40 50 60 70 Worker nodes Figure 3: Results of an experiment with an input file of 61.4 GB and a different number of Hadoop worker nodes. The script does not currently implement all FastQC statistics (we expect the missing ones to scale similarly in SeqPig), whereas the per-cycle quality distribution is not computed by FastQC. Hadoop. Further, the total runtime of the script that computes all statistics is mostly determined by the slowest of the individual ones, since the complete script is compiled into a single Map-only job. A different observation is that for most of the statistics computed we are able to achieve a close to linear speedup compared to FastQC until 48 nodes. We assume that the levelling off is due to the Hadoop job overhead eventually dominating over speedup due to parallelization, depending on input file size. SeqPig enables simple and scalable manipulation and analysis of sequencing data on the Hadoop platform. At CRS4 SeqPig is already used routinely for several steps in the production workflow; in addition, it has been successfully used for ad hoc investigations into data quality issues, comparison of alignment tools, and reformatting and packaging data. We have also tested SeqPig on Amazon s Elastic MapReduce service, where users may rent computing time on the cloud to run their SeqPig scripts and even share their S3 storage buckets with other cloud-enabled software. Instructions are provided in the supplementary material. Acknowledgements This work was supported by the Cloud Software and D2I Programs of the Finnish Strategic Centre for Science, Technology and Innovation DIGILE; Academy of Finland [139402]; the Sardinian (Italy) Regional Grant [L7-2010/COBIK]; COST Action BM1006 Next Generation Sequencing Data Analysis Network, SeqAhead. Computational resources were provided by CRS4 and ELIXIR node hosted at CSCIT Center for Science. 4

References [1] S. Andrews. Fastqc. a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc, 2010. [2] Y. Chen, S. Alspaugh, et al. Interactive analytical processing in big data systems: A cross-industry study of MapReduce workloads. In Proceedings of the VLDB Endowment, volume 5, pages 1802 1813, Aug. 2012. [3] B. Langmead, M. Schatz, et al. Searching for SNPs with cloud computing. Genome Biology, 10(11):R134, 2009. [4] V. Marx. Biology: The big challenges of big data. Nature, 498:255 260, June 2013. [5] M. Niemenmaa, A. Kallio, et al. Hadoop-BAM: Directly manipulating next generation sequencing data in the cloud. Bioinformatics, 28(6):876 877, 2012. [6] H. Nordberg, K. Bhatia, et al. Biopig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics, 2013. Epub ahead of print. [7] B. O Connor, B. Merriman, et al. SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinformatics, 11(Suppl 12):S2, 2010. [8] L. Pireddu, S. Leo, et al. SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics, 27(15):2159 2160, 2011. [9] T. Robinson, S. Killcoyne, et al. SAMQA: Error classification and validation of high-throughput sequenced read data. BMC Genomics, 12(1):419, 2011. [10] S. Schönherr, L. Forer, et al. Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds. BMC Bioinformatics, 13(1), 2012. [11] L. Stein. The case for cloud computing in genome informatics. Genome Biology, 11(5):207, 2010. [12] R. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics, 11(Suppl 12):S1, 2010. [13] C. W. Whelan, J. Tyner, et al. Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce. arxiv:1307.2331, 2013. 5