Scalable Cloud Computing Solutions for Next Generation Sequencing Data



Similar documents
CSE-E5430 Scalable Cloud Computing Lecture 2

Scalable Cloud Computing

Hadoop-BAM and SeqPig

Processing NGS Data with Hadoop-BAM and SeqPig

CSE-E5430 Scalable Cloud Computing. Lecture 4

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

NetFlow Analysis with MapReduce

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Comparison of Different Implementation of Inverted Indexes in Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Large scale processing using Hadoop. Ján Vaňo

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Jeffrey D. Ullman slides. MapReduce for data intensive computing

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data and Apache Hadoop s MapReduce

A Performance Analysis of Distributed Indexing using Terrier

Big Data and Industrial Internet

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Intro to Map/Reduce a.k.a. Hadoop

Big Data With Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Improving MapReduce Performance in Heterogeneous Environments

BIG DATA TRENDS AND TECHNOLOGIES

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop IST 734 SS CHUNG

Hadoop Architecture. Part 1

MapReduce and Hadoop Distributed File System

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Application Development. A Paradigm Shift

Hadoop & its Usage at Facebook

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source Google-style large scale data analysis with Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop implementation of MapReduce computational model. Ján Vaňo

Data-Intensive Computing with Map-Reduce and Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Bringing Big Data Modelling into the Hands of Domain Experts

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Log Mining Based on Hadoop s Map and Reduce Technique

MapReduce (in the cloud)

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Accelerating and Simplifying Apache

THE HADOOP DISTRIBUTED FILE SYSTEM

Enabling High performance Big Data platform with RDMA

Hadoop. Bioinformatics Big Data

Hadoop & its Usage at Facebook

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Parallel Processing of cluster by Map Reduce

Reduction of Data at Namenode in HDFS using harballing Technique

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

GeoGrid Project and Experiences with Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Mining Large Datasets: Case of Mining Graph Data in the Cloud

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Scaling Out With Apache Spark. DTL Meeting Slides based on

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop and Map-Reduce. Swati Gore

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Chapter 7. Using Hadoop Cluster and MapReduce

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Parallel Computing. Benson Muite. benson.

GraySort on Apache Spark by Databricks

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Application Execution on Cloud using Hadoop Distributed File System

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop: Embracing future hardware

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Apache Hadoop FileSystem Internals

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Networking in the Hadoop Cluster

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Comparative analysis of Google File System and Hadoop Distributed File System

Alternatives to HIVE SQL in Hadoop File Structure

Cloud Computing at Google. Architecture

Transcription:

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of Information and Computer Science School of Science, Aalto University firstname.lastname@aalto.fi 2 CSC IT Center for Science firstname.lastname@csc.fi

Next Generation Sequencing The amount of NGS data worldwide is predicted to grow very rapidly The datasets are large: The 1000 Genomes project (http://www.1000genomes.org) is a freely available 50 TB+ data set of human genomes (as of Jun 2010) Without scalable cloud computing and storage methods genomics research will hit computing infrastructural and/or budgeting limits 2/17

Google MapReduce A scalable batch processing framework developed at Google for Big Data: computing the Web index Uses commodity PC hardware to create a massively parallel storage subsystem Each node is a usually a Linux PC with 4-12 hard disks. Data is by default stored on three nodes for very high availability Adding a new server not only improves CPU capacity but also adds storage capacity and bandwidth 3/17

Google MapReduce (cnt.) Much cheaper per Terabyte than traditional SAN storage The MapReduce framework takes care of all issues related to parallelization, synchronization, load balancing, and fault tolerance. All these details are hidden from the programmer 4/17

MapReduce Diagram Worker (1)fork (2)assign map User program (1)fork Master (2) assign reduce (1)fork split 0 split 1 (3)read split 2 Worker (4) local write (5)Remote read Worker (6)write output file 0 split 3 split 4 Worker output file 1 Worker Input files Map phase Intermediate files (on local disks) Reduce phase Output files 22.1.2010 2 Figure: J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 5/17

Apache Hadoop An Open Source implementation of the MapReduce framework, heavily used by e.g., Yahoo! and Facebook Moving Computation is Cheaper than Moving Data - Ship code to data, not data to code Map and Reduce workers are also storage nodes for the underlying distributed filesystem: Job allocation is first tried to a node having a copy of the data, and if that fails, then to a node in the same rack (to maximize network bandwidth) Project Web page: http://hadoop.apache.org/ 6/17

Apache Hadoop (cnt.) Builds reliable systems out of unreliable commodity hardware by replicating both data and in case of node failures: automatically rerunning computations Tuned for large (Gigabytes of data) files Designed to handle very large 1 PB+ data sets Designed for streaming data accesses in batch processing, designed for high bandwidth instead of low latency For scalability NOT a POSIX filesystem - all applications do not port without modifications to this storage system A good match for upcoming NGS data storage needs 7/17

Two Large Hadoop Installations Yahoo! (2009): 4000 nodes, 16 PB raw disk, 64TB RAM, 32K cores Facebook (2010): 2000 nodes, 21 PB storage, 64TB RAM, 22.4K cores A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, H. Liu: Data warehousing and analytics infrastructure at Facebook. SIGMOD Conference 2010: 1013-1020. http://doi.acm.org/10.1145/1807167.1807278 8/17

Hadoop-BAM Opensource library that enables Hadoop to natively process BAM (Binary Alignment/Map) files in parallel Based on parallelizing parts of the samtools Java implementation called Picard on top of Hadoop As a proof of concept has been so far used for sorting and visualizing BAM files (coverage summaries) using upto 180 cores in parallel Can be easily extended to do other BAM file processing in parallel - many avenues for future research 9/17

Hadoop-BAM (cnt.) Technical challenges: BAM file needs to be split into 64MB parts that can be processed in independently in parallel by different machines This is difficult as the BAM file format is compressed - find a compressed block binary Furthermore, compression blocks boundaries are not aligned with record boundaries. We find the correct record boundaries by trying to speculatively parse the file starting at different offsets until a correct offset is found The BAM format is quite difficult but not impossible to work with in parallel http://sourceforge.net/projects/hadoop-bam/ 10/17

CSC Genome Browser Bioinformatics is the largest customer group of CSC (in user numbers) Interactive browsing with zooming in and out of a BAM file for read coverage summaries, Google Earth -style Single BAM files can be 100GB+ with interactive visualization at different zoom levels Preprocessing with a new library Hadoop-BAM to compute summary data for the higher zoom levels Benchmarked to scale upto running 15 x 12 = 180 cores in parallel for the visualization 11/17

Genome Browser GUI 12/17

Read aggregation: problem (cnt.) 13/17

Triton Cloud Testbed I Cloud computing testbed at Aalto University I 112 AMD Opteron 2.6 GHz compute nodes with 12 cores, 32-64GB memory each, totalling 1344 cores I Infiniband and 1 GBit Ethernet networks I 30 TB+ local disk I 40 TB+ fileserver work space 14/17

Experiments One 50 GB (compressed) BAM input file from 1000 Genomes Run on the Triton cluster 1-15 compute nodes with 12 cores each Four repetitions for increasing number of worker nodes Two types of runs: sorting according to starting position ( sorted ) and read aggregation using five summary-block sizes at once ( summarized ) 15/17

Mean speedup 50 GB sorted 50 GB summarized for B=2,4,8,16,32 16 14 12 Ideal Input file import Sorting Output file export Total elapsed 16 14 12 Ideal Input file import Summarizing Output file export Total elapsed Mean speedup 10 8 6 Mean speedup 10 8 6 4 4 2 2 0 1 2 4 8 15 0 1 2 4 8 15 Workers Workers 16/17

Conclusions Parallel processing does not improve runtimes if data transfer in&out of a single fileserver does not scale The large cloud players are using commodity PCs + distributed scalable fault tolerant filesystems to get storage performance up and keep prices down The NGS storage and processing requirements are a good match for MapReduce and its open source Apache Hadoop implementation Integrating alignment tools, filtering of BAM files, and parallel ad-hoc database queries are on our TODO list Version 3.0 of the hadoop-bam available: http://sourceforge.net/projects/hadoop-bam/ We are looking for challenges in parallelizing NGS data processing tasks 17/17