Evaluating MapReduce and Hadoop for Science

Similar documents

Performance and Energy Efficiency of. Hadoop deployment models

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Understanding Hadoop Performance on Lustre

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Lustre * Filesystem for Cloud and Hadoop *

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

A Performance Analysis of Distributed Indexing using Terrier

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Can the Elephants Handle the NoSQL Onslaught?

Maximizing Hadoop Performance with Hardware Compression

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

MapReduce with Apache Hadoop Analysing Big Data

Chapter 7. Using Hadoop Cluster and MapReduce

Storage Architectures for Big Data in the Cloud

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Optimize the execution of local physics analysis workflows using Hadoop

GeoGrid Project and Experiences with Hadoop

Cloud Federation to Elastically Increase MapReduce Processing Resources

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Oracle Big Data SQL Technical Update

Accelerating and Simplifying Apache

Hadoop Architecture. Part 1

Ali Ghodsi Head of PM and Engineering Databricks

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

In Memory Accelerator for MongoDB

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

A Service for Data-Intensive Computations on Virtual Clusters

Hadoop Job Oriented Training Agenda

Workflow Tools at NERSC. Debbie Bard NERSC Data and Analytics Services

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

HPCHadoop: MapReduce on Cray X-series

Installing Hadoop over Ceph, Using High Performance Networking

Testing 3Vs (Volume, Variety and Velocity) of Big Data

MongoDB and Couchbase

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Scalable Architecture on Amazon AWS Cloud

MapReduce and Hadoop Distributed File System

Big Data Course Highlights

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Workshop on Hadoop with Big Data

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Energy efficiency in HPC :

Application Development. A Paradigm Shift

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing

MapReduce and Hadoop Distributed File System V I J A Y R A O

Complete Java Classes Hadoop Syllabus Contact No:

CSE-E5430 Scalable Cloud Computing Lecture 2

XtreemFS Extreme cloud file system?! Udo Seidel

MongoDB Developer and Administrator Certification Course Agenda

Investigation of storage options for scientific computing on Grid and Cloud facilities

Big Data Challenges in Bioinformatics

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

POSIX and Object Distributed Storage Systems

Implement Hadoop jobs to extract business value from large and varied data sets

Integrating Big Data into the Computing Curricula

EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD

Enabling High performance Big Data platform with RDMA

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

Duke University

TRAINING PROGRAM ON BIGDATA/HADOOP

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Hadoop Size does Hadoop Summit 2013

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Open source Google-style large scale data analysis with Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Mixing Hadoop and HPC Workloads on Parallel Filesystems

S06: Open-Source Stack for Cloud Computing

Hadoop IST 734 SS CHUNG

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Evaluation of NoSQL and Array Databases for Scientific Applications

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Mambo Running Analytics on Enterprise Storage

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Distributed File System Propagation Adapter for Nimbus

GraySort on Apache Spark by Databricks

Amazon EC2 Product Details Page 1 of 5

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

Transcription:

Evaluating MapReduce and Hadoop for Science Lavanya Ramakrishnan LRamakrishnan@lbl.gov Lawrence Berkeley National Lab

Computation and Data are critical parts of the scientific process Three Pillars of Science Theory Experiment Computation Advance Light Source Data Rates 2009 65 TB/yr 2011 312 TB/yr 2013 1900 TB/yr Data (Fourth Paradigm)

Internet BigData led to the MapReduce and Hadoop Evolution Map Reduce 3

A central component of the MapReduce model is its file system HDFS Typical Replication 3 1 Storage Location Compute Node Servers Access Model Custom (except with Fuse) GPFS and Lustre POSIX Stripe Size 64 MB 1 MB Concurrent Writes No Yes Scales with # of Compute Nodes # of Servers Scale of Largest Systems O(10k) Nodes User/Kernel Space User Kernel O(100) Servers

Evaluating the Hype from Reality Hadoop on VM MapReduce Hadoop on HPC Cloud Clusters HPC NoSQL MongoDB +Hadoop 5

Streaming adds a performance overhead Better Evaluating Hadoop for Science, IEEE Cloud 2012 6

High performance parallel file systems can be used with Hadoop for small to medium concurrency Better Time (minutes) 12 10 8 6 4 Teragen (1TB) HDFS GPFS Linear (HDFS) Expon. (HDFS) Linear (GPFS) Expon. (GPFS) 2 0 0 500 1000 1500 2000 2500 3000 Number of maps 7

We evaluate three data-intensive operations with different testbed configurations Filter Merge Reorder Public data sets

Data operations impacts the performance differences across file systems: Wikipedia (2TB) 15 WriteTime Better 10 ProcessingTime ReadTime Processing time (1000s) 5 0

Read-intensive applications benefit from HDFS Processing time (s) 0 200 400 600 800 HDFS GPFS Better 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Size (TB)

Scientific Ensembles have similarities with MapReduce structure A large number of loosely coupled tasks, each with their own internal parallelism. Riding the Elephant: Managing Ensembles with Hadoop, MTAGS 2011 11

All patterns could be implemented in Hadoop but with varying levels of difficulty low Riding the Elephant: Managing Ensembles with Hadoop, MTAGS 2011 high

There are challenges when using Hadoop for scientific applications High throughput workflows Scaling up from desktops File system: non POSIX Language: Java Input and output formats: mostly line-oriented text Streaming mode: restrictive i/p and o/p model Data locality: what happens when multiple inputs? File permissions: jobs run as user hadoop 13

Tigres: Design templates for common patterns of parallelism Application "LightSrc-1" Create and Debug "LightSrc" Domain templates Base Tigres templates Share Application "LightSrc-2" Create and Debug Scale up Implement templates as a library in an existing language http://tigres.lbl.gov

Templates Sequence ( name, task_array, input_array ) e.g., output [ ] = Sequence ( my seq, task_array_12, input_array_12) Parallel ( name, task_array, input_array ) e.g., output[ ] = Parallel( abc, task_array_12, input_array_12) Split ( split_task, split_input_values, task_array, task_array_in ) e.g., Split( task_x1, input_value_1, spl_t_arr, spl_i_arr) Merge ( task_array, input_array, merge_task, merge_input_values) e.g., Merge( syn_t_arr, syn_i_arr, task_x1, input_value_1)

Evaluating the Hype from Reality Hadoop on VM MapReduce Hadoop on HPC Cloud Clusters HPC NoSQL MongoDB +Hadoop 16

Reorder and Merge: Writes to Mongo Processing time (s) can be expensive 0 200 400 600 800 *Sharded MongoDB vs HDFS on a 8 node Hadoop cluster (R=W) Read Time Processing Time Write Time MongoDB 4.6 Million Input Records Reorder HDFS Processing time (s) 0 200 400 600 800 *Sharded MongoDB vs HDFS on a 8 node Hadoop cluster R<W Read Time Processing Time Write Time Merge Better MongoDB HDFS 4.6 Million Input Records

Filter: Hadoop MapReduce provides a way to scale up analysis on MongoDB Better Processing Time(min) 50 100 150 Hadoop MongoDB MapReduce (2 workers) MongoDB MapReduce 4.6 9.3 18.6 37.2 Number of Input Records (Million)

Data analysis with Hadoop and MongoDB: Offload the MapReduce writes to HDFS Better Move data to HDFS Sharding helps Writing to MongoDB Reading from MongoDB

Evaluating the Hype from Reality Hadoop on VM MapReduce Hadoop on HPC Cloud Clusters HPC NoSQL MongoDB +Hadoop 20

Teragen and Terasort take longer on virtual machines Better Teragen performance 600 500 Execution time (= sec) 400 300 200 Terasort performance 3000 100 2500 0 100 GB 200 GB 300 GB 400 GB 500 GB Physical Virtual Execution time (= sec) 2000 1500 1000 500 21 0 100 GB 200 GB 300 GB 400 GB 500 GB Physical Virtual

Reorder on virtual machines is faster (still investigating) Better 2000 Wikibench reorder performance 1500 Execution time (= sec) 1000 500 0 34 GB 74 GB 111 GB Physical Virtual 22

Physical and virtual have different power profiles but correlate with maps and reduces Better 8 Wikibench reorder power consumption - Physical 100 8 Wikibench reorder power consumption - Virtual 100 90 90 80 80 Power (= kw) 7 6 70 60 50 40 30 Left percentage (= %) Power (= kw) 7 6 70 60 50 40 30 Left percentage (= %) 20 20 10 10 5 0 200 400 600 800 0 Time (= sec) 37 GB Map Reduce 5 0 200 400 600 800 0 Time (= sec) 37 GB Map Reduce 23

Configuring Hadoop on Virtual Machines can benefit from different configurations Better 9000 8000 Time (seconds) 7000 6000 5000 4000 3000 Filter Reorder Merge 2000 1000 0 30D 30C 30D 80C 30D 130C 80D 30C 80D 80C 130D 30C Different Configurations

Reorder (virtual) needs more compute nodes than data nodes Wikibench on VMs, reorder 0.2 0.18 Better 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 30-30 80-30 130-30 30-80 80-80 30-130 collocation Performance/power 25 37GB 74GB 111GB

Filter (virtual) can benefit from more data nodes Wikibench on VMs, filter 1.4 1.2 Better 1 0.8 0.6 0.4 0.2 0 30-30 80-30 130-30 30-80 80-80 30-130 collocation Performance/power 26 37GB 74GB 111GB

FRIEDA: Storage and Data Management on VMs 27 http://frieda.lbl.gov

Summary MapReduce and Hadoop ecosystem are powerful paradigms for science But may not be out of box solutions It is possible to run Hadoop in nontraditional configurations to enable use in existing environments 28

Questions? Email:LRamakrishnan@lbl.gov Collaborators Shane Canon, Elif Dede, Zacharia Fadika, Madhu Govindaraju, Daniel Gunter, Eugen Feller, Christine Morin 29