Evaluating MapReduce and Hadoop for Science

Size: px

Start display at page:

Download "Evaluating MapReduce and Hadoop for Science"

Shana McDowell
8 years ago
Views:

1 Evaluating MapReduce and Hadoop for Science Lavanya Ramakrishnan Lawrence Berkeley National Lab

2 Computation and Data are critical parts of the scientific process Three Pillars of Science Theory Experiment Computation Advance Light Source Data Rates TB/yr TB/yr TB/yr Data (Fourth Paradigm)

Experiment Computation Advance Light Source Data Rates

3 Internet BigData led to the MapReduce and Hadoop Evolution Map Reduce 3

4 A central component of the MapReduce model is its file system HDFS Typical Replication 3 1 Storage Location Compute Node Servers Access Model Custom (except with Fuse) GPFS and Lustre POSIX Stripe Size 64 MB 1 MB Concurrent Writes No Yes Scales with # of Compute Nodes # of Servers Scale of Largest Systems O(10k) Nodes User/Kernel Space User Kernel O(100) Servers

Lustre POSIX Stripe Size 64 MB 1 MB Concurrent Writes No Yes Scales with # of Compute

5 Evaluating the Hype from Reality Hadoop on VM MapReduce Hadoop on HPC Cloud Clusters HPC NoSQL MongoDB +Hadoop 5

6 Streaming adds a performance overhead Better Evaluating Hadoop for Science, IEEE Cloud

7 High performance parallel file systems can be used with Hadoop for small to medium concurrency Better Time (minutes) Teragen (1TB) HDFS GPFS Linear (HDFS) Expon. (HDFS) Linear (GPFS) Expon. (GPFS) Number of maps 7

4 Teragen (1TB) HDFS GPFS Linear (HDFS) Expon.

8 We evaluate three data-intensive operations with different testbed configurations Filter Merge Reorder Public data sets

9 Data operations impacts the performance differences across file systems: Wikipedia (2TB) 15 WriteTime Better 10 ProcessingTime ReadTime Processing time (1000s) 5 0

10 Read-intensive applications benefit from HDFS Processing time (s) HDFS GPFS Better Size (TB)

11 Scientific Ensembles have similarities with MapReduce structure A large number of loosely coupled tasks, each with their own internal parallelism. Riding the Elephant: Managing Ensembles with Hadoop, MTAGS

each with their own internal parallelism.

12 All patterns could be implemented in Hadoop but with varying levels of difficulty low Riding the Elephant: Managing Ensembles with Hadoop, MTAGS 2011 high

13 There are challenges when using Hadoop for scientific applications High throughput workflows Scaling up from desktops File system: non POSIX Language: Java Input and output formats: mostly line-oriented text Streaming mode: restrictive i/p and o/p model Data locality: what happens when multiple inputs? File permissions: jobs run as user hadoop 13

output formats: mostly line-oriented text Streaming mode: restrictive i/p and o/p

14 Tigres: Design templates for common patterns of parallelism Application "LightSrc-1" Create and Debug "LightSrc" Domain templates Base Tigres templates Share Application "LightSrc-2" Create and Debug Scale up Implement templates as a library in an existing language

templates Share Application "LightSrc-2" Create and Debug Scale up

15 Templates Sequence ( name, task_array, input_array ) e.g., output [ ] = Sequence ( my seq, task_array_12, input_array_12) Parallel ( name, task_array, input_array ) e.g., output[ ] = Parallel( abc, task_array_12, input_array_12) Split ( split_task, split_input_values, task_array, task_array_in ) e.g., Split( task_x1, input_value_1, spl_t_arr, spl_i_arr) Merge ( task_array, input_array, merge_task, merge_input_values) e.g., Merge( syn_t_arr, syn_i_arr, task_x1, input_value_1)

, output[ ] = Parallel( abc, task_array_12, input_array_12) Split ( split_task, split_input_values, task_array,

16 Evaluating the Hype from Reality Hadoop on VM MapReduce Hadoop on HPC Cloud Clusters HPC NoSQL MongoDB +Hadoop 16

17 Reorder and Merge: Writes to Mongo Processing time (s) can be expensive *Sharded MongoDB vs HDFS on a 8 node Hadoop cluster (R=W) Read Time Processing Time Write Time MongoDB 4.6 Million Input Records Reorder HDFS Processing time (s) *Sharded MongoDB vs HDFS on a 8 node Hadoop cluster R<W Read Time Processing Time Write Time Merge Better MongoDB HDFS 4.6 Million Input Records

6 Million Input Records Reorder HDFS Processing time (s) 0 200 400 600 800 *Sharded MongoDB vs HDFS on

18 Filter: Hadoop MapReduce provides a way to scale up analysis on MongoDB Better Processing Time(min) Hadoop MongoDB MapReduce (2 workers) MongoDB MapReduce Number of Input Records (Million)

150 Hadoop MongoDB MapReduce (2 workers) MongoDB

19 Data analysis with Hadoop and MongoDB: Offload the MapReduce writes to HDFS Better Move data to HDFS Sharding helps Writing to MongoDB Reading from MongoDB

20 Evaluating the Hype from Reality Hadoop on VM MapReduce Hadoop on HPC Cloud Clusters HPC NoSQL MongoDB +Hadoop 20

21 Teragen and Terasort take longer on virtual machines Better Teragen performance Execution time (= sec) Terasort performance GB 200 GB 300 GB 400 GB 500 GB Physical Virtual Execution time (= sec) GB 200 GB 300 GB 400 GB 500 GB Physical Virtual

22 Reorder on virtual machines is faster (still investigating) Better 2000 Wikibench reorder performance 1500 Execution time (= sec) GB 74 GB 111 GB Physical Virtual 22

23 Physical and virtual have different power profiles but correlate with maps and reduces Better 8 Wikibench reorder power consumption - Physical Wikibench reorder power consumption - Virtual Power (= kw) Left percentage (= %) Power (= kw) Left percentage (= %) Time (= sec) 37 GB Map Reduce Time (= sec) 37 GB Map Reduce 23

24 Configuring Hadoop on Virtual Machines can benefit from different configurations Better Time (seconds) Filter Reorder Merge D 30C 30D 80C 30D 130C 80D 30C 80D 80C 130D 30C Different Configurations

25 Reorder (virtual) needs more compute nodes than data nodes Wikibench on VMs, reorder Better collocation Performance/power 25 37GB 74GB 111GB

26 Filter (virtual) can benefit from more data nodes Wikibench on VMs, filter Better collocation Performance/power 26 37GB 74GB 111GB

27 FRIEDA: Storage and Data Management on VMs 27

28 Summary MapReduce and Hadoop ecosystem are powerful paradigms for science But may not be out of box solutions It is possible to run Hadoop in nontraditional configurations to enable use in existing environments 28

29 Questions? Collaborators Shane Canon, Elif Dede, Zacharia Fadika, Madhu Govindaraju, Daniel Gunter, Eugen Feller, Christine Morin 29

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced