Brown. Andrew Pavlo October 7, 2009

Size: px

Start display at page:

Download "Brown. Andrew Pavlo October 7, 2009"

Ross Dalton
7 years ago
Views:

1 Brown Andrew Pavlo October 7, 2009

2 Today s Talk MapReduce Overview Hadoop Overview Demo Advanced Topics Sweet Spots

3 MapReduce Overview Massively parallel data processing Programming Model vs. Execution Platform Programs consist of only two functions: Map(k1, v1) (k2, list(v2)) Reduce(k2, list(v2)) (key3, list(v3))

MapReduce Example Calculate total order amount per day after Jan 1 st. 2009-03-01 Map(key, Reduce(key, value) values) { { $53.00 2009-03-01 $10.00 if sum (key = 0; >= 2009-01-01 ) { $25.00 $53.

4 MapReduce Example Calculate total order amount per day after Jan 1 st Map(key, Reduce(key, value) values) { { $ $10.00 if sum (key = 0; >= ) { $25.00 $53.00 DATE AMOUNT DATE AMOUNT $10.00 while output(key, (values.hasnext()) $25.00 value); { $ $ $25.00 $ $30.00 $53.00 } sum += values.next(); $85.00 $ $ $93.00 $ $ $ $99.00 } } $ $ $ output(key, $44.00 sum); $62.00 $69.00 } ReduceOutput Map Workers

5 Hadoop Yahoo/Apache's open-source framework: Core framework: Java Native libraries: C/C++ MR applications written in any language. Provides much of Google's known middleware: MapReduce Hadoop GFS HDFS BigTable Hbase

6 Hadoop Architecture MapReduce engine: Central job tracker (master). Separate task tracker on node (slaves). Distributed filesystem: Central filesystem name node (master). Separate daemons to serve blocks (slaves).

7 Fault Tolerance Task-level Fault Tolerance: Map/Reduce tasks are pinged periodically. Unresponsive tasks are re-executedl. Slow tasks are speculatively re-executed on available nodes. HDFS Name Node & Job Tracker: Fail over to a backup daemons.

8 Writing MapReduce Applications Create Mapper and Reducer objects that implement the proper interfaces. Create a JobConf/Configuration object and set the appropriate parameters: Map/Reduce Classes Input/Output Paths Input/Output Data Types Call the run job method.

9 Brown Public cluster running in ilab: $HADOOP_HOME=/pro/hadoop/ 16 nodes / ~1.5 TB HDFS Running Hadoop Submit jobs on any department machine. Head nodes: HDFS Master Job Tracker

10 Demo Get the total order amount for each day after : Load input data into HDFS Write Map/Reduce methods Create execution jar Deploy in Hadoop View/download results

11 Today s Talk MapReduce Overview Hadoop Overview Demo Advanced Topics Sweet Spots

12 Input File Formats Defines how the input data is interpreted, partitioned, and transmitted to Map tasks. TextInputFiles (default): Keys line #, Value line contents SequenceFile: Flat files, binary key/value pairs KeyValueTextInputFormat: Plain text, human readable, parameterized types, split by tabs/spaces

13 Compression Data in HDFS can be compressed to speed up transfer rates: Block-level compression. Record-level compression. Only SequenceFiles can be compressed. JNI calls into native codecs (e.g., zlib, gzip, lzo) Data throughput vs. CPU performance

14 Optimizations Reusable JVM Fires up a new JVM for each task invocation. Distributed cache Deploy static data files and libraries to all nodes before map tasks begin executing. Rack awareness Tasks will try to fetch blocks from data nodes in the same rack.

15 Additional Hadoop Technologies Avro Serialization: Hadoop s version of Protocol Buffers. Data is stored according to a schema. Hive Data Warehouse: Pig: Facebook s SQL layer on top of Hadoop. High-level language that compiles down into Hadoop programs.

16 Additional Hadoop Technologies Amazon Elastic MapReduce Direct support for dynamically deploying MR nodes in EC2 cloud. Cloudera Desktop: Improved web-interface for managing HDFS and running jobs. HadoopDB (Yale): Replace HDFS with MySQL/Postgres at each node.

17 Today s Talk MapReduce Overview Hadoop Overview Demo Advanced Topics Sweet Spots

18 Extract-Transform-Load Read Once data sets: Read data from several different sources. Parse and clean. Perform complex transformations. Decide what attribute data to store. Load the information into a DBMS. Allows for quick-and-dirty data analysis.

19 Semi-Structured Data MapReduce systems can easily stored semistructured data since no schema is needed: Typically key/value records with a varying number of attributes. Awkward to stored in relational DBMS: Wide-tables with many nullable attributes. Column store fairs better.

20 Limited Budget Operations MapReduce frameworks: Community supported and driven. Attractive for projects with modest budgets and requirements. Parallel DBMSs are expensive: No open-source option.

21 More Information How to execute MR jobs in the CS department: SIGMOD 09 MR vs. DBMS benchmark: Questions/Comments?

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce