Python MapReduce Programming with Pydoop

Similar documents

Pydoop: a Python MapReduce and HDFS API for Hadoop

Tutorial- Counting Words in File(s) using MapReduce

Extreme Computing. Hadoop MapReduce in more detail.

Big Data 2012 Hadoop Tutorial

Introduction to MapReduce and Hadoop

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Mrs: MapReduce for Scientific Computing in Python

Getting to know Apache Hadoop

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

MapReduce, Hadoop and Amazon AWS

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Big Data and Apache Hadoop s MapReduce

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Advanced Data Management Technologies

How To Use Hadoop

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

BIG DATA APPLICATIONS

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

University of Maryland. Tuesday, February 2, 2010

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop IST 734 SS CHUNG

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

The Hadoop Framework

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Management and NoSQL Databases

Big Data for the JVM developer. Costin Leau,

Big Data Analytics* Outline. Issues. Big Data

Apache Hadoop. Alexandru Costan

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Open source Google-style large scale data analysis with Hadoop

Introduction to Hadoop

MAPREDUCE Programming Model

Intro to Map/Reduce a.k.a. Hadoop

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Hadoop Streaming. Table of contents

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

A very short Intro to Hadoop

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Sriram Krishnan, Ph.D.

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce. Tushar B. Kute,

Parallel Processing of cluster by Map Reduce

THE HADOOP DISTRIBUTED FILE SYSTEM

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

INTRODUCTION TO HADOOP

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Single Node Setup. Table of contents

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data With Hadoop

The Hadoop Eco System Shanghai Data Science Meetup

HDFS. Hadoop Distributed File System

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Internals of Hadoop Application Framework and Distributed File System

MapReduce (in the cloud)

Yuji Shirasaki (JVO NAOJ)

Distributed Filesystems

Data Science in the Wild

Introduc)on to Map- Reduce. Vincent Leroy

Data-intensive computing systems

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

TP1: Getting Started with Hadoop

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Hadoop

CS54100: Database Systems

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Fault Tolerance in Hadoop for Work Migration

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Hadoop Architecture. Part 1

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce and Hadoop Distributed File System

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Transcription:

Distributed Computing CRS4 http://www.crs4.it EuroPython 2011

Acknowledgments Part of the MapReduce tutorial is based upon: J. Zhao, J. Pjesivac-Grbovic, MapReduce: The programming model and practice, SIGMETRICS 09 Tutorial, 2009. http://research.google.com/pubs/pub36249.html The Free Lunch is Over is a well-known article by Herb Sutter, available online at http://www.gotw.ca/publications/concurrency-ddj.htm Pygments rules!

Intro: 1. The Free Lunch is Over CPU clock speed reached saturation around 2004 Multi-core architectures Everyone must go parallel Moore s law reinterpreted number of cores per chip doubles every 2 y clock speed remains fixed or decreases must rethink the design of our software www.gotw.ca

MapReduce and Hadoop Intro: 2. The Data Deluge sciencematters.berkeley.edu satellites scienceblogs.com high-energy physics data-intensive applications 1 high-throughput sequencer: several TB/week Hadoop to the rescue! upload.wikimedia.org www.imagenes-bio.de DNA sequencing social networks

Intro: 3. Python and Hadoop Hadoop: a DC framework for data-intensive applications Open source Java implementation of Google s MapReduce and GFS Pydoop: API for writing Hadoop programs in Python Comparison with other solutions Performance

Outline 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

Outline The MapReduce Programming Model Hadoop: Open Source MapReduce 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

Outline The MapReduce Programming Model Hadoop: Open Source MapReduce 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

What is MapReduce? The MapReduce Programming Model Hadoop: Open Source MapReduce A programming model for large-scale distributed data processing Inspired by map and reduce in functional programming Map: map a set of input key/value pairs to a set of intermediate key/value pairs Reduce: apply a function to all values associated to the same intermediate key; emit output key/value pairs An implementation of a system to execute such programs Fault-tolerant (as long as the master stays alive) Hides internals from users Scales very well with dataset size

The MapReduce Programming Model Hadoop: Open Source MapReduce MapReduce s Hello World: Wordcount the quick brown fox ate the lazy green fox Mapper Mapper Mapper Map the, 1 fox, 1 quick, 1 ate, 1 brown, 1 fox, 1 the, 1 lazy, 1 green, 1 Shuffle & Sort Reducer Reducer Reduce quick, 1 brown, 1 fox, 2 ate, 1 the, 2 lazy, 1 green, 1

Wordcount: Pseudocode The MapReduce Programming Model Hadoop: Open Source MapReduce map(string key, String value): // key: does not matter in this case // value: a subset of input words for each word w in value: Emit(w, "1"); reduce(string key, Iterator values): // key: a word // values: all values associated to key int wordcount = 0; for each v in values: wordcount += ParseInt(v); Emit(key, AsString(wordcount));

The MapReduce Programming Model Hadoop: Open Source MapReduce Mock Implementation mockmr.py from itertools import groupby from operator import itemgetter def _pick_last(it): for t in it: yield t[-1] def mapreduce(data, mapf, redf): buf = [] for line in data.splitlines(): for ik, iv in mapf("foo", line): buf.append((ik, iv)) buf.sort() for ik, values in groupby(buf, itemgetter(0)): for ok, ov in redf(ik, _pick_last(values)): print ok, ov

The MapReduce Programming Model Hadoop: Open Source MapReduce Mock Implementation mockwc.py from mockmr import mapreduce DATA = """the quick brown fox ate the lazy green fox """ def map_(k, v): for w in v.split(): yield w, 1 def reduce_(k, values): yield k, sum(v for v in values) if name == " main ": mapreduce(data, map_, reduce_)

MapReduce: Execution Model The MapReduce Programming Model Hadoop: Open Source MapReduce User Program (1) fork (1) fork Master (1) fork (2) assign map (2) assign reduce split 0 split 1 split 2 split 3 split 4 (3) read Mapper Mapper Mapper (4) local write (5) remote read Reducer Reducer (6) write Input (DFS) Intermediate files (on local disks) Output (DFS) adapted from Zhao et al., MapReduce: The programming model and practice, 2009 see acknowledgments

MapReduce vs Alternatives 1 The MapReduce Programming Model Hadoop: Open Source MapReduce Programming model Declarative Procedural MapReduce MPI DBMS/SQL Flat raw files Data organization Structured adapted from Zhao et al., MapReduce: The programming model and practice, 2009 see acknowledgments

MapReduce vs Alternatives 2 The MapReduce Programming Model Hadoop: Open Source MapReduce MPI MapReduce DBMS/SQL programming model message passing map/reduce declarative data organization no assumption files split into blocks organized structures data type any (k,v) string/protobuf tables with rich types execution model independent nodes map/shuffle/reduce transaction communication high low high granularity fine coarse fine usability steep learning curve simple concept runtime: hard to debug key selling point run any application huge datasets interactive querying There is no one-size-fits-all solution Choose according to your problem s characteristics adapted from Zhao et al., MapReduce: The programming model and practice, 2009 see acknowledgments

The MapReduce Programming Model Hadoop: Open Source MapReduce MapReduce Implementations / Similar Frameworks Google MapReduce (C++, Java, Python) Based on proprietary infrastructures (MapReduce, GFS,... ) and some open source libraries Hadoop (Java) Open source, top-level Apache project GFS HDFS Used by Yahoo, Facebook, ebay, Amazon, Twitter... DryadLINQ (C# + LINQ) Not MR, DAG model: vertices=programs, edges=channels Proprietary (Microsoft); academic release available The small ones Starfish (Ruby), Octopy (Python), Disco (Python + Erlang)

Outline The MapReduce Programming Model Hadoop: Open Source MapReduce 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

Hadoop: Overview The MapReduce Programming Model Hadoop: Open Source MapReduce Scalable Thousands of nodes Petabytes of data over 10M files Single file: Gigabytes to Terabytes Economical Open source COTS Hardware (but master nodes should be reliable) Well-suited to bag-of-tasks applications (many bio apps) Files are split into blocks and distributed across nodes High-throughput access to huge datasets WORM storage model

Hadoop: The MapReduce Programming Model Hadoop: Open Source MapReduce Client Job Request HDFS Master MR Master Namenode Job Tracker Datanode Datanode Datanode HDFS Slaves Task Tracker Task Tracker Task Tracker MR Slaves Client sends Job request to Job Tracker Job Tracker queries Namenode about physical data block locations Input stream is split among the desired number of map tasks Map tasks are scheduled closest to where data reside

The MapReduce Programming Model Hadoop: Open Source MapReduce Hadoop Distributed File System (HDFS) file name "a" a3 Client read data block IDs, datanodes a1 Datanode 1 Datanode 3 a3 a1 a1 a2 Namenode Secondary Namenode namespace checkpointing helps namenode with logs not a namenode replacement Datanode 5 Datanode 2 Datanode 4 Datanode 6 a3 a2 a2 Each block is replicated n times (3 by default) One replica on the same rack, the others on different racks You have to provide network topology Rack 1 Rack 2 Rack 3

Wordcount: (part of) Java Code The MapReduce Programming Model Hadoop: Open Source MapReduce public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) {word.set(itr.nexttoken()); context.write(word, one);} }} public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) {sum += val.get();} result.set(sum); context.write(key, result); }}

The MapReduce Programming Model Hadoop: Open Source MapReduce Other Optional MapReduce Components Combiner (local Reducer) RecordReader Translates the byte-oriented view of input files into the record-oriented view required by the Mapper Directly accesses HDFS files Processing unit: InputSplit (filename, offset, length) Partitioner Decides which Reducer receives which key Typically uses a hash function of the key RecordWriter Writes key/value pairs output by the Reducer Directly accesses HDFS files

Outline 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

Hadoop on your Laptop in 10 Minutes Download from www.apache.org/dyn/closer.cgi/hadoop/core Unpack to /opt, then set a few vars: export HADOOP_HOME=/opt/hadoop-0.20.2 export PATH=$HADOOP_HOME/bin:${PATH} Setup passphraseless ssh: ssh-keygen -t dsa -P -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >>~/.ssh/authorized_keys in $HADOOP_HOME/conf/hadoop-env.sh, set JAVA_HOME to the appropriate value for your machine

Additional Tweaking Use as Non-Root Assumption: user is in the user group # mkdir /var/tmp/hdfs /var/log/hadoop # chown :users /var/tmp/hdfs /var/log/hadoop # chmod 770 /var/tmp/hdfs /var/log/hadoop Edit $HADOOP_HOME/conf/hadoop-env.sh: export HADOOP_LOG_DIR=/var/log/hadoop Edit $HADOOP_HOME/conf/hdfs-site.xml: <property> <name>dfs.name.dir</name> <value>/var/tmp/hdfs/nn</value> </property> <property> <name>dfs.data.dir</name> <value>/var/tmp/hdfs/data</value> </property>

Additional Tweaking MapReduce Edit $HADOOP_HOME/conf/mapred-site.xml: <property> <name>mapred.system.dir</name> <value>/var/tmp/hdfs/system</value> </property> <property> <name>mapred.local.dir</name> <value>/var/tmp/hdfs/tmp</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> </property> <property> <name>mapred.child.java.opts</name> <value>-xmx512m</value> </property>

Start your Pseudo-Cluster Namenode format is required only on first use hadoop namenode -format start-all.sh firefox http://localhost:50070 & firefox http://localhost:50030 & localhost:50070: HDFS web interface localhost:50030: MapReduce web interface

Web Interface HDFS

Web Interface MapReduce

Run the Java Word Count Example Wait until HDFS is ready for work hadoop dfsadmin -safemode wait Copy input data to HDFS wget http://www.gutenberg.org/cache/epub/11/pg11.txt hadoop fs -put pg11.txt alice.txt Run Word Count hadoop jar $HADOOP_HOME/*examples*.jar wordcount alice.txt output Copy output back to local fs hadoop fs -get output{,} sort -rn -k2 output/part-r-00000 head -n 3 the 1664 and 780 to 773 ls output/_logs/history localhost_1307814843760_job_201106111954_0001_conf.xml localhost_1307814843760_job_201106111954_0001_simleo_word+count

Cool! I Want to Develop my own MR Application! The easiest path for beginners is Hadoop Streaming Java package included in Hadoop Use any executable as the mapper or reducer Read key-value pairs from standard input Write them to standard output Text protocol: records are serialized as k\tv\n : hadoop jar \ $HADOOP_HOME/contrib/streaming/*streaming*.jar \ -input myinputdirs \ -output myoutputdir \ -mapper my_mapper \ -reducer my_reducer \ -file my_mapper \ -file my_reducer \ -jobconf mapred.map.tasks=2 \ -jobconf mapred.reduce.tasks=2

WC with Streaming and Python Scripts Mapper #!/usr/bin/env python import sys for line in sys.stdin: for word in line.split(): print "%s\t1" % word

WC with Streaming and Python Scripts Reducer #!/usr/bin/env python import sys def serialize(key, value): return "%s\t%d" % (key, value) def deserialize(line): key, value = line.split("\t", 1) return key, int(value) def main(): prev_key, out_value = None, 0 for line in sys.stdin: key, value = deserialize(line) if key!= prev_key: if prev_key is not None: print serialize(prev_key, out_value) out_value = 0 prev_key = key out_value += value print serialize(key, out_value) if name == " main ": main()

Outline 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

Outline 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

MapReduce Development with Hadoop Java: native C/C++: APIs for both MR and HDFS are supported by Hadoop Pipes and included in the Hadoop distribution Python: several solutions, but do they meet all of the requirements of nontrivial apps? Reuse existing modules, including C/C++ extensions NumPy / SciPy for numerical computation Specialized components (RecordReader/Writer, Partitioner) HDFS access

Python MR: Hadoop-Integrated Solutions Hadoop Streaming awkward programming style can only write mapper and reducer scripts (no RecordReader, etc.) no HDFS can only process text data streams (lifted in 0.21+) Jython incomplete standard library most third-party packages are only compatible with CPython cannot use C/C++ extensions typically one or more releases behind CPython

Python MR: Third Party Solutions Hadoopy Dumbo Hadoop Streaming Disco Happy Jython Octopy Last update: Aug 2009 Last update: Apr 2008 Hadoop-based Non-Hadoop MR Hadoop-based: same limitations as Streaming/Jython, except for ease of use Other implementations: not as mature/widespread

Python MR: Our Solution Pydoop http://pydoop.sourceforge.net Access to most MR components, including RecordReader, RecordWriter and Partitioner Get configuration, set counters and report status Programming model similar to the Java one: you define classes, the MapReduce framework instantiates them and calls their methods CPython use any module HDFS API

Summary of Features Streaming Jython Pydoop C/C++ Ext Yes No Yes Standard Lib Full Partial Full MR API No * Full Partial Java-like FW No Yes Yes HDFS No Yes Yes (*) you can only write the map and reduce parts as executable scripts.

Outline 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

Hadoop Pipes Java Hadoop Framework PipesMapRunner downward protocol record reader mapper combiner child process upward protocol partitioner C++ Application PipesReducer downward protocol reducer upward protocol record writer child process App: separate process Communication with Java framework via persistent sockets The C++ app provides a factory used by the framework to create MR components

Integration of Pydoop with the C/C++ API mapred/pipes virtual method C++ pipes invocation _pipes ext module Hadoop Java Framework boost.python object Pure Python Modules hdfs C libhdfs (JNI wrapper) function User Application call boost.python object _hdfs ext module Integration with Pipes (C++): Method calls flow from the framework through the C++ and the Pydoop API, ultimately reaching user-defined methods Results are wrapped by Boost and returned to the framework Integration with libhdfs (C): Function calls initiated by Pydoop Results wrapped and returned as Python objects to the app

Outline 1 MapReduce and Hadoop The MapReduce Programming Model Hadoop: Open Source MapReduce 2 3

Python Wordcount, Full Program Code #!/usr/bin/env python import pydoop.pipes as pp class Mapper(pp.Mapper): def map(self, context): words = context.getinputvalue().split() for w in words: context.emit(w, "1") class Reducer(pp.Reducer): def reduce(self, context): s = 0 while context.nextvalue(): s += int(context.getinputvalue()) context.emit(context.getinputkey(), str(s)) if name == " main ": pp.runtask(pp.factory(mapper, Reducer))

Status Reports and Counters class Mapper(pp.Mapper): def init (self, context): super(mapper, self). init (context) context.setstatus("initializing") self.inputwords = context.getcounter("wordcount", "INPUT_WORDS") def map(self, context): words = context.getinputvalue().split() for w in words: context.emit(w, "1") context.incrementcounter(self.inputwords, len(words))

Status Reports and Counters: Web UI

Optional Components: Record Reader import struct, pydoop.hdfs as hdfs class Reader(pp.RecordReader): def init (self, context): super(reader, self). init (context) self.isplit = pp.inputsplit(context.getinputsplit()) self.file = hdfs.open(self.isplit.filename) self.file.seek(self.isplit.offset) self.bytes_read = 0 if self.isplit.offset > 0: discarded = self.file.readline() # read by prev. split reader self.bytes_read += len(discarded) def next(self): # return: (have_a_record, key, value) if self.bytes_read > self.isplit.length: # end of input split return (False, "", "") key = struct.pack(">q", self.isplit.offset+self.bytes_read) value = self.file.readline() if value == "": # end of file return (False, "", "") self.bytes_read += len(value) return (True, key, value) def getprogress(self): return min(float(self.bytes_read)/self.isplit.length, 1.0)

Optional Components: Record Writer, Partitioner import pydoop.utils as pu class Writer(pp.RecordWriter): def init (self, context): super(writer, self). init (context) jc = context.getjobconf() pu.jc_configure_int(self, jc, "mapred.task.partition", "part") pu.jc_configure(self, jc, "mapred.work.output.dir", "outdir") pu.jc_configure(self, jc, "mapred.textoutputformat.separator", "sep", "\t") self.outfn = "%s/part-%05d" % (self.outdir, self.part) self.file = hdfs.open(self.outfn, "w") def emit(self, key, value): self.file.write("%s%s%s\n" % (key, self.sep, value)) class Partitioner(pp.Partitioner): def partition(self, key, numofreduces): return (hash(key) & sys.maxint) % numofreduces

The HDFS Module >>> import pydoop.hdfs as hdfs >>> f = hdfs.open( alice.txt ) >>> f.fs.host localhost >>> f.fs.port 9000 >>> f.name hdfs://localhost:9000/user/simleo/alice.txt >>> print f.read(50) Project Gutenberg s Alice s Adventures in Wonderla >>> print f.readline() nd, by Lewis Carroll >>> f.close()

HDFS by Block Size import collections, pydoop.hdfs as hdfs def treewalker(fs, root_info): yield root_info if root_info["kind"] == "directory": for info in fs.list_directory(root_info["name"]): for item in treewalker(fs, info): yield item def usage_by_bs(fs, root): usage = collections.counter() root_info = fs.get_path_info(root) for info in treewalker(fs, root_info): if info["kind"] == "file": usage[info["block_size"]] += info["size"] return usage def main(): fs = hdfs.hdfs("default", 0) root = "%s/%s" % (fs.working_directory(), "tree_test") for bs, tot_size in usage_by_bs(fs, root).iteritems(): print "%.1f\t%d" % (bs/float(2**20), tot_size) fs.close()

Comparison: vs Jython and Text-Only Streaming completion time (s) 1600 1400 1200 1000 800 600 400 200 0 without combiner with combiner java c++ pydoop jython streaming 48 nodes, 2 1.8 GHz dual core Opterons, 4 GB RAM App: Wordcount on 20 GB of random English text Dataset: uniform sampling from a spell checker list Java/C++ included for reference

Comparison: vs Dumbo (Binary Streaming) completion time (s) 2500 2000 1500 1000 500 0 without combiner with combiner java pydoop dumbo 24 nodes, 2 1.8 GHz dual core Opterons, 4 GB RAM App: Wordcount on 20 GB of random English text Dataset: uniform sampling from a spell checker list Java included for reference

Pydoop at CRS4 Core: computational biology applications for analyzing data generated by our Sequencing and Genotyping Platform the vast majority of the code is written in Python biodoop-seal.sourceforge.net Short Read Alignment > 4 TB / week Bio Suite Genotype Calling 7K individuals in < 1d Flow Cytometry dev

Summary MapReduce is a big deal :) Strengths: large datasets, scalability, ease of use Weaknesses: overhead, lower raw performance MapReduce vs more traditional models MR: low communication, coarse-grained, data-intensive Threads/MPI: high communication, fine-grained, CPU-intensive As with any set of tools, choose according to your problem Solid open source implementation available (Hadoop) Full-fledged Python/HDFS API available (Pydoop)

Appendix For Further Reading For Further Reading I H. Sutter, The Free Lunch is Over: a Fundamental Turn Toward Concurrency in Software Dr. Dobb s Journal 30(3), 2005. J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters in OSDI 2004: Sixth Symposium on Operating System Design and Implementation, 2004. http://hadoop.apache.org http://pydoop.sourceforge.net

Appendix For Further Reading For Further Reading II S. Leo and G. Zanetti, In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC 2010), pages 819 825, 2010. S. Leo, F. Santoni, and G. Zanetti, Biodoop: Bioinformatics on Hadoop In The 38th International Conference on Parallel Processing Workshops (ICPPW 2009), pages 415 422, 2009.