Mrs: MapReduce for Scientific Computing in Python



Similar documents
Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Big Data 2012 Hadoop Tutorial

Tutorial- Counting Words in File(s) using MapReduce

BIG DATA APPLICATIONS

Getting to know Apache Hadoop

Introduction to MapReduce and Hadoop

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Running Hadoop on Windows CCNP Server

Hadoop Configuration and First Examples

Word Count Code using MR2 Classes and API

Xiaoming Gao Hui Li Thilina Gunarathne

Enterprise Data Storage and Analysis on Tim Barr

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Introduc)on to Map- Reduce. Vincent Leroy

map/reduce connected components

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Hadoop WordCount Explained! IT332 Distributed Systems

Introduction To Hadoop

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Introduction to Cloud Computing

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Hadoop Architecture. Part 1

Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University

Big Data for the JVM developer. Costin Leau,

Outline of Tutorial. Hadoop and Pig Overview Hands-on

Hadoop Basics with InfoSphere BigInsights

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop Design and k-means Clustering

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

How To Use Hadoop

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

How To Install Hadoop From Apa Hadoop To (Hadoop)

Big Data Management and NoSQL Databases

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Apache Hadoop new way for the company to store and analyze big data

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Chapter 7. Using Hadoop Cluster and MapReduce

MR-(Mapreduce Programming Language)

Big Data Analytics* Outline. Issues. Big Data

University of Maryland. Tuesday, February 2, 2010

TP1: Getting Started with Hadoop

5 HDFS - Hadoop Distributed System

How To Write A Mapreduce Program In Java.Io (Orchestra)

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Apache Hadoop. Alexandru Costan

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Introduction to Parallel Programming and MapReduce

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

HPCHadoop: MapReduce on Cray X-series

The Hadoop Eco System Shanghai Data Science Meetup

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Distributed Filesystems

19 Putting into Practice: Large-Scale Data Management with HADOOP

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Word count example Abdalrahman Alsaedi

Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay

Big Data and Data Science Grows Up. Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com

Hadoop Training Hands On Exercise

MapReduce. Tushar B. Kute,

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Tes$ng Hadoop Applica$ons. Tom Wheeler

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MapReduce, Hadoop and Amazon AWS

Introduction to Big Data Science. Wuhui Chen

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop: Understanding the Big Data Processing Method

Hadoop Setup Walkthrough

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

MapReduce Evaluator: User Guide

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Mobile Cloud Computing for Data-Intensive Applications

BIG DATA, MAPREDUCE & HADOOP

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Scalable Computing with Hadoop

Transcription:

Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012

Large scale problems require parallel processing Communication in parallel processing is hard abstracts away interprocess communication User only has to identify which parts of the problem are embarrassingly parallel

WordCount wordcount.py import mrs class WordCount(mrs.): def map(self, line num, line text): for word in line text.split(): yield (word, 1) def reduce(self, word, counts): yield sum(counts) if name == main : mrs.main(wordcount)

Iterative

Hadoop Hadoop is the most widely used open source implementation Hadoop was designed for big data, not scientific computing Requires the use of HDFS and a dedicated cluster

in Scientific Computing What does an ideal implementation look like in the context of scientific computing?

Ease of Development Rapid prototyping Testability Debuggability

Ease of Development WordCount.java public class WordCount { public static class Tokenizerper extends per<object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } public static class IntSumr extends r<text,intwritable,text,intwritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println( Usage: wordcount <in> <out> ); System.exit(2); } Job job = new Job(conf, word count ); job.setjarbyclass(wordcount.class); job.setperclass(tokenizerper.class); job.setcombinerclass(intsumr.class); job.setrclass(intsumr.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileFormat.addPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } }

Ease of Deployment Dedicated cluster vs. supercomputers and private cluster Work with any filesystem Work with any scheduler

Ease of Deployment pbs-hadoop.sh # Step 1: Find the network address. ADDR=$(/sbin/ip o 4 addr list $INTERFACE sed e s;ˆ. inet \(. \)/. $;\1; ) # Step 2: Set up the Hadoop configuration. export HADOOP LOG DIR=$JOBDIR/log mkdir $HADOOP LOG DIR export HADOOP CONF DIR=$JOBDIR/conf cp R $HADOOP HOME/conf $HADOOP CONF DIR sed e s/master IP ADDRESS/$ADDR/g e s@hadoop TMP DIR@$JOBDIR/tmp@g \ e s/map TASKS/$MAP TASKS/g \ e s/reduce TASKS/$REDUCE TASKS/g \ e s/tasks PER NODE/$TASKS PER NODE/g \ <$HADOOP HOME/conf/hadoop site.xml \ >$HADOOP CONF DIR/hadoop site.xml # Step 3: Start daemons on the master. HADOOP= $HADOOP HOME/bin/hadoop $HADOOP namenode format # format the hdfs $HADOOP HOME/bin/hadoop daemon.sh start namenode $HADOOP HOME/bin/hadoop daemon.sh start jobtracker # Step 4: Start daemons on the slaves. ENV=. $HOME/.bashrc; export HADOOP CONF DIR=$HADOOP CONF DIR; export HADOOP LOG DIR=$HADOOP LOG DIR pbsdsh u bash c $ENV; $HADOOP datanode & pbsdsh u bash c $ENV; $HADOOP tasktracker & sleep 15 # Step 5: Run the User Program $HADOOP dfs put $INPUT $HDFS INPUT $HADOOP jar $PROGRAM ${ARGS[@]} $HADOOP dfs get $HDFS OUTPUT $OUTPUT # Step 6: Stop daemons on the slaves and master. kill %2 # kill tasktracker kill %1 # kill datanode $HADOOP HOME/bin/hadoop daemon.sh stop jobtracker $HADOOP HOME/bin/hadoop daemon.sh stop namenode

Other Issues Iterative performance Fault tolerance Interoperability

What is Mrs? Aims to be a simple to use framework Implemented in pure Python Designed with scientific computing in mind

Why Python? Python is nearly ubiquitous Mrs needs no dependencies outside of standard library Familiarity and readability Easy interoperability Debugging and testing One downside: GIL

Iterative

Iterative :

Automatic Serialization Serialization happens every time a tasks communicates with another machine Mrs automatically handles this with pickle Hadoop requires Writable classes everywhere

Debugging: Run Modes Serial Mock Parallel Parallel

Debugging: Random Number Generators Seeding random number generators makes results reproducible Need different seed for each task Mrs has random function which lets you create a random number generator with an arbitrary number of offset parameters ex. rand = self.random(id, iter)

Performance and Case Studies Interpreter overhead does not preclude good performance for Mrs. We demonstrate on three different problems: Halton Sequence: CPU bound benchmark Particle Swarm Optimization: CPU bound application Walk Analysis: IO bound application

Performance and Case Studies Optimization Story: Make sure you have the right algorithm Careful profiling Run with PyPy Rewrite critical path in C

Monte Carlo Pi Estimation Monte Carlo algorithm for computing the value of π by generating random points in a square Very little data, but computationally intense We can control how much computation each map task performs 0.5 0.25 0 0.25 Halton Sequence 0.5 0.5 0.25 0 0.25 0.5

Monte Carlo Pi Estimation Time (seconds) 120 100 80 60 40 Mrs using pure Python Hadoop (Java) Mrs (PyPy) Mrs (cpython) 20 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 Points Per Task

Monte Carlo Pi Estimation Python with inner loop in C (using ctypes) 120 100 Hadoop (Java) Mrs (cpython) Time (seconds) 80 60 40 20 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 Points Per Task

Particle Swarm Optimization Inspired by simulations of flocking birds Particles interact while exploring : motion and function evaluation : communication CPU bound problem 40 30 20 10 0 0 2 4 6 8 10

Particle Swarm Optimization Convergence plots for the Rosenbrock-250 function 10 10 10 8 Serial Parallel 10 6 Best Value 10 4 10 2 10 0 10 2 10 4 0 60 120 180 240 300 Minutes

Walk Analyzer Involves analyzing random walks in a graph Heavy IO bound Average Hadoop Time: 1:06:53 Average Mrs Time: 52:55

Where to find Mrs Mrs Homepage with links to source, documentation, mailing list, etc: http://code.google.com/p/mrs-mapreduce In case you forget the url, just google mrs mapreduce :)

Other cool features I neglected to mention... merge sort Asyncronous Concurrent Convergence Checks Memory Logging Merge Sort Dataset Custom Serializer