Getting Started with Hadoop with Amazon s Elastic MapReduce

Similar documents

A programming model in Cloud: MapReduce

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Introduction to analyzing big data using Amazon Web Services

Hands-on Exercises with Big Data

Cloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012

MapReduce for Statistical NLP/Machine Learning

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Introduction To Hive

CS 378 Big Data Programming

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Map Reduce & Hadoop Recommended Text:

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Hadoop & Spark Using Amazon EMR

MapReduce (in the cloud)

Cloud computing - Architecting in the cloud

Assignment 2: More MapReduce with Hadoop

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Parallel Programming and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce

Last time. Today. IaaS Providers. Amazon Web Services, overview

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

HiBench Introduction. Carson Wang Software & Services Group

Introduction to Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Cloud Computing at Google. Architecture

Jeffrey D. Ullman slides. MapReduce for data intensive computing

How To Use Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Cloud Computing. Chapter Hadoop

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Scalable Architecture on Amazon AWS Cloud

Open source Google-style large scale data analysis with Hadoop

MapReduce, Hadoop and Amazon AWS

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

HADOOP MOCK TEST HADOOP MOCK TEST II

Sriram Krishnan, Ph.D.

Workshop on Hadoop with Big Data

Big Data With Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Hadoop. Bioinformatics Big Data

Spark. Fast, Interactive, Language- Integrated Cluster Computing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Challenges for Data Driven Systems

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Big Data Course Highlights

Internals of Hadoop Application Framework and Distributed File System

COURSE CONTENT Big Data and Hadoop Training

Tutorial for Assignment 2.0

Other Map-Reduce (ish) Frameworks. William Cohen

Hadoop Job Oriented Training Agenda

Integrating Big Data into the Computing Curricula

Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

Hadoop Installation MapReduce Examples Jake Karnes

Important Notice. (c) Cloudera, Inc. All rights reserved.

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Eucalyptus User Console Guide

Duke University

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Machine Learning Big Data using Map Reduce

Big Data and Data Science: Behind the Buzz Words

Real Time Big Data Processing

Big Data. Lyle Ungar, University of Pennsylvania

How to Run Spark Application

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Best Practices for Amazon EMR

Data Intensive Computing Handout 6 Hadoop

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Single Node Hadoop Cluster Setup

Big Data Analytics. Lucas Rego Drumond

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

FREE computing using Amazon EC2

Predictive Analytics with Storm, Hadoop, R on AWS

HDFS. Hadoop Distributed File System

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Introduction to Hadoop

Hadoop IST 734 SS CHUNG

Big Data and Scripting Systems build on top of Hadoop

Transcription:

Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson scott@drskippy.net http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 1 / 43

Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 2 / 43

Amazon Web Services What is Amazon Web Services? For first Hadoop project on AWS, use these services: Elastic Compute Cloud (EC2) Amazon Simple Storage Service (S3) Elastic MapReduce (EMR) For future projects, AWS is much more: SimpleDB, Relational Database Services Simple Queue Service (SQS), Simple Notification Service (SNS) Alexa Mechanical Turk... Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 3 / 43

Amazon Web Services Signing up for AWS 1 Create an AWS account - http://aws.amazon.com/ 2 Sign up for EC2 cloud compute services - http://aws.amazon.com/ec2/ 3 Set up Security Credentials (under menu Account Security Credentials) - 3 kinds of credentials, you need to create an Access Key ; use it to access S3 storage 4 Sign up for S3 storage services - http://aws.amazon.com/s3/ 5 Sign up for EMR - http://aws.amazon.com/elasticmapreduce/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 4 / 43

Amazon Web Services What are S3 buckets? Streaming EMR projects use Simple Storage Service (S3) Buckets for data, code, logging and output. Bucket A bucket is a container for objects stored in Amazon S3. Every object is contained in a bucket. Bucket names must be globally unique. Object Entities stored in Amazon S3. Objects consist of object data and metadata. Metadata consists of key-value pairs. Object data is opaque. Objects Keys An object is uniquely identified within a bucket by a key (name) and a version ID. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 5 / 43

Amazon Web Services Accessing objects in S3 buckets Want to: 1 Move data into and out of S3 buckets 2 Set access privileges Tools: S3 console in your AWS control panel is adequate for managing S3 buckets and objects one at a time Other browser options: good for multiple file upload/download - Firefox S3 https://addons.mozilla.org/en-us/firefox/addon/3247/ ; or minimal - S3 plug-in for Chrome https://chrome.google.com/ extensions/detail/appeggcmoaojledegaonmdaakfhjhchf Programmatic options: Web Services (both SOAP-y and REST-ful): wget, curl, Python, Ruby, Java... Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 6 / 43

Amazon Web Services S3 Bucket Example 1 - RESTful GET Example - Image object Bucket: bsi-test Key: image.jpg Object: JPEG structured data data from image.jpg RESTful GET access, use URL: http://s3.amazonaws.com/bsi-test/image.jpg Example - Text file object Bucket: bsi-test Key: foobar Object: text RESTful GET access, use URL: http://s3.amazonaws.com/bsi-test/foobar Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 7 / 43

S3 Bucket Example 2 Amazon Web Services Example - Python, Boto, Metadata from boto.s3.connection import S3Connection conn = S3Connection( key-id, secret-key ) bucket = conn.get_bucket( bsi-test ) k = bucket.get_key( image.jpg ) print "Value for key x-amz-meta-s3fox-modifiedtime is:" print k.get_metadata( s3fox-modifiedtime ) k.get_contents_to_filename( deleteme.jpg ) k = bucket.get_key( foobar ) print "Object value for key foobar is:" print k.get_contents_as_string() print "Value for key x-amz-meta-example-key is:" print k.get_metadata( example-key ) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 8 / 43

S3 Bucket Example 2 Amazon Web Services Example - Python, Boto, Metadata - Output scott@mowgli-ubuntu:~/dropbox/hadoop$./botoexample.py Value for key x-amz-meta-s3fox-modifiedtime is: 1273869756000 Object value for key foobar is: This is a test of S3 Value for key x-amz-meta-example-key is: This is an example value. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 9 / 43

Amazon Web Services What is Elastic Map Reduce? Hadoop Hosted Hadoop framework running on EC2 and S3. Job Flow Processing steps EMR runs on a specified dataset using a set of Amazon EC2 instances. S3 Bucket(s) Input data, output, scripts, jars, logs. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 10 / 43

Amazon Web Services Controlling Job Flows Want to: 1 Configure jobs 2 Start jobs 3 Check status or stop jobs Tools: AWS Management Console https://console.aws.amazon.com/elasticmapreduce/home Command Line Tools (requires Ruby [sudo apt-get install ruby libopenssl-ruby]) http://developer.amazonwebservices.com/connect/entry. jspa?externalid=2264&categoryid=262 API calls defined by the service (REST-ful and SOAP-y) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 11 / 43

Amazon Web Services EMR Example 1 - Running a simple Work Flow from the AWS Management Console EMR Example 1 Hold up a minute...! What problem are we solving? Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 12 / 43

Agenda Interlude: Solving problems with Map and Reduce 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 13 / 43

Interlude: Solving problems with Map and Reduce Central MapReduce Ideas Operate on key-value pairs Data scientist provides map and reduce (input) < k1, v1 > map < k2, v2 > < k2, v2 > combine,sort < k2, v2 > < k2, v2 > reduce < k3, v3 > (output) (Optional: Combine provided in map, may significantly reduce bandwidth between workers) Efficient Sort provide by MapReduce library. Implies efficient compare(k2 a, k2 b ) Implicit parallelization - splitting and distributing data, starting maps, reduces, collecting output Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 14 / 43

Interlude: Solving problems with Map and Reduce Key components of MapReduce framework (wikipedia http://en.wikipedia.org/wiki/mapreduce) The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are: 1 input reader 2 Map function 3 partition function 4 compare function 5 Reduce function 6 output writer Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 15 / 43

Interlude: Solving problems with Map and Reduce Google Tutorial View 1 MapReduce library shards the input files and starts up many copies on a cluster. 2 Master assigns work to workers. There are map and reduce tasks. 3 Workers assigned map tasks reads the contents input shard, parse key-value pairs and pass pairs to map function. Intermediate key-value pairs produced by the map function are buffered in memory. 4 Periodically, buffered pairs are written to disk, partitioned into regions. Locations of buffered pairs on the local disk are passed to the master. 5 When a reduce worker has read all intermediate data, it sorts by the intermediate keys. All occurrences a key are grouped together. 6 Reduce workers pass a key and the corresponding set of intermediate values to the reduce function. 7 Output of the reduce function is appended to a final output file for each reduce partition. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 16 / 43

Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Data (from Apache Hadoop tutorial) Example: Word Count file1: Hello World Bye World file2: Hello Hadoop Goodbye Hadoop Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 17 / 43

Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Map Example: Word Count The first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 18 / 43

Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Sort and Combine Example: Word Count The output of the first map: < Bye, 1> < Hello, 1> < World, 2> The output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 19 / 43

Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Sort and Reduce Example: Word Count The Reducer method sums up the values for each key. The output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 20 / 43

Interlude: Solving problems with Map and Reduce What problems is MapReduce good at solving? Themes: Identify, transform, aggregate, filter, count, sort... Requirement of global knowledge of data is (a) occasional (vs. cost of map) (b) confined to ordinality Discovery tasks (vs. high repetition of similar transactional tasks, many reads) Unstructured data (vs. tabular, indexes!) Continuously updated data (indexing cost) Many, many, many machines (fault tolerance) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 21 / 43

Interlude: Solving problems with Map and Reduce What problems is MapReduce good at solving? Memes: MapReduce SQL (read the comments too) http://www.data-miners.com/blog/2008/01/ mapreduce-and-sql-aggregations.html MapReduce vs. Message Passing Interface (MPI) MPI is good for task parallelism and Hadoop is good for Data Parallelism. finite differences, finite elements, particle-in-cell... MapReduce vs. column-oriented DBs tabular data, indexes (cantankerous old farts!) http://databasecolumn.vertica.com/ database-innovation/mapreduce-a-major-step-backwards/ and http://databasecolumn.vertica.com/ database-innovation/mapreduce-ii/ MapReduce vs. relational DBs http://scienceblogs.com/ goodmath/2008/01/databases_are_hammers_mapreduc.php Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 22 / 43

Agenda Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 23 / 43

Running MapReduce on Amazon Elastic MapReduce Example 1 - Add up integers Example 1: Streaming Work Flow with AWS Management Console Data 3 4-1 4-3 1 1... Map import sys for line in sys.stdin: print %s%s%d % ("sum", \t, int(line)) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 24 / 43

Running MapReduce on Amazon Elastic MapReduce Example 1 - Add up integers Example 1: Streaming Work Flow with AWS Management Console Reduce import sys sum_of_ints = 0 for line in sys.stdin: key, value = line.split( \t ) # key is always the same try: sum_of_ints += int(value) except ValueError: pass try: print "%s%s%d" % (key, \t, sum_of_ints) except NameError: # No items processed pass Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 25 / 43

Running MapReduce on Amazon Elastic MapReduce Example 1 - Add up integers Example 1: Streaming Work Flow with AWS Management Console Shell test cat./input/ints.txt./mapper.py >./inter cat./input/ints1.txt./mapper.py >>./inter cat./input/ints2.txt./mapper.py >>./inter cat./input/ints3.txt./mapper.py >>./inter echo "Intermediate output:" cat./inter cat./inter sort \./reducer.py >./output/cmdlineoutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 26 / 43

Running MapReduce on Amazon Elastic MapReduce Example 1 - Add up integers Example 1: Streaming Work Flow with AWS Management Console What was that comment earlier about an optional combiner? Combiner in map import sys sum_of_ints = 0 for line in sys.stdin: sum_of_ints += int(line) print %s%s%d % ("sum", \t, sum_of_ints) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 27 / 43

Running MapReduce on Amazon Elastic MapReduce Example 1 - Add up integers Example 1: Streaming Work Flow with AWS Management Console Combiner shell test cat./input/ints.txt./mapper_combine.py >./inter cat./input/ints1.txt./mapper_combine.py >>./inter cat./input/ints2.txt./mapper_combine.py >>./inter cat./input/ints3.txt./mapper_combine.py >>./inter echo "Intermediate output:" cat./inter cat./inter sort \./reducer.py >./output/cmdlinecomboutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 28 / 43

Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers - AWS Console 1 Upload onecount directory with FFS3 2 Create a New Job Flow Name: onecount Job Flow: Run own app Job Type: Streaming 3 Input: bsi-test/onecount/input Output: bsi-test/onecount/outputconsole (must not exist) Mapper: bsi-test/onecount/mapper.py Reducer: bsi-test/onecount/reducer.py Extra Args: none Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 29 / 43

Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers - AWS Console 4 Instances: 4 Type: small Keypair: No (Yes allows ssh to Hadoop master) Log: yes Log Location: bsi-test/onecount/log Hadoop Debug: no 5 No bootstrap actions 6 Start it, and wait... Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 30 / 43

Agenda Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 31 / 43

Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count Map def read_input(file): for line in file: yield line.split() def main(separator= \t ): data = read_input(sys.stdin) for words in data: for word in words: lword = word.lower().strip(string.puctuation) print %s%s%d % (lword, separator, 1) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 32 / 43

def main(separator= \t ): data = read_mapper_output(sys.stdin, separator=separator) for current_word,group in groupby(data,itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: pass Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 33 / 43 Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count Example 2 - Word count (Slightly more useful) Reduce def read_mapper_output(file, separator= \t ): for line in file: yield line.rstrip().split(separator, 1)

Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count Example 2 - Word count (Slightly more useful) Shell test echo "foo foo quux labs foo bar quux"./mapper.py echo "foo foo quux labs foo bar quux"./mapper.py \ sort./reducer.py cat./input/alice.txt./mapper.py \ sort./reducer.py >./output/cmdlineoutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 34 / 43

Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count - AWS Console 1 Upload mywordcount directory with FFS3 2 Create a New Job Flow Name: mywordcount Job Flow: Run own app Job Type: Streaming 3 Input: bsi-test/mywordcount/input Output: bsi-test/mywordcount/outputconsole (must not exist) Mapper: bsi-test/mywordcount/mapper.py Reducer: bsi-test/mywordcount/reducer.py Extra Args: none Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 35 / 43

Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count - AWS Console 4 Instances: 4 Type: small Keypair: No (Yes allows ssh to Hadoop master) Log: yes Log Location: bsi-test/mywordcount/log Hadoop Debug: no 5 No bootstrap actions 6 Start it, and wait... Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 36 / 43

Agenda Running MapReduce on Amazon Elastic MapReduce Example 3 - elastic-mapreduce command line tool 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 37 / 43

Running MapReduce on Amazon Elastic MapReduce Example 3 - elastic-mapreduce command line tool Example 3 - elastic-mapreduce command line tool Word count (again, only better) /usr/local/emr-ruby/elastic-mapreduce --create \ --stream \ --num-instances 2 \ --name from-elastic-mapreduce \ --input s3n://bsi-test/mywordcount/input \ --output s3n://bsi-test/mywordcount/outputrubytool \ --mapper s3n://bsi-test/mywordcount/mapper.py \ --reducer s3n://bsi-test/mywordcount/reducer.py \ --log-uri s3n://bsi-test/mywordcount/log /usr/local/emr-ruby/elastic-mapreduce --list Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 38 / 43

Agenda References and Notes 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 39 / 43

References and Notes MapReduce Concepts Links Google MapReduce Tutorial: http: //code.google.com/edu/parallel/mapreduce-tutorial.html Apache Hadoop tutorial: http://hadoop.apache.org/common/ docs/current/mapred_tutorial.html Google Code University presentation on MapReduce: http://code. google.com/edu/submissions/mapreduce/listing.html MapReduce framework paper: http://labs.google.com/papers/mapreduce-osdi04.pdf Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 40 / 43

References and Notes Amazon Web Services Links EMR Getting Started documentation: http://aws.amazon.com/documentation/elasticmapreduce/ Getting started with Amazon S3: http: //docs.amazonwebservices.com/amazons3/2006-03-01/gsg/ PIG on EMR: http: //s3.amazonaws.com/awsvideos/amazonelasticmapreduce/ ElasticMapReduce-PigTutorial.html Boto Python library (multiple Amazon Services): http://code.google.com/p/boto/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 41 / 43

References and Notes Machine Learning Linear speedup (with processor number) for locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN) : http://www.cs.stanford.edu/ people/ang/papers/nips06-mapreducemulticore.pdf Mahout framework: http://mahout.apache.org/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 42 / 43

References and Notes Examples Links Wordcount example/tutorial: http://www.michael-noll.com/ wiki/writing_an_hadoop_mapreduce_program_in_python CouchDB and MapReduce (interesting examples of MR implementations for common problems) http://wiki.apache.org/couchdb/view_snippets This presentation: http://drskippy.net/projects/emr-hadoopmeetup.pdf or presentation source, example files etc.: http://drskippy.net/projects/emr-hadoopmeetup.zip Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 43 / 43