Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca

Similar documents
Word count example Abdalrahman Alsaedi

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

MR-(Mapreduce Programming Language)

Hadoop WordCount Explained! IT332 Distributed Systems

Word Count Code using MR2 Classes and API

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

How To Write A Mapreduce Program In Java.Io (Orchestra)

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

map/reduce connected components

Hadoop Configuration and First Examples

Shadi Khalifa Database Systems Laboratory (DSL)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Hadoop Basics with InfoSphere BigInsights

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

CS54100: Database Systems

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Introduction to MapReduce and Hadoop

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Project 5 Twitter Analyzer Due: Fri :59:59 pm

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Getting to know Apache Hadoop

Data Science in the Wild

Connecting Hadoop with Oracle Database

Hadoop: Understanding the Big Data Processing Method

Tes$ng Hadoop Applica$ons. Tom Wheeler

Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

HPCHadoop: MapReduce on Cray X-series

Xiaoming Gao Hui Li Thilina Gunarathne

Hands-on Exercises with Big Data

Introduction To Hadoop

DLT Solutions and Amazon Web Services

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Amazon Web Services (AWS) Setup Guidelines

Running Hadoop on Windows CCNP Server

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Service Organization Controls 3 Report

Cloud Computing For Bioinformatics

ur skills.com

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

Extreme Computing. Hadoop MapReduce in more detail.

Overview of Web Services API

Hadoop Tutorial. General Instructions

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Integration Guide

By Hrudaya nath K Cloud Computing

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

IDS 561 Big data analytics Assignment 1

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Scalable Computing with Hadoop

Hadoop & Spark Using Amazon EMR

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Programming in Hadoop Programming, Tuning & Debugging

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Map Reduce & Hadoop Recommended Text:

5 HDFS - Hadoop Distributed System

Hadoop, Hive & Spark Tutorial

Cloud Computing. Adam Barker

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data Management and NoSQL Databases

Internals of Hadoop Application Framework and Distributed File System

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Amazon Glacier. Developer Guide API Version

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Simple Storage Service (S3)

University of Maryland. Tuesday, February 2, 2010

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Introduction to Cloud Computing

Getting Started with Amazon EC2 Management in Eclipse

Running Knn Spark on EC2 Documentation

Mrs: MapReduce for Scientific Computing in Python

Cloud Models and Platforms

CS 378 Big Data Programming

BIG DATA APPLICATIONS

Real Time Big Data Processing

Hadoop Setup. 1 Cluster

Introduction to MapReduce and Hadoop

Hadoop Installation MapReduce Examples Jake Karnes

Amazon Elastic Beanstalk

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop Streaming. Table of contents

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

MapReduce, Hadoop and Amazon AWS

Transcription:

Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca

The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services 2

Infrastructure Services Infrastructure Services EC2 http://aws.amazon.com/ec2/ VPC http://aws.amazon.com/vpc/ S3 http://aws.amazon.com/s3/ EBS http://aws.amazon.com/ebs/ 3

Amazon Simple Storage Service (S3) Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. Write, read, and delete objects containing from 1 byte to 5 terabytes of data each. The number of objects you can store is unlimited. Each object is stored in a bucket and retrieved via a unique, developer-assigned key. A bucket can be stored in one of several Regions. You can choose a Region to optimize for latency, minimize costs, or address regulatory requirements. Objects stored in a Region never leave the Region unless you transfer them out. Authentication mechanisms are provided to ensure that data is kept secure from unauthorized access. Objects can be made private or public, and rights can be granted to specific users. S3 charges based on per GB-month AND per I/O requests AND per data modification requests. 4

5

Platform Services EMR http://aws.amazon.com/elasticmapreduce/ Platform Services RDS http://aws.amazon.com/rds/ DynamoDB http://aws.amazon.com/dynamodb/ Beanstalk http://aws.amazon.com/elasticbeanstalk/ 6

Amazon Elastic MapReduce (EMR) Amazon EMR is a web service that makes it easy to quickly and cost-effectively process vast amounts of data using Hadoop. Amazon EMR distribute the data and processing across a resizable cluster of Amazon EC2 instances. With Amazon EMR you can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete. Amazon EMR supports a variety of Amazon EC2 instance types and Amazon EC2 pricing options (On-Demand, Reserved, and Spot). When launching an Amazon EMR cluster (also called a "job flow"), you choose how many and what type of Amazon EC2 Instances to provision. The Amazon EMR price is in addition to the Amazon EC2 price. Amazon EMR is used in a variety of applications, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. 7

8

What is Hadoop? 9

Example: Word Count Objective: Count the number of occurrences of each word in the provided input files. How it works In the Map phase the text is tokenized into words then we form a key value pair with these words where the key being the word itself and value is set to 1. In the reduce phase the keys are grouped together and the values for the same key are added. 10

11

Example 2: Joins with MapReduce A retailer has a customer data base and he needs to do some promotions. DeliveryDetails.txt Expected Output UserDetails.txt 123 456, Jim 456 He123, chooses Tom bulk sms as the choice of his promotion which is done by a third party. 456 123, Pending Tom, Pending 789 123, Harry 789 456, Richa 123 456, Delivered 789 123, Failed Jim, Delivered Harry, Failed Once the SMS is sent, the SMS provider returns the delivery status back to the retailer. 789 456, Resend Richa, Resend We have 2 input files as follows: UserDetails.txt : Every record is of the format mobile number, consumer name. DeliveryDetails.txt: Every record is of the format mobile number, delivery status. Objective: Associate the customer name with the delivery status. http://kickstarthadoop.blogspot.ca/2011/09/joins-with-plain-map-reduce.html 12

Formulating as a MapReduce Since we have 2 input files, we need 2 Map functions (one for each input file) UserDetails.txt UserDetails.txt DeliveryDetails.txt Key: mobile number <123 456, DR~Delivered> Value: CD for the customer <456 123, details DR~Pending> file + Customer Name DeliveryDetails.txt <789 123, DR~Failed> Key: mobile number <789 456, DR~Resend> Value: DR for the delivery report file + Status <123 456, CD~Jim> <456 123, CD~Tom> <789 123, CD~Harry> <789 456, CD~Richa> Reduce Output <Jim, Delivered> On the reducer, every key would be having two values (for simplicity) one with prefix CD and other DR. <Tom, Pending> Reduce Input <123 456, CD~Jim> <123 456, DR~Delivered> <456 123, CD~Tom> <456 123, DR~Pending> From CD get the customer name corresponding to the cell number (input key) and from DR get the status. The output Key values from the reducer would be as follows Key : Customer Name Value : Status Message 13

Tools you will need Eclipse IDE for Java EE Developers http://www.eclipse.org/downloads/packages/eclipseide-java-ee-developers/keplersr1 hadoop-core-1.2.1.jar http://mvnrepository.com/artifact/org.apache.hadoop/h adoop-core/1.2.1 Amazon Web Services Account https://console.aws.amazon.com/elasticmapreduce/ho me?region=us-east-1 14

Create a new Java Project 15

Add Hadoop jar to the project Create a lib folder in the project. Copy and paste the Hadoop jar into the lib folder. Add the jar to the project build path. 16

Map classes: 1) UserFileMapper public class UserFileMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { UserDetails.txt 123 456, Jim 456 123, Tom //variables to process Consumer Details private String cellnumber,customername,filetag="cd~"; UserDetails.txt <123 456, CD~Jim> <456 123, CD~Tom> /* map method that process ConsumerDetails.txt and frames the initial key value pairs 789 123, Harry <789 123, CD~Harry> Key(Text) mobile number 789 456, Richa <789 456, CD~Richa> Value(Text) An identifier to indicate the source of input(using CD for the customer details file) + Customer Name */ public void map(longwritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { //taking one line/record at a time and parsing them into key value pairs String line = value.tostring(); String splitarray[] = line.split(","); cellnumber = splitarray[0].trim(); customername = splitarray[1].trim(); } //sending the key value pair out of mapper output.collect(new Text(cellNumber), new Text(fileTag+customerName)); } 17

Map classes: 2) DeliveryFileMapper public class DeliveryFileMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { DeliveryDetails.txt 123 456, Delivered 456 123, Pending //variables to process delivery report private String cellnumber,deliverycode,filetag="dr~"; DeliveryDetails.txt <123 456, DR~Delivered> <456 123, DR~Pending> /* map method that process DeliveryReport.txt and frames the initial key value pairs Key(Text) mobile number Value(Text) 789 123, An identifier Failed <789 123, DR~Failed> to indicate the source of input(using DR for the delivery report file) + Status Code*/ 789 456, Resend <789 456, DR~Resend> public void map(longwritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { //taking one line/record at a time and parsing them into key value pairs String line = value.tostring(); String splitarray[] = line.split(","); cellnumber = splitarray[0].trim(); deliverycode = splitarray[1].trim(); } } //sending the key value pair out of mapper output.collect(new Text(cellNumber), new Text(fileTag+deliveryCode)); 18

Reduce Class public class SmsReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { Reduce Input <123 456, CD~Jim> //Variables to aid the join process <123 private 456, String DR~Delivered> customername,deliveryreport; Reduce Output <Jim, Delivered> <456 123, CD~Tom> <Tom, Pending> public void reduce(text key, Iterator<Text> values, OutputCollector<Text, Text> output, <456 Reporter 123, DR~Pending> reporter) throws IOException { while (values.hasnext()) { String currvalue = values.next().tostring(); String valuesplitted[] = currvalue.split("~"); /*identifying the record source that corresponds to a cell number and parses the values accordingly*/ if(valuesplitted[0].equals("cd")) { customername=valuesplitted[1].trim(); } else if(valuesplitted[0].equals("dr")) { deliveryreport = valuesplitted[1].trim(); } } output.collect(new Text(customerName), new Text(deliveryReport)); } 19 }

Driver Class public class SmsDriver extends Configured implements Tool { public int run(string[] args) throws Exception { //get the configuration parameters and assigns a job name JobConf conf = new JobConf(getConf(), SmsDriver.class); conf.setjobname("sms Reports"); Since we have 2 map functions and 2 input files, MulipleInputFormat is used to //setting key value types for mapper and reducer outputs specify which input to go into which mapper. conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(text.class); //specifying the custom reducer class conf.setreducerclass(smsreducer.class); // If only one Mapper exists // conf.setmapperclass(mapper.class); // FileInputFormat.addInputPath(conf, newpath(args[0])); //Specifying the input directories(@ runtime) and Mappers independently for inputs from multiple sources MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, UserFileMapper.class); MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, DeliveryFileMapper.class); //Specifying the output directory @ runtime FileOutputFormat.setOutputPath(conf, new Path(args[2])); } JobClient.runJob(conf); return 0; } public static void main(string[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new SmsDriver(), args); System.exit(res); } 20

Export Project to Jar 21

Uploading Project Jar and Data to S3 https://console.aws.amazon.com/s3/home?region=us-east-1 22

Hadoop Execution Overview 23

Using Amazon EMR https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1 24

Using Amazon EMR Make Sure that the output folder (3 rd argument in our example) does NOT exists on S3 (MapReduce will create it) 25

Using Amazon EMR Set the number and type of the EC2 instances used to process your application 26

Using Amazon EMR Remember to ENABLE the Debugging and provide the log Path 27

Performance Using Amazon cloudwatch (Monitoring Tab) you can check the performance of the job 28

Debugging the Job For detailed information on the MapReduce progress, click on the syslog link 29

Accessing the Results on S3 30

Elastic Map Reduce Questions?