A Tutorial Introduc/on to Big Data Hands On Data Analy/cs over EMR Robert Grossman University of Chicago Open Data Group Collin BenneE Open Data Group November 12, 2012 1
Amazon AWS Elas/c MapReduce allows for MapReduce jobs to run over the Amazon Elas/c Cloud Infrastructure with minimal setup.
Running Custom Jobs Elas/c Map Reduce allows for different types of jobs to be run Streaming allows Anything which handles Standard I/O Lets you mix and match languages, shell scripts, etc.
Choose a Language(s) The Streaming Interface of Elas/c Map Reduce lets us use different languages for each step of the job. Choose wisely Pick a language that fits the task For the examples, Python for both Maps One reducer is in R, the other in Python
tutorials.opendatagroup.com
Create a Job
Example 1 MapReduce Job to read in data and build plots to help with Exploratory Data Analysis Data is already in S3 Output will be wrieen to HDFS
Example 2 MapReduce Job to read in data and build models in PMML: Data is already in S3 Output will be wrieen to HDFS
Amazon Hadoop Streaming Job Applica/on
Set up Job Parameters
Values INPUT OUTPUT MAPPER REDUCER EXTRA ARGS tutorials.opendatagroup.com/ sc12/data tutorials.opendatagroup.com/ sc12/out tutorials.opendatagroup.com/ sc12/emr/mapper.py tutorials.opendatagroup.com/ sc12/emr/reducer-plots.py
Instance Types and Count
Machine Access and S3 Logs
SSH Access If you do not specify an EC2 keypair, then you cannot log into the nodes. If everything works, this is usually not necessary
Logging If you specify the Amazon S3 log path, then the standard Hadoop logging will be wrieen to the S3 bucket of your choice This directory must exist This is helpful ever if everything works as you can learn things about the job
Bootstrap
Bootstrap If you do not specify a bootstrap, you get the vanilla version of EMR. If you want to add any packages, run any ini/aliza/on scripts, etc, you have to do it with a bootstrap ac/on EMR offers canned bootstraps We run a custom ac/on
Job Summary
Let the Magic Happen
Check on Instances under EC2
Machine Access You can ssh into the master node if you specified an EC2 keypair during configura/on To access a slave node, 1. scp the EC2 keypair to them master node 2. ssh into the master node 3. ssh into the slave node using the EC2 keypair ssh i <path_to_key>/<key> hadoop@<ip>
Text- browser JobTracker Access from Master Node
Job Output The job we ran is the Exploratory Data Analysis (EDA) Step It generates plots as SVG files of the data The SVG plots are wrieen to HDFS SVG images are xml, not binary (With the same bootstrap and mapper, we can run a job to build models in PMML)
Each reducer producers a part- * file
Images are in the Files
Extrac/ng Images Depending on how many keys and reducers you have, there will be 0, 1, or more svg plots in each output file 1. Download each part- * 2. To check how many images are in a file: grep <svg part-00000 wc l If there is only one, rename to part- 0000.svg If there is more than one, split the file up in your favorite text editor
View in a Web Browser
EDA Job Details While we wait for your job to finish
Shuffle and Sort Map output across all mappers is shuffled so that Like Keys are sent to a single reducer Map output is sorted so that all output with a key k is seen as a con/guous group. This is done behind the scenes for you by the MapReduce framework
Map 1. Read line- by- line 2. Parse each record Prune unwanted fields Perform data transforma/ons Select fields to be the Key Value Values are sent out over Standard I/O, so everything is a string
Reduce 1. Aggregate records by key 2. Perform any reduc/on steps Compute running sta/s/cs or necessary metadata Store in an efficient data structure Perform the analy/cs on the aggregated group Values are sent out over Standard I/O to the HDFS, so everything is a string
Code You do not need to set up Job configura/on in the code, all that is handled for you by the framework. This means that very liele code is necessary
Code - Map #!/usr/bin/env python import sys import time if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))
Code - Map #!/usr/bin/env python import sys import time Iterate over each line if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))
Code - Map #!/usr/bin/env python import sys import time Parse the line if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))
Code - Map #!/usr/bin/env python import sys import time Transform Data if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))
Code - Map #!/usr/bin/env python import sys import time Emit a key, value pair if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))
Reduce - Plots 1. Aggregate Events 2. Calculate the Mean We calculate a running mean so that the events do not have to be held in memory Trade off Does the amount of RAM required to hold all events for one key push the available limits? Can running sta/s/cs be safely computed? Build SVG plot using Cassius Values are sent to HDFS over Standard I/O
Reduce - Models 1. Aggregate Events 2. Calculate the Mean We calculate a running mean so that the events do not have to be held in memory Trade off Does the amount of RAM required to hold all events for one key push the available limits? Can running sta/s/cs be safely computed? Build PMML model using Augustus Values are sent to HDFS over Standard I/O
Model MapReduce Par//on Model Segment Events and sta/s/cs collected in the reducer are used to constructed a model describing the segment Each Bus Route Day of the Week combina/on gets a Gaussian Distribu/on with a mean and variance to predict rider volume
PMML Template for our Model from augustus.core.xmlbase import load import augustus.core.pmml41 as pmml # the segment is validated as PMML on load segment = load(""" <Segment> <SimplePredicate field="segment" operator="equal" value="zero"/> <BaselineModel functionname="regression"> <MiningSchema> <MiningField usagetype="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" teststatistic="zvalue"> <Baseline> <GaussianDistribution mean="0" variance="1" /> </Baseline> </TestDistributions> </BaselineModel> </Segment> """, pmml.pmml)
Code Reducer PMML from augustus.core.xmlbase import load import augustus.core.pmml41 as pmml Use Augustus to validate template # the segment is validated as PMML on load segment = load(""" <Segment> <SimplePredicate field="segment" operator="equal" value="zero"/> <BaselineModel functionname="regression"> <MiningSchema> <MiningField usagetype="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" teststatistic="zvalue"> <Baseline> <GaussianDistribution mean="0" variance="1" /> </Baseline> </TestDistributions> </BaselineModel> </Segment> """, pmml.pmml)
Code Reducer PMML from augustus.core.xmlbase import load import augustus.core.pmml41 as pmml Template is a hard code string # the segment is validated as PMML on load segment = load(""" <Segment> <SimplePredicate field="segment" operator="equal" value="zero"/> <BaselineModel functionname="regression"> <MiningSchema> <MiningField usagetype="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" teststatistic="zvalue"> <Baseline> <GaussianDistribution mean="0" variance="1" /> </Baseline> </TestDistributions> </BaselineModel> </Segment> """, pmml.pmml)
Code Reducer PMML from augustus.core.xmlbase import load import augustus.core.pmml41 as pmml Segment predicate is the Par//on Key # the segment is validated as PMML on load segment = load(""" <Segment> <SimplePredicate field="segment" operator="equal" value="zero"/> <BaselineModel functionname="regression"> <MiningSchema> <MiningField usagetype="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" teststatistic="zvalue"> <Baseline> <GaussianDistribution mean="0" variance="1" /> </Baseline> </TestDistributions> </BaselineModel> </Segment> """, pmml.pmml)
Code Reducer PMML def doany(v, date, rides): v["count"] += 1 diff = rides - v["mean"] incr = alpha * diff v["mean"] += incr v["varn"] = (1. - alpha)*(v["varn ] + diff*incr) Accumulate Step
Code Reducer PMML def dolast(v): if v["count"] > 1: variance = v["varn"] * v["count"] / (v["count"] - 1.) else: variance = v["varn"] v["gaussiandistribution"]["mean"] = v["mean"] v["gaussiandistribution"]["variance"] = variance v["partialsum"].attrib = {"COUNT": v["count"], "RUNMEAN": v["mean"], "RUNSN": v["varn"]} print v["segment"].xml() Write out Model
Code Reducer PMML def dolast(v): if v["count"] > 1: variance = v["varn"] * v["count"] / (v["count"] - 1.) else: variance = v["varn"] v["gaussiandistribution"]["mean"] = v["mean"] v["gaussiandistribution"]["variance"] = variance v["partialsum"].attrib = {"COUNT": v["count"], "RUNMEAN": v["mean"], "RUNSN": v["varn"]} print v["segment"].xml() Calculate values to fill in the template
Code Reducer PMML def dolast(v): if v["count"] > 1: variance = v["varn"] * v["count"] / (v["count"] - 1.) else: variance = v["varn"] v["gaussiandistribution"]["mean"] = v["mean"] v["gaussiandistribution"]["variance"] = variance v["partialsum"].attrib = {"COUNT": v["count"], "RUNMEAN": v["mean"], "RUNSN": v["varn"]} print v["segment"].xml() Fill them in
Code Reducer PMML def dolast(v): if v["count"] > 1: variance = v["varn"] * v["count"] / (v["count"] - 1.) else: variance = v["varn"] v["gaussiandistribution"]["mean"] = v["mean"] v["gaussiandistribution"]["variance"] = variance v["partialsum"].attrib = {"COUNT": v["count"], "RUNMEAN": v["mean"], "RUNSN": v["varn"]} print v["segment"].xml() Write model to HDFS
PMML PMML is the leading standard for sta/s/cal and data mining models Version 4.1 includes support for mul/ple models, such as segmented models and ensembles of models It allows for models to expressed as XML- compliant, portable documents
PMML in the HDFS output
Ques/ons? For the most recent version of these slides, please see tutorials.opendatagroup.com