A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Transcription

1 A Tutorial Introduc/on to Big Data Hands On Data Analy/cs over EMR Robert Grossman University of Chicago Open Data Group Collin BenneE Open Data Group November 12,

2 Amazon AWS Elas/c MapReduce allows for MapReduce jobs to run over the Amazon Elas/c Cloud Infrastructure with minimal setup.

3 Running Custom Jobs Elas/c Map Reduce allows for different types of jobs to be run Streaming allows Anything which handles Standard I/O Lets you mix and match languages, shell scripts, etc.

4 Choose a Language(s) The Streaming Interface of Elas/c Map Reduce lets us use different languages for each step of the job. Choose wisely Pick a language that fits the task For the examples, Python for both Maps One reducer is in R, the other in Python

5 tutorials.opendatagroup.com

6 Create a Job

7 Example 1 MapReduce Job to read in data and build plots to help with Exploratory Data Analysis Data is already in S3 Output will be wrieen to HDFS

8 Example 2 MapReduce Job to read in data and build models in PMML: Data is already in S3 Output will be wrieen to HDFS

9 Amazon Hadoop Streaming Job Applica/on

10 Set up Job Parameters

11 Values INPUT OUTPUT MAPPER REDUCER EXTRA ARGS tutorials.opendatagroup.com/ sc12/data tutorials.opendatagroup.com/ sc12/out tutorials.opendatagroup.com/ sc12/emr/mapper.py tutorials.opendatagroup.com/ sc12/emr/reducer-plots.py

12 Instance Types and Count

13 Machine Access and S3 Logs

14 SSH Access If you do not specify an EC2 keypair, then you cannot log into the nodes. If everything works, this is usually not necessary

15 Logging If you specify the Amazon S3 log path, then the standard Hadoop logging will be wrieen to the S3 bucket of your choice This directory must exist This is helpful ever if everything works as you can learn things about the job

16 Bootstrap

17 Bootstrap If you do not specify a bootstrap, you get the vanilla version of EMR. If you want to add any packages, run any ini/aliza/on scripts, etc, you have to do it with a bootstrap ac/on EMR offers canned bootstraps We run a custom ac/on

18 Job Summary

19 Let the Magic Happen

20 Check on Instances under EC2

21 Machine Access You can ssh into the master node if you specified an EC2 keypair during configura/on To access a slave node, 1. scp the EC2 keypair to them master node 2. ssh into the master node 3. ssh into the slave node using the EC2 keypair ssh i <path_to_key>/<key> hadoop@<ip>

22 Text- browser JobTracker Access from Master Node

23 Job Output The job we ran is the Exploratory Data Analysis (EDA) Step It generates plots as SVG files of the data The SVG plots are wrieen to HDFS SVG images are xml, not binary (With the same bootstrap and mapper, we can run a job to build models in PMML)

24 Each reducer producers a part- * file

25 Images are in the Files

26 Extrac/ng Images Depending on how many keys and reducers you have, there will be 0, 1, or more svg plots in each output file 1. Download each part- * 2. To check how many images are in a file: grep <svg part wc l If there is only one, rename to part svg If there is more than one, split the file up in your favorite text editor

27 View in a Web Browser

28 EDA Job Details While we wait for your job to finish

29 Shuffle and Sort Map output across all mappers is shuffled so that Like Keys are sent to a single reducer Map output is sorted so that all output with a key k is seen as a con/guous group. This is done behind the scenes for you by the MapReduce framework

30 Map 1. Read line- by- line 2. Parse each record Prune unwanted fields Perform data transforma/ons Select fields to be the Key Value Values are sent out over Standard I/O, so everything is a string

31 Reduce 1. Aggregate records by key 2. Perform any reduc/on steps Compute running sta/s/cs or necessary metadata Store in an efficient data structure Perform the analy/cs on the aggregated group Values are sent out over Standard I/O to the HDFS, so everything is a string

32 Code You do not need to set up Job configura/on in the code, all that is handled for you by the framework. This means that very liele code is necessary

33 Code - Map #!/usr/bin/env python import sys import time if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))

34 Code - Map #!/usr/bin/env python import sys import time Iterate over each line if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))

35 Code - Map #!/usr/bin/env python import sys import time Parse the line if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))

36 Code - Map #!/usr/bin/env python import sys import time Transform Data if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))

37 Code - Map #!/usr/bin/env python import sys import time Emit a key, value pair if name == " main ": for line in sys.stdin.xreadlines(): route, date, daytype, rides = line.rstrip().split(",") weekday = time.strftime("%a", time.strptime(date, "%m/%d/%y")) sys.stdout.write("%s-%s\t%s,%s\n" % (route, weekday, date, rides))

38 Reduce - Plots 1. Aggregate Events 2. Calculate the Mean We calculate a running mean so that the events do not have to be held in memory Trade off Does the amount of RAM required to hold all events for one key push the available limits? Can running sta/s/cs be safely computed? Build SVG plot using Cassius Values are sent to HDFS over Standard I/O

39 Reduce - Models 1. Aggregate Events 2. Calculate the Mean We calculate a running mean so that the events do not have to be held in memory Trade off Does the amount of RAM required to hold all events for one key push the available limits? Can running sta/s/cs be safely computed? Build PMML model using Augustus Values are sent to HDFS over Standard I/O

40 Model MapReduce Par//on Model Segment Events and sta/s/cs collected in the reducer are used to constructed a model describing the segment Each Bus Route Day of the Week combina/on gets a Gaussian Distribu/on with a mean and variance to predict rider volume

41 PMML Template for our Model from augustus.core.xmlbase import load import augustus.core.pmml41 as pmml # the segment is validated as PMML on load segment = load(""" <Segment> <SimplePredicate field="segment" operator="equal" value="zero"/> <BaselineModel functionname="regression"> <MiningSchema> <MiningField usagetype="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" teststatistic="zvalue"> <Baseline> <GaussianDistribution mean="0" variance="1" /> </Baseline> </TestDistributions> </BaselineModel> </Segment> """, pmml.pmml)

42 Code Reducer PMML from augustus.core.xmlbase import load import augustus.core.pmml41 as pmml Use Augustus to validate template # the segment is validated as PMML on load segment = load(""" <Segment> <SimplePredicate field="segment" operator="equal" value="zero"/> <BaselineModel functionname="regression"> <MiningSchema> <MiningField usagetype="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" teststatistic="zvalue"> <Baseline> <GaussianDistribution mean="0" variance="1" /> </Baseline> </TestDistributions> </BaselineModel> </Segment> """, pmml.pmml)

43 Code Reducer PMML from augustus.core.xmlbase import load import augustus.core.pmml41 as pmml Template is a hard code string # the segment is validated as PMML on load segment = load(""" <Segment> <SimplePredicate field="segment" operator="equal" value="zero"/> <BaselineModel functionname="regression"> <MiningSchema> <MiningField usagetype="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" teststatistic="zvalue"> <Baseline> <GaussianDistribution mean="0" variance="1" /> </Baseline> </TestDistributions> </BaselineModel> </Segment> """, pmml.pmml)

44 Code Reducer PMML from augustus.core.xmlbase import load import augustus.core.pmml41 as pmml Segment predicate is the Par//on Key # the segment is validated as PMML on load segment = load(""" <Segment> <SimplePredicate field="segment" operator="equal" value="zero"/> <BaselineModel functionname="regression"> <MiningSchema> <MiningField usagetype="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" teststatistic="zvalue"> <Baseline> <GaussianDistribution mean="0" variance="1" /> </Baseline> </TestDistributions> </BaselineModel> </Segment> """, pmml.pmml)

45 Code Reducer PMML def doany(v, date, rides): v["count"] += 1 diff = rides - v["mean"] incr = alpha * diff v["mean"] += incr v["varn"] = (1. - alpha)*(v["varn ] + diff*incr) Accumulate Step

46 Code Reducer PMML def dolast(v): if v["count"] > 1: variance = v["varn"] * v["count"] / (v["count"] - 1.) else: variance = v["varn"] v["gaussiandistribution"]["mean"] = v["mean"] v["gaussiandistribution"]["variance"] = variance v["partialsum"].attrib = {"COUNT": v["count"], "RUNMEAN": v["mean"], "RUNSN": v["varn"]} print v["segment"].xml() Write out Model

47 Code Reducer PMML def dolast(v): if v["count"] > 1: variance = v["varn"] * v["count"] / (v["count"] - 1.) else: variance = v["varn"] v["gaussiandistribution"]["mean"] = v["mean"] v["gaussiandistribution"]["variance"] = variance v["partialsum"].attrib = {"COUNT": v["count"], "RUNMEAN": v["mean"], "RUNSN": v["varn"]} print v["segment"].xml() Calculate values to fill in the template

48 Code Reducer PMML def dolast(v): if v["count"] > 1: variance = v["varn"] * v["count"] / (v["count"] - 1.) else: variance = v["varn"] v["gaussiandistribution"]["mean"] = v["mean"] v["gaussiandistribution"]["variance"] = variance v["partialsum"].attrib = {"COUNT": v["count"], "RUNMEAN": v["mean"], "RUNSN": v["varn"]} print v["segment"].xml() Fill them in

49 Code Reducer PMML def dolast(v): if v["count"] > 1: variance = v["varn"] * v["count"] / (v["count"] - 1.) else: variance = v["varn"] v["gaussiandistribution"]["mean"] = v["mean"] v["gaussiandistribution"]["variance"] = variance v["partialsum"].attrib = {"COUNT": v["count"], "RUNMEAN": v["mean"], "RUNSN": v["varn"]} print v["segment"].xml() Write model to HDFS

50 PMML PMML is the leading standard for sta/s/cal and data mining models Version 4.1 includes support for mul/ple models, such as segmented models and ensembles of models It allows for models to expressed as XML- compliant, portable documents

51 PMML in the HDFS output

52 Ques/ons? For the most recent version of these slides, please see tutorials.opendatagroup.com