Practice and Applications of Data Management CMPSCI 345. Lecture 19-20: Amazon Web Services

Transcription

1 Practice and Applications of Data Management CMPSCI 345 Lecture 19-20: Amazon Web Services

2 Extra credit: project part 3 } Open-ended addi*onal features. } Presenta*ons on Dec 7 } Need to sign up by Nov 30 2

3 This week } No class on Wednesday (enjoy Thanksgiving) } Office hours on Tuesday 2-3pm. 3

4 Map-Reduce Summary } Hides scheduling and paralleliza*on details } However, very limited queries } Difficult to write more complex tasks } Need mul*ple map-reduce opera*ons } Solu*on: } Use MapReduce as a run*me for higher level languages } Pig (Yahoo, now apache project): SQL-like operators } Hive (apache project): SQL } Scope (MS): SQL But proprietary } DryadLINQ (MS): LINQ But also proprietary 4

5 Homework assignment } Amazon Web Services } You need to sign up } Prac*ce large-scale unstructured data processing on Hadoop } This (and next) week: } Overview of AWS in class } Guiding through the first steps of the assignment. 5

6 Amazon Web Services (AWS) } A cloud compu*ng pla]orm 6

7 Why cloud computing? vs vs 7

8 What will we learn? analyze search logs BED EBD0C yahoo chat 824F413FA37520BF garter belts 824F413FA37520BF lingerie 824F413FA37520BF spiderman 824F413FA37520BF tommy hilfiger 824F413FA37520BF calgary 824F413FA37520BF calgary 824F413FA37520BF exhibitionists 8

9 What is Pig? } An engine for execu*ng programs on top of Hadoop } It provides a language, Pig La*n, to specify these programs } An Apache open source project } h^p://hadoop.apache.org/pig/ 9

10 Why use Pig? Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged Load Users Filter by age Load Pages Join on name Group on url Count clicks Order by clicks Take top 5 10

11 In MapReduce import java.io.ioexception; import java.util.arraylist; import java.util.iterator; import java.util.list; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.mapred.fileinputformat; import org.apache.hadoop.mapred.fileoutputformat; import org.apache.hadoop.mapred.jobconf; import org.apache.hadoop.mapred.keyvaluetextinputformat; import org.apache.hadoop.mapred.mapper; import org.apache.hadoop.mapred.mapreducebase; import org.apache.hadoop.mapred.outputcollector; import org.apache.hadoop.mapred.recordreader; import org.apache.hadoop.mapred.reducer; import org.apache.hadoop.mapred.reporter; import org.apache.hadoop.mapred.sequencefileinputformat; import org.apache.hadoop.mapred.sequencefileoutputformat; import org.apache.hadoop.mapred.textinputformat; import org.apache.hadoop.mapred.jobcontrol.job; import org.apache.hadoop.mapred.jobcontrol.jobcontrol; import org.apache.hadoop.mapred.lib.identitymapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String key = line.substring(0, firstcomma); String value = line.substring(firstcomma + 1); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("1" + value); oc.collect(outkey, outval); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String value = line.substring(firstcomma + 1); int age = Integer.parseInt(value); if (age < 18 age > 25) return; String key = line.substring(0, firstcomma); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("2" + value); oc.collect(outkey, outval); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasnext()) { Text t = iter.next(); String value = t.tostring(); if (value.charat(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1)); reporter.setstatus("ok"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setstatus("ok"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.tostring(); int firstcomma = line.indexof(','); int secondcomma = line.indexof(',', firstcomma); String key = line.substring(firstcomma, secondcomma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outkey = new Text(key); oc.collect(outkey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasnext()) { sum += iter.next().get(); reporter.setstatus("ok"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((longwritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasnext()) { oc.collect(key, iter.next()); count++; } } } public static void main(string[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setjobname("load Pages"); lp.setinputformat(textinputformat.class); lp.setoutputkeyclass(text.class); lp.setoutputvalueclass(text.class); lp.setmapperclass(loadpages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setnumreducetasks(0); Job loadpages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setjobname("load and Filter Users"); lfu.setinputformat(textinputformat.class); lfu.setoutputkeyclass(text.class); lfu.setoutputvalueclass(text.class); lfu.setmapperclass(loadandfilterusers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setnumreducetasks(0); Job loadusers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setjobname("join Users and Pages"); join.setinputformat(keyvaluetextinputformat.class); join.setoutputkeyclass(text.class); join.setoutputvalueclass(text.class); join.setmapperclass(identitymapper.class); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setnumreducetasks(50); Job joinjob = new Job(join); joinjob.adddependingjob(loadpages); joinjob.adddependingjob(loadusers); JobConf group = new JobConf(MRExample.class); group.setjobname("group URLs"); group.setinputformat(keyvaluetextinputformat.class); group.setoutputkeyclass(text.class); group.setoutputvalueclass(longwritable.class); group.setoutputformat(sequencefileoutputformat.class); group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); group.setreducerclass(reduceurls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setnumreducetasks(50); Job groupjob = new Job(group); groupjob.adddependingjob(joinjob); JobConf top100 = new JobConf(MRExample.class); top100.setjobname("top 100 sites"); top100.setinputformat(sequencefileinputformat.class); top100.setoutputkeyclass(longwritable.class); top100.setoutputvalueclass(text.class); top100.setoutputformat(sequencefileoutputformat.class); top100.setmapperclass(loadclicks.class); top100.setcombinerclass(limitclicks.class); top100.setreducerclass(limitclicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setnumreducetasks(1); Job limit = new Job(top100); limit.adddependingjob(groupjob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addjob(loadpages); jc.addjob(loadusers); jc.addjob(joinjob); jc.addjob(groupjob); jc.addjob(limit); jc.run(); } } 170 lines of code, 4 hours to write 11

12 In Pig Latin Users = load users as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into top5sites ; 9 lines of code, 15 minutes to write 12

13 But how good is it? 13

14 Essence of Pig } Map-Reduce is too low a level to program, SQL too high } Pig La*n, a language intended to sit between the two: } Impera*ve } Provides standard rela*onal transforms (join, sort, etc.) } Schemas are op*onal, used when available, can be defined at run*me } User Defined Func*ons are first class ci*zens } Opportuni*es for advanced op*mizer but op*miza*ons by programmer also possible 14

15 Multi-store script A = load users as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into bydemo ; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into bystate ; load users filter nulls group by age, gender group by state apply UDFs apply UDFs store into bydemo store into bystate 15

16 What are people doing with Pig } At Yahoo ~70% of Hadoop jobs are Pig jobs } Being used at Twi^er, LinkedIn, and other companies } Available as part of Amazon EMR web service and Cloudera Hadoop distribu*on } What users use Pig for: } Search infrastructure } Ad relevance } Model training } User intent analysis } Web log processing } Image processing } Incremental processing of large data sets 16

17 What will we learn? analyze search logs BED EBD0C yahoo chat 824F413FA37520BF garter belts 824F413FA37520BF lingerie 824F413FA37520BF spiderman 824F413FA37520BF tommy hilfiger 824F413FA37520BF calgary 824F413FA37520BF calgary 824F413FA37520BF exhibitionists analyze small search logs BED EBD0C yahoo chat 824F413FA37520BF garter belts 824F413FA37520BF lingerie 824F413FA37520BF spiderman 824F413FA37520BF tommy hilfiger 824F413FA37520BF calgary 824F413FA37520BF calgary 824F413FA37520BF exhibitionists 17

18 AWS assignment Informa*on on Pig, Hadoop, and AWS Help with gemng set up Actual assignment 18

19 Running Hadoop on your machines Semng up Part A 1. Extract hw3.zip 2. Extract pigtmp.zip 3. Extract hadoop zip 19

20 Setting up Make sure hadoop is executable: $ chmod u+x ~/hw3/hadoop /bin/hadoop 20

21 Setting up Set environment variables: $ export PIGDIR=~/hw3/pigtmp $ export HADOOP=~/hw3/hadoop $ export HADOOPSITEPATH=~/hw3/hadoop / conf/ $ export PATH=$HADOOP/bin/:$PATH In Windows: $ set PIGDIR=~/hw3/pigtmp etc 21

22 Setting up The variable JAVA_HOME should be set to point to your system's Java directory. System dependent In OS X: $ export JAVA_HOME=$(/usr/libexec/java_home) In Windows, it should point to your JDK folder. (You should have that from project part 2. 22

23 The data: search query logs Excite: old search engine (something like google) 23

24 The data } Take a peak inside excite-small.log BED EBD0C 824F413FA37520BF 824F413FA37520BF 824F413FA37520BF 824F413FA37520BF 824F413FA37520BF 824F413FA37520BF 824F413FA37520BF yahoo chat garter belts lingerie spiderman tommy hilfiger calgary calgary exhibitionists query user *me: YYMMDDHHMMSS 24

25 script1-local.pig } Objec*ve: } Find query phrases that occur with high frequency during certain *mes of day } Open script1-local.pig 25

26 script1-local.pig REGISTER./tutorial.jar; raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query); clean1 = FILTER raw BY org.apache.pig.tutorial.nonurldetector(query); clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.tolower(query) as query;... Register the jar to access UDFs Load the raw data Remove records where the query is empty or a URL Change the query to lower case 26

27 script1-local.pig... houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.extracthour(time) as hour, query; ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.ngramgenerator(query) ) as ngram; ngramed2 = DISTINCT ngramed1; hour_frequency1 = GROUP ngramed2 BY (ngram, hour);... Extract the hour Generate n-grams from the query string Get unique n-grams Group by n-gram and hour 27

28 script1-local.pig... hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count; uniq_frequency1 = GROUP hour_frequency2 BY group::ngram; uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.scoregenerator($1)); uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;... Count the occurrences of each n-gram Generate n-grams from the query string Use a UDF to compute a popularity score for the n-gram Assigns names to the fields 28

29 script1-local.pig... filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0; ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score; Keep frequency scores higher than 2 STORE ordered_uniq_frequency INTO 'script1-localresults.txt' USING PigStorage(); Sort the records by hour and score Store the results 29

30 Execute your Pig script $ java -cp $PIGDIR/pig.jar org.apache.pig.main -x local script1-local.pig $ ls -l script1-local-results.txt $ cat script1-local-results.txt 30

31 Explore what happens Start grunt: $ java -cp $PIGDIR/pig.jar org.apache.pig.main -x local grunt> Copy and paste commands from the script Explore the created tables with the commands describe and dump 31

32 Sign in the AWS management console } h^ps://console.aws.amazon.com 32

33 Check your S3 storage } h^ps://console.aws.amazon.com 33

34 Go to Elastic MapReduce } h^ps://console.aws.amazon.com 34

35 35

36 36

37 37

38 38

39 39

40 Starting the job } The job may take a few minutes to start 40

41 Cluster list Monitors elapsed *me 41

42 Terminates the job SSH instruc*ons DNS name of Master node 42

43 Connecting to the Master Find your Master s DNS from the console $ ssh -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com> Use the name of the master, and the path to your EC2 key pair 43

44 On the Master Create a directory on the HDFS system: % hadoop dfs -mkdir /user/hadoop 44

45 Edit script1-hadoop.pig... s3n://<name_of_your_bucket>/excite.log.bz2 raw = LOAD 'excite.log.bz2' USING PigStorage('\t') AS (user, time, query);... STORE ordered_uniq_frequency INTO Change the loca*on of the data to the one on your S3 bucket Change the loca*on of the output script1-hadoop-results' USING PigStorage();... /user/hadoop/script1-hadoop-results 45

46 Upload files to the Master $ scp i </path/to/saved/keypair/file.pem> script1-hadoop.pig hadoop@<master.public-dns-name.amazonaws.com>:~/. $ scp i </path/to/saved/keypair/file.pem> tutorial.jar hadoop@<master.public-dns-name.amazonaws.com>:~/. Again, use the name of the master, and the path to your EC2 key pair 46

47 On the Master Execute the script: % pig -l. script1-hadoop.pig 47

48 Instruc*ons to enable monitoring connec*ons 48

49 Monitoring job flows Starts a proxy listening on port 8157 In a new terminal window: $ ssh -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dnsname.amazonaws.com> Use the name of the master, and the path to your EC2 key pair 49

50 Enable FoxyProxy on the browser 50

51 Monitoring jobs Access monitoring URLs 51

52 Load the jobtracker } 52

53 Retrieving results On the Master: % hadoop dfs copytolocal /user/hadoop/script1- hadoop-results script1-hadoop-results On your machine: $ scp i </path/to/saved/keypair/file.pem> -r hadoop@<master.public-dnsname.amazonaws.com>:~/script1-hadoop-results/. 53

54 Terminate all jobs when you are done If you forget jobs running, costs will rack up. You are responsible for your usage. 54

55 Relational DB on AWS } h^ps://console.aws.amazon.com 55

56 56

57 Pick a name a descrip*on 57

58 58

59 59

60 60

61 61

62 62

63 63

64 64

65 65

66 Connect to the cloud database psql --host=<your_rds_instance> --port= username=<username> --password --dbname=cloud_db Use the DB instance address from your console Type the username you chose Type the command in a single line 66

67 Import data to RDS In your phpexample code psql -f initialize.sql --host=<your_rds_instance> --port= username=<username> --password --dbname=cloud_db 67

68 Update your configuration file Enter the proper values in config.php 68

69 Start a local h^p server. E.g., with php 5.4: php -S localhost:

70 Remember to delete your instance when you no longer need it. 70