Hadoop Pig. Introduction Basic. Exercise

Size: px
Start display at page:

Download "Hadoop Pig. Introduction Basic. Exercise"

Transcription

1 Your Name

2 Hadoop Pig Introduction Basic Exercise

3

4 A set of files A database A single file

5 Modern systems have to deal with far more data than was the case in the past Yahoo : over 170PB of data Facebook : over 30PB of data We need a system to handle large-scale computation!

6 Characteristics Distributed systems are groups of networked computers Allowing developers to use multiple machines for a single job At compute time, data is copied to the compute nodes Programming for traditional distributed systems is complex. Data exchange requires synchronization Finite bandwidth is available Temporal dependencies are complicated It is difficult to deal with partial failures of the system

7

8 Distribute the data as it is initially stored in the system Individual nodes can work on data local to those nodes Users can focus on developing applications.

9 Open Source Software + Hardware Commodity Zookeeper Sqoop/ Oozie Pig Hive Hue Mahout Flume MapReduce HBase Ganglia/Nagios Hadoop Distributed File System (HDFS)

10 Open Source Software + Hardware Commodity Zookeeper Sqoop/ Oozie Pig Hive Hue Mahout Flume MapReduce HBase Ganglia/Nagios Hadoop Distributed File System (HDFS)

11 HDFS (Hadoop Distributed File System) Distributed Scalable Fault tolerant High performance MapReduce Software framework of parallel computation for distributed operation Mainly written by Java

12

13 Consists of a high-level language for expressing data analysis, which is called Pig Latin Combined with the power of Hadoop and the MapReduce framework A platform for analyzing large data sets

14 Q : Find user who tend to visit good pages. Visits Pages......

15 Load Visits(user, url, time) Canonicalize urls Join url = url Load Pages(url, pagerank) Group by user Compute Average Pagerank Filter avgpr > 0.5 Result

16 import java.io.ioexception; import java.util.arraylist; import java.util.iterator; import java.util.list; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.io.writable; import org.apache.hadoop.io.writablecomparable; import org.apache.hadoop.mapred.fileinputformat; import org.apache.hadoop.mapred.fileoutputformat; import org.apache.hadoop.mapred.jobconf; import org.apache.hadoop.mapred.keyvaluetextinputformat; import org.apache.hadoop.mapred.mapper; import org.apache.hadoop.mapred.mapreducebase; import org.apache.hadoop.mapred.outputcollector; import org.apache.hadoop.mapred.recordreader; import org.apache.hadoop.mapred.reducer; import org.apache.hadoop.mapred.reporter; import org.apache.hadoop.mapred.sequencefileinputformat; import org.apache.hadoop.mapred.sequencefileoutputformat; import org.apache.hadoop.mapred.textinputformat; import org.apache.hadoop.mapred.jobcontrol.job; import org.apache.hadoop.mapred.jobcontrol.jobcontrol; import org.apache.hadoop.mapred.lib.identitymapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String key = line.substring(0, firstcomma); String value = line.substring(firstcomma + 1); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("1" + value); oc.collect(outkey, outval); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(longwritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.tostring(); int firstcomma = line.indexof(','); String value = line.substring(firstcomma + 1); int age = Integer.parseInt(value); if (age < 18 age > 25) return; String key = line.substring(0, firstcomma); Text outkey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outval = new Text("2" + value); oc.collect(outkey, outval); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasnext()) { Text t = iter.next(); String value = t.tostring(); if (value.charat(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1)); reporter.setstatus("ok"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setstatus("ok"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { Low-level programing. public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.tostring(); int firstcomma = line.indexof(','); int secondcomma = line.indexof(',', firstcomma); String key = line.substring(firstcomma, secondcomma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outkey = new Text(key); oc.collect(outkey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasnext()) { sum += iter.next().get(); reporter.setstatus("ok"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((longwritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { } // Only output the first 100 records while (count < 100 && iter.hasnext()) { oc.collect(key, iter.next()); count++; } } public static void main(string[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setjobname("load Pages"); lp.setinputformat(textinputformat.class); lp.setoutputkeyclass(text.class); lp.setoutputvalueclass(text.class); lp.setmapperclass(loadpages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setnumreducetasks(0); Job loadpages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setjobname("load and Filter Users"); lfu.setinputformat(textinputformat.class); lfu.setoutputkeyclass(text.class); lfu.setoutputvalueclass(text.class); lfu.setmapperclass(loadandfilterusers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setnumreducetasks(0); Job loadusers = new Job(lfu); - Requiring more effort to understand and maintain Write join code yourself. JobConf join = new JobConf(MRExample.class); join.setjobname("join Users and Pages"); join.setinputformat(keyvaluetextinputformat.class); join.setoutputkeyclass(text.class); join.setoutputvalueclass(text.class); join.setmapperclass(identitymapper.class); join.setreducerclass(join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setnumreducetasks(50); Job joinjob = new Job(join); joinjob.adddependingjob(loadpages); joinjob.adddependingjob(loadusers); Effort to manage multiple MapReduce jobs. JobConf group = new JobConf(MRExample.class); group.setjobname("group URLs"); group.setinputformat(keyvaluetextinputformat.class); group.setoutputkeyclass(text.class); group.setoutputvalueclass(longwritable.class); group.setoutputformat(sequencefileoutputformat.class); group.setmapperclass(loadjoined.class); group.setcombinerclass(reduceurls.class); group.setreducerclass(reduceurls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setnumreducetasks(50); Job groupjob = new Job(group); groupjob.adddependingjob(joinjob); JobConf top100 = new JobConf(MRExample.class); top100.setjobname("top 100 sites"); top100.setinputformat(sequencefileinputformat.class); top100.setoutputkeyclass(longwritable.class); top100.setoutputvalueclass(text.class); top100.setoutputformat(sequencefileoutputformat.class); top100.setmapperclass(loadclicks.class); top100.setcombinerclass(limitclicks.class); top100.setreducerclass(limitclicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setnumreducetasks(1); Job limit = new Job(top100); limit.adddependingjob(groupjob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addjob(loadpages);

17 Visits = load /data/visits as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load /data/pages as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; UserPageranks = foreach UserVisits generate user, AVG(VP.pagerank) as avgpr; GoodUsers = filter UserPageranks by avgpr > 0.5 ; store GoodUsers into /data/good_users ;s Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time.

18 Ease of programming. script language(pig Latin) Pig not equals to SQL Increases productivity In one test 10 lines of Pig Latin 200 lines of Java. What took 4 hours to write in Java took 15 minutes in Pig Latin. Provide some common functionality (such as join, group, sort, etc) Extensibility - UDF

19 Pig Basic

20 You can run Pig (execute Pig Latin statements and Pig commands) using various modes. Local Mode MapReduce Mode Interactive Mode Yes Yes Batch Mode Yes Yes

21 Local mode Definition : all files are installed and run using your local host and file system. How : using the -x flag : pig -x local Mapreduce mode Definition : access to a Hadoop cluster and HDFS installation. How : the mode is the default mode; you can, but don't need to, specify it using the -x flag : pig or pig -x mapreduce.

22 By Pig command : /* local mode */ $ pig -x local... /* mapreduce mode */ $ pig... or $ pig -x mapreduce... By Java command : /* local mode */ $ java -cp pig.jar org.apache.pig.main -x local... /* mapreduce mode */ $ java -cp pig.jar org.apache.pig.main... or $ java -cp pig.jar org.apache.pig.main -x mapreduce...

23 Definition : using the Grunt shell. How : Invoke the Grunt shell using the "pig" command and then enter your Pig Latin statements and Pig commands interactively at the command line. Example: grunt> A = load 'passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id; grunt> dump B; $ pig -x local $ pig $ pig x mapreduce

24 Definition : run Pig in batch mode. How : Pig scripts + Pig command Example: /* id.pig */ a = LOAD 'passwd USING PigStorage(':'); b = FOREACH a GENERATE $0 AS id; STORE b INTO id.out ; $ pig -x local id.pig

25 Pig scripts Place Pig Latin statements and Pig commands in a single file. Not required, but is good practice to identify the file using the *.pig extension. Allows parameter substitution Support comments For single-line : use - - For multi-line : use /*. */

26 Pig Basic

27 exec : Run a Pig script. Syntax : exec [ param param_name = param_value] [ param_file file_name] [script] grunt> cat myscript.pig a = LOAD 'student' AS (name, age, gpa); b = ORDER a BY name; STORE b into '$out'; grunt> exec param out=myoutput myscript.pig;

28 run : Run a Pig script. Syntax : run [ param param_name = param_value] [ param_file file_name] script grunt> cat myscript.pig b = ORDER a BY name; c = LIMIT b 10; grunt> a = LOAD 'student' AS (name, age, gpa); grunt> run myscript.pig grunt> d = LIMIT c 3; grunt> DUMP d; (alice,20,2.47) (alice,27,1.95)

29 File commands cat, cd, copyfromlocal, copytolocal, cp, ls, mkdir,mv, pwd, rm, rmf, exec, run Utility commands help quit kill jobid set debug [on off] set job.name 'jobname'

30 Pig Basic

31 Type Description Example int Signed 32-bit integer 10 long Signed 64-bit integer 10L float 32-bit floating point 10.5F double 64-bit floating point 10.5 chararray bytearray Character array (string) in Unicode UTF-8 format Byte array (blob) Hello world boolean boolean true/false (case insensitive)

32 Type Description Example tuple An ordered set of fields. (20, true) Syntax : ( field [, field ] ) A tuple just like a row with one or more fields. Each field can be any data type Any field may/may not have data

33 Type Description Example bag An collection of tuples. {(1, 2), (3, 4)} Syntax : { tuple [, tuple ] } A bag contains multiple tuples. (1~n) Outer bag/inner bag

34 Type Description Example map A set of key value pairs. [open# trendmicro] Syntax : [ key#value <, key#value >] Just like other language, key must be unique in a map.

35 Tuple Field Relation(Outer Bag) Tom Male 10 Mary Female 20 John Male 30 Susan Female Sam Male 60

36 Conclusion? A field is a piece of data. A tuple is an ordered set of fields. A bag is a collection of tuples. A relation is a bag (more specifically, an outer bag). PIG LATIN STATEMENT WORK WITH RELATIONS!!

37 Case sensitive: Names (aliases) of relations and fields Built-in function names Case insensitive: Parameter name Key-word such as LOAD, STORE in pig latin. (They can also be written as load, store)

38 Used as keyword AS clause. Defined with LOAD/STREAM/FOREACH operator. Enhance better parse-time error checking and more efficient code execution. Optional but encourage to use.

39 Legal format: includes both the field name and field type includes the field name only, lack field type not to define a schema

40 John 18 true Mary 19 false Bill 20 true Joe 18 true a = LOAD 'data' AS (name:chararray, age:int, male:boolean); a = LOAD data AS (name, age, male); a = LOAD data ;

41 (3,8,9) (mary,19) (1,4,7) (john,18) (2,5,8) (joe,18) a = LOAD data AS (F:tuple(f1:int, f2:int, f3:int), T:tuple(t1:chararray, t2:int)); a = LOAD data AS (F:(f1:int, f2:int, f3:int), T:(t1:chararray, t2:int));

42 {(3,8,9)} {(1,4,7)} {(2,5,8)} a = LOAD 'data' AS (B: bag {T: tuple(t1:int, t2:int, t3:int)}); a = LOAD 'data' AS (B: {T: (t1:int, t2:int, t3:int)});

43 What if field name/field type is not defined? Lacks field name : can only refer to the field using positional notation. ($0, $1 ) Lacks field type : value will be treated as bytearray. you can change the default type using the cast operators. John 18 true Mary 19 false Bill 20 true Joe 18 true grunt> a = LOAD data'; grunt> b = FOREACH a GENERATE $0;

44 Pig Basic

45 Loading Data Working with Data Storing Intermediate Results Storing Final Results Debugging Pig Latin

46 LOAD 'data [USING function] [AS schema]; Use the LOAD operator to load data from the file system. Default load function is PigStorage. Example : a = LOAD 'data' AS (name:chararray, age:int); a = LOAD 'data' USING PigStorage( : ) AS (name, age);

47 Transform data into what you want. Major operators: FOREACH : work with columns of data FILTER : work with tuples or rows of data GROUP : group data in a single relation JOIN : Performs an inner join of two or more relations based on common field values. UNION : merge the contents of two or more relations SPLIT : Partitions a relation into two or more relations.

48 Default location is /tmp Could be configured by property pig.temp.dir

49 STORE relation INTO 'directory' [USING function]; Use the STORE operator to write output to the file system. Default store function is PigStorage. If the directory already exists, the STORE operation will fail. The output file will be named as part-nnnnn. Example : STORE a INTO 'myoutput USING PigStorage ('*');

50 DUMP : display results to your terminal screen.(useful) DESCRIBE : review the schema of a relation. (useful) EXPLAIN : view the logical, physical, or map reduce execution plans to compute a relation. ILLUSTRATE : view the step-by-step execution of a series of statements.

51 18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Describe a ; a: {age: int, couple: (husband: chararray, wife:chararray)}

52 18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Dump a ; (18,(John, Mary)) (22,(Austin, Sunny)) (78,(Richard, Alice)) (87,(Johnny,))

53 18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Dump b ; (78,(Richard, Alice))

54 18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Describe c ; c: {lucky_couple: (old_man: chararray,old_women: chararray)}

55 18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) I want to find couple(s) both alive and age > 75? a = Load data' AS (age:int, couple:(husband:chararray, wife:chararray)) b = FILTER a BY (age > 75) AND ((couple.husband is not null) AND (couple.husband!= '') AND (couple.wife is not null) AND (couple.wife!= '')); c = FOREACH b GENERATE couple AS lucky_couple:(old_man:chararray, old_women:chararray); STORE c INTO 'population'; Folder population (Richard, Alice)

56 Pig Basic

57 Operator Symbol addition + subtraction - multiplication * division / modulo % bincond?: x = FOREACH a GENERATE f2, f1 + f2; x = FOREACH a GENERATE f2, (f2==1?1:count(b));

58 Operator AND OR NOT Symbol AND OR NOT x = FILTER a BY (f1==8) OR (NOT (f2+f3 > f1));

59 Operator (Simple data type you want to cast to) to tuple to bag to map Symbol (enclose data type in parentheses) (tuple(data_type)) src_field (bag{tuple(data_type)}) src_field (map[]) src_field b = FOREACH a GENERATE (int)$0 + 1; x = FOREACH b GENERATE group, (chararray)count(a) AS total;

60 Operator (Simple data type you want to cast to) to tuple to bag to map Symbol (enclose data type in parentheses) (tuple(data_type)) src_field (bag{tuple(data_type)}) src_field (map[]) src_field (1,2,3) (4,2,1) a = LOAD 'data' AS fld; b = FOREACH a GENERATE (tuple(int,int,float))fld; DESCRIBE a; DESCRIBE b; DUMP b;

61 Operator (Simple data type you want to cast to) to tuple to bag to map Symbol (enclose data type in parentheses) (tuple(data_type)) src_field (bag{tuple(data_type)}) src_field (map[]) src_field {(1234L)} {(5678L)} {(9012)} a = LOAD 'data' AS fld:bytearray; b = FOREACH a GENERATE (bag{tuple(long)})fld; DESCRIBE a; DESCRIBE b; DUMP b;

62 Operator (Simple data type you want to cast to) to tuple to bag to map Symbol (enclose data type in parentheses) (tuple(data_type)) src_field (bag{tuple(data_type)}) src_field (map[]) src_field [open#apache] [apache#hadoop] [hadoop#pig] a = LOAD 'data' AS fld; b = FOREACH a GENERATE ((map[])fld; DESCRIBE a; DESCRIBE b; DUMP b;

63 Operator Symbol equal == not equal!= greater than > greater than or equal to >= less than < less than or equal to <= pattern matching matches x = FILTER a BY (f2 == 'apache'); x = FILTER a BY (f1 matches '.*apache.*');

64 Operator Symbol Note tuple constructor bag constructor map constructor () Equivalent to TOTUPLE {} Equivalent to TOBAG [] Equivalent to TOMAP

65 Operator Symbol Note tuple constructor bag constructor map constructor () Equivalent to TOTUPLE {} Equivalent to TOBAG [] Equivalent to TOMAP Joe 20 Amy 22 Leo 18 a = LOAD 'students AS (name:chararray, age:int); b = FOREACH a GENERATE (name, age); STORE b INTO results ; (Joe,20) (Amy,22) (Leo,18)

66 Operator Symbol Note tuple constructor () Equivalent to TOTUPLE bag constructor {} Equivalent to TOBAG map constructor [] Equivalent to TOMAP Joe Amy Leo a = LOAD data AS (name:chararray, age:int, gpa:float); b = FOREACH a GENERATE {(name, age)}, {name, age}; STORE b INTo results ; {(Joe,20)} {(Joe ),(20)} {(Amy,22)} {(Amy),(22)} {(Leo,18)} {(Leo),(18)}

67 Operator Symbol Note tuple constructor bag constructor map constructor () Equivalent to TOTUPLE {} Equivalent to TOBAG [] Equivalent to TOMAP Joe Amy Leo a = LOAD 'students AS (name:chararray, age:int, gpa:float); b = FOREACH a GENERATE [name, gpa]; STORE b INTO results ; [Joe#3.5] [Amy#3.2] [Leo#2.1]

68 Operator is null is not null Symbol is null is not null x = FILTER a BY f1 is not null;

69 Load data from the file system Default load function is PigStorage, could choose to loaded by other customized UDF. LOAD 'data [USING function] [AS schema]; a = LOAD 'myfile.txt' USING PigStorage('\t'); a = LOAD 'myfile.txt' AS (f1:int, f2:int, f3:int);

70 Save result to the file system. Default store function is PigStorage, could be customized by UDF. About the output directory, if the directory already exist, the operation will fail. STORE alias INTO directory [USING function] STORE a INTO 'myoutput' USING PigStorage ('*');

71 Groups the data in one or more relations. The result of a GROUP operation is a relation that includes one tuple per group. The tuple contains 2 fields: first field : named group, the same type as the group key second field : whose name is the same with original relation, and type is a bag.

72 alias = GROUP alias {ALL BY expression} [,alias ALL BY expression ] John F Mary F Bill F Joe F a = LOAD data' AS (name:chararray, age:int, gpa:float); b = GROUP a BY age; c = FOREACH b GENERATE group, COUNT(a); DESCRIBE b; DUMP b; DUMP c;

73 Basically, the GROUP and COGROUP operators are identical. For readability GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations.

74 alias = COGROUP alias {ALL BY expression} [,alias ALL BY expression ] Alice turtle Alice goldfish Alice cat Bob dog Bob cat a = LOAD 'data1' AS (owner:chararray,pet:chararray); b = LOAD 'data2' AS (friend1:chararray,friend2:chararray); Cindy Alice Mark Alice x = COGROUP a BY owner, b BY friend2; Paul Bob Paul Jane DESCRIBE x; DUMP x;

75 Selects tuples from a relation based on some condition. alias = FILTER alias BY expression; x = FILTER a BY (f1 == 8) OR (NOT (f2+f3 > f1));

76 Removes duplicate tuples in a relation. You can not use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT. alias = DISTINCT alias; (8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3) x = DISTINCT a; (1,2,3) (4,3,3) (8,3,4)

77 Sorts a relation based on one or more fields. Currently, supporting ordering on fields with simple types or by tuple designator (*). You cannot order on fields with complex types or by expressions. alias = ORDER alias BY {*[ASC DESC]} field[asc DESC] [,field[asc DESC] ] x = ORDER a BY a3 DESC;

78 Performs an inner join of two or more relations based on common field values. Inner joins ignore null keys, so it makes sense to filter them out before the join.

79 alias = JOIN alias BY field (, alias BY field); a = LOAD 'data1' AS (a1:int,a2:int,a3:int); b = LOAD 'data2' AS (b1:int,b2:int); x = JOIN a BY a1, b BY b1; DESCRIBE x; DUMP x;

80 Un-nests tuple/bag - changes the structure of it. FLATTEN (Tuple/Bag)

81 18 (John, Mary) 22 (Austin, Sunny) 78 (Richard, Alice) 87 (Johnny,) a = LOAD data' AS (age:int, couple:(husband:chararray, wife:chararray)); b = FOREACH a GENERATE age, couple; c = FOREACH a GENERATE age, FLATTEN(couple); DESCRIBE b; DUMP b; DUMP c; DESCRIBE c;

82 18 (John, Mary) 22 (Austin, Sunny) 22 (Tom, Rosa) 78 (Richard, Alice) 87 (Johnny,) DESCRIBE b; a = LOAD data' AS (age:int, couple:(husband:chararray, wife:chararray)); b = GROUP a BY age; c = FOREACH b GENERATE FLATTEN(a); DUMP b; DESCRIBE c; DUMP c;

83 Computes the union of two or more relations. Does not eliminate duplicate tuples. Does not ensure (as databases do) that all tuples adhere to the same schema the simplest case should be ensure that the tuples in the input relations have the same schema

84 UNION alias, alias [,alias ]; a = LOAD 'data1' AS (a1:int,a2:int,a3:int); b = LOAD 'data2' AS (b1:int,b2:int, b3:int); x = UNION a, b; DESCRIBE x; DUMP x;

85 Partitions a relation into two or more relations. Result : A tuple may be assigned to more than one relation. A tuple may not be assigned to any relation.

86 SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ] [, alias OTHERWISE]; a = LOAD 'data1' AS (a1:int,a2:int,a3:int); SPLIT a INTO x IF a1==7, y IF (a1 < 5 AND a2 < 6), z OTHERWISE; DUMP x; DUMP z; DUMP y;

87 alias = LIMIT alias n Limit the number of output tuples. Note : There is no guarantee which n tuples will be returned, and the tuples that are returned can change from one run to the next a = LOAD 'data1' AS (a1:int,a2:int,a3:int); x = LIMIT a 2; DUMP x;

88 alias = FOREACH alias GENERATE expression [AS schema] [expression [AS schema].]; Generates data transformations based on columns of data.

89 a = LOAD 'data' AS (a1:int,a2:int,a3:int); x = FOREACH a GENERATE a1+a2 AS f1:int, (a3==3? TOTUPLE(1,1) : TOTUPLE (0,0)); DUMP x;

90 Could work with nested. FOREACH statements can be nested to two levels only. Valid nested op : CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY The GENERATE keyword must be the last statement within the nested block.

91 MARY,22,female RICHARD,26,male AUSTIN,28,male JOHN,52,male TOM,40,male JIJI,38,female DESCRIBE b; a = LOAD data' USING PigStorage(',') AS (name:chararray, age:int, gender:chararray); b = GROUP a BY gender; c = FOREACH b { c0 = FILTER a BY (age > 20) AND (age < 30); GENERATE group, COUNT(c0) AS young; } DESCRIBE c; DUMP b; DUMP c;

92 MARY,female,LA RICHARD,male,Taipei AUSTIN,male,Taipei TOM,male,Taichiung JOHN,male JIJI,female DESCRIBE c; a = LOAD data' USING PigStorage(',') AS (name:chararray, gender:chararray, location:chararray); b = GROUP a BY gender; c = FOREACH b { c0 = FILTER a BY (location is not null) AND (location!= ''); DUMP c; } c1 = ORDER c0 BY name; GENERATE group, c1;

93 Goal : Familiar with Pig Latin basic usage, includes: Be able to run a Pig script. Be able to load/store data using Pig. Be able to parse data using Pig. Be able to applying simple operation on data.

94

web-scale data processing Christopher Olston and many others Yahoo! Research

web-scale data processing Christopher Olston and many others Yahoo! Research web-scale data processing Christopher Olston and many others Yahoo! Research Motivation Projects increasingly revolve around analysis of big data sets Extracting structured data, e.g. face detection Understanding

More information

Practice and Applications of Data Management CMPSCI 345. Lecture 19-20: Amazon Web Services

Practice and Applications of Data Management CMPSCI 345. Lecture 19-20: Amazon Web Services Practice and Applications of Data Management CMPSCI 345 Lecture 19-20: Amazon Web Services Extra credit: project part 3 } Open-ended addi*onal features. } Presenta*ons on Dec 7 } Need to sign up by Nov

More information

Distributed and Cloud Computing

Distributed and Cloud Computing Distributed and Cloud Computing K. Hwang, G. Fox and J. Dongarra Chapter 6: Cloud Programming and Software Environments Part 1 Adapted from Kai Hwang, University of Southern California with additions from

More information

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015 COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

Programming with Pig. This chapter covers

Programming with Pig. This chapter covers 10 Programming with Pig This chapter covers Installing Pig and using the Grunt shell Understanding the Pig Latin language Extending the Pig Latin language with user-defined functions Computing similar

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

Database design and implementation CMPSCI 645. Lectures 22: Parallel Databases and MapReduce

Database design and implementation CMPSCI 645. Lectures 22: Parallel Databases and MapReduce Database design and implementation CMPSCI 645 Lectures 22: Parallel Databases and MapReduce What is a parallel database? 2 Parallel v.s. distributed databases } Parallel database system: } Improve performance

More information

Introduction to Apache Pig Indexing and Search

Introduction to Apache Pig Indexing and Search Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational

More information

Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc.

Large Scale (Machine) Learning at Twitter. Aleksander Kołcz Twitter, Inc. Large Scale (Machine) Learning at Twitter Aleksander Kołcz Twitter, Inc. What is Twitter Micro-blogging service Open exchange of information/opinions Private communication possible via DMs Content restricted

More information

MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Hadoop Tutorial GridKa School 2011

Hadoop Tutorial GridKa School 2011 Hadoop Tutorial GridKa School 2011 Ahmad Hammad, Ariel García Karlsruhe Institute of Technology September 7, 2011 Abstract This tutorial intends to guide you through the basics of Data Intensive Computing

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

Connecting Hadoop with Oracle Database

Connecting Hadoop with Oracle Database Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.

More information

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so: Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins) Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --

More information

Large Scale Data Analysis Using Apache Pig Master's Thesis

Large Scale Data Analysis Using Apache Pig Master's Thesis UNIVERSITY OF TARTU FACULTY OF MATHEMATICS AND COMPUTER SCIENCE Institute of Computer Science Jürmo Mehine Large Scale Data Analysis Using Apache Pig Master's Thesis Advisor: Satish Srirama Co-advisor:

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig

Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig Pig Page 1 Programming Map/Reduce Wednesday, February 23, 2011 3:45 PM Two kinds of Map/Reduce programming In Java/Python In Pig+Java Today, we'll start with Pig Pig Page 2 Recall from last time Wednesday,

More information

Apache Pig Joining Data-Sets

Apache Pig Joining Data-Sets 2012 coreservlets.com and Dima May Apache Pig Joining Data-Sets Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Hadoop & Pig Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Outline Introduction (Setup) Hadoop, HDFS and MapReduce Pig Introduction What is Hadoop and where did it come from? Big Data

More information

Pig vs Hive. Big Data 2014

Pig vs Hive. Big Data 2014 Pig vs Hive Big Data 2014 Pig Configuration In the bash_profile export all needed environment variables Pig Configuration Download a release of apache pig: pig-0.11.1.tar.gz Pig Configuration Go to the

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Qsoft Inc www.qsoft-inc.com

Qsoft Inc www.qsoft-inc.com Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:

More information

Download and install Download virtual machine Import virtual machine in Virtualbox

Download and install Download virtual machine Import virtual machine in Virtualbox Hadoop/Pig Install Download and install Virtualbox www.virtualbox.org Virtualbox Extension Pack Download virtual machine link in schedule (https://rmacchpcsymposium2015.sched.org/? iframe=no) Import virtual

More information

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc.

Introduction to Pig. Content developed and presented by: 2009 Cloudera, Inc. Introduction to Pig Content developed and presented by: Outline Motivation Background Components How it Works with Map Reduce Pig Latin by Example Wrap up & Conclusions Motivation Map Reduce is very powerful,

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

An Insight on Big Data Analytics Using Pig Script

An Insight on Big Data Analytics Using Pig Script An Insight on Big Data Analytics Using Pig Script J.Ramsingh 1, Dr.V.Bhuvaneswari 2 1 Ph.D research scholar Department of Computer Applications, Bharathiar University, 2 Assistant professor Department

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Lecture 9-10: Advanced Pig Latin! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 9-10: Advanced Pig Latin! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 9-10: Advanced Pig Latin!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking

More information

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36

Big Data: Pig Latin. P.J. McBrien. Imperial College London. P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36 Big Data: Pig Latin P.J. McBrien Imperial College London P.J. McBrien (Imperial College London) Big Data: Pig Latin 1 / 36 Introduction Scale Up 1GB 1TB 1PB P.J. McBrien (Imperial College London) Big Data:

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Big Data and Scripting Systems build on top of Hadoop

Big Data and Scripting Systems build on top of Hadoop Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is

More information

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Spring,2015 Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE Contents: Briefly About Big Data Management What is hive? Hive Architecture Working

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Complete Java Classes Hadoop Syllabus Contact No: 8888022204

Complete Java Classes Hadoop Syllabus Contact No: 8888022204 1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Hadoop Scripting with Jaql & Pig

Hadoop Scripting with Jaql & Pig Hadoop Scripting with Jaql & Pig Konstantin Haase und Johan Uhle 1 Outline Introduction Markov Chain Jaql Pig Testing Scenario Conclusion Sources 2 Introduction Goal: Compare two high level scripting languages

More information

Hadoop Streaming. Table of contents

Hadoop Streaming. Table of contents Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

CSE 544: Principles of Database Systems

CSE 544: Principles of Database Systems CSE 544: Principles of Database Systems MapReduce, PigLatin CSE544 - Spring, 2012 1 Overview of Today s Lecture Cluster computing Map/reduce (paper) Degree sequence (brief discussion) PigLatin (brief discussion,

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools

More information

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to

More information

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free) Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

BIG DATA HADOOP TRAINING

BIG DATA HADOOP TRAINING BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Hadoop: The Definitive Guide

Hadoop: The Definitive Guide FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

Other Map-Reduce (ish) Frameworks. William Cohen

Other Map-Reduce (ish) Frameworks. William Cohen Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Running Hadoop on Windows CCNP Server

Running Hadoop on Windows CCNP Server Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Big Data Course Highlights

Big Data Course Highlights Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

More information

Hadoop Hands-On Exercises

Hadoop Hands-On Exercises Hadoop Hands-On Exercises Lawrence Berkeley National Lab July 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information