BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader. Arijit Das Greg Belli Erik Lowney Nick Bitto

Transcription

1 BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader Arijit Das Greg Belli Erik Lowney Nick Bitto

2 Introduction NPS Introduction Hadoop File System Background Wordcount & modifications (Lab Hour 1) Table JOIN & modifications (Lab Hour 2) Bonus Lab Big Data Loader References 2

3 Naval Postgraduate School Graduate school run by the Navy. MS/PhD and research programs. Active Duty, Civilians & Contractors. Presenters are from the CS Dept. Project is sponsored by Marine Corps. We work with private sector also. Located in Monterey, CA. 3

4 Current work at NPS NPS team has been working on Big Data. Apache Hadoop was the first effort. Navy has a license with Oracle. Team has been looking at Cloudera HDFS. Future work will involve the Big Data Appliance. 4

5 HDFS Introduction Apache Hadoop open-source Large scale data processing Cluster environment Commodity hardware Global community Contributors Users 5

6 HDFS Introduction Based on Papers: Google s MapReduce Google File System (GFS) HDFS Fault Tolerant Hardware failures common Software handles the failures 6

7 HDFS Introduction The Apache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce 7

8 Basic HDFS Architecture

9 HDFS Basic Interaction 9

10 HDFS/ORACLE DB/Middleware 10

11 Patching Hadoop Nodes 11

12 Lab: Getting Started Putty (ssh) ( ) Putty: Linux Command Line. VNC viewer (Remote Desktop for Linux). Custom install the viewer (client) only. Do NOT install the server (not needed). Do NOT logout from the VNC. Kill it from the Task manager instead. 12

13 Connect to the VM Lab: Getting Started Putty (ssh), IP address (command line). VNC viewer, IP address:1(gui) Open 2 terminal windows side by side cd /home/cloudera/nocoug (in both terminals) View Hadoop_Commands.txt (in one terminal) Run the commands (other terminal) for repeat execution, change the output to run2 or run3 or run4 or View output in VNC (browser: Click on Browse the filesystem Look in: NoCoug/Ouput/WordCount.. or NoCoug/Ouput/StudentJoin.. Look in file: part-r

14 Working Folder (/home/cloudera/nocoug) Lab: Getting Started WordCount1.java (original) WordCount2.java (same as WordCount1, use to make changes) StudentJoin1.java (original) StudentJoin2.java (same as StudentJoin1, use to make changes) StudentJoin3.java (same as StudentJoin1, use to make changes) Solution folder (/home/cloudera/nocoug/solutions) WordCount2.java (solution code) StudentJoin2.java (solution code) StudentJoin3.java (solution code) Data folder (/home/cloudera/nocoug/sampledata) Data has already been moved to HDFS (WordCount & StudentJoin) pg1661.txt (WordCount1.java & wordcount2.java) grades.txt (StudentJoin1.java/StudentJoin2.java/StudentJoin3.java) students.txt (StudentJoin1.java/StudentJoin2.java/StudentJoin3.java) hometown.txt (StudentJoin3.java) 14

15 WordCount

16 WordCount counts all words in a file Run WordCount1, review the output. WordCount Lab 1 Remove all special characters from the output. Make sure case does not affect counts. Print only the words that have a count > 10 Where do you do these changes? Map class? Reduce class? Both? 16

17 WordCount WordCount1 "'A 1 "'About 1 "'Absolute 1 "'Ah!' 2 "'Ah, 2 "'Ample.' 1 "'And 10 "'Arthur!' 1 "'As 1 "'At 1 "'Because 1 "'Breckinridge, 1 "'But 1 "'But, 1 "'But,' 1 "'Certainly 2 "'Certainly,' 1 WordCount2 A 2681 ABLE 31 ABOUT 176 ABOVE 27 ABSOLUTE 14 ABSOLUTELY 27 ACCOUNT 20 ACQUAINTANCE 11 ACROSS 37 ACTION 12 ADDRESS 26 ADLER 16 ADVENTURE 25 ADVENTURES 11 ADVERTISEMENT 19 ADVICE 19 AFFAIR 14 AFRAID 21 AFTER 99 AFTERNOON 15 AFTERWARDS 18 AGAIN 66 AGAINST 53 AGE 14 AGO 27

18 StudentJoin Joins 2 tables students & grades. Counts the number of classes taken Calculates average GPA. Lab 1 Add the student level (Freshman or..). Lab 2 Do a 3 table join and show the student hometown. Table 3 is in file hometown. Where do you do these changes? Map class? Reduce class? Both? StudentJoin 18

19 StudentJoin StudentJoin1 Joe Smith Jane Miller John Roberts Stephanie Jones StudentJoin2 Joe Smith Junior Jane Miller Sophmore John Roberts Senior Stephanie Jones Freshman StudentJoin3 Joe Smith Monterey, CA Jane Miller San Jose, CA John Roberts Oakland, CA Stephanie Jones Santa Cruz, CA

20 WordCount Outline import *; public class WordCount1 public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> public static void main(string[] args) throws Exception

21 Main Function public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount1.class); job.setmapperclass(wordcountmapper.class); job.setreducerclass(wordcountreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);

22 Main Function Set the name to word count The name of the class file that will run the job The class file that will run the mapper and reducer. This In this sets case the the class mapper for the key field and and reducer for the are value nested field. The classes key field inside will the be job text and the class. value field will be an integer. This makes sense in a word count since the word is the key and how often it appears is the value. public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount1.class); job.setmapperclass(wordcountmapper.class); job.setreducerclass(wordcountreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);

23 Main Function The word count application has two arguments. The first argument is the input file that we will run the word count on. The second is the This filepath line will where start the job word and then count block application until the will write job is out complete. the output If the text. job completes successful the application will exit with a return code of 0, if there is an error it will exit with a return code of 1. public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount1.class); job.setmapperclass(wordcountmapper.class); job.setreducerclass(wordcountreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);

24 Mapper Function public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException String[] words = value.tostring().split(" ") ; for (String str: words) word.set(str); context.write(word, one);

25 Mapper Function This is the WordCountMapper class The declaration. KEYOUT It field extends is set the to The KEYIN field is set to Text. Mapper and In the class IntWritable word and count requires are Object. The key by default application, both four arguments. Hadoop the classes. key will be the byte offset These field Text for are will is KEYIN, comparible be the the value VALUEIN, individual to String from the KEYOUT, words. and input The IntWritable VALUEOUT. VALUEOUT is comparible textfile. The VALUEIN This field is there is set to is the to Integer. input IntWritable set to and The Text output difference and and will datatypes will is contain the for contain the Text keys a and count the and IntWritable of line of values how text are often classes set. the implement word appears. interfaces from the input that are file. used by Hadoop to run the MapReduce. public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException String[] words = value.tostring().split(" ") ; for (String str: words) word.set(str); context.write(word, one);

26 Mapper Function Create a constant variable named one and set This it function to equal is 1. run The when word the Mapper count works is called by on splitting the Data Create up Node. each a Text Key line variable is by word the byte to and hold offset assigning each and individual is each word never word the used value once in the of the 1. word input In the count reduce Text job. line Value phase is split. is the the application Text line from will the count input the similar file. Context words. is a class that allows the task to write data out. public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException String[] words = value.tostring().split(" ") ; for (String str: words) word.set(str); context.write(word, one);

27 Mapper Function A string array of every word in the input line is created by splitting the Text The at every Context space class is Each used The end character. to word write result in out of the the the array is intermediate Map phase is casted from data. a a String The into KEYOUT collection a Text. is of the key-value Text that pairs was in the set form in the [word, previous 1]. Multiple line. instances The of VALUEOUT the same word is the will be IntWritable added up in that the was set to Reduce 1. phase. public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException String[] words = value.tostring().split(" ") ; for (String str: words) word.set(str); context.write(word, one);

28 Reducer Function public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException int total = 0; for (IntWritable val : values) total++; context.write(key, new IntWritable(total));

29 Reducer Function The for reduce loop will function count is how called many on the values individual we found datanodes for the after given the key. map In phase the word has count completed. example, The this function means has counting three parameters. how many The times first an individual parameter word is the appeared. key that the The datanode value in will the process. key value The pair second was parameter set to one, is so a iterable instead list The of Context values of adding for class that the writes key. value The in third out the the parameter IntWritable, result. is we the can context just which count++ is used for to each write object out the in result. the iterable. public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException int total = 0; for (IntWritable val : values) total++; context.write(key, new IntWritable(total));

30 ReduceJoin Outline import *; public class ReduceJoin public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text> public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text> public static class ReduceJoinReducer extends Reducer<Text, Text, Text, Text> public static void main(string[] args) throws Exception

31 ReduceJoin Purpose Accounts 001 John Allen Sales Abigail Smith 003 April Stevens 004 Nasser Hafez

32 Main Function public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "Reduce-side join"); job.setjarbyclass(reducejoin.class); job.setreducerclass(reducejoinreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SalesRecordMapper.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AccountRecordMapper.class); Path outputpath = new Path(args[2]); FileOutputFormat.setOutputPath(job, outputpath); outputpath.getfilesystem(conf).delete(outputpath); System.exit(job.waitForCompletion(true)? 0 : 1);

33 Main Function The main function is similar to the WordCount program. public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "Reduce-side join"); job.setjarbyclass(reducejoin.class); job.setreducerclass(reducejoinreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SalesRecordMapper.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AccountRecordMapper.class); Path outputpath = new Path(args[2]); FileOutputFormat.setOutputPath(job, outputpath); outputpath.getfilesystem(conf).delete(outputpath); System.exit(job.waitForCompletion(true)? 0 : 1);

34 Main Function Differences: Multiple File Inputs Separate Mapper class for each input file Program deletes any previously created file in the output path public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "Reduce-side join"); job.setjarbyclass(reducejoin.class); job.setreducerclass(reducejoinreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SalesRecordMapper.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AccountRecordMapper.class); Path outputpath = new Path(args[2]); FileOutputFormat.setOutputPath(job, outputpath); outputpath.getfilesystem(conf).delete(outputpath); System.exit(job.waitForCompletion(true)? 0 : 1);

35 Sales Record Mapper The Sales Record Mapper takes each line from the Sales file, casts it as a String, then uses the split function to separate the line into a String array based on where the tabs are in the line public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("sales\t"+parts[1]));

36 Sales Record Mapper The Sales Record Mapper then outputs the data. The Key is Text datatype containing the account number. The Value is a Text datatype containing the word sales concatenated with the price public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("sales\t"+parts[1]));

37 Sales Record Mapper An example output from the Sales Record mapper is: [001, sales\t35.99]. There is a tab character between sales and Keep the concatenation in mind as it will be important in the Reduce phase public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("sales\t"+parts[1]));

38 Account Record Mapper The Account Record Mapper takes each line from the Accounts file, casts it as a String, then uses the split function to separate the line into a String array based on where the tabs are in the line. Names in the file are separated by a space and so will be considered one element. public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("accounts\t"+parts[1])); 001 John Allen 002 Abigail Smith 003 April Stevens 004 Nasser Hafez

39 Account Record Mapper The Account Record Mapper then outputs the data. The Key is Text datatype containing the account number. The Value is a Text datatype containing the word accounts concatenated with the person s name. public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("accounts\t"+parts[1])); 001 John Allen 002 Abigail Smith 003 April Stevens 004 Nasser Hafez

40 Account Record Mapper An important point to note is that both the SalesRecordMapper and the AccountRecordMapper output the same field as the key. This is what will facilitate the join. Also note the concatenation in the AccountRecordMapper. It is similar to the SalesRecordMapper. public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("accounts\t"+parts[1])); 001 John Allen 002 Abigail Smith 003 April Stevens 004 Nasser Hafez

41 ReduceJoin Reducer Data sent to an individual datanode during the Reduce phase is partitioned based on the key field. Since the Sales data and the Accounts data both use a userid as the key field, each datanode will contain all data pertaining to a single salesperson. In the reduce phase we want to process and output data about each salesperson. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

42 ReduceJoin Reducer Three variables are created. These will store the salesperson s data. The name field will contain the salesperson s name. Total will contain the total cost of the items sold. Count will contain the total number of sales by that salesperson. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

43 ReduceJoin Reducer Loop through every value for the key assigned to the datanode. In the ReduceJoin program, the values will be in one of two forms: sales\t35.99 or accounts\tjohn Allen public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

44 ReduceJoin Reducer The value is cast to a string and then split based on the tab character. Doing this allows the reducer to process the value differently depending on if we concatenated sales or accounts to the value during the Map phase. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

45 ReduceJoin Reducer Values concatenated with sales contain the sales data for the salesperson. Each value processed indicates a sale, so the program will increment the count variable and add the sale amount to the running total. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

46 ReduceJoin Reducer Values concatenated with accounts contain the salesperson s name. Their name is assigned to the name variable. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

47 ReduceJoin Reducer The sales data is formatted and output using the Context class. The format is a key-value pair. [John Allen, 3\t124.93] public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

48 Hadoop Commands Listing the files in HDFS hdfs dfs -ls /NoCoug/Output Removing the files in HDFS hdfs dfs -rm -r -f /NoCoug/Output/WordCount2_run1 Full Reference

49 Oracle Loader for Hadoop Parallel load, optimized for Hadoop Automatic load balancing Convert to Oracle format on Hadoop Save database CPU Text Avro Parquet Sequence files Load specific Hive partitions Kerberos authentication Hive Log files JSON Compressed files And more Load directly into In-Memory table

50 Oracle Loader Unzip OraLoader Set OLH_HOME Loader runs as a MapReduce Job Job Configuration file an.xml file contains data mappings database connection information hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.oraloader -conf job.xml 50

51 Oracle Loader Startup the Database sqlplus / as sysdba startup exit Start the listener lsnrctl start SQL Developer Table WORDCOUNT Right click on the table and select Truncate SQLPLUS sqlplus ORACLELOADER/Biwa2015 truncate table WORDCOUNT 51

52  <property> <name>mapreduce.inputformat.class</name> <value>oracle.hadoop.loader.lib.input.delimitedtextinputformat</value> </property> <property> </property> <name>mapred.input.dir</name> <value>/nocoug/wordcount2_run3</value> <property> <name>oracle.hadoop.loader.input.fieldterminator</name> <value>\u0009</value> </property> 52

53 53

54 54

55 HDFS References Java MapReduce class documentation: HDFS command documentation: HDFS Book for learning: Hadoop Beginner s Guide by Garry Turkington [PACKT publishing] Wikipedia (5 Steps of Map Reduce): NPS publications repository: 55

56 Conclusion Questions? Contact: Arijit Das 56