BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader. Arijit Das Greg Belli Erik Lowney Nick Bitto

Size: px
Start display at page:

Download "BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader. Arijit Das Greg Belli Erik Lowney Nick Bitto"

Transcription

1 BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader Arijit Das Greg Belli Erik Lowney Nick Bitto

2 Introduction NPS Introduction Hadoop File System Background Wordcount & modifications (Lab Hour 1) Table JOIN & modifications (Lab Hour 2) Bonus Lab Big Data Loader References 2

3 Naval Postgraduate School Graduate school run by the Navy. MS/PhD and research programs. Active Duty, Civilians & Contractors. Presenters are from the CS Dept. Project is sponsored by Marine Corps. We work with private sector also. Located in Monterey, CA. 3

4 Current work at NPS NPS team has been working on Big Data. Apache Hadoop was the first effort. Navy has a license with Oracle. Team has been looking at Cloudera HDFS. Future work will involve the Big Data Appliance. 4

5 HDFS Introduction Apache Hadoop open-source Large scale data processing Cluster environment Commodity hardware Global community Contributors Users 5

6 HDFS Introduction Based on Papers: Google s MapReduce Google File System (GFS) HDFS Fault Tolerant Hardware failures common Software handles the failures 6

7 HDFS Introduction The Apache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System (HDFS) Hadoop YARN Hadoop MapReduce 7

8 Basic HDFS Architecture

9 HDFS Basic Interaction 9

10 HDFS/ORACLE DB/Middleware 10

11 Patching Hadoop Nodes 11

12 Lab: Getting Started Putty (ssh) ( ) Putty: Linux Command Line. VNC viewer (Remote Desktop for Linux). Custom install the viewer (client) only. Do NOT install the server (not needed). Do NOT logout from the VNC. Kill it from the Task manager instead. 12

13 Connect to the VM Lab: Getting Started Putty (ssh), IP address (command line). VNC viewer, IP address:1(gui) Open 2 terminal windows side by side cd /home/cloudera/nocoug (in both terminals) View Hadoop_Commands.txt (in one terminal) Run the commands (other terminal) for repeat execution, change the output to run2 or run3 or run4 or View output in VNC (browser: Click on Browse the filesystem Look in: NoCoug/Ouput/WordCount.. or NoCoug/Ouput/StudentJoin.. Look in file: part-r

14 Working Folder (/home/cloudera/nocoug) Lab: Getting Started WordCount1.java (original) WordCount2.java (same as WordCount1, use to make changes) StudentJoin1.java (original) StudentJoin2.java (same as StudentJoin1, use to make changes) StudentJoin3.java (same as StudentJoin1, use to make changes) Solution folder (/home/cloudera/nocoug/solutions) WordCount2.java (solution code) StudentJoin2.java (solution code) StudentJoin3.java (solution code) Data folder (/home/cloudera/nocoug/sampledata) Data has already been moved to HDFS (WordCount & StudentJoin) pg1661.txt (WordCount1.java & wordcount2.java) grades.txt (StudentJoin1.java/StudentJoin2.java/StudentJoin3.java) students.txt (StudentJoin1.java/StudentJoin2.java/StudentJoin3.java) hometown.txt (StudentJoin3.java) 14

15 WordCount

16 WordCount counts all words in a file Run WordCount1, review the output. WordCount Lab 1 Remove all special characters from the output. Make sure case does not affect counts. Print only the words that have a count > 10 Where do you do these changes? Map class? Reduce class? Both? 16

17 WordCount WordCount1 "'A 1 "'About 1 "'Absolute 1 "'Ah!' 2 "'Ah, 2 "'Ample.' 1 "'And 10 "'Arthur!' 1 "'As 1 "'At 1 "'Because 1 "'Breckinridge, 1 "'But 1 "'But, 1 "'But,' 1 "'Certainly 2 "'Certainly,' 1 WordCount2 A 2681 ABLE 31 ABOUT 176 ABOVE 27 ABSOLUTE 14 ABSOLUTELY 27 ACCOUNT 20 ACQUAINTANCE 11 ACROSS 37 ACTION 12 ADDRESS 26 ADLER 16 ADVENTURE 25 ADVENTURES 11 ADVERTISEMENT 19 ADVICE 19 AFFAIR 14 AFRAID 21 AFTER 99 AFTERNOON 15 AFTERWARDS 18 AGAIN 66 AGAINST 53 AGE 14 AGO 27

18 StudentJoin Joins 2 tables students & grades. Counts the number of classes taken Calculates average GPA. Lab 1 Add the student level (Freshman or..). Lab 2 Do a 3 table join and show the student hometown. Table 3 is in file hometown. Where do you do these changes? Map class? Reduce class? Both? StudentJoin 18

19 StudentJoin StudentJoin1 Joe Smith Jane Miller John Roberts Stephanie Jones StudentJoin2 Joe Smith Junior Jane Miller Sophmore John Roberts Senior Stephanie Jones Freshman StudentJoin3 Joe Smith Monterey, CA Jane Miller San Jose, CA John Roberts Oakland, CA Stephanie Jones Santa Cruz, CA

20 WordCount Outline import *; public class WordCount1 public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> public static void main(string[] args) throws Exception

21 Main Function public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount1.class); job.setmapperclass(wordcountmapper.class); job.setreducerclass(wordcountreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);

22 Main Function Set the name to word count The name of the class file that will run the job The class file that will run the mapper and reducer. This In this sets case the the class mapper for the key field and and reducer for the are value nested field. The classes key field inside will the be job text and the class. value field will be an integer. This makes sense in a word count since the word is the key and how often it appears is the value. public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount1.class); job.setmapperclass(wordcountmapper.class); job.setreducerclass(wordcountreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);

23 Main Function The word count application has two arguments. The first argument is the input file that we will run the word count on. The second is the This filepath line will where start the job word and then count block application until the will write job is out complete. the output If the text. job completes successful the application will exit with a return code of 0, if there is an error it will exit with a return code of 1. public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount1.class); job.setmapperclass(wordcountmapper.class); job.setreducerclass(wordcountreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1);

24 Mapper Function public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException String[] words = value.tostring().split(" ") ; for (String str: words) word.set(str); context.write(word, one);

25 Mapper Function This is the WordCountMapper class The declaration. KEYOUT It field extends is set the to The KEYIN field is set to Text. Mapper and In the class IntWritable word and count requires are Object. The key by default application, both four arguments. Hadoop the classes. key will be the byte offset These field Text for are will is KEYIN, comparible be the the value VALUEIN, individual to String from the KEYOUT, words. and input The IntWritable VALUEOUT. VALUEOUT is comparible textfile. The VALUEIN This field is there is set to is the to Integer. input IntWritable set to and The Text output difference and and will datatypes will is contain the for contain the Text keys a and count the and IntWritable of line of values how text are often classes set. the implement word appears. interfaces from the input that are file. used by Hadoop to run the MapReduce. public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException String[] words = value.tostring().split(" ") ; for (String str: words) word.set(str); context.write(word, one);

26 Mapper Function Create a constant variable named one and set This it function to equal is 1. run The when word the Mapper count works is called by on splitting the Data Create up Node. each a Text Key line variable is by word the byte to and hold offset assigning each and individual is each word never word the used value once in the of the 1. word input In the count reduce Text job. line Value phase is split. is the the application Text line from will the count input the similar file. Context words. is a class that allows the task to write data out. public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException String[] words = value.tostring().split(" ") ; for (String str: words) word.set(str); context.write(word, one);

27 Mapper Function A string array of every word in the input line is created by splitting the Text The at every Context space class is Each used The end character. to word write result in out of the the the array is intermediate Map phase is casted from data. a a String The into KEYOUT collection a Text. is of the key-value Text that pairs was in the set form in the [word, previous 1]. Multiple line. instances The of VALUEOUT the same word is the will be IntWritable added up in that the was set to Reduce 1. phase. public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException String[] words = value.tostring().split(" ") ; for (String str: words) word.set(str); context.write(word, one);

28 Reducer Function public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException int total = 0; for (IntWritable val : values) total++; context.write(key, new IntWritable(total));

29 Reducer Function The for reduce loop will function count is how called many on the values individual we found datanodes for the after given the key. map In phase the word has count completed. example, The this function means has counting three parameters. how many The times first an individual parameter word is the appeared. key that the The datanode value in will the process. key value The pair second was parameter set to one, is so a iterable instead list The of Context values of adding for class that the writes key. value The in third out the the parameter IntWritable, result. is we the can context just which count++ is used for to each write object out the in result. the iterable. public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException int total = 0; for (IntWritable val : values) total++; context.write(key, new IntWritable(total));

30 ReduceJoin Outline import *; public class ReduceJoin public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text> public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text> public static class ReduceJoinReducer extends Reducer<Text, Text, Text, Text> public static void main(string[] args) throws Exception

31 ReduceJoin Purpose Accounts 001 John Allen Sales Abigail Smith 003 April Stevens 004 Nasser Hafez

32 Main Function public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "Reduce-side join"); job.setjarbyclass(reducejoin.class); job.setreducerclass(reducejoinreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SalesRecordMapper.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AccountRecordMapper.class); Path outputpath = new Path(args[2]); FileOutputFormat.setOutputPath(job, outputpath); outputpath.getfilesystem(conf).delete(outputpath); System.exit(job.waitForCompletion(true)? 0 : 1);

33 Main Function The main function is similar to the WordCount program. public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "Reduce-side join"); job.setjarbyclass(reducejoin.class); job.setreducerclass(reducejoinreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SalesRecordMapper.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AccountRecordMapper.class); Path outputpath = new Path(args[2]); FileOutputFormat.setOutputPath(job, outputpath); outputpath.getfilesystem(conf).delete(outputpath); System.exit(job.waitForCompletion(true)? 0 : 1);

34 Main Function Differences: Multiple File Inputs Separate Mapper class for each input file Program deletes any previously created file in the output path public static void main(string[] args) throws Exception Configuration conf = new Configuration(); Job job = new Job(conf, "Reduce-side join"); job.setjarbyclass(reducejoin.class); job.setreducerclass(reducejoinreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, SalesRecordMapper.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, AccountRecordMapper.class); Path outputpath = new Path(args[2]); FileOutputFormat.setOutputPath(job, outputpath); outputpath.getfilesystem(conf).delete(outputpath); System.exit(job.waitForCompletion(true)? 0 : 1);

35 Sales Record Mapper The Sales Record Mapper takes each line from the Sales file, casts it as a String, then uses the split function to separate the line into a String array based on where the tabs are in the line public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("sales\t"+parts[1]));

36 Sales Record Mapper The Sales Record Mapper then outputs the data. The Key is Text datatype containing the account number. The Value is a Text datatype containing the word sales concatenated with the price public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("sales\t"+parts[1]));

37 Sales Record Mapper An example output from the Sales Record mapper is: [001, sales\t35.99]. There is a tab character between sales and Keep the concatenation in mind as it will be important in the Reduce phase public static class SalesRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("sales\t"+parts[1]));

38 Account Record Mapper The Account Record Mapper takes each line from the Accounts file, casts it as a String, then uses the split function to separate the line into a String array based on where the tabs are in the line. Names in the file are separated by a space and so will be considered one element. public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("accounts\t"+parts[1])); 001 John Allen 002 Abigail Smith 003 April Stevens 004 Nasser Hafez

39 Account Record Mapper The Account Record Mapper then outputs the data. The Key is Text datatype containing the account number. The Value is a Text datatype containing the word accounts concatenated with the person s name. public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("accounts\t"+parts[1])); 001 John Allen 002 Abigail Smith 003 April Stevens 004 Nasser Hafez

40 Account Record Mapper An important point to note is that both the SalesRecordMapper and the AccountRecordMapper output the same field as the key. This is what will facilitate the join. Also note the concatenation in the AccountRecordMapper. It is similar to the SalesRecordMapper. public static class AccountRecordMapper extends Mapper<Object, Text, Text, Text> public void map(object key, Text value, Context context ) throws IOException, InterruptedException String record = value.tostring(); String[] parts = record.split("\t"); context.write(new Text(parts[0]), new Text("accounts\t"+parts[1])); 001 John Allen 002 Abigail Smith 003 April Stevens 004 Nasser Hafez

41 ReduceJoin Reducer Data sent to an individual datanode during the Reduce phase is partitioned based on the key field. Since the Sales data and the Accounts data both use a userid as the key field, each datanode will contain all data pertaining to a single salesperson. In the reduce phase we want to process and output data about each salesperson. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

42 ReduceJoin Reducer Three variables are created. These will store the salesperson s data. The name field will contain the salesperson s name. Total will contain the total cost of the items sold. Count will contain the total number of sales by that salesperson. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

43 ReduceJoin Reducer Loop through every value for the key assigned to the datanode. In the ReduceJoin program, the values will be in one of two forms: sales\t35.99 or accounts\tjohn Allen public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

44 ReduceJoin Reducer The value is cast to a string and then split based on the tab character. Doing this allows the reducer to process the value differently depending on if we concatenated sales or accounts to the value during the Map phase. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

45 ReduceJoin Reducer Values concatenated with sales contain the sales data for the salesperson. Each value processed indicates a sale, so the program will increment the count variable and add the sale amount to the running total. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

46 ReduceJoin Reducer Values concatenated with accounts contain the salesperson s name. Their name is assigned to the name variable. public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

47 ReduceJoin Reducer The sales data is formatted and output using the Context class. The format is a key-value pair. [John Allen, 3\t124.93] public void reduce(text key, Iterable<Text> values, Context context ) throws IOException, InterruptedException String name = ""; double total = 0.0; int count = 0; for(text t: values) String parts[] = t.tostring().split("\t"); if (parts[0].equals("sales")) count++; total+= Float.parseFloat(parts[1]); else if (parts[0].equals("accounts")) name = parts[1]; String str = String.format("%d\t%f", count, total); context.write(new Text(name), new Text(str));

48 Hadoop Commands Listing the files in HDFS hdfs dfs -ls /NoCoug/Output Removing the files in HDFS hdfs dfs -rm -r -f /NoCoug/Output/WordCount2_run1 Full Reference

49 Oracle Loader for Hadoop Parallel load, optimized for Hadoop Automatic load balancing Convert to Oracle format on Hadoop Save database CPU Text Avro Parquet Sequence files Load specific Hive partitions Kerberos authentication Hive Log files JSON Compressed files And more Load directly into In-Memory table

50 Oracle Loader Unzip OraLoader Set OLH_HOME Loader runs as a MapReduce Job Job Configuration file an.xml file contains data mappings database connection information hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.oraloader -conf job.xml 50

51 Oracle Loader Startup the Database sqlplus / as sysdba startup exit Start the listener lsnrctl start SQL Developer Table WORDCOUNT Right click on the table and select Truncate SQLPLUS sqlplus ORACLELOADER/Biwa2015 truncate table WORDCOUNT 51

52 <!-- Input settings --> <property> <name>mapreduce.inputformat.class</name> <value>oracle.hadoop.loader.lib.input.delimitedtextinputformat</value> </property> <property> </property> <name>mapred.input.dir</name> <value>/nocoug/wordcount2_run3</value> <property> <name>oracle.hadoop.loader.input.fieldterminator</name> <value>\u0009</value> </property> 52

53 53

54 54

55 HDFS References Java MapReduce class documentation: HDFS command documentation: HDFS Book for learning: Hadoop Beginner s Guide by Garry Turkington [PACKT publishing] Wikipedia (5 Steps of Map Reduce): NPS publications repository: 55

56 Conclusion Questions? Contact: Arijit Das 56

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Mrs: MapReduce for Scientific Computing in Python

Mrs: MapReduce for Scientific Computing in Python Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

More information

Hadoop Configuration and First Examples

Hadoop Configuration and First Examples Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco

Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Hadoop (Hands On) Irene Finocchi and Emanuele Fusco Big Data Computing March 23, 2015. Master s Degree in Computer Science Academic Year 2014-2015, spring semester I.Finocchi and E.Fusco Hadoop (Hands

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.

hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm. hadoop Running hadoop on Grid'5000 Vinicius Cogo vielmo@lasige.di.fc.ul.pt Marcelo Pasin pasin@di.fc.ul.pt Andrea Charão andrea@inf.ufsm.br Outline 1 Introduction 2 MapReduce 3 Hadoop 4 How to Install

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

Big Data Analytics* Outline. Issues. Big Data

Big Data Analytics* Outline. Issues. Big Data Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 Hadoop and HDInsight in a Heartbeat HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Map-Reduce and Hadoop

Map-Reduce and Hadoop Map-Reduce and Hadoop 1 Introduction to Map-Reduce 2 3 Map Reduce operations Input data are (key, value) pairs 2 operations available : map and reduce Map Takes a (key, value) and generates other (key,

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free) Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming

More information

What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin

What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin data every day 5/5/2016 Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection

More information

Connecting Hadoop with Oracle Database

Connecting Hadoop with Oracle Database Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email n.roy@neu.edu if you have questions or need more clarifications. Nilay

More information

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig Introduction to Pig Agenda What is Pig? Key Features of Pig The Anatomy of Pig Pig on Hadoop Pig Philosophy Pig Latin Overview Pig Latin Statements Pig Latin: Identifiers Pig Latin: Comments Data Types

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

Running Hadoop on Windows CCNP Server

Running Hadoop on Windows CCNP Server Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs

More information

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so: Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

More information

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

More information

Lesson 7 Pentaho MapReduce

Lesson 7 Pentaho MapReduce Lesson 7 Pentaho MapReduce Pentaho Data Integration, or PDI, is a comprehensive ETL platform allowing you to access, prepare and derive value from both traditional and big data sources. During this lesson,

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

How to program a MapReduce cluster

How to program a MapReduce cluster How to program a MapReduce cluster TF-IDF step by step Ján Súkeník xsukenik@is.stuba.sk sukenik08@student.fiit.stuba.sk TF-IDF potrebujeme pre každý dokument počet slov frekvenciu každého slova pre každé

More information

Hadoop Tutorial. General Instructions

Hadoop Tutorial. General Instructions CS246: Mining Massive Datasets Winter 2016 Hadoop Tutorial Due 11:59pm January 12, 2016 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted

More information

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop Installation MapReduce Examples Jake Karnes Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an

More information

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra) MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

More information

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

LANGUAGES FOR HADOOP: PIG & HIVE

LANGUAGES FOR HADOOP: PIG & HIVE Friday, September 27, 13 1 LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September 27, 13 2 Motivation Native MapReduce Gives fine-grained control over how program interacts

More information

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins) Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

More information

IDS 561 Big data analytics Assignment 1

IDS 561 Big data analytics Assignment 1 IDS 561 Big data analytics Assignment 1 Due Midnight, October 4th, 2015 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted with the code

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Hadoop Integration Guide

Hadoop Integration Guide HP Vertica Analytic Database Software Version: 7.1.x Document Release Date: 12/9/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May

Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May 2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite

More information

Enterprise Data Storage and Analysis on Tim Barr

Enterprise Data Storage and Analysis on Tim Barr Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,

More information

Biomap Jobs and the Big Picture

Biomap Jobs and the Big Picture Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:

More information

Hadoop, Hive & Spark Tutorial

Hadoop, Hive & Spark Tutorial Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')

More information

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Michail Basios (m.basios@sms.ed.ac.uk) Stratis Viglas (sviglas@inf.ed.ac.uk) 1 Getting started First you need to access the machine where you will be doing all

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

More information

Hadoop Integration Guide

Hadoop Integration Guide HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 2/20/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements

More information

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)

How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos) Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

More information

Introduction To Hadoop

Introduction To Hadoop Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the

More information

Hadoop Job Oriented Training Agenda

Hadoop Job Oriented Training Agenda 1 Hadoop Job Oriented Training Agenda Kapil CK hdpguru@gmail.com Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab

Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab Hadoop Overview Lavanya Ramakrishnan Iwona Sakrejda Shane Canon Lawrence Berkeley National Lab July 2011 Overview Concepts & Background MapReduce and Hadoop Hadoop Ecosystem Tools on top of Hadoop Hadoop

More information

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl

Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm

More information

CDH installation & Application Test Report

CDH installation & Application Test Report CDH installation & Application Test Report He Shouchun (SCUID: 00001008350, Email: she@scu.edu) Chapter 1. Prepare the virtual machine... 2 1.1 Download virtual machine software... 2 1.2 Plan the guest

More information

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig Contents Acknowledgements... 1 Introduction to Hive and Pig... 2 Setup... 2 Exercise 1 Load Avro data into HDFS... 2 Exercise 2 Define an

More information

Oracle Big Data Fundamentals Ed 1 NEW

Oracle Big Data Fundamentals Ed 1 NEW Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big

More information

How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1

How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 How to Install and Configure EBF15328 for MapR 4.0.1 or 4.0.2 with MapReduce v1 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

An Overview of Hadoop

An Overview of Hadoop 1 / 26 An Overview of The Ohio State University Department of Linguistics 2 / 26 What is? is a software framework for scalable distributed computing 3 / 26 MapReduce Follows Google s MapReduce framework

More information

An Implementation of Sawzall on Hadoop

An Implementation of Sawzall on Hadoop 1 An Implementation of Sawzall on Hadoop Hidemoto Nakada, Tatsuhiko Inoue and Tomohiro Kudoh, 1-1-1 National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki 35-8568,

More information

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide Microsoft SQL Server Connector for Apache Hadoop Version 1.0 User Guide October 3, 2011 Contents Legal Notice... 3 Introduction... 4 What is SQL Server-Hadoop Connector?... 4 What is Sqoop?... 4 Supported

More information

BIG DATA ANALYTICS USING HADOOP TOOLS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

BIG DATA ANALYTICS USING HADOOP TOOLS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment BIG DATA ANALYTICS USING HADOOP TOOLS A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science by

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

Overview of Web Services API

Overview of Web Services API 1 CHAPTER The Cisco IP Interoperability and Collaboration System (IPICS) 4.5(x) application programming interface (API) provides a web services-based API that enables the management and control of various

More information

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015 MarkLogic Connector for Hadoop Developer s Guide 1 MarkLogic 8 February, 2015 Last Revised: 8.0-3, June, 2015 Copyright 2015 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents

More information

Cloud Computing Era. Trend Micro

Cloud Computing Era. Trend Micro Cloud Computing Era Trend Micro Three Major Trends to Chang the World Cloud Computing Big Data Mobile 什 麼 是 雲 端 運 算? 美 國 國 家 標 準 技 術 研 究 所 (NIST) 的 定 義 : Essential Characteristics Service Models Deployment

More information