Hadoop Configuration and First Examples



Similar documents
Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Word Count Code using MR2 Classes and API

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop WordCount Explained! IT332 Distributed Systems

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Introduction to MapReduce and Hadoop

Word count example Abdalrahman Alsaedi

Tutorial- Counting Words in File(s) using MapReduce

Hadoop 2.6 Configuration and More Examples

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

How To Write A Mapreduce Program In Java.Io (Orchestra)

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Introduction To Hadoop

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

MR-(Mapreduce Programming Language)

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

map/reduce connected components

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Extreme Computing. Hadoop MapReduce in more detail.

Distributed Systems + Middleware Hadoop

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Mrs: MapReduce for Scientific Computing in Python

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop: Understanding the Big Data Processing Method

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

BIG DATA APPLICATIONS

Download and install Download virtual machine Import virtual machine in Virtualbox

Hadoop Basics with InfoSphere BigInsights

Big Data 2012 Hadoop Tutorial

HPCHadoop: MapReduce on Cray X-series

Getting to know Apache Hadoop

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Data Science in the Wild

CS54100: Database Systems

Big Data Management and NoSQL Databases

How To Install Hadoop From Apa Hadoop To (Hadoop)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Introduction to Big Data Science. Wuhui Chen

Data Science Analytics & Research Centre

19 Putting into Practice: Large-Scale Data Management with HADOOP

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop MultiNode Cluster Setup

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Running Hadoop on Windows CCNP Server

By Hrudaya nath K Cloud Computing

Xiaoming Gao Hui Li Thilina Gunarathne

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Hadoop (pseudo-distributed) installation and configuration

Programming in Hadoop Programming, Tuning & Debugging

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

INTRODUCTION TO HADOOP

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Tutorial. Christopher M. Judd

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Internals of Hadoop Application Framework and Distributed File System

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Connecting Hadoop with Oracle Database

Three Approaches to Data Analysis with Hadoop

Hadoop Integration Guide

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

BIG DATA, MAPREDUCE & HADOOP

5 HDFS - Hadoop Distributed System

How to program a MapReduce cluster

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

HADOOP MOCK TEST HADOOP MOCK TEST II

Hadoop Integration Guide

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

LANGUAGES FOR HADOOP: PIG & HIVE

Apache Hadoop new way for the company to store and analyze big data

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

The Hadoop Eco System Shanghai Data Science Meetup

Enterprise Data Storage and Analysis on Tim Barr

An Implementation of Sawzall on Hadoop

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Distributed Filesystems

Introduction to Map/Reduce & Hadoop

Cloud Computing Era. Trend Micro

Transcription:

Hadoop Configuration and First Examples Big Data 2015

Hadoop Configuration In the bash_profile export all needed environment variables

Hadoop Configuration Allow remote login

Hadoop Configuration Download the binary release of apache hadoop: hadoop-1.2.1-bin.tar

Hadoop Configuration In the conf directory of the hadoop-home directory, set the following files core-site.xml hadoop-env.sh hdfs-site.xml mapred-site.xml

Hadoop Configuration In the hadoop-env.sh file you have to edit the following

Hadoop Configuration core-site.xml hdfs-site.xml mapred-site.xml

Hadoop Configuration & Running Final configuration: Running hadoop: http://localhost:50070 http://localhost:50030

http://localhost:50070 Hadoop: browse hdfs Browse the filesystem

Architecture 6

Architecture 7

Hadoop Configuration & Running Check all running daemons in Hadoop using the command jps $:~ jps 30773 JobTracker 30861 TaskTracker 30707 SecondaryNameNode 30613 DataNode 30524 NameNode 30887 Jps

Hadoop: commands on hdfs $:~hadoop-*/bin/hadoop dfs <command> <parameters> create a directory in hdfs! $:~hadoop-*/bin/hadoop dfs -mkdir input copy a local file in hdfs! $:~hadoop-*/bin/hadoop dfs -copyfromlocal /tmp/example.txt input delete a directory in hdfs $:~hadoop-*/bin/hadoop dfs -rmr input

Hadoop: execute MR application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: Word Count in MapReduce $:~hadoop-*/bin/hadoop dfs -mkdir input $:~hadoop-*/bin/hadoop dfs -mkdir output $:~hadoop-*/bin/hadoop dfs -copyfromlocal /tmp/parole.txt input $:~hadoop-*/bin/hadoop jar /tmp/word.jar WordCount input/parole.txt output/result { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

Let s start with some examples!

Install Pseudo-Distributed Mode Good luck! https://github.com/h4ml3t/introducing-hadoop 4

Example: Temperature We collected information from https://www.ncdc.noaa.gov/ (National Climatic Data Center) about the atmosphere. For each year (starting from 1763), for each month, for each day, for each hours, for each existing meteorogical stations in the world we have an entry in a file that specifies: data, time, id, air temperature, pressure, elevation, latitude, longitude... 9

Example: Temperature What if we want to know the max temperature in each year? 10

Example: Temperature What if we want to know the max temperature in each year? 11

Example: Temperature What if we want to know the max temperature in each year? 12

Temperature - data format File entries example:... 0057332130999991958010103004+51317+028783FM- 12+017199999V0203201N00721004501CN0100001N9-00211-01391102681 0057332130999991959010103004+51317+028783FM- 12+017199999V0203201N00721004501CN0100001N9+00651-01391102681 0057332130999991960010103004+51317+028783FM- 12+017199999V0203201N00721004501CN0100001N9+00541-01391102681... Year Degrees Celsius x10 13

Temperature logical data flow INPUT MAP 0067011990 0043011990 0043011990 0043012650 0043012650 (0, 0067011990 ) (106,0043011990 ) (212, 0043011990 ) (318, 0043012650 ) (424, 0043012650 ) (1950, 0) (1950, 22) (1950, -11) (1949, 111) (1949, 78) (1950, 0) (1950, 22) (1950, -11) (1949, 111) (1949, 78) SHUFFLE & SORT (1949, [111, 78]) (1950, [0, 22, -11]) REDUCE (1949, 111) (1950, 22) 14

Temperature Mapper map (k1, v1) [(k2, v2)] import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { privatestatic final int MISSING = 9999; public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Stringline = value.tostring(); String year = line.substring(15, 19); int airtemperature; if (line.charat(87) == '+') { map (k1, v1) [(k2, v2)] 15

Temperature Reducer (1) import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; reduce (k2, [v2]) [(k3, v3)] publicclass MaxTemperatureReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, DoubleWritable> { publicvoid reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException { intmaxvalue = Integer.MIN_VALUE; while (values.hasnext()) { maxvalue = Math.max(maxValue, values.next().get()); output.collect(key, new DoubleWritable(((double)maxValue)/10)); 16

Temperature Reducer (2) import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; reduce (k2, [v2]) [(k3, v3)] publicclass MaxTemperatureReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, DoubleWritable> { public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException { intmaxvalue = Integer.MIN_VALUE; while (values.hasnext()) { maxvalue = Math.max(maxValue, values.next().get()); output.collect(key, new DoubleWritable(((double)maxValue)/10)); 17

Temperature - Job publicclassmaxtemperature { publicstatic void main(string[]args) throws IOException { JobConf conf = newjobconf(maxtemperature.class); conf.setjobname("max temperature"); FileInputFormat.addInputPath(conf, new Path("temperature.txt")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setmapperclass(maxtemperaturemapper.class); conf.setreducerclass(maxtemperaturereducer.class); conf.setmapoutputkeyclass(text.class); conf.setmapoutputvalueclass(intwritable.class); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(doublewritable.class); JobClient.runJob(conf); 18

Hadoop: execute MaxTemperature application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: MaxTemperature in MapReduce $:~hadoop-*/bin/hadoop dfs -mkdir input $:~hadoop-*/bin/hadoop dfs -mkdir output $:~hadoop-*/bin/hadoop dfs -copyfromlocal /example_data/temperature.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar temperature/maxtemperature input/temperature.txt output/result_temperature { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

Hadoop: clean MaxTemperature application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: MaxTemperature in MapReduce $:~hadoop-*/bin/hadoop dfs -rmr input $:~hadoop-*/bin/hadoop dfs -rmr output

Word Count logical data flow sopra la panca la capra campa, sotto la panca la capra crepa INPUT (0, sopra) (6, la) (9, panca) (15, la) (18, capra) (24, campa,) MAP (sopra, 1) (la, 1) (panca, 1) (la,1) (capra, 1) (campa, 1) SHUFFLE & SORT REDUCE (sopra, 1) (la, 1) (panca, 1) (la,1) (capra, 1) (campa, 1) (campa,, [1]) (capra, [1,1]) (crepa, [1]) (la, [1,1,1,1]) (campa,, 1) (capra, 2) (crepa, 1) (la, 4) 20

Word Count Mapper (1) import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.mapper; publicclass WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); publicvoid map(longwritable key, Text value, Context context) throws IOException, InterruptedException { Stringline = value.tostring(); StringTokenizer tokenizer = newstringtokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 21

Word Count Mapper (2) import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.mapper; publicclass WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); publicvoid map(longwritable key, Text value, Context context) throws IOException, InterruptedException { Stringline = value.tostring(); StringTokenizer tokenizer = newstringtokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); 22

Word Count Reducer (1) import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.reducer; publicclass WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); context.write(key, new IntWritable(sum)); 23

Word Count Reducer (2) import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.reducer; publicclass WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); context.write(key, new IntWritable(sum)); 24

Word Count Job import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; publicclass WordCount { publicstatic void main(string[]args) throws Exception { Job job = newjob(newconfiguration(), "WordCount"); job.setjarbyclass(wordcount.class); job.setmapperclass(wordcountmapper.class); job.setreducerclass(wordcountreducer.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); job.waitforcompletion(true); 25

Hadoop: execute WordCount application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: WordCount in MapReduce $:~hadoop-*/bin/hadoop dfs -mkdir input $:~hadoop-*/bin/hadoop dfs -mkdir output $:~hadoop-*/bin/hadoop dfs -copyfromlocal /example_data/words.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar wordcount/wordcount input/words.txt output/result_words { { path on hdfs to reach the file path on hdfs of the directory to generate to store the result

Hadoop: clean WordCount application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: WordCount in MapReduce $:~hadoop-*/bin/hadoop dfs -rmr input $:~hadoop-*/bin/hadoop dfs -rmr output

Word Count: Without Combiner MAP OUTPUT (worda,1) (worda,1) (wordb,1) (wordb,1) (wordb,1) (worda,1) (worda,1) (worda,1) (wordb,1) (wordc,1) REDUCER INPUT (worda, [1,1,1,1,1]) (wordb, [1,1,1]) (wordc, [1]) OUTPUT (worda, 5) (wordb, 3) (wordc, 1) 27

Word Count: With Combiner MAP OUTPUT (worda,1) (worda,1) (wordb,1) (wordb,1) (wordb,1) (worda,1) (worda,1) (worda,1) (wordb,1) (wordc,1) (worda, 2) (wordb, 3) (worda, 3) (wordb, 1) (wordc, 1) REDUCER INPUT (worda, [2,3]) (wordb, [3,1]) (wordc, [1]) OUTPUT (worda, 5) (wordb, 3) (wordc, 1) COMBINER OUTPUT 28

Combiner In the job configuration: job.setmapperclass(wordcountmapper.class); job.setcombinerclass(wordcountreducer.class); job.setreducerclass(wordcountreducer.class); Watch for types! 29

Bigram Count A bigram is a couple of two consecutive words In this sentence there are 8 bigrams: A bigram bigram is is a a couple... 31

Bigram Count Example: sopra la panca la capra campa, sotto la panca la capra crepa campa, sotto 1 capra campa, 1 capra crepa 1 la capra 2 la panca 2 panca la 2 sopra la 1 sotto la 1 To represent a Bigram we want to use a custom Writable type (BigramWritable) 32

Bigram Count import org.apache.hadoop.io.*; publicclass BigramWritable implements WritableComparable<BigramWritable> { private Text leftbigram; private Text rightbigram; public BigramWritable(Text left, Text right){ this.leftbigram = left; this.rightbigram = right; @Override publicvoid readfields(datainput in) throws IOException { leftbigram = newtext(in.readutf()); rightbigram = newtext(in.readutf()); @Override publicvoid write(dataoutput out) throws IOException { out.writeutf(leftbigram.tostring()); out.writeutf(rightbigram.tostring()); 33

Bigram Count public void set(text prev, Text cur) { leftbigram = prev; rightbigram = cur; @Override public int hashcode() { return leftbigram.hashcode() + rightbigram.hashcode(); @Override public boolean equals(object o) { if (o instanceof BigramWritable) { BigramWritable bigram = (BigramWritable) o; return leftbigram.equals(bigram.leftbigram) && rightbigram.equals(bigram.rightbigram); return false; 34

Bigram Count @Override public int compareto(bigramwritable tp) { intcmp = leftbigram.compareto(tp.leftbigram); if (cmp!= 0) { return cmp; return rightbigram.compareto(tp.rightbigram); 35

Bigram Count - Mapper publicclass BigramCountMapper extends Mapper<LongWritable, Text, BigramWritable, IntWritable> { private static final IntWritable ONE = new IntWritable(1); private static final BigramWritable BIGRAM = new BigramWritable(); @Override publicvoid map(longwritable key, Text value, Context context) throws IOException, InterruptedException { Stringline = value.tostring(); String prev = null; StringTokenizer itr = newstringtokenizer(line); while (itr.hasmoretokens()) { Stringcur = itr.nexttoken(); if (prev!= null) { BIGRAM.set( new Text(prev), new Text(cur) ); context.write(bigram, ONE); prev = cur; 36

Bigram Count - Reducer import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.reducer; public class BigramCountReducer extends Reducer<BigramWritable,IntWritable,Text, IntWritable> { private final static IntWritable SUM = new IntWritable(); @Override publicvoid reduce(bigramwritablekey, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; Iterator<IntWritable> iter = values.iterator(); while (iter.hasnext()) { sum += iter.next().get(); SUM.set(sum); context.write(new Text( key.tostring() ), SUM); 37

Bigram Count - Job import org.apache.hadoop.conf.configured; import org.apache.hadoop.util.toolrunner; import org.apache.hadoop.util.tool; publicclass BigramCount extends Configured implements Tool { private BigramCount() { publicint run(string[] args) throws Exception { Job job = newjob(getconf()); job.setjobname(bigramcount.class.getsimplename()); job.setjarbyclass(bigramcount.class); job.waitforcompletion(true); publicstatic void main(string[]args) throws Exception { ToolRunner.run(new BigramCount(), args); 38

Hadoop: execute BigramCount application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: BigramCount in MapReduce $:~hadoop-*/bin/hadoop dfs -mkdir input $:~hadoop-*/bin/hadoop dfs -mkdir output $:~hadoop-*/bin/hadoop dfs -copyfromlocal /example_data/words.txt input $:~hadoop-*/bin/hadoop jar /example_jar/bg-example1-1.jar bigram/bigramcount -input input/words.txt -numreducers 1 -output output/result_bigram

Hadoop: clean BigramCount application $:~hadoop-*/bin/hadoop jar <path-jar> <jar-mainclass> <jar-parameters> Example: BigramCount in MapReduce $:~hadoop-*/bin/hadoop dfs -rmr input $:~hadoop-*/bin/hadoop dfs -rmr output

Stopping the Hadoop DFS Stop hadoop: That's all! Happy Map Reducing!

Resources

Hadoop Configuration and First Examples Big Data 2015