How to program a MapReduce cluster



Similar documents
Word Count Code using MR2 Classes and API

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Configuration and First Examples

Xiaoming Gao Hui Li Thilina Gunarathne

Introduction to MapReduce and Hadoop

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop Basics with InfoSphere BigInsights

Tutorial- Counting Words in File(s) using MapReduce

MR-(Mapreduce Programming Language)

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

How To Write A Mapreduce Program In Java.Io (Orchestra)

Cloud Computing Era. Trend Micro

Word count example Abdalrahman Alsaedi

BIG DATA APPLICATIONS

Introduc)on to Map- Reduce. Vincent Leroy

Mrs: MapReduce for Scientific Computing in Python

Hadoop: Understanding the Big Data Processing Method

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Distributed Systems + Middleware Hadoop

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

LANGUAGES FOR HADOOP: PIG & HIVE

Data Science in the Wild

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Big Data Management and NoSQL Databases

CS54100: Database Systems

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Introduction To Hadoop

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Big Data 2012 Hadoop Tutorial

The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

HPCHadoop: MapReduce on Cray X-series

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

map/reduce connected components

An Implementation of Sawzall on Hadoop

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Claudia. Big Data. IN4325 Information Retrieval

Getting to know Apache Hadoop

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Three Approaches to Data Analysis with Hadoop

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

INTRODUCTION TO HADOOP

Big Data for the JVM developer. Costin Leau,

Hadoop Integration Guide

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

BIWA 2015 Big Data Lab Java MapReduce WordCount/Table JOIN Big Data Loader. Arijit Das Greg Belli Erik Lowney Nick Bitto

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Programming in Hadoop Programming, Tuning & Debugging

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Introduction to Big Data Science. Wuhui Chen

Enterprise Data Storage and Analysis on Tim Barr

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Data management in the cloud using Hadoop

Outline of Tutorial. Hadoop and Pig Overview Hands-on

HowTo Hadoop. Devaraj Das

19 Putting into Practice: Large-Scale Data Management with HADOOP

Introduction to Cloud Computing

Map Reduce Workflows

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Extreme Computing. Hadoop MapReduce in more detail.

Introduction to MapReduce and Hadoop

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Big Data and Apache Hadoop s MapReduce

Connecting Hadoop with Oracle Database

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Running Hadoop on Windows CCNP Server

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Building Big Data Pipelines using OSS. Costin Leau Staff Engineer

Hadoop, Hive & Spark Tutorial

BIG DATA, MAPREDUCE & HADOOP

Introduction to Hadoop

An Introduction to Apostolos N. Papadopoulos

Download and install Download virtual machine Import virtual machine in Virtualbox

Tutorial. Christopher M. Judd

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Transcription:

How to program a MapReduce cluster TF-IDF step by step Ján Súkeník xsukenik@is.stuba.sk sukenik08@student.fiit.stuba.sk

TF-IDF potrebujeme pre každý dokument počet slov frekvenciu každého slova pre každé slovo počet dokumentov, v ktorých sa nachádza v MapReduce rozdelíme na 3 fázy

Fáza 1 Map (dokument, [slová]) (slovo, dokument, 1) Reduce (slovo, dokument, [1, 1, ]) (slovo, dokument, počet-výskytov-slova n)

Mapper

public class FirstMapper extends Mapper<LongWritable, Text, Text, IntWritable> public void map(longwritable linenumber, Text linecontent, Context context) String filename = ((FileSplit) context.getinputsplit()).getpath().getname(); Matcher m = Pattern.compile("\\w+").matcher(lineContent.toString()); while (m.find()) String word = m.group().tolowercase(); if (shouldskipword(word)) continue; context.write( new Text(word + "@" + filename), new IntWritable(1) );

public class FirstMapper extends Mapper<LongWritable, Text, Text, IntWritable> public void map(longwritable linenumber, Text linecontent, Context context) String filename = ((FileSplit) context.getinputsplit()).getpath().getname(); Matcher m = Pattern.compile("\\w+").matcher(lineContent.toString()); while (m.find()) String word = m.group().tolowercase(); if (shouldskipword(word)) continue; context.write( new Text(word + "@" + filename), new IntWritable(1) );

public class FirstMapper extends Mapper<LongWritable, Text, Text, IntWritable> public void map(longwritable linenumber, Text linecontent, Context context) String filename = ((FileSplit) context.getinputsplit()).getpath().getname(); Matcher m = Pattern.compile("\\w+").matcher(lineContent.toString()); while (m.find()) String word = m.group().tolowercase(); if (shouldskipword(word)) continue; context.write( new Text(word + "@" + filename), new IntWritable(1) );

public class FirstMapper extends Mapper<LongWritable, Text, Text, IntWritable> public void map(longwritable linenumber, Text linecontent, Context context) String filename = ((FileSplit) context.getinputsplit()).getpath().getname(); Matcher m = Pattern.compile("\\w+").matcher(lineContent.toString()); while (m.find()) String word = m.group().tolowercase(); if (shouldskipword(word)) continue; context.write( new Text(word + "@" + filename), new IntWritable(1) );

public class FirstMapper extends Mapper<LongWritable, Text, Text, IntWritable> public void map(longwritable linenumber, Text linecontent, Context context) String filename = ((FileSplit) context.getinputsplit()).getpath().getname(); Matcher m = Pattern.compile("\\w+").matcher(lineContent.toString()); while (m.find()) String word = m.group().tolowercase(); if (shouldskipword(word)) continue; context.write( new Text(word + "@" + filename), new IntWritable(1) );

public class FirstMapper extends Mapper<LongWritable, Text, Text, IntWritable> public void map(longwritable linenumber, Text linecontent, Context context) String filename = ((FileSplit) context.getinputsplit()).getpath().getname(); Matcher m = Pattern.compile("\\w+").matcher(lineContent.toString()); while (m.find()) String word = m.group().tolowercase(); if (shouldskipword(word)) continue; context.write( new Text(word + "@" + filename), new IntWritable(1) );

Reducer

public class FirstReducer extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(text wordatdocument, Iterable<IntWritable> ones, Context context) int sum = 0; for (IntWritable one : ones) sum++; context.write( wordatdocument, new IntWritable(sum) );

public class FirstReducer extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(text wordatdocument, Iterable<IntWritable> ones, Context context) int sum = 0; for (IntWritable one : ones) sum++; context.write( wordatdocument, new IntWritable(sum) );

public class FirstReducer extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(text wordatdocument, Iterable<IntWritable> ones, Context context) int sum = 0; for (IntWritable one : ones) sum++; context.write( wordatdocument, new IntWritable(sum) );

public class FirstReducer extends Reducer<Text, IntWritable, Text, IntWritable> public void reduce(text wordatdocument, Iterable<IntWritable> ones, Context context) int sum = 0; for (IntWritable one : ones) sum++; context.write( wordatdocument, new IntWritable(sum) );

Driver

public class FirstDriver extends Configured implements Tool public int run(string[] args) throws Exception Configuration conf = getconf(); Job job = new Job(conf, "TF-IDF phase 1"); job.setjarbyclass(firstdriver.class); job.setmapperclass(firstmapper.class); job.setreducerclass(firstreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitforcompletion(true)? 0 : 1; public static void main(string[] args) throws Exception System.exit(ToolRunner.run( new Configuration(), new FirstDriver(), args ));

public class FirstDriver extends Configured implements Tool public int run(string[] args) throws Exception Configuration conf = getconf(); Job job = new Job(conf, "TF-IDF phase 1"); job.setjarbyclass(firstdriver.class); job.setmapperclass(firstmapper.class); job.setreducerclass(firstreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitforcompletion(true)? 0 : 1; public static void main(string[] args) throws Exception System.exit(ToolRunner.run( new Configuration(), new FirstDriver(), args ));

public class FirstDriver extends Configured implements Tool public int run(string[] args) throws Exception Configuration conf = getconf(); Job job = new Job(conf, "TF-IDF phase 1"); job.setjarbyclass(firstdriver.class); job.setmapperclass(firstmapper.class); job.setreducerclass(firstreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitforcompletion(true)? 0 : 1; public static void main(string[] args) throws Exception System.exit(ToolRunner.run( new Configuration(), new FirstDriver(), args ));

public class FirstDriver extends Configured implements Tool public int run(string[] args) throws Exception Configuration conf = getconf(); Job job = new Job(conf, "TF-IDF phase 1"); job.setjarbyclass(firstdriver.class); job.setmapperclass(firstmapper.class); job.setreducerclass(firstreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitforcompletion(true)? 0 : 1; public static void main(string[] args) throws Exception System.exit(ToolRunner.run( new Configuration(), new FirstDriver(), args ));

public class FirstDriver extends Configured implements Tool public int run(string[] args) throws Exception Configuration conf = getconf(); Job job = new Job(conf, "TF-IDF phase 1"); job.setjarbyclass(firstdriver.class); job.setmapperclass(firstmapper.class); job.setreducerclass(firstreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitforcompletion(true)? 0 : 1; public static void main(string[] args) throws Exception System.exit(ToolRunner.run( new Configuration(), new FirstDriver(), args ));

public class FirstDriver extends Configured implements Tool public int run(string[] args) throws Exception Configuration conf = getconf(); Job job = new Job(conf, "TF-IDF phase 1"); job.setjarbyclass(firstdriver.class); job.setmapperclass(firstmapper.class); job.setreducerclass(firstreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitforcompletion(true)? 0 : 1; public static void main(string[] args) throws Exception System.exit(ToolRunner.run( new Configuration(), new FirstDriver(), args ));

Fáza 2 Map (slovo, dokument, n) (dokument, slovo, n) Reduce (dokument, [slovo 1, n 1, slovo 2, n 2, ]) (slovo, dokument, n, počet-slov-v-dokumente N)

Mapper

public class SecondMapper extends Mapper<LongWritable, Text, Text, Text> public void map(longwritable linenumber, Text linecontent, Context context) String[] line = linecontent.tostring().split("\t"); String[] key = line[0].split("@"); String word = key[0]; String document = key[1]; String count = line[1]; context.write( new Text(document), new Text(word + "=" + count) );

public class SecondMapper extends Mapper<LongWritable, Text, Text, Text> public void map(longwritable linenumber, Text linecontent, Context context) String[] line = linecontent.tostring().split("\t"); String[] key = line[0].split("@"); String word = key[0]; String document = key[1]; String count = line[1]; context.write( new Text(document), new Text(word + "=" + count) );

public class SecondMapper extends Mapper<LongWritable, Text, Text, Text> public void map(longwritable linenumber, Text linecontent, Context context) String[] line = linecontent.tostring().split("\t"); String[] key = line[0].split("@"); String word = key[0]; String document = key[1]; String count = line[1]; context.write( new Text(document), new Text(word + "=" + count) );

Reducer

public class SecondReducer extends Reducer<Text, Text, Text, Text> public void reduce(text document, Iterable<Text> wordcounts, Context context) int sumofwordsindocument = 0; Map<String, Integer> tempcounter = new HashMap<String, Integer>(); for (Text wordcountstr : wordcounts) String[] wordcount = wordcountstr.tostring().split("="); String word = wordcount[0]; int count = Integer.parseInt(wordCount[1]); tempcounter.put(word, count); sumofwordsindocument += count; for (Entry<String, Integer> entry : tempcounter.entryset()) context.write( new Text(entry.getKey() + "@" + document.tostring()), new Text(entry.getValue() + "/" + sumofwordsindocument) );

public class SecondReducer extends Reducer<Text, Text, Text, Text> public void reduce(text document, Iterable<Text> wordcounts, Context context) int sumofwordsindocument = 0; Map<String, Integer> tempcounter = new HashMap<String, Integer>(); for (Text wordcountstr : wordcounts) String[] wordcount = wordcountstr.tostring().split("="); String word = wordcount[0]; int count = Integer.parseInt(wordCount[1]); tempcounter.put(word, count); sumofwordsindocument += count; for (Entry<String, Integer> entry : tempcounter.entryset()) context.write( new Text(entry.getKey() + "@" + document.tostring()), new Text(entry.getValue() + "/" + sumofwordsindocument) );

public class SecondReducer extends Reducer<Text, Text, Text, Text> public void reduce(text document, Iterable<Text> wordcounts, Context context) int sumofwordsindocument = 0; Map<String, Integer> tempcounter = new HashMap<String, Integer>(); for (Text wordcountstr : wordcounts) String[] wordcount = wordcountstr.tostring().split("="); String word = wordcount[0]; int count = Integer.parseInt(wordCount[1]); tempcounter.put(word, count); sumofwordsindocument += count; for (Entry<String, Integer> entry : tempcounter.entryset()) context.write( new Text(entry.getKey() + "@" + document.tostring()), new Text(entry.getValue() + "/" + sumofwordsindocument) );

public class SecondReducer extends Reducer<Text, Text, Text, Text> public void reduce(text document, Iterable<Text> wordcounts, Context context) int sumofwordsindocument = 0; Map<String, Integer> tempcounter = new HashMap<String, Integer>(); for (Text wordcountstr : wordcounts) String[] wordcount = wordcountstr.tostring().split("="); String word = wordcount[0]; int count = Integer.parseInt(wordCount[1]); tempcounter.put(word, count); sumofwordsindocument += count; for (Entry<String, Integer> entry : tempcounter.entryset()) context.write( new Text(entry.getKey() + "@" + document.tostring()), new Text(entry.getValue() + "/" + sumofwordsindocument) );

public class SecondReducer extends Reducer<Text, Text, Text, Text> public void reduce(text document, Iterable<Text> wordcounts, Context context) int sumofwordsindocument = 0; Map<String, Integer> tempcounter = new HashMap<String, Integer>(); for (Text wordcountstr : wordcounts) String[] wordcount = wordcountstr.tostring().split("="); String word = wordcount[0]; int count = Integer.parseInt(wordCount[1]); tempcounter.put(word, count); sumofwordsindocument += count; for (Entry<String, Integer> entry : tempcounter.entryset()) context.write( new Text(entry.getKey() + "@" + document.tostring()), new Text(entry.getValue() + "/" + sumofwordsindocument) );

Driver

public class SecondDriver extends Configured implements Tool public int run(string[] args) throws Exception Configuration conf = getconf(); Job job = new Job(conf, "TF-IDF phase 2"); job.setjarbyclass(seconddriver.class); job.setmapperclass(secondmapper.class); job.setreducerclass(secondreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitforcompletion(true)? 0 : 1; public static void main(string[] args) throws Exception System.exit(ToolRunner.run(new Configuration(), new SecondDriver(), args));

Fáza 3 Map Reduce (slovo, dokument, n, N) (slovo, dokument, n, N) (slovo, [d 1, n 1, N 1, d 2, n 2, N 2, ]) (slovo, dokument, tf-idf)

Mapper

public class ThirdMapper extends Mapper<LongWritable, Text, Text, Text> public void map(longwritable linenumber, Text linecontent, Context context) String[] line = linecontent.tostring().split("\t"); String[] key = line[0].split("@"); String word = key[0]; String document = key[1]; String countandtotal = line[1]; context.write( new Text(word), new Text(document + "=" + countandtotal) );

public class ThirdMapper extends Mapper<LongWritable, Text, Text, Text> public void map(longwritable linenumber, Text linecontent, Context context) String[] line = linecontent.tostring().split("\t"); String[] key = line[0].split("@"); String word = key[0]; String document = key[1]; String countandtotal = line[1]; context.write( new Text(word), new Text(document + "=" + countandtotal) );

public class ThirdMapper extends Mapper<LongWritable, Text, Text, Text> public void map(longwritable linenumber, Text linecontent, Context context) String[] line = linecontent.tostring().split("\t"); String[] key = line[0].split("@"); String word = key[0]; String document = key[1]; String countandtotal = line[1]; context.write( new Text(word), new Text(document + "=" + countandtotal) );

Reducer

public class ThirdReducer extends Reducer<Text, Text, Text, Text> private static final DecimalFormat DF = new DecimalFormat("###.########"); private static final int NUMBER_OF_DOCS = 37; public void reduce(text word, Iterable<Text> documentcounts, Context context) int numberofdocumentswithword = 0; Map<String, String> tempfrequencies = new HashMap<String, String>(); for (Text documentcountstr : documentcounts) String[] documentcount = documentcountstr.tostring().split("="); String document = documentcount[0]; String count = documentcount[1]; numberofdocumentswithword++; tempfrequencies.put(document, count); for (Entry<String, String> entry : tempfrequencies.entryset()) String[] counttotal = entry.getvalue().split("/"); int count =.parseint(counttotal[0]); int total =.parseint(counttotal[1]); double tf = (double) count / (double) total; double idf = Math.log10((double) NUMBER_OF_DOCS / (double) numberofdocumentswithword); double tfidf = tf * idf; context.write( new Text(word + "@" + entry.getkey()), new Text(DF.format(tfidf)) );

public class ThirdReducer extends Reducer<Text, Text, Text, Text> private static final DecimalFormat DF = new DecimalFormat("###.########"); private static final int NUMBER_OF_DOCS = 37; public void reduce(text word, Iterable<Text> documentcounts, Context context) int numberofdocumentswithword = 0; Map<String, String> tempfrequencies = new HashMap<String, String>(); for (Text documentcountstr : documentcounts) String[] documentcount = documentcountstr.tostring().split("="); String document = documentcount[0]; String count = documentcount[1]; numberofdocumentswithword++; tempfrequencies.put(document, count); for (Entry<String, String> entry : tempfrequencies.entryset()) String[] counttotal = entry.getvalue().split("/"); int count = Integer.parseInt(countTotal[0]); int total =Integer.parseInt(countTotal[1]); double tf = (double) count / (double) total; double idf = Math.log10((double) NUMBER_OF_DOCS / (double) numberofdocumentswithword); double tfidf = tf * idf; context.write( new Text(word + "@" + entry.getkey()), new Text(DF.format(tfidf)) );

public class ThirdReducer extends Reducer<Text, Text, Text, Text> private static final DecimalFormat DF = new DecimalFormat("###.########"); private static final int NUMBER_OF_DOCS = 37; public void reduce(text word, Iterable<Text> documentcounts, Context context) int numberofdocumentswithword = 0; Map<String, String> tempfrequencies = new HashMap<String, String>(); for (Text documentcountstr : documentcounts) String[] documentcount = documentcountstr.tostring().split("="); String document = documentcount[0]; String count = documentcount[1]; numberofdocumentswithword++; tempfrequencies.put(document, count); for (Entry<String, String> entry : tempfrequencies.entryset()) String[] counttotal = entry.getvalue().split("/"); int count = Integer.parseInt(countTotal[0]); int total = Integer.parseInt(countTotal[1]); double tf = (double) count / (double) total; double idf = Math.log10((double) NUMBER_OF_DOCS / (double) numberofdocumentswithword); double tfidf = tf * idf; context.write( new Text(word + "@" + entry.getkey()), new Text(DF.format(tfidf)) );

public class ThirdReducer extends Reducer<Text, Text, Text, Text> private static final DecimalFormat DF = new DecimalFormat("###.########"); private static final int NUMBER_OF_DOCS = 37; public void reduce(text word, Iterable<Text> documentcounts, Context context) int numberofdocumentswithword = 0; Map<String, String> tempfrequencies = new HashMap<String, String>(); for (Text documentcountstr : documentcounts) String[] documentcount = documentcountstr.tostring().split("="); String document = documentcount[0]; String count = documentcount[1]; numberofdocumentswithword++; tempfrequencies.put(document, count); for (Entry<String, String> entry : tempfrequencies.entryset()) String[] counttotal = entry.getvalue().split("/"); int count = Integer.parseInt(countTotal[0]); int total = Integer.parseInt(countTotal[1]); double tf = (double) count / (double) total; double idf = Math.log10((double) NUMBER_OF_DOCS / (double) numberofdocumentswithword); double tfidf = tf * idf; context.write( new Text(word + "@" + entry.getkey()), new Text(DF.format(tfidf)) );

public class ThirdReducer extends Reducer<Text, Text, Text, Text> private static final DecimalFormat DF = new DecimalFormat("###.########"); private static final int NUMBER_OF_DOCS = 37; public void reduce(text word, Iterable<Text> documentcounts, Context context) int numberofdocumentswithword = 0; Map<String, String> tempfrequencies = new HashMap<String, String>(); for (Text documentcountstr : documentcounts) String[] documentcount = documentcountstr.tostring().split("="); String document = documentcount[0]; String count = documentcount[1]; numberofdocumentswithword++; tempfrequencies.put(document, count); for (Entry<String, String> entry : tempfrequencies.entryset()) String[] counttotal = entry.getvalue().split("/"); int count = Integer.parseInt(countTotal[0]); int total = Integer.parseInt(countTotal[1]); double tf = (double) count / (double) total; double idf = Math.log10((double) NUMBER_OF_DOCS / (double) numberofdocumentswithword); double tfidf = tf * idf; context.write( new Text(word + "@" + entry.getkey()), new Text(DF.format(tfidf)) );

Driver

public class ThirdDriver extends Configured implements Tool public int run(string[] args) throws Exception Configuration conf = getconf(); Job job = new Job(conf, "TF-IDF phase 3"); job.setjarbyclass(thirddriver.class); job.setmapperclass(thirdmapper.class); job.setreducerclass(thirdreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); return job.waitforcompletion(true)? 0 : 1; public static void main(string[] args) throws Exception System.exit(ToolRunner.run(new Configuration(), new ThirdDriver(), args));

Hotovo Ako to spustím?

Príprava vstupov na HDFS Vytvorenie HDFS adresára $ hadoop dfs -mkdir /pewe Nakopírovanie súborov na HDFS $ hadoop dfs -put ~/input /pewe

Spustenie Vytvoriť jar so skompilovanými súbormi $ hadoop jar tfidf.jar package.firstdriver $ hadoop dfs ls /pewe /pewe/input /pewe/first-output $ hadoop jar tfidf.jar package.seconddriver $ hadoop dfs ls /pewe /pewe/input /pewe/first-output /pewe/second-output $ hadoop jar tfidf.jar package.thirddriver $ hadoop dfs ls /pewe /pewe/input /pewe/first-output /pewe/second-output /pewe/third-output

Vstup Down, down, down. There was nothing else to do, so Alice soon began talking again. 'Dinah'll miss me very much to-night, I should think!' (Dinah was the cat.) 'I hope they'll remember her saucer of milk at tea-time. Dinah my dear! I wish you were down here with me! There are no mice in the air, I'm afraid, but you might catch a bat, and that's very like a mouse, you know. But do cats eat bats, I wonder?' And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, 'Do cats eat bats? Do cats eat bats?' and sometimes, 'Do bats eat cats?' for, you see, as she couldn't answer either question, it didn't much matter which way she put it. She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, 'Now, Dinah, tell me the truth: did you ever eat a bat?' when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over. Presently she began again. 'I wonder if I shall fall right THROUGH the earth! How funny it'll seem to come out among the people that walk with Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

Výstup po fáze 1 yard@pg1322.txt 10 yases@pg2265.txt 1 yawning@pg135.txt 14 yevgeney@pg1399.txt 1 yo@pg2000.txt 2077 yonder@pg1112.txt 6 yours@pg1502.txt 4 yourself@pg236.txt 1 youssouf@pg35830.txt 4 youthful@pg14591.txt 3 yuliya@pg35830.txt 14 zabljak@pg35830.txt 2 zalec@pg35830.txt 2 zip@pg236.txt 1 zlatibor@pg35830.txt 2 zrnovci@pg35830.txt 2 zuzemberk@pg35830.txt 2

Výstup po fáze 2 directing@pg2701.txt 3/153640 olassen@pg2701.txt 1/153640 sharp@pg2701.txt 42/153640 shark@pg2701.txt 30/153640 peremptorily@pg2701.txt 4/153640 inventor@pg2701.txt 1/153640 women@pg2701.txt 11/153640 godly@pg2701.txt 3/153640 compendious@pg2701.txt 1/153640 unforseen@pg2701.txt 1/153640 hannibal@pg2701.txt 1/153640 massive@pg2701.txt 5/153640 popping@pg2701.txt 1/153640 districts@pg2701.txt 3/153640 claimed@pg2701.txt 2/153640 malignantly@pg2701.txt 1/153640 crippled@pg2701.txt 3/153640

Výstup po fáze 3 have@pg1080.txt 0 have@pg1101.txt 0 have@pg1112.txt 0 have@pg11.txt 0 have@pg1232.txt 0 have@pg1257.txt 0... ver@pg2000.txt 0.00118238 italy@pg25717.txt 0.00142321 sperm@pg2701.txt 0.00173278 scallops@pg20776.txt 0.00196741 clifford@pg1101.txt 0.0019727 constantius@pg25717.txt 0.00247444 kotick@pg236.txt 0.00315428 toomai@pg236.txt 0.00343735 shere@pg236.txt 0.00351823 clifford@pg1502.txt 0.00417993

Iné Hadoop-related projekty... alebo dá sa to ešte ľahšie?

Mahout knižnica na strojové učenie a data mining Hadoop verzie známych algoritmov classification clustering pattern mining regression dimension reduction evolutionary algorithms collaborative filtering vector similarity stále vo vývoji

Hive a Pig jazyky abstrahovanie od Hadoop MapReduce v Jave Hive Pig SQL-like jazyk vznikol na Facebook hackatrone procedurálny jazyk vznikol v Yahoo! Research napríklad nad HDFS súbormi

HBase a Cassandra distribuované databázové systémy open-source verzie Google BigTable HBase vytvorené špeciálne pre Hadoop a HDFS Cassandra vytvorené vo Facebooku nemala s Hadoop nič spoločné dnes už sú kamaráti dá sa použiť Hive a Pig

Ďakujem Otázky?

Zdroje 1. http://hadoop.apache.org/common/docs/r0.17.2/ 2. http://cassandra.apache.org/ 3. http://hbase.apache.org/ 4. http://hive.apache.org/ 5. http://pig.apache.org/ 6. http://mahout.apache.org/ 7. http://marcellodesales.wordpress.com/2009/12/31/tf-idf-in-hadoop-part-1-word-frequency-in-doc/ 8. http://www.gutenberg.org/