Introduction To Hadoop

Similar documents

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Word count example Abdalrahman Alsaedi

Introduction to MapReduce and Hadoop

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

MR-(Mapreduce Programming Language)

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

How To Write A Mapreduce Program In Java.Io (Orchestra)

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Data Science in the Wild

Programming in Hadoop Programming, Tuning & Debugging

map/reduce connected components

Hadoop Configuration and First Examples

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Introduction to Cloud Computing

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop Design and k-means Clustering

CS54100: Database Systems

Hadoop: Understanding the Big Data Processing Method

Word Count Code using MR2 Classes and API

Big Data Management and NoSQL Databases

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Scalable Computing with Hadoop

By Hrudaya nath K Cloud Computing

HPCHadoop: MapReduce on Cray X-series

Extreme Computing. Hadoop MapReduce in more detail.

Introduction to MapReduce and Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Getting to know Apache Hadoop

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

Data Science Analytics & Research Centre

Mrs: MapReduce for Scientific Computing in Python

Running Hadoop on Windows CCNP Server

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

HowTo Hadoop. Devaraj Das

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Introduc8on to Apache Spark

Tutorial- Counting Words in File(s) using MapReduce

Xiaoming Gao Hui Li Thilina Gunarathne

AVRO - SERIALIZATION

BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE

Lecture 3 Hadoop Technical Introduction CSE 490H

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Data Deluge. Billions of users connected through the Internet. Storage getting cheaper

Lecture #1. An overview of Big Data

Spark and Shark: High-speed In-memory Analytics over Hadoop Data

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

MapReduce Tutorial. Table of contents

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Hadoop for HPC. Instructor: Ekpe Okorafor

Hadoop, Hive & Spark Tutorial

Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc.

Download and install Download virtual machine Import virtual machine in Virtualbox

Hadoop Streaming. Table of contents

Hadoop Basics with InfoSphere BigInsights

Introduction to MapReduce

INTRODUCTION TO HADOOP

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Implementations of iterative algorithms in Hadoop and Spark

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop Map-Reduce Tutorial

Why Spark Is the Next Top (Compute) Model

Big Data 2012 Hadoop Tutorial

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Big Data for the JVM developer. Costin Leau,

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem

Connecting Hadoop with Oracle Database

BIG DATA, MAPREDUCE & HADOOP

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. AVRO Tutorial

BIG DATA ANALYTICS HADOOP PERFORMANCE ANALYSIS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

An Implementation of Sawzall on Hadoop

Assignment 1 Introduction to the Hadoop Environment

University of Maryland. Tuesday, February 2, 2010

Internals of Hadoop Application Framework and Distributed File System

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

BIG DATA APPLICATIONS

The Hadoop Eco System Shanghai Data Science Meetup

Transcription:

Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 1 / 12

Outline 1 Mapper Reducer Main 2 How it Works Serialization Data Flow 3 Lab Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 2 / 12

Mapper Mapper < wikipedia.org, The Free > < The, 1 >, < Free, 1 > public void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, new IntWritable(1)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12

Mapper Mapper < wikipedia.org, The Free > < The, 1 >, < Free, 1 > public void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, new IntWritable(1)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12

Mapper Mapper < wikipedia.org, The Free > < The, 1 >, < Free, 1 > public void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, new IntWritable(1)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12

Mapper Mapper < wikipedia.org, The Free > < The, 1 >, < Free, 1 > public void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, new IntWritable(1)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12

Mapper Mapper < wikipedia.org, The Free > < The, 1 >, < Free, 1 > public void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, new IntWritable(1)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12

Reducer Reducer < The, 1 >, < The, 1 > < The, 2 > public void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += ((IntWritable) values.next()).get(); output.collect(key, new IntWritable(sum)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12

Reducer Reducer < The, 1 >, < The, 1 > < The, 2 > public void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += ((IntWritable) values.next()).get(); output.collect(key, new IntWritable(sum)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12

Reducer Reducer < The, 1 >, < The, 1 > < The, 2 > public void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += ((IntWritable) values.next()).get(); output.collect(key, new IntWritable(sum)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12

Reducer Reducer < The, 1 >, < The, 1 > < The, 2 > public void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += ((IntWritable) values.next()).get(); output.collect(key, new IntWritable(sum)); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12

Main Main public static void main(string[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setmapperclass(mapclass.class); conf.setcombinerclass(reduceclass.class); conf.setreducerclass(reduceclass.class); conf.setnummaptasks(new Integer(40)); conf.setnumreducetasks(new Integer(30)); conf.setinputpath(new Path("/shared/wikipedia_small")); conf.setoutputpath(new Path("/user/kheafield/word_count")); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); JobClient.runJob(conf); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12

Main Main public static void main(string[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setmapperclass(mapclass.class); conf.setcombinerclass(reduceclass.class); conf.setreducerclass(reduceclass.class); conf.setnummaptasks(new Integer(40)); conf.setnumreducetasks(new Integer(30)); conf.setinputpath(new Path("/shared/wikipedia_small")); conf.setoutputpath(new Path("/user/kheafield/word_count")); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); JobClient.runJob(conf); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12

Main Main public static void main(string[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setmapperclass(mapclass.class); conf.setcombinerclass(reduceclass.class); conf.setreducerclass(reduceclass.class); conf.setnummaptasks(new Integer(40)); conf.setnumreducetasks(new Integer(30)); conf.setinputpath(new Path("/shared/wikipedia_small")); conf.setoutputpath(new Path("/user/kheafield/word_count")); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); JobClient.runJob(conf); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12

Main Main public static void main(string[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setmapperclass(mapclass.class); conf.setcombinerclass(reduceclass.class); conf.setreducerclass(reduceclass.class); conf.setnummaptasks(new Integer(40)); conf.setnumreducetasks(new Integer(30)); conf.setinputpath(new Path("/shared/wikipedia_small")); conf.setoutputpath(new Path("/user/kheafield/word_count")); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); JobClient.runJob(conf); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12

How it Works Serialization Types Purpose Simple serialization for keys, values, and other data Interface Writable Read and write binary format Convert to String for text formats WritableComparable adds sorting order for keys Example Implementations ArrayWritable is only Writable BooleanWritable IntWritable sorts in increasing order Text holds a String Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 6 / 12

public class IntPairWritable implements Writable { public int first; public int second; public void write(dataoutput out) throws IOException { out.writeint(first); out.writeint(second); public void readfields(datainput in) throws IOException { first = in.readint(); second = in.readint(); public int hashcode() { return first + second; public String tostring() { return Integer.toString(first) + "," + Integer.toString(second); Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 7 / 12 A Writable How it Works Serialization

How it Works WritableComparable Method Serialization public int compareto(object other) { IntPairWritable o = (IntPairWritable)other; if (first < o.first) return -1; if (first > o.first) return 1; if (second < o.second) return -1; if (second > o.second) return 1; return 0; Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 8 / 12

How it Works Data Flow Data Flow Default Flow 1 Mappers read from HDFS 2 Map output is partitioned by key and sent to Reducers 3 Reducers sort input by key 4 Reduce output is written to HDFS 1 HDFS 2 Mapper 3 Reducer 4 HDFS Input Map Sort Reduce Output Input Map Sort Reduce Output Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 9 / 12

How it Works Data Flow Combiners Concept Add counts at Mapper before sending to Reducer. Word count is 6 minutes with combiners and 14 without. Implementation Mapper caches output and periodically calls Combiner Input to Combine may be from Map or Combine Combiner uses interface as Reducer Mapper Combine Sort Reduce Output Input Map Cache Sort Reduce Output Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 10 / 12

Lab Exercises Recommended: Word Count Get word count running. Bigrams Count bigrams and unigrams efficiently. Capitalization With what probability is a word capitalized? Indexer In what documents does each word appear? Where in the documents? Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 11 / 12

Instructions Lab 1 Login to the cluster successfully (and set your password). 2 Get Eclipse installed, so you can build Java code. 3 Install the Hadoop plugin for Eclipse so you can deploy jobs to the cluster. 4 Set up your Eclipse workspace from a template that we provide. 5 Run the word counter example over the Wikipedia data set. Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 12 / 12