Map-Reduce and Hadoop



Similar documents
Extreme Computing. Hadoop MapReduce in more detail.

Hadoop Streaming. Table of contents

Getting to know Apache Hadoop

Hadoop Design and k-means Clustering

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Introduction to Cloud Computing

University of Maryland. Tuesday, February 2, 2010

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop WordCount Explained! IT332 Distributed Systems

Lesson 7 Pentaho MapReduce

Xiaoming Gao Hui Li Thilina Gunarathne

Internals of Hadoop Application Framework and Distributed File System

MapReduce. Tushar B. Kute,

Big Data and Apache Hadoop s MapReduce

Word Count Code using MR2 Classes and API

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Map Reduce & Hadoop Recommended Text:

Cloudera Certified Developer for Apache Hadoop

Hadoop Basics with InfoSphere BigInsights

Big Data 2012 Hadoop Tutorial

Scalable Computing with Hadoop

Big Data and Scripting map/reduce in Hadoop

Lecture 3 Hadoop Technical Introduction CSE 490H

map/reduce connected components

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Project 5 Twitter Analyzer Due: Fri :59:59 pm

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015

How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

The Hadoop Eco System Shanghai Data Science Meetup

PassTest. Bessere Qualität, bessere Dienstleistungen!

Hadoop MapReduce: Review. Spring 2015, X. Zhang Fordham Univ.

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Developing a MapReduce Application

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Tutorial- Counting Words in File(s) using MapReduce

Introduction to MapReduce and Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hive Interview Questions

A. Aiken & K. Olukotun PA3

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Big Data Analytics* Outline. Issues. Big Data

INTRODUCTION TO HADOOP

Data-intensive computing systems

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

HADOOP MOCK TEST HADOOP MOCK TEST II

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Professional Hadoop Solutions

Qsoft Inc

Data Intensive Computing Handout 6 Hadoop

Understanding Hadoop Performance on Lustre

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

How To Write A Map Reduce In Hadoop Hadooper (Ahemos)

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Use Hadoop

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Assignment 1: MapReduce with Hadoop

Rumen. Table of contents

Data Science in the Wild

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Peers Techno log ies Pv t. L td. HADOOP

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Open source large scale distributed data management with Google s MapReduce and Bigtable

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

Hadoop Configuration and First Examples

Package hive. January 10, 2011

How to Run Spark Application

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Big Data Management and NoSQL Databases

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Jeffrey D. Ullman slides. MapReduce for data intensive computing

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA, MAPREDUCE & HADOOP

Transcription:

Map-Reduce and Hadoop 1

Introduction to Map-Reduce 2

3 Map Reduce operations Input data are (key, value) pairs 2 operations available : map and reduce Map Takes a (key, value) and generates other (key, value) Reduce Takes a key and all associated values Generates (key, value) pairs A map-reduce algorithm requires a mapper and a reducer

4 Map Reduce example Compute the average grade of students For each course, the professor provides us with a text file Text file format : lines of student grade Algorithm (non map-reduce) For each student, collect all grades and perform the average Algorithm (map-reduce) Mapper Assume the input file is parsed as (student, grade) pairs So do nothing! Reducer Perform the average of all values for a given key

5 Map Reduce example Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20) (Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10]) (Fabrice, 15) (Brian 15) (Paul, 15) Reduce

6 Map Reduce example too easy Ok, this was easy because We didn t care about technical details like reading inputs All keys are equals, no weighted average Now can we do something more complicated? Let s computed a weighted average Course 1 has weight 5 Course 2 has weight 2 Course 3 has weight 3 What is the problem now?

7 Map Reduce example Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, 20) (Brian, 10) (Paul, 15) (Fabrice, 15) (Brian, 20) (Paul, 10) (Fabrice, 10) (Brian, 15) (Paul, 20) (Fabrice, [20, 15, 10]) (Brian, [10, 15, 20]) (Paul, [15, 20, 10]) (Fabrice, 15) (Brian 15) (Paul, 15) Reduce Should be able to discriminate between values

8 Map Reduce example - advanced How discriminate between values for a given key We can t unless the values look different New reducer Input : (Name, [course1_grade1, course2_grade2, course3_grade3]) Strip values from course indication and perform weighted average So, we need to change the input of the reducer which comes from the mapper New mapper Input : (Name, Grade) Output : (Name, coursename_grade) The mapper needs to be aware of the input file

9 Map Reduce example - 2 Course 1 Fabrice 20 Brian 10 Paul 15 Course 2 Fabrice 15 Brian 20 Paul 10 Course 3 Fabrice 10 Brian 15 Paul 20 Map (Fabrice, C1_20) (Brian, C1_10) (Paul, C1_15) (Fabrice, C2_15) (Brian, C2_20) (Paul, C2_10) (Fabrice, C3_10) (Brian, C3_15) (Paul, C3_20) (Fabrice, [C1_20, C2_15, C3_10]) (Brian, [C1_10, C2_15, C3_20]) (Paul, [C1_15, C2_20, C3_10]) (Fabrice, 16) (Brian, 14) (Paul, 14.5) Reduce

10 Introduction to Hadoop F. Huet, Oasis Seminar, 07/07/2010

11 What is Hadoop? A set of software developed by Apache for distributed computing Many different projects MapReduce HDFS : Hadoop Distributed File System Hbase : Distributed Database. Written in Java Can be deployed on any cluster easily

12 Hadoop Job An Hadoop job is composed of a map operation and (possibly) a reduce operation Map and reduce operations are implemented in a Mapper subclass and a Reducer subclass Hadoop will start many instances of Mapper and Reducer Decided at runtime but can be specified Each instance will work on a subset of the keys called a Splits

13 Map-Reduce workflow Source : Hadoop the definitive guide

14 Mapper Extend default class Mapper<K1, V1, K2, V2> K1, V1 : type of input (key,value) K2, V2 : type of output (key,value) Implements public void map(k1 key, V1 value, Context context) throws IOException, InterruptedException Output of values is done using context.write

15 Reducer Extend default class Reducer<K1, V1, K2, V2> K1, V1 : type of input (key,[values]) K2, V2 : type of output (key, value) Implements public void reduce(k1 key, V1 values, Context context) throws IOException, InterruptedException V1 is iterable Output of values is done using context.write

16 Input/Output Hadoop helps abstracting away data format and I/O from map/ reduce process InputFormat Validates data input format (user specified) Split-up the input file into Splits Provides an InputReader to read records from the Splits Default : TextInputFormat to read text file (key will be offset, value will be the line) OutputFormat Validate data output format Provides an OutputWriter to write records to the file system Default : TextOutputFormat to write plain text files

17 Hadoop Job example Configuration config = new Configuration(); Job job = new Job(config, "filesplittest"); job.setinputformatclass(textinputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setoutputformatclass(singletextoutputformat.class); Path outputdir = new Path(output); Path inputpath = new Path(input); FileInputFormat.setInputPaths(job, inputpath); FileOutputFormat.setOutputPath(job, outputdir); job.setmapperclass(mapsinglesortedfile.class); job.setreducerclass(reducer.class);

18 HDFS Hadoop Distributed File System Aggregate local storage Used by Hadoop workers to read input, store temporary data and final output Can be accessed using CLI $> hadoop fs command put : copy a local file to HDFS get : copy a HDFS file to a local directory Suitable for large files 64MB Block

Demo 19

20 Scenario Input : a text file made of RDF data (subject, predicate, object) Output : 3 files containing the input data sorted by subject, predicate or object Hadoop cluster eon 2-4 with HDFS Only need Hadoop conf files to use this cluster Monitor computation using web interface on eon2