# Massive Distributed Processing using Map-Reduce

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Massive Distributed Processing using Map-Reduce (Przetwarzanie rozproszone w technice map-reduce) Dawid Weiss Institute of Computing Science Pozna«University of Technology 01/2007

2 1 Introduction 2 Map Reduce 3 Open Source Map-Reduce: Hadoop 4 Experiments at the Institute 5 Conclusions

3 Massive distributed processing problems numerous processing units large input relatively simple computation large output distributed input data Computations are most often very simple. Data instances huge. Input can be fragmented into continuous `splits'.

4 Examples of MDP problems Search/ scan problems (grep). Counting problems (URL access). Indexing problems (reverse link, inverted indices). Sorting problems.

5 The overhead of custom solutions Parallelization is never easy. Job scheduling. Failure detection and recovery. Job progress/ status tracking.

6 The overhead of custom solutions Parallelization is never easy. Job scheduling. Failure detection and recovery. Job progress/ status tracking. Simplicity of the original computation is lost.

7 1 Introduction 2 Map Reduce 3 Open Source Map-Reduce: Hadoop 4 Experiments at the Institute 5 Conclusions

8 Map Reduce Map Reduce (Jerey Dean, Sanjay Ghemawat; Google Inc.) A technique of automatic parallelization of computations by enforcing a restricted programming model, derived from functional languages. Inspiration: map and reduce operations in Lisp. Hide the messy details, keep the programmer happy. Achieve scalability, robustness and fault-tolerance by adding processing units.

9 The programming model 1 The input is parcelled into keys and associated values. 2 The map function takes (in_key, in_value) pairs and produces (out_key, im_value) pairs: map(in_key, in_value) [(out_key, im_value)] 3 All values for identical keys are grouped. 4 The reduce function reduces a list of values for a single key to a fewer list of results (typically one or zero): reduce(out_key x, [im_value 1, im_value 2,...]) [out_value]

10 Example: word counting

11 Source:

12 Example: word counting [Dean and Ghemawat, 2004] Map function map(string key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Reduce function reduce(string key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

13 More examples Distributed grep. map: (--, line) -> (line) reduce: identity Reverse Web link graph. map: (source-url, html-content) -> (target-url, source-url) reduce: (target-url, [source-urls]) -> (target-url, concat(source-urls)) Inverted index of documents. map: (doc-id, content) -> (word, doc-id) reduce: (word, [doc-ids]) -> (word, concat(doc-ids)) More complex tasks achieved by combining Map-Reduce jobs (the indexing process at Google more than 20 MR tasks!).

14 Example reduce phase (indexing at Google).

15 Further improvements Combiners (avoid too much intermediate trac). Speculative execution (anticipate invalid/ broken nodes). Load balancing (split your input into possibly many map tasks). Data access optimizations (keep processing close to the input).

16 1 Introduction 2 Map Reduce 3 Open Source Map-Reduce: Hadoop 4 Experiments at the Institute 5 Conclusions

17 The Hadoop project Mark Carafella, Doug Cutting and others. Originally part of Apache Lucene codebase. Impressively dynamic growth as a Lucene sub-project. Apache-license.

18 The open source MapReduce environment Distributed File System (DFS) Generic Indexing Crawler MapReduce Query parsers/ Searching Web front-end Hadoop Lucene Nutch

19 HDFS assumptions HDFS is inspired by GFS (Google File System). Design goals: expect hardware failures (processes, disk, nodes), streaming data access, large les (TB of data), simple coherence model (one writer, many readers), optimization of computation in MapReduce (locality), single master (name node), multiple slaves (data nodes).

20 Hadoop requirements Installation/ operation requirements: Java 1.5.x or higher, preferably from Sun. Linux and Windows (under CygWin). MapReduce jobs: Preferably implemented in Java. Hadoop Streaming (arbitrary shell commands). C/C++ APIs to HDFS.

21 Example: word counting 1 /** 2 * Counts the words in each line. 3 * For each line of input, break the line into words and 4 * emit them as (<b>word</b>, <b>1</b>). 5 */ 6 public static class MapClass 7 extends MapReduceBase implements Mapper { 8 9 private final static IntWritable one = new IntWritable(1); 10 private Text word = new Text(); public void map(writablecomparable key, Writable value, 13 OutputCollector output, Reporter reporter) throws IOException { 14 final String line = ((Text) value).tostring(); 15 final StringTokenizer itr = new StringTokenizer(line); 16 while (itr.hasmoretokens()) { 17 word.set(itr.nexttoken()); 18 output.collect(word, one); 19 } 20 } 21 }

22 Example: word counting 1 /** 2 * A reducer class that just emits the sum of the input values. 3 */ 4 public static class Reduce 5 extends MapReduceBase implements Reducer { 6 7 public void reduce(writablecomparable key, Iterator values, 8 OutputCollector output, Reporter reporter) 9 throws IOException 10 { 11 int sum = 0; 12 while (values.hasnext()) { 13 sum += ((IntWritable) values.next()).get(); 14 } 15 output.collect(key, new IntWritable(sum)); 16 } 17 }

23 Example: word counting 1 public static void main(string[] args) throws IOException { 2 final JobConf conf = new JobConf(WordCount.class); 3 conf.setjobname("wordcount"); 4 5 // The keys are words (strings). 6 conf.setoutputkeyclass(text.class); 7 // The values are counts (ints). 8 conf.setoutputvalueclass(intwritable.class); 9 10 conf.setmapperclass(mapclass.class); 11 conf.setcombinerclass(reduce.class); 12 conf.setreducerclass(reduce.class); // [...] conf.setinputpath(new Path(input)); 17 conf.setoutputpath(new Path(output)); // Uncomment to run locally in a single process 20 // conf.set("mapred.job.tracker", "local"); JobClient.runJob(conf); 23 }

24 The trickery of Hadooping... Windows installation often broken (scripts, paths). Documentation scarce and not up-to-date. Real setup of a distributed cluster requires some initial work (account setup, moving distributions around).

25 DFS performance Word counting Sorting 1 Introduction 2 Map Reduce 3 Open Source Map-Reduce: Hadoop 4 Experiments at the Institute DFS performance Word counting Sorting 5 Conclusions

26 DFS performance Word counting Sorting Requirements Linux-based systems. Shell access, password-less SSH access within the cluster's nodes. Certain open ports within the cluster (for DFS, trackers, Web interface).

27 DFS performance Word counting Sorting Requirements Linux-based systems. Shell access, password-less SSH access within the cluster's nodes. Certain open ports within the cluster (for DFS, trackers, Web interface). Conclusion At the moment setting up a Hadoop installation at lab-45 is problematic.

28 DFS performance Word counting Sorting Test installation at lab-142/ lab-143 Installation prole: Out-of-the-box installation of Hadoop Cluster of 7, then 28 machines. Code/ conguration distribution provided by the NFS. One master (name node, job tracker), multiple data nodes (DFS/ MR). Simple experiments performed: DFS performance. Word counting example. Sorting example.

29 DFS performance Word counting Sorting 1 Introduction 2 Map Reduce 3 Open Source Map-Reduce: Hadoop 4 Experiments at the Institute DFS performance Word counting Sorting 5 Conclusions

30 DFS performance Word counting Sorting Setup: nodes: 7, replication factor: 3, block size: 64 MB, local le size: 1.6 GB (entire Rzeczpospolita corpus, concatenated).

31 DFS performance Word counting Sorting Setup: Results: nodes: 7, replication factor: 3, block size: 64 MB, local le size: 1.6 GB (entire Rzeczpospolita corpus, concatenated). copy to DFS: 5'25s., 5-8 MB/s (network and local-machine bound), random-write from within Map-Reduce job, 28 DFS nodes: 2,73 GB 1'20s.

34 DFS performance Word counting Sorting 1 Introduction 2 Map Reduce 3 Open Source Map-Reduce: Hadoop 4 Experiments at the Institute DFS performance Word counting Sorting 5 Conclusions

35 DFS performance Word counting Sorting Setup: nodes: initially 7, repeated for 28, replication factor: 3, block size: 64 MB, input: DFS le 1.6 GB (Rzepus), maps: 67, reduces: 7.

36 DFS performance Word counting Sorting Setup: nodes: initially 7, repeated for 28, replication factor: 3, block size: 64 MB, input: DFS le 1.6 GB (Rzepus), maps: 67, reduces: 7. Results: 7 nodes 5'31s. (note: full-cycle of read, word count, write), 28 nodes 2'21s, 28 nodes (281 maps, 29 reduces) 2'31s.

37 Screenshots from the Web interface and computation progress. Map-Reduce job progress.

38 Screenshots from the Web interface and computation progress. Map-Reduce job progress.

39 Screenshots from the Web interface and computation progress. Map-Reduce job progress.

40 Screenshots from the Web interface and computation progress. MR cluster administration interface.

41 Screenshots from the Web interface and computation progress. The output.

42 Screenshots from the Web interface and computation progress. The output.

43 MR progress in the word counting task Completion % Map Reduce :03:30 21:04:00 21:04:30 21:05:00 21:05:30 21:06:00 21:06:30 21:07:00 Time

44 Failure handling (robustness).

45 Failure handling (robustness).

46 Failure handling (robustness). Broken node?

47 DFS performance Word counting Sorting 1 Introduction 2 Map Reduce 3 Open Source Map-Reduce: Hadoop 4 Experiments at the Institute DFS performance Word counting Sorting 5 Conclusions

48 DFS performance Word counting Sorting Setup: nodes: 28, replication factor: 3, block size: 64 MB, input: DFS les total 2,73 GB (random byte sequences), maps: 280, reduces: 7.

49 DFS performance Word counting Sorting Setup: nodes: 28, replication factor: 3, block size: 64 MB, input: DFS les total 2,73 GB (random byte sequences), maps: 280, reduces: 7. Results: read-sort-write time: 4'18s.

50 1 Introduction 2 Map Reduce 3 Open Source Map-Reduce: Hadoop 4 Experiments at the Institute 5 Conclusions

51 Conclusions Map-Reduce is an interesting programming paradigm. Automatic paralellism, scalability, fault-tolerance. Hadoop provides a cost-eective option for experiments with Map-Reduce. Lack of documentation, but source code available.

52 References Dean, J. and Ghemawat, S. (2004). MapReduce: Simplied Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation, OSDI '2004, pages lucene (2007). Apache lucene. On-line: nutch (2007). Apache nutch. On-line: hadoop (2007). Apache hadoop. On-line:

54 Thank you.

### Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the

### Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

### Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

### Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

### Outline. What is Big Data? Hadoop HDFS MapReduce

Intro To Hadoop Outline What is Big Data? Hadoop HDFS MapReduce 2 What is big data? A bunch of data? An industry? An expertise? A trend? A cliche? 3 Wikipedia big data In information technology, big data

### CS54100: Database Systems

CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

### Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

### Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

### MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

### Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

### So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing!

Mapping Page 1 Using Raw Hadoop 8:34 AM So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing! Hadoop Yahoo's open-source MapReduce implementation

### Data Science in the Wild

Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

### map/reduce connected components

1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

### Introduction to Cloud Computing

Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

### Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

### Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007

Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

### Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

### Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Programming Hadoop Map-Reduce Programming, Tuning & Debugging Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008 Existential angst: Who am I? Yahoo! Grid Team (CCDI) Apache Hadoop Developer

### Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

### Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

### Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06 Seek versus Transfer B-Tree requires seek per access unless to recent, cached page so can buffer & pre-sort

### Big Data Management and NoSQL Databases

NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

### MapReduce framework. (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

### Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

### Lots of Data, Little Money. A Last.fm perspective. Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23

Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:

### Big Data Analytics* Outline. Issues. Big Data

Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

### Programming in Hadoop Programming, Tuning & Debugging

Programming in Hadoop Programming, Tuning & Debugging Venkatesh. S. Cloud Computing and Data Infrastructure Yahoo! Bangalore (India) Agenda Hadoop MapReduce Programming Distributed File System HoD Provisioning

### Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### Hadoop + Clojure. Hadoop World NYC Friday, October 2, 2009. Stuart Sierra, AltLaw.org

Hadoop + Clojure Hadoop World NYC Friday, October 2, 2009 Stuart Sierra, AltLaw.org JVM Languages Functional Object Oriented Native to the JVM Clojure Scala Groovy Ported to the JVM Armed Bear CL Kawa

### Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

### MapReduce (in the cloud)

MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

### HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

### Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

### and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

### Processing Data with Map Reduce

Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

### The MapReduce Framework

The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

### MapReduce. Course NDBI040: Big Data Management and NoSQL Databases. Practice 01: Martin Svoboda

Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming

MAPREDUCE - HADOOP IMPLEMENTATION http://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm Copyright tutorialspoint.com MapReduce is a framework that is used for writing applications to process

### Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

### Cloud Computing. Lectures 7 and 8 Map Reduce 2014-2015

Cloud Computing Lectures 7 and 8 Map Reduce 2014-2015 1 Up until now Introduction Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing 2 Outline Map Reduce: What is

### Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

### Introduc8on to Apache Spark

Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these

### Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

### Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

### Distributed Recommenders. Fall 2010

Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

### Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to

### Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

### Cloud Computing: MapReduce and Hadoop

Cloud Computing: MapReduce and Hadoop June 2010 Marcel Kunze, Research Group Cloud Computing KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

### Word Count Code using MR2 Classes and API

EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

### BIG DATA, MAPREDUCE & HADOOP

BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

### Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

### Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Cloud Computing Leonidas Fegaras University of Texas at Arlington Web Data Management and XML L3b: Cloud Computing 1 Computing as a Utility Cloud computing is a model for enabling convenient, on-demand

### Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

Cloud Computing i Hadoop X JPL Barcelona, 01/07/2011 Marc de Palol @lant Qui sóc? Qui sóc? Qui sóc? Qui sóc? Qui sóc? Qui sóc? Grid Computing vs Cloud Grid Computing vs Cloud Els dos són sistemes distribuïts

### Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

### Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Building a distributed search system with Apache Hadoop and Lucene Mirko Calvaresi a Barbara, Leonardo e Vittoria 2 Index Preface... 5 1 Introduction: the Big Data Problem... 6 1.1 Big data: handling the

### DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

### MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large

### Internals of Hadoop Application Framework and Distributed File System

International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

### Mrs: MapReduce for Scientific Computing in Python

Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

### Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

### Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

### Introduction to MapReduce and Hadoop

UC Berkeley Introduction to MapReduce and Hadoop Matei Zaharia UC Berkeley RAD Lab matei@eecs.berkeley.edu What is MapReduce? Data-parallel programming model for clusters of commodity machines Pioneered

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem

Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem Santhosh voruganti Asst.Prof CSE Dept,CBIT, Hyderabad,India Abstract Cloud Computing is emerging as a new computational paradigm

### HADOOP SDJ INFOSOFT PVT LTD

HADOOP SDJ INFOSOFT PVT LTD DATA FACT 6/17/2016 SDJ INFOSOFT PVT. LTD www.javapadho.com Big Data Definition Big data is high volume, high velocity and highvariety information assets that demand cost

### Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

### Big Data Processing, 2014/15

Big Data Processing, 2014/15 Lecture 6: MapReduce - behind the scenes continued (a very mixed bag)!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams

Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs

### Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

### marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

### The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

### MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 14 Issue 9 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

### Lecture 27 - MapReduce ECE 459: Programming for Performance

ECE 459: Programming for Performance Jon Eyolfson March 19/23, 2012 Introduction Framework introduced by Google for large problems Consists of two functional operations: map and reduce >>> map( lambda

ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information

### ETH Zurich Department of Computer Science Networked Information Systems - Spring Tutorial #1: Hadoop and MapReduce.

ETH Zurich Department of Computer Science Networked Information Systems - Spring 2008 Tutorial #1: Hadoop and MapReduce March 17, 2008 1 Introduction Hadoop 1 is an open-source Java-based software platform

### MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

MapReduce Systems Sara Bouchenak Sara.Bouchenak@imag.fr http://sardes.inrialpes.fr/~bouchena/teaching/ Lectures based on the following slides: http://code.google.com/edu/submissions/mapreduceminilecture/listing.html

### Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

### Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science

### Programming with Hadoop. 2009 Cloudera, Inc.

Programming with Hadoop Overview How to use Hadoop Hadoop MapReduce Hadoop Streaming Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution

CURSO: DESARROLLADOR PARA APACHE HADOOP TEST DE EJEMPLO DEL EXÁMEN DE CERTIFICACIÓN www.formacionhadoop.com 1 Question: 1 When is the earliest point at which the reduce method of a given Reducer can be

Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)