# Cloud Computing. Lectures 7 and 8 Map Reduce

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Cloud Computing Lectures 7 and 8 Map Reduce

2 Up until now Introduction Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing 2

3 Outline Map Reduce: What is it? Concepts An Example: PageRank 3

4 MapReduce: What is it? Typical Challenges in Parallel Processing: How to assign tasks to the workers? What if we have more tasks than workers? What if the workers need to share partial results? How do we aggregate partial results? How do we know whether the workers have finished? What if the workers fail? Functional Programming + Distributed Processing Platform 4

5 Origins: Functional Programming An old idea from the 50s. What is functional programming? Computing as the composition/sequencing of a set of functions. The theoretical foundations are lambda calculus. What is the difference to imperative programming? The concepts of data and instructions are blended. The flow of data is implicit. Execution order less relevant in the program design. 5

6 For example, in Scheme (define (foo x y) (sqrt (+ (* x x) (* y y)))) (foo 3 4) 5 (define (bar f x) (f (f x))) (define (baz x) (* x x)) (bar baz 2) 16 6

7 So... What does this have to do with MapReduce? Scheme and Lisp are strongly modelled on list processing. They use two basic concepts from functional programming: Map: apply the same operation to all the elements in a list. Fold: use an operator to combine all the elements of a list. 7

8 Map Mapis a second order function: It receives another function as a parameter. It works by: Applying the parameter function to all the elements of a list. Thereby generating a new list. f f f f f 8

9 Reduce Fold (Reduce)is also a second order function. It works by: Initializing an accumulator. Applying the parameter function to the accumulator and the first element of the list. The result is stored in the accumulator. The operation is repeated for each element of the list. The result is the final value of the accumulator. f f f f f final value Initial value 9

10 Map/Fold Example Simple map example: (map (lambda (x) (* x x)) '( )) '( ) Simple fold example: (fold + 0 '( )) 15 (fold * 1 '( )) 120 Sum of squares: (define (squares-sum v) (fold + 0 (map (lambda (x) (* x x)) v))) (squares-sum '( )) 55 10

11 MapReduce Map+fold over lists of <key, value> pairs. Map: operates on <key1, value1>pairs resulting in lists of <key2, value2> pairs. Reduce: Receives all <key2, value2>for a specific key2 and generates <key3, value3> pairs. 11

12 MapReduce Input Data Master Partitioned Output 12

13 Good and Bad Examples MapReduce is good for: Log indexing. Ordering large amounts of data. Analysing images. MapReduce is bad for: Calculating digits of π. Calculating sequences of Fibonacci numbers. Replacing a relational database. 13

14 Real Examples Implementing scalable learning algorithms. Graph algorithms, e.g. travelling salesman. Gathering and analysing medical information. Detecting face similarities in large sets of images. Web crawling. 14

15 Hadoop: Map/Reduce Hadoop: FLOSS Apache project that reimplements several of Google s cloud components, for example MapReduce. Example: the HelloWorldof distributed processing, word counting. 15

16 Wordcount: Map public class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(writablecomparable key, Writable values, OutputCollector output, Reporter reporter) throws IOException { String line = values.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); }}} 16

17 Wordcount: Reduce public class Reduce extends MapReduceBase implements Reducer public void reduce(writablecomparable _key, Iterator _values, OutputCollector output, Reporter reporter) throws IOException { Iterator<IntWritable> values = (Iterator<IntWritable>) _values; int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(_key, new IntWritable(sum)); }} 17

18 Datatypes Writable Defines a serialization protocol. All datatypes are Writable. WritableComparable Define an ordering criteria. All keys must be of this type, but not the values. IntWritable LongWritable Text Concrete classes for the classic datatypes. 18

19 Basic Datatypes IntWritable DoubleWritable FloatWritable BooleanWritable ArrayWritable BytesWritable MapWritable VLongWritable VIntWritable 19

20 Complex Datatypes The easy way: Code them in text, e.g. (a, b) = a:b. Use regular expressions to parse and extract the data. It works but is bad software engineering. The not so easy way: Define an implementation of WritableComparable. You must implement: readfields, write, compareto. Computationally efficient. 20

21 Writable public class MyWritable implements Writable { private int counter; private long timestamp; public void write(dataoutput out) throws IOException { out.writeint(counter); out.writelong(timestamp); } public void readfields(datainput in) throws IOException { counter = in.readint(); timestamp = in.readlong(); } public static MyWritable read(datainput in) throws IOException { MyWritable w = new MyWritable(); w.readfields(in); return w; } } 21

22 WritableComparable public class MyWritableComparable implements WritableComparable { private int counter; private long timestamp; public void write(dataoutput out) throws IOException { out.writeint(counter); out.writelong(timestamp); } public void readfields(datainput in) throws IOException { counter = in.readint(); timestamp = in.readlong(); } public int compareto(mywritablecomparable w) { int thisvalue = this.counter; int thatvalue = ((MyWritableComparable)w).counter; return (thisvalue < thatvalue? -1 : (thisvalue==thatvalue? 0 : 1)); }} 22

23 Wordcount: Main public static void main(string[] args) { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } 23

24 PageRank: Random Walks in the Web PageRank is Google s famous web indexing algorithm. Rationale: If a user starts on a random web page and starts clicking what is the probability that he will reach a given page? A high probability signals an important page. This is the basic principle of PageRank: More pointed to pages get more points. 24

25 PageRank: Visually en.wikipedia.org 25

26 PageRank: Formula Given a page A, and other pages T 1 up to T n with links to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n )) C(P) is the cardinality of P (number of outbound links) d is called the dampening of the randomness of the page choice (normally 0.85). It is the probability of choosing a page without coming from a link. 26

27 PageRank: Intuition The calculation is iterative: PR i+1 is based in PR i Each page distributes its PR i to all the pages it is linked to. The pages that receive fragments of PR add all the received fragments to generate their PR i+1 27

28 PageRank: Problems Will PageRank converge? How fast? What is the correct value of d and how sensitive is the algorithm to this value? What is the most efficient algorithm for PageRank? 28

29 PageRank: First Implementation Create two tables (currentand next) with the PageRank of each page. Fill the currenttable with the initial values of PR and the adjacent pages. Traverse the graph distributing currentpr among the next PR elements. current := next; next := new_table (); Iterate again or stop. 29

30 Algorithm Distribution Parallelization intuition: Each line of the nexttable depends on the current table but not on other lines of the nexttable. Individual lines of the adjacency tables, the tables of neighbours, can be processed in parallel. The lines in sparse matrices are relatively small compared to the number of nodes (pages). 30

31 Algorithm Distribution Consequences of this approach: We can map each line of the currenttable thereby generating PR fragments to be distributed to all pages pointed by the page. Fragments can be reduced (added) to a single value of PR. 31

32 Map step: break page rank into even fragments to distribute to link targets Reduce step: add together fragments into next PageRank Iterate for next step... 32

33 Phase 1: Reading the HTML Map task reads (URL, page-content)and transforms it into (URL,(PR init, url-list)) PR init is the initial PageRank value for the particular URL. url-list contains all the pages to which the page points. In this iteration, the reducer is the identity function (pass-through reducer). 33

34 Phase 2: PageRank Distribution (i) Map reads (URL, (current_pr, url-list)) For each u in url-list, output(u, current_pr/ urllist ) Outputs (URL, url-list)to pass the graph arcs to the next iterations. PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n )) 34

35 Phase 2: PageRank Distribution (ii) Reduce receives (URL, url-list)and many (URL, PR_frag): Adds all vals and adjust the calculation with d. Generates (URL, (new_pr, url-list)) PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) PR(T n )/C(T n )) A non-parallel component decides whether the algorithm has converged. (Fixed number of iterations? Comparison of critical values?) 35

36 The Original Challenges Scheduling: Assigns map and reduce to the workers. Distribution: Workers are moved to the data. (Will be important later on.) Synchronization: Gathering, ordering and distributes intermediate results. Fault Tolerance: Detection and restart of failed tasks (next time). All about a distributed file system (next lecture). 36

37 Next time MapReduce: systems perspective & other applications 37

Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the

### Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

### Mrs: MapReduce for Scientific Computing in Python

Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

### Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

### Outline. What is Big Data? Hadoop HDFS MapReduce

Intro To Hadoop Outline What is Big Data? Hadoop HDFS MapReduce 2 What is big data? A bunch of data? An industry? An expertise? A trend? A cliche? 3 Wikipedia big data In information technology, big data

### So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing!

Mapping Page 1 Using Raw Hadoop 8:34 AM So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing! Hadoop Yahoo's open-source MapReduce implementation

### Word Count Code using MR2 Classes and API

EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

### Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

### Introduction to Cloud Computing

Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

### Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

### Massive Distributed Processing using Map-Reduce

Massive Distributed Processing using Map-Reduce (Przetwarzanie rozproszone w technice map-reduce) Dawid Weiss Institute of Computing Science Pozna«University of Technology 01/2007 1 Introduction 2 Map

### Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

### Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

### map/reduce connected components

1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

### Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

### Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

### Health Care Claims System Prototype

SGT WHITE PAPER Health Care Claims System Prototype MongoDB and Hadoop 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301) 614-8601 www.sgt-inc.com

### Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

### Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

### Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

### Data Science in the Wild

Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

### HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

### Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007

Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06 Seek versus Transfer B-Tree requires seek per access unless to recent, cached page so can buffer & pre-sort

### MapReduce framework. (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

### Lots of Data, Little Money. A Last.fm perspective. Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23

Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:

### MAPREDUCE - COMBINERS

MAPREDUCE - COMBINERS http://www.tutorialspoint.com/map_reduce/map_reduce_combiners.htm Copyright tutorialspoint.com A Combiner, also known as a semi-reducer, is an optional class that operates by accepting

### Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

### Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

### Big Data Management and NoSQL Databases

NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

### From Distributed Systems to Data Science. William C. Benton Red Hat Emerging Technology

From Distributed Systems to Data Science William C. Benton Red Hat Emerging Technology About me At Red Hat: scheduling, configuration management, RPC, Fedora, data engineering, data science. Before Red

### CSE-E5430 Scalable Cloud Computing Lecture 3

CSE-E5430 Scalable Cloud Computing Lecture 3 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 21.9-2015 1/25 Writing Hadoop Jobs Example: Assume

### Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Cloud Computing i Hadoop X JPL Barcelona, 01/07/2011 Marc de Palol @lant Qui sóc? Qui sóc? Qui sóc? Qui sóc? Qui sóc? Qui sóc? Grid Computing vs Cloud Grid Computing vs Cloud Els dos són sistemes distribuïts

### Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

### Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

### Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

### HADOOP SDJ INFOSOFT PVT LTD

HADOOP SDJ INFOSOFT PVT LTD DATA FACT 6/17/2016 SDJ INFOSOFT PVT. LTD www.javapadho.com Big Data Definition Big data is high volume, high velocity and highvariety information assets that demand cost

### Hadoop + Clojure. Hadoop World NYC Friday, October 2, 2009. Stuart Sierra, AltLaw.org

Hadoop + Clojure Hadoop World NYC Friday, October 2, 2009 Stuart Sierra, AltLaw.org JVM Languages Functional Object Oriented Native to the JVM Clojure Scala Groovy Ported to the JVM Armed Bear CL Kawa

### Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

### MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

### Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

### CS54100: Database Systems

CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

### Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

### Processing Data with Map Reduce

Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

### Enterprise Data Storage and Analysis on Tim Barr

Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,

### K-means Implementation

COSC 6397 Big Data Analytics Introduction to MapReduce (II) Edgar Gabriel Spring 2014 K-means Implementation Simplified assumptions 1 iteration 2-D points, floating point coordinates One data point per

### Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015

Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 1 MapReduce in More Detail 2 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks.

### BIG DATA APPLICATIONS

BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

### Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

### Big Data Processing, 2014/15

Big Data Processing, 2014/15 Lecture 6: MapReduce - behind the scenes continued (a very mixed bag)!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams

### Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

### Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

### Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

### Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

### Internals of Hadoop Application Framework and Distributed File System

International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

### BIG DATA, MAPREDUCE & HADOOP

BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

### Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

### Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1

Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch November 11, 2013 10-11-2013 1 Overview Today s program 1. Little more practical details about this course 2. Recap from last time (Google

In this exercise, we will define our custom keys and values and use them in our map reduce program. Following Program runs on 250 mb file and imploys counters. Defining value class name import java.io.datainput;

### The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

### By Hrudaya nath K Cloud Computing

Processing Big Data with Map Reduce and HDFS By Hrudaya nath K Cloud Computing Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution of

### An Implementation of Sawzall on Hadoop

1 An Implementation of Sawzall on Hadoop Hidemoto Nakada, Tatsuhiko Inoue and Tomohiro Kudoh, 1-1-1 National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki 35-8568,

### What s Big Data? Big Data: 3V s. Variety (Complexity) 5/5/2016. Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin

data every day 5/5/2016 Introduction to Big Data, mostly from www.cs.kent.edu/~jin/bigdata by Ruoming Jin What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection

### Hadoop Overview. July 2011. Lavanya Ramakrishnan Iwona Sakrejda Shane Canon. Lawrence Berkeley National Lab

Hadoop Overview Lavanya Ramakrishnan Iwona Sakrejda Shane Canon Lawrence Berkeley National Lab July 2011 Overview Concepts & Background MapReduce and Hadoop Hadoop Ecosystem Tools on top of Hadoop Hadoop

### Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction

### Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

### Big Data Analytics* Outline. Issues. Big Data

Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

### Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

### Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May

2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite

### Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Lecture 5 Programming Hadoop I Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline MapReduce basics A closer look at WordCount MR Anatomy of MapReduce

### Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

### USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email n.roy@neu.edu if you have questions or need more clarifications. Nilay

### Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

### Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to

### PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

PASS4TEST IT Certification Guaranteed, The Easy Way! \ http://www.pass4test.com We offer free update service for one year Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Vendor

### Introduction to Big Data Science. Wuhui Chen

Introduction to Big Data Science Wuhui Chen What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data

### Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.

Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 1/77 Business Drivers of Cloud Computing Large data centers allow for economics

### Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Programming Hadoop Map-Reduce Programming, Tuning & Debugging Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008 Existential angst: Who am I? Yahoo! Grid Team (CCDI) Apache Hadoop Developer

### Map-Reduce for Parallel Computing

Map-Reduce for Parallel Computing Amit Jain Department of Computer Science College of Engineering Boise State University 1/53 Big Data, Big Disks, Cheap Computers In pioneer days they used oxen for heavy

### Introduc8on to Apache Spark

Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these

### Distributed Model Checking Using Hadoop

Distributed Model Checking Using Hadoop Rehan Abdul Aziz October 25, 2010 Abstract Various techniques for addressing state space explosion problem in model checking exist. One of these is to use distributed

### Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

### Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

### Programming with Hadoop. 2009 Cloudera, Inc.

Programming with Hadoop Overview How to use Hadoop Hadoop MapReduce Hadoop Streaming Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution

### MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

### Big Data and Data Science Grows Up. Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com

Big Data and Data Science Grows Up Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com 1 Source IDC 2 Hadoop Open Source Distributed Cluster SoGware Distributed file system Java-

### INFO5011. Cloud Computing Semester 2, 2011 Lecture 7, MapReduce (II)

INFO5011 Cloud Computing Semester 2, 2011 Lecture 7, MapReduce (II) COMMONWEALTH OF Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of the university

### The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

### MapReduce. Course NDBI040: Big Data Management and NoSQL Databases. Practice 01: Martin Svoboda

Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming

### Distributed computing: index building and use

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

### Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

### Hadoop Basics with InfoSphere BigInsights

An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights

### Outline of Tutorial. Hadoop and Pig Overview Hands-on

Outline of Tutorial Hadoop and Pig Overview Hands-on 1 Hadoop and Pig Overview Lavanya Ramakrishnan Shane Canon Lawrence Berkeley National Lab October 2011 Overview Concepts & Background MapReduce and