map/reduce connected components

Size: px
Start display at page:

Download "map/reduce connected components"

Transcription

1 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains (reduce) connectivity is not affected each reducer returns n edges result graph has fewer edges ( k n) number of iterations analogous to MST

2 2, map/reduce generalizing the filtering approach partition randomly into subproblems reduce size in each subproblem independently recombine to (reduced) problem repeat until problem size is small enough for single node solve small/sparse instance on single node for instantiations, two guarantees are needed: filtered parts in the subproblems do not affect optimal solution number of iterations is limited

3 3, map/reduce performance of algorithms map reduce algorithms vary in a number of behavioral parameters which are characteristic for performance (total computation time) incurred workload (total involved work on all machines) space consumption (memory) these can be formalized in two groups 1 key complexity - resource consumption on individual nodes sequential complexity - overall resource consumption 1 Goel, Munagala, 2012

4 4, map/reduce performance of algorithms key complexity maximum size of any key/value pair maximum running time for any mapper or reducer on any key/value pair maximum memory consumption for any mapper or reducer on any key/value pair nodes must be capable of executing individual map/reduce operations sequential complexity size of all key/value pairs input and output by mappers/reducers total running time for all mappers/reducers whole system must be capable of execution (e.g. sufficient number of machines)

5 5, map/reduce complexity classes efficiency of algorithms is measured by their complexity complexity (time/space) is defined depending on size of input problem problems are distinguished by the (possible) existence of efficient algorithms example: P 2 P = { decision problems, solvable in time O(p(n)) where p is a polynomial of input size n}. problems are tractable (efficiently solvable) if polynomial time algorithm exists is there a comparable definition for map/reduce? 2 This is not the exact definition, but an illustration.

6 6, complexity for parallel algorithms: N C PRAM setting: many cores, common memory parallel algorithms can use arbitrary many cores these are usually not available (have to be emulated) N C - Nick s class A decision problem is in N C: there exists an algorithm solving it which uses polynomial number of cores (O(n k )) solves the problem in polylogarithmic time (O((log n) c )) with k and c constant and n being the input size. considered to be the class of efficiently parallelizable problems variants: N C i in time O((log n) i ) (often denoted O(log i n))

7 complexity for map reduce algorithms MRC i Let ɛ > 0 be a fixed value. Let the input be a finite sequence k i, v i with total size (in bits) n. An algorithm A consists of R map (µ) and reduce (ρ) steps µ 1, ρ 1, µ 2, ρ 2,..., µ R, ρ R. A is in MRC i, if it outputs the correct answer with probability at least 3/4 and for input size n: each µ r /ρ r is a randomized mapper/reducer with run-time polynomial in n and memory consumption O(n 1 ɛ ) and word length O(log n) the total space consumption of key/value pairs resulting from any µ r is in O(n 2 2ɛ ) (note: = (n 1 ɛ ) 2 ) the number of rounds R O(log i n). deterministic variant DMRC i with probability 1 source: Karloff, Suri, Vassilvitskii, A Model of Computation for MapReduce, ,

8 8, interpretation of MRC i each individual task (map/reduce) has polynomial runtime space consumption is O(n 1 ɛ ) e.g. for ɛ = 0.5 O( n) ɛ should be maximized a mapper should not produce more than quadratic amount of output O(n 1 ɛ ) in memory, output O(n 2 2ɛ ) total memory on all nodes limited to O(n 2 2ɛ ) needs O(n 1 ɛ ) nodes note: P and N C are classes of problems, MRC is a class of algorithms open question: how to distribute (shuffle) O(n 2 2ɛ ) key/value pairs in memory O(n 1 1ɛ ) Graham s Greedy Algorithm (Graham 1966)

9 9, MST complexity note: notation of MRC and MST collide, n and ɛ from MST input size: n 1+c steps: c/ɛ O(log n) (log n = 1 + c) nodes in MST need memory n 1+ɛ memory: n 1+ɛ < n 1+c, otherwise direct solution on single node

10 Hadoop/HDFS 10,

11 11, introduction up to here: only mongodb implementation of mapreduce considered this is not a full implementation comes with limitations e.g. no guarantee that all keys end up at one reducer behavior can not be influenced has the advantage of simple setup and simple usage example for m/r-implementation with more features and possiblities is Hadoop/HDFS allows implementations in Java

12 12, introduction two main components HDFS - Hadoop distributed file system runs on top of OS file system provides a view on real files stored on the actual hardware fixed block size (64MB) optimized for write once, read often Hadoop - the execution layer implements the mapreduce execution handles failed tasks (retry and give up) handles distribution of tasks both (storage and execution layer) run on the same nodes

13 13, typical architecture a network of nodes is connected to an Hadoop cluster one master node NameNode - address data (which block on which node) JobTracker - execution management slave nodes DataNode - data storage TaskTracker - execution of tasks both processes run on the same (physical) node clients send jobs to JobTracker JobTracker distributes tasks among slave nodes

14 14, HDFS - overview NameNode distributes blocks to nodes ensures redundancy constant contact: check-in from slaves keeps data organized as files and directories is single point of failure handles only meta-data, nodes and clients communicate directly optimized for streaming access no random file access no appending of data organized like Unix file systems can be mounted (i.e. blended into general file system)

15 15, setting up an example installation requirements Hadoop/HDFS use ssh and rsync for communication ssh - secure shell client (remote login), need client and server (sshd) rsync - remote synchronization (data transfer) jobs are implemented in Java, need Java Runtime Environment installation from tarball 3 use latest stable version (2.4.1) local installation - standalone mode extract tarball, change into directory, test: run bin/hadoop 3 source: https://hadoop.apache.org/releases.html

16 16, test job execution Hadoop distribution provides example jobs in hadoop-examples jar jobs need input an output directory create input directory and copy some files in there $mkdir input $cp conf/*.xml input execute example job: $bin/hadoop jar hadoop-examples jar\ grep input output dfs[a-z.]+ result: lots of logging output files in output ls output/ _SUCCESS part-00000

17 job implementation a job is defined in an arbitrary Java class Hadoop will start the main()-method main method configures and runs the actual job Job configuration name input/output format special classes providing verification and reading/writing methods for I/O operations output key/value classes - type specifications as Java classes mapper/combiner/reducer mapper and reducer as usual but as Java classes combiner analogue but used to combine mapper output before sending to other nodes input/output paths for file access 17,

18 JobClient.runJob(conf); } 18, putting a job together public static void main(final String[] args) throws Exception { final JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));

19 19, the mapper class implement interface Mapper has type parameters: input key, input value output key/value class MapReduceBase provides empty implementations for functions map function implemented as public void map() key and value as input (according to type parameters) OutputCollector used to emit key/value pairs Reporter for logging and progress reports example: word count

20 public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private final Text word = new public void map(final LongWritable key, final Text value, final OutputCollector<Text, IntWritable> output, final Reporter reporter) throws IOException { final String line = value.tostring(); final StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); } } } 20,

21 21, the reducer class again, type parameters for input and output function reduce() for the actual task reporting and output collection analogous to mapper class both classes need access to hadoop library hadoop-core jar can be found in main directory of distribution

22 22, public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> public void reduce(final Text key, final Iterator<IntWritable> values, final OutputCollector<Text, IntWritable> output, final Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

23 23, compile and execute before execution, classes have to be compiled and packed into jar-file eclipse export is possible, alternatively in distribution directory: mkdir wordcount_classes javac -classpath hadoop core.jar -d wordcount_classes\ WordCount.java jar -cvf wordcount.jar -C wordcount_classes without config, hdfs read directly from system: bin/hadoop dfs -ls /tmp bin/hadoop dfs -cat /tmp/hadoop/input/test.txt

24 24, compile and execute run with /tmp/hadoop/input as input /tmp/hadoop/output as output directory class WordCount in package my.pack here: directly in main dir of Hadoop distribution otherwise: provide full path to jar bin/hadoop wordcount.jar my.pack.wordcount \ /tmp/hadoop/input /tmp/hadoop/output TextInputFormat reads all files in the input dir use line number as key, line as value TextOutputFormat writes all key/value pairs as plain text to output dir for more details, c.f. hadoop.apache.org/docs/stable/mapred_tutorial.html# Inputs+and+Outputs

25 25, using hadoop with python hadoop supports execution of scripts in arbitrary languages interface: input and output via system in/out streams scripts read input from stdin and write output to stdout #!/bin/bash /opt/hadoop/bin/hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-*streaming*.jar \ -mapper mapper.py -reducer reducer.py \ -input pg4300.txt -output pg4300.out c.f. example from: writing-an-hadoop-mapreduce-program-in-python/ input/output comes as text/csv files (tab delimited)

Big Data and Scripting map reduce algorithms and complexity

Big Data and Scripting map reduce algorithms and complexity Big Data and Scripting map reduce algorithms and complexity 1, 2, breadth first search important procedure in many graph algorithms search the nodes of a graph nearest first determine shortest-path distance

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Recitation 8: Hadoop Programming. Vijay Vasudevan

Recitation 8: Hadoop Programming. Vijay Vasudevan 15-440 Recitation 8: Hadoop Programming Vijay Vasudevan Outline Hadoop walkthrough Hadoop Apache Hadoop is an open-source version of Google s MapReduce Who uses it? Yahoo, Facebook, Physicists, Wall Street...

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

MapReduce framework. (input) -> map -> -> combine -> -> reduce -> (output)

MapReduce framework. (input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

COSC 416 NoSQL Databases. Hadoop and HDFS. Dr. Ramon Lawrence University of British Columbia Okanagan

COSC 416 NoSQL Databases. Hadoop and HDFS. Dr. Ramon Lawrence University of British Columbia Okanagan COSC 416 NoSQL Databases Hadoop and HDFS Dr. Ramon Lawrence University of British Columbia Okanagan ramon.lawrence@ubc.ca MapReduce and Hadoop MapReduce was invented by Google and has an open source implementation

More information

Big Data and Scripting map reduce algorithms and complexity

Big Data and Scripting map reduce algorithms and complexity Big Data and Scripting map reduce algorithms and complexity 1, 2, embarrassingly parallel many problems are naturally parallel consist of many independent subproblems, examples: add two matrices element-wise

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Outline. What is Big Data? Hadoop HDFS MapReduce

Outline. What is Big Data? Hadoop HDFS MapReduce Intro To Hadoop Outline What is Big Data? Hadoop HDFS MapReduce 2 What is big data? A bunch of data? An industry? An expertise? A trend? A cliche? 3 Wikipedia big data In information technology, big data

More information

MapReduce. Course NDBI040: Big Data Management and NoSQL Databases. Practice 01: Martin Svoboda

MapReduce. Course NDBI040: Big Data Management and NoSQL Databases. Practice 01: Martin Svoboda Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

Processing Data with Map Reduce

Processing Data with Map Reduce Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop. Owen O Malley Yahoo Inc! Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

MapReduce and Beyond

MapReduce and Beyond MapReduce and Beyond Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 22, 2014 Amir H. Payberah (SICS) MapReduce April 22, 2014 1 / 44 What do we do when there is too much data

More information

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so: Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

More information

By Hrudaya nath K Cloud Computing

By Hrudaya nath K Cloud Computing Processing Big Data with Map Reduce and HDFS By Hrudaya nath K Cloud Computing Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution of

More information

Introduction to Hadoop. Owen O Malley Yahoo Inc!

Introduction to Hadoop. Owen O Malley Yahoo Inc! Introduction to Hadoop Owen O Malley Yahoo Inc! omalley@apache.org Hadoop: Why? Need to process 100TB datasets with multiday jobs On 1 node: scanning @ 50MB/s = 23 days MTBF = 3 years On 1000 node cluster:

More information

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science

More information

Lots of Data, Little Money. A Last.fm perspective. Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23

Lots of Data, Little Money. A Last.fm perspective. Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Lots of Data, Little Money. A Last.fm perspective Martin Dittus, martind@last.fm London Geek Nights, 2009-04-23 Big Data Little Money You have lots of data You want to process it For your product (Last.fm:

More information

Introduction To Hadoop

Introduction To Hadoop Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the

More information

CS34800 Information Systems

CS34800 Information Systems CS34800 Information Systems Big Data Prof. Chris Clifton 2 November 2016 The Cloud: What s it all About? Impala CS34800 2 Jan-16 20 Christopher W. Clifton 1 Beyond RDBMS The Relational Model is too limiting!

More information

HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

More information

Index Construction. Introduction to Information Retrieval CS 150 Donald J. Patterson

Index Construction. Introduction to Information Retrieval CS 150 Donald J. Patterson Index Construction Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org BSBI - Block sort-based indexing Different way

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

Hadoop Streaming. Table of contents

Hadoop Streaming. Table of contents Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5

More information

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

1.6 Counting words with Hadoop running your first program

1.6 Counting words with Hadoop running your first program 14 CHAPTER 1 Introducing Hadoop for each value in values { sum = sum + value; emit ((String)token, (Integer) sum); We ve said before that the output of both map and reduce function are lists. As you can

More information

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Hadoop Configuration and First Examples

Hadoop Configuration and First Examples Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download

More information

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship

Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Hadoop & Pig Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Outline Introduction (Setup) Hadoop, HDFS and MapReduce Pig Introduction What is Hadoop and where did it come from? Big Data

More information

MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Massive Distributed Processing using Map-Reduce

Massive Distributed Processing using Map-Reduce Massive Distributed Processing using Map-Reduce (Przetwarzanie rozproszone w technice map-reduce) Dawid Weiss Institute of Computing Science Pozna«University of Technology 01/2007 1 Introduction 2 Map

More information

MAPREDUCE - HADOOP IMPLEMENTATION

MAPREDUCE - HADOOP IMPLEMENTATION MAPREDUCE - HADOOP IMPLEMENTATION http://www.tutorialspoint.com/map_reduce/implementation_in_hadoop.htm Copyright tutorialspoint.com MapReduce is a framework that is used for writing applications to process

More information

Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007

Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

Tutorial on Hadoop including HDFS and MapReduce

Tutorial on Hadoop including HDFS and MapReduce Tutorial on Hadoop including HDFS and MapReduce Table Of Contents Introduction... 3 The Use Case... 5 Pre-Requisites... 5 Task 1: Access Your Hortonworks Data Platform Single Node AMI Instance... 6 Task

More information

Programming in Hadoop Programming, Tuning & Debugging

Programming in Hadoop Programming, Tuning & Debugging Programming in Hadoop Programming, Tuning & Debugging Venkatesh. S. Cloud Computing and Data Infrastructure Yahoo! Bangalore (India) Agenda Hadoop MapReduce Programming Distributed File System HoD Provisioning

More information

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008 Programming Hadoop Map-Reduce Programming, Tuning & Debugging Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008 Existential angst: Who am I? Yahoo! Grid Team (CCDI) Apache Hadoop Developer

More information

Tutorial on Hadoop HDFS and MapReduce

Tutorial on Hadoop HDFS and MapReduce Tutorial on Hadoop HDFS and MapReduce Table Of Contents Introduction... 3 The Use Case... 4 Pre-Requisites... 5 Task 1: Access Your Hortonworks Virtual Sandbox... 5 Task 2: Create the MapReduce job...

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing!

So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing! Mapping Page 1 Using Raw Hadoop 8:34 AM So far, we've been protected from the full complexity of hadoop by using Pig. Let's see what we've been missing! Hadoop Yahoo's open-source MapReduce implementation

More information

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins) Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --

More information

CS 425 / ECE 428 Distributed Systems Fall 2015

CS 425 / ECE 428 Distributed Systems Fall 2015 CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Sep 8, 2015 Lecture 5: Mapreduce and Hadoop All slides IG 1 What is MapReduce? Terms are borrowed from Functional Language (e.g., Lisp)

More information

Programming with Hadoop. 2009 Cloudera, Inc.

Programming with Hadoop. 2009 Cloudera, Inc. Programming with Hadoop Overview How to use Hadoop Hadoop MapReduce Hadoop Streaming Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution

More information

Hadoop In 45 Minutes or Less

Hadoop In 45 Minutes or Less Large-Scale Data Processing for Everyone Who Am I? Principal Software Engineer at Object Computing, Inc. I worked on large-scale data processing in a previous job If only I d had Hadoop back then... What

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Behavioral Data Mining. Lecture 6 Hadoop and MapReduce (slides originally by Matei Zaharia)

Behavioral Data Mining. Lecture 6 Hadoop and MapReduce (slides originally by Matei Zaharia) Behavioral Data Mining Lecture 6 Hadoop and MapReduce (slides originally by Matei Zaharia) What is MapReduce? Data-parallel programming model for clusters of commodity machines Pioneered by Google Processes

More information

School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014. Hadoop for HPC. Instructor: Ekpe Okorafor

School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014. Hadoop for HPC. Instructor: Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Hadoop for HPC Instructor: Ekpe Okorafor Outline Hadoop Basics Hadoop Infrastructure HDFS MapReduce Hadoop & HPC Hadoop

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own

More information

Mrs: MapReduce for Scientific Computing in Python

Mrs: MapReduce for Scientific Computing in Python Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

More information

HADOOP - MAPREDUCE. Generally MapReduce paradigm is based on sending the computer to where the data resides!

HADOOP - MAPREDUCE. Generally MapReduce paradigm is based on sending the computer to where the data resides! HADOOP - MAPREDUCE http://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm Copyright tutorialspoint.com MapReduce is a framework using which we can write applications to process huge amounts of data,

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Michail Basios (m.basios@sms.ed.ac.uk) Stratis Viglas (sviglas@inf.ed.ac.uk) 1 Getting started First you need to access the machine where you will be doing all

More information

Introduction to Map/Reduce

Introduction to Map/Reduce Introduction to Map/Reduce Christoph bases on Yahoo! Hadoop Tutorial (Module 4) http://public.yahoo.com/gogate/hadoop-tutorial/html/module4.html Agenda What is Map/Reduce? The Building Blocks: mapping

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this

More information

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

More information

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi

Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi Building a distributed search system with Apache Hadoop and Lucene Mirko Calvaresi a Barbara, Leonardo e Vittoria 2 Index Preface... 5 1 Introduction: the Big Data Problem... 6 1.1 Big data: handling the

More information

Top 60 Hadoop & MapReduce Interview Question

Top 60 Hadoop & MapReduce Interview Question Top 60 Hadoop & MapReduce Interview Question 1) What is Hadoop Map Reduce? For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step

More information

Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007

Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction. 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Write the word counter example

More information

Distributed Recommenders. Fall 2010

Distributed Recommenders. Fall 2010 Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Single Node Setup. Table of contents

Single Node Setup. Table of contents Table of contents 1 Purpose... 2 2 Prerequisites...2 2.1 Supported Platforms...2 2.2 Required Software... 2 2.3 Installing Software...2 3 Download...2 4 Prepare to Start the Hadoop Cluster... 3 5 Standalone

More information

Lab 10: NHẬ P MO N ẬPẬCHE HẬDOOP

Lab 10: NHẬ P MO N ẬPẬCHE HẬDOOP Lab 10: NHẬ P MO N ẬPẬCHE HẬDOOP Biên soạn: ThS. Nguyễn Quang Hùng E-mail: hungnq2@cse.hcmut.edu.vn 1. Giới thiệu: Hadoop Map/Reduce là một khung nền (software framework) mã nguồn mở, hỗ trợ người lập

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

HADOOP SDJ INFOSOFT PVT LTD

HADOOP SDJ INFOSOFT PVT LTD HADOOP SDJ INFOSOFT PVT LTD DATA FACT 6/17/2016 SDJ INFOSOFT PVT. LTD www.javapadho.com Big Data Definition Big data is high volume, high velocity and highvariety information assets that demand cost

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

More information

TP1: Getting Started with Hadoop

TP1: Getting Started with Hadoop TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web

More information

Download and install Download virtual machine Import virtual machine in Virtualbox

Download and install Download virtual machine Import virtual machine in Virtualbox Hadoop/Pig Install Download and install Virtualbox www.virtualbox.org Virtualbox Extension Pack Download virtual machine link in schedule (https://rmacchpcsymposium2015.sched.org/? iframe=no) Import virtual

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email n.roy@neu.edu if you have questions or need more clarifications. Nilay

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop Installation MapReduce Examples Jake Karnes Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit

More information

Filtering: A Method for Solving Graph Problems in MapReduce

Filtering: A Method for Solving Graph Problems in MapReduce Filtering: A Method for Solving Graph Problems in MapReduce Benjamin Moseley Silvio Lattanzi Siddharth Suri Sergei Vassilvitski UIUC Google Yahoo! Yahoo! Overview Introduction to MapReduce model Our settings

More information

Recommended Literature

Recommended Literature COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2014 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic

More information

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) khalifa@cs.queensu.ca The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Introduction to MapReduce

Introduction to MapReduce Introduction to MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Background: Cloud and distributed computing 2.

More information

ETH Zurich Department of Computer Science Networked Information Systems - Spring Tutorial #1: Hadoop and MapReduce.

ETH Zurich Department of Computer Science Networked Information Systems - Spring Tutorial #1: Hadoop and MapReduce. ETH Zurich Department of Computer Science Networked Information Systems - Spring 2008 Tutorial #1: Hadoop and MapReduce March 17, 2008 1 Introduction Hadoop 1 is an open-source Java-based software platform

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Big Data Analytics* Outline. Issues. Big Data

Big Data Analytics* Outline. Issues. Big Data Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

More information

Running Hadoop at Stirling

Running Hadoop at Stirling Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs

More information