Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Size: px
Start display at page:

Download "Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015"

Transcription

1 Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective

2 MapReduce in More Detail 2

3 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks. Tasks are assigned to the worker processes dynamically. In the reduction phase there are as many calls to reduce as the output files of the map phase (but only 1 reducer process by default). A typical Google run: Mappers=200,000; Reducers=4,000; Workers=2,000. The master assigns each map task (1 split) to one worker: Worker reads the input ideally from the local disk. And produces R local files with <key,value> pairs. 3

4 Master (ii) Master assigns each reduce task to a reducer: The reducer reads the intermediate files from the mappers. The reducer orders the <key,value> pairs and applies the reduce function. The user may specify partitions to control which values each reducer gets. 4

5 Fault Tolerance Worker faults: Faults are detected by periodic pings. Ready or ongoing tasks that fail are redone. Ongoing reduce tasks that fail are redone. Finished tasks are reported to the master. Master failure: The master state is checkpointed in a distributed file system. When a failed master is restarted, he reads the saved state and continues from that point on. 5

6 Splits Input data is provided to the mappers in splits. A split is a block of input data. Default HDFS are 64MB big (by default). In particular, Hadoop tries to create splits containing data local only to one node. A mapper then runs on the node where the split was created. 6

7 Input Formats Controlled by conf.setinputformat FileInputFormat (default option): 64MB blocks (for files 64MB) per split (and per map). CombineFileInputFormat: groups files on the same node into the same split. KeyValueTextInputFormat NLineInputFormat 7

8 Output Formats Controlled by conf.setoutputformat. TextOutputFormat: one file per reducer (default option). SequenceFileOutputFormat: compressed output. Normally used to feed other MapReduce cycles. MultipleOutputFormat: allows control over the amount and name of output files. It is also possible to output to databases. 8

9 Number of Tasks The number of mappers is determined based on the size of the input data. By default there is only 1 reducer process. Both are configurable: conf.setnummaptasks(int num); conf.setnumreducetasks(int num); 9

10 Partitions Partitioners partition the intermediate data before being submitted to the reducers. By default, the partition is determined by the hashcode function of the intermediate (map output) key class. You can change it by reimplementing the key s hashcode() method or by replacing the partitioner with a new one. 11

11 Example of a Partitioner public class HashPartitioner extends Partitioner<K2, V2> { public void configure(jobconf job) {} public int getpartition(k2 key, V2 value, int numpartitions) { return (key.hashcode() & Integer.MAX_VALUE) % } } numpartitions; 12

12 Combiner They reduce the amount of data at the mappers output by aggregating intermediate results from one node. In the Wordcount example (last lecture), we could aggregate the mappers output before shuffling them into the reducers. By default there is no combiner. Combiners are implemented like a reducer class and are added to the workflow by calling conf.setcombinerclass(); 13

13 Main Hadoop MapReduce Parameters 14

14 Hadoop MapReducer Servers Jobtracker Java class. It s the central manager of a MapReduce job. Tasktracker Java class. It s the process responsible for starting local map and reduce workers. Periodically notifies the jobtracker that he s still alive. 15

15 Execution of MapReduce Programs 16

16 Worker Faults Situations where the workers notify the jobtracker: Exceptions; Abrupt termination of the JVM. A worker is failed if it spends longer than mapred.task.timeout without reporting to the jobtracker. Map output files are in the local file system and not HDFS. It is not efficient to replicate output data (mappers can be repeated). Failed workers are retried 4 times on different nodes. It s important for mappers and reducer to be able to deal with corrupted data. After the 2 nd retry, they change to skipping mode: All input records that cause errors are skipped. 17

17 Server Faults If the tasktracker fails or is too slow reporting to the jobtracker, its jobs are restarted at other nodes. If the jobtracker fails, all work is lost. It s the main Hadoop problem. 18

18 Scheduling Hadoop MapReduce scheduling has evolved to ensure greater evenness: Originally, Hadoop scheduled MapReduce jobs in FIFO order. Later, FIFO with priorities (VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW) without preemption. Then, the scheduler was transformed into a plugin. Example: FairScheduler (Facebook) Jobs are grouped (pool; by Unix use, Unix group, labels). Each group reserves a minimum amount of map and reduce slots with a maximum amount of running jobs. Free slots are distributed evenly between all groups. 19

19 hadoop-site.xml <property> <name>mapred.jobtracker.taskscheduler</name> <value>org.apache.hadoop.mapred.fairscheduler</val ue> </property> <property> <name>mapred.fairscheduler.allocation.file</name> <value>/path/to/pools.xml</value> </property> 20

20 pools.xml <?xml version="1.0"?> <allocations> <pool name="ads"> <minmaps>10</minmaps> <minreduces>5</minreduces> <maxrunningjobs>3</maxrunningjobs> </pool> <user name= smith"> <maxrunningjobs>1</maxrunningjobs> </user> <usermaxjobsdefault>10</usermaxjobsdefault> </allocations> 21

21 Scheduling (2) CapacityScheduler (Yahoo): Keeps a set of queues. One queue by user, group, organization. Fairness is guaranteed between queues like in the Facebook FairScheduler: Slot reservations + Even distribution of the excess slots. Within each queue, scheduling is FIFO with priorities and with a minimum quantum before preemption. 22

22 A MapReduce Graph Application Performing calculation over a graph requires processing each node. Each node has specific information as well as arcs to other nodes. The calculation must go through the graph and run the computational step. How can we traverse the graph using MapReduce? How can we represent the graph in MapReduce? 23

23 Dijkstra Algorithm: Shortest Path to a node in a graph 1 function Dijkstra(Graph, source): 2 for each vertex v in Graph: 3 dist[v] := infinity 4 previous[v] := undefined 5 dist[source] := 0 6 Q := the set of all nodes in Graph 7 while Q is not empty: 8 u := vertex in Q with smallest dist[] 9 if dist[u] = infinity: /* only unreachable nodes left */ 10 break 11 remove u from Q /* u is processed */ 12 for each neighbor v of u: 13 alt := dist[u] + dist_between(u, v) 14 if alt < dist[v]: 15 dist[v] := alt 16 previous[v] := u 17 return dist[] 24

24 Breadth First Search Breadth first search is an iterative search algorithm within a graph. On each step, the boundary moves one step further with respect to the origin. 25

25 BFS & MapReduce Problem: There is a mismatch between BFS and MapReduce. Solution: Iterative traversal using MapReduce map some nodes, change the border and run MapReduce again. 26

26 BFS & MapReduce Problem: It s too expensive to send the whole graph to all of map tasks. Solution: Create a new graph representation. 27

27 Graph Representation The most straightforward representation is describing each node by the connections to its neighbors. 28

28 Adjacency Matrix Another classic graph representation: M[i][j]= 1 means that there is an arc from i to j

29 Direct References Iterating through the graph requires a list of all the graph. This solution requires that the graph be shared and therefore synchronized! More complex structures are more difficult to serialize. class Node { Object data; Vector<GraphNode> arcs; Node next; } 30

30 Adjacency Matrix: Sparse Representation An adjacency matrix for a large graph (e.g. the web) will be full mostly with zeros. Each line will be very long. A sparse representation is needed that contains only non-null elements. 31

31 Sparse Representation 1: (3, 1), (18, 1), (200, 1) 2: (6, 1), (12, 1), (80, 1), (400, 1) 3: (1, 1), (14, 1) 32

32 Shortest Path: Intuition Let s define the solution by induction, assuming only distances of 1: DistanceTo(InitialNode) = 0 For all nodes n reachable from the InitialNode DistanceTo(n) = 1. For all other nodes m reachable from a set of nodes S, DistanceTo(m) = 1 + min(distanceto(n), n S) 33

33 From the Intuition to the Algorithm A map worker gets a node n as the key and (D, points) as the value: D is the distance from n to the origin. Points is the list of nodes that are reachable from n. p points, generate <p, (D+1,?)> The reduce tasks aggregates all distance to a given point p and selects the shortest one. 34

34 Which leads to This MapReduce step advances the search border one step. Yet to do all the BFS, MapReduce must be fed with the whole graph. Where is the list of points gone? Map workers must output the arcs <n, (?, points)> again. This algorithm is less efficient than Dijkstra s but much more scalable. 35

35 With Weights Most interesting graphs have arcs with values (different than 1). The algorithm is still valid but the point list must include the weight of the arcs between them. Mappers output (p, D+weight p ) instead of (p, D+1) for each node p. 36

36 Conclusions MapReduce makes us think differently. But allows is possible to do quite sophisticated parallelizations. It is fundamental to think about the amount of transmitted information. 37

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

Lecture 3 Hadoop Technical Introduction CSE 490H

Lecture 3 Hadoop Technical Introduction CSE 490H Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Lecture 5 Programming Hadoop I Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline MapReduce basics A closer look at WordCount MR Anatomy of MapReduce

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

The Internet: A Remarkable Story. Inside the Net: A Different Story. Networks are Hard to Manage. Software Defined Networking Concepts

The Internet: A Remarkable Story. Inside the Net: A Different Story. Networks are Hard to Manage. Software Defined Networking Concepts The Internet: A Remarkable Story Software Defined Networking Concepts Based on the materials from Jennifer Rexford (Princeton) and Nick McKeown(Stanford) Tremendous success From research experiment to

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Programming with Hadoop. 2009 Cloudera, Inc.

Programming with Hadoop. 2009 Cloudera, Inc. Programming with Hadoop Overview How to use Hadoop Hadoop MapReduce Hadoop Streaming Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

How MapReduce Works 資碩一 戴睿宸

How MapReduce Works 資碩一 戴睿宸 How MapReduce Works MapReduce Entities four independent entities: The client The jobtracker The tasktrackers The distributed filesystem Steps 1. Asks the jobtracker for a new job ID 2. Checks the output

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

6. How MapReduce Works. Jari-Pekka Voutilainen

6. How MapReduce Works. Jari-Pekka Voutilainen 6. How MapReduce Works Jari-Pekka Voutilainen MapReduce Implementations Apache Hadoop has 2 implementations of MapReduce: Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Classic MapReduce The Client

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Processing Data with Map Reduce

Processing Data with Map Reduce Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

PASS4TEST. IT Certification Guaranteed, The Easy Way! We offer free update service for one year

PASS4TEST. IT Certification Guaranteed, The Easy Way!  We offer free update service for one year PASS4TEST IT Certification Guaranteed, The Easy Way! \ http://www.pass4test.com We offer free update service for one year Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Vendor

More information

Hadoop Fair Scheduler Design Document

Hadoop Fair Scheduler Design Document Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns Review Assignment 2 Ques9ons? If you d like to use guava (Google collec9ons classes) pom.xml available for assignment 2 Includes dependency for

More information

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Hadoop. History and Introduction. Explained By Vaibhav Agarwal Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Fair Scheduler. Table of contents

Fair Scheduler. Table of contents Table of contents 1 Purpose... 2 2 Introduction... 2 3 Installation... 3 4 Configuration...3 4.1 Scheduler Parameters in mapred-site.xml...4 4.2 Allocation File (fair-scheduler.xml)... 6 4.3 Access Control

More information

Hadoop in Action. Justin Quan March 15, 2011

Hadoop in Action. Justin Quan March 15, 2011 Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Load Balancing and Termination Detection

Load Balancing and Termination Detection Chapter 7 slides7-1 Load Balancing and Termination Detection slides7-2 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Extending Hadoop beyond MapReduce

Extending Hadoop beyond MapReduce Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Social Media Mining. Graph Essentials

Social Media Mining. Graph Essentials Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures

More information

Load balancing Static Load Balancing

Load balancing Static Load Balancing Chapter 7 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Load Balancing and Termination Detection

Load Balancing and Termination Detection Chapter 7 Load Balancing and Termination Detection 1 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume

More information

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031) using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed

More information

MapReduce. Introduction and Hadoop Overview. 13 June 2012. Lab Course: Databases & Cloud Computing SS 2012

MapReduce. Introduction and Hadoop Overview. 13 June 2012. Lab Course: Databases & Cloud Computing SS 2012 13 June 2012 MapReduce Introduction and Hadoop Overview Lab Course: Databases & Cloud Computing SS 2012 Martin Przyjaciel-Zablocki Alexander Schätzle Georg Lausen University of Freiburg Databases & Information

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011

Jordan Boyd-Graber University of Maryland. Tuesday, February 10, 2011 Data-Intensive Information Processing Applications! Session #2 Hadoop: Nuts and Bolts Jordan Boyd-Graber University of Maryland Tuesday, February 10, 2011 This work is licensed under a Creative Commons

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

BIG DATA Giraph. Felipe Caicedo December-2012. Cloud Computing & Big Data. FIB-UPC Master MEI

BIG DATA Giraph. Felipe Caicedo December-2012. Cloud Computing & Big Data. FIB-UPC Master MEI BIG DATA Giraph Cloud Computing & Big Data Felipe Caicedo December-2012 FIB-UPC Master MEI Content What is Apache Giraph? Motivation Existing solutions Features How it works Components and responsibilities

More information

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200

Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

MapReduce Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

MapReduce Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Google Field Trip Map Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Distributed Indexing - Architecture

More information

Google Bing Daytona Microsoft Research

Google Bing Daytona Microsoft Research Google Bing Daytona Microsoft Research Raise your hand Great, you can help answer questions ;-) Sit with these people during lunch... An increased number and variety of data sources that generate large

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Outline. What is Big Data? Hadoop HDFS MapReduce

Outline. What is Big Data? Hadoop HDFS MapReduce Intro To Hadoop Outline What is Big Data? Hadoop HDFS MapReduce 2 What is big data? A bunch of data? An industry? An expertise? A trend? A cliche? 3 Wikipedia big data In information technology, big data

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

H2O on Hadoop. September 30, 2014. www.0xdata.com

H2O on Hadoop. September 30, 2014. www.0xdata.com H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms

More information

Job Oriented Instructor Led Face2Face True Live Online I.T. Training for Everyone Worldwide

Job Oriented Instructor Led Face2Face True Live Online I.T. Training for Everyone Worldwide H2kInfosys H2K Infosys provides online IT training and placement services worldwide. www.h2kinfosys.com USA- +1-(770)-777-1269, UK (020) 33717615 Training@H2KINFOSYS.com / H2KInfosys@gmail.com DISCLAIMER

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Load Balancing and Termination Detection

Load Balancing and Termination Detection Chapter 7 Slide 1 Slide 2 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13 How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering Optimized Mapreduce across Datacenter s Using Dache: a Data Aware Caching for Big- Data Applications S.Venkata Krishna Kumar, D.Veera Bhuvaneswari Associate Professor, Dept. of Computer Science, PSG College

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information