Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective
|
|
|
- Meryl Lane
- 10 years ago
- Views:
Transcription
1 Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective
2 MapReduce in More Detail 2
3 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks. Tasks are assigned to the worker processes dynamically. In the reduction phase there are as many calls to reduce as the output files of the map phase (but only 1 reducer process by default). A typical Google run: Mappers=200,000; Reducers=4,000; Workers=2,000. The master assigns each map task (1 split) to one worker: Worker reads the input ideally from the local disk. And produces R local files with <key,value> pairs. 3
4 Master (ii) Master assigns each reduce task to a reducer: The reducer reads the intermediate files from the mappers. The reducer orders the <key,value> pairs and applies the reduce function. The user may specify partitions to control which values each reducer gets. 4
5 Fault Tolerance Worker faults: Faults are detected by periodic pings. Ready or ongoing tasks that fail are redone. Ongoing reduce tasks that fail are redone. Finished tasks are reported to the master. Master failure: The master state is checkpointed in a distributed file system. When a failed master is restarted, he reads the saved state and continues from that point on. 5
6 Splits Input data is provided to the mappers in splits. A split is a block of input data. Default HDFS are 64MB big (by default). In particular, Hadoop tries to create splits containing data local only to one node. A mapper then runs on the node where the split was created. 6
7 Input Formats Controlled by conf.setinputformat FileInputFormat (default option): 64MB blocks (for files 64MB) per split (and per map). CombineFileInputFormat: groups files on the same node into the same split. KeyValueTextInputFormat NLineInputFormat 7
8 Output Formats Controlled by conf.setoutputformat. TextOutputFormat: one file per reducer (default option). SequenceFileOutputFormat: compressed output. Normally used to feed other MapReduce cycles. MultipleOutputFormat: allows control over the amount and name of output files. It is also possible to output to databases. 8
9 Number of Tasks The number of mappers is determined based on the size of the input data. By default there is only 1 reducer process. Both are configurable: conf.setnummaptasks(int num); conf.setnumreducetasks(int num); 9
10 Partitions Partitioners partition the intermediate data before being submitted to the reducers. By default, the partition is determined by the hashcode function of the intermediate (map output) key class. You can change it by reimplementing the key s hashcode() method or by replacing the partitioner with a new one. 11
11 Example of a Partitioner public class HashPartitioner extends Partitioner<K2, V2> { public void configure(jobconf job) {} public int getpartition(k2 key, V2 value, int numpartitions) { return (key.hashcode() & Integer.MAX_VALUE) % } } numpartitions; 12
12 Combiner They reduce the amount of data at the mappers output by aggregating intermediate results from one node. In the Wordcount example (last lecture), we could aggregate the mappers output before shuffling them into the reducers. By default there is no combiner. Combiners are implemented like a reducer class and are added to the workflow by calling conf.setcombinerclass(); 13
13 Main Hadoop MapReduce Parameters 14
14 Hadoop MapReducer Servers Jobtracker Java class. It s the central manager of a MapReduce job. Tasktracker Java class. It s the process responsible for starting local map and reduce workers. Periodically notifies the jobtracker that he s still alive. 15
15 Execution of MapReduce Programs 16
16 Worker Faults Situations where the workers notify the jobtracker: Exceptions; Abrupt termination of the JVM. A worker is failed if it spends longer than mapred.task.timeout without reporting to the jobtracker. Map output files are in the local file system and not HDFS. It is not efficient to replicate output data (mappers can be repeated). Failed workers are retried 4 times on different nodes. It s important for mappers and reducer to be able to deal with corrupted data. After the 2 nd retry, they change to skipping mode: All input records that cause errors are skipped. 17
17 Server Faults If the tasktracker fails or is too slow reporting to the jobtracker, its jobs are restarted at other nodes. If the jobtracker fails, all work is lost. It s the main Hadoop problem. 18
18 Scheduling Hadoop MapReduce scheduling has evolved to ensure greater evenness: Originally, Hadoop scheduled MapReduce jobs in FIFO order. Later, FIFO with priorities (VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW) without preemption. Then, the scheduler was transformed into a plugin. Example: FairScheduler (Facebook) Jobs are grouped (pool; by Unix use, Unix group, labels). Each group reserves a minimum amount of map and reduce slots with a maximum amount of running jobs. Free slots are distributed evenly between all groups. 19
19 hadoop-site.xml <property> <name>mapred.jobtracker.taskscheduler</name> <value>org.apache.hadoop.mapred.fairscheduler</val ue> </property> <property> <name>mapred.fairscheduler.allocation.file</name> <value>/path/to/pools.xml</value> </property> 20
20 pools.xml <?xml version="1.0"?> <allocations> <pool name="ads"> <minmaps>10</minmaps> <minreduces>5</minreduces> <maxrunningjobs>3</maxrunningjobs> </pool> <user name= smith"> <maxrunningjobs>1</maxrunningjobs> </user> <usermaxjobsdefault>10</usermaxjobsdefault> </allocations> 21
21 Scheduling (2) CapacityScheduler (Yahoo): Keeps a set of queues. One queue by user, group, organization. Fairness is guaranteed between queues like in the Facebook FairScheduler: Slot reservations + Even distribution of the excess slots. Within each queue, scheduling is FIFO with priorities and with a minimum quantum before preemption. 22
22 A MapReduce Graph Application Performing calculation over a graph requires processing each node. Each node has specific information as well as arcs to other nodes. The calculation must go through the graph and run the computational step. How can we traverse the graph using MapReduce? How can we represent the graph in MapReduce? 23
23 Dijkstra Algorithm: Shortest Path to a node in a graph 1 function Dijkstra(Graph, source): 2 for each vertex v in Graph: 3 dist[v] := infinity 4 previous[v] := undefined 5 dist[source] := 0 6 Q := the set of all nodes in Graph 7 while Q is not empty: 8 u := vertex in Q with smallest dist[] 9 if dist[u] = infinity: /* only unreachable nodes left */ 10 break 11 remove u from Q /* u is processed */ 12 for each neighbor v of u: 13 alt := dist[u] + dist_between(u, v) 14 if alt < dist[v]: 15 dist[v] := alt 16 previous[v] := u 17 return dist[] 24
24 Breadth First Search Breadth first search is an iterative search algorithm within a graph. On each step, the boundary moves one step further with respect to the origin. 25
25 BFS & MapReduce Problem: There is a mismatch between BFS and MapReduce. Solution: Iterative traversal using MapReduce map some nodes, change the border and run MapReduce again. 26
26 BFS & MapReduce Problem: It s too expensive to send the whole graph to all of map tasks. Solution: Create a new graph representation. 27
27 Graph Representation The most straightforward representation is describing each node by the connections to its neighbors. 28
28 Adjacency Matrix Another classic graph representation: M[i][j]= 1 means that there is an arc from i to j
29 Direct References Iterating through the graph requires a list of all the graph. This solution requires that the graph be shared and therefore synchronized! More complex structures are more difficult to serialize. class Node { Object data; Vector<GraphNode> arcs; Node next; } 30
30 Adjacency Matrix: Sparse Representation An adjacency matrix for a large graph (e.g. the web) will be full mostly with zeros. Each line will be very long. A sparse representation is needed that contains only non-null elements. 31
31 Sparse Representation 1: (3, 1), (18, 1), (200, 1) 2: (6, 1), (12, 1), (80, 1), (400, 1) 3: (1, 1), (14, 1) 32
32 Shortest Path: Intuition Let s define the solution by induction, assuming only distances of 1: DistanceTo(InitialNode) = 0 For all nodes n reachable from the InitialNode DistanceTo(n) = 1. For all other nodes m reachable from a set of nodes S, DistanceTo(m) = 1 + min(distanceto(n), n S) 33
33 From the Intuition to the Algorithm A map worker gets a node n as the key and (D, points) as the value: D is the distance from n to the origin. Points is the list of nodes that are reachable from n. p points, generate <p, (D+1,?)> The reduce tasks aggregates all distance to a given point p and selects the shortest one. 34
34 Which leads to This MapReduce step advances the search border one step. Yet to do all the BFS, MapReduce must be fed with the whole graph. Where is the list of points gone? Map workers must output the arcs <n, (?, points)> again. This algorithm is less efficient than Dijkstra s but much more scalable. 35
35 With Weights Most interesting graphs have arcs with values (different than 1). The algorithm is still valid but the point list must include the weight of the arcs between them. Mappers output (p, D+weight p ) instead of (p, D+1) for each node p. 36
36 Conclusions MapReduce makes us think differently. But allows is possible to do quite sophisticated parallelizations. It is fundamental to think about the amount of transmitted information. 37
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Hadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
Lecture 3 Hadoop Technical Introduction CSE 490H
Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
The Internet: A Remarkable Story. Inside the Net: A Different Story. Networks are Hard to Manage. Software Defined Networking Concepts
The Internet: A Remarkable Story Software Defined Networking Concepts Based on the materials from Jennifer Rexford (Princeton) and Nick McKeown(Stanford) Tremendous success From research experiment to
Data-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
University of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
How MapReduce Works 資碩一 戴睿宸
How MapReduce Works MapReduce Entities four independent entities: The client The jobtracker The tasktrackers The distributed filesystem Steps 1. Asks the jobtracker for a new job ID 2. Checks the output
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
6. How MapReduce Works. Jari-Pekka Voutilainen
6. How MapReduce Works Jari-Pekka Voutilainen MapReduce Implementations Apache Hadoop has 2 implementations of MapReduce: Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Classic MapReduce The Client
Hadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Introduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
http://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)
Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map
Hadoop Fair Scheduler Design Document
Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................
Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Hadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns
CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns Review Assignment 2 Ques9ons? If you d like to use guava (Google collec9ons classes) pom.xml available for assignment 2 Includes dependency for
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Fair Scheduler. Table of contents
Table of contents 1 Purpose... 2 2 Introduction... 2 3 Installation... 3 4 Configuration...3 4.1 Scheduler Parameters in mapred-site.xml...4 4.2 Allocation File (fair-scheduler.xml)... 6 4.3 Access Control
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
Big Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 [email protected], 2 [email protected],
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
Extending Hadoop beyond MapReduce
Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Social Media Mining. Graph Essentials
Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Load Balancing and Termination Detection
Chapter 7 slides7-1 Load Balancing and Termination Detection slides7-2 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
Load balancing Static Load Balancing
Chapter 7 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection
Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez (200950006) Sameer Kumar (200950031)
using Hadoop MapReduce framework Binoy A Fernandez (200950006) Sameer Kumar (200950031) Objective To demonstrate how the hadoop mapreduce framework can be extended to work with image data for distributed
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Load Balancing and Termination Detection
Chapter 7 Load Balancing and Termination Detection 1 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection
Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200
Hadoop Learning Resources 1 Hadoop Certification (Developer, Administrator HBase & Data Science) CCD-410, CCA-410 and CCB-400 and DS-200 Author: Hadoop Learning Resource Hadoop Training in Just $60/3000INR
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
Data Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
MapReduce. Introduction and Hadoop Overview. 13 June 2012. Lab Course: Databases & Cloud Computing SS 2012
13 June 2012 MapReduce Introduction and Hadoop Overview Lab Course: Databases & Cloud Computing SS 2012 Martin Przyjaciel-Zablocki Alexander Schätzle Georg Lausen University of Freiburg Databases & Information
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
Evaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist [email protected], [email protected], [email protected] Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Load Balancing and Termination Detection
Chapter 7 Slide 1 Slide 2 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination
H2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13
How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Big Data Processing using Hadoop. Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center
Big Data Processing using Hadoop Shadi Ibrahim Inria, Rennes - Bretagne Atlantique Research Center Apache Hadoop Hadoop INRIA S.IBRAHIM 2 2 Hadoop Hadoop is a top- level Apache project» Open source implementation
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
Introduction to Cloud Computing
Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own
Job Scheduling with the Fair and Capacity Schedulers
Job Scheduling with the Fair and Capacity Schedulers Matei Zaharia UC Berkeley Wednesday, June 10, 2009 Santa Clara Marriott Motivation» Provide fast response times to small jobs in a shared Hadoop cluster»
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
