MAPREDUCE Programming Model

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "MAPREDUCE Programming Model"

Transcription

1 CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce Framework Map Reduce Data Flow MapReduce Execution Model MapReduce Jobtrackers and Tasktrackers Input Data Splits Scheduling and Synchronization Speculative Execution Partitioners and Combiners Scaling Data Intensive Application Example WordCount Program I I can not do everything, but still I can do something; and because I cannot do everything, I will not refuse to do something I can do Word Count And 1 Because 1 But 1 Can 4 Do 4 Everything 2 I 5 Not 3 Refuse 1 Something 2 Still 1 To 1 Will 1 Define WordCount as Multiset; For Each Document in DocumentSet { T = tokenize(document); For Each Token in T { WordCount[token]++; } } Display(WordCount); Program Does NOT Scale for Large Number of Documents 1

2 WordCount Program II A two-phased program can be used to speed up execution by distributing the work over several machines and combining the outcome from each machine into the final word count Phase I Document Processing Each machine will process a fraction of the document set Phase II Count Aggregation Partial word counts from individual machines are combined into the final word count WordCount Program II Phase I Define WordCount as Multiset; For Each Document in DocumentSubset { T = tokenize(document); For Each Token in T { WordCount[token]++; } } SendToSecondPhase( wordcount); Phase II Define TotalWordCount as Multiset; For each WordCount Received from Phase I { MultisetAdd (TotalWordCount, WordCount); } WordCount Program II Limitations The program does not take into consideration the location of the documents Storage server can become a bottleneck, if enough bandwidth is not available Distribution of documents across multiple machines removes the central server bottleneck Storing WordCount and TotalWordCount in the memory is a flaw When processing large document sets, the number of unique words can exceed the RAM capacity In Phase II, the aggregation machine becomes the bottleneck WordCount Program Solution The aggregation phase must execute in a distributed fashion on a cluster of machines that can run independently To achieve this, functionalities must be added Store files over a cluster of processing machines. Design a disk-based hash table permitting processing without being limited by RAM capacity. Partition intermediate data across multiple machines Shuffle the partitions to the appropriate machines 2

3 Divide and Conquer Work Partition W 1 W 2 W 3 How to Scale Up? Worker Worker Worker R 1 R 2 R 3 Final Output Combine Parallelization Challenges How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers fail? Parallelization Challenges Parallelization problems arise from in several ways Communication between workers, asynchronously Workers need to exchange information about their states Concurrent access to shared resources, while preserving state consistency Workers need to manipulate data, concurrently Cooperation requires synchronization and interprocess communication mechanisms 3

4 Distributed Workers Coordination Coordinating a large number of workers in a distributed environment is challenging The order in which workers run may be unknown The order in which workers interrupt each other may be unknown The order in which workers access shared data may be unknown Failures further compound the problem! Classic Models Computational Models Master and Slaves Producers and Consumers Readers and Writers IPC Models Shared Memory Threads Message Passing To ensure correct execution several mechanisms are needed Semaphores (lock, unlock), Conditional variables (wait, notify, broadcast), Barriers, Address deadlock, livelock, race conditions,... Makes it difficult to debug parallel execution on clusters of distribute processors MapReduce Data-Intensive Programming Model MapReduce is a programming model for processing large sets Users specify the computation in terms of a map() and a reduce() function, Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and Underlying system also handles machine failures, efficient communications, and performance issues. MapReduce is inspired by the map() and fold() functions commonly used in functional programming Typical Large-Data Problem At a high-level of abstraction, MapReduce codifies a generic recipe for processing large data set Iterate over a large number of records Extract something of interest from each Map Shuffle and sort intermediate results Aggregate intermediate results Reduce Generate final output Basic Tenet of MapReduce is Enabling a Functional Abstraction for the Map() and Reduce() operations (Dean and Ghemawat, OSDI 2004) 4

5 MAPREDUCE Functional Programming Paradigm Imperative Languages and Functional Languages The design of the imperative languages is based directly on the von Neumann architecture Efficiency is the primary concern, rather than the suitability of the language for software development The design of the functional languages is based on mathematical functions A solid theoretical basis that is also closer to the user, but relatively unconcerned with the architecture of the machines on which programs will run Fundamentals of Functional Programming Languages The basic process of computation is fundamentally different in a FPL than in an imperative language In an imperative language, operations are done and the results are stored in variables for later use Management of variables is a constant concern and source of complexity for imperative programming FPL takes a mathematical approach to the concept of a variable Variables are bound to values, not memory locations A variable s value cannot change, which eliminates assignment as a possible operation Characteristics of Pure FPLs Pure FP languages tend to Have no side-effects Have no assignment statements Often have no variables! Be built on a small, concise framework Have a simple, uniform syntax Be implemented via interpreters rather than compilers Be mathematically easier to handle 5

6 Importance of FP FPLs encourage thinking at higher levels of abstraction It enables programmers to work in units larger than statements of conventional languages FPLs provide a paradigm for parallel computing Absence of assignment provide basis for independence of evaluation order Ability to operate on entire data structures FPL and IPL Example Summing the integers 1 to 10 in IPL The computation method is variable assignment total = 0; for (i = 1; i 10; ++i) total = total+i; Summing the integers 1 to 10 in FPL The computation method is function application. sum [1..10] 22 Lambda Calculus The lambda calculus is a formal mathematical system to investigate functions, function application and recursion. A lambda expression specifies the parameter(s) and the mapping of a function in the following form λ x. x * x * x for the function cube (x) = x * x * x Lambda expressions describe nameless functions Lambda expressions are applied to parameter(s) by placing the parameter(s) after the expression (λ x. x * x * x) 3 => 3*3*3 => 27 (λ x,y. (x-y)*(y-x)) (3,5) => (3-5)*(5-3) => -4 MAPREDUCE Functional Programming 6

7 FPL Map and Fold map and fold FPL higher-order functions (map f list1 [list2 list3 ]) (map square ( )) ( ) (fold f list [ ]) (fold + ( )) 30 (fold + (map square (map list1 list2)))) Roots in Functional Programming Map Fold f f f f f g g g g g What is MapReduce? Programming model for expressing distributed computations at a massive scale Execution framework for organizing and performing such computations Open-source implementation called Hadoop Mappers And Reducers A mapper is a function that takes as input one ordered (key; value) pair of binary strings. As output the mapper produces a finite multiset of new (key, value) pairs. Mappers operates on ONE (key; value) pair at a time A reducer is a function that takes as input a binary string k which is the key, and a sequence of values v 1, v 2,, v n, which are also binary strings. As output, the reducer produces a multiset of pairs of binary strings (k,v k,1 ), (k,v k,2 ), (k,v k,3 ), (k,v k,n ) Key in output tuples is identical to the key in input tuple. Consequence Mappers can manipulate keys arbitrarily, but Reducers cannot change the keys at all 7

8 MapReduce Framework Programmers specify two functions: map (k, v) <k, v >* reduce (k, v ) <k, v >* All values, v, with the same key are sent to the same reducer The execution framework supports a computational runtime environment to handle all issues related to coordinating the parallel execution of a data-intensive computation in a large-scale environment Breaking up the problem into smaller tasks, coordinating workers executions, aggregating intermediate results, dealing with failures and softeare errors, MapReduce Runtime Basic Functions Handles scheduling Assigns workers to map and reduce tasks Handles data distribution Moves processes to data, not data to processes Handles synchronization among workers Gathers, sorts, and shuffles intermediate data Handles errors and faults, dynamically Detects worker failures and restarts K 1 V 1 K 2 V 2 K 3 V 3 K 4 V 4 K 5 V 5 K 6 V 6 MapReduce Data Flow Input Records k 1 v 1 k 1 v 1 Output Records map map map map k 21 v 23 k 1 v 3 a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c Split k 12 v 32 Local QSort k 1 v 5 Reduce Reduce Reduce Split k 21 v 45 k 12 v 54 k 2 v 2 k 2 v 4 r 1 s 1 r 2 s 2 r 3 s 3 8

9 MapReduce Programmers specify two functions: map (k, v) <k, v >* reduce (k, v ) <k, v >* All values with the same key are reduced together The execution framework handles everything else! Not quite Usually, programmers also specify: partition (k, number of partitions) partition for k Often a simple hash of the key, e.g., hash(k ) mod N Divides up key space for parallel reduce operations combine (k, v ) <k, v >* Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic MapReduce Design Issues Barrier between map and reduce phases To enhance performance the process of copying intermediate data can start early Keys arrive at each reducer in sorted order No enforced ordering across reducers MapReduce Implementations Google has a proprietary implementation in C++ Bindings in Java, Python Hadoop is an open-source implementation in Java Development led by Yahoo, used in production Now an Apache project Rapidly expanding software ecosystem Lots of custom research implementations For GPUs, cell processors, etc. Distributed Execution Overview Input data Split 0 Split 1 Split 2 Read Worker Worker Worker User Program Fork Fork Fork Assign Map Local Write Master Remote Read, Sort Assign Reduce Worker Worker Write Output File 0 Output File 1 9

10 Map-Reduce Framework for Word Count Input: a collection of keys and their values User- Written Map Function Each input (k,v) mapped to set of intermediate key-value pairs Sort All Key-value Pairs By Key Hello World : Word Count Each list of values is reduced to a single value for that key User- Written Reduce Function One list of intermediate values for each key: (k, [v 1,,v n ]) WordCount PseudoCode Map Reduce map(string filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(string token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); } Implementation of WordCount() A program forks a master process and many worker processes. Input is partitioned into some number of splits. Worker processes are assigned either to perform Map on a split or Reduce for some set of intermediate keys. 10

11 Execution of WordCount Responsibility of the Master From Other Map Processes (w,1), (w,1), One Reduce Process (w,200), Note: typically one Reduce will handle many words. 1. Assign Map and Reduce tasks to Workers. 2. Check that no Worker has died (because its processor failed). 3. Communicate results of Map to the Reduce tasks. (w,1), (w,1), One Chunk of Docs One Map Process Distribute By Word (w,1), (x,1), (w,1), (y,1), Communication from Map to Reduce Select a number R of reduce tasks. Divide the intermediate keys into R groups, Use an efficient hashing function Each Map task creates, at its own processor, R files of intermediate key-value pairs, one for each Reduce task. MAP REDUCE Execution Framework Dr. Taieb Znati Computer Science Department University of Pittsburgh 11

12 MAPREDUCE Execution Framework MapReduce Execution Model MapReduce Jobtrackers and Tasktrackers Input Data Splits MapReduce Execution Issues Scheduling and Synchronization Speculative Execution Partitioners and Combiners MapReduce Data Flow A MapReduce job is a unit of work to be performed Job consists of the MapReduce Program, the Input data and the Configuration Information The MapReduce job is divided it into two types of tasks map tasks and reduce tasks It is not uncommon for MapReduce jobs to have thousands of individual tasks to be assigned to cluster nodes The Input data is divided into fixed-size pieces called splits One map task is created for each split The user-defined map function is run on each split Configuration information indicates where the input lies, and the output is stored Lifecycle of a MapReduce Job Map function Reduce function Job Execution Jobtrackers and Takstrackers Two types of nodes control the job execution process Jobtraccker and Tasktrackers A jobtracker coordinates all the jobs run on the system by scheduling tasks to run on Tasktrackers Tasktrackers run tasks and send progress reports to the jobtracker Jobtracker keeps a record of the overall progress of each job If a task fails, the jobtracker can reschedule it on a different Tasktracker 12

13 Scheduling and Synchronization Jobtracker Scheduling and Coordination In large jobs, the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently, The Jobtracker must maintain a task queue and assign nodes to waiting tasks as the nodes become available. Another aspect of Jobtracker s responsibiliies involves coordination among tasks belonging to different jobs Jobs from different users, for example Designing a large-scale, shared resource to support several users simultaneously in a predictable, transparent and policy-driven fashion is challening! MapReduce Stragglers The speed of a MapReduce job is sensitive to the stragglers performance tasks that take an usually long time to complete The map phase of a job is only as fast as the slowest map task. The running time of the slowest reduce task determines the completion time of a job Stragglers may result from unreliable hardware A machine recovering from frequent hardware errors may become significantly slower The barrier between the map and reduce tasks further compounds the problem MapReduce Speculative Execution Speculative execution is an optimization technique to improve job running times, in the presence of stragglers Base on speculative execution, an identical copy of the same task is executed on a different machine, The result of the task that finishes first is used Google has reported that speculative execution can achieve 44% performance improvement 13

14 MapReduce Speculative Execution Both map and reduce tasks can be speculatively executed, but the technique is better suited to map tasks than reduce tasks This is due to the fact that each copy of the reduce task needs to pull data over the network. Speculative execution cannot adequately address cases where stragglers are caused by a skew in the distribution of values associated with intermediate keys In these cases, tasks responsible for processing the most frequent elements run much longer than the typical task More efficient local aggregation may be required MapReduce Synchronization In MapReduce, synchronization is needed when mappers and reduces exchange intermediate output and state information Intermediate key-value pairs must be grouped by key, which requires the execution a distributed sort process involving all the nodes that executed map tasks and all the nodes that will execute reduce tasks The shuffle and sort process involves copying intermediate data over the network MapReduce Synchronization A MapReduce job with M mappers and R reducers may involves up to M R distinct copy operations Each mapper intermediate output goes to every reducer No reducer can start until all the mappers have finished emitting key-value pairs and all intermediate key-value pairs have been shuffled and sorted Necessary to guarantee that all values associated with the same key have been gathered. This is an important departure from functional programming, where aggregation can begin as soon as values are available. For improvement start copying intermediate key-value pairs over the network to the nodes running the reducers as soon as each mapper finishes Data Locality Optimization 14

15 Input Division Split Size Fine-grained splits increase parallelism and improves fault-tolerance Small splits reduce the processing time of each split and allows faster machines to process proportionally more splits over the course of the job than slower machines Load-balancing can be achieved more efficiently with small splits The impact of failure, when combined with load-balancing, can be reduced significantly with fine-grained splits Too small splits increases the overhead of managing splits Map task creation dominates the total job execution time. Data Locality Input Data Data locality Optimization The map task should be run on a node where the input data resides The optimal split size is the same as the largest size of input that can be guaranteed to be stored on a single node. If the split is larger than what one node can store, data transfer on across the network to the node running the map task is required May result in significant communication overhead, and reduces efficiency Data Locality Map Output Output produced by map tasks should be stored locally, NOT at a distributed storage Map output is intermediate Processed by reduce tasks to produce the final output Map output is no needed upon completion of the job Map output should NOT be replicated to overcome failure It is more efficient to restart the map task upon failure than replicating the output produced by map tasks Data Locality Reduce Tasks Reduce tasks cannot typically take advantage of data locality Input to a single reduce task is normally the output from all mappers. The sorted map outputs have to be transferred across the network to the node where the reduce task is running The outputs are merged and then passed to the user-defined reduce function 15

16 Reduce Task Output Replication Multiple replicas of the reduce output are normally stored reliably, First replica is stored on the local node, Other replicas being stored on off-rack nodes. To increase efficiency, the writing of the reduce output must reduce the amount of network bandwidth consumed The replication process must be streamlined and efficient Partitioners and Combiners Number of Reduce Tasks The number of map reducers is typically specified independently It is not only governed by the size of the input, but also the type of the application MapReduce data flow can be specified in different ways No reduce tasks Singe reduce task Multiple reduce tasks Data Flow No Reduce Tasks Input Split 0 Map Output Part o Split 1 Map Part 1 Split 2 Map Part 2 Replication Replication Replication 16

17 Data Flow Single Reduce Task Input Split 0 Map Sort Copy Output Merge Split 1 Map Sort Reduce Part 0 Replication Data Flow Multiple Reduce Tasks Input Sort Split 0 Map Split 1 Map Output Copy Merge Reduce Part 0 Replication Merge Reduce Part 1 Replication Split 1 Map Sort Split 1 Map Data Flow Multiple Reduce Tasks Map tasks partition their output, each creating one partition for each reduce task There can be many keys and their associate values in each partition The recodes for every key are all in a single partition The partition can be controlled by user-defined partition function The use of a hash function typically works well The Shuffle, data flow between map and reduce, is a complicated process whose tuning can have a big impact on the job execution MapReduce Combiners Combiner functions can be used to minimize the data transferred between map and reduce tasks Combiners are particularly useful when MapReduce jobs are limited by the bandwidth available on the cluster Combiners are user-specified functions Combiner functions run on the map output The combiner s is then fed to the reduce function Since it is an optimization function, there is no guarantee how many times combiners are called for a particular map output record, if at all Calling the combiner zero, one, or many times should produce the same output from the reducer. 17

18 Combiner Example Max Temperature Assume that the mappers produce the following output Mapper 1 (1950, 0), (1950, 20), (1950, 10) Mapper 2 (1950, 25), (1950, 15) Reduce function is called with a list of all the values (1950, [0, 20, 10, 25, 15]) with output (1950, 25) Using a combiner for each map output results in: Combiner 1 (1950, 20) Combiner 2 (1950, 25) Reduce function is called with (1950, [20,25]) with output (1950, 25) Combiner Property The combiner function calls can be expressed as follows: Max(0, 20, 10, 25, 15) = Max(Max(0, 20, 10), Max(25, 15)) = Max(20, 25)=25 Max() is commonly referred to as distributive Not all function exhibit distributive property Mean(0, 20, 10, 25, 15) = ( )/5=14 Mean(Mean(0, 20, 10), Mean(25, 15)) = Mean (10, 20) = 15 Combiners do not replace producers Producers are still needed to process recorders with the same key from different maps Conclusion Part I Conclusion Part II MapReduce Execution Framework MapReduce Jobtrackers and Tasktrackers Input Data Splits MapReduce Execution Issues Scheduling and Synchronization Speculative Execution Partitioners and Combiners Scaling Data Intensive Application MapReduce Framework MapReduce Overview Map Reduce Data Flow Map Function Partition Function Compare Function Reduce Function 18

19 Reference Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-TakLeung, 19

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

What is cloud computing?

What is cloud computing? Introduction to Clouds and MapReduce Jimmy Lin University of Maryland What is cloud computing? With mods by Alan Sussman This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

MapReduce: Algorithm Design Patterns

MapReduce: Algorithm Design Patterns Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak MapReduce Systems Sara Bouchenak Sara.Bouchenak@imag.fr http://sardes.inrialpes.fr/~bouchena/teaching/ Lectures based on the following slides: http://code.google.com/edu/submissions/mapreduceminilecture/listing.html

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich First, an Announcement There will be a repetition exercise group on Wednesday this week. TAs will answer your questions on SQL, relational

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Data-intensive computing systems

Data-intensive computing systems Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2013/14 Lecture XI: MapReduce & Hadoop The new world of Big Data (programming model) Big Data Buzzword for challenges occurring

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Cloud Computing. Chapter 8. 8.1 Hadoop

Cloud Computing. Chapter 8. 8.1 Hadoop Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

High Performance Computing MapReduce & Hadoop. 17th Apr 2014 High Performance Computing MapReduce & Hadoop 17th Apr 2014 MapReduce Programming model for parallel processing vast amounts of data (TBs/PBs) distributed on commodity clusters Borrows from map() and reduce()

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Principles of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. MapReduce. Charlie Garrod Christian Kästner

Principles of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. MapReduce. Charlie Garrod Christian Kästner Principles of Software Construction: Objects, Design, and Concurrency Distributed System Design, Part 2. MapReduce Spring 2014 Charlie Garrod Christian Kästner School of Computer Science Administrivia

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

<Insert Picture Here> Oracle and/or Hadoop And what you need to know Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004) MapReduce from the paper MapReduce: Simplified Data Processing on Large Clusters (2004) What it is MapReduce is a programming model and an associated implementation for processing and generating large

More information