Map Reduce / Hadoop / HDFS



Similar documents
Jeffrey D. Ullman slides. MapReduce for data intensive computing

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

MapReduce (in the cloud)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Parallel Processing of cluster by Map Reduce

Big Data With Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

The MapReduce Framework

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Apache Hadoop s MapReduce

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

MapReduce and the New Software Stack

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Hadoop and Map-Reduce. Swati Gore

Apache Hadoop. Alexandru Costan

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Hadoop Architecture. Part 1

Advanced Data Management Technologies

Cloud Computing at Google. Architecture

Hadoop. Sunday, November 25, 12

Data Science in the Wild

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

16.1 MAPREDUCE. For personal use only, not for distribution. 333

MapReduce Systems. Outline. Computer Speedup. Sara Bouchenak

Lecture Data Warehouse Systems

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Hadoop IST 734 SS CHUNG

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Open source Google-style large scale data analysis with Hadoop

Intro to Map/Reduce a.k.a. Hadoop

Big Application Execution on Cloud using Hadoop Distributed File System

Introduction to Parallel Programming and MapReduce

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Introduction to Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop implementation of MapReduce computational model. Ján Vaňo

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Open source large scale distributed data management with Google s MapReduce and Bigtable

Chapter 7. Using Hadoop Cluster and MapReduce

Large scale processing using Hadoop. Ján Vaňo

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Big Data Processing with Google s MapReduce. Alexandru Costan

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

Distributed File Systems

CS 378 Big Data Programming. Lecture 2 Map- Reduce

MapReduce: Simplified Data Processing on Large Clusters. Jeff Dean, Sanjay Ghemawat Google, Inc.

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Introduction to MapReduce and Hadoop

CS 378 Big Data Programming

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Cluster Applications

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop and No SQL Nagaraj Kulkarni

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Apache Hadoop FileSystem and its Usage in Facebook

Fault Tolerance in Hadoop for Work Migration

MAPREDUCE Programming Model

Hadoop & its Usage at Facebook

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Storage Architectures for Big Data in the Cloud

Introduction to Hadoop

Cloud Computing using MapReduce, Hadoop, Spark

Hypertable Architecture Overview

The Hadoop Framework

Sriram Krishnan, Ph.D.

Map Reduce & Hadoop Recommended Text:

Design and Evolution of the Apache Hadoop File System(HDFS)

LARGE-SCALE DATA PROCESSING WITH MAPREDUCE

Big Systems, Big Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

A programming model in Cloud: MapReduce

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

HADOOP PERFORMANCE TUNING

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

BBM467 Data Intensive ApplicaAons

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop & its Usage at Facebook

Challenges for Data Driven Systems

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Transcription:

Chapter 3: Map Reduce / Hadoop / HDFS 97

Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98

Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN Today Next week 99

Overview Distributed File Systems - Difference to RDBMS - Parallel Computing Architecture 100

Distributed File Systems Past - most computing is done on a single processor: - one main memory - one cache - one local disk, New Challenges: - Files must be stored redundantly: - If one node fails, all of its files would be unavailable until the node is replaced (see File Management) - Computations must be divided into tasks: - a task can be restarted without affecting other tasks (see ) - use of commodity hardware 101

Distributed File Systems - Drawbacks of RDBMS - Database system are difficult to scale. - Database systems are difficult to configure and maintain - Diversification in available systems complicates its selection - Peak provisioning leads to unnecessary costs - Advantages of NoSQL systems: - Elastic scaling - Less administration - Better economics - Flexible data models 102

Distributed File Systems Parallel computing architecture - Referred as cluster computing - Physical Organisation: - compute nodes are stored on racks (8-64) - nodes on a single rack connected by a network Switch Rack Rack Rack Rack Racks of servers (and switches at the top), at Google s Mayes County, Oklahoma data center [extremetech.com] Nodes within a rack are connected by a network, typically gigabit Ethernet 103

Distributed File Systems Parallel computing architecture Large-Scale File-System Organisation: - Characteristics: - files are several terabytes in size (Facebook s daily logs: 60TB; 1000 genomes project: 200TB; Google Web Index; 10+ PB) - files are rarely updated - Reads and appends are common Exemplary distributed file systems: - Google File System (GFS) - Hadoop Distributed File System (HDFS, by Apache) - CloudStore - HDF5 - S3 (Amazon EC2) - 104

Distributed File Systems Parallel computing architecture - Large-Scale File-System Organisation: - Organisation: - files are divided into chunks (typically 16-64MB in size) - chunks are replicated n times (i.e default in HDFS: n=3) at n different nodes (optimally: replicas are located on different racks optimising fault tolerance) - how to find files? - existence of a master node - holds a meta-file (directory) about location of all copies of a file -> all participants using the DFS know where copies are located 105

Overview -Motivation - Programming Model - Recap Functional Programming - Examples 106

Motivation: - Comparison to Other Systems vs. RDBMS RDBMS Data size Petabytes Gigabytes Access Batch Interactive and Batch Updates Write once, read many times Read & Write many times Structure Dynamic schema Static schema Integrity Low High (normalized data) Scaling Linear Non linear 107

Motivation: - Comparison to Other Systems vs. Grid Computing - Accessing large data volumes becomes a problem in High performance computing (HPC), as the network bandwidth is the bottleneck <-> Data Locality in - in HPC, programmers have to explicitly handle the data flow <-> operates only in higher level, i.e. data flow is implicit -handling partial failures <-> as a shared-nothingarchitecture (no dependence of tasks); detects failures and reschedules missing operations 108

Motivation: Large Scale Data Processing In General: - can be used to manage large-scale computations in a way that is tolerant of hardware faults - System itself manages automatic parallelisation and distribution, I/O scheduling, coordination of tasks that are implemented in map() and reduce() and copes with unexpected system failures or stragglers - several implementations: Google s internal implementation, open-source implementation Hadoop (using HDFS), 109

Programming Model - General Processing - Input & Output: each a set of key/value pairs - Programmer specifies two functions: map (in_key, in_value) -> list (out_key, intermediate_value) Processes input key/value pair; one Map()-Call for every pair Produces a set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list (out_value) combines all intermediate values for a particular key; one Reduce()-call per unique key produces a set of merged output values (usually just one output value) 110

Programming Model Recap: Functional Programming - is inspired by similar primitives in LISP, SML, Haskell and other languages - The general idea of higher order functions (map and fold) in functional programming (FP) languages are transferred in the environment of : - map in <-> map in FP - reduce in <-> fold in FP 111

Programming Model Recap: Functional Programming - MAP: - 2 parameters: applies a function on each element of a list - the type of the elements within the result list can differ from the type of the input list - the size of the result list remains the same In Haskell: Example: map :: (a >b) > [a] > [b] map f [] = [] map f (x:xs) = f x : map f xs *Main> map (\x > (x,1)) [ Big, Data, Management, and, Analysis ] [( Big,1),( Data,1),( Management,1),( and,1),( Analysis,1)] 112

Programming Model Recap: Functional Programming - FOLD: - 3 parameters: traverse a list and apply a function f() to each element plus an accumulator. f() returns the next accumulator value - in functional programming: foldl and foldr In Haskell (analog foldr): foldl :: (b >a >b) > b > [a] > b foldl f acc [] = acc foldl f acc (x:xs) = fold f (f acc x) xs Example: *Main> foldl (\acc (key,value) > acc + value) 0 [( Big, 1), ( Big, 1), ( Big,1)] 3 113

Programming Model - General Processing of - 1. Chunks from a DFS are attached to Map tasks turning each chunk into a sequence of key-value pairs. - 2. key-value pairs are collected by a master controller and sorted by key. The keys are divided among all Reduce tasks. - 3. Reduce tasks work on each key separately and combine all the values associated with a specific key. 114

Programming Model - High-level diagram 115

Programming Model - High-level diagram 116

Programming Model - General Processing - Programmer s task: specify map() and reduce(); - environment takes care of: - Partitioning the input data - Scheduling - Shuffle and Sort (performing the group-by-key step) - Handling machine failures and stragglers - Managing of required inter-machine communication 117

Programming Model - General Processing Partitioning the input data - data files are divided into blocks (default in GFS/HDFS: 64 MB) and replicas of each are stored on different nodes - Master schedules map() tasks in close proximity to data storage - map() tasks are executed physically on the same machine where one replica of an input file is storaged (or, at least on the same rack -> communication via network switch) - > Goal: conserve network bandwidth (c.f Grid Computing) > achieves to read input data at local disk speed, rather than limiting read rate by rack switches 118

Programming Model - General Processing Scheduling - One master, many workers - split input data into M map tasks - reduce phase partitioned into R tasks - tasks are assigned to workers dynamically - Master assigns each map task to a free worker - considers proximity of data to worker - > worker reads task input (optimal: from local disk) - > output: files containing intermediate (key,value)-pairs sorted by key - Master assigns each reduce task to a free worker - worker reads intermediate (key, value)-pairs - worker merges and applies reduce()-function for output 119

Programming Model - General Processing Shuffle and Sort (performing the group-by-key step) - input to every reducer is sorted by key - Shuffle: sort and transfer the map outputs to the reducers as inputs - Mappers need to separate output intended for different reducers - Reducers need to collect their data from all(!) mappers - keys at each reducer are processed in order 120

Programming Model - General Processing Shuffle and Sort (performing the group-by-key step) Quelle: Oreilly, Hadoop The Definitive Guide 3rd Edition, May 2012 121

Programming Model - General Processing Handling machine failures and stragglers - General: master pings workers periodically to detect failures - Map worker failure - Map tasks completed or in-progress at worker are reset to idle - all reduce workers will be notified about any re-execution - Reduce worker failure - only in-progress tasks at worker will be re-executed - > output stored in global FS - Master failure - master node is replicated itself. Backup master recovers last updated log files (metafile) and continues - if no backup master -> MR task is aborted and client is notified 122

Programming Model - General Processing Handling machine failures and stragglers - Failures

Programming Model - General Processing Handling machine failures and stragglers - Stragglers - slow workers lengthen the termination of a task - close to completion, backup copies of the remaining in-progress tasks are created - Causes: hardware degradation, software misconfiguration, - if a task is running slower than expected, another equivalent task will be launched as backup -> speculative execution of tasks - when a task completes successfully, any duplicate task are killed 124

Programming Model - General Processing Handling machine failures and stragglers - Stragglers

Programming Model - General Processing Managing required inter-machine communication - Task status (idle, in-progress, completed) - Idle tasks get scheduled as workers become available - In completion of a map task, the worker sends the location and sizes of its intermediate files to the master - Master pushes this info to reducers - Fault tolerance: master pings workers periodically to detect failures 126

Programming Model - General Processing - Workflow Workflow of (as original implemented by Google): 1. Initiate the environment on a cluster of machines 2. One Master, the rest are workers that are assigned tasks by the master 3. A map task reads the contents of an input split and passes them to the MAP-function. The results are buffered in memory 4. The buffered (key,value)-pairs are written to local disk. The location of these (intermediate) files are passed back to the master 5. A reduce worker who has been notified by the master, uses remote procedure calls to read the buffered data. 6. Reduce worker iterates over the sorted intermediate (key,value)-pairs and passes them to the REDUCE-function > On completion of all tasks, the master notifies the user program. 127

Programming Model - Low-level diagram 128

Example #1 WordCount - Setting: text documents, e.g. web server logs - Task: count occurrence of distinct words appearing in the input file, e.g find popular URLs in server logs Challenges: - File is too large for to fit in a single machines s memory - parallel execution > Solution: Apply 129

Example #1 WordCount - Goal: Count word occurrence in a set of documents - Input: Wer A sagt, muss auch B sagen! Wer B sagt, braucht B nicht nochmal sagen! map (k1, v1) > list (k2, v2) map (String key, String value): //key:document name //value: content of document for each word w in value do: emitintermediate(w, 1 ) reduce (k2, list(v2)) > list(v2) reduce (String key, Iterator values): //key:a word //values: a list of counts int result = 0; for each v in values do: result += parseint(v); emit(result.tostring()) 130

Example #1 WordCount In a parallel environment: worker: 2 131

Example #2 k-means Randomly initialize k centers: Classify: Assign each point to nearest centre: Recenter: becomes centroid of its points: 132

Example #2 k-means - - Scheme 133

Example #2 k-means - Classification Step As Map Classify: Assign each point to nearest center: Map: Input: - subset of d-dimensional objects of in each mapper - initial set of centroids Output: - list of objects assigned to nearest centroid. This list will later be read by the reducer program 134

Example #2 k-means - Classification Step As Map Classify: Assign each point to nearest center: for all x_i in M do bestcentroid <- null mindist <- inf for all c in C do dist <- l2dist(x, c) if bestcentroid == null dist < mindist then mindist <- dist bestcentroid <- c endif endfor outputlist << (bestcentroid, x_i) endfor return outputlist 135

Example #2 k-means - Recenter Step as Reduce Recenter: becomes centroid of its points: Note: equivalent to averaging its points! Reduce: Input: - list of (key,value)-pairs, where key = bestcentroid and value = objects assigned to this centroid Output: - (key,value), where key = oldcentroid and value = newbestcentroid, which is the new centroid calculated for that bestcentroid 136

Example #2 k-means - Recenter Step as Reduce Recenter: becomes centroid of its points: assignmentlist <- outputlists // lists from mappers are merged together (shuffle) for all (key,values) in assignmentlist do newcentroid, sumofobjects, numofobjects <- null for all obj in values do sumofobjects += obj numofobjects ++ endfor newcentroid <- (sumofobjects / numofobjects) newcentroidlist << (key, newcentroid) endfor return newcentroidlist 137