Principles of Software Construction: Objects, Design, and Concurrency. Distributed System Design, Part 2. MapReduce. Charlie Garrod Christian Kästner



Similar documents
MapReduce (in the cloud)

Introduction to Parallel Programming and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce

MAPREDUCE Programming Model

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Introduction to Hadoop

Cloud Computing at Google. Architecture

Map Reduce / Hadoop / HDFS

Big Data With Hadoop

Introduction to Hadoop

The MapReduce Framework

Big Data and Apache Hadoop s MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Parallel Processing of cluster by Map Reduce

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data Processing with Google s MapReduce. Alexandru Costan

Hadoop and Map-Reduce. Swati Gore

The Hadoop Framework

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Lecture Data Warehouse Systems

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Performance and Energy Efficiency of. Hadoop deployment models

Data-Intensive Computing with Map-Reduce and Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to DISC and Hadoop

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Can the Elephants Handle the NoSQL Onslaught?

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

MapReduce. from the paper. MapReduce: Simplified Data Processing on Large Clusters (2004)

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

CS 378 Big Data Programming. Lecture 2 Map- Reduce

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Project DALKIT (informal working title)

How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

- Behind The Cloud -

Apache Flink Next-gen data analysis. Kostas

Hadoop IST 734 SS CHUNG

Hadoop Parallel Data Processing

Hadoop WordCount Explained! IT332 Distributed Systems

A programming model in Cloud: MapReduce

The Stratosphere Big Data Analytics Platform

CS 378 Big Data Programming

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 16,

Distributed Computing and Big Data: Hadoop and MapReduce

Advanced Data Management Technologies

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Implement Hadoop jobs to extract business value from large and varied data sets

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Apache Hadoop. Alexandru Costan

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Bigtable is a proven design Underpins 100+ Google services:

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Scalable Internet Services and Load Balancing

Disco: Beyond MapReduce

Fast Data in the Era of Big Data: Twitter s Real-

In Memory Accelerator for MongoDB

Lecture 10 - Functional programming: Hadoop and MapReduce

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Developing MapReduce Programs

Big Data Analytics* Outline. Issues. Big Data

Hadoop and Map-reduce computing

PassTest. Bessere Qualität, bessere Dienstleistungen!

Data Management in the Cloud MAP/REDUCE. Map/Reduce. Programming model Examples Execution model Criticism Iterative map/reduce

Big Data Technology Core Hadoop: HDFS-YARN Internals

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Future Internet Technologies

Development of nosql data storage for the ATLAS PanDA Monitoring System

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Fast Data Hadoop acceleration with Flash. June 2013

Enabling High performance Big Data platform with RDMA

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Sriram Krishnan, Ph.D.

16.1 MAPREDUCE. For personal use only, not for distribution. 333

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

A Brief Outline on Bigdata Hadoop

Storage Architectures for Big Data in the Cloud

Log Mining Based on Hadoop s Map and Reduce Technique

Introduction to Hadoop

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

From GWS to MapReduce: Google s Cloud Technology in the Early Days

NoSQL and Hadoop Technologies On Oracle Cloud

Transcription:

Principles of Software Construction: Objects, Design, and Concurrency Distributed System Design, Part 2. MapReduce Spring 2014 Charlie Garrod Christian Kästner School of Computer Science

Administrivia Homework 5c due tonight Homework 6 coming tomorrow 2

Road map from last time Application-level communication protocols Frameworks for simple distributed computation Remote Procedure Call (RPC) Java Remote Method Invocation (RMI) Common patterns of distributed system design Complex computational frameworks e.g., distributed map-reduce 3

Today: Distributed system design, part 2 Introduction to distributed systems Motivation: reliability and scalability Replication for reliability Partitioning for scalability MapReduce: A robust, scalable framework for distributed computation on replicated, partitioned data 4

5

Aside: The robustness vs. redundancy curve robustness? redundancy 6

Metrics of success Reliability Often in terms of availability: fraction of time system is working 99.999% available is "5 nines of availability" Scalability Ability to handle workload growth 7

A case study: Passive primary-backup replication Architecture before replication: client front-end client front-end database server: {alice:90, bob:42, } Problem: Database server might fail 8

A case study: Passive primary-backup replication Architecture before replication: client front-end client front-end database server: {alice:90, bob:42, } Problem: Database server might fail Solution: Replicate data onto multiple servers client client front-end front-end primary: {alice:90, bob:42, } backup: {alice:90, bob:42, } backup: {alice:90, bob:42, 9

Partitioning for scalability Partition data based on some property, put each partition on a different server CMU server: client front-end {cohen:9, bob:42, } client front-end Yale server: {alice:90, pete:12, } MIT server: {deb:16, reif:40, } 10

Master/tablet-based systems Dynamically allocate range-based partitions Master server maintains tablet-to-server assignments Tablet servers store actual data client client Front-ends cache tablet-to-server assignments Master: Tablet server 1: {a-c:[2], d-g:[3,4], k-z: h-j:[3], {pete:12, k-z:[1]} reif:42} front-end front-end Tablet server 2: a-c: {alice:90, bob:42, cohen:9} Tablet server 3: d-g: {deb:16} h-j:{ } Tablet server 4: d-g: {deb:16} 11

Today: Distributed system design, part 2 Introduction to distributed systems Motivation: reliability and scalability Replication for reliability Partitioning for scalability MapReduce: A robust, scalable framework for distributed computation on replicated, partitioned data 12

Map from a functional perspective map(f, x[0 n-1])! Apply the function f to each element of list x! E.g., in Python: def square(x): return x*x! map(square, [1, 2, 3, 4]) would return [1, 4, 9, 16] Parallel map implementation is trivial What is the work? What is the depth? map/reduce images src: Apache Hadoop tutorials 13

Reduce from a functional perspective reduce(f, x[0 n-1])! Repeatedly apply binary function f to pairs of items in x, replacing the pair of items with the result until only one item remains One sequential Python implementation: def reduce(f, x):! if len(x) == 1: return x[0]! return reduce(f, [f(x[0],x[1])] + x[2:])! e.g., in Python: def add(x,y): return x+y! reduce(add, [1,2,3,4])! would return 10 as reduce(add, [1,2,3,4])! reduce(add, [3,3,4])! reduce(add, [6,4])! reduce(add, [10]) -> 10! 14

Reduce with an associative binary function If the function f is associative, the order f is applied does not affect the result 1 + ((2+3) + 4) 1 + (2 + (3+4)) (1+2) + (3+4) Parallel reduce implementation is also easy What is the work? What is the depth? 15

Distributed MapReduce The distributed MapReduce idea is similar to (but not the same as!):!!reduce(f2, map(f1, x)) Key idea: a "data-centric" architecture Send function f1 directly to the data Execute it concurrently Then merge results with reduce Also concurrently Programmer can focus on the data processing rather than the challenges of distributed systems 16

MapReduce with key/value pairs (Google style) Master Assign tasks to workers Ping workers to test for failures Map workers Map for each key/value pair Emit intermediate key/value pairs the shuffle: Reduce workers Sort data by intermediate key and aggregate by key Reduce for each key 17

MapReduce with key/value pairs (Google style) E.g., for each word on the Web, count the number of times that word occurs For Map: key1 is a document name, value is the contents of that document For Reduce: key2 is a word, values is a list of the number of counts of that word f1(string key1, String value):! for each word w in value:! EmitIntermediate(w, 1);!! f2(string key2, Iterator values):! int result = 0;! for each v in values:! result += v;! Emit(key2, result);! Map: (key1, v1) à (key2, v2)* Reduce: (key2, v2*) à (key3, v3)* MapReduce: (key1, v1)* à (key3, v3)* MapReduce: (docname, doctext)* à (word, wordcount)* 18

MapReduce architectural details Usually integrated with a distributed storage system Map worker executes function on its share of the data Map output usually written to worker's local disk Shuffle: reduce worker often pulls intermediate data from map worker's local disk Reduce output usually written back to distributed storage system 2: 1: 3: 19

Handling server failures with MapReduce Map worker failure: Re-map using replica of the storage system data Reduce worker failure: New reduce worker can pull intermediate data from map worker's local disk, re-reduce Master failure: Options: Restart system using new master Replicate master 2: 1: 3: 20

The beauty of MapReduce Low communication costs (usually) The shuffle (between map and reduce) is expensive MapReduce can be iterated Input to MapReduce: key/value pairs in the distributed storage system Output from MapReduce: key/value pairs in the distributed storage system 21

Another MapReduce example E.g., for person in a social network graph, output the number of mutual friends they have For Map: key1 is a person, value is the list of her friends For Reduce: key2 is???, values is a list of??? f1(string key1, String value):!!! f2(string key2, Iterator values):! MapReduce: (person, friends)* à (pair of people, count of mutual friends)* 22

Another MapReduce example E.g., for person in a social network graph, output the number of mutual friends they have For Map: key1 is a person, value is the list of her friends For Reduce: key2 is a pair of people, values is a list of 1s, for each mutual friend that pair has f1(string key1, String value):!! for each pair of friends!in value:! EmitIntermediate(pair, 1);! f2(string key2, Iterator values):! int result = 0;! for each v in values:! result += v;! Emit(key2, result);! MapReduce: (person, friends)* à (pair of people, count of mutual friends)* 23

And another MapReduce example E.g., for each page on the Web, create a list of the pages that link to it For Map: key1 is a document name, value is the contents of that document For Reduce: key2 is???, values is a list of??? f1(string key1, String value):!!! f2(string key2, Iterator values):! MapReduce: (docname, doctext)* à (docname, list of incoming links)* 24

Thursday More distributed systems.. 25