STOCHASTIC ANALYTICS. Chris Jermaine Rice University cmj4@cs.rice.edu

Similar documents

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Advanced Big Data Analytics with R and Hadoop

Hadoop and Map-Reduce. Swati Gore

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

BIG DATA What it is and how to use?

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Hadoop. Sunday, November 25, 12

How To Handle Big Data With A Data Scientist

Architectures for Big Data Analytics A database perspective

ANALYTICS CENTER LEARNING PROGRAM

How To Scale Out Of A Nosql Database

Big Data and Data Science: Behind the Buzz Words

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Business Intelligence for Big Data

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

PS 271B: Quantitative Methods II. Lecture Notes

Application Development. A Paradigm Shift

Big Data Big Deal? Salford Systems

MapReduce with Apache Hadoop Analysing Big Data

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Big Data With Hadoop

Big Data Too Big To Ignore

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data and Apache Hadoop s MapReduce

Cloudera Certified Developer for Apache Hadoop

Big Data and Market Surveillance. April 28, 2014

Towards running complex models on big data

White Paper: What You Need To Know About Hadoop

The Internet of Things and Big Data: Intro

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Bayesian networks - Time-series models - Apache Spark & Scala

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hadoop Ecosystem B Y R A H I M A.

Workshop on Hadoop with Big Data

Testing Big data is one of the biggest

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data Analysis and HADOOP

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

L1: Introduction to Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Accelerating and Simplifying Apache

Oracle Big Data SQL Technical Update

Using Tableau Software with Hortonworks Data Platform

Cloud Computing at Google. Architecture

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data Explained. An introduction to Big Data Science.

Internals of Hadoop Application Framework and Distributed File System

Big Data Introduction

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Implement Hadoop jobs to extract business value from large and varied data sets

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Apache Hadoop: The Big Data Refinery

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Map Reduce & Hadoop Recommended Text:

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

CS 378 Big Data Programming

Big Data and Hadoop for the Executive A Reference Guide

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

HiBench Installation. Sunil Raiyani, Jayam Modi

Web Data Mining: A Case Study. Abstract. Introduction

Large-Scale Test Mining

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Sunnie Chung. Cleveland State University

Data Warehousing in the Age of Big Data

Distributed Computing and Big Data: Hadoop and MapReduce

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

TRAINING PROGRAM ON BIGDATA/HADOOP

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Introduction to Hadoop

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Big Data: Beyond the Hype

Introduction to Data Mining

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

This Symposium brought to you by

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Navigating Big Data business analytics

White Paper: Hadoop for Intelligence Analysis

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

The big data revolution

Transcription:

STOCHASTIC ANALYTICS Chris Jermaine Rice University cmj4@cs.rice.edu 1

What Is Meant by Stochastic Analytics? 2

What Is Meant by Stochastic Analytics? It is actually a relatively little-used term (to date!) It means tackling (Big Data) analytic tasks using stochastic models Will now define stochastic models and (Big Data) analytics 3

First: What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? 4

What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Have a data record describing J. Smith Both John Smith and Jane Smith are people in your data set Record a 50/50 chance of J. Smith referring to John/Jane 5

What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Missing a person s gender? 52% of people in the data set are women... Represent the gender via a distribution (52% female, 48% male) 6

What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Provides for a principled way to talk about beliefs I am 99.5% sure that this is gonna be a great presentation! 7

What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Provides for a principled way to talk about beliefs One great thing about randomness: The theory already exists Can leverage 100s of years of probability theory 8

What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) 9

What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) Examples: Segmenting/clustering customers for data reduction/viz Learning rules from data: beer implies diapers Learning to recognize Higgs boson in the aftermath of a collision 10

What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) Back in the day (that is, the 1990 s), related buzzwords were: Data warehousing Building a large database where operational data go to retire On-Line Analytic Processing (OLAP) Combination data model / viz tool / query interface for a multidimensional database Data Mining The application of scalable intelligent knowledge discovery algorithms to data One of the pioneers in data mining (Usama Fayyad) worked at JPL 11

What Are Big Data Analytics? Before 2000s, many tools for Big Data analytics were DB-centric That is, they worked in concert with or were based upon RDBMS technology 12

What Are Big Data Analytics? What changed in early 2000s? 13

What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech 14

What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech Felt it was too inflexible Difficult to implement things like PageRank Didn t lend itself to analyzing sequence-based user data Bad for analyzing graphs Many other annoyances 15

What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech Felt it was too inflexible Too tied to proprietary hardware Too slow, didn t scale to 1000 s of machines By early-mid 2000s......the early Internet companies began to publicize the infrastructure they d built Most influential idea was MapReduce See http://dl.acm.org/citation.cfm?id=1327492 for the seminal 2004 paper on MR In last 5 years or so, an entire NoSQL ecosystem for BDA has sprung up Tools include: Hadoop, Cassandra, HBase, Pig, Hive... 16

Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics 17

Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics With the wrinkle that the result of the analysis itself is stochastic/random Of key importance because it gives analyst an idea of risk/uncertainty of result 18

Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics With the wrinkle that the result of the analysis itself is stochastic/random Of key importance because it gives analyst an idea of risk/uncertainty of result I ll be illustrating this with an example But first a quick crash course on Bayesian statistics Important because the Bayesian approach underlies much of stochastic analytics 19

Bayesian Statistics Imagine that we come up with an exam for a college course Want to run a trial with a few students Goal is to see if this exam is too hard or too easy That is, want to say something about what we think the mean exam score will be And what the spread (or variance) of the score will be 20

So How Would a Bayesian Do This? First imagine a stochastic generative process for the data For the ith student, we might imagine score i ~ Normal ( μ, σ 2 ) ultimate goal is to guess the value of mu Means is sampled from a RV with a normal dist. Why Normal? Models typical bell shaped data Assume μ, σ 2 are also generated by sampling from RVs, and are unknown 21

So How Would a Bayesian Do This? How is the mean generated? We imagine μ ~ Normal (70, 100) Why Normal (70, 100)? Allows for all possible, reasonable exam averages 22

So How Would a Bayesian Do This? How is the variance (spread) generated? We imagine ~ InverseGamma (1, 1) σ 2 Why InverseGamma (1, 1)? Allows a really large range of positive vals σ 2 23

Thus, Our Generative Process Is Step 1: ~ InverseGamma (1, 1) σ 2 Step 2: μ ~ Normal (75, 100) Step 3: for each i, score i ~ Normal ( μ, σ 2 ) 24

Why Have a Generative Process? Given gen process, can use PDF to measure how likely a given pair is Specifically, p( μ, σ 2 ) = IG ( σ 2 1, 1) Normal ( μ 75, 100) So given any μ, σ 2 combo, we can say how likely we think it is This is our prior on mean and variance Note times cause the two values are indep. 25

Bayesian Inference To a Bayesian, inference is then the process of updating prior expectations after seeing the data After we see the data: p( μ, σ 2 data) p( data μσ, 2 ) p( μσ, 2 ) ----------------------------------------------------------- p( data) This is equation follows directly from Bayes Theorem Note: this is also a distribution = Known as a posterior distribution test scores 26

Pictorially You have a prior distribution on mean and variance: 0 5 10 15 20 25 30 35 40 45 0 20 40 60 80 100 120 sigma 2 (variance of test scores) mu (average of test scores) 27

Pictorially You see some test scores 0 5 10 15 20 25 30 35 40 45 0 20 40 60 80 100 120 sigma 2 (variance of test scores) 100 95 mu (average of test scores) 90 85 80 75 70 seem to average in the high 70 s 65 60 student 1 student 2 student 3 student 4 student 5 student 6 28

Pictorially Then use Bayes to get a posterior distribution on mean and var: 0 5 10 15 20 25 30 35 40 45 0 20 40 60 80 100 120 sigma 2 (variance of test scores) mu (average of test scores) much narrower after seeing data! 29

That s the Bayesian Approach Come up with a generative process Which includes prior distributions on the quantities like to est In our example, the variance and especially the mean See some data Use Bayes Theorem and data to update the priors This gives you a posterior dist The posterior is your estimate Why attractive for use in Big Data analytics? Answer is a dist So it gives you some idea of uncertainty/risk of inferences made using the data 30

Now Let s Look At an Example We have archived billions of sales records and want to know: What would my profits have been in 08 if I d cut all of my margins by 10%? How might we use archived data to guess this answer? Need to guess each customer s demand at new price Use to build a hypothetical, revised version our table storing sales in 2008 Finally, query this hypothetical table Here s one possibility......utilizing the Bayesian approach 31

Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased For a given customer, demand is a linear function (I know, could do better than linear...) Price of part A 32

Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Demand curve is generated via samples from twin Gamma distributions 33

Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Distribution over D 0, P 0 defines a whole family of possible demand curves... 34

Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Here s a new, possible demand curve... 35

Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) And another... 36

Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) And so on... 37

Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) The PDF Ff (.) = Gamma( D 0 k D, θ D ) Gamma( P 0 k P, θ P ) is our prior distribution over demand curves 38

Step 2: Learn the Model To apply this model, we need to learn the prior That is, choose k P, k D, θ P, θ D to be realistic for sales in 2008 Reasonable tactic: use the warehoused data to perform an MLE That is, find the set of params that best describes all of the 2008 sales Do this by issuing computations over the warehouse 39

Step 3: Apply the Model - the Theory Now we have a prior model (PDF) for demand function f: Ff ( k P, k D, θ P, θ D ) Problem: the actual demand curve for each customer is unseen But can use observed demand to obtain a posterior demand model Ex: for ith sale, a customer bought d units at price p Then posterior demand model is given by: Ff ( i k P, k D, θ P, θ D, f i ( p) = d) 40

Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) 41

Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) So F accepts this one... Price of part A 42

Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 43

Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 44

Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 45

Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A 46

Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A Those demand curves that are given non-zero weight are ordered exactly as the prior would order them 47

Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A Those demand curves that are given non-zero weight are ordered exactly as the prior would order them To guess the customer s demand at new price p : Sample a possible f i from the PDF Ff ( i k P, k D, θ P, θ D, f i ( p) = d) Compute f i ( p ), and we have a new demand! 48

Step 3: Apply the Model - the Practice To actually compute the overall profit under new prices: I d need to sample an f i from F(f i f i (p) = d) for every sale from 2008 Evaluate f i ( p ) for all those sales Compute the new profits 49

Step 3: Apply the Model - the Practice Issue: the profits are actually random You do this once, you get one answer You do this again, you get another answer How to handle this? Redo the computation many times (Monte Carlo) to obtain a distribution of results number of obs negative positive 60 40 20 additional profits 0 1M don t do it! 50

Illustrative of Stochastic Analytics Begins with a stochastic model Model is applied at very fine granularity to big data set In contrast to summarize-then-analyze approach Result of analysis is a distribution, not a single result Often obtained using Bayesian techniques 51

Why Is It Important to Get a Distribution? Gives the final consumer a picture of uncertainty/risk Especially when applying Bayesian approach... lack of data comes through as very wide tails People increasingly aware of pitfalls of not quantifying risk mean:-$20m... would YOU choose this? -100M 0 100M... 1B loss 52

Systematic Quant. of Uncertainty/Risk Worthwhile polemic is Sam Savage (Stanford) Flaw of Averages Savage describes two f/laws : Flaw of averages (weak form): Flaw of averages (strong form): Mean correct, Variance ignored Sam Savage s book (why we underestimate risk) Wrong value of mean: f(e[x]) E[f(X)] 21 53

Another Example: Linear Regression 54

Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) 55

Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) 56

Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) known age predicted income So as to predict the outcomes associated with a new data point 57

Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) known age predicted income So as to predict the outcomes associated with a new data point Widely used in practice, esp. in Big Data analytics But typically LS version is applied; not the stochastic/bayesian variant 58

Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i 59

Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) intercept is c slope is c 0 1 age (regressor) 60

Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) given x 1 age (regressor) 61

Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) y is normally distributed around this point age (regressor) 62

Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) We imagine entire data set is generated in this way... age (regressor) 63

Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 As an aside: why use Laplace distribution? i 64

Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome)...we imagine generated using unknown params age (regressor) 65

Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome) age (regressor) use Bayes rule to get posterior distribution on coefs (Challenging for high-d, Big Data!) 66

Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome) given x 1 age (regressor) can use that distribution to generate a distribution of outcomes given a new data point 67

Bayesian Linear Regression income (outcome) age (regressor) Why is this Stochastic Analytics? We are using a stochastic model In conjunction with a very large data set To perform a (toy) analytic task: predicting income of a person given an age And we obtain information re the uncertainty of our prediction 68

Soooo... We ve seen two examples illustrating stochastic analytics Many more exist Almost any analytic task can be made stochastic by applying appropriate model Likely to grow in importance as application of ML to Big Data becomes common Why? The Bayesian approach is common in modern ML......natural synergy between SA and the Bayesian approach 69

Epilogue: How Might One Implement This? Well, you could use our system: MCDB/SimSQL MCDB/SimSQL is itself built on Hadoop/MapReduce That is, a high level spec in MCDB/SimSQL ends up running on Hadoop Apache Hadoop is the de-facto open-source implementation of MapReduce 70

What Is MapReduce? It is a simple Big Data processing paradigm To process a data set, you have two pieces of user-supplied code: A map code And a reduce code These are run (potentially over a very large compute cluster built using commodity hardware) using three data processing phases A map phase A shuffle phase And a reduce phase 71

The Map Phase Assume that the input data are stored in a huge file This file contains a simple list of pairs of type (key1,value1) And assume we have a user-supplied function of the form map (key1,value1) That outputs a list of pairs of the form (key2, value2) In the map phase of the MapReduce computation this map function is called for every record in the input data set Instances of map run in parallel all over the compute cluster 72

The Reduce Phase Assume we have a user-supplied function of the form reduce (key2,list <value2>) That outputs a list of value2 objects In the reduce phase of the MapReduce computation this reduce function is called for every key2 value output by the shuffle Instances of reduce run in parallel all over the compute cluster The output of all of those instances is collected in a (potentially) huge output file 73

The Shuffle Phase The shuffle phase accepts all of the (key2, value2) pairs from the map phase And it groups them together So that all of the pairs From all over the cluster Having the same key2 value Are merged into a single (key2, list <value2>) pair Called a shuffle because this is where a potential all-to-all data transfer happens 74

Why Do People Like MapReduce? Can specify a computation (such as WordCount) In a few lines of code (Java in the case of Hadoop) MapReduce platform takes care of distributing to 1000 s of cores Great for data parallel tasks Those where have a lot of data and can move computation to data Codes can be much simpler than codes in other DP paradigms 75

Questions? 76