STOCHASTIC ANALYTICS. Chris Jermaine Rice University

Transcription

1 STOCHASTIC ANALYTICS Chris Jermaine Rice University 1

2 What Is Meant by Stochastic Analytics? 2

3 What Is Meant by Stochastic Analytics? It is actually a relatively little-used term (to date!) It means tackling (Big Data) analytic tasks using stochastic models Will now define stochastic models and (Big Data) analytics 3

4 First: What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? 4

5 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Have a data record describing J. Smith Both John Smith and Jane Smith are people in your data set Record a 50/50 chance of J. Smith referring to John/Jane 5

6 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Missing a person s gender? 52% of people in the data set are women... Represent the gender via a distribution (52% female, 48% male) 6

7 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Provides for a principled way to talk about beliefs I am 99.5% sure that this is gonna be a great presentation! 7

8 What Is a Stochastic Model? A model for some slice of reality which has a random component Here random means probabilistic There is typically a distribution over possible inputs and/or outcomes Why utilize randomness? Randomness provides a way to model uncertainty Provides a way to model missing data Provides for a principled way to talk about beliefs One great thing about randomness: The theory already exists Can leverage 100s of years of probability theory 8

9 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) 9

10 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) Examples: Segmenting/clustering customers for data reduction/viz Learning rules from data: beer implies diapers Learning to recognize Higgs boson in the aftermath of a collision 10

11 What Are Big Data Analytics? The process of extracting useful (actionable) knowledge from a very large data set (or sets) Back in the day (that is, the 1990 s), related buzzwords were: Data warehousing Building a large database where operational data go to retire On-Line Analytic Processing (OLAP) Combination data model / viz tool / query interface for a multidimensional database Data Mining The application of scalable intelligent knowledge discovery algorithms to data One of the pioneers in data mining (Usama Fayyad) worked at JPL 11

12 What Are Big Data Analytics? Before 2000s, many tools for Big Data analytics were DB-centric That is, they worked in concert with or were based upon RDBMS technology 12

13 What Are Big Data Analytics? What changed in early 2000s? 13

14 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech 14

15 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech Felt it was too inflexible Difficult to implement things like PageRank Didn t lend itself to analyzing sequence-based user data Bad for analyzing graphs Many other annoyances 15

16 What Are Big Data Analytics? What changed in early 2000s? The rise of the big Internet company, esp. Google, Yahoo! The big Internet companies were never enamored of RDBMS tech Felt it was too inflexible Too tied to proprietary hardware Too slow, didn t scale to 1000 s of machines By early-mid 2000s......the early Internet companies began to publicize the infrastructure they d built Most influential idea was MapReduce See for the seminal 2004 paper on MR In last 5 years or so, an entire NoSQL ecosystem for BDA has sprung up Tools include: Hadoop, Cassandra, HBase, Pig, Hive... 16

17 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics 17

18 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics With the wrinkle that the result of the analysis itself is stochastic/random Of key importance because it gives analyst an idea of risk/uncertainty of result 18

19 Soooo... We ve defined stochastic models... We ve defined Big Data analytics What is meant by Stochastic Analytics? The application of stochastic models/methods to problems in Big Data analytics With the wrinkle that the result of the analysis itself is stochastic/random Of key importance because it gives analyst an idea of risk/uncertainty of result I ll be illustrating this with an example But first a quick crash course on Bayesian statistics Important because the Bayesian approach underlies much of stochastic analytics 19

20 Bayesian Statistics Imagine that we come up with an exam for a college course Want to run a trial with a few students Goal is to see if this exam is too hard or too easy That is, want to say something about what we think the mean exam score will be And what the spread (or variance) of the score will be 20

21 So How Would a Bayesian Do This? First imagine a stochastic generative process for the data For the ith student, we might imagine score i ~ Normal ( μ, σ 2 ) ultimate goal is to guess the value of mu Means is sampled from a RV with a normal dist. Why Normal? Models typical bell shaped data Assume μ, σ 2 are also generated by sampling from RVs, and are unknown 21

22 So How Would a Bayesian Do This? How is the mean generated? We imagine μ ~ Normal (70, 100) Why Normal (70, 100)? Allows for all possible, reasonable exam averages 22

23 So How Would a Bayesian Do This? How is the variance (spread) generated? We imagine ~ InverseGamma (1, 1) σ 2 Why InverseGamma (1, 1)? Allows a really large range of positive vals σ 2 23

24 Thus, Our Generative Process Is Step 1: ~ InverseGamma (1, 1) σ 2 Step 2: μ ~ Normal (75, 100) Step 3: for each i, score i ~ Normal ( μ, σ 2 ) 24

25 Why Have a Generative Process? Given gen process, can use PDF to measure how likely a given pair is Specifically, p( μ, σ 2 ) = IG ( σ 2 1, 1) Normal ( μ 75, 100) So given any μ, σ 2 combo, we can say how likely we think it is This is our prior on mean and variance Note times cause the two values are indep. 25

26 Bayesian Inference To a Bayesian, inference is then the process of updating prior expectations after seeing the data After we see the data: p( μ, σ 2 data) p( data μσ, 2 ) p( μσ, 2 ) p( data) This is equation follows directly from Bayes Theorem Note: this is also a distribution = Known as a posterior distribution test scores 26

27 Pictorially You have a prior distribution on mean and variance: sigma 2 (variance of test scores) mu (average of test scores) 27

28 Pictorially You see some test scores sigma 2 (variance of test scores) mu (average of test scores) seem to average in the high 70 s student 1 student 2 student 3 student 4 student 5 student 6 28

29 Pictorially Then use Bayes to get a posterior distribution on mean and var: sigma 2 (variance of test scores) mu (average of test scores) much narrower after seeing data! 29

30 That s the Bayesian Approach Come up with a generative process Which includes prior distributions on the quantities like to est In our example, the variance and especially the mean See some data Use Bayes Theorem and data to update the priors This gives you a posterior dist The posterior is your estimate Why attractive for use in Big Data analytics? Answer is a dist So it gives you some idea of uncertainty/risk of inferences made using the data 30

31 Now Let s Look At an Example We have archived billions of sales records and want to know: What would my profits have been in 08 if I d cut all of my margins by 10%? How might we use archived data to guess this answer? Need to guess each customer s demand at new price Use to build a hypothetical, revised version our table storing sales in 2008 Finally, query this hypothetical table Here s one possibility......utilizing the Bayesian approach 31

32 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased For a given customer, demand is a linear function (I know, could do better than linear...) Price of part A 32

33 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Demand curve is generated via samples from twin Gamma distributions 33

34 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Distribution over D 0, P 0 defines a whole family of possible demand curves... 34

35 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) Here s a new, possible demand curve... 35

36 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) And another... 36

37 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) And so on... 37

38 Step 1: A Stochastic Demand Model First, define a customer demand model... Quantity purchased D 0 Gamma( k D, θ D ) P 0 Price of part A Gamma( k P, θ P ) The PDF Ff (.) = Gamma( D 0 k D, θ D ) Gamma( P 0 k P, θ P ) is our prior distribution over demand curves 38

39 Step 2: Learn the Model To apply this model, we need to learn the prior That is, choose k P, k D, θ P, θ D to be realistic for sales in 2008 Reasonable tactic: use the warehoused data to perform an MLE That is, find the set of params that best describes all of the 2008 sales Do this by issuing computations over the warehouse 39

40 Step 3: Apply the Model - the Theory Now we have a prior model (PDF) for demand function f: Ff ( k P, k D, θ P, θ D ) Problem: the actual demand curve for each customer is unseen But can use observed demand to obtain a posterior demand model Ex: for ith sale, a customer bought d units at price p Then posterior demand model is given by: Ff ( i k P, k D, θ P, θ D, f i ( p) = d) 40

41 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) 41

42 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) So F accepts this one... Price of part A 42

43 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) And this one... Price of part A 43

46 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A 46

47 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A Those demand curves that are given non-zero weight are ordered exactly as the prior would order them 47

48 Step 3: Apply the Model - the Theory Intuitively, F(f i f i (p) = d) gives non-zero weight to all demand functions thru the point (p, d) Quantity purchased (p, d) But not this one Price of part A Those demand curves that are given non-zero weight are ordered exactly as the prior would order them To guess the customer s demand at new price p : Sample a possible f i from the PDF Ff ( i k P, k D, θ P, θ D, f i ( p) = d) Compute f i ( p ), and we have a new demand! 48

49 Step 3: Apply the Model - the Practice To actually compute the overall profit under new prices: I d need to sample an f i from F(f i f i (p) = d) for every sale from 2008 Evaluate f i ( p ) for all those sales Compute the new profits 49

50 Step 3: Apply the Model - the Practice Issue: the profits are actually random You do this once, you get one answer You do this again, you get another answer How to handle this? Redo the computation many times (Monte Carlo) to obtain a distribution of results number of obs negative positive additional profits 0 1M don t do it! 50

51 Illustrative of Stochastic Analytics Begins with a stochastic model Model is applied at very fine granularity to big data set In contrast to summarize-then-analyze approach Result of analysis is a distribution, not a single result Often obtained using Bayesian techniques 51

52 Why Is It Important to Get a Distribution? Gives the final consumer a picture of uncertainty/risk Especially when applying Bayesian approach... lack of data comes through as very wide tails People increasingly aware of pitfalls of not quantifying risk mean:-$20m... would YOU choose this? -100M 0 100M... 1B loss 52

53 Systematic Quant. of Uncertainty/Risk Worthwhile polemic is Sam Savage (Stanford) Flaw of Averages Savage describes two f/laws : Flaw of averages (weak form): Flaw of averages (strong form): Mean correct, Variance ignored Sam Savage s book (why we underestimate risk) Wrong value of mean: f(e[x]) E[f(X)] 21 53

54 Another Example: Linear Regression 54

55 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) 55

56 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) 56

57 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) known age predicted income So as to predict the outcomes associated with a new data point 57

58 Another Example: Linear Regression In linear regression, have a number of regressors and outcomes Linear regression attempts to fit a line to these income (outcome) age (regressor) known age predicted income So as to predict the outcomes associated with a new data point Widely used in practice, esp. in Big Data analytics But typically LS version is applied; not the stochastic/bayesian variant 58

59 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i 59

60 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) intercept is c slope is c 0 1 age (regressor) 60

61 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) given x 1 age (regressor) 61

62 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) y is normally distributed around this point age (regressor) 62

63 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 i income (outcome) We imagine entire data set is generated in this way... age (regressor) 63

64 Bayesian Linear Regression Imagine the following stochastic generative process First, generate the variability of regression coefs, For each regressor, the regression coef c i ~ Laplace( b) v ~ InvGamma( 1, 1) Then generate the variance of the prediction error, σ 2 ~ InvGamma( 1, 1) Finally, for a data point x 0 = 1, x 1, x 2, x n, outcome y ~ Normal x i c i, σ 2 As an aside: why use Laplace distribution? i 64

65 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome)...we imagine generated using unknown params age (regressor) 65

66 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome) age (regressor) use Bayes rule to get posterior distribution on coefs (Challenging for high-d, Big Data!) 66

67 Bayesian Linear Regression Then, imagine we have a huge data set... income (outcome) given x 1 age (regressor) can use that distribution to generate a distribution of outcomes given a new data point 67

68 Bayesian Linear Regression income (outcome) age (regressor) Why is this Stochastic Analytics? We are using a stochastic model In conjunction with a very large data set To perform a (toy) analytic task: predicting income of a person given an age And we obtain information re the uncertainty of our prediction 68

69 Soooo... We ve seen two examples illustrating stochastic analytics Many more exist Almost any analytic task can be made stochastic by applying appropriate model Likely to grow in importance as application of ML to Big Data becomes common Why? The Bayesian approach is common in modern ML......natural synergy between SA and the Bayesian approach 69

70 Epilogue: How Might One Implement This? Well, you could use our system: MCDB/SimSQL MCDB/SimSQL is itself built on Hadoop/MapReduce That is, a high level spec in MCDB/SimSQL ends up running on Hadoop Apache Hadoop is the de-facto open-source implementation of MapReduce 70

71 What Is MapReduce? It is a simple Big Data processing paradigm To process a data set, you have two pieces of user-supplied code: A map code And a reduce code These are run (potentially over a very large compute cluster built using commodity hardware) using three data processing phases A map phase A shuffle phase And a reduce phase 71

72 The Map Phase Assume that the input data are stored in a huge file This file contains a simple list of pairs of type (key1,value1) And assume we have a user-supplied function of the form map (key1,value1) That outputs a list of pairs of the form (key2, value2) In the map phase of the MapReduce computation this map function is called for every record in the input data set Instances of map run in parallel all over the compute cluster 72

73 The Reduce Phase Assume we have a user-supplied function of the form reduce (key2,list <value2>) That outputs a list of value2 objects In the reduce phase of the MapReduce computation this reduce function is called for every key2 value output by the shuffle Instances of reduce run in parallel all over the compute cluster The output of all of those instances is collected in a (potentially) huge output file 73

74 The Shuffle Phase The shuffle phase accepts all of the (key2, value2) pairs from the map phase And it groups them together So that all of the pairs From all over the cluster Having the same key2 value Are merged into a single (key2, list <value2>) pair Called a shuffle because this is where a potential all-to-all data transfer happens 74

75 Why Do People Like MapReduce? Can specify a computation (such as WordCount) In a few lines of code (Java in the case of Hadoop) MapReduce platform takes care of distributing to 1000 s of cores Great for data parallel tasks Those where have a lot of data and can move computation to data Codes can be much simpler than codes in other DP paradigms 75

76 Questions? 76