Big Data, Statistics, and the Internet

Similar documents
Bayes and Big Data: The Consensus Monte Carlo Algorithm

Computational Statistics for Big Data

Evaluating partitioning of big graphs

PS 271B: Quantitative Methods II. Lecture Notes

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Statistical Machine Learning

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Bayesian Statistics in One Hour. Patrick Lam

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Campaign and Ad Group Management. Google AdWords Fundamentals

Basics of Statistical Machine Learning

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone

Lecture 3: Linear methods for classification

IBM SPSS Direct Marketing 22

Parallelization Strategies for Multicore Data Analysis

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

BSPCloud: A Hybrid Programming Library for Cloud Computing *

IBM SPSS Direct Marketing 23

Logistic Regression (1/24/13)

STA 4273H: Statistical Machine Learning

Machine Learning Big Data using Map Reduce

Machine Learning.

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

HT2015: SC4 Statistical Data Mining and Machine Learning

Invited Applications Paper

Towards running complex models on big data

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Applications of R Software in Bayesian Data Analysis

Imperfect Debugging in Software Reliability

Marketing Mix Modelling and Big Data P. M Cain

Predictive Modeling Techniques in Insurance

Advanced Big Data Analytics with R and Hadoop

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Graph Processing and Social Networks

1 Prior Probability and Posterior Probability

Optimization and analysis of large scale data sorting algorithm based on Hadoop

The primary goal of this thesis was to understand how the spatial dependence of

Bootstrapping Big Data

1 Choosing the right data mining techniques for the job (8 minutes,

Dirichlet Processes A gentle tutorial

Big Data Analytics. Lucas Rego Drumond

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining - Evaluation of Classifiers

Markov Chain Monte Carlo Simulation Made Simple

D-optimal plans in observational studies

Logistic Regression (a type of Generalized Linear Model)

Big Data Big Deal? Salford Systems

The Data Mining Process

Principles of Data Mining by Hand&Mannila&Smyth

From the help desk: Bootstrapped standard errors

Analysis of Bayesian Dynamic Linear Models

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Bayesian Statistics: Indian Buffet Process

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Role of Social Networking in Marketing using Data Mining

Classification Problems

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Statistical Challenges with Big Data in Management Science

Outline. Dispersion Bush lupine survival Quasi-Binomial family

NoSQL. Thomas Neumann 1 / 22

Linear Classification. Volker Tresp Summer 2015

RevoScaleR Speed and Scalability

Sampling for Bayesian computation with large datasets

The Artificial Prediction Market

Challenges of Cloud Scale Natural Language Processing

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Exploratory Data Analysis

DECISION MAKING UNDER UNCERTAINTY:

Basic Bayesian Methods

Collaborative Filtering. Radek Pelánek

Machine Learning over Big Data

data visualization and regression

Bayesian networks - Time-series models - Apache Spark & Scala

CHAPTER 2 Estimating Probabilities

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Information Management course

Anomaly detection for Big Data, networks and cyber-security

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Statistical Models in Data Mining

Inference of Probability Distributions for Trust and Security applications

How To Understand The Theory Of Probability

Statistics 104: Section 6!

Un point de vue bayésien pour des algorithmes de bandit plus performants

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Transcription:

Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

Summary Big data live on more than one machine. Computing takes place in the MapReduce / Hadoop paradigm, where communicating is expensive. Methods that treat the data as a single object don t work. We need to treat MapReduce as a basic assumption. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

Outline of the talk Big data, Statistics, and the Internet Big data are real, and sometimes necessary Bayes is important Experiments in the Internet age MapReduce: A case study in how to do it wrong Consensus Monte Carlo Examples Binomial Logistic regression Hierarchical models BART Conclusion Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39

Dan Ariely January 3 Facebook post Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39

Big data, machine learning, and statistics Statistics Models summarize data so humans can better understand it. Does spending more money on ads make my business more profitable? Is treatment A better than treatment B? Sampling/aggregation works just fine. (Humans can t handle the complexity anyway.) Machine learning Models allow machines to make decisions. Google search, Amazon product recommendations, Facebook news feed,.... Need big data to match diverse users to diverse content. The canonical problem in Silicon Valley is personalization. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39

Personalization is a big logistic regression Response = convert (yes / no) Data are structured... Search queries Search query Search history (long term) Session history Demographics... Search results Individual pages One boxes Nested in sites Links between sites...... but structure is often ignored Ads Ad creative Keywords Landing page quality Nested in campaigns... Many billions of queries and potential results, billion+ users, many millions of ads (at any given time). There are content user interactions. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 9 / 39

Sp rsi y Sparsity plays an important role in modeling internet data Models are big because of a small number of factors with many levels. Big data problems are often big collections of small data problems. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

Bayesian ideas remain important in big data Bayesian themes: Prediction Average over unknowns, don t maximize. Uncertainty Probability coherently represents uncertainty. Combine information Hierarchical models combine information from multiple sources. The importance of better predictions is obvious. Now for an example of the others... Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

Classical vs internet experiments Classical experiments Results take a long time. Planning is important. Type I errors costly. Internet experiments Results come quickly and sequentially. All cost is opportunity cost. Cost of Type I error is 0. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

Multi-armed bandits Entirely driven by uncertainty Problem statement: There are several potential actions available. Rewards come from a distribution fa (y θ). If you knew θ you could choose the best a. Run a sequential experiment to find the best a by learning θ. Tradeoff: Explore Experiment in case your model is wrong. Exploit Do what your model says is best. Thompson sampling heuristic: Given current data y, compute w a = p(action a is best y) = p(action a is best θ)p(θ y) dθ Assign action a with probability wa. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39

Application: Website optimization For details search Google for [Google analytics multi-armed bandit] Each website is run as an independent experiment. Too easy for a big data problem. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39

What about ad optimization? The font, text size, and background color of the ad should work well with the font, text size, and color scheme of the domain. There are 600, 000 domains in the (proof of concept) data. Many have too little traffic to support an independent experiment. Need uncertainty + information pooling. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39

Test case: hierarchical logistic regression logitpr(y di = β d ) = β T d x di β d N (µ, Σ) p(µ, Σ) = NIW d is one of 600,000 internet domains (blah.com) y di indicates whether the ad on domain d was clicked at impression i. x di contains details about ad characteristics: fonts, colors, etc. x di has roughly dimensions. NOTE: The model is (somewhat) pedagogical. More sophisticated shrinkage is needed. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 6 / 39

The obvious algorithm Algorithm Alternate between the following steps many times. Map Each worker simulates p(β d y d, µ, Σ). Reduce Master node simulates p(µ, Σ β). Theoretically identical to the standard single machine MCMC. Works great on HPC/GPU. It works terribly in MapReduce. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 7 / 39

SuperStep Times 0 worker job has Less time per step drawing β d Same amount of work drawing µ, Σ. But the µ, Σ draw takes more time! Extra time spent coordinating the extra machines. Workers Hours 5. 0.75 Both too slow! Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 8 / 39

The MapReduce paradox We started off compute constrained. We parallelized. We wound up using essentially 0% of the compute resources on the worker machines. Conclusion For Bayesian methods to work in a MapReduce / Hadoop environment, we need algorithms that require very little communication. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 9 / 39

Consensus Monte Carlo Embarassingly parallel: only one communication Partition the data y into shards y,..., y S. Run a separate Monte Carlo on each shard-level posterior p(θ y s ). (MCMC, SMC, QMC, MC,...) Combine the draws to form a consensus posterior distribution. (Like doing a meta-analysis) This should work great, because most Bayesian problems can be written p(θ y) = S p(θ y s ) Complication: You have draws from p(θ y s ), not the distribution itself. Benefits: No communication overhead. You can use existing software. s= Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

Two issues What to do about the prior? Fractionate: ps (θ) = p(θ) /S (reasonable, but beware impropriety and consider shrinkage) Adjust by multiplying / dividing working priors. How to form the consensus? Average [Scott et. al (3)], [Wang and Dunson (3)] Other methods: Importance resampling (SFS?) [Huang and Gelman (5)] Kernel estimates [Neiswanger, et. al (3)] Parametric approximation [Huang and Gelman (5)] [Machine Learning] Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

Averaging individual draws? The Gaussian case If p and p are normal, then p p = prior likelihood. If z p p then z N µ σ + µ σ σ + σ, + σ σ If x p and y p, then w x + w y p p (where w i /σ i ). This requires knowing the weights to use in the average, but. Sometimes you know them.. We ve got a full MC for each shard, so we can easily estimate within-shard variances. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

Averaging vs. other methods. Advertising is robust. Not susceptible to infinite resampling weights or the curse of dimensionality. Can t capture multi-modality or discrete spaces. Kernel estimates will break down in high dimensions Despite claims of being asymptotically exact. Beware the constant in O( ). Importance resampling is probably the long term answer Details still need to be worked out. What to use for importance weights? Worker level posteriors can be widely separated, and not overlap the true posterior, in which case resampling would fail. Parametric models Variational Bayes gets the marginal histograms right, but not the joint distribution. Parametric assumptions are reasonable if they re reasonable. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39

Binomial 00 Bernoulli outcomes, with one success, 0 workers. Density 0 0 0 Single Machine Consensus Another Single Machine Consensus 0.000 0.00 0.004 0.006 0.008 0.0 0.000 0.00 0.004 0.006 0.008 0.0 0.0 N = 000 Bandwidth = 0.00087 0.000 0.00 0.004 0.006 0.008 0.0 0.0 Single Machine This beta distribution is highly skewed (non-gaussian) but it still works. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39

Unbalanced is okay Density 0 60 80 0 Single Machine Consensus Another Single Machine 5 workers (0,,, 70, 0) observations. Weights (proportional to n) handle information uncertainty correctly. 0.000 0.005 0.0 0.05 0.0 0.05 0.0 N = 000 Bandwidth = 0.0004677 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39

Be careful with the prior Single Machine Consensus Another Single Machine Density 0 0 0 Beta prior Distribute information ( fake data ) evenly across shards. Notion of a natural scale is still a mystery. 0.00 0.0 0.04 0.06 0.08 0. N = 00 Bandwidth = 0.00087 If S copies of the prior contain meaningful information then you need to be careful. If not, then you can afford to be sloppy. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 6 / 39

Logistic regression Test data has 5 binary predictor variables x x x 3 x 4 x 5 frequency..3.5.0 coefficient -3. -.5.8 3 Last one is rare, but highly predictive when it occurs. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 7 / 39

Data y n x x x 3 x 4 x 5 66 755 0 0 0 6 753 0 0 0 0 34 86 0 0 0 90 77 0 0 6 73 0 0 37 5 0 0 68 0 9 706 0 0 0 8 3 0 0 0 3 7 0 8 4 0 0 8 0 0 7 3 0 0 3 4 0 0 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 8 / 39

6.5 overall matrix scalar equal x 6.5 x 3 4 4 3.5 7 4 3 0.5 0. 3 0.4 x3 0.6 0.5 6 3 3 4 4 0.5 x4.5 3.5 0. x5 0. 0.5 0.4 0. 0.4 0.05 0.65 0. 0.5 x 4 0 4 6 0.6 0. 0.4 0.8 3 0.5.5 4.5 4 0.5 6 3 0.5 5 3 3.5 5 0.5.5 0.6 6.5.6 0.4 0..5 0.8 0.4 0.5 0.8 0.4 0.6 0. 0. 0. 0.. 0.4 0.6 0. 0. 0. 0.8 0.5 0. 0. 0.05 0.3 0.5 x x3 0 obs/worker 0 workers 4 0 4 6 0. 0.6 0.4 0. 0.65 0.55 0. 0.5 0. 0.6 0.05 0.4 0.5 0.8 x4 0.05 0. 0.05 0. 0.05 x5 4 0 4 6 4 0 4 6

Why equal weighting fails Even though workers have identical sampling distributions. Worker-level posteriors for β 5. 0.00 0.05 0. 0.5 0. 0.5 0 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

x 60 80 0 60 70 60 70 80 x 60 90 70 80 60 70 0 60 70 x3 60 80 0 60 70 80 60 60 70 x4 5 5 5 5 80 5 5 5 35 60 5 5 5 35 x5 x 3 3 60 80 0 60 70 0 60 70 90 60 60 70 0 60 70 5 5 5 5 70 5 5 5 35 5 5 5 35 x 60 60 70 60 70 5 5 5 5 5 5 4 6 8 4 6 8 36 x3 3 3 5 5 5 80 5 5 5 5 5 5 x4 3 3 3 3 x5 overall matrix scalar 00 obs / worker 0 workers

overall matrix scalar x x x3 x4 x5 0 0 0 600 0 0 0 0 0 0 0 0 0 800 0 0 0 0 0 0 0 0 600 600 600 0 600 800 600 800 600 800 00 00 0 0 0 7 5 0 0 0 x 3 3 0 800 0 0 0 0 00 0 0 600 0 0 0 600 0 0 0 0 0 800 0 0 0 0 00 0 900 600 0 900 0 0 0 600 0 0 600 0 0 0 0 0 0 0 3 0 0 0 0 x x3,000 obs / worker 0 workers 3 3 5 0 0 0 0 5 x4 x5 3 3 3 3

Hierarchical Poisson regression 4 million observations (ad impressions) Predictors include a quality score and indicators for 6 different ad formats. Coefficients vary by advertiser (roughly,000 advertisers here). Data sharded by advertiser. 867 shards. No shard has more than thousand observations. Median shard size is 7,000 observations. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 33 / 39

Hierarchical Poisson regression Posterior for hypers given 6 data shards overall consensus 5 5 5 5 Intercept Quality AF AF AF 3 AF 4 AF 5 0. 0.05 0.05 0. 0. 0. 0. 0. 0.05 0.05 0. 0.05 0.05 0.0 0.0 0.05 0.05 0.05 0. 0. 0.05 0.0 0.04 0.05 0.05 0.05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.005 5 5 5 5 AF 6 0.0 0.0 0.0 0.00 AF 6 AF 5 AF 4 AF 3 AF AF Quality Intercept Very close, even on joint distribution. Parameters are embarassingly parallel, given marginal for hypers. Can get s in one more Map step.

Compute time for hierarchical models Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 35 / 39

BART [Chipman, George, McCulloch ()] y i = j f j (x i ) + ɛ i ɛ i iid N ( 0, σ ) Each f i is a tree. Priors keep the trees small. Sum of weak learners idea akin to boosting. Trees have means in the leaves. Can t average trees, but we can average ˆf (x). Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 36 / 39

Consensus Bart vs Bart Fits vs Friedman s test function. shards, K total observations. consensus bart, no prior adjustment 5 5 consensus bart, no prior adjustment 5 5 consensus bart, no prior adjustment 5 5 5 bart mean 5 5 bart 5% 5 5 5 bart 95% consensus bart, prior adjustment 5 consensus bart, prior adjustment 5 0 5 consensus bart, prior adjustment 5 35 5 5 5 5 5 5 5 bart mean bart 5% Columns: mean, and (.05,.975) quantiles. Consensus vs. single machine. Top row: same prior (nearly perfect match) Bottom row: prior / (fractionated prior overdispersed posterior) bart 95%

Conclusions Bayesian ideas are important in big data problems for the same reasons they re important in small data problems. Naive parallelizations of single machine algorithms are not effective in a MapReduce environment (too much coordination). Consensus Monte Carlo Good Requires only one communication. Reasonably good approximation. Can use existing code. Bad No theoretical guarantees yet (in non-gaussian case). Averaging only works where averaging makes sense. Not good for discrete s, spike-and-slab, etc. Need to work out resampling theory. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 38 / 39

References Chipman, H. A., George, E. I., and McCulloch, R. E. (). Bart: Bayesian additive regression trees. The Annals of Applied Statistics 4, 66 98. Dean, J. and Ghemawat, S. (4). Mapreduce: Simplied data processing on large clusters. In OSDI 04: Fifth Symposium on Operating System Design and Implementation. Huang, Z. and Gelman, A. (5) Sampling for Bayesian computation with large datasets (unpublished) Malewicz, G., Austern, Matthew H. Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. (). Pregel: A system for large-scale graph processing. In SIGMOD, 35 45. Neiswanger, W, Wang, C, and Xing, E (3). Asymptotically Exact, Embarrassingly Parallel MCMC arxiv preprint arxiv:3.4780. Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (3) Bayes and Big Data: The Consensus Monte Carlo Algorithm http://research.google.com/pubs/pub4849.html. Wang, X. and Dunson, D. B. (3) Parallel MCMC via Weierstrass Sampler arxiv preprint arxiv:3.4605. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 39 / 39