Big Data, Statistics, and the Internet

Size: px
Start display at page:

Download "Big Data, Statistics, and the Internet"

Transcription

1 Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

2 Summary Big data live on more than one machine. Computing takes place in the MapReduce / Hadoop paradigm, where communicating is expensive. Methods that treat the data as a single object don t work. We need to treat MapReduce as a basic assumption. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

3 Outline of the talk Big data, Statistics, and the Internet Big data are real, and sometimes necessary Bayes is important Experiments in the Internet age MapReduce: A case study in how to do it wrong Consensus Monte Carlo Examples Binomial Logistic regression Hierarchical models BART Conclusion Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39

4 Dan Ariely January 3 Facebook post Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39

5 Big data, machine learning, and statistics Statistics Models summarize data so humans can better understand it. Does spending more money on ads make my business more profitable? Is treatment A better than treatment B? Sampling/aggregation works just fine. (Humans can t handle the complexity anyway.) Machine learning Models allow machines to make decisions. Google search, Amazon product recommendations, Facebook news feed,.... Need big data to match diverse users to diverse content. The canonical problem in Silicon Valley is personalization. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39

6

7

8

9 Personalization is a big logistic regression Response = convert (yes / no) Data are structured... Search queries Search query Search history (long term) Session history Demographics... Search results Individual pages One boxes Nested in sites Links between sites but structure is often ignored Ads Ad creative Keywords Landing page quality Nested in campaigns... Many billions of queries and potential results, billion+ users, many millions of ads (at any given time). There are content user interactions. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 9 / 39

10 Sp rsi y Sparsity plays an important role in modeling internet data Models are big because of a small number of factors with many levels. Big data problems are often big collections of small data problems. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

11 Bayesian ideas remain important in big data Bayesian themes: Prediction Average over unknowns, don t maximize. Uncertainty Probability coherently represents uncertainty. Combine information Hierarchical models combine information from multiple sources. The importance of better predictions is obvious. Now for an example of the others... Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

12 Classical vs internet experiments Classical experiments Results take a long time. Planning is important. Type I errors costly. Internet experiments Results come quickly and sequentially. All cost is opportunity cost. Cost of Type I error is 0. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

13 Multi-armed bandits Entirely driven by uncertainty Problem statement: There are several potential actions available. Rewards come from a distribution fa (y θ). If you knew θ you could choose the best a. Run a sequential experiment to find the best a by learning θ. Tradeoff: Explore Experiment in case your model is wrong. Exploit Do what your model says is best. Thompson sampling heuristic: Given current data y, compute w a = p(action a is best y) = p(action a is best θ)p(θ y) dθ Assign action a with probability wa. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39

14 Application: Website optimization For details search Google for [Google analytics multi-armed bandit] Each website is run as an independent experiment. Too easy for a big data problem. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39

15 What about ad optimization? The font, text size, and background color of the ad should work well with the font, text size, and color scheme of the domain. There are 600, 000 domains in the (proof of concept) data. Many have too little traffic to support an independent experiment. Need uncertainty + information pooling. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39

16 Test case: hierarchical logistic regression logitpr(y di = β d ) = β T d x di β d N (µ, Σ) p(µ, Σ) = NIW d is one of 600,000 internet domains (blah.com) y di indicates whether the ad on domain d was clicked at impression i. x di contains details about ad characteristics: fonts, colors, etc. x di has roughly dimensions. NOTE: The model is (somewhat) pedagogical. More sophisticated shrinkage is needed. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 6 / 39

17 The obvious algorithm Algorithm Alternate between the following steps many times. Map Each worker simulates p(β d y d, µ, Σ). Reduce Master node simulates p(µ, Σ β). Theoretically identical to the standard single machine MCMC. Works great on HPC/GPU. It works terribly in MapReduce. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 7 / 39

18 SuperStep Times 0 worker job has Less time per step drawing β d Same amount of work drawing µ, Σ. But the µ, Σ draw takes more time! Extra time spent coordinating the extra machines. Workers Hours Both too slow! Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 8 / 39

19 The MapReduce paradox We started off compute constrained. We parallelized. We wound up using essentially 0% of the compute resources on the worker machines. Conclusion For Bayesian methods to work in a MapReduce / Hadoop environment, we need algorithms that require very little communication. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 9 / 39

20 Consensus Monte Carlo Embarassingly parallel: only one communication Partition the data y into shards y,..., y S. Run a separate Monte Carlo on each shard-level posterior p(θ y s ). (MCMC, SMC, QMC, MC,...) Combine the draws to form a consensus posterior distribution. (Like doing a meta-analysis) This should work great, because most Bayesian problems can be written p(θ y) = S p(θ y s ) Complication: You have draws from p(θ y s ), not the distribution itself. Benefits: No communication overhead. You can use existing software. s= Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

21 Two issues What to do about the prior? Fractionate: ps (θ) = p(θ) /S (reasonable, but beware impropriety and consider shrinkage) Adjust by multiplying / dividing working priors. How to form the consensus? Average [Scott et. al (3)], [Wang and Dunson (3)] Other methods: Importance resampling (SFS?) [Huang and Gelman (5)] Kernel estimates [Neiswanger, et. al (3)] Parametric approximation [Huang and Gelman (5)] [Machine Learning] Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

22 Averaging individual draws? The Gaussian case If p and p are normal, then p p = prior likelihood. If z p p then z N µ σ + µ σ σ + σ, + σ σ If x p and y p, then w x + w y p p (where w i /σ i ). This requires knowing the weights to use in the average, but. Sometimes you know them.. We ve got a full MC for each shard, so we can easily estimate within-shard variances. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

23 Averaging vs. other methods. Advertising is robust. Not susceptible to infinite resampling weights or the curse of dimensionality. Can t capture multi-modality or discrete spaces. Kernel estimates will break down in high dimensions Despite claims of being asymptotically exact. Beware the constant in O( ). Importance resampling is probably the long term answer Details still need to be worked out. What to use for importance weights? Worker level posteriors can be widely separated, and not overlap the true posterior, in which case resampling would fail. Parametric models Variational Bayes gets the marginal histograms right, but not the joint distribution. Parametric assumptions are reasonable if they re reasonable. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39

24 Binomial 00 Bernoulli outcomes, with one success, 0 workers. Density Single Machine Consensus Another Single Machine Consensus N = 000 Bandwidth = Single Machine This beta distribution is highly skewed (non-gaussian) but it still works. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39

25 Unbalanced is okay Density Single Machine Consensus Another Single Machine 5 workers (0,,, 70, 0) observations. Weights (proportional to n) handle information uncertainty correctly N = 000 Bandwidth = Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39

26 Be careful with the prior Single Machine Consensus Another Single Machine Density Beta prior Distribute information ( fake data ) evenly across shards. Notion of a natural scale is still a mystery N = 00 Bandwidth = If S copies of the prior contain meaningful information then you need to be careful. If not, then you can afford to be sloppy. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 6 / 39

27 Logistic regression Test data has 5 binary predictor variables x x x 3 x 4 x 5 frequency coefficient Last one is rare, but highly predictive when it occurs. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 7 / 39

28 Data y n x x x 3 x 4 x Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 8 / 39

29 6.5 overall matrix scalar equal x 6.5 x x x x x x x3 0 obs/worker 0 workers x x

30 Why equal weighting fails Even though workers have identical sampling distributions. Worker-level posteriors for β Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39

31 x x x x x5 x x x x x5 overall matrix scalar 00 obs / worker 0 workers

32 overall matrix scalar x x x3 x4 x x x x3,000 obs / worker 0 workers x4 x

33 Hierarchical Poisson regression 4 million observations (ad impressions) Predictors include a quality score and indicators for 6 different ad formats. Coefficients vary by advertiser (roughly,000 advertisers here). Data sharded by advertiser. 867 shards. No shard has more than thousand observations. Median shard size is 7,000 observations. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 33 / 39

34 Hierarchical Poisson regression Posterior for hypers given 6 data shards overall consensus Intercept Quality AF AF AF 3 AF 4 AF AF AF 6 AF 5 AF 4 AF 3 AF AF Quality Intercept Very close, even on joint distribution. Parameters are embarassingly parallel, given marginal for hypers. Can get s in one more Map step.

35 Compute time for hierarchical models Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 35 / 39

36 BART [Chipman, George, McCulloch ()] y i = j f j (x i ) + ɛ i ɛ i iid N ( 0, σ ) Each f i is a tree. Priors keep the trees small. Sum of weak learners idea akin to boosting. Trees have means in the leaves. Can t average trees, but we can average ˆf (x). Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 36 / 39

37 Consensus Bart vs Bart Fits vs Friedman s test function. shards, K total observations. consensus bart, no prior adjustment 5 5 consensus bart, no prior adjustment 5 5 consensus bart, no prior adjustment bart mean 5 5 bart 5% bart 95% consensus bart, prior adjustment 5 consensus bart, prior adjustment consensus bart, prior adjustment bart mean bart 5% Columns: mean, and (.05,.975) quantiles. Consensus vs. single machine. Top row: same prior (nearly perfect match) Bottom row: prior / (fractionated prior overdispersed posterior) bart 95%

38 Conclusions Bayesian ideas are important in big data problems for the same reasons they re important in small data problems. Naive parallelizations of single machine algorithms are not effective in a MapReduce environment (too much coordination). Consensus Monte Carlo Good Requires only one communication. Reasonably good approximation. Can use existing code. Bad No theoretical guarantees yet (in non-gaussian case). Averaging only works where averaging makes sense. Not good for discrete s, spike-and-slab, etc. Need to work out resampling theory. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 38 / 39

39 References Chipman, H. A., George, E. I., and McCulloch, R. E. (). Bart: Bayesian additive regression trees. The Annals of Applied Statistics 4, Dean, J. and Ghemawat, S. (4). Mapreduce: Simplied data processing on large clusters. In OSDI 04: Fifth Symposium on Operating System Design and Implementation. Huang, Z. and Gelman, A. (5) Sampling for Bayesian computation with large datasets (unpublished) Malewicz, G., Austern, Matthew H. Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. (). Pregel: A system for large-scale graph processing. In SIGMOD, Neiswanger, W, Wang, C, and Xing, E (3). Asymptotically Exact, Embarrassingly Parallel MCMC arxiv preprint arxiv: Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (3) Bayes and Big Data: The Consensus Monte Carlo Algorithm Wang, X. and Dunson, D. B. (3) Parallel MCMC via Weierstrass Sampler arxiv preprint arxiv: Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 39 / 39

Bayes and Big Data: The Consensus Monte Carlo Algorithm

Bayes and Big Data: The Consensus Monte Carlo Algorithm Bayes and Big Data: The Consensus Monte Carlo Algorithm Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George 3, and Robert E. McCulloch 4 Google, Inc. Acadia University

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

Computational Statistics for Big Data

Computational Statistics for Big Data Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount

More information

Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

More information

Distributed Bayesian Posterior Sampling via Moment Sharing

Distributed Bayesian Posterior Sampling via Moment Sharing Distributed Bayesian Posterior Sampling via Moment Sharing Minjie Xu 1, Balaji Lakshminarayanan 2, Yee Whye Teh 3, Jun Zhu 1, and Bo Zhang 1 1 State Key Lab of Intelligent Technology and Systems; Tsinghua

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

More information

Campaign and Ad Group Management. Google AdWords Fundamentals

Campaign and Ad Group Management. Google AdWords Fundamentals Campaign and Ad Group Management Google AdWords Fundamentals Question: When a Campaign is Pending what does this mean? Question: When a Campaign is Pending what does this mean? ANSWER: IT MEANS THE CAMPAIGN

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it Dan Ariely MYSQL AND HBASE ECOSYSTEM

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html 10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

More information

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR) 2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

Invited Applications Paper

Invited Applications Paper Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D. LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

Applications of R Software in Bayesian Data Analysis

Applications of R Software in Bayesian Data Analysis Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx

More information

Imperfect Debugging in Software Reliability

Imperfect Debugging in Software Reliability Imperfect Debugging in Software Reliability Tevfik Aktekin and Toros Caglar University of New Hampshire Peter T. Paul College of Business and Economics Department of Decision Sciences and United Health

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

More information

Big data challenges for physics in the next decades

Big data challenges for physics in the next decades Big data challenges for physics in the next decades David W. Hogg Center for Cosmology and Particle Physics, New York University 2012 November 09 punchlines Huge data sets create new opportunities. they

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Graph Processing and Social Networks

Graph Processing and Social Networks Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

More information

1 Prior Probability and Posterior Probability

1 Prior Probability and Posterior Probability Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

The primary goal of this thesis was to understand how the spatial dependence of

The primary goal of this thesis was to understand how the spatial dependence of 5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

1 Choosing the right data mining techniques for the job (8 minutes,

1 Choosing the right data mining techniques for the job (8 minutes, CS490D Spring 2004 Final Solutions, May 3, 2004 Prof. Chris Clifton Time will be tight. If you spend more than the recommended time on any question, go on to the next one. If you can t answer it in the

More information

Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

Big Data Big Deal? Salford Systems www.salford-systems.com

Big Data Big Deal? Salford Systems www.salford-systems.com Big Data Big Deal? Salford Systems www.salford-systems.com 2015 Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni 1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

More information

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

Role of Social Networking in Marketing using Data Mining

Role of Social Networking in Marketing using Data Mining Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

Statistical Challenges with Big Data in Management Science

Statistical Challenges with Big Data in Management Science Statistical Challenges with Big Data in Management Science Arnab Kumar Laha Indian Institute of Management Ahmedabad Analytics vs Reporting Competitive Advantage Reporting Prescriptive Analytics (Decision

More information

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Outline. Dispersion Bush lupine survival Quasi-Binomial family Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the

More information

NoSQL. Thomas Neumann 1 / 22

NoSQL. Thomas Neumann 1 / 22 NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Sampling for Bayesian computation with large datasets

Sampling for Bayesian computation with large datasets Sampling for Bayesian computation with large datasets Zaiying Huang Andrew Gelman April 27, 2005 Abstract Multilevel models are extremely useful in handling large hierarchical datasets. However, computation

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Challenges of Cloud Scale Natural Language Processing

Challenges of Cloud Scale Natural Language Processing Challenges of Cloud Scale Natural Language Processing Mark Dredze Johns Hopkins University My Interests? Information Expressed in Human Language Machine Learning Natural Language Processing Intelligent

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

DECISION MAKING UNDER UNCERTAINTY:

DECISION MAKING UNDER UNCERTAINTY: DECISION MAKING UNDER UNCERTAINTY: Models and Choices Charles A. Holloway Stanford University TECHNISCHE HOCHSCHULE DARMSTADT Fachbereich 1 Gesamtbibliothek Betrtebswirtscrtaftslehre tnventar-nr. :...2>2&,...S'.?S7.

More information

Basic Bayesian Methods

Basic Bayesian Methods 6 Basic Bayesian Methods Mark E. Glickman and David A. van Dyk Summary In this chapter, we introduce the basics of Bayesian data analysis. The key ingredients to a Bayesian analysis are the likelihood

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains

More information

Machine Learning over Big Data

Machine Learning over Big Data Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

More information

data visualization and regression

data visualization and regression data visualization and regression Sepal.Length 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 I. setosa I. versicolor I. virginica I. setosa I. versicolor I. virginica Species Species

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

Anomaly detection for Big Data, networks and cyber-security

Anomaly detection for Big Data, networks and cyber-security Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Inference of Probability Distributions for Trust and Security applications

Inference of Probability Distributions for Trust and Security applications Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Statistics 104: Section 6!

Statistics 104: Section 6! Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC

More information

Un point de vue bayésien pour des algorithmes de bandit plus performants

Un point de vue bayésien pour des algorithmes de bandit plus performants Un point de vue bayésien pour des algorithmes de bandit plus performants Emilie Kaufmann, Telecom ParisTech Rencontre des Jeunes Statisticiens, Aussois, 28 août 2013 Emilie Kaufmann (Telecom ParisTech)

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information