Big Data, Statistics, and the Internet
|
|
- Bruce Preston
- 7 years ago
- Views:
Transcription
1 Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
2 Summary Big data live on more than one machine. Computing takes place in the MapReduce / Hadoop paradigm, where communicating is expensive. Methods that treat the data as a single object don t work. We need to treat MapReduce as a basic assumption. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
3 Outline of the talk Big data, Statistics, and the Internet Big data are real, and sometimes necessary Bayes is important Experiments in the Internet age MapReduce: A case study in how to do it wrong Consensus Monte Carlo Examples Binomial Logistic regression Hierarchical models BART Conclusion Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39
4 Dan Ariely January 3 Facebook post Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39
5 Big data, machine learning, and statistics Statistics Models summarize data so humans can better understand it. Does spending more money on ads make my business more profitable? Is treatment A better than treatment B? Sampling/aggregation works just fine. (Humans can t handle the complexity anyway.) Machine learning Models allow machines to make decisions. Google search, Amazon product recommendations, Facebook news feed,.... Need big data to match diverse users to diverse content. The canonical problem in Silicon Valley is personalization. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39
6
7
8
9 Personalization is a big logistic regression Response = convert (yes / no) Data are structured... Search queries Search query Search history (long term) Session history Demographics... Search results Individual pages One boxes Nested in sites Links between sites but structure is often ignored Ads Ad creative Keywords Landing page quality Nested in campaigns... Many billions of queries and potential results, billion+ users, many millions of ads (at any given time). There are content user interactions. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 9 / 39
10 Sp rsi y Sparsity plays an important role in modeling internet data Models are big because of a small number of factors with many levels. Big data problems are often big collections of small data problems. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
11 Bayesian ideas remain important in big data Bayesian themes: Prediction Average over unknowns, don t maximize. Uncertainty Probability coherently represents uncertainty. Combine information Hierarchical models combine information from multiple sources. The importance of better predictions is obvious. Now for an example of the others... Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
12 Classical vs internet experiments Classical experiments Results take a long time. Planning is important. Type I errors costly. Internet experiments Results come quickly and sequentially. All cost is opportunity cost. Cost of Type I error is 0. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
13 Multi-armed bandits Entirely driven by uncertainty Problem statement: There are several potential actions available. Rewards come from a distribution fa (y θ). If you knew θ you could choose the best a. Run a sequential experiment to find the best a by learning θ. Tradeoff: Explore Experiment in case your model is wrong. Exploit Do what your model says is best. Thompson sampling heuristic: Given current data y, compute w a = p(action a is best y) = p(action a is best θ)p(θ y) dθ Assign action a with probability wa. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39
14 Application: Website optimization For details search Google for [Google analytics multi-armed bandit] Each website is run as an independent experiment. Too easy for a big data problem. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39
15 What about ad optimization? The font, text size, and background color of the ad should work well with the font, text size, and color scheme of the domain. There are 600, 000 domains in the (proof of concept) data. Many have too little traffic to support an independent experiment. Need uncertainty + information pooling. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39
16 Test case: hierarchical logistic regression logitpr(y di = β d ) = β T d x di β d N (µ, Σ) p(µ, Σ) = NIW d is one of 600,000 internet domains (blah.com) y di indicates whether the ad on domain d was clicked at impression i. x di contains details about ad characteristics: fonts, colors, etc. x di has roughly dimensions. NOTE: The model is (somewhat) pedagogical. More sophisticated shrinkage is needed. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 6 / 39
17 The obvious algorithm Algorithm Alternate between the following steps many times. Map Each worker simulates p(β d y d, µ, Σ). Reduce Master node simulates p(µ, Σ β). Theoretically identical to the standard single machine MCMC. Works great on HPC/GPU. It works terribly in MapReduce. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 7 / 39
18 SuperStep Times 0 worker job has Less time per step drawing β d Same amount of work drawing µ, Σ. But the µ, Σ draw takes more time! Extra time spent coordinating the extra machines. Workers Hours Both too slow! Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 8 / 39
19 The MapReduce paradox We started off compute constrained. We parallelized. We wound up using essentially 0% of the compute resources on the worker machines. Conclusion For Bayesian methods to work in a MapReduce / Hadoop environment, we need algorithms that require very little communication. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 9 / 39
20 Consensus Monte Carlo Embarassingly parallel: only one communication Partition the data y into shards y,..., y S. Run a separate Monte Carlo on each shard-level posterior p(θ y s ). (MCMC, SMC, QMC, MC,...) Combine the draws to form a consensus posterior distribution. (Like doing a meta-analysis) This should work great, because most Bayesian problems can be written p(θ y) = S p(θ y s ) Complication: You have draws from p(θ y s ), not the distribution itself. Benefits: No communication overhead. You can use existing software. s= Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
21 Two issues What to do about the prior? Fractionate: ps (θ) = p(θ) /S (reasonable, but beware impropriety and consider shrinkage) Adjust by multiplying / dividing working priors. How to form the consensus? Average [Scott et. al (3)], [Wang and Dunson (3)] Other methods: Importance resampling (SFS?) [Huang and Gelman (5)] Kernel estimates [Neiswanger, et. al (3)] Parametric approximation [Huang and Gelman (5)] [Machine Learning] Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
22 Averaging individual draws? The Gaussian case If p and p are normal, then p p = prior likelihood. If z p p then z N µ σ + µ σ σ + σ, + σ σ If x p and y p, then w x + w y p p (where w i /σ i ). This requires knowing the weights to use in the average, but. Sometimes you know them.. We ve got a full MC for each shard, so we can easily estimate within-shard variances. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
23 Averaging vs. other methods. Advertising is robust. Not susceptible to infinite resampling weights or the curse of dimensionality. Can t capture multi-modality or discrete spaces. Kernel estimates will break down in high dimensions Despite claims of being asymptotically exact. Beware the constant in O( ). Importance resampling is probably the long term answer Details still need to be worked out. What to use for importance weights? Worker level posteriors can be widely separated, and not overlap the true posterior, in which case resampling would fail. Parametric models Variational Bayes gets the marginal histograms right, but not the joint distribution. Parametric assumptions are reasonable if they re reasonable. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39
24 Binomial 00 Bernoulli outcomes, with one success, 0 workers. Density Single Machine Consensus Another Single Machine Consensus N = 000 Bandwidth = Single Machine This beta distribution is highly skewed (non-gaussian) but it still works. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39
25 Unbalanced is okay Density Single Machine Consensus Another Single Machine 5 workers (0,,, 70, 0) observations. Weights (proportional to n) handle information uncertainty correctly N = 000 Bandwidth = Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39
26 Be careful with the prior Single Machine Consensus Another Single Machine Density Beta prior Distribute information ( fake data ) evenly across shards. Notion of a natural scale is still a mystery N = 00 Bandwidth = If S copies of the prior contain meaningful information then you need to be careful. If not, then you can afford to be sloppy. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 6 / 39
27 Logistic regression Test data has 5 binary predictor variables x x x 3 x 4 x 5 frequency coefficient Last one is rare, but highly predictive when it occurs. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 7 / 39
28 Data y n x x x 3 x 4 x Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 8 / 39
29 6.5 overall matrix scalar equal x 6.5 x x x x x x x3 0 obs/worker 0 workers x x
30 Why equal weighting fails Even though workers have identical sampling distributions. Worker-level posteriors for β Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
31 x x x x x5 x x x x x5 overall matrix scalar 00 obs / worker 0 workers
32 overall matrix scalar x x x3 x4 x x x x3,000 obs / worker 0 workers x4 x
33 Hierarchical Poisson regression 4 million observations (ad impressions) Predictors include a quality score and indicators for 6 different ad formats. Coefficients vary by advertiser (roughly,000 advertisers here). Data sharded by advertiser. 867 shards. No shard has more than thousand observations. Median shard size is 7,000 observations. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 33 / 39
34 Hierarchical Poisson regression Posterior for hypers given 6 data shards overall consensus Intercept Quality AF AF AF 3 AF 4 AF AF AF 6 AF 5 AF 4 AF 3 AF AF Quality Intercept Very close, even on joint distribution. Parameters are embarassingly parallel, given marginal for hypers. Can get s in one more Map step.
35 Compute time for hierarchical models Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 35 / 39
36 BART [Chipman, George, McCulloch ()] y i = j f j (x i ) + ɛ i ɛ i iid N ( 0, σ ) Each f i is a tree. Priors keep the trees small. Sum of weak learners idea akin to boosting. Trees have means in the leaves. Can t average trees, but we can average ˆf (x). Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 36 / 39
37 Consensus Bart vs Bart Fits vs Friedman s test function. shards, K total observations. consensus bart, no prior adjustment 5 5 consensus bart, no prior adjustment 5 5 consensus bart, no prior adjustment bart mean 5 5 bart 5% bart 95% consensus bart, prior adjustment 5 consensus bart, prior adjustment consensus bart, prior adjustment bart mean bart 5% Columns: mean, and (.05,.975) quantiles. Consensus vs. single machine. Top row: same prior (nearly perfect match) Bottom row: prior / (fractionated prior overdispersed posterior) bart 95%
38 Conclusions Bayesian ideas are important in big data problems for the same reasons they re important in small data problems. Naive parallelizations of single machine algorithms are not effective in a MapReduce environment (too much coordination). Consensus Monte Carlo Good Requires only one communication. Reasonably good approximation. Can use existing code. Bad No theoretical guarantees yet (in non-gaussian case). Averaging only works where averaging makes sense. Not good for discrete s, spike-and-slab, etc. Need to work out resampling theory. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 38 / 39
39 References Chipman, H. A., George, E. I., and McCulloch, R. E. (). Bart: Bayesian additive regression trees. The Annals of Applied Statistics 4, Dean, J. and Ghemawat, S. (4). Mapreduce: Simplied data processing on large clusters. In OSDI 04: Fifth Symposium on Operating System Design and Implementation. Huang, Z. and Gelman, A. (5) Sampling for Bayesian computation with large datasets (unpublished) Malewicz, G., Austern, Matthew H. Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. (). Pregel: A system for large-scale graph processing. In SIGMOD, Neiswanger, W, Wang, C, and Xing, E (3). Asymptotically Exact, Embarrassingly Parallel MCMC arxiv preprint arxiv: Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (3) Bayes and Big Data: The Consensus Monte Carlo Algorithm Wang, X. and Dunson, D. B. (3) Parallel MCMC via Weierstrass Sampler arxiv preprint arxiv: Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 39 / 39
Bayes and Big Data: The Consensus Monte Carlo Algorithm
Bayes and Big Data: The Consensus Monte Carlo Algorithm Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George 3, and Robert E. McCulloch 4 Google, Inc. Acadia University
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationComputational Statistics for Big Data
Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount
More informationEvaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
More informationPS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationSoftware tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
More informationDistributed Bayesian Posterior Sampling via Moment Sharing
Distributed Bayesian Posterior Sampling via Moment Sharing Minjie Xu 1, Balaji Lakshminarayanan 2, Yee Whye Teh 3, Jun Zhu 1, and Bo Zhang 1 1 State Key Lab of Intelligent Technology and Systems; Tsinghua
More informationBayesian Statistics in One Hour. Patrick Lam
Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationThe Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
More informationCampaign and Ad Group Management. Google AdWords Fundamentals
Campaign and Ad Group Management Google AdWords Fundamentals Question: When a Campaign is Pending what does this mean? Question: When a Campaign is Pending what does this mean? ANSWER: IT MEANS THE CAMPAIGN
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More informationBig data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it Dan Ariely MYSQL AND HBASE ECOSYSTEM
More informationMapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12
MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationParallelization Strategies for Multicore Data Analysis
Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationBSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
More information10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
More information2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationInvited Applications Paper
Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM
More informationTowards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
More informationAuxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationLARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationApplications of R Software in Bayesian Data Analysis
Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx
More informationImperfect Debugging in Software Reliability
Imperfect Debugging in Software Reliability Tevfik Aktekin and Toros Caglar University of New Hampshire Peter T. Paul College of Business and Economics Department of Decision Sciences and United Health
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationPredictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
More informationBig data challenges for physics in the next decades
Big data challenges for physics in the next decades David W. Hogg Center for Cosmology and Particle Physics, New York University 2012 November 09 punchlines Huge data sets create new opportunities. they
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationA Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing
More informationlarge-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
More informationGraph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
More information1 Prior Probability and Posterior Probability
Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationThe primary goal of this thesis was to understand how the spatial dependence of
5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More information1 Choosing the right data mining techniques for the job (8 minutes,
CS490D Spring 2004 Final Solutions, May 3, 2004 Prof. Chris Clifton Time will be tight. If you spend more than the recommended time on any question, go on to the next one. If you can t answer it in the
More informationDirichlet Processes A gentle tutorial
Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationMarkov Chain Monte Carlo Simulation Made Simple
Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationLogistic Regression (a type of Generalized Linear Model)
Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge
More informationBig Data Big Deal? Salford Systems www.salford-systems.com
Big Data Big Deal? Salford Systems www.salford-systems.com 2015 Copyright Salford Systems 2010-2015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationFrom the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
More informationAnalysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
More informationWeb-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni
1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed
More informationBayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationRole of Social Networking in Marketing using Data Mining
Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationHealthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw
Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
More informationStatistical Challenges with Big Data in Management Science
Statistical Challenges with Big Data in Management Science Arnab Kumar Laha Indian Institute of Management Ahmedabad Analytics vs Reporting Competitive Advantage Reporting Prescriptive Analytics (Decision
More informationOutline. Dispersion Bush lupine survival Quasi-Binomial family
Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the
More informationNoSQL. Thomas Neumann 1 / 22
NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationSampling for Bayesian computation with large datasets
Sampling for Bayesian computation with large datasets Zaiying Huang Andrew Gelman April 27, 2005 Abstract Multilevel models are extremely useful in handling large hierarchical datasets. However, computation
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationChallenges of Cloud Scale Natural Language Processing
Challenges of Cloud Scale Natural Language Processing Mark Dredze Johns Hopkins University My Interests? Information Expressed in Human Language Machine Learning Natural Language Processing Intelligent
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationExploratory Data Analysis
Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
More informationDECISION MAKING UNDER UNCERTAINTY:
DECISION MAKING UNDER UNCERTAINTY: Models and Choices Charles A. Holloway Stanford University TECHNISCHE HOCHSCHULE DARMSTADT Fachbereich 1 Gesamtbibliothek Betrtebswirtscrtaftslehre tnventar-nr. :...2>2&,...S'.?S7.
More informationBasic Bayesian Methods
6 Basic Bayesian Methods Mark E. Glickman and David A. van Dyk Summary In this chapter, we introduce the basics of Bayesian data analysis. The key ingredients to a Bayesian analysis are the likelihood
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
More informationMachine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
More informationdata visualization and regression
data visualization and regression Sepal.Length 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 I. setosa I. versicolor I. virginica I. setosa I. versicolor I. virginica Species Species
More informationBayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationBNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)
More informationAnomaly detection for Big Data, networks and cyber-security
Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationDefending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationInference of Probability Distributions for Trust and Security applications
Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations
More informationHow To Understand The Theory Of Probability
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
More informationStatistics 104: Section 6!
Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC
More informationUn point de vue bayésien pour des algorithmes de bandit plus performants
Un point de vue bayésien pour des algorithmes de bandit plus performants Emilie Kaufmann, Telecom ParisTech Rencontre des Jeunes Statisticiens, Aussois, 28 août 2013 Emilie Kaufmann (Telecom ParisTech)
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More information