Big Data, Statistics, and the Internet


 Bruce Preston
 2 years ago
 Views:
Transcription
1 Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
2 Summary Big data live on more than one machine. Computing takes place in the MapReduce / Hadoop paradigm, where communicating is expensive. Methods that treat the data as a single object don t work. We need to treat MapReduce as a basic assumption. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
3 Outline of the talk Big data, Statistics, and the Internet Big data are real, and sometimes necessary Bayes is important Experiments in the Internet age MapReduce: A case study in how to do it wrong Consensus Monte Carlo Examples Binomial Logistic regression Hierarchical models BART Conclusion Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39
4 Dan Ariely January 3 Facebook post Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39
5 Big data, machine learning, and statistics Statistics Models summarize data so humans can better understand it. Does spending more money on ads make my business more profitable? Is treatment A better than treatment B? Sampling/aggregation works just fine. (Humans can t handle the complexity anyway.) Machine learning Models allow machines to make decisions. Google search, Amazon product recommendations, Facebook news feed,.... Need big data to match diverse users to diverse content. The canonical problem in Silicon Valley is personalization. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39
6
7
8
9 Personalization is a big logistic regression Response = convert (yes / no) Data are structured... Search queries Search query Search history (long term) Session history Demographics... Search results Individual pages One boxes Nested in sites Links between sites but structure is often ignored Ads Ad creative Keywords Landing page quality Nested in campaigns... Many billions of queries and potential results, billion+ users, many millions of ads (at any given time). There are content user interactions. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 9 / 39
10 Sp rsi y Sparsity plays an important role in modeling internet data Models are big because of a small number of factors with many levels. Big data problems are often big collections of small data problems. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
11 Bayesian ideas remain important in big data Bayesian themes: Prediction Average over unknowns, don t maximize. Uncertainty Probability coherently represents uncertainty. Combine information Hierarchical models combine information from multiple sources. The importance of better predictions is obvious. Now for an example of the others... Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
12 Classical vs internet experiments Classical experiments Results take a long time. Planning is important. Type I errors costly. Internet experiments Results come quickly and sequentially. All cost is opportunity cost. Cost of Type I error is 0. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
13 Multiarmed bandits Entirely driven by uncertainty Problem statement: There are several potential actions available. Rewards come from a distribution fa (y θ). If you knew θ you could choose the best a. Run a sequential experiment to find the best a by learning θ. Tradeoff: Explore Experiment in case your model is wrong. Exploit Do what your model says is best. Thompson sampling heuristic: Given current data y, compute w a = p(action a is best y) = p(action a is best θ)p(θ y) dθ Assign action a with probability wa. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39
14 Application: Website optimization For details search Google for [Google analytics multiarmed bandit] Each website is run as an independent experiment. Too easy for a big data problem. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39
15 What about ad optimization? The font, text size, and background color of the ad should work well with the font, text size, and color scheme of the domain. There are 600, 000 domains in the (proof of concept) data. Many have too little traffic to support an independent experiment. Need uncertainty + information pooling. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39
16 Test case: hierarchical logistic regression logitpr(y di = β d ) = β T d x di β d N (µ, Σ) p(µ, Σ) = NIW d is one of 600,000 internet domains (blah.com) y di indicates whether the ad on domain d was clicked at impression i. x di contains details about ad characteristics: fonts, colors, etc. x di has roughly dimensions. NOTE: The model is (somewhat) pedagogical. More sophisticated shrinkage is needed. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 6 / 39
17 The obvious algorithm Algorithm Alternate between the following steps many times. Map Each worker simulates p(β d y d, µ, Σ). Reduce Master node simulates p(µ, Σ β). Theoretically identical to the standard single machine MCMC. Works great on HPC/GPU. It works terribly in MapReduce. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 7 / 39
18 SuperStep Times 0 worker job has Less time per step drawing β d Same amount of work drawing µ, Σ. But the µ, Σ draw takes more time! Extra time spent coordinating the extra machines. Workers Hours Both too slow! Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 8 / 39
19 The MapReduce paradox We started off compute constrained. We parallelized. We wound up using essentially 0% of the compute resources on the worker machines. Conclusion For Bayesian methods to work in a MapReduce / Hadoop environment, we need algorithms that require very little communication. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 9 / 39
20 Consensus Monte Carlo Embarassingly parallel: only one communication Partition the data y into shards y,..., y S. Run a separate Monte Carlo on each shardlevel posterior p(θ y s ). (MCMC, SMC, QMC, MC,...) Combine the draws to form a consensus posterior distribution. (Like doing a metaanalysis) This should work great, because most Bayesian problems can be written p(θ y) = S p(θ y s ) Complication: You have draws from p(θ y s ), not the distribution itself. Benefits: No communication overhead. You can use existing software. s= Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
21 Two issues What to do about the prior? Fractionate: ps (θ) = p(θ) /S (reasonable, but beware impropriety and consider shrinkage) Adjust by multiplying / dividing working priors. How to form the consensus? Average [Scott et. al (3)], [Wang and Dunson (3)] Other methods: Importance resampling (SFS?) [Huang and Gelman (5)] Kernel estimates [Neiswanger, et. al (3)] Parametric approximation [Huang and Gelman (5)] [Machine Learning] Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
22 Averaging individual draws? The Gaussian case If p and p are normal, then p p = prior likelihood. If z p p then z N µ σ + µ σ σ + σ, + σ σ If x p and y p, then w x + w y p p (where w i /σ i ). This requires knowing the weights to use in the average, but. Sometimes you know them.. We ve got a full MC for each shard, so we can easily estimate withinshard variances. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
23 Averaging vs. other methods. Advertising is robust. Not susceptible to infinite resampling weights or the curse of dimensionality. Can t capture multimodality or discrete spaces. Kernel estimates will break down in high dimensions Despite claims of being asymptotically exact. Beware the constant in O( ). Importance resampling is probably the long term answer Details still need to be worked out. What to use for importance weights? Worker level posteriors can be widely separated, and not overlap the true posterior, in which case resampling would fail. Parametric models Variational Bayes gets the marginal histograms right, but not the joint distribution. Parametric assumptions are reasonable if they re reasonable. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 3 / 39
24 Binomial 00 Bernoulli outcomes, with one success, 0 workers. Density Single Machine Consensus Another Single Machine Consensus N = 000 Bandwidth = Single Machine This beta distribution is highly skewed (nongaussian) but it still works. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 4 / 39
25 Unbalanced is okay Density Single Machine Consensus Another Single Machine 5 workers (0,,, 70, 0) observations. Weights (proportional to n) handle information uncertainty correctly N = 000 Bandwidth = Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 5 / 39
26 Be careful with the prior Single Machine Consensus Another Single Machine Density Beta prior Distribute information ( fake data ) evenly across shards. Notion of a natural scale is still a mystery N = 00 Bandwidth = If S copies of the prior contain meaningful information then you need to be careful. If not, then you can afford to be sloppy. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 6 / 39
27 Logistic regression Test data has 5 binary predictor variables x x x 3 x 4 x 5 frequency coefficient Last one is rare, but highly predictive when it occurs. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 7 / 39
28 Data y n x x x 3 x 4 x Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 8 / 39
29 6.5 overall matrix scalar equal x 6.5 x x x x x x x3 0 obs/worker 0 workers x x
30 Why equal weighting fails Even though workers have identical sampling distributions. Workerlevel posteriors for β Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39
31 x x x x x5 x x x x x5 overall matrix scalar 00 obs / worker 0 workers
32 overall matrix scalar x x x3 x4 x x x x3,000 obs / worker 0 workers x4 x
33 Hierarchical Poisson regression 4 million observations (ad impressions) Predictors include a quality score and indicators for 6 different ad formats. Coefficients vary by advertiser (roughly,000 advertisers here). Data sharded by advertiser. 867 shards. No shard has more than thousand observations. Median shard size is 7,000 observations. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 33 / 39
34 Hierarchical Poisson regression Posterior for hypers given 6 data shards overall consensus Intercept Quality AF AF AF 3 AF 4 AF AF AF 6 AF 5 AF 4 AF 3 AF AF Quality Intercept Very close, even on joint distribution. Parameters are embarassingly parallel, given marginal for hypers. Can get s in one more Map step.
35 Compute time for hierarchical models Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 35 / 39
36 BART [Chipman, George, McCulloch ()] y i = j f j (x i ) + ɛ i ɛ i iid N ( 0, σ ) Each f i is a tree. Priors keep the trees small. Sum of weak learners idea akin to boosting. Trees have means in the leaves. Can t average trees, but we can average ˆf (x). Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 36 / 39
37 Consensus Bart vs Bart Fits vs Friedman s test function. shards, K total observations. consensus bart, no prior adjustment 5 5 consensus bart, no prior adjustment 5 5 consensus bart, no prior adjustment bart mean 5 5 bart 5% bart 95% consensus bart, prior adjustment 5 consensus bart, prior adjustment consensus bart, prior adjustment bart mean bart 5% Columns: mean, and (.05,.975) quantiles. Consensus vs. single machine. Top row: same prior (nearly perfect match) Bottom row: prior / (fractionated prior overdispersed posterior) bart 95%
38 Conclusions Bayesian ideas are important in big data problems for the same reasons they re important in small data problems. Naive parallelizations of single machine algorithms are not effective in a MapReduce environment (too much coordination). Consensus Monte Carlo Good Requires only one communication. Reasonably good approximation. Can use existing code. Bad No theoretical guarantees yet (in nongaussian case). Averaging only works where averaging makes sense. Not good for discrete s, spikeandslab, etc. Need to work out resampling theory. Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 38 / 39
39 References Chipman, H. A., George, E. I., and McCulloch, R. E. (). Bart: Bayesian additive regression trees. The Annals of Applied Statistics 4, Dean, J. and Ghemawat, S. (4). Mapreduce: Simplied data processing on large clusters. In OSDI 04: Fifth Symposium on Operating System Design and Implementation. Huang, Z. and Gelman, A. (5) Sampling for Bayesian computation with large datasets (unpublished) Malewicz, G., Austern, Matthew H. Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. (). Pregel: A system for largescale graph processing. In SIGMOD, Neiswanger, W, Wang, C, and Xing, E (3). Asymptotically Exact, Embarrassingly Parallel MCMC arxiv preprint arxiv: Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (3) Bayes and Big Data: The Consensus Monte Carlo Algorithm Wang, X. and Dunson, D. B. (3) Parallel MCMC via Weierstrass Sampler arxiv preprint arxiv: Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 39 / 39
Bayes and Big Data: The Consensus Monte Carlo Algorithm
Bayes and Big Data: The Consensus Monte Carlo Algorithm Steven L. Scott, Alexander W. Blocker, Fernando V. Bonassi, Hugh A. Chipman, Edward I. George 3, and Robert E. McCulloch 4 Google, Inc. Acadia University
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationEvaluating partitioning of big graphs
Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed
More informationPS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
More informationComputational Statistics for Big Data
Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount
More informationParametric Models Part I: Maximum Likelihood and Bayesian Density Estimation
Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationSoftware tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia Antipolis SCALE (exoasis) Team
Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia Antipolis SCALE (exoasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationBayesian Statistics in One Hour. Patrick Lam
Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for largescale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for largescale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: PierreLuc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationDistributed Bayesian Posterior Sampling via Moment Sharing
Distributed Bayesian Posterior Sampling via Moment Sharing Minjie Xu 1, Balaji Lakshminarayanan 2, Yee Whye Teh 3, Jun Zhu 1, and Bo Zhang 1 1 State Key Lab of Intelligent Technology and Systems; Tsinghua
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationLab 8: Introduction to WinBUGS
40.656 Lab 8 008 Lab 8: Introduction to WinBUGS Goals:. Introduce the concepts of Bayesian data analysis.. Learn the basic syntax of WinBUGS. 3. Learn the basics of using WinBUGS in a simple example. Next
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationBig data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it Dan Ariely MYSQL AND HBASE ECOSYSTEM
More information2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationThe Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
More information1 Choosing the right data mining techniques for the job (8 minutes,
CS490D Spring 2004 Final Solutions, May 3, 2004 Prof. Chris Clifton Time will be tight. If you spend more than the recommended time on any question, go on to the next one. If you can t answer it in the
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More informationTowards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? Web data (web logs, click histories) ecommerce applications (purchase histories) Retail purchase histories
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationParallelization Strategies for Multicore Data Analysis
Parallelization Strategies for Multicore Data Analysis WeiChen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management
More informationInvited Applications Paper
Invited Applications Paper   Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM
More informationCampaign and Ad Group Management. Google AdWords Fundamentals
Campaign and Ad Group Management Google AdWords Fundamentals Question: When a Campaign is Pending what does this mean? Question: When a Campaign is Pending what does this mean? ANSWER: IT MEANS THE CAMPAIGN
More information10601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601f10/index.html
10601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601f10/index.html Course data All uptodate info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601f10/index.html
More informationMapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12
MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:
More informationBSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,
More informationThe primary goal of this thesis was to understand how the spatial dependence of
5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial
More informationLARGESCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.
LARGESCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationMarkov Chain Monte Carlo Simulation Made Simple
Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical
More informationBayesian networks  Timeseries models  Apache Spark & Scala
Bayesian networks  Timeseries models  Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup  November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationBig data challenges for physics in the next decades
Big data challenges for physics in the next decades David W. Hogg Center for Cosmology and Particle Physics, New York University 2012 November 09 punchlines Huge data sets create new opportunities. they
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationBig Data Big Deal? Salford Systems www.salfordsystems.com
Big Data Big Deal? Salford Systems www.salfordsystems.com 2015 Copyright Salford Systems 20102015 Big Data Is The New In Thing Google trends as of September 24, 2015 Difficult to read trade press without
More informationApplications of R Software in Bayesian Data Analysis
Article International Journal of Information Science and System, 2012, 1(1): 723 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationImperfect Debugging in Software Reliability
Imperfect Debugging in Software Reliability Tevfik Aktekin and Toros Caglar University of New Hampshire Peter T. Paul College of Business and Economics Department of Decision Sciences and United Health
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationNETWORK. Google Display ADVERTISING. PPC Associates. Impression Sculpting: How to Master
Impression Sculpting: How to Master Google Display NETWORK ADVERTISING By Will Lin, CoFounder & CEO, PPC Associates PPC Associates h o l i s t i c o n l i n e m a r k e t i n g Will Lin CoFounder, PPC
More informationPredictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
More informationStatistical Challenges with Big Data in Management Science
Statistical Challenges with Big Data in Management Science Arnab Kumar Laha Indian Institute of Management Ahmedabad Analytics vs Reporting Competitive Advantage Reporting Prescriptive Analytics (Decision
More information1 Prior Probability and Posterior Probability
Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which
More informationA Bootstrap MetropolisHastings Algorithm for Bayesian Analysis of Big Data
A Bootstrap MetropolisHastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationAuxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationFrom the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
More informationGraph Processing and Social Networks
Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationDirichlet Processes A gentle tutorial
Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid ElArini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 36 Outline
More informationRole of Social Networking in Marketing using Data Mining
Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationHealthcare data analytics. DaWei Wang Institute of Information Science wdw@iis.sinica.edu.tw
Healthcare data analytics DaWei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
More informationAnomaly detection for Big Data, networks and cybersecurity
Anomaly detection for Big Data, networks and cybersecurity Patrick RubinDelanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationDefending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationBayesian Techniques for Parameter Estimation. He has Van Gogh s ear for music, Billy Wilder
Bayesian Techniques for Parameter Estimation He has Van Gogh s ear for music, Billy Wilder Statistical Inference Goal: The goal in statistical inference is to make conclusions about a phenomenon based
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationNoSQL. Thomas Neumann 1 / 22
NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,
More informationlargescale machine learning revisited Léon Bottou Microsoft Research (NYC)
largescale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationMachine Learning over Big Data
Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed
More informationBayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
More informationAnalysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationdata visualization and regression
data visualization and regression Sepal.Length 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 I. setosa I. versicolor I. virginica I. setosa I. versicolor I. virginica Species Species
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationExploratory Data Analysis
Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationINTRODUCTORY STATISTICS
INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore
More informationNonparametric statistics and model selection
Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the ttest and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.
More informationTracking Algorithms. Lecture17: Stochastic Tracking. Joint Probability and Graphical Model. Probabilistic Tracking
Tracking Algorithms (2015S) Lecture17: Stochastic Tracking Bohyung Han CSE, POSTECH bhhan@postech.ac.kr Deterministic methods Given input video and current state, tracking result is always same. Local
More informationLogistic Regression (a type of Generalized Linear Model)
Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge
More informationUnderstand how PPC can help you achieve your marketing objectives at every stage of the sales funnel.
1 Understand how PPC can help you achieve your marketing objectives at every stage of the sales funnel. 2 WHO IS THIS GUIDE FOR? This guide is written for marketers who want to get more from their digital
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
More informationWebbased Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni
1 Webbased Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationSibyl: a system for large scale machine learning
Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer Machine Learning
More informationHandling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationSampling for Bayesian computation with large datasets
Sampling for Bayesian computation with large datasets Zaiying Huang Andrew Gelman April 27, 2005 Abstract Multilevel models are extremely useful in handling large hierarchical datasets. However, computation
More informationBayes and Naïve Bayes. cs534machine Learning
Bayes and aïve Bayes cs534machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
More informationBayesian data analysis: what it is and what it is not. Prof. Andrew Gelman Dept. of Statistics Columbia University
Bayesian data analysis: what it is and what it is not Prof. Andrew Gelman Dept. of Statistics Columbia University Talk for Columbia University Department of Computer Science, 15 Dec 2003 1 Themes Popular
More information