Bootstrapping Big Data

Size: px
Start display at page:

Download "Bootstrapping Big Data"

Transcription

1 Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, Abstract The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively computationally demanding. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a more computationally efficient, though still robust, means of quantifying the quality of estimators. BLB shares the generic applicability and statistical efficiency of the bootstrap and is furthermore well suited for application to very large datasets using modern distributed computing architectures, as it uses only small subsets of the observed data at any point during its execution. We provide both empirical and theoretical results which demonstrate the efficacy of BLB. Introduction Assessing the quality of estimates based upon finite data is a task of fundamental importance in data analysis. For example, when estimating a vector of model parameters given a training dataset, it is useful to be able to quantify the uncertainty in that estimate (e.g., via a confidence region), its bias, and its risk. Such quality assessments provide far more information than a simple point estimate itself and can be used to improve human interpretation of inferential outputs, perform hypothesis testing, do bias correction, make more efficient use of available resources (e.g., by ceasing to process data when the confidence region is sufficiently small), perform active learning, and do feature selection, among many more potential uses. Accurate assessment of estimate quality has been a longstanding concern in statistics. A great deal of classical work in this vein has proceeded via asymptotic analysis, which relies on deep study of particular classes of estimators in particular settings to obtain analytic approximations to particular quality measures (such as confidence regions) [4]. While this approach ensures asymptotic correctness and allows analytic computation, it is limited to cases in which the relevant asymptotic analysis is tractable and has actually been performed. In contrast, recent decades have seen greater focus on more automatic methods, which generally require significantly less analysis, at the expense of doing more computation. The bootstrap [2, 3] is perhaps the best known and most widely used among these methods, due to its simplicity and generic applicability. Efforts to ameliorate statistical shortcomings of the bootstrap in turn led to the development of related methods such as the m out of n bootstrap and subsampling [, 4]. Despite the computational demands of this more automatic methodology, advancements have been driven primarily by a desire to remedy shortcomings in statistical correctness, while largely ignoring computational considerations such as processing time, space, communication, and parallelism. However, with the advent of increasingly large datasets and diverse sets of often complex and exploratory queries, computational considerations and automation (i.e., lack of reliance on deep analysis of the specific estimator and setting of interest) are increasingly important. Indeed, even as the amount of available data grows, the number of parameters to be estimated and the number of potential sources of bias often also grow, leading to a need to be able to tractably assess estimator quality in the setting of large data. Thus, unlike previous work on estimator quality assessment, here we directly study both the accuracy and the computational costs of state of the art automatic methods such as the bootstrap and its relatives. The bootstrap, despite its

2 simplicity of implementation and very general statistical correctness, has relatively high computational costs. We also find that its relatives, such as the m out of n bootstrap and subsampling, have lesser computational costs, as expected, but have correctness which is sensitive to hyperparameters (such as the number of subsampled data points) that must be specified a priori; additionally, these methods generally require the use of more prior information (such as rates of convergence of estimators) than the bootstrap. Motivated by these observations, we introduce a new procedure, the Bag of Little Bootstraps (BLB), which functions by combining the results of bootstrapping multiple small subsets of a larger original dataset. As we demonstrate below, BLB has a significantly more favorable computational profile than the bootstrap while being more robust than existing alternatives such as the m out of n bootstrap and subsampling. Additionally, our procedure maintains the generic applicability and favorable statistical properties of the bootstrap and is well suited to implementation on modern distributed and parallel computing architectures. 2 Setting, Notation, and a Bit of Related Work We assume that we observe a sample X,..., X n drawn i.i.d. from some true (unknown) underlying distribution P. Based only on this observed data, we estimate a quantity ˆθ n = θ(x,..., X n ) = θ(p n ), where P n = n n i= δ X i is the empirical distribution of X,..., X n. The true (unknown) population value of the estimated quantity is θ(p ). For example, θ might compute a measure of correlation, a classifier s parameters, or the prediction accuracy of a trained model. Noting that ˆθ n is a random quantity because it is based on n random observations, we define Q n (P ) Q as the true underlying distribution of ˆθ n, which is determined by both P and the form of θ. Our end goal is the computation of some metric ξ(q n (P )) : Q Ξ, for Ξ a vector space, which informatively summarizes Q n (P ). For instance, ξ might compute a confidence region, a standard error, or a bias. In practice, we do not have direct knowledge of P or Q n (P ), and so we must estimate ξ(q n (P )) itself based only on the observed data and knowledge of θ. Using this notation, the bootstrap [2, 3] simply forms the data-driven plugin approximation ξ(q n (P )) ξ(q n (P n )). While ξ(q n (P n )) cannot be computed exactly in most cases, it is generally amenable to straightforward Monte Carlo approximation via the following simple algorithm: repeatedly resample n points i.i.d. from P n, compute θ on each resample, form the empirical distribution Q n of the computed θ s, and approximate ξ(q n (P )) ξ(q n ). Though conceptually simple and powerful, this procedure requires repeated computation of θ on resamples having size comparable to that of the original dataset. If the original dataset is large, then this repeated computation can be prohibitively costly. 3 Bag of Little Bootstraps (BLB) The Bag of Little Bootstraps (Algorithm ) functions by averaging the results of bootstrapping multiple small subsets of X,..., X n. As a result, it inherits the statistical correctness of the bootstrap, while in most cases having the reduced computational costs of the m out of n bootstrap and subsampling. Additionally, as we show in experiments below, BLB is significantly more robust than these procedures to the choice of subset size. BLB proceeds by repeatedly subsampling b < n points from X,..., X n. Each subsample defines an empirical distribution P (j) based on those b points, which is used to compute a Monte Carlo approximation to ξ(q n(p (j) )) via repeated resampling of n points from P (j). Having obtained one such Monte Carlo approximation per subsample, we then average them to obtain a final, improved estimate of ξ(q n (P )). As we show both empirically and theoretically, BLB s output converges (generally quickly) as we increase s, the number of subsamples (or chunks); also, fairly modest values of r, the number of Monte Carlo resamples, seem to be sufficient (e.g., in our experiments). Similarly to the bootstrap, each outer iteration of BLB computes an approximation to ξ(q n (P )) based on a plugin estimate ξ(q n (P (j) )). In contrast to the bootstrap, the plugin estimator used by each outer iteration of BLB is based on, which is more compact and hence generally less computationally demanding than the full empirical distribution P (j) P n. The empirical distribution P (j) is, however, inferior to P n as an approximation to the true underlying distribution P. Therefore, BLB averages across multiple different realizations of P (j) to improve the quality of the final result. Delving more deeply into the computational requirements of BLB, each size n resample from P (j) contains at most b distinct data points, each of which may be replicated multiple times. Thus, we can represent each resample using 2

3 Algorithm : Bag of Little Bootstraps (BLB) Input: Data X,..., X n θ: estimator of interest ξ: estimator quality assessment Output: An estimate of ξ(q n (P )). b: subsample size s: number of subsamples r: number of Monte Carlo iterations for j to s do // Subsample the data Randomly sample a set I of b indices from {,..., n} without replacement [or, choose I to be a disjoint chunk of size b from a predefined random partition of {,..., n}] P (j) b i I δ X i // Approximate ξ(q n (P (j) )) for k to r do Sample Xk,,..., X k,n P(j) i.i.d. ˆθ n,k θ(x k,,..., X k,n ) end Q n,j r r k= δˆθ n,k ξn,j ξ(q n,j ) end // Average estimates of ξ(q n (P )) computed on different subsamples return s s j= ξ n,j only space O(b) by simply maintaining the b < n distinct points with associated counts. If θ can compute directly on this weighted representation (i.e., if θ s computational requirements scale only in the number of distinct data points presented to it), then we achieve corresponding savings in required space and computation time. This property does indeed hold for many if not most commonly used θ s, including M-estimators such as linear and kernel regression, logistic regression, and Support Vector Machines, among many others. As we show in our experiments and analysis, b can generally be chosen to be much smaller than n: for example, we might take b = n γ where γ [.5, ]. As a result, each computation of θ by BLB can be significantly faster than each computation of θ by the bootstrap. Indeed, a simple and standard calculation [3] shows that each resample generated by the bootstrap contains approximately.632n distinct points, which is quite large if n is large. More concretely, if, for instance, n =,,, then a bootstrap resample would contain approximately 632, distinct points; in contrast, with b = n.6, each BLB subsample and resample would contain at most 3, 98 distinct points. Assuming a data point size of MB, the full dataset would occupy TB of storage space, a bootstrap resample would occupy approximately 632 GB, and each BLB subsample or resample would occupy approximately 4 GB. Thus, BLB only requires repeated computation on small subsets of the original dataset and avoids the bootstrap s crippling need for repeated computation of θ on resamples having size comparable to that of the original dataset. Due to its much smaller subsample and resample space requirements, BLB is also significantly more amenable to distribution of different subsamples and resamples and their associated computations to independent compute nodes or racks of nodes, thus allowing for simple distributed and parallel implementations. 4 Experiments We investigate the empirical performance characteristics of BLB and compare to existing methods via experiments on simulated data. Use of simulated data is necessary here because it allows knowledge of Q n (P ) and hence ξ(q n (P )) (i.e., ground truth). For different datasets and estimation tasks, we study the convergence properties of BLB as well as the bootstrap, the m out of n bootstrap, and subsampling. We consider two different tasks and forms for θ: regression and classification. For both settings, the data has the form X i = ( X i, Y i ) P, i.i.d. for i =,..., n, where X i R d ; Y i R for regression, whereas Y i {, } for classification. We use n = 2, for the plots shown, and d is set to for regression and for classification. 3

4 BLB.5 BLB.6 BLB.7 BLB.8 BLB BOFN.5 BOFN.6 BOFN.7 BOFN.8 BOFN BLB.5 BLB.6 BLB.7 BLB.8 BLB BOFN.5 BOFN.6 BOFN.7 BOFN.8 BOFN Figure : Relative error vs. processing time for BLB, bootstrap (), and b out of n bootstrap (BOFN) on two different problems. For both BLB and BOFN, b = n γ with the value of γ for each trajectory given in the legend. (two lefthand plots) Regression setting with linear data generating distribution and Gamma X i distribution. Leftmost plot shows BLB with ; other plot shows BOFN. (two righthand plots) Classification setting with linear data generating distribution and Gamma X i distribution. Lefthand plot shows BLB with ; other plot shows BOFN. In each case, θ estimates a linear parameter vector in R d (via either least squares or logistic regression) for a model of the mapping between X i and Y i. We define ξ as computing a set of marginal 95% confidence intervals, one for each element of the estimated parameter vector (averaging across ξ s consists of averaging element-wise interval boundaries). To evaluate the various quality assessment procedures on a given estimation task and true underlying data distribution P, we first compute the ground truth ξ(q n (P )) based on 2, realizations of datasets of size n from P. Then, for an independent dataset realization of size n from the true underlying distribution, we run each quality assessment procedure and record the estimate of ξ(q n (P )) produced after each iteration (e.g., after each bootstrap resample or BLB subsample is processed), as well as the cumulative time required to produce that estimate. Every such estimate is evaluated based on the average (across dimensions) relative deviation of its component-wise confidence intervals widths from the corresponding true widths. We repeat this process on five independent dataset realizations of size n and average the resulting relative errors and corresponding times across these five datasets to obtain a trajectory of relative error versus time for each quality assessment procedure (the trajectories variances are relatively small). To maintain consistency of notation, we henceforth refer to the m out of n bootstrap as the b out of n bootstrap. For BLB, the b out of n bootstrap, and subsampling, we consider subsample sizes b = n γ where γ {.5,.6,.7,.8,.9}; we use r = in all runs of BLB. Figure shows results for both the regression and classification settings. Here the true underlying distribution P generates each coordinate of each X i independently from a different Gamma distribution, and it generates Y i from X i via a noisy linear model. As seen in the figure, in the regression setting BLB (first plot) succeeds in converging to low relative error significantly more quickly than the bootstrap, for all values of b considered. In contrast, the b out of n bootstrap (second plot) fails to converge to low relative error for smaller values of b. In the classification setting, BLB (third plot) converges to relative error comparable to that of the bootstrap for b > n.6, while converging to higher relative errors for the smallest values of b considered. For larger values of b, which are still significantly smaller than n, we again converge to low relative error more quickly than the bootstrap. We are also once again more robust than the b out of n bootstrap (fourth plot), which fails to converge to low relative error for b n.7. In fact, even for b n.6, BLB s performance is superior to that of the b out of n bootstrap. For the aforementioned cases in which BLB does not match the relative error of the bootstrap, additional empirical results (not shown here) and our theoretical analysis indicate that this discrepancy in relative error diminishes as n increases. Identical evaluation of subsampling (plots not shown) shows that it performs strictly worse than the b out of n bootstrap. Qualitatively similar results also hold in both the regression and classification settings when P generates X i from either Normal or StudentT distributions, and when P uses a non-linear noisy mapping between X i and Y i (so that θ learns a misspecified model). These experiments demonstrate the improved computational efficiency of BLB relative to the bootstrap, the fact that BLB maintains statistical correctness, and the improved robustness of BLB to the choice of b. Note that all speedups observed for BLB over the bootstrap in our experiments are achieved without parallelization, though better parallelizability is indeed another significant advantage of BLB. See the extended version of this paper for theoretical results showing that BLB is consistent and shares the higher-order correctness (i.e., favorable statistical convergence rate) of the bootstrap, additional experimental results, and further discussion of BLB and related prior work. 4

5 References [] P. Bickel, F. Gotze, and W. van Zwet. Resampling fewer than n observations: gains, losses and remedies for losses. Statistica Sinica, 997. [2] B. Efron. Bootstrap methods: another look at the jackknife. Annals of Statistics, 979. [3] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 993. [4] D. Politis, J. Romano, and M. Wolf. Subsampling. Springer,

The Big Data Bootstrap

The Big Data Bootstrap Ariel Kleiner akleiner@cs.berkeley.edu Ameet Talwalkar ameet@cs.berkeley.edu Purnamrita Sarkar psarkar@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 9472, USA Michael

More information

A Scalable Bootstrap for Massive Data

A Scalable Bootstrap for Massive Data A Scalable Bootstrap for Massive Data arxiv:2.56v2 [stat.me] 28 Jun 22 Ariel Kleiner Department of Electrical Engineering and Computer Science University of California, Bereley aleiner@eecs.bereley.edu

More information

Big Data: The Computation/Statistics Interface

Big Data: The Computation/Statistics Interface Big Data: The Computation/Statistics Interface Michael I. Jordan University of California, Berkeley September 2, 2013 What Is the Big Data Phenomenon? Big Science is generating massive datasets to be used

More information

RECENT advances in digital technology have led to a

RECENT advances in digital technology have led to a Robust, scalable and fast bootstrap method for analyzing large scale data Shahab Basiri, Esa Ollila, Member, IEEE, and Visa Koivunen, Fellow, IEEE arxiv:542382v2 [statme] 2 Apr 25 Abstract In this paper

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

On statistics, computation and scalability

On statistics, computation and scalability BEJ bj v.2013/06/10 Prn:2013/06/19; 11:06 F:bejsp17.tex; (Laima) p. 1 Bernoulli 0(00), 2013, 1 13 DOI: 10.3150/12-BEJSP17 On statistics, computation and scalability MICHAEL I. JORDAN Department of Statistics

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

The Variability of P-Values. Summary

The Variability of P-Values. Summary The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

L13: cross-validation

L13: cross-validation Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

More information

Decompose Error Rate into components, some of which can be measured on unlabeled data

Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

More information

Department of Economics

Department of Economics Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

More information

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,

More information

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

More information

SAS Certificate Applied Statistics and SAS Programming

SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and

More information

Big Data in the Mathematical Sciences

Big Data in the Mathematical Sciences Big Data in the Mathematical Sciences Wednesday 13 November 2013 Sponsored by: Extract from Campus Map Note: Walk from Zeeman Building to Arts Centre approximately 5 minutes ZeemanBuilding BuildingNumber38

More information

Copyright. Network and Protocol Simulation. What is simulation? What is simulation? What is simulation? What is simulation?

Copyright. Network and Protocol Simulation. What is simulation? What is simulation? What is simulation? What is simulation? Copyright Network and Protocol Simulation Michela Meo Maurizio M. Munafò Michela.Meo@polito.it Maurizio.Munafo@polito.it Quest opera è protetta dalla licenza Creative Commons NoDerivs-NonCommercial. Per

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Online Appendices to the Corporate Propensity to Save

Online Appendices to the Corporate Propensity to Save Online Appendices to the Corporate Propensity to Save Appendix A: Monte Carlo Experiments In order to allay skepticism of empirical results that have been produced by unusual estimators on fairly small

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE 1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

More information

Lecture 13: Validation

Lecture 13: Validation Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

More information

REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

More information

How to report the percentage of explained common variance in exploratory factor analysis

How to report the percentage of explained common variance in exploratory factor analysis UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

REVIEW OF ENSEMBLE CLASSIFICATION

REVIEW OF ENSEMBLE CLASSIFICATION Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation. Dr. Thomas Jensen Expedia.com Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

More information

How can we discover stocks that will

How can we discover stocks that will Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive

More information

Why do statisticians "hate" us?

Why do statisticians hate us? Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR) 2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

More information

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes Simcha Pollack, Ph.D. St. John s University Tobin College of Business Queens, NY, 11439 pollacks@stjohns.edu

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Computing with Finite and Infinite Networks

Computing with Finite and Infinite Networks Computing with Finite and Infinite Networks Ole Winther Theoretical Physics, Lund University Sölvegatan 14 A, S-223 62 Lund, Sweden winther@nimis.thep.lu.se Abstract Using statistical mechanics results,

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

A Perspective on Statistical Tools for Data Mining Applications

A Perspective on Statistical Tools for Data Mining Applications A Perspective on Statistical Tools for Data Mining Applications David M. Rocke Center for Image Processing and Integrated Computing University of California, Davis Statistics and Data Mining Statistics

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

More information

Multiple Discriminant Analysis of Corporate Bankruptcy

Multiple Discriminant Analysis of Corporate Bankruptcy Multiple Discriminant Analysis of Corporate Bankruptcy In this paper, corporate bankruptcy is analyzed by employing the predictive tool of multiple discriminant analysis. Using several firm-specific metrics

More information

False Discovery Rates

False Discovery Rates False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Learning bagged models of dynamic systems. 1 Introduction

Learning bagged models of dynamic systems. 1 Introduction Learning bagged models of dynamic systems Nikola Simidjievski 1,2, Ljupco Todorovski 3, Sašo Džeroski 1,2 1 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Andrew Gelman Guido Imbens 2 Aug 2014 Abstract It is common in regression discontinuity analysis to control for high order

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning

More information

Project management: a simulation-based optimization method for dynamic time-cost tradeoff decisions

Project management: a simulation-based optimization method for dynamic time-cost tradeoff decisions Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 2009 Project management: a simulation-based optimization method for dynamic time-cost tradeoff decisions Radhamés

More information

Statistics 104: Section 6!

Statistics 104: Section 6! Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: dbloome@fas.harvard.edu Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC

More information

Better decision making under uncertain conditions using Monte Carlo Simulation

Better decision making under uncertain conditions using Monte Carlo Simulation IBM Software Business Analytics IBM SPSS Statistics Better decision making under uncertain conditions using Monte Carlo Simulation Monte Carlo simulation and risk analysis techniques in IBM SPSS Statistics

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Monte Carlo testing with Big Data

Monte Carlo testing with Big Data Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:

More information