Bootstrapping Big Data


 Jordan Anderson
 2 years ago
 Views:
Transcription
1 Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, Abstract The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrapbased quantities can be prohibitively computationally demanding. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a more computationally efficient, though still robust, means of quantifying the quality of estimators. BLB shares the generic applicability and statistical efficiency of the bootstrap and is furthermore well suited for application to very large datasets using modern distributed computing architectures, as it uses only small subsets of the observed data at any point during its execution. We provide both empirical and theoretical results which demonstrate the efficacy of BLB. Introduction Assessing the quality of estimates based upon finite data is a task of fundamental importance in data analysis. For example, when estimating a vector of model parameters given a training dataset, it is useful to be able to quantify the uncertainty in that estimate (e.g., via a confidence region), its bias, and its risk. Such quality assessments provide far more information than a simple point estimate itself and can be used to improve human interpretation of inferential outputs, perform hypothesis testing, do bias correction, make more efficient use of available resources (e.g., by ceasing to process data when the confidence region is sufficiently small), perform active learning, and do feature selection, among many more potential uses. Accurate assessment of estimate quality has been a longstanding concern in statistics. A great deal of classical work in this vein has proceeded via asymptotic analysis, which relies on deep study of particular classes of estimators in particular settings to obtain analytic approximations to particular quality measures (such as confidence regions) [4]. While this approach ensures asymptotic correctness and allows analytic computation, it is limited to cases in which the relevant asymptotic analysis is tractable and has actually been performed. In contrast, recent decades have seen greater focus on more automatic methods, which generally require significantly less analysis, at the expense of doing more computation. The bootstrap [2, 3] is perhaps the best known and most widely used among these methods, due to its simplicity and generic applicability. Efforts to ameliorate statistical shortcomings of the bootstrap in turn led to the development of related methods such as the m out of n bootstrap and subsampling [, 4]. Despite the computational demands of this more automatic methodology, advancements have been driven primarily by a desire to remedy shortcomings in statistical correctness, while largely ignoring computational considerations such as processing time, space, communication, and parallelism. However, with the advent of increasingly large datasets and diverse sets of often complex and exploratory queries, computational considerations and automation (i.e., lack of reliance on deep analysis of the specific estimator and setting of interest) are increasingly important. Indeed, even as the amount of available data grows, the number of parameters to be estimated and the number of potential sources of bias often also grow, leading to a need to be able to tractably assess estimator quality in the setting of large data. Thus, unlike previous work on estimator quality assessment, here we directly study both the accuracy and the computational costs of state of the art automatic methods such as the bootstrap and its relatives. The bootstrap, despite its
2 simplicity of implementation and very general statistical correctness, has relatively high computational costs. We also find that its relatives, such as the m out of n bootstrap and subsampling, have lesser computational costs, as expected, but have correctness which is sensitive to hyperparameters (such as the number of subsampled data points) that must be specified a priori; additionally, these methods generally require the use of more prior information (such as rates of convergence of estimators) than the bootstrap. Motivated by these observations, we introduce a new procedure, the Bag of Little Bootstraps (BLB), which functions by combining the results of bootstrapping multiple small subsets of a larger original dataset. As we demonstrate below, BLB has a significantly more favorable computational profile than the bootstrap while being more robust than existing alternatives such as the m out of n bootstrap and subsampling. Additionally, our procedure maintains the generic applicability and favorable statistical properties of the bootstrap and is well suited to implementation on modern distributed and parallel computing architectures. 2 Setting, Notation, and a Bit of Related Work We assume that we observe a sample X,..., X n drawn i.i.d. from some true (unknown) underlying distribution P. Based only on this observed data, we estimate a quantity ˆθ n = θ(x,..., X n ) = θ(p n ), where P n = n n i= δ X i is the empirical distribution of X,..., X n. The true (unknown) population value of the estimated quantity is θ(p ). For example, θ might compute a measure of correlation, a classifier s parameters, or the prediction accuracy of a trained model. Noting that ˆθ n is a random quantity because it is based on n random observations, we define Q n (P ) Q as the true underlying distribution of ˆθ n, which is determined by both P and the form of θ. Our end goal is the computation of some metric ξ(q n (P )) : Q Ξ, for Ξ a vector space, which informatively summarizes Q n (P ). For instance, ξ might compute a confidence region, a standard error, or a bias. In practice, we do not have direct knowledge of P or Q n (P ), and so we must estimate ξ(q n (P )) itself based only on the observed data and knowledge of θ. Using this notation, the bootstrap [2, 3] simply forms the datadriven plugin approximation ξ(q n (P )) ξ(q n (P n )). While ξ(q n (P n )) cannot be computed exactly in most cases, it is generally amenable to straightforward Monte Carlo approximation via the following simple algorithm: repeatedly resample n points i.i.d. from P n, compute θ on each resample, form the empirical distribution Q n of the computed θ s, and approximate ξ(q n (P )) ξ(q n ). Though conceptually simple and powerful, this procedure requires repeated computation of θ on resamples having size comparable to that of the original dataset. If the original dataset is large, then this repeated computation can be prohibitively costly. 3 Bag of Little Bootstraps (BLB) The Bag of Little Bootstraps (Algorithm ) functions by averaging the results of bootstrapping multiple small subsets of X,..., X n. As a result, it inherits the statistical correctness of the bootstrap, while in most cases having the reduced computational costs of the m out of n bootstrap and subsampling. Additionally, as we show in experiments below, BLB is significantly more robust than these procedures to the choice of subset size. BLB proceeds by repeatedly subsampling b < n points from X,..., X n. Each subsample defines an empirical distribution P (j) based on those b points, which is used to compute a Monte Carlo approximation to ξ(q n(p (j) )) via repeated resampling of n points from P (j). Having obtained one such Monte Carlo approximation per subsample, we then average them to obtain a final, improved estimate of ξ(q n (P )). As we show both empirically and theoretically, BLB s output converges (generally quickly) as we increase s, the number of subsamples (or chunks); also, fairly modest values of r, the number of Monte Carlo resamples, seem to be sufficient (e.g., in our experiments). Similarly to the bootstrap, each outer iteration of BLB computes an approximation to ξ(q n (P )) based on a plugin estimate ξ(q n (P (j) )). In contrast to the bootstrap, the plugin estimator used by each outer iteration of BLB is based on, which is more compact and hence generally less computationally demanding than the full empirical distribution P (j) P n. The empirical distribution P (j) is, however, inferior to P n as an approximation to the true underlying distribution P. Therefore, BLB averages across multiple different realizations of P (j) to improve the quality of the final result. Delving more deeply into the computational requirements of BLB, each size n resample from P (j) contains at most b distinct data points, each of which may be replicated multiple times. Thus, we can represent each resample using 2
3 Algorithm : Bag of Little Bootstraps (BLB) Input: Data X,..., X n θ: estimator of interest ξ: estimator quality assessment Output: An estimate of ξ(q n (P )). b: subsample size s: number of subsamples r: number of Monte Carlo iterations for j to s do // Subsample the data Randomly sample a set I of b indices from {,..., n} without replacement [or, choose I to be a disjoint chunk of size b from a predefined random partition of {,..., n}] P (j) b i I δ X i // Approximate ξ(q n (P (j) )) for k to r do Sample Xk,,..., X k,n P(j) i.i.d. ˆθ n,k θ(x k,,..., X k,n ) end Q n,j r r k= δˆθ n,k ξn,j ξ(q n,j ) end // Average estimates of ξ(q n (P )) computed on different subsamples return s s j= ξ n,j only space O(b) by simply maintaining the b < n distinct points with associated counts. If θ can compute directly on this weighted representation (i.e., if θ s computational requirements scale only in the number of distinct data points presented to it), then we achieve corresponding savings in required space and computation time. This property does indeed hold for many if not most commonly used θ s, including Mestimators such as linear and kernel regression, logistic regression, and Support Vector Machines, among many others. As we show in our experiments and analysis, b can generally be chosen to be much smaller than n: for example, we might take b = n γ where γ [.5, ]. As a result, each computation of θ by BLB can be significantly faster than each computation of θ by the bootstrap. Indeed, a simple and standard calculation [3] shows that each resample generated by the bootstrap contains approximately.632n distinct points, which is quite large if n is large. More concretely, if, for instance, n =,,, then a bootstrap resample would contain approximately 632, distinct points; in contrast, with b = n.6, each BLB subsample and resample would contain at most 3, 98 distinct points. Assuming a data point size of MB, the full dataset would occupy TB of storage space, a bootstrap resample would occupy approximately 632 GB, and each BLB subsample or resample would occupy approximately 4 GB. Thus, BLB only requires repeated computation on small subsets of the original dataset and avoids the bootstrap s crippling need for repeated computation of θ on resamples having size comparable to that of the original dataset. Due to its much smaller subsample and resample space requirements, BLB is also significantly more amenable to distribution of different subsamples and resamples and their associated computations to independent compute nodes or racks of nodes, thus allowing for simple distributed and parallel implementations. 4 Experiments We investigate the empirical performance characteristics of BLB and compare to existing methods via experiments on simulated data. Use of simulated data is necessary here because it allows knowledge of Q n (P ) and hence ξ(q n (P )) (i.e., ground truth). For different datasets and estimation tasks, we study the convergence properties of BLB as well as the bootstrap, the m out of n bootstrap, and subsampling. We consider two different tasks and forms for θ: regression and classification. For both settings, the data has the form X i = ( X i, Y i ) P, i.i.d. for i =,..., n, where X i R d ; Y i R for regression, whereas Y i {, } for classification. We use n = 2, for the plots shown, and d is set to for regression and for classification. 3
4 BLB.5 BLB.6 BLB.7 BLB.8 BLB BOFN.5 BOFN.6 BOFN.7 BOFN.8 BOFN BLB.5 BLB.6 BLB.7 BLB.8 BLB BOFN.5 BOFN.6 BOFN.7 BOFN.8 BOFN Figure : Relative error vs. processing time for BLB, bootstrap (), and b out of n bootstrap (BOFN) on two different problems. For both BLB and BOFN, b = n γ with the value of γ for each trajectory given in the legend. (two lefthand plots) Regression setting with linear data generating distribution and Gamma X i distribution. Leftmost plot shows BLB with ; other plot shows BOFN. (two righthand plots) Classification setting with linear data generating distribution and Gamma X i distribution. Lefthand plot shows BLB with ; other plot shows BOFN. In each case, θ estimates a linear parameter vector in R d (via either least squares or logistic regression) for a model of the mapping between X i and Y i. We define ξ as computing a set of marginal 95% confidence intervals, one for each element of the estimated parameter vector (averaging across ξ s consists of averaging elementwise interval boundaries). To evaluate the various quality assessment procedures on a given estimation task and true underlying data distribution P, we first compute the ground truth ξ(q n (P )) based on 2, realizations of datasets of size n from P. Then, for an independent dataset realization of size n from the true underlying distribution, we run each quality assessment procedure and record the estimate of ξ(q n (P )) produced after each iteration (e.g., after each bootstrap resample or BLB subsample is processed), as well as the cumulative time required to produce that estimate. Every such estimate is evaluated based on the average (across dimensions) relative deviation of its componentwise confidence intervals widths from the corresponding true widths. We repeat this process on five independent dataset realizations of size n and average the resulting relative errors and corresponding times across these five datasets to obtain a trajectory of relative error versus time for each quality assessment procedure (the trajectories variances are relatively small). To maintain consistency of notation, we henceforth refer to the m out of n bootstrap as the b out of n bootstrap. For BLB, the b out of n bootstrap, and subsampling, we consider subsample sizes b = n γ where γ {.5,.6,.7,.8,.9}; we use r = in all runs of BLB. Figure shows results for both the regression and classification settings. Here the true underlying distribution P generates each coordinate of each X i independently from a different Gamma distribution, and it generates Y i from X i via a noisy linear model. As seen in the figure, in the regression setting BLB (first plot) succeeds in converging to low relative error significantly more quickly than the bootstrap, for all values of b considered. In contrast, the b out of n bootstrap (second plot) fails to converge to low relative error for smaller values of b. In the classification setting, BLB (third plot) converges to relative error comparable to that of the bootstrap for b > n.6, while converging to higher relative errors for the smallest values of b considered. For larger values of b, which are still significantly smaller than n, we again converge to low relative error more quickly than the bootstrap. We are also once again more robust than the b out of n bootstrap (fourth plot), which fails to converge to low relative error for b n.7. In fact, even for b n.6, BLB s performance is superior to that of the b out of n bootstrap. For the aforementioned cases in which BLB does not match the relative error of the bootstrap, additional empirical results (not shown here) and our theoretical analysis indicate that this discrepancy in relative error diminishes as n increases. Identical evaluation of subsampling (plots not shown) shows that it performs strictly worse than the b out of n bootstrap. Qualitatively similar results also hold in both the regression and classification settings when P generates X i from either Normal or StudentT distributions, and when P uses a nonlinear noisy mapping between X i and Y i (so that θ learns a misspecified model). These experiments demonstrate the improved computational efficiency of BLB relative to the bootstrap, the fact that BLB maintains statistical correctness, and the improved robustness of BLB to the choice of b. Note that all speedups observed for BLB over the bootstrap in our experiments are achieved without parallelization, though better parallelizability is indeed another significant advantage of BLB. See the extended version of this paper for theoretical results showing that BLB is consistent and shares the higherorder correctness (i.e., favorable statistical convergence rate) of the bootstrap, additional experimental results, and further discussion of BLB and related prior work. 4
5 References [] P. Bickel, F. Gotze, and W. van Zwet. Resampling fewer than n observations: gains, losses and remedies for losses. Statistica Sinica, 997. [2] B. Efron. Bootstrap methods: another look at the jackknife. Annals of Statistics, 979. [3] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 993. [4] D. Politis, J. Romano, and M. Wolf. Subsampling. Springer,
The Big Data Bootstrap
Ariel Kleiner akleiner@cs.berkeley.edu Ameet Talwalkar ameet@cs.berkeley.edu Purnamrita Sarkar psarkar@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 9472, USA Michael
More informationA Scalable Bootstrap for Massive Data
A Scalable Bootstrap for Massive Data arxiv:2.56v2 [stat.me] 28 Jun 22 Ariel Kleiner Department of Electrical Engineering and Computer Science University of California, Bereley aleiner@eecs.bereley.edu
More informationBig Data: The Computation/Statistics Interface
Big Data: The Computation/Statistics Interface Michael I. Jordan University of California, Berkeley September 2, 2013 What Is the Big Data Phenomenon? Big Science is generating massive datasets to be used
More informationRECENT advances in digital technology have led to a
Robust, scalable and fast bootstrap method for analyzing large scale data Shahab Basiri, Esa Ollila, Member, IEEE, and Visa Koivunen, Fellow, IEEE arxiv:542382v2 [statme] 2 Apr 25 Abstract In this paper
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationApplied Multivariate Analysis  Big data analytics
Applied Multivariate Analysis  Big data analytics Nathalie VillaVialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationExperimental data analysis Lecture 3: Confidence intervals. Dodo Das
Experimental data analysis Lecture 3: Confidence intervals Dodo Das Review of lecture 2 Nonlinear regression  Iterative likelihood maximization LevenbergMarquardt algorithm (Hybrid of steepest descent
More informationMultiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract
Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about
More informationA Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICETUNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationOn statistics, computation and scalability
BEJ bj v.2013/06/10 Prn:2013/06/19; 11:06 F:bejsp17.tex; (Laima) p. 1 Bernoulli 0(00), 2013, 1 13 DOI: 10.3150/12BEJSP17 On statistics, computation and scalability MICHAEL I. JORDAN Department of Statistics
More informationlargescale machine learning revisited Léon Bottou Microsoft Research (NYC)
largescale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
More informationThe Variability of PValues. Summary
The Variability of PValues Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 276958203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report
More informationL13: crossvalidation
Resampling methods Cross validation Bootstrap L13: crossvalidation Bias and variance estimation with the Bootstrap Threeway data partitioning CSCE 666 Pattern Analysis Ricardo GutierrezOsuna CSE@TAMU
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 20092010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationFrom the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19  Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19  Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505919B &
More informationThe Bootstrap. 1 Introduction. The Bootstrap 2. Short Guides to Microeconometrics Fall Kurt Schmidheiny Unversität Basel
Short Guides to Microeconometrics Fall 2016 The Bootstrap Kurt Schmidheiny Unversität Basel The Bootstrap 2 1a) The asymptotic sampling distribution is very difficult to derive. 1b) The asymptotic sampling
More informationDecompose Error Rate into components, some of which can be measured on unlabeled data
BiasVariance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data BiasVariance Decomposition for Regression BiasVariance Decomposition for Classification BiasVariance
More informationServer Load Prediction
Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that
More informationIntroduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
More informationDepartment of Economics
Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 14730278 On Testing for Diagonality of Large Dimensional
More informationWe discuss 2 resampling methods in this chapter  crossvalidation  the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Crossvalidation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationA Bootstrap MetropolisHastings Algorithm for Bayesian Analysis of Big Data
A Bootstrap MetropolisHastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing
More informationMachine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida  1  Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationCHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationSAS Certificate Applied Statistics and SAS Programming
SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationOnline Appendices to the Corporate Propensity to Save
Online Appendices to the Corporate Propensity to Save Appendix A: Monte Carlo Experiments In order to allay skepticism of empirical results that have been produced by unusual estimators on fairly small
More informationBig Data in the Mathematical Sciences
Big Data in the Mathematical Sciences Wednesday 13 November 2013 Sponsored by: Extract from Campus Map Note: Walk from Zeeman Building to Arts Centre approximately 5 minutes ZeemanBuilding BuildingNumber38
More informationConcept Learning. Machine Learning 1
Concept Learning Inducing general functions from specific training examples is a main issue of machine learning. Concept Learning: Acquiring the definition of a general category from given sample positive
More informationIntroduction to Logistic Regression
OpenStaxCNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStaxCNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
More informationCopyright. Network and Protocol Simulation. What is simulation? What is simulation? What is simulation? What is simulation?
Copyright Network and Protocol Simulation Michela Meo Maurizio M. Munafò Michela.Meo@polito.it Maurizio.Munafo@polito.it Quest opera è protetta dalla licenza Creative Commons NoDerivsNonCommercial. Per
More informationElements of statistics (MATH04871)
Elements of statistics (MATH04871) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis 
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Resampling techniques g Threeway data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationBootstrapping Multivariate Spectra
Bootstrapping Multivariate Spectra Jeremy Berkowitz Federal Reserve Board Francis X. Diebold University of Pennsylvania and NBER his Draft August 3, 1997 Address correspondence to: F.X. Diebold Department
More informationHow to report the percentage of explained common variance in exploratory factor analysis
UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: LorenzoSeva, U. (2013). How to report
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationOn the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
More informationAuxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationA THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA
A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationCalculating Interval Forecasts
Calculating Chapter 7 (Chatfield) Monika Turyna & Thomas Hrdina Department of Economics, University of Vienna Summer Term 2009 Terminology An interval forecast consists of an upper and a lower limit between
More informationREVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
More informationParallelization Strategies for Multicore Data Analysis
Parallelization Strategies for Multicore Data Analysis WeiChen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management
More informationREPORT DOCUMENTATION PAGE
REPORT DOCUMENTATION PAGE Form Approved OMB NO. 07040188 Public Reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,
More informationTHE HYBRID CARTLOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CATLOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most datamining projects involve classification problems assigning objects to classes whether
More informationLearning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationAppendix A: Sampling Methods
Appendix A: Sampling Methods What is Sampling? Sampling is used in an @RISK simulation to generate possible values from probability distribution functions. These sets of possible values are then used to
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationHow can we discover stocks that will
Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive
More informationComputing with Finite and Infinite Networks
Computing with Finite and Infinite Networks Ole Winther Theoretical Physics, Lund University Sölvegatan 14 A, S223 62 Lund, Sweden winther@nimis.thep.lu.se Abstract Using statistical mechanics results,
More informationSimulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes
Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes Simcha Pollack, Ph.D. St. John s University Tobin College of Business Queens, NY, 11439 pollacks@stjohns.edu
More informationChecklists and Examples for Registering Statistical Analyses
Checklists and Examples for Registering Statistical Analyses For welldesigned confirmatory research, all analysis decisions that could affect the confirmatory results should be planned and registered
More information2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More informationA Perspective on Statistical Tools for Data Mining Applications
A Perspective on Statistical Tools for Data Mining Applications David M. Rocke Center for Image Processing and Integrated Computing University of California, Davis Statistics and Data Mining Statistics
More informationInstitute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
More informationFalse Discovery Rates
False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving
More informationSpatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
More informationBayesian sample size determination of vibration signals in machine learning approach to fault diagnosis of roller bearings
Bayesian sample size determination of vibration signals in machine learning approach to fault diagnosis of roller bearings Siddhant Sahu *, V. Sugumaran ** * School of Mechanical and Building Sciences,
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationEmployer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
More informationTesting for the martingale hypothesis in Asian stock prices: evidence from a new joint variance ratio test
Testing for the martingale hypothesis in Asian stock prices: evidence from a new joint variance ratio test Jae H. Kim Department of Econometrics and Business Statistics Monash University, Caulfield East,
More informationMicrosoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002Topics in StatisticsBiological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More information1.2 Statistical testing by permutation
Statistical testing by permutation 17 Excerpt (pp. 1726) Ch. 13), from: McBratney & Webster (1981), McBratney et al. (1981), Webster & Burgess (1984), Borgman & Quimby (1988), and FrançoisBongarçon (1991).
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationMultiple Discriminant Analysis of Corporate Bankruptcy
Multiple Discriminant Analysis of Corporate Bankruptcy In this paper, corporate bankruptcy is analyzed by employing the predictive tool of multiple discriminant analysis. Using several firmspecific metrics
More informationMINITAB ASSISTANT WHITE PAPER
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. OneWay
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS1332014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationCurrent Standard: Mathematical Concepts and Applications Shape, Space, and Measurement Primary
Shape, Space, and Measurement Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two and threedimensional shapes by demonstrating an understanding of:
More informationLearning bagged models of dynamic systems. 1 Introduction
Learning bagged models of dynamic systems Nikola Simidjievski 1,2, Ljupco Todorovski 3, Sašo Džeroski 1,2 1 Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan
More informationA Study Of Bagging And Boosting Approaches To Develop MetaClassifier
A Study Of Bagging And Boosting Approaches To Develop MetaClassifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet524121,
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationChapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical microclustering algorithm ClusteringBased SVM (CBSVM) Experimental
More informationMachine Learning. Term 2012/2013 LSI  FIB. Javier Béjar cbea (LSI  FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI  FIB Term 2012/2013 Javier Béjar cbea (LSI  FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
More informationInsurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationGraduate Programs in Statistics
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
More information