Automated Learning and Data Visualization



Similar documents
Local classification and local likelihoods

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Least Squares Estimation

HT2015: SC4 Statistical Data Mining and Machine Learning

Penalized regression: Introduction

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Exercise 1.12 (Pg )

Regression III: Advanced Methods

Multiple Linear Regression in Data Mining

Cluster Analysis: Advanced Concepts

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Regression III: Advanced Methods

Bootstrapping Big Data

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Parallel Computing of Kernel Density Estimates with MPI

Bildverarbeitung und Mustererkennung Image Processing and Pattern Recognition

Regularized Logistic Regression for Mind Reading with Parallel Validation

Marketing Mix Modelling and Big Data P. M Cain

The Wondrous World of fmri statistics

2013 MBA Jump Start Program. Statistics Module Part 3

7 Time series analysis

International Statistical Institute, 56th Session, 2007: Phil Everson

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

5. Multiple regression

DATA INTERPRETATION AND STATISTICS

Quantile Regression under misspecification, with an application to the U.S. wage structure

Statistical Models in R

Data Preprocessing. Week 2

Normality Testing in Excel

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

2. Simple Linear Regression

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Gamma Distribution Fitting

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

Exploratory Data Analysis

Server Load Prediction

Advanced In-Database Analytics

Machine Learning in Statistical Arbitrage

Simple Linear Regression Inference

Data Mining: An Overview. David Madigan

EDA and ML A Perfect Pair for Large-Scale Data Analysis

11. Analysis of Case-control Studies Logistic Regression

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

How To Run Statistical Tests in Excel

Using kernel methods to visualise crime data

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Multivariate Analysis of Ecological Data

One-Way Analysis of Variance (ANOVA) Example Problem

Chapter 23. Inferences for Regression

Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see. Description

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

4. Simple regression. QBUS6840 Predictive Analytics.

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Nonparametric adaptive age replacement with a one-cycle criterion

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Simple Regression Theory II 2010 Samuel L. Baker

Statistical Models in R

Simple Predictive Analytics Curtis Seare

Cross Validation. Dr. Thomas Jensen Expedia.com

Imputing Values to Missing Data

STA 4273H: Statistical Machine Learning

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

Applied Multivariate Analysis - Big data analytics

Exploratory Data Analysis with MATLAB

This content downloaded on Tue, 19 Feb :28:43 PM All use subject to JSTOR Terms and Conditions

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Recall this chart that showed how most of our course would be organized:

Univariate Regression

Lecture 3: Linear methods for classification

Simple linear regression

On Correlating Performance Metrics

Statistics Review PSY379

Nonparametric statistics and model selection

Numerical Algorithms Group

Understanding the Benefits of IBM SPSS Statistics Server

STATISTICAL DATA ANALYSIS COURSE VIA THE MATLAB WEB SERVER

Demand Forecasting When a product is produced for a market, the demand occurs in the future. The production planning cannot be accomplished unless

Introduction to Regression and Data Analysis

Multivariate Normal Distribution

Example G Cost of construction of nuclear power plants

Basics of Statistical Machine Learning

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Some Essential Statistics The Lure of Statistics

Data Mining Introduction

Data Preparation and Statistical Displays

Gerry Hobbs, Department of Statistics, West Virginia University

Predict the Popularity of YouTube Videos Using Early View Data

Introduction to Pattern Recognition

Transcription:

1 Automated Learning and Data Visualization William S. Cleveland Department of Statistics Department of Computer Science Purdue University

Methods of Statistics, Machine Learning, and Data Mining 2 Mathematical methods summary statistics models fitted to data algorithms that process the data Visualization methods displays of raw data displays of output of mathematical methods

Mathematical Methods 3 Critical in all phases of the analysis of data: from initial checking and cleaning to final presentation of results Automated learning: adapt themselves to systematic patterns in data Can carry out predictive tasks Can describe the patterns in a way that provides fundamental understanding Different patterns require different methods even when the task is the same

Visualization Methods 4 Critical in all phases of the analysis of data: from initial checking and cleaning to final presentation of results Allow us to learn which patterns occur out of an immensely broad collection of possible patterns

Visualization Methods in Support of Mathematical Methods 5 Typically not feasible to carry out all of the mathematical methods necessary to cover the broad collection of patterns that could have been seen by visualization Visualization provides immense insight into appropriate mathematical methods, even when the task is just prediction Automatic selection of best mathematical methods model selection criteria training-test framework risks finding best from of a group of poor performers

Mathematical Methods in Support of Visualization 6 Typically not possible to understand the patterns in a data set just displaying the raw data Must also carry out mathematical methods and then visualize the output the remaining variation in the data after adjusting for output

Tactics and Strategy 7 Mathematical methods exploit the tactical power of the computer: an ability to perform massive mathematical computations with great speed Visualization methods exploit the strategic power of the human: an ability to reason using input from the immensely powerful human visual system The combination provides the best chance to retain the information in the data

Machine Learning Algorithm Deep Blue vs. Human Kasparov 8 Why was Kasparov so distressed about the possibility that IBM was cheating by allowing a human to assist the algorithm?

IBM Machine Learning System Deep Blue vs. Human Kasparov 9 Why was Kasparov so distressed about the possibility that IBM was cheating by allowing a human to assist the algorithm? He knew he had no chance to beat a human-machine combination The immense tactical power of the IBM machine learning system The strategic power, much less than his own, of a grand master

Conclusion 10 MATHEMATICAL METHODS & VISUALIZATION METHODS ARE SYMBIOTIC

Plan for this Talk 11 Visualization Databases for Large Complex Datasets Just a few minutes in this talk Paper in Journal of Machine Learning Research (AISTATS 2009 Proceedings) Web site with live examples: ml.stat.purdue.edu/vdb/ Approach to visualization that fosters comprehensive analysis of large complex datasets

Plan for this Talk 12 The ed Method for Nonparametric Density Estimation & Diagnostic Checking Current work describe here to make the case for a tight coupling of a mathematical method and visualization methods Addresses the 50 year old topic of nonparametric density estimation 50 years of kernel density estimates and very little visualization for diagnostic checking to see if the density patterns are faithfully represented A new mathematical method built, in part, to enable visualization Results much more faithful following of density patterns in data visualization methods for diagnostic checking simple finite sample statistical inference

Visualization Databases for Large Complex Datasets 13 Saptarshi Guha Paul Kidwell Ryan Hafen William Cleveland Department of Statistics, Purdue University

Visualization Databases for Large Complex Datasets 14 Comprehensive analysis of large complex database that preserves the information in the data is greatly enhanced by a visualization database (VDB) VDB many displays some with many pages often with many panels per page A large multi-page display for a single display method results from parallelizing the data partition the data into subsets sample the subsets apply the visualization method to each subset in the sample, typically one per panel Time of the analyst not increased by choosing a large sample over a small one display viewers can be designed to allow rapid scanning: animation with punctuated stops Often, it is not necessary to view every page of a display Display design to enhance rapid scanning attention of effortless gestalt formation that conveys information rather than focused viewing

Visualization Databases for Large Complex Datasets 15 Already successful just with off-the-shelf tools and simple concepts Can be greatly improved by research in visualization methods that targets VDBs and large displays Our current research projects subset sampling methods automation algorithms for choosing basic display elements display design for gestalt formation

RHIPE 16 Our approach to VDBs allows embarrassingly parallel computation Large amounts of computer time can be saved by distributed computing environments One is RHIPE (ml.stat.purdue.edu/rhipe) Saptarshi Guha, Purdue Statistics R-Hadoop Integrated Processing Environment Greek for in a moment pronounced hree pay A recent merging of the R interactive environment for data analysis (www.r-project.org) and the Hadoop distributed file system and compute engine (hadoop.apache.org) Public domain A remarkable achievement that has had a dramatic effect on our ability to compute with large data sets

17 The ed Method for Nonparametric Density Estimation & Diagnostic Checking Ryan Hafen William S. Cleveland Department of Statistics, Purdue University

Normalized British Income Data (7201 observations) 18 8 6 British Income 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 f value

Kernel Density Estimation (KDE): Most-Used Method 19 Let x j for j = 1 to m be the observations, ordered from smallest to largest Fixed-bandwidth kernel density estimate with bandwidth h Kernel K(u) 0, K(u)du = 1 ˆf(x) = 1 mh m j=1 K ( x xj h ) K often unit normal probability density, so h = standard deviation x j closer to x adds more to ˆf(x) than a further x j As the bandwidth h increases, ˆf(x) gets smoother

KDE for Incomes: Gaussian Kernel, Sheather-Jones h 20 0.8 0.6 Density 0.4 0.2 0.0 0 1 2 3 British Income

Three Problems of Nonparametric Density Estimation: Problem 1 21 We want an estimate to faithfully follow the density patterns in the data tools that convince us this is happening with our current set of data There are a few reasonable things plot estimates with different bandwidths, small to large model selection criteria SiZer (Chaudhuri and Marron) Problem 1: There does not exist a set of comprehensive diagnostic tools.

Kernel Density Estimates with Gaussian Kernel for Income Data 22 1 0 1 2 3 4 5 0.202 0.368 0.8 0.6 0.4 0.2 0.0 0.061 0.111 0.8 Density 0.6 0.4 0.2 0.018 0.033 0.0 0.8 0.6 0.4 0.2 0.0 1 0 1 2 3 4 5 British Income

Bandwidth Selection Criterion 23 Treated as if the criterion automatically provides a good fit to patterns Silverman 0.6 Density 0.4 0.2 0.0 0 1 2 3 4 British Income Cross-Validation 0.8 0.6 Density 0.4 0.2 0.0 0 1 2 3 British Income There are many criteria Choosing h replaced by choosing a criterion Useful, but only a small part of comprehensive diagnosis

Three Problems of Nonparametric Density Estimation: Problem 2 24 KDEs are simple and can be made to run very fast, which make us want to use them Price for the simplicity Discussed extensively in the literature

Three Problems of Nonparametric Density Estimation: Problem 2 25 E(ˆf(x)) = f(x u)k(u)du, K(u) 0 This expected value can be far from f(x) chop peaks fill in valleys underestimate density at data boundaries when there is a sharp cutoff in density (e.g., the world s simplest density, a uniform) General assessment: remedies such as changing h or K with x do not fix the problems, and introduce others Problem 2: KDEs have much trouble following faithfully the density patterns in data.

The Three Problems of Nonparametric Density Estimation: Problem 3 26 Problem 3: Very little technology for statistical inference.

Regression Analysis (Function Estimation) 27 Not the impoverished situation of nonparametric density estimation For example the following model: y i = g(x i ) + ɛ i y i for i = 1 to n are measurements of a response x i is a p-tuple of measurements of p explanatory variables ɛ i are error terms: independent, identically distributed with mean 0 Regression analysis has a wealth of models, mathematical methods, visualization methods for diagnostic checking, and inference technology

The ed Method for Density Estimation 28 Take a model building approach Turn density estimation to a regression analysis to exploit the rich environment for analysis In addition seek to make the regression simple by fostering ɛ i that have a normal distribution In fact, it is even more powerful than most regression settings because we know the error variance

British Income Data: The Modeling Begins Right Away 29 8 6 British Income 4 2 0 0.0 0.2 0.4 0.6 0.8 1.0 f value

Log British Income Data: The Modeling Begins Right Away 30 2 Log base 2 British Income 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 f value

British Income: The Modeling Begins Right Away 31 Observations used for nonparametric density estimation income less than 3.75 log base 2 income larger than 2.75 reduces the number of observations by 39 to 7162 Not realistic to suppose we can get good relative density estimates in these tails by nonparametric methods Incomes: x j normalized pounds sterling (nps) for j = 1 to m = 7162, ordered from smallest to largest

Histogram of British Incomes 32 Consider a histogram interval with length g Estimate of density for the interval is κ/m g fraction of observations nps g is fixed and think of κ as a random variable

Order Statistics and Their Gaps 33 x j normalized pounds sterling (nps) for j = 1 to m = 7162, ordered from smallest to largest Order statistic κ-gaps: g (κ) 1 = x κ+1 x 1 g (κ) 2 = x 2κ+1 x κ+1 g (κ) 3 = x 3κ+1 x 2κ+1. For κ = 10: g (10) 1 = x 11 x 1 g (10) 2 = x 21 x 11 g (10) 3 = x 31 x 21. Gaps have units nps Number of observation in each interval is κ

Balloon Densities 34 Gaps: g (κ) i = x iκ+1 x (i 1)κ+1, i = 1, 2,...,n b (κ) i = κ/m g (κ) i = fraction of observations κ nps x iκ+1 x (i 1)κ+1 fraction of observations nps g (κ) i is positioned at the midpoint of the gap interval [x (i 1)κ+1,x iκ+1 ] x (κ) i = x iκ+1 + x (i 1)κ+1 2 nps Now κ is fixed and we think of g (κ) i as a random variable

A Very Attractive Property of the Log Balloon Estimate 35 y (κ) i = log(b (κ) i ),i = 1,...,n Distributional Properties: The Theory Approximately independent and distributed like a constant plus the log of a chi-squared distribution with 2κ degrees of freedom E(y (κ) i ) = log f(x (κ) i ) + log κ ψ 0 (κ) Var(y (κ) i ) = ψ 1 (κ) ψ 0 = digamma function ψ 1 = trigamma function

ed Step 1: Log Balloon Densities 36 Start with the log balloon densities as the raw data Two considerations in the choice of κ (1) small enough that there is as little distortion of the density as possible by the averaging that occurs (2) large enough that y (κ) i is approximately normal κ = 10 is quite good and κ = 20 nearly perfect (in theory) we can give this up and even take κ = 1 but next steps are more complicated

ed: Log Balloon Densities vs. Gap Midpoints for Income with κ = 10 37 0 1 Log Density 2 3 4 5 1 2 3 British Income

ed Step 2: Smooth Log Balloon Densities Using Nonparametric Regression 38 0 Density 2 4 1 2 3 British Income

ed Step 2 39 Smooth y (κ) i as a function of x (κ) i using nonparametric regression: loess Fit polynomials locally of degree δ in a moving fashion like a moving average of a time series Bandwidth parameter 0 < α 1 Fit at x uses the [αn] closest points to x, the neighborhood of x Weighted least-squares fitting where weights decrease to 0 as distances of neighborhood points increase to the neighborhood boundary Loess possesses all of the statistical-inference technology of parametric fitting for linear models

ed: Three Tuning Parameters 40 κ: gap length α: bandwidth δ: degree of polynomial in local fitting

Notational Change 41 Midpoints of gap intervals: x (κ) i x i Log balloon densities: y (κ) i y i

ed Step 2 42 A Model Building Approach A starter model for the log balloon densities y i as a function of gap midpoints x i based on theory hope for a good approximation to the underlying patterns in the data y i = y(x i ) + ɛ i y(x) = log(f(x)) = log density well approximated by polynomials of degree δ locally in neighbors of x determined by α expect δ to be 2 or 3 to reach up to the tops of peaks and the bottoms of valleys ɛ i independent identically distributed with mean 0 distribution well approximated by the normal Use the comprehensive set of tools of regression diagnostics to investigate assumptions

Model Diagnostics 43 Important quantities used in carrying out diagnostics ŷ(x) = ed log density fit from nonparametric regression ŷ i = ŷ(x i ) = fitted values at x i = gap midpoints ˆɛ i = y i ŷ i = residuals ˆf(x) = exp(ŷ(x)) = ed density estimate ˆσ 2 ɛ = estimate of error variance Loess possesses all of the statistical-inference technology of parametric linear regression in theory, variance = γ 1 (κ), which is 0.105 for κ = 10 ν = equivalent degrees of freedom of the fit (number of parameters)

ed Log Raw Density Estimate for Income: Fit vs. Income 44 1 2 3 2 0.25 2 0.32 2 0.4 0 2 4 Density 2 0.02 2 0.1 2 0.17 6 0 2 4 6 1 2 3 British Income 1 2 3

ed Log Density Estimate for Income: Residuals vs. Income 45 1 2 3 2 0.25 2 0.32 2 0.4 1.0 0.5 0.0 0.5 Density 1.0 2 0.02 2 0.1 2 0.17 0.5 0.0 0.5 1 2 3 1 2 3 British Income

ed Density Estimate for Income: Fit vs. Income 46 0 1 2 3 4 2 0.32 2 0.4 1.0 0.8 0.6 0.4 0.2 0.0 2 0.17 2 0.25 Density 2 0.02 2 0.1 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 British Income

47 ed Log Density Estimate for Income: Residual Variance vs. Degrees Freedom degree 1 degree 2 degree 3 0.110 Residual Variance Estimate 0.105 0.100 10 15 20 25 30 35 ν

Mallows Cp Model Selection Criterion 48 A visualization tool for showing the trade-off of bias and variance, which is far more useful than just a criterion that one minimizes or is optimizing M : an estimate of mean-squared error ν: degrees of freedom of fit M estimates Bias Sum of Squares σ 2 + Variance Sum of Squares σ 2 M = i ˆɛ 2 i ˆσ 2 (ɛ) (n ν) + ν E(M) ν when the fit follows the pattern in the data The amount by which M exceeds ν is an estimate of the bias Cp: M vs. ν

ed Log Density Estimate for Income: Cp Plot 49 degree 1 degree 2 degree 3 80 60 M 40 20 10 15 20 25 30 35 ν

Model Selection for British Incomes 50 From Cp plot, plots of residuals, and plots of fits Gap size: κ = 10 Polynomial degree: δ = 2 Bandwidth parameter: α = 0.16 Equivalent degrees of freedom: ν = 19

ed Log Density Estimate for Income: Fit vs. Income 51 0 Density 2 4 1 2 3 British Income

ed Log Density Estimate for Income: Residuals vs. Income 52 1.0 0.5 Density 0.0 0.5 1 2 3 British Income

ed Log Density Estimate for Income: Absolute Residuals vs. Income 53 1.0 0.8 Absolute Residuals 0.6 0.4 0.2 0.0 1 2 3 British Income

ed Log Density Estimate for Income: Lag 1 Residual Correlation Plot 54 1.0 0.5 Residual (i 1) 0.0 0.5 0.5 0.0 0.5 1.0 Residual (i)

ed Log Density Estimate for Income: Normal Quantile Plot of Residuals 55 1.0 0.5 Residuals 0.0 0.5 1.0 0.5 0.0 0.5 1.0 Normal Quantiles

ed Density Estimate and KDE for Income: Fits vs. Income 56 Ed 0.8 0.6 0.4 0.2 Density Sheather Jones 0.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 British Income

ed Log Density Estimate and Log KDE for Income: Residuals vs. Income 57 1 2 3 1.0 Ed Sheather Jones 0.5 Residuals 0.0 0.5 1 2 3 British Income

ed Log Density Estimate for Income: 99% Pointwise Confidence Intervals 58 0.0 Log Density 0.5 1.0 1.5 0.5 1.0 1.5 0.10 0.00 0.10 0.5 1.0 1.5 British Income

Queuing Delays of Internet Voice-over-IP Traffic 59 264583 observations From a study of quality of service for different traffic loads Semi-empirical model for generating traffic Simulated queueing on a router

Silverman Bandwidth KDE for Delay: Fit vs. Delay 60 6 5 4 Density 3 2 1 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Delay (ms)

Silverman Bandwidth Log KDE for Delay: Residuals vs. Delay 61 1.0 0.5 Residual 0.0 0.5 1.0 1.5 0.0 0.1 0.2 0.3 0.4 Delay (ms)

ed Log Raw Density Estimates for Delay: κ = 100 Because of Ties 62 2 Log Density 0 2 4 0.0 0.1 0.2 0.3 0.4 Delay (ms)

ed Log Density Estimates for Delay: 3 Independent Fits for Three Intervals 63 Polynomial degrees δ and bandwidths α for loess fits Interval 1: δ = 1, α = 0.75 Interval 2: δ = 2, α = 0.5 Interval 3: δ = 1, α = 1.5

ed Log Density Estimates for Delay: 3 Fits vs. Delay 64 2 0 Density 2 4 0.0 0.1 0.2 0.3 0.4 Delay (ms)

ed Log Density Estimates, 3 Fits for Delay: Residuals vs. Delay 65 1.0 0.5 Density 0.0 0.5 0.0 0.1 0.2 0.3 0.4 Delay (ms)

Silverman Bandwidth KDE for Delay: 3 Fits 66 Unit normal kernel so h = standard deviation Interval 1: h = 0.00346 Interval 2: h = 0.00532 Interval 3: h = 0.00971

Silverman Bandwidth KDE for Delay: 3 Fits vs. Delay 67 2 Log Density 0 2 4 0.0 0.1 0.2 0.3 0.4 Delay (ms)

Silverman Bandwidth Log KDE, 3 Fits for Delay: Residuals vs. Delay 68 1.5 1.0 Residual 0.5 0.0 0.5 0.0 0.1 0.2 0.3 0.4 Delay (ms)

Computation 69 We use existing algorithms for loess to get faster computations for ed than direct computation Still, are the ed computations fast enough, especially for extensions to higher dimensions? Alex Gray and collaborators have done some exceptional work in algorithms for fast computation of KDEs and nonparametric regression Nonparametric Density Estimation: Toward Computational Tractability. Gray, A. G. and Moore, A. W. In SIAM International Conference on Data Mining, 2003. Winner of Best Algorithm Paper Prize. How can we tailor this to our needs here?

RHIPE 70 ed presents embarrassingly parallel computation Large amounts of computer time can be saved by distributed computing environments One is RHIPE (ml.stat.purdue.edu/rhipe) Saptarshi Guha, Purdue Statistics R-Hadoop Integrated Processing Environment Greek for in a moment pronounced hree pay A recent merging of the R interactive environment for data analysis (www.r-project.org) and the Hadoop distributed file system and compute engine (hadoop.apache.org) Public domain A remarkable achievement that has had a dramatic effect on our ability to compute with large data sets