Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Similar documents
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Support Vector Machines with Clustering for Training with Very Large Datasets

Statistical Machine Learning

Lecture 3: Linear methods for classification

The Artificial Prediction Market

Detecting Corporate Fraud: An Application of Machine Learning

Statistical issues in the analysis of microarray data

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Support Vector Machines Explained

Machine learning for algo trading

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Basics of Statistical Machine Learning

The Data Mining Process

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Introduction to General and Generalized Linear Models

Knowledge Discovery from patents using KMX Text Analytics

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

LCs for Binary Classification

Linear Classification. Volker Tresp Summer 2015

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning Logistic Regression

Machine Learning in Statistical Arbitrage

Big Data Analytics CSCI 4030

Semi-Supervised Support Vector Machines and Application to Spam Filtering

A Simple Introduction to Support Vector Machines

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

How To Understand The Theory Of Probability

Review Jeopardy. Blue vs. Orange. Review Jeopardy

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Making Sense of the Mayhem: Machine Learning and March Madness

An Introduction to Machine Learning

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Statistics 2014 Scoring Guidelines

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Neural Networks and Support Vector Machines

Predict the Popularity of YouTube Videos Using Early View Data

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Estimation of σ 2, the variance of ɛ

Data Mining Practical Machine Learning Tools and Techniques

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Comparison of machine learning methods for intelligent tutoring systems


Data Mining: An Overview. David Madigan

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Classification of Bad Accounts in Credit Card Industry

Support Vector Machines for Classification and Regression

STA 4273H: Statistical Machine Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear Algebra Review. Vectors

Data Mining. Nonlinear Classification

BANACH AND HILBERT SPACE REVIEW

Early defect identification of semiconductor processes using machine learning

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

HT2015: SC4 Statistical Data Mining and Machine Learning

Knowledge Discovery and Data Mining

Forecasting and Hedging in the Foreign Exchange Markets

Online learning of multi-class Support Vector Machines

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Several Views of Support Vector Machines

False Discovery Rates

Likelihood Approaches for Trial Designs in Early Phase Oncology

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Penalized Logistic Regression and Classification of Microarray Data

The Scientific Data Mining Process

Content-Based Recommendation

Data Mining - Evaluation of Classifiers

Support Vector Machine (SVM)

Programming Exercise 3: Multi-class Classification and Neural Networks

Machine Learning Final Project Spam Filtering

Least Squares Estimation

Predictive Data modeling for health care: Comparative performance study of different prediction models

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Vector and Matrix Norms

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Classification Problems

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Correlation and Convolution Class Notes for CMSC 426, Fall 2005 David Jacobs

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Additional sources Compilation of sources:

Financial Market Microstructure Theory

Nominal and ordinal logistic regression

Decision Trees from large Databases: SLIQ

Transcription:

Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the Univ of Pennsylvania March 31, 2006 Overview Familiar regression model, but... Adapt to the context of data mining Scope: Borrow from machine learning Search: Heuristic sequential strategies Selection: Alpha-investing rules Estimation: Adaptive empirical Bayes Structure: Calibration Does it work? Numerical comparisons to other methods using reference data sets Data Mining Context Predictive modeling of wide data Modern data sets n... Thousands to millions of rows m... Hundreds to thousands of columns. No matter how large n becomes, can conceive of models with m > n Derived features (e.g., interactions) Consequence Cannot fit saturated model to estimate! 2 Cannot assume true model in fitted class

Wide Data Application Rows Columns Credit 3,000,000 350 Faces 10,000 1,400 Genetics 1,000 10,000 CiteSeer 500! Lots of Data Credit scoring Millions of credit card users Past use of credit, economics, transactions Text Documents to be classified into categories Large corpus of marked documents, and even more that have not been marked Images Millions of images from video surveillance All those pixel patterns become features Experience Model for bankruptcy Stepwise regression selecting from more than 67,000 predictors Successful Better classifications than C4.5 But Fit dominated by interactions Linear terms hidden Know missed some things, even with 67,000 Unable to exploit domain knowledge Not the fastest code to run Why use regression? Familiarity Reduce the chances for pilot error Well-defined classical inference IF you know the predictors, inference easy Linear approximation good enough Even if the right answer is nonlinear Good diagnostics Residual analysis helpful, even with millions Framework for studying other methods

Key Challenge Which features to use in a model? Cannot use them all! Too many Over-fitting May need transformations Even if did use them all, may not find best Model averaging? Too slow Save for later... along with bagging. Reproducing kernel Hilbert space (from SVMs) Search and selection methods Auction Estimation Adaptive shrinkage improves testimator Structure of model Calibration Reproducing kernel Hilbert space Larger Scope Lesson from analysis of bankruptcy Interactions can be very useful But dominate if all predictors treated as monolithic group (m linear, m 2 second order) Question How to incorporate useful quadratic interactions, other transformations? Particularly hard to answer in genetic situations with every wide data sets for which m >> n.

Reproducing Kernels Some history Introduced in Stat by Parzen and Wahba Adopted by machine learning community for use in support vector machines. Use in regression Find interesting directions in feature space Avoid explicit calculation of the points in the very high dimensional feature space. Example of RKHS Bulls-eye pattern Non-linear boundary between cases in the two groups Example of RKHS Linearize boundary Add X 1 2 and X 2 2 to basis Does not generalize easily (too many) Alternative using RKHS Define new feature space $%"($) Possibly much higher dimension than m Inner product between points x 1 and x 2 in new space is <"(x 1 ),"(x 2 )> Reproducing kernel K evaluates inner product without forming "(x) explicitly " " " " K(x 1,x 2 ) = <"(x 1 ),"(x 2 )> Example of RKHS Industry inventing kernel functions Gaussian kernel (aka, radial basis) " " " " K(x 1,x 2 ) = c exp(- x 1 -x 2 2 ) Generate several new features Compute Gram matrix in feature space " indirectly using kernel K " " " " G = [ K(x i,x j ) ] n#n Find leading singular vectors of G, as in a principal component analysis These become directions in the model

Example of RKHS For the bulls-eye, leading two singular vectors convert circle to hyperplane Expand with components from RKHS Search and selection methods Experts recommend features to auction Original Gaussian kernel Auction-Based Search Lesson from analysis of bankruptcy Interactions help, but all interactions? Must we consider every interaction, or just those among predictors in the model? Further motivation Substantive experts reveal missing features. In some applications, the scope of the search depends on the state of the model Examples: citations in CiteSeer, genetics Streaming features Expert Feature Auction Strategy that recommends a candidate feature to add to the model Examples PCA of original data RKHS using various kernels Interactions Parasitic experts Substantive transformations Experts bid for opportunity to recommend a feature (or bundle)

Feature Auction Expert is rewarded if correct Experts have wealth If recommended feature is accepted in the model, expert earns ' additional wealth If recommended feature is refused, expert loses bid As auction proceeds, it... Rewards experts that offer useful features, allowing these to recommend more X s Eliminates experts whose features are not accepted. Alpha-Investing Wealth = Type I error Each expert begins auction with nominal level to spend, say W 0 = 0.05 At step j of the auction, Expert bids 0 # & j # W j-1 to recommend X Assume this is the largest bid Model assigns p-value p to X If p#&: add X" " " " set W j = W j-1 + ('-p) If p>&: don t add X"" set W j = W j-1 - & j Discussion of Alpha-Investing Similar to alpha-spending rules that are used in clinical trials But allows good experts to continue suggesting features Infinitely many tests Can imitate various tests of multiple null hypotheses Bonferroni Step-down testing Discussion of Alpha-Investing Bonferroni test of H 0 (1),...,H 0 (m) Set W 0 = & and reward ' = 0 Bid & j = &/m Step-down test Set W 0 = & and reward ' = & Test all m at level &/m If none are significant, done If one is significant, earn & back Test remaining m-1 conditional on p j> &/m

Discussion of Alpha-Investing Can test an infinite sequence of hypotheses Step-down testing allows only finite collection: must begin with ordered p-values Alpha investing is sequential If expert has good science, then bids heavily on the hypotheses assumed to be most useful α j W 0 j 2 Over-fitting? If expert receives & back in the feature auction, then what s to stop model from over-fitting? Excess Discovery Count Number of correct rejections in excess of, say, 95% of total rejections Terminology S ( (m) = # correct rejections in m tests R(m) = # rejections in m tests Excess discovery count " EDC " α,γ (m) = α + E θ (S θ (m) γr(m)) Procedure " " controls EDC EDC &,) (m) $ 0 Excess Discovery Count EDC α,γ (m) = α + E θ (S θ (m) γr(m)) Count EDC E Θ R E Θ S Θ Γ E Θ R Α No signal Moderate Strong signal Θ

Alpha-Investing Controls EDC Theorem: An alpha-investing rule with initial wealth W 0 #& and payoff ' # (1-)) controls EDC. For sequence of honest tests of the sequence H 0 (1),...,H 0 (m),...and any stopping time M " " " inf inf E θedc α,γ (M) 0 M θ Comparison to FDR Notation R(m) = # rejected = S ( (m)+v ( (m) V ( (m) = # false rejections (Type I errors) False discovery rate controls ratio of false positives to rejections ) " E θ ( Vθ (m) R(m) R(m) > 0 Control of EDC implies that " " E θ V θ (m) E θ R(m) (1 γ) + α P (R(m) > 0) F DR E θ R(m) Expand with components from RKHS Search and selection methods Experts recommend features to auction Estimation Adaptive shrinkage improves testimator ^ Μ Testimators Σ 2 Log m y Estimate mean * in multivariate normal " Y N(*,! 2 I) Hard thresholding (D&J, 1994, wavelet) Possesses certain minimax optimality Bounds risk relative to an oracle that knows which variables to include in the model Basically same as using a Bonferroni rule applied to p-values

Polyshrink(b)!4!2 0 2 4 Adaptive Estimator!4!2 0 2 4 b Signal Mean 0 1.00 1.15 1.25 1.35 2.00 Polyshrink adaptively shrinks estimator when fitting in higher dimensions About the same as a testimator when fitting one estimator In higher dimensions, shrinkage varies with the level of signal found Possesses type of optimality, in the sense of a robust prior. Resembles empirical Bayes estimators (e.g., Silverman & Johnstone) Value in Modeling Evaluate one predictor at a time No real gain over testimator Evaluate several predictors at once Shrinkage has some teeth Several predictors at once? Generally do one at a time, eschew Principle of Marginality Bundles originate in RKHS: take top k components from feature space Expand with components from RKHS Search and selection methods Experts recommend features to auction Estimation Adaptive shrinkage improves testimator Structure of model Estimate empirical link function Calibration Model is calibrated if predictions are correct on average E(Y Ŷ ) = Ŷ Link function in generalized linear model has similar role E(y) = g(x β) Rather than assume a known link, estimate the link as part of the modeling

Average Y 0.5 0.4 0.3 0.2 0.1 Empirical Calibration 0.1 0.15 0.2 0.25 Average Predicted Before Proportion BR 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Average Prediction After Expand with components from RKHS Search and selection methods Experts recommend features to auction Estimation Adaptive shrinkage improves testimator Structure of model Estimate empirical link function Problems Challenges Control proliferation of interactions Incorporate expert guidance Explore richer spaces of predictors Run faster Computing Streaming selection is much faster than batch Have run 1,000,000+ features in applications NIPS data sets Comparisons Competition among 100+ algorithms Goal to predict cases in a hold back sample Success based on area under ROC Data sets Variety of contexts More wide than tall

Results: 2003 NIPS Unlike BR: Very high signal rates... Dataset n m AUC NIPS* Arcene 200 10,000 0.93 0.96 Dexter 600 20,000 0.99+ 0.992 Dorothea 1150 100,000? Gisette 7000 5,000 0.995 0.999 Madelon 2600 500 0.94 0.95 Results: Face Detection 10,000 images, 5,000 with faces and 5,000 without Type I error at 50% Type II Method Type I AdaBoost 0.07 FFS 0.07 AsymBoost 0.07 Streaming 0.05 More examples What Next? Working on faster version of software Data formats are a big issue Implement subspace shrinkage Current implementation uses hard thresholding Improve expert strategy Goal of machine learning is turn-key system Prefer ability to build in expertise