Decompose Error Rate into components, some of which can be measured on unlabeled data
|
|
|
- Cameron Welch
- 9 years ago
- Views:
Transcription
1 Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance Analysis of Learning Algorithms Effect of Bagging on Bias and Variance Effect of Boosting on Bias and Variance Summary and Conclusion
2 Bias-Variance Analysis in Regression True function is y = f(x) + ε where ε is normally distributed with zero mean and standard deviation σ. Given a set of training examples, {(x i, y i )}, we fit an hypothesis h(x) = w x + b to the data to minimize the squared error Σ i [y i h(x i )] 2
3 Example: 20 points y = x + 2 sin(1.5x) + N(0,0.2)
4 50 fits (20 examples each)
5 Bias-Variance Analysis Now, given a new data point x* (with observed value y* = f(x*) + ε), we would like to understand the expected prediction error E[ (y* h(x*)) 2 ]
6 Classical Statistical Analysis Imagine that our particular training sample S is drawn from some population of possible training samples according to P(S). Compute E P [ (y* h(x*)) 2 ] Decompose this into bias, variance, and noise
7 Lemma Let Z be a random variable with probability distribution P(Z) Let Z = E P [ Z ] be the average value of Z. Lemma: E[ (Z Z) 2 ] = E[Z 2 ] Z 2 E[ (Z Z) 2 ] = E[ Z 2 2 Z Z + Z 2 ] = E[Z 2 ] 2 E[Z] Z + Z 2 = E[Z 2 ] 2 Z 2 + Z 2 = E[Z 2 ] Z 2 Corollary: E[Z 2 ] = E[ (Z Z) 2 ] + Z 2
8 Bias-Variance Variance-Noise Decomposition E[ (h(x*) y*) 2 ] = E[ h(x*) 2 2 h(x*) y* + y* 2 ] = E[ h(x*) 2 ] 2 E[ h(x*) ] E[y*] + E[y* 2 ] = E[ (h(x*) h(x*)) 2 ] + h(x*) 2 (lemma) 2 h(x*) f(x*) + E[ (y* f(x*)) 2 ] + f(x*) 2 (lemma) = E[ (h(x*) h(x*)) 2 ] + [variance] (h(x*) f(x*)) 2 + [bias 2 ] E[ (y* f(x*)) 2 ] [noise]
9 Derivation (continued) E[ (h(x*) y*) 2 ] = = E[ (h(x*) h(x*)) 2 ] + (h(x*) f(x*)) 2 + E[ (y* f(x*)) 2 ] = Var(h(x*)) + Bias(h(x*)) 2 + E[ ε 2 ] = Var(h(x*)) + Bias(h(x*)) 2 + σ 2 Expected prediction error = Variance + Bias 2 + Noise 2
10 Bias, Variance, and Noise Variance: E[ (h(x*) h(x*)) 2 ] Describes how much h(x*) varies from one training set S to another Bias: [h(x*) Bias: h(x*) f(x*)] Describes the average error of h(x*). Noise: E[ (y* f(x*)) 2 ] = E[ε 2 ] = σ 2 Describes how much y* varies from f(x*)
11 50 fits (20 examples each)
12 Bias
13 Variance
14 Noise
15 50 fits (20 examples each)
16 Distribution of predictions at x=2.0
17 50 fits (20 examples each)
18 Distribution of predictions at x=5.0
19 Measuring Bias and Variance In practice (unlike in theory), we have only ONE training set S. We can simulate multiple training sets by bootstrap replicates S = {x x is drawn at random with replacement from S} and S = S.
20 Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S 1,,, S B Apply learning algorithm to each replicate S b to obtain hypothesis h b Let T b = S \ S b be the data points that do not appear in S b (out of bag points) Compute predicted value h b (x) for each x in T b
21 Estimating Bias and Variance (continued) For each data point x, we will now have the observed corresponding value y and several predictions y 1,,, y K. Compute the average prediction h. Estimate bias as (h( y) Estimate variance as Σ k (y k h) 2 /(K 1) Assume noise is 0
22 Approximations in this Procedure Bootstrap replicates are not real data We ignore the noise If we have multiple data points with the same x value, then we can estimate the noise We can also estimate noise by pooling y values from nearby x values
23 Ensemble Learning Methods Given training sample S Generate multiple hypotheses, h 1, h 2,, h L. Optionally: determining corresponding weights w 1, w 2,, w L Classify new points according to l w l h l > θ
24 Bagging: Bootstrap Aggregating For b = 1,,, B do S b = bootstrap replicate of S Apply learning algorithm to S b to learn h b Classify new points by unweighted vote: [ b h b (x)]/b > 0
25 Bagging Bagging makes predictions according to y = Σ b h b (x) / B Hence, bagging s s predictions are h(x)
26 Estimated Bias and Variance of Bagging If we estimate bias and variance using the same B bootstrap samples, we will have: Bias = (h( y) [same as before] Variance = Σ k (h h) 2 /(K 1) = 0 Hence, according to this approximate way of estimating variance, bagging removes the variance while leaving bias unchanged. In reality, bagging only reduces variance and tends to slightly increase bias
27 Bias/Variance Heuristics Models that fit the data poorly have high bias: inflexible models such as linear regression, regression stumps Models that can fit the data very well have low bias but high variance: flexible models such as nearest neighbor regression, regression trees This suggests that bagging of a flexible model can reduce the variance while benefiting from the low bias
28 Bias-Variance Decomposition for Classification Can we extend the bias-variance decomposition to classification problems? Several extensions have been proposed; we will study the extension due to Pedro Domingos (2000a; 2000b) Domingos developed a unified decomposition that covers both regression and classification
29 Classification Problems: Noisy Channel Model Data points are generated by y i = n(f(x i )), where f(x i ) is the true class label of x i n( ) ) is a noise process that may change the true label f(x i ). Given a training set {(x 1, y 1 ),,, (x m, y m )}, our learning algorithm produces an hypothesis h. Let y* = n(f(x*)) be the observed label of a new data point x*. h(x*) is the predicted label. The error ( loss( loss ) ) is defined as L(h(x*), y*)
30 Loss Functions for Classification The usual loss function is 0/1 loss. L(y,y) is 0 if y y = y and 1 otherwise. Our goal is to decompose E p [L(h(x*), y*)] into bias, variance, and noise terms
31 Discrete Equivalent of the Mean: The Main Prediction As before, we imagine that our observed training set S was drawn from some population according to P(S) Define the main prediction to be y m (x*) = argmin y E P [ L(y,, h(x*)) ] For 0/1 loss, the main prediction is the most common vote of h(x*) (taken over all training sets S weighted according to P(S)) For squared error, the main prediction is h(x*)
32 Bias, Variance, Noise Bias B(x*) = L(y m, f(x*)) This is the loss of the main prediction with respect to the true label of x* Variance V(x*) = E[ L(h(x*), y m ) ] This is the expected loss of h(x*) relative to the main prediction Noise N(x*) = E[ L(y*, f(x*)) ] This is the expected loss of the noisy observed value y* relative to the true label of x*
33 Squared Error Loss These definitions give us the results we have already derived for squared error loss L(y,y) = (y y) 2 Main prediction y m = h(x*) Bias 2 : L(h(x*) h(x*),, f(x*)) = (h(x*)( f(x*)) 2 Variance: E[ L(h(x*), h(x*)) ) ] = E[ (h(x*) h(x*)) 2 ] Noise: E[ L(y*, f(x*)) ] = E[ (y* f(x*)) 2 ]
34 0/1 Loss for 2 classes There are three components that determine whether y* = h(x*) Noise: y* = f(x*)? Bias: f(x*) = y m? Variance: y m = h(x*)? Bias is either 0 or 1, because neither f(x*) nor y m are random variables
35 Case Analysis of Error f(x*) = y m? yes no [bias] y m = h(x*)? y m = h(x*)? yes no [variance] yes no [variance] y* = f(x*)? y* = f(x*)? y* = f(x*)? y* = f(x*)? yes no [noise] yes no [noise] yes no [noise] yes no [noise] correct error [noise] error [variance] correct [noise cancels variance] error [bias] correct [noise cancels bias] correct [variance cancels bias] error [noise cancels variance cancels bias]
36 Unbiased case Let P(y* f(x*)) = N(x*) = τ Let P(y m h(x*)) = V(x*) = σ If (f(x*) = y m ), then we suffer a loss if exactly one of these events occurs: L(h(x*), y*) = τ(1-σ) ) + σ(1-τ) = τ + σ 2τσ = N(x*) + V(x*) 2 N(x*) V(x*)
37 Biased Case Let P(y* f(x*)) = N(x*) = τ Let P(y m h(x*)) = V(x*) = σ If (f(x*) y m ), then we suffer a loss if either both or neither of these events occurs: L(h(x*), y*) = τσ + (1 σ) σ)(1 τ) = 1 (τ + σ 2τσ) = B(x*) [N(x*) + V(x*) 2 N(x*) V(x*)]
38 Decomposition for 0/1 Loss (2 classes) We do not get a simple additive decomposition in the 0/1 loss case: E[ L(h(x*), y*) ] = if B(x*) = 1: B(x*) [N(x*) + V(x*) 2 N(x*) V(x*)] if B(x*) = 0: B(x*) + [N(x*) + V(x*) 2 N(x*) V(x*)] In biased case, noise and variance reduce error; in unbiased case, noise and variance increase error
39 Summary of 0/1 Loss A good classifier will have low bias, in which case the expected loss will approximately equal the variance The interaction terms will usually be small, because both noise and variance will usually be < 0.2, so the interaction term 2 V(x*) N(x*) will be < 0.08
40 0/1 Decomposition in Practice In the noise-free case: E[ L(h(x*), y*) ] = if B(x*) = 1: B(x*) V(x*) if B(x*) = 0: B(x*) + V(x*) It is usually hard to estimate N(x*), so we will use this formula
41 Decomposition over an entire data set Given a set of test points T = {(x* 1,y* 1 ),,, (x* n,y* n )}, we want to decompose the average loss: L = Σ i E[ L(h(x* i ), y* i ) ] / n We will write it as L = B + Vu Vb where B is the average bias, Vu is the average unbiased variance, and Vb is the average biased variance (We ignore the noise.) Vu Vb will be called net variance
42 Classification Problems: Overlapping Distributions Model Suppose at each point x, the label is generated according to a probability distribution y ~ P(y x) The goal of learning is to discover this probability distribution The loss function L(p,h) ) = KL(p,h) ) is the Kullback-Liebler divergence between the true distribution p and our hypothesis h.
43 Kullback-Leibler Divergence For simplicity, assume only two classes: y {0,1} Let p be the true probability P(y=1 x) and h be our hypothesis for P(y=1 x). The KL divergence is KL(p,h) ) = p log p/h + (1-p) log (1-p)/(1 p)/(1-h)
44 Bias-Variance Variance-Noise Decomposition for KL Goal: Decompose E S [ KL(y,, h) ] into noise, bias, and variance terms Compute the main prediction: h = argmin u E S [ KL(u,, h) ] This turns out to be the geometric mean: log(h/(1 /(1-h)) = E S [ log(h/(1-h)) h)) ] h = 1/Z * exp( E S [ log h ] )
45 Computing the Noise Obviously the best estimator h would be p. What loss would it receive? E[ KL(y,, p) ] = E[ y log y/p + (1-y) log (1-y)/(1 y)/(1-p) = E[ y log y y log p + (1-y) log (1-y) (1-y) log (1-p) ] = -p p log p (1-p) log (1-p) = H(p)
46 Bias, Variance, Noise Variance: E S [ KL(h,, h) ] Bias: KL(p, h) Noise: H(p) Expected loss = Noise + Bias + Variance E[ KL(y,, h) ] = H(p) ) + KL(p, h) ) + E S [ KL(h,, h) ]
47 Consequences of this Definition If our goal is probability estimation and we want to do bagging, then we should combine the individual probability estimates using the geometric mean log(h/(1 /(1-h)) = E S [ log(h/(1-h)) h)) ] In this case, bagging will produce pure variance reduction (as in regression)!
48 Experimental Studies of Bias and Variance Artificial data: Can generate multiple training sets S and measure bias and variance directly Benchmark data sets: Generate bootstrap replicates and measure bias and variance on separate test set
49 Algorithms to Study K-nearest neighbors: What is the effect of K? Decision trees: What is the effect of pruning? Support Vector Machines: What is the effect of kernel width σ?
50 K-nearest neighbor (Domingos, 2000) Chess (left): Increasing K primarily reduces Vu Audiology (right): Increasing K primarily increases B.
51 Size of Decision Trees Glass (left), Primary tumor (right): deeper trees have lower B, higher Vu
52 Example: 200 linear SVMs (training sets of size 20) Error: 13.7% Bias: 11.7% Vu: 5.2% Vb: 3.2%
53 Example: 200 RBF SVMs σ = 5 Error: 15.0% Bias: 5.8% Vu: 11.5% Vb: 2.3%
54 Example: 200 RBF SVMs σ = 50 Error: 14.9% Bias: 10.1% Vu: 7.8% Vb: 3.0%
55 SVM Bias and Variance Bias-Variance tradeoff controlled by σ Biased classifier (linear SVM) gives better results than a classifier that can represent the true decision boundary!
56 B/V Analysis of Bagging Under the bootstrap assumption, bagging reduces only variance Removing Vu reduces the error rate Removing Vb increases the error rate Therefore, bagging should be applied to low-bias classifiers, because then Vb will be small Reality is more complex!
57 Bagging Nearest Neighbor Bagging first-nearest neighbor is equivalent (in the limit) to a weighted majority vote in which the k-th neighbor receives a weight of exp(-(k-1)) exp(-k) Since the first nearest neighbor gets more than half of the vote, it will always win this vote. Therefore, Bagging 1-NN is equivalent to 1-NN.
58 Bagging Decision Trees Consider unpruned trees of depth 2 on the Glass data set. In this case, the error is almost entirely due to bias Perform 30-fold bagging (replicated 50 times; 10-fold cross-validation) What will happen?
59 Bagging Primarily Reduces Bias!
60 Questions Is this due to the failure of the bootstrap assumption in bagging? Is this due to the failure of the bootstrap assumption in estimating bias and variance? Should we also think of Bagging as a simple additive model that expands the range of representable classifiers?
61 Bagging Large Trees? Now consider unpruned trees of depth 10 on the Glass dataset. In this case, the trees have much lower bias. What will happen?
62 Answer: Bagging Primarily Reduces Variance
63 Bagging of SVMs We will choose a low-bias, high-variance SVM to bag: RBF SVM with σ=5
64 RBF SVMs again: σ = 5
65 Effect of 30-fold Bagging: Variance is Reduced
66 Effects of 30-fold Bagging Vu is decreased by 0.010; Vb is unchanged Bias is increased by Error is reduced by 0.005
67 Bagging Decision Trees (Freund & Schapire)
68 Boosting
69 Bias-Variance Analysis of Boosting Boosting seeks to find a weighted combination of classifiers that fits the data well Prediction: Boosting will primarily act to reduce bias
70 Boosting DNA splice (left) and Audiology (right) Early iterations reduce bias. Later iterations also reduce variance
71 Boosting vs Bagging (Freund & Schapire)
72 Review and Conclusions For regression problems (squared error loss), the expected error rate can be decomposed into Bias(x*) 2 + Variance(x*) + Noise(x*) For classification problems (0/1 loss), the expected error rate depends on whether bias is present: if B(x*) = 1: B(x*) [V(x*) + N(x*) 2 V(x*) N(x*)] if B(x*) = 0: B(x*) + [V(x*) + N(x*) 2 V(x*) N(x*)] or B(x*) + Vu(x*) Vb(x*) [ignoring noise]
73 Review and Conclusions (2) For classification problems with log loss, the expected loss can be decomposed into noise + bias + variance E[ KL(y,, h) ] = H(p) ) + KL(p, h) ) + E S [ KL(h,, h) ]
74 Sources of Bias and Variance Bias arises when the classifier cannot represent the true function that is, the classifier underfits the data Variance arises when the classifier overfits the data There is often a tradeoff between bias and variance
75 Effect of Algorithm Parameters on Bias and Variance k-nearest neighbor: increasing k typically increases bias and reduces variance decision trees of depth D: increasing D typically increases variance and reduces bias RBF SVM with parameter σ: increasing σ increases bias and reduces variance
76 Effect of Bagging If the bootstrap replicate approximation were correct, then bagging would reduce variance without changing bias In practice, bagging can reduce both bias and variance For high-bias classifiers, it can reduce bias (but may increase Vu) For high-variance classifiers, it can reduce variance
77 Effect of Boosting In the early iterations, boosting is primary a bias-reducing method In later iterations, it appears to be primarily a variance-reducing reducing method
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Local classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
A Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
Cross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Detecting Corporate Fraud: An Application of Machine Learning
Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
Introduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Car Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
CS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
LCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz
AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
Boosting. [email protected]
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected]
Introduction to Learning & Decision Trees
Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
FilterBoost: Regression and Classification on Large Datasets
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 [email protected] Robert E. Schapire Department
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Cross-validation for detecting and preventing overfitting
Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Active Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/
Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. [email protected]
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang [email protected] http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Ensembles of Similarity-Based Models.
Ensembles of Similarity-Based Models. Włodzisław Duch and Karol Grudziński Department of Computer Methods, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. E-mails: {duch,kagru}@phys.uni.torun.pl
How To Solve The Class Imbalance Problem In Data Mining
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,
L13: cross-validation
Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU
Semi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
Sentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
