Improving Generalization



Similar documents
Self Organizing Maps: Fundamentals

L13: cross-validation

Data Mining - Evaluation of Classifiers

Lecture 13: Validation

Cross Validation. Dr. Thomas Jensen Expedia.com

Statistical Machine Learning

Lecture 6. Artificial Neural Networks

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Logistic Regression for Spam Filtering

Lecture 2: The SVM classifier

An Introduction to Machine Learning

Gaussian Conjugate Prior Cheat Sheet

Monotonicity Hints. Abstract

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Penalized regression: Introduction

Designing a learning system

Supporting Online Material for

Introduction to Logistic Regression

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Making Sense of the Mayhem: Machine Learning and March Madness

Machine Learning and Pattern Recognition Logistic Regression

Evaluation & Validation: Credibility: Evaluating what has been learned

Cross-validation for detecting and preventing overfitting

Statistical Machine Learning from Data

Machine Learning: Multi Layer Perceptrons

An Introduction to Neural Networks

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Chapter 6. The stacking ensemble approach

Knowledge Discovery and Data Mining

Linear Threshold Units

Data Mining Practical Machine Learning Tools and Techniques

Neural Computation - Assignment

Introduction to Support Vector Machines. Colin Campbell, Bristol University

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Follow links Class Use and other Permissions. For more information, send to:

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

A Simple Introduction to Support Vector Machines

Hidden Markov Models

Data Mining. Nonlinear Classification

Recurrent Neural Networks

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Moving averages. Rob J Hyndman. November 8, 2009

5. Continuous Random Variables

Knowledge Discovery and Data Mining

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

Performance Metrics for Graph Mining Tasks

Towards running complex models on big data

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Regularized Logistic Regression for Mind Reading with Parallel Validation

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Northumberland Knowledge

The Standard Normal distribution

Efficient online learning of a non-negative sparse autoencoder

Supervised Learning (Big Data Analytics)

Local classification and local likelihoods

Building a Smooth Yield Curve. University of Chicago. Jeff Greco

Introduction to Machine Learning Using Python. Vikram Kamath

Understanding the Advantages of Modularity in Neural Systems

Nonparametric statistics and model selection

STA 4273H: Statistical Machine Learning

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Neural Network Add-in

Classification and Prediction

Temporal Difference Learning in the Tetris Game

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Extreme Value Modeling for Detection and Attribution of Climate Extremes

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Lecture 8 February 4

Detection of changes in variance using binary segmentation and optimal partitioning

Gamma Distribution Fitting

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability

CHAPTER 2 Estimating Probabilities

Application of Neural Network in User Authentication for Smart Home System

Markov Chain Monte Carlo Simulation Made Simple

Time Series and Forecasting

Comparing Alternate Designs For A Multi-Domain Cluster Sample

Early defect identification of semiconductor processes using machine learning

Dual Methods for Total Variation-Based Image Restoration

Data mining and statistical models in marketing campaigns of BT Retail

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Sampling. COUN 695 Experimental Design

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Determination of g using a spring

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Predicting daily incoming solar energy from weather data

Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network

Basics of Statistical Machine Learning

Drugs store sales forecast using Machine Learning

Chapter 4: Artificial Neural Networks

GI01/M055 Supervised Learning Proximal Methods

Transcription:

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction and Weight Sharing 5. Stopping Training Early 6. Regularization and Weight Decay 7. Adding Noise / Jittering 8. Which is the Best Approach?

Improving Generalization We have seen that, for our networks to generalize well, we need to avoid both underfitting of the training data (which corresponds to high statistical bias) and over-fitting of the training data (which corresponds to high statistical variance). There are a number of approaches for improving generalization we can: 1. Arrange to have the optimum number of free parameters (independent connection weights) in our model. 2. Stop the gradient descent training at the appropriate point. 3. Add a regularization term to the error function to smooth out the mappings that are learnt. 4. Add noise to the training patterns to smooth out the data points. To employ these effectively we need to be able to estimate from our training data what the generalization is likely be. L10-2

Training, Validation and Testing Data Sets We have talked about training data sets the data used for training our networks. The testing data set is the unseen data that is used to test the network s generalization. We usually want to optimize our network s training procedures to result in the best generalization, but using the testing data to do this would clearly be cheating. What we can do is assume that the training data and testing data are drawn randomly from the same data set, and then any sub-set of the training data that we do not train the network on can be used to estimate what the performance on the testing set will be, i.e. what the generalization will be. The portion of the data we have available for training that is withheld from the network training is called the validation data set, and the remainder of the data is called the training data set. This approach is called the hold out method. The idea is that we split the available data into training and validation sets, train various networks using the training set, test each one on the validation set, and the network which is best is likely to provide the best generalization to the testing set. L10-3

Cross Validation Often the availability of training data is limited, and using part of it as a validation set is not practical. An alternative is to use the procedure of cross-validation. In K-fold cross-validation we divide all the training data at random into K distinct subsets, train the network using K 1 subsets, and test the network on the remaining subset. The process of training and testing is then repeated for each of the K possible choices of the subset omitted from the training. The average performance on the K omitted subsets is then our estimate of the generalization performance. This procedure has the advantage that is allows us to use a high proportion of the available training data (a fraction 1 1 / K ) for training, while making use of all the data points in estimating the generalization error. The disadvantage is that we need to train the network K times. Typically K ~ 10 is considered reasonable. If K is made equal to the full sample size, it is called leave-one-out cross validation. L10-4

Weight Restriction and Weight Sharing Perhaps the most obvious way to prevent over-fitting in our models (i.e. neural networks) is to restrict the number of free parameters they have. The simplest way we can do this is to restrict the number of hidden units, as this will automatically reduce the number of weights. We can use some form of validation or cross-validation scheme to find the best number for each given problem. An alternative is to have many weights in the network, but constrain certain groups of them to be equal. If there are symmetries in the problem, we can enforce hard weight sharing by building them into the network in advance. In other problems we can use soft weight sharing where sets of weights are encouraged to have similar values by the learning algorithm. One way to do this is to add an appropriate term to the error/cost function. This method can then be seen as a particular form of regularization. L10-5

Stopping Training Early Neural networks are often set up with more than enough parameters for over-fitting to occur, and so other procedures have to be employed to prevent it. For the iterative gradient descent based network training procedures we have considered (such as batch back-propagation and conjugate gradients), the training set error will naturally decrease with increasing numbers of epochs of training. The error on the unseen validation and testing data sets, however, will start off decreasing as the under-fitting is reduced, but then it will eventually begin to increase again as over-fitting occurs. The natural solution to get the best generalization, i.e. the lowest error on the test set, is to use the procedure of early stopping. One simply trains the network on the training set until the error on the validation set starts rising again, and then stops. That is the point at which we expect the generalization error to start rising as well. L10-6

Practical Aspects of Stopping Early One potential problem with the idea of stopping early is that the validation error may go up and down numerous times during training. The safest approach is generally to train to convergence (or at least until it is clear that the validation error is unlikely to fall again), saving the weights at each epoch, and then go back to weights at the epoch with the lowest validation error. There is an approximate relationship between stopping early and a particular form of regularization which indicates that it will work best if the training starts off with very small random initial weights. There are general practical problems concerning how to best split the available training data into distinct training and validation data sets. For example: What fraction of the patterns should be in the validation set? Should the data be split randomly, or by some systematic algorithm? As so often, these issues are very problem dependent and there are no simple general answers. L10-7

Regularization The general technique of regularization encourages smoother network mappings by adding a penalty term Ω to the standard (e.g. sum squared error) cost function E reg = E + λω sse where the regularization parameter λ controls the trade-off between reducing the error E sse and increasing the smoothing. This modifies the gradient descent weight updates so w ( n) ( m) sse ij kl kl ( n) E ( w ) Ω( wij ) = η ( m) ηλ ( m) w w For example, for the case of soft weight sharing we can use the regularization function kl Ω = ln,, k l m M j= 1 ( m) α j ( wkl µ j ) exp 2 2 2πσ 2σ j j 2 in which the 3M parameters α j, µ j, σ j are optimised along with the weights. L10-8

Weight Decay One of the simplest forms of regularization has a regularization function which is just the sum of the squares of the network weights (not including the thresholds): Ω = 1 ( m) 2 w kl k, l, m ( ) In conventional curve fitting this regularizer is known as ridge regression. We can see why it is called weight decay when we observe the extra term in the weight updates: 2 ( n) ηλ Ω( wij ) ( m) w kl = ηλw ( m) kl In each epoch the weights decay in proportion to their size, i.e. exponentially. Empirically, this leads to significant improvements in generalization. This is because producing over-fitted mappings requires high curvature and hence large weights. Weight decay keeps the weights small and hence the mappings are smooth. L10-9

Adding Noise / Jittering Adding noise or jitter to the inputs during training is also found empirically to improve network generalization. This is because the noise will smear out each data point and make it difficult for the network to fit the individual data points precisely, and consequently reduce over-fitting. Actually, if the added noise is ξ with variance λ, the error function can be written: p p 2 sse i 2 ( j j i i ) p ξ j E ( ξ ) = 1 y net ( x + ξ ) and after some tricky mathematics one can show that λω = E ( ξ ) E ( ) = λ sse i sse 0 1 2 p j p net j ( xk ) p x which is a standard Tikhonov regularizer minimising curvature. The gradient descent weight updates can then be performed with an extended back-propagation algorithm. i 2 L10-10

Which is the Best Approach? To get an optimal balance between bias and variance, we need to find a way of optimising the effective complexity of the model, which for neural networks means optimising the effective number of adaptable parameters. One way to optimise that number is to use the appropriate number of connection weights, which can easily be adjusted by changing the number of hidden units. The other approaches we have looked at can all be seen to be effectively some form of regularization, which involves adding a penalty term to the standard gradient descent error function. The degree of regularization, and hence the effective complexity of the model, is controlled by adjusting the regularization scale λ. In practice, we find the best number of hidden units, or degree of regularization, using a validation data set or the procedure of cross-validation. The approaches all work well, and which we choose will ultimately depend on which is most convenient for the particular problem in hand. Unfortunately, there is no overall best approach! L10-11

Overview and Reading 1. We began by recalling the aim of good generalization. 2. Then the ideas of validation and cross-validation were introduced as convenient methods for estimating generalization using only the available training data. 3. We then went through the main approaches for improving generalization: limiting the number of weights, weight sharing, stopping training early, regularization, weight decay, and adding noise to the inputs. 4. As usual, we concluded that there was no generally optimal approach. Reading 1. Bishop: Sections 9.2, 9.3, 9.4, 9.8 2. Haykin: Sections 4.12, 4.14 3. Gurney: Section 6.10 L10-12