Regularization in Neural Networks

Similar documents
Christfried Webers. Canberra February June 2015

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

STA 4273H: Statistical Machine Learning

Monotonicity Hints. Abstract

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Gaussian Processes in Machine Learning

Statistical Models in Data Mining

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Supporting Online Material for

Classification Problems

Lecture 8 February 4

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning and Pattern Recognition Logistic Regression

Convolution. 1D Formula: 2D Formula: Example on the web:

Statistical Machine Learning

: Introduction to Machine Learning Dr. Rita Osadchy

Review of Fundamental Mathematics

Big Data Analytics CSCI 4030

High-Performance Signature Recognition Method using SVM

Simple and efficient online algorithms for real world applications

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Advanced Signal Processing and Digital Noise Reduction

Machine Learning: Multi Layer Perceptrons

Signature Segmentation from Machine Printed Documents using Conditional Random Field

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Linear Threshold Units

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Accurate and robust image superresolution by neural processing of local image representations

A Simple Introduction to Support Vector Machines

Introduction to Machine Learning Using Python. Vikram Kamath

Chapter 4: Artificial Neural Networks

Novelty Detection in image recognition using IRF Neural Networks properties

A Learning Based Method for Super-Resolution of Low Resolution Images

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Introduction to Logistic Regression

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Question 2 Naïve Bayes (16 points)

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Supervised Learning (Big Data Analytics)

Higher Education Math Placement

CSCI567 Machine Learning (Fall 2014)

Applications to Data Smoothing and Image Processing I

Lecture 3: Linear methods for classification

Bayesian Image Super-Resolution

Lecture 9: Introduction to Pattern Analysis

Lecture 6. Artificial Neural Networks

Bildverarbeitung und Mustererkennung Image Processing and Pattern Recognition

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Making Sense of the Mayhem: Machine Learning and March Madness

Visualization by Linear Projections as Information Retrieval

Java Modules for Time Series Analysis

Support Vector Machine (SVM)

Artificial Neural Network for Speech Recognition

Machine Learning in Spam Filtering

Part-Based Recognition

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

An Introduction to Machine Learning

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Keywords: Image complexity, PSNR, Levenberg-Marquardt, Multi-layer neural network.

Programming Exercise 3: Multi-class Classification and Neural Networks

Introduction to Online Learning Theory

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Algebra and Geometry Review (61 topics, no due date)

MBA Jump Start Program

Metrics on SO(3) and Inverse Kinematics

Component Ordering in Independent Component Analysis Based on Data Power

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Several Views of Support Vector Machines

4.1 Learning algorithms for neural networks

Taking Inverse Graphics Seriously

DRAFT. Further mathematics. GCE AS and A level subject content

2 Signature-Based Retrieval of Scanned Documents Using Conditional Random Fields

Chapter 12 Discovering New Knowledge Data Mining

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Statistical machine learning, high dimension and big data

1. Classification problems

We can display an object on a monitor screen in three different computer-model forms: Wireframe model Surface Model Solid model

Linear Classification. Volker Tresp Summer 2015

IBM SPSS Neural Networks 22

Machine Learning Overview

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

The equivalence of logistic regression and maximum entropy models

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Designing a learning system

Simplified Machine Learning for CUDA. Umar

Thnkwell s Homeschool Precalculus Course Lesson Plan: 36 weeks

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

6.1 Add & Subtract Polynomial Expression & Functions

An Introduction to Neural Networks

Vocabulary Words and Definitions for Algebra

Transcription:

Regularization in Neural Networks Sargur Srihari 1

Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function Linear Transformations and Consistent Gaussian priors 3. Early stopping Invariances Tangent propagation Training with transformed data Convolutional networks Soft weight sharing 2

What is Regularization? In machine learning (also, statistics and inverse problems): introducing additional information to prevent over-fitting (or solve ill-posed problem) This information is usually a penalty for complexity, e.g., restrictions for smoothness bounds on the vector space norm Theoretical justification for regularization: attempts to impose Occam's razor on the solution From a Bayesian point of view Regularization corresponds to imposition of prior distributions on model parameters 3

1. Regularization by determining no. of hidden units Number of input and output units is determined by dimensionality of data set Number of hidden units M is a free parameter Adjusted to get best predictive performance Possible approach is to get maximum likelihood estimate of M for balance between under-fitting and over-fitting 4

Effect of Varying Number of Hidden Units Sinusoidal Regression Problem Two layer network trained on 10 data points M = 1, 3 and 10 hidden units Minimizing sum-of-squared error function Using conjugate gradient descent Generalization error is not a simple function of M due to presence of local minima in error function

Using Validation Set to determine no of hidden units Plot a graph choosing random starts and different numbers of hidden units M 30 random starts for each M Sum of squares Test error for polynomial data Overall best validation Set performance happened at M=8 Number of hidden units, M 6

2. Regularization using Simple Weight Decay Generalization error is not a simple function of M Due to presence of local minima Need to control network complexity to avoid over-fitting Choose a relatively large M and control complexity by addition of regularization term Simplest regularizer is weight decay Effective model complexity determined by choice of regularization coefficient λ Regularizer is equivalent to a Gaussian prior over weight vector w If p(w) = N (w m 0, α -1 I), log-posterior is ln p(w t) = β 2 E (w) = E(w) + λ 2 w T w Maximization of posterior is equivalent to minimization E(w) = 1 2 N n=1 { t n w T φ(x n ) } 2 Simple weight decay has certain shortcomings invariance to scaling N n=1 2 { t n w T φ(x n )} α 2 wt w + const + λ 2 wt w where λ=α/β t = y(x,w)+ε p(t x,w,β)=n(t y(x,w),β -1 )

Consistent Gaussian priors Simple weight decay is inconsistent with certain scaling properties of network mappings To show this, consider a multi-layer perceptron network with two layers of weights and linear output units Set of input variables {x i } and output variables {y k } Activations of hidden units in first layer have the form # & z j = h% w ji x i + w j 0 ( $ ' Activations of output units are i y k = w kj z j + w k 0 j 8

Linear Transformations of input/output Variables Suppose we perform a linear transformation of input data x i x i = ax i + b We can arrange for mapping performed by network to be # & unchanged z j = h% w ji x i + w j 0 ( if we transform weights and biases from inputs to hidden units as w ji w ji = 1 a w ji and w j 0 = w j 0 b a Similar linear transformation of output variables of network is y k y k = cy k + d Can be achieved by transformation of second layer weights and biases w k j w kj = cw kj and w k 0 = cw k 0 + d i $ i w ji y k = w kj z j + w k 0 j ' 9

Desirable invariance property of regularizer If we train one network using original data: {x i }, {y k } Another for which input and/or target variables are transformed by one of the linear transformations x i x i = ax i + b y k y k = cy k + d Then they should only differ by the weights as given Regularizer should have this property Otherwise it arbitrarily favors one solution over another Simple weight decay E (w) = E(w) + λ does not have this property 2 w T w Treats all weights and biases on an equal footing Therefore look for a regularizer invariant under the linear 10 transformations

Regularizer invariant under linear transformation A regularizer invariant to re-scaling of weights and shifts of biases is w 2 + λ 2 w 2 λ 1 2 w W 1 where W 1 are weights of first layer and W 2 of second layer This regularizer remains unchanged under weight transformations provided However, it corresponds to prior of the form 2 w W 2 λ 1 a 1/ 2 λ 1 and λ 2 c 1/ 2 λ 2 & p(w α 1,α 2 ) α exp α 1 ( ' 2 w 2 α 2 w 2 w W 1 2 w W 2 This is an improper prior which cannot be normalized Leads to difficulties in selecting regularization coefficients and in model comparison within Bayesian framework Instead include separate priors for biases with their own hyper-parameters ) + * α 1 and α 2 are hyper-parameters

Effect of hyperparameters on input-output Network with single input (x value range: -1 to +1), single linear output (y value range: -60 to +40) Priors are governed by four hyper-parameters α 1b, precision of Gaussian distribution of first layer bias α 1w, precision of Gaussian of first layer weights α 2b,...of second layer bias α 2w,. of second layer weights 12 hidden units with tanh activation functions output output Observe: vertical scale controlled by α 2 w Draw five samples from prior over hyperparameters, and plot inputoutput (network functions) Five samples correspond to five colors For each setting function is learnt and plotted output input input 12 Observe: horizontal scale controlled by α 1 w input

3. Early Stopping Alternative to regularization In controlling complexity Error measured with an independent validation set shows initial decrease in error and then an increase Training stopped at point of smallest error with validation data Effectively limits network complexity Training Set Error Iteration Step Validation Set Error Iteration Step 13

Interpreting the effect of Early Stopping Consider quadratic error function Axes in weight space are parallel to eigen vectors of Hessian In absence of weight decay, weight vector starts at origin and proceeds to w ML Stopping at w is similar to weight decay E (w) = E(w) + λ 2 w T w Effective number of parameters in the network grows during course of training Contour of constant error w ML represents minimum 14

Invariances Quite often in classification problems there is a need Predictions should be invariant under one or more transformations of input variable Example: handwritten digit should be assigned same classification irrespective of position in the image (translation) and size (scale) Such transformations produce significant changes in raw data, yet need to produce same output from classifier Examples: pixel intensities, in speech recognition, nonlinear time warping along time axis 15

Simple Approach for Invariance Large sample set where all transformations are present E.g., for translation invariance, examples of objects in may different positions Impractical Number of different samples grows exponentially with number of transformations Seek alternative approaches for adaptive model to exhibit required invariances 16

Approaches to Invariance 1. Training set augmented by transforming training patterns according to desired invariances E.g., shift each image into different positions 2. Add regularization term to error function that penalizes changes in model output when input is transformed. Leads to tangent propagation. 3. Invariance built into pre-processing by extracting features invariant to required transformations 4. Build invariance property into structure of neural network (convolutional networks) Local receptive fields and shared weights 17

Approach 1: Transform each input Synthetically warp each handwritten digit image before presentation to model Easy to implement but computationally costly Original Image Warped Images Random displacements Δx,Δy (0,1) at each pixel Then smooth by convolutions of width 0.01, 30 and 60 resply 18

Approach 2: Tangent Propagation Regularization can be used to encourage models to be invariant to transformations by techniques of tangent propagation input vector x n Continuous transformation sweeps a manifold M in D- dimensional input space Two-dimensional input space showing effect of continuous transformation with single parameter ξ Vector resulting from transformation is s(x,ξ) so that s(x,0)=x Tangent to curve M is given by τ n = s(x n,ξ) ξ ξ = 0 19

Tangent Propagation as Regularization Under transformation of input vector Output vector will change Derivative of output k wrt ξ is given by y k ξ ξ = 0 = where J ki is the (k,i) element of the Jacobian Matrix J Result is used to modify the standard error function To encourage local invariance in neighborhood of data point E = E + λω where λ is a regularization coefficient and Ω = 1 2 & y ) nk 2 ( ' ξ + = 1 n k ξ = 0 * 2 n k & ( ' D i=1 ) J nki τ ni + * 2 D i=1 y k x i x i ξ ξ = 0 = D J ki τ i i=1 20

Tangent vector from finite differences In practical implementation tangent vector τ n is approximated using finite differences by subtracting original vector x n from the corresponding vector after transformations using a small value of ξ and then dividing by ξ Tangent Distance used to build invariance properties with nearest-neighbor classifiers is a related technique Original Image x Adding small contribution from tangent vector to image x+ετ Tangent vector τ corresponding to small rotation True image rotated for comparison 21

Equivalence of Approaches 1 and 2 Expanding the training set is closely related to tangent propagation Small transformations of the original input vectors together with Sum-of-squared error function can be shown to be equivalent to the tangent propagation regularizer 22