Lecture 8 February 4



Similar documents
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CSCI567 Machine Learning (Fall 2014)

Linear Threshold Units

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Statistical Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

STA 4273H: Statistical Machine Learning

Lecture 3: Linear methods for classification

Machine Learning and Pattern Recognition Logistic Regression

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Introduction to Logistic Regression

Programming Exercise 3: Multi-class Classification and Neural Networks

Linear Classification. Volker Tresp Summer 2015

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

(Quasi-)Newton methods

Christfried Webers. Canberra February June 2015

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

CSC 411: Lecture 07: Multiclass Classification

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Machine Learning Logistic Regression

Classification using Logistic Regression

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Chapter 4: Artificial Neural Networks

Machine Learning: Multi Layer Perceptrons

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Predict Influencers in the Social Network

Follow links Class Use and other Permissions. For more information, send to:

Advanced analytics at your hands

Separable First Order Differential Equations

Classification Problems

Logistic Regression (1/24/13)

Introduction to Machine Learning Using Python. Vikram Kamath

CSI:FLORIDA. Section 4.4: Logistic Regression

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

DERIVATIVES AS MATRICES; CHAIN RULE

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Linear Models for Classification

2.3 Convex Constrained Optimization Problems

G.A. Pavliotis. Department of Mathematics. Imperial College London

Linear and quadratic Taylor polynomials for functions of several variables.

Lecture 2: The SVM classifier

Neural network software tool development: exploring programming language options

Supervised Learning (Big Data Analytics)

Application of Neural Network in User Authentication for Smart Home System

CHAPTER 2 Estimating Probabilities

14. Nonlinear least-squares

Lecture 6. Artificial Neural Networks

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

An Introduction to Neural Networks

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

NEURAL NETWORKS A Comprehensive Foundation

CS229 Lecture notes. Andrew Ng

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Lecture 6: Logistic Regression

Numerical Methods for Solving Systems of Nonlinear Equations

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Neural Networks in Quantitative Finance

Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis

The equivalence of logistic regression and maximum entropy models

The Backpropagation Algorithm

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Neural Networks and Back Propagation Algorithm

A Simple Introduction to Support Vector Machines

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Recurrent Neural Networks

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Adaptive Online Gradient Descent

Numerical methods for American options

Making Sense of the Mayhem: Machine Learning and March Madness

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

Package neuralnet. February 20, 2015

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

1 Lecture: Integration of rational functions by decomposition

1.5. Factorisation. Introduction. Prerequisites. Learning Outcomes. Learning Style

GLM, insurance pricing & big data: paying attention to convergence issues.

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Microeconomic Theory: Basic Math Concepts

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

Predictive Dynamix Inc

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Question 2 Naïve Bayes (16 points)

Gaussian Processes in Machine Learning

L3: Statistical Modeling with Hadoop

Analecta Vol. 8, No. 2 ISSN

Support Vector Machines with Clustering for Training with Very Large Datasets

Transcription:

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt x (8.1) where x {0, 1} n. Assume that we are trying to solve problems such as Digit Classification (from a grid of yes/no or 1/0 values, as in figure 8.1) Figure 8.1. Example of digit recognition, kind of problem we are trying to solve. But, what is the space of representable hypotheses? Consider x {0, 1} 2 and an arbitrary boolean function y y(x) = (x 1 x 2 x 3 ) (x 2 x 4...)... (8.2) We can examine the AND ( ) and OR ( ) operations, and consider whether they can be represented by a linear decision boundary. Following 8.2, we can see they can be. By similar inspection, we can determine that the XOR ( ) cannot be represented by a linear decision boundary. However, let us think of a hierarchy of classifiers, where the output of classifiers are used as features (inputs) of other classifiers. For example, one can represent an XOR as z 1 z 2 where z 1 = x 1 x 2 and z 2 = x 1 x 2. This naturally leads to Neural Nets 8-1

Figure 8.2. AND ( ) and OR ( ) are linearly separable while XOR ( ) is not. 8.1.2 Definition of Neural Net A logistic regression model can be interpreted as a single layer Neural Net. We will write the final output as a nonlinearity g applied to a linear combination of input features. h(x) = g( i w i x i ) (8.3) where g(x) is the logistic function defined in equation 8.1. Its graphical representation is shown in figure 8.3 Figure 8.3. Single layer Neural Network described in equation 8.3. We can also consider multiple layers of this network, which allows us to deal with nonlinearities. Let us consider a 2-layer Neural Net (see figure 8.4), where, by convention w ji corresponds to the weight between the higher node j and a lower node i. The output will be given by: ( 3 ) ( 3 ( 2 )) out = g w j z j = g w j g w ji x i (8.4) because j=1 j=1 ( 2 ) z j = g w ji x i i=1 i=1 (8.5) 8-2

8.1.3 Training a Neural Net Figure 8.4. Example of 2-layer Neural Net. Let us define the following squared error cost function. J(w) = 1 2 ( y (i) out(x (i) ) ) 2 i (8.6) Although it has no formal probabilistic interpretation, we treat it as a objective function that we wish to minimize. One option is to use regular gradient descent to train w: w := w α J(w) w (8.7) We can use the backpropogation algorithm to efficiently compute J(w). w We can also perform a regular Newton-Raphson update, following the standard formula: [ ] 2 J(w) w := w. 2 J(w) (8.8) w 2 w 8.1.4 Advantages and Disadvantages Advantages It can represent nonlinear boundaries Fast feed forward architecture Fact: Any boolean function can be represented with a 2-layer neural net. [But it can have an arbitrarily large number of nodes.] Disadvantages Gradient descent is not guaranteed to reach a global optimum. 8-3

We do not know the optimal number and size of layers. Computing the gradient update requires roughly O(n) operations, where n is the number of weights in the neural net. Furthermore, it can also be implemented in an stochastic online fashion by updating w after examining each training point. Computing the Hessian roughly takes O(n 2 ) operations (and inverting an n n matrix requires O(n 3 ) operations), so it can get expensive with a large number of weights. In practice, Newton-Raphson updates work best when we are in the quadratic bowl around a local minimum (rather than a plateau). There are many combinations of these two approaches: Line search: changes the step size of the gradient seeking for the minimum of the function among the steps. Conjugate Gradient Consult your favorite optimization textbook for more information - but in most cases, these algorithms can be seen as approximations to the full Newton-Raphson update. 8.1.5 Backpropogation We will derive the backpropogation equations for (8.6). We will consider the gradient due to a particular training example (for say, the online gradient descent setting) and drop the (i) notation. J w [ ] out(x) = (y out(x)) w (8.9) Figure 8.5. Part of the Neural Network for Notation explanation. Notation: We define the activation a j of a node j to the the linear combination of input features before the nonlinearity is applied: 8-4

a j = i w ji z i (8.10) z j = g(a j ) (8.11) We call g(a) the activation function. For our purposes, we will generally use the logistic function (eq 8.1) as g(x) but we could use other squashing functions. Another common choice is g(x) = tanh(x). Using the chain rule, and defining δ j out we have: Therefore we will need to: out = out = δ j z i (8.12) w ji w ji Calculate the z i s with a forward step. This is equivalent to applying the neural net to the current x input, with the current weights {w} and computing all the hidden node values, from the bottom layer up. Calculate the δ j s. We will show below that one can do this recursively from the top layers down to the bottom layers - the backpropogation step. Recall the chain rule for partial derivatives: f w = f x x w + f y y w (8.13) This can be verified with a simple example: f(x, y) = xy; x = 2w; y = 3w Then: δ j = out = k out a k a k = k δ k w kj z j = g (a j ) k δ k w kj (8.14) where g (a) = g(a)(1 g(a)) for the sigmoid activation function. To initialize this recursion, we can compute 8.1.6 Comments on Neural Nets What if g(x) = x? The final output will be linear! δ out = out a out = g (a out ) (8.15) Why not make g(x) = I{x > 0} (a step function)? The output is not differentiable, and so gradient descent or Newton Raphson become difficult to apply. 8-5

We can apply other loss functions besides squared error. Recall that logistic regression was derived with the cross-entropy loss (eq 8.16). This might be a more reasonable cost function when y (i) {0, 1}. For multiclass classification, one can also use a multi-class cross-entropy loss (8.17) when the output of the neural net is a softmax function(8.18). J(w) = i ( log out(x (i) ) y(i) (1 out(x (i) )) 1 y(i)) (8.16) J(w) = out k (x (i) ) = m i=1 9 k=1 exp (a out,k) j exp(a out,k) y (i) k log ( out k (x (i) ) ) (8.17) (8.18) 8-6