Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Similar documents

Lecture 8 February 4

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Statistical Machine Learning

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Supervised Learning (Big Data Analytics)

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Efficient online learning of a non-negative sparse autoencoder

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Chapter 4: Artificial Neural Networks

Linear Threshold Units

Lecture 6. Artificial Neural Networks

A Simple Introduction to Support Vector Machines

Making Sense of the Mayhem: Machine Learning and March Madness

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Programming Exercise 3: Multi-class Classification and Neural Networks

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Lecture 2: The SVM classifier

STA 4273H: Statistical Machine Learning

Simplified Machine Learning for CUDA. Umar

Neural Networks and Support Vector Machines

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Supporting Online Material for

Question 2 Naïve Bayes (16 points)

Introduction to Machine Learning Using Python. Vikram Kamath

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Machine Learning: Multi Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Big Data Analytics CSCI 4030

Linear Classification. Volker Tresp Summer 2015

Support Vector Machines Explained

Self Organizing Maps: Fundamentals

CSCI567 Machine Learning (Fall 2014)

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Online Learning Theory

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

An Introduction to Neural Networks

Novelty Detection in image recognition using IRF Neural Networks properties

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

3F3: Signal and Pattern Processing

NEURAL NETWORKS A Comprehensive Foundation

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections

Lecture 3: Linear methods for classification

Predict Influencers in the Social Network

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Data Mining: An Overview. David Madigan

Logistic Regression (1/24/13)

Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep

CSC 411: Lecture 07: Multiclass Classification

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Support Vector Machines with Clustering for Training with Very Large Datasets

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis

Learning to Process Natural Language in Big Data Environment

Several Views of Support Vector Machines

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

Machine Learning and Pattern Recognition Logistic Regression

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Machine learning algorithms for predicting roadside fine particulate matter concentration level in Hong Kong Central

Class-specific Sparse Coding for Learning of Object Representations

Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou

Early defect identification of semiconductor processes using machine learning

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Deep Learning for Multivariate Financial Time Series. Gilberto Batres-Estrada

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Machine Learning for Data Science (CS4786) Lecture 1

Implementation of Neural Networks with Theano.

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Recurrent Neural Networks

Support Vector Machine (SVM)

Classification Problems

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

jorge s. marques image processing

Car Insurance. Havránek, Pokorný, Tomášek

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Employer Health Insurance Premium Prediction Elliott Lui

The equivalence of logistic regression and maximum entropy models

Data, Measurements, Features

Analecta Vol. 8, No. 2 ISSN

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

An Introduction to Machine Learning

Convolution. 1D Formula: 2D Formula: Example on the web:

TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC

5.1 Bipartite Matching

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS

Analysis of Multilayer Neural Networks with Direct and Cross-Forward Connection

Cross-validation for detecting and preventing overfitting

Data Mining - Evaluation of Classifiers

Temporal Difference Learning in the Tetris Game

Simple and efficient online algorithms for real world applications

Transcription:

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi

Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector machines Maximizing the maximum margin, and Hard margin: subject to the constraints that no training error shall be made Soft margin: minimizing the slack variables that represent how much an associated training example violates the classification rule. Extended to nonlinear classifier with kernel trick Mapping input vectors to high dimensional space linear classifier in high dimensional space, nonlinear in original space.

Building nonlinear classifier With a network of logistic units? A single logistic unit is linear classifier : f: X Y 1 f X = 1 + exp( W 0 N n=1 W n X n )

Graph representation of a logistic unit Input layer: An input X=(X 1,, X n ) Output: logistic function of the input features

A logistic unit as an neuron: Input layer: An input X=(X 1,, X n ) Activation: weighted sum of input features a = W 0 + N n=1 W n X n Activation function: logistic function h applied to the weighted sum Output: z = h(a)

Neural Network: Multiple layers of neurons Output of a layer is the input into the upper layer

An example A three layer neural network w 10 z 1 a 1 w 11 a 1 y 1 w 12 a 2 z2 w 10 w 20 a1 = w11 x1 + w 12 x2 + w 10 z 1 = h(a 1 ) a 2 = w21 x1 + w 22 x2 + w 20 z 2 = h(a 2 ) w 11 w 21 w 12 w 22 x 1 x 2 a 1 = w11 z1 + w 12 x2 + w 10 y 1 = f(a 1 )

XOR Problem It is impossible to linearly separate these two classes x 2 (0,1) (1,1) (0,0) (1,0) x 1

XOR Problem Two classes become separable by putting a threshold 0.5 to the output y 1 x 2 (0,1) (1,1) 6.06 z 1 y 1 13.34 13.13 z2 6.56 7.19 Input Output (0,0) 0.057 (1,0) 0.949 (0,1) 0.946 (1,1) 0.052 11.62 12.88 10.99 13.13 (0,0) (1,0) x 1 x 1 x 2

Application: Drive a car Input: real-time videos captured by a camera Output: signals that steer a car From the sharp left, straight to sharp right

Training Neural Network Given a training set of M examples {(x (i),t (i) ) i=1,,m} Training neural network is equivalent to minimizing the least square error between the network output and the true value min w L w = 1 (y (i) t i ) 2 2 i=1 Where y (i) is the output depending on the network parameters w. M

Recap: Gradient decent Method Gradient descent method is an iterative algorithm hill climbing method to find the peak point of a mountain At each point, compute its gradient L L L L,,..., w w w 0 1 N Gradient is a vector that points to the steepest direction climbing up the mountain. At each point, w is updated so it moves a size of step λ in the gradient direction w w L( w)

Stochastic Gradient Ascent Method Making the learning algorithm scalable to big data Computing the gradient of square error for only one example L w = M i=1 (y (i) t i ) 2 L L L L,,..., w w w 0 1 N L (i) w = (y (i) t i ) 2 L () i L L L,,..., w w w ( i) ( i) ( i) 0 1 N

Boiling down to computing the gradient y 1 z 1 w 11 a 1 w 12 z2 w 10 Square loss: L 1 (y t ) 2 k k 2 y f ( a ), a w z k k k kj j j Derivative to the activation in the second layer: L y (y t ) (y t ) f '( a ) k k k k k k k ak ak x 1 x 2 Derivative to the parameter in the second layer: L L a k ki k ki w a w k z i

Boiling down to computing the gradient Computing the derivatives to the parameters in the first layer w 10 z 1 w 11 a 1 y 1 w 12 z2 w 10 Relation between activations of the first and second layers By chain rule: a w h( a ) L k kj j j k j k k j h '( a ) L a a a a w j k kj j k w 11 w 21 w w 12 22 The derivative to the parameter in the first layer: a w x j jn n n L L a w a w j jn j jn j x n x 1 x 2

Summary: Back propagation δ 1 For each training example (x,y), For each output unit k For each hidden unit j (y t ) f '( a ) k k k k δ 1 δ 2 h '( a ) w j j k kj k

Summary: Back propagation δ 1 For each training example (x,y), For each weight w ki : L w = δ k ki zi δ 1 δ 2 Update w ki wki δk zi For each weight w jn : Update L w jn = δ j xn w jn wjn δj xn

Regularized Square Error Add a zero mean Gaussian prior on the weights w ij (l) ~N(0, σ 2 ) MAP estimate of w L (i) w = 1 2 (y(i) t i ) 2 + γ 2 (w ij (l) ) 2

Summary: Back propagation δ 1 For each training example (x,y), For each weight w ki : L w = δ k zi + γw ki ki δ 1 δ 2 Update w ki wki δk zi γw ki For each weight w jn : Update L w = δ j xn + γw jn jn w jn wjn δj xn γw jn

Multiple outputs encoding multiple classes MNIST: ten classes of digits Encoding multiple classes as multiple outputs: An output variable is set to 1 if the corresponding class is positive for the example Otherwise, the output is set to 0. The posterior probability of an example belonging to class k P Class k x = y k K k =1 y k

Overfitting Tuning the number of update iterations on validation set

How expressive is NN? Boolean functions: Every Boolean function can be represented by network with single hidden layer But might require exponential number of hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error by neural network with one hidden layer Any function can be approximated to arbitrary accuracy by a network with two hidden layers

Learning feature representation by neural networks A compact representation for high dimensional input vectors A large image with thousands of pixels High dimensional input vectors might cause curse of dimensionality Needs more examples for training (in lecture 1) Not well capture the intrinsic variations an arbitrary point in a high dimensional space probably does not represent a valid real object. A meaningful low dimensional space is preferred!

Autoencoder Set output to input Hidden layers as feature representation, since it contains sufficient information to reconstruct the input in the output layer

An example

Deep Learning: A Deep Feature Representation If you build multiple layers to reconstruct the input at the output layer

Summary Neural Networks: Multiple layers of neurons Each upper layer neuron encodes weighted sum of inputs from the other neurons at a lower layer by an activation function BP training: a stochastic gradient descent method From the output layer down to the hidden and input layers Autoencoder: feature representation