Neural Networks. (Reading: Kuncheva Section 2.5)

Similar documents
Neural Networks and Support Vector Machines

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Lecture 6. Artificial Neural Networks

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

An Introduction to Neural Networks

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Chapter 4: Artificial Neural Networks

3 An Illustrative Example

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Self Organizing Maps: Fundamentals

Lecture 8 February 4

Role of Neural network in data mining

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Feedforward Neural Networks and Backpropagation

Power Prediction Analysis using Artificial Neural Network in MS Excel

Method of Combining the Degrees of Similarity in Handwritten Signature Authentication Using Neural Networks

Recurrent Neural Networks

NEURAL NETWORKS A Comprehensive Foundation

Neural network software tool development: exploring programming language options

Università degli Studi di Bologna

Analecta Vol. 8, No. 2 ISSN

Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis

Neural Networks in Quantitative Finance

6.2.8 Neural networks for data mining

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep

A Simple Introduction to Support Vector Machines

Stock Prediction using Artificial Neural Networks

SMORN-VII REPORT NEURAL NETWORK BENCHMARK ANALYSIS RESULTS & FOLLOW-UP 96. Özer CIFTCIOGLU Istanbul Technical University, ITU. and

Linear Threshold Units

IBM SPSS Neural Networks 22

Novelty Detection in image recognition using IRF Neural Networks properties

OPTIMUM LEARNING RATE FOR CLASSIFICATION PROBLEM

Programming Exercise 3: Multi-class Classification and Neural Networks

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Data quality in Accounting Information Systems

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Neural Networks and Back Propagation Algorithm

Application of Neural Network in User Authentication for Smart Home System

Machine Learning and Pattern Recognition Logistic Regression

Analysis of Multilayer Neural Networks with Direct and Cross-Forward Connection

Neural Networks algorithms and applications

Introduction to Machine Learning Using Python. Vikram Kamath

Machine Learning and Data Mining -

Linear Models for Classification

AN APPLICATION OF TIME SERIES ANALYSIS FOR WEATHER FORECASTING

A Content based Spam Filtering Using Optical Back Propagation Technique

Using Neural Networks with Limited Data to Estimate Manufacturing Cost. A thesis presented to. the faculty of

Deep Learning for Multivariate Financial Time Series. Gilberto Batres-Estrada

IBM SPSS Neural Networks 19

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Neural Network Add-in

Machine Learning: Multi Layer Perceptrons

Supporting Online Material for

Artificial neural networks

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS

American International Journal of Research in Science, Technology, Engineering & Mathematics

Neural Network Design in Cloud Computing

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Comparison of Supervised and Unsupervised Learning Algorithms for Pattern Classification

SEARCH AND CLASSIFICATION OF "INTERESTING" BUSINESS APPLICATIONS IN THE WORLD WIDE WEB USING A NEURAL NETWORK APPROACH

COMBINED NEURAL NETWORKS FOR TIME SERIES ANALYSIS

STA 4273H: Statistical Machine Learning

Time Series Data Mining in Rainfall Forecasting Using Artificial Neural Network

Big Data Analytics CSCI 4030

Package AMORE. February 19, 2015

Neural Computation - Assignment

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Neural Networks: a replacement for Gaussian Processes?

Design call center management system of e-commerce based on BP neural network and multifractal

Tennis Winner Prediction based on Time-Series History with Neural Modeling

SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Neural Network Predictor for Fraud Detection: A Study Case for the Federal Patrimony Department

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

Artificial Neural Network Approach for Classification of Heart Disease Dataset

Lecture 3: Linear methods for classification

Chapter 6. The stacking ensemble approach

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

The general inefficiency of batch training for gradient descent learning

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Data Mining Lab 5: Introduction to Neural Networks

PLAANN as a Classification Tool for Customer Intelligence in Banking

Statistical Machine Learning

TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC

ARTIFICIAL NEURAL NETWORKS FOR DATA MINING

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services

Real-Time Credit-Card Fraud Detection using Artificial Neural Network Tuned by Simulated Annealing Algorithm

Artificial Neural Network for Speech Recognition

Predict Influencers in the Social Network

CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER

A hybrid financial analysis model for business failure prediction

Temporal Difference Learning in the Tetris Game

LCs for Binary Classification

Transcription:

Neural Networks (Reading: Kuncheva Section 2.5)

Inspired by Biology Introduction But as used in pattern recognition research, have little relation with real neural systems (studied in neurology and neuroscience) Kuncheva: the literature on NNs is excessive and continuously growing. Early Work McCullough and Pitts (1943) 2

Introduction, Continued Black-Box View of a Neural Net Represents function f : R n R c where n is the dimensionality of the input space, c the output space Classification: map feature space to values for c discriminant functions: choose class with maximum discriminant value Regression: learn continuous outputs directly (e.g. learn to fit the sin function - see Bishop text) Training (for Classification) Minimizes error on outputs (i.e. maximize function approximation) for a training set, most often the squared error: E ¼ 1 2 X N j¼1 X c i¼1 {g i (z j ) Iðv i, l(z j )Þ} 2 (2:77) 3

Introduction, Continued Granular Representation A set of interacting elements ( neurons or nodes) map input values to output values using a structured series of interactions Properties Instable: like decision trees, small changes in training data can alter NN behavior significantly Also like decision trees, prone to overfitting: validation set often used to stop training Expressive: With proper design and training, can approximate any function to a specified precision 4

Expressive Power of NNs X X Using Squared Error for Learning Classification Functions: For infinite data, the set of discriminant functions learned by a network approach the true posterior probabilities for each class (for multi-layer perceptrons (MLP), and radial basis function (RBF) networks): Note: lim N!1 g i(x) ¼ P(v i jx), x [ R n (2:78) This result applies to any classifier that can approximate an arbitrary discriminant function with a specified precision (not specific to NNs) 5

A Single Neuron (Node) Let u ¼½u 0,..., u q Š T [ R qþ1 be the input vector to the node and v [ R be its output. We call w ¼½w 0,..., w q Š T [ R qþ1 a vector of synaptic weights. The processing element implements the function X v ¼ f(j); j ¼ Xq i¼0 w i u i (2:79) where f : R! R is the activation function and j is the net sum. Typical choices for f Fig. 2.14 The NN processing unit. 6

Common Activation Functions. The threshold function. The sigmoid function f(j) ¼ 1, if j 0, 0, otherwise. j : (net sum). The identity function f(j) ¼ 1 1 þ exp ( j) f 0 (j) ¼ " # X f(j)½1 f(j)š. f(j) ¼ j (used for input nodes) 7

Bias: Offset for Activation Functions ½ Š The weight w 0 is used as a bias, and the corresponding input value u 0 is set to 1. Equation (2.79) can be rewritten as " # v ¼ f½z ( w 0 )Š¼f Xq w i u i ( w 0 ) i¼1 (2:83) where z is now the weighted sum of the weighted inputs from 1 to q. Geometrically, the equation X q i¼1 w i u i ( w 0 ) ¼ 0 (2:84) defines a hyperplane in R q. A node with a threshold activation function (2.80) responds with value þ1 to all inputs ½u 1,..., u q Š T on the one side of the hyperplane, and value 0 to all inputs on the other side. 8

The Perception (Rosenblatt, 1962) Rosenblatt [8] defined the so called perceptron and its famous training algorithm. The perceptron is implemented as Eq. (2.79) with a threshold activation function f(j) ¼ 1, if j 0, 1, otherwise. (2:85) Update Rule: v ¼ f(j); j ¼ Xq w i u i (2:79) i¼0 w w vhz j (2:86) where v is the output of the perceptron for z j and h is a parameter specifying the learning rate. Beside its simplicity, perceptron training has the following interesting Learning Algorithm: Set all input weights (w) randomly (e.g. in [0,1]) Apply the weight update rule when a misclassification is made Pass over training data (Z) until no errors are made. One pass = one epoch 9

Properties of Perceptron Learning Convergence and Zero Error! If two classes are linearly separable in feature space, always converges to a function producing no error on the training set Infinite Looping and No Guarantees! If classes not linearly separable. If stopped early, no guarantee that last function learned is the best considered during training 10

Perceptron Learning Fig. 2.16 (a) Uniformly distributed two-class data and the boundary found by the perceptron training algorithm. (b) The evolution of the class boundary. 11

Multi-Layer Perceptron Nodes: perceptrons Hidden, output layers have the same activation function (threshold or sigmoid) Classification is feedforward: compute activations one layer at a time, input to ouput: decide ωi for max gi(x) Fig. 2.17 (activation: identity fn) A generic model of an MLP classifier. Learning is through backpropagation (update input weights from output to input layer) 12

Multi-Layer Perceptron Nodes: perceptrons Hidden, output layers have the same activation function (threshold or sigmoid) Classification is feedforward: compute activations one layer at a time, input to ouput: decide ωi for max gi(x) Fig. 2.17 (activation: identity fn) A generic model of an MLP classifier. Learning is through backpropagation (update input weights from output to input layer) 13

Multi-Layer Perceptron Nodes: perceptrons Hidden, output layers have the same activation function (threshold or sigmoid) Classification is feedforward: compute activations one layer at a time, input to ouput decide ωi for max gi(x) Fig. 2.17 (activation: identity fn) A generic model of an MLP classifier. Learning is through backpropagation (update input weights from output to input layer) 14

Multi-Layer Perceptron Nodes: perceptrons Hidden, output layers have the same activation function (threshold or sigmoid) Classification is feedforward: compute activations one layer at a time, input to ouput decide ωi for max gi(x) Fig. 2.17 (activation: identity fn) A generic model of an MLP classifier. Learning is through backpropagation (update input weights from output to input layer) 15

Multi-Layer Perceptron (correct) Nodes: perceptrons Hidden, output layers have the same activation function (threshold or sigmoid) Fig. 2.17 (activation: identity fn) A generic model of an MLP classifier. Classification is feedforward: compute activations one layer at a time, input to ouput decide ωi for max gi(x) Learning is through backpropagation (update input weights from output to input layer) 16

Multi-Layer Perceptron (correct) Nodes: perceptrons Hidden, output layers have the same activation function (threshold or sigmoid) Fig. 2.17 (activation: identity fn) A generic model of an MLP classifier. Classification is feedforward: compute activations one layer at a time, input to ouput decide ωi for max gi(x) Learning is through backpropagation (update input weights from output to input layer) 17

Multi-Layer Perceptron (correct) Nodes: perceptrons Hidden, output layers have the same activation function (threshold or sigmoid) Fig. 2.17 (activation: identity fn) A generic model of an MLP classifier. Classification is feedforward: compute activations one layer at a time, input to ouput decide ωi for max gi(x) Learning is through backpropagation (update input weights from output to input layer) 18

Multi-Layer Perceptron (correct) Nodes: perceptrons Hidden, output layers have the same activation function (threshold or sigmoid) Fig. 2.17 (activation: identity fn) A generic model of an MLP classifier. Classification is feedforward: compute activations one layer at a time, input to ouput decide ωi for max gi(x) Learning is through backpropagation (update input weights from output to input layer) 19

MLP Properties Approximating Classification Regions MLP shown in previous slide with threshold nodes can approximate any classification regions in R n to a specified precision Approximating Any Function Later found that an MLP with one hidden layer and threshold nodes can approximate any function with a specified precision In Practice... These results tell us what is possible, but not how to achieve it (network structure and training algorithms) 20

Input Output Input Output Input Output Fig. 2.18 Possible classification regions for an MLP with one, two, and three layers of threshold nodes. (Note that the NN configuration column only indicates the number of hidden layers and not the number of nodes needed to produce the regions in column An example.)

(2.91): Output Node Error d o i ¼ @E @j o i ¼½g i (x) I(x, v i )Šg i (x)½1 g i (x)š (2.96): Hidden Node Error d h k ¼ @E @j h k ¼ Xc i¼1 d o i wo ik! (2.77) (Squared Error): E ¼ 1 2 X N j¼1 X c i¼1 Stopping Criterion: Error less than epsilon OR Exceed max # epochs, T Output/Hidden Activation: Sigmoid function **Online training (vs. batch or stochastic) v h k (1 vh k ) {g i (z j ) Iðv i, l(z j )Þ} 2 Backpropagation MLP training 1. Choose an MLP structure: pick the number of hidden layers, the number of nodes at each layer and the activation functions. 2. Initialize the training procedure: pick small random values for all weights (including biases) of the NN. Pick the learning rate h. 0; the maximal number of epochs T and the error goal e. 0: 3. Set E ¼ 1, the epoch counter t ¼ 1 and the object counter j ¼ 1. 4. While ðe. e and t T) do (a) Submit z j as the next training example. (b) Calculate the output of every node of the NN with the current weights (forward propagation). (c) Calculate the error term d at each node at the output layer by (2.91). (d) Calculate recursively all error terms at the nodes of the hidden layers using (2.95) (backward propagation). (e) For each hidden and each output node update the weights by w new ¼ w old hdu; ð2:98þ using the respective d and u: (f) Calculate E using the current weights and Eq. (2.77). (g) If j ¼ N (a whole pass through Z (epoch) is completed), then set t ¼ t þ 1 and j ¼ 0. Else, set j ¼ j þ 1: 5. End % (While) Fig. 2.19 Backpropagation MLP training.

nodes 3, 7 are bias nodes: always output 1 activation: input: identity hidden/output: sigmoid Fig. 2.20 A 2 : 3 : 2 MLP configuration. Bias nodes are depicted outside the layers and are not counted as separate nodes. TABLE 2.5 (a) Random Set of Weights for a 2 : 3 : 2 MLP NN; (b) Updated Weights Through Backpropagation for a Single Training Example. Neuron Incoming Weights (a) 1 w 31 ¼ 0.4300 w 41 ¼ 0.0500 w 51 ¼ 0.7000 w 61 ¼ 0.7500 2 w 32 ¼ 0.6300 w 42 ¼ 0.5700 w 52 ¼ 0.9600 w 62 ¼ 0.7400 4 w 74 ¼ 0.5500 w 84 ¼ 0.8200 w 94 ¼ 0.9600 5 w 75 ¼ 0.2600 w 85 ¼ 0.6700 w 95 ¼ 0.0600 6 w 76 ¼ 0.6000 w 86 ¼ 1.0000 w 96 ¼ 0.3600 (b) 1 w 31 ¼ 0.4191 w 41 ¼ 0.0416 w 51 ¼ 0.6910 w 61 ¼ 0.7402 2 w 32 ¼ 0.6305 w 42 ¼ 0.5704 w 52 ¼ 0.9604 w 62 ¼ 0.7404 4 w 74 ¼ 0.5500 w 84 ¼ 0.8199 w 94 ¼ 0.9600 5 w 75 ¼ 0.2590 w 85 ¼ 0.6679 w 95 ¼ 0.0610 6 w 76 ¼ 0.5993 w 86 ¼ 0.9986 w 96 ¼ 0.3607

2:3:2 MLP (see previous slide) Batch training (updates at end of epoch) Max Epochs: 1000, η = 0.1, error goal: 0 Initial weights: random, in [0,1] Final train error: 4% Final test error: 9% Fig. 2.21 Squared error and the apparent error rate versus the number of epochs for the backpropagation training of a 2 : 3 : 2 MLP on the banana data.

Final Note Backpropogation Algorithms Are numerous: many designed for faster convergence, increased stability, etc. 25