IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...



Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

STA 4273H: Statistical Machine Learning

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Feedforward Neural Networks and Backpropagation

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Lecture 8 February 4

Lecture 6. Artificial Neural Networks

Predict Influencers in the Social Network

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Statistical Machine Learning

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

HT2015: SC4 Statistical Data Mining and Machine Learning

Machine Learning and Data Mining -

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

An Introduction to Data Mining

Machine Learning and Pattern Recognition Logistic Regression

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

IBM SPSS Neural Networks 22

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Christfried Webers. Canberra February June 2015

The Data Mining Process

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Role of Neural network in data mining

Data Mining mit der JMSL Numerical Library for Java Applications

Neural network software tool development: exploring programming language options

Novelty Detection in image recognition using IRF Neural Networks properties

Predictive Dynamix Inc

Chapter 4: Artificial Neural Networks

Learning to Process Natural Language in Big Data Environment

Introduction to Machine Learning Using Python. Vikram Kamath

Neural Networks and Support Vector Machines

Designing a learning system

CSCI567 Machine Learning (Fall 2014)

Introduction to Machine Learning CMU-10701

Deep Learning for Multivariate Financial Time Series. Gilberto Batres-Estrada

Lecture 6: Logistic Regression

Supervised Learning (Big Data Analytics)

A Neural Support Vector Network Architecture with Adaptive Kernels. 1 Introduction. 2 Support Vector Machines and Motivations

Using artificial intelligence for data reduction in mechanical engineering

Programming Exercise 3: Multi-class Classification and Neural Networks

Machine Learning: Multi Layer Perceptrons

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

Neural Network Add-in

TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC

Self Organizing Maps: Fundamentals

Chapter 12 Discovering New Knowledge Data Mining

L13: cross-validation

Linear Models for Classification

Feature Engineering in Machine Learning

Simple and efficient online algorithms for real world applications

Logistic Regression for Spam Filtering

NEURAL NETWORKS A Comprehensive Foundation

An Introduction to Neural Networks

Efficient online learning of a non-negative sparse autoencoder

IBM SPSS Neural Networks 19

An Introduction to Machine Learning

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

6.2.8 Neural networks for data mining

1. Classification problems

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Support Vector Machine (SVM)

Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network

The equivalence of logistic regression and maximum entropy models

Steven C.H. Hoi School of Information Systems Singapore Management University

Data Mining Techniques Chapter 7: Artificial Neural Networks

Deep learning applications and challenges in big data analytics

Recurrent Neural Networks

Introduction: Overview of Kernel Methods

APPLICATION OF ARTIFICIAL NEURAL NETWORKS USING HIJRI LUNAR TRANSACTION AS EXTRACTED VARIABLES TO PREDICT STOCK TREND DIRECTION

American International Journal of Research in Science, Technology, Engineering & Mathematics

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Support Vector Machines with Clustering for Training with Very Large Datasets

Stock Prediction using Artificial Neural Networks

Scalable Developments for Big Data Analytics in Remote Sensing

Employer Health Insurance Premium Prediction Elliott Lui

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

A Simple Introduction to Support Vector Machines

Data quality in Accounting Information Systems

A Logistic Regression Approach to Ad Click Prediction

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining and Neural Networks in Stata

Implementation of Neural Networks with Theano.

Lecture 13: Validation

Neural Networks in Quantitative Finance

Data Mining. Nonlinear Classification

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Linear Classification. Volker Tresp Summer 2015

Supporting Online Material for

Follow links Class Use and other Permissions. For more information, send to:

Transcription:

IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic A.I. ) ks ism or n w io t x ne neeural n o n Introduce machine-learning and neural networks (terminology) C tifi (ar Start with simple statistical models l cia Neuroscience Feed Forward Neural Networks (specifically Multilayer Perceptrons) Opimization + Control theory Computer Science Information theory Statistics Artificial Intelligence rks Statis Physics pu com Neuroscience n test point: targets: horse inputs: x (1) cat preprocessing, feature extraction etc... d X targets: (feature vector) t (3.5, -2,..., 127, 0,...) +1 (-9.2, 32,..., 24, 1,...) -1 t(1) etc... (n) x2 horse? x(n) x= X sics tical Phy two al ne e neur ienc & rosc u e n nal tatio cial artifi inputs: Number of examples Symbolic A.I. Machine Learning Training Set Dimensionality of input Nowadays vision of the founding disciplines (6.8, 54,..., 17, -3,...) (5.7, -27,..., 64, 0,...) +1 t(n)?

Machine learning tasks Supervised learning = predict target t from input x t represents a category or class!classification (binary or multiclass) t is a real value! regression Unsupervised learning: no explicit target t model the distribution of x!density estimation capture underlying structure in x! dimensionality reduction, clustering, etc... n examples The task predicting t from x input x IR d target t Training Set D n x 3 x 4 x 5 t 0.32-0.27 +1 0 0.82 113-0.12 0.42-1 1 0.22 34 0.06 0.35-1 1-0.37 56 0.91-0.72 +1 0-0.63 77.................. Learning a parameterized function f that minimizes a loss. loss function: L(y, t) output y= f (x) f : parameters x 3 x 4 x 5-0.12 0.42-1 1 0.22 34 target input x t Empirical risk minimization We need to specify: A form for parameterized function f A specific loss function L(y, t) We then define the empirical risk as: n ˆR(f, D n ) = L(f (x (i) ), t (i) ) i=1 i.e. overall loss over the training set Learning amounts to finding optimal parameters: = arg min ˆR(f, D n ) Linear Regression We choose A linear mapping: f (x) = w, x + b with parameters: = w, b}, w IR d, b IR dot product Squared error loss: L(y, t) = (y t) 2 A simple learning algorithm weight vector We search the parameters that minimize the overall loss over the training set = arg min ˆR(f, D n ) Simple linear algebra yields an analytical solution. bias

Arrows represent synaptic connections w are synaptic weights Linear Regression Neural network view Inuitive understanding of the dot product: each component of x weighs differently on the response. y = f (x) = w 1 + w 2 +... + w d x d + b Neural network terminology: w 1 w 2 w 3 w 4 w 5 b x 3 x 4 x 5 input x y output linear output neuron 1 layer of input neurons Regularized empirical risk It may be necessary to induce a preference for some values of the parameters over others to avoid overfitting We can define the regularized empirical risk as: ( n ) ˆR λ (f, D n ) = L(f (x (i) ), t (i) ) + λω() i=1 empirical risk regularization term Ω penalizes more or less certain parameter values λ 0 controls the amount of regularization Ridge Regression = Linear regression + L2 regularization We penalize large weights: Ω() = Ω(w, b) = w 2 = d j=1 w 2 j In neural network terminology: weight decay penalty Again, simple linear algebra yields an analytical solution.

Logistic Regression If we have a binary classification task: We want to estimate conditional probability: We choose A non-linear mapping: f (x) = f w,b (x) = sigmoid( w, x + b) logistic non-linearity sigmoid(x) = Cross-entropy loss: L(y, t) = t ln(y) + (1 t) ln(1 y) 1 1 + e x The logistic sigmoid is the inverse of the logit link function in the terminology of Geleralized Linear Models (GLMs). t 0, 1} y P (t = 1 x) y [0, 1] No analytical solution, but optimization is convex Limitations of Logistic Regression Only yields linear decision boundary: a hyperplane! inappropriate if classes not linearly separable (as on the figure) Réseaux de neurones input x Logistic Regression Neural network view La puissance expressive des réseaux de neurones y 1 y 2 blue decision region y 1 decision boundary (hyperplane) mistakes y 2 mistakes red decision region y Sigmoid output neuron b w 5 w w 3 w 4 2 x 3 x 4 x 5 y 3 y 3 y 4 Sigmoid can be viewed as: soft differentiable alternative to the step function of original Perceptron (Rosenblatt 1957). simplified model of firing rate response in biological neurons. y 4 1 layer of input neurons How to obtain non-linear decision boundaries? An old technique... map x non-linearly to feature space: = φ(x) find separating hyperplane in new space hyperplane in new space corresponds to non-linear decision surface in initial x space.

exemple: y =! Ex. using fixed mapping y = ( ) x2 α y 3 ˆ R 2 y 2w ˆH ˆ x y 1 2 R 2 Réseaux de neurones Neural Network: La puissance expressive des réseaux de neurones Multi-Layer Perceptron (MLP) with one hidden layer of size 4 neurons 6 12 How to obtain non-linear decision boundaries... Three ways to map x to = φ(x) Use an explicit fixed mapping!previous example Use an implicit fixed mapping!kernel Methods (SVMs, Kernel Logistic Regression...) Learn a parameterized mapping:! Multilayer feed-forward Neural Networks such as Multilayer Perceptrons (MLP) Réseaux de neurones Expressive power of Neural Networks with one hidden layer La puissance expressive des réseaux de neurones 7 y 2 y output y y 3 no deux hidden couches layer R 2 == Logistic regression limited to representing a separating hyperplane y 1 y 1 y 2 y 3 y 4 y 4 hidden layer IR d intput layer x IR d one trois hidden couches layer... R 2 R2 Universal approximation property Any continuous function can be approximated arbitarily well (with a growing number of hidden unis)

Neural Network (MLP) with one hidden layer of size d neurons Functional form (parametric): y = f (x) = sigmoid ( w, + b) Parameters: = W hidden, b hidden, w, b} = sigmoid(w hidden x + b hidden ) d d Optimizing parameters on training set (training the network): = arg min ˆR λ (f, D n ) ( n ) L(f (x (i) ), t (i) ) + λω() i=1 empirical risk regularization term (weight decay) Hyper-parameters controlling capacity! Network has a set of parameters:! optimized on the training set using gradient descent. d 1! There are also hyper-parameters that control model capacity number of hidden units d regularizaiton control λ (weight decay) early stopping of the optimization! tuned by a model selection procedure, not on training set. Training Neural Networks We need to optimize the network s parameters: Descente de Newton ˆR λ (f, D n ) = arg min Initialize parameters at random Perform gradient descent D= Fonctions discriminantes linéaires ˆR λ Either batch gradient descent: REPEAT: η ˆR λ J(a) a 2 Or stochastic gradient descent: REPEAT: Pick i in 1...n η (L(f (x (i) ), t (i) ) + λn ) Ω() Or other gradient descent technique (conjugate gradient, Newton, steps natural gradient,...)... Hyper-parameter tuning (x (1), t (1) ) (x (2), t (2) ) (x (N), t (N) ) } Training } Validation } Test Divide available dataset in three set (size n) set (size n ) set (Size m) a 1 For each considered values of hyper-parameters: 1) Train the model, i.e. find the value of the parameters that optimize the regularized empirical risk on the training set. 2) Evaluate performance on validation set based on criterion we truly care about. Keep value of hyper-parameters with best performance on validation set. (possibly retrain on union of train and validation ). Evaluate generalization performance on separate test-set never used during training or validation (i.e. unbiased out-of-sample evaluation). If too few examples, use k-fold cross-validation or leave-one-out ( jack-knife )

performance Erreur d apprentissage (error) on training set performance Erreur de validation (error) on validation set 10,0 7,5 5,0 2,5 Hyper-parameter tuning 0 1 3 5 7 9 11 13 15 Value of hyper-parameter hyper-parameter value yielding smallest error on validation set is 5 (whereas it s 1 on the training set) Summary Feed-forward Neural Networks (such as Multilayer Perceptrons MLPs) are parameterized non-linear functions or Generalized non-linear models......trained using gradient descent techniques Architectural details and capacity-control hyperparameters must be tuned with proper model selection procedure. Data must be preprocessed into suitable format x µ standardization for continuous variable: use σ one-hot encoding for categorical variables ex: [ 0, 0, 1, 0 ] Note: there are many other types of Neural Nets... Neural Networks Why they matter for data mining advantages of Neural Networks for data-mining. motivating research on learning deep networks. Advantages of Neural Networks!The power of learnt non-linearity: automatically extracting the necessary features!flexibility: they can be used for binary classification multiclass classification regression conditional density modeling (NNet trained to output parameters of distribution of t as a function of x) dimensionality reduction... very adaptable framework (some would say too much...)

erview me methods r, rks Ex: using a Neural Net for dimensionality reduciton The classical auto-encoder framework learning a lower-dimensional representation x D inputs z M x D outputs Advantages of Neural Networks (continued)!neural Networks scale well Data-mining often deals with huge databases Stochastic gradient descent can handle these Many more modern machine-learning techniques have big scaling issues (e.g. SVMs and other Kernel methods) oassociative hidden layer ssed versions of ar Why then have they gone out of fashion in machine learning? Tricky to train (many hyperparameters to tune) NOT Train your Neural YET Net yers, provides a method IDIOT Non-convex optimization!local minima: solution depends on where you start... Example of a deep architecture made of multiple layers, solving complex problems... PROOF! Convex But convexity may be too restrictive. problems are mathematically nice and easier, but real-world hard problems may require non-convex models.

Representational power of functional composition. Shallow architectures (NNets with one hidden layer, SVMs, boosting,...) can be universal approximators... The promises of learning deep architectures But may require exponentially more nodes than corresponding deep architectures (see Bengio 2007).! statistically more efficient to learn small deep architectures (fewer parameters) than fat shallow architectures. The notion of Level of Representation