# Deep learning via Hessian-free optimization

Save this PDF as:

Size: px
Start display at page:

## Transcription

6 poor differentiation between units Sharing information across iterations A simple enhancement to the HF algorithm which we found improves its performance by an order of magnitude is to use the search direction p n 1 found by CG in the previous HF iteration as the starting point for CG in the current one. There are intuitively appealing reasons why p n 1 might make a good initialization. Indeed, the values of B and f(θ) for a given HF iteration should be similar in some sense to their values at the previous iteration, and thus the optimization problem solved by CG is also similar to the previous one, making the previous solution a potentially good starting point. In typical implementations of HF, the CG runs are initialized with the zero vector, and doing this has the nice property that the initial value of φ will be non-positive (0 in fact). This in turn ensures that the search direction produced by CG will always provide a reduction in q θ, even if CG is terminated after the first iteration. In general, if CG is initialized with a non-zero vector, the initial value of φ can be greater than zero, and we have indeed found this to be the case when using p n 1. However, judging an initialization merely by its φ value may be misleading, and we found that runs initialized from p n 1 rather than 0 consistently yielded better reductions in q θ, even when φ(p n 1 ) 0. A possible explanation for this finding is that p n 1 is wrong within the new quadratic objective mostly along the volatile high-curvature directions, which are quickly and easily discovered and corrected by CG, thus leaving the harder-tofind low-curvature directions, which tend to be more stable over the parameter space, and thus more likely to remain descent directions between iterations of HF CG iteration backtracking While each successive iteration of CG improves the value of p with respect to the 2 nd -order model q θ (p), these improvements are not necessarily reflected in the value of f(θ + p). In particular, if q θ (p) is untrustworthy due to the damping parameter λ being set too low or the current minibatch being too small or unrepresentative, then running CG past a certain number of iterations can actually be harmful. In fact, the dependency of the directions generated by CG on the quality of B should almost certainly increase CG iterates, as the basis from which CG generates directions (called the Krylov basis) expands to include matrix-vector products with increasingly large powers of B. By storing the current solution for p at iteration γ j of CG for each j (where γ > 1 is a constant; 1.3 in our experiments), we can backtrack along them after CG has terminated, reducing j as long as the γ j 1 th iterate of CG yields a lower value of f(x+p) than the γ j th. However, we have observed experimentally that, for best performance, the value of p used to initialize the next CG run (as described in the previous sub-section) should not be backtracked in this manner. The likely explanation for this effect is that directions which are followed too strongly in p due to bad curvature information from an unrepresentative mini-batch, or λ being too small, will be corrected by the next run of CG, since it will use a different mini-batch and possibly a larger λ to compute B Preconditioning CG Preconditioning is a technique used to accelerate CG. It does this by performing a linear change of variables ˆx = Cx for some matrix C, and then optimizing the transformed quadratic objective given by ˆφ(ˆx) = 1 2 ˆx C AC 1ˆx (C 1 b) ˆx. ˆφ may have more forgiving curvature properties than the original φ, depending on the value of the matrix C. To use preconditioned CG one specifies M C C, with the understanding that it must be easy to solve M y = x for arbitrary x. Preconditioning is somewhat of an application specific art, and we experimented with many possible choices. One that we found to be particularly effective was the diagonal matrix: [ ( D ) M = diag f i (θ) f i (θ) + λi i=1 where f i is the value of the objective associated with training-case i, denotes the element-wise product and the exponent α is chosen to be less than 1 in order to suppress extreme values (we used 0.75 in our experiments). The inner sum has the nice interpretation of being the diagonal of the empirical Fisher information matrix, which is similar in some ways to the G matrix. Unfortunately, it is impractical to use the diagonal of G itself, since the obvious algorithm has a cost similar to K backprop operations, where K is the size of the output layer, which is large for auto-encoders (although typically not for classification nets). 5. Random initialization Our HF approach, like almost any deterministic optimization scheme, is not completely immune to bad initializations, although it does tend to be far more robust to these than 1 st -order methods. For example, it cannot break symmetry between two units in the same layer that are initialized with identical weights. However, as discussed in section 2.2 it has a far better chance than 1 st -order methods of doing so if the weights are nearly identical. In our initial investigations we tried a variety of random initialization schemes. The better ones, which were more careful about avoiding issues like saturation, seemed to allow the runs of CG to terminate after fewer iterations, mostly likely because these initializations resulted in more favorable local curvature properties. The best random initialization scheme we found was one of our own design, ] α

7 sparse initialization. In this scheme we hard limit the number of non-zero incoming connection weights to each unit (we used 15 in our experiments) and set the biases to 0 (or 0.5 for tanh units). Doing this allows the units to be both highly differentiated as well as unsaturated, avoiding the problem in dense initializations where the connection weights must all be scaled very small in order to prevent saturation, leading to poor differentiation between units. 6. Related work on 2 nd -order optimization LeCun et al. (1998) have proposed several diagonal approximations of the H and G matrices for multi-layer neural nets. While these are easy to invert, update and store, the diagonal approximation may be overly simplistic since it neglects the very important interaction between parameters. For example, in the nearly identical units scenario considered in section 2.2, a diagonal approximation would not be able to recognize the differentiating direction as being one of low curvature. Amari et al. (2000) have proposed a 2 nd -order learning algorithm for neural nets based on an empirical approximation of the Fisher information matrix (which can be defined for a neural net by casting it as a probabilistic model). Since Schraudolph s approach for computing Gd may be generalized to compute Fd, we were thus able to evaluate the possibility of using F as an alternative to G within our HF approach. The resulting algorithm wasn t able to make significant progress on the deep auto-encoder problems we considered, possibly indicating that F doesn t contain sufficient curvature information to overcome the problems associated with deep-learning. A more theoretical observation which supports the use of G over F in neural nets is that the ranks of F and G are D and DL respectively, where D is the size of the training set and L is the size of the output layer. Another observation is that G will converge to the Hessian as the error of the net approaches zero, a property not shared by F. Building on the work of Pearlmutter (1994), Schraudolph (2002) proposed 2 nd -order method called Stochastic Meta-descent (SMD) which uses an on-line 2 nd -order approximation to f and optimizes it via updates to p which are also computed on-line. This method differs from HF in several important ways, but most critically in the way it optimizes p. The update scheme used by SMD is a form of preconditioned gradient-descent given by: p n+1 = p n + M 1 r n r n f(θ n ) + Bp n where M is a diagonal pre-conditioning matrix chosen to approximate B. Using the previously discussed method for computing Bd products, SMD is able to compute these updates efficiently. However, using gradient-descent instead of CG to optimize q θ (p), even with a good diagonal preconditioner, is an approach likely to fail because q θ (p) will exhibit the same pathological curvature as the objective function f that it approximates. And pathological curvature was Table 1. Experimental parameters NAME SIZE K ENCODER DIMS CURVES MNIST FACES the primary reason for abandoning gradient-descent as an optimization method in the first place. Moreover, it can be shown that in the batch case, the updates computed by SMD lie in the same Krylov subspace as those computed by an equal number of CG iterations, and that CG finds the optimal solution of q θ (p) within this subspace. These 2 nd -order methods, plus others we haven t discussed, have only been validated on shallow nets and/or toy problems. And none of them have been shown to be fundamentally more effective than 1 st -order optimization methods on deep learning problems. 7. Experiments We present the results from a series of experiments designed to test the effectiveness of our HF approach on the deep auto-encoder problems considered by Hinton & Salakhutdinov (2006) (abbr. H&S). We adopt precisely the same model architectures, datasets, loss functions and training/test partitions that they did, so as to ensure that our results can be directly compared with theirs. Each dataset consists of a collection of small grey-scale images of various objects such as hand-written digits and faces. Table 1 summarizes the datasets and associated experimental parameters, where size gives the size of the training set, K gives the size of minibatches used, and encoder dims gives the encoder network architecture. In each case, the decoder architecture is the mirror image of the encoder, yielding a symmetric autoencoder. This symmetry is required in order to be able to apply H&S s pre-training approach. Note that CURVES is the synthetic curves dataset from H&S s paper and FACES is the augmented Olivetti face dataset. We implemented our approach using the GPU-computing MATLAB package Jacket. We also re-implemented, using Jacket, the precise approach considered by H&S, using their provided code as a basis, and then re-ran their experiments using many more training epochs than they did, and for far longer than we ran our HF approach on the same models. With these extended runs we were able to obtain slightly better results than they reported for both the CURVES and MNIST experiments. Unfortunately, we were not able to reproduce their results for the FACES dataset, as each net we pre-trained had very high generalization error, even before fine-tuning. We ran each method until it either seemed to converge, or started to overfit (which happened for MNIST and FACES, but not CURVES). We found that since our method was much better at fitting the training data, it was thus more prone to

8 Table 2. Results (training and test errors) PT + NCG RAND+HF PT + HF CURVES 0.74, , , 0.21 MNIST 2.31, , , 2.46 MNIST* 2.07, , , 2.28 FACES -, , 139 -,- FACES* -,- 60.6, 122 -,- overfitting, and so we ran additional experiments where we introduced an l 2 prior on the connection weights. Table 2 summarizes our results, where PT+NCG is the pre-training + non-linear CG fine-tuning approach of H&S, RAND+HF is our Hessian-free method initialized randomly, and PT+HF is our approach initialized with pretrained parameters. The numbers given in each entry of the table are the average sum of squared reconstruction errors on the training-set and the test-set. The * s indicate that an l 2 prior was used, with strength 10 4 on MNIST and 10 2 on FACES. Error numbers for FACES which involve pre-training are missing due to our failure to reproduce the results of H&S on that dataset (instead we just give the testerror number they reported). Figure 2 demonstrates the performance of our implementations on the CURVES dataset. Pre-training time is included where applicable. This plot is not meant to be a definitive performance analysis, but merely a demonstration that our method is indeed quite efficient. 8. Discussion of results and implications The most important implication of our results is that learning in deep models can be achieved effectively and efficiently by a completely general optimizer without any need for pre-training. This opens the door to examining a diverse range of deep or otherwise difficult-to-optimize architectures for which there are no effective pre-training methods, such as asymmetric auto-encoders, or recurrent neural nets. A clear theme which emerges from our results is that the HF optimized nets have much lower training error, implying that our HF approach does well because it is more effective than pre-training + fine-tuning approaches at solving the under-fitting problem. Because both the MNIST and FACES experiments used early-stopping, the training error numbers reported in Table 2 are actually much higher than can otherwise be achieved. When we initialized randomly and didn t use early-stopping we obtained a training error of 1.40 on MNIST and 12.9 on FACES. A pre-trained initialization benefits our approach in terms of optimization speed and generalization error. However, there doesn t seem to be any significant benefit in regards to under-fitting, as our HF approach seems to solve this problem almost completely by itself. This is in stark contrast to the situation with 1 st -order optimization algorithms, where the main hurdle overcome by pre-training is that of under-fitting, at least in the setting of auto-encoders. error time in seconds (log scale) PT+NCG RAND+HF PT+HF Figure 2. Error (train and test) vs. computation time on CURVES Based on these results we can hypothesize that the way pretraining helps 1 st -order optimization algorithms overcome the under-fitting problem is by placing the parameters in a region less affected by issues of pathological curvature in the objective, such as those discussed in section 2.2. This would also explain why our HF approach optimizes faster from pre-trained parameters, as more favorable local curvature conditions allow the CG runs to make more rapid progress when optimizing q θ (p). Finally, while these early results are very encouraging, clearly further research is warranted in order to address the many interesting questions that arise from them, such as how much more powerful are deep nets than shallow ones, and is this power fully exploited by current pretraining/fine-tuning schemes? Acknowledgments The author would like to thank Geoffrey Hinton, Richard Zemel, Ilya Sutskever and Hugo Larochelle for their helpful suggestions. This work was supported by NSERC and the University of Toronto. REFERENCES Amari, S., Park, H., and Fukumizu, K. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In NIPS, Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science, July LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, Mizutani, E. and Dreyfus, S. E. Second-order stagewise backpropagation for hessian-matrix analyses and investigation of negative curvature. Neural Networks, 21(2-3): , Nocedal, J. and Wright, S. J. Numerical Optimization. Springer, Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural Computation, Schraudolph, N. N. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 2002.

### Supporting Online Material for

www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### CSC321 Introduction to Neural Networks and Machine Learning. Lecture 21 Using Boltzmann machines to initialize backpropagation.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 21 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton Some problems with backpropagation The amount of information

### Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey

### Lecture 6. Artificial Neural Networks

Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks

Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks Dong-Hyun Lee sayit78@gmail.com Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul,

### How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Efficient online learning of a non-negative sparse autoencoder

and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil

### 6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

### General Framework for an Iterative Solution of Ax b. Jacobi s Method

2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

### Applications to Data Smoothing and Image Processing I

Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

### Learning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.

Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose

### INTRODUCTION TO NEURAL NETWORKS

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

### Multilayer Perceptrons

Made wi t h OpenOf f i ce. or g 1 Multilayer Perceptrons 2 nd Order Learning Algorithms Made wi t h OpenOf f i ce. or g 2 Why 2 nd Order? Gradient descent Back-propagation used to obtain first derivatives

### Machine Learning: Multi Layer Perceptrons

Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### An Introduction to Neural Networks

An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,

### Application of Neural Network in User Authentication for Smart Home System

Application of Neural Network in User Authentication for Smart Home System A. Joseph, D.B.L. Bong, D.A.A. Mat Abstract Security has been an important issue and concern in the smart home systems. Smart

### Classifying Chess Positions

Classifying Chess Positions Christopher De Sa December 14, 2012 Chess was one of the first problems studied by the AI community. While currently, chessplaying programs perform very well using primarily

### PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Multilayer Percetrons

PMR5406 Redes Neurais e Aula 3 Multilayer Percetrons Baseado em: Neural Networks, Simon Haykin, Prentice-Hall, 2 nd edition Slides do curso por Elena Marchiori, Vrie Unviersity Multilayer Perceptrons Architecture

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

### Recurrent Neural Networks

Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### IBM SPSS Neural Networks 22

IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,

### On Optimization Methods for Deep Learning

Quoc V. Le quocle@cs.stanford.edu Jiquan Ngiam jngiam@cs.stanford.edu Adam Coates acoates@cs.stanford.edu Abhik Lahiri alahiri@cs.stanford.edu Bobby Prochnow prochnow@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### Why Does Unsupervised Pre-training Help Deep Learning?

Journal of Machine Learning Research 11 (2010) 625-660 Submitted 8/09; Published 2/10 Why Does Unsupervised Pre-training Help Deep Learning? Dumitru Erhan Yoshua Bengio Aaron Courville Pierre-Antoine Manzagol

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### The QOOL Algorithm for fast Online Optimization of Multiple Degree of Freedom Robot Locomotion

The QOOL Algorithm for fast Online Optimization of Multiple Degree of Freedom Robot Locomotion Daniel Marbach January 31th, 2005 Swiss Federal Institute of Technology at Lausanne Daniel.Marbach@epfl.ch

### Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

### P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition K. Osypov* (WesternGeco), D. Nichols (WesternGeco), M. Woodward (WesternGeco) & C.E. Yarman (WesternGeco) SUMMARY Tomographic

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Introduction to time series analysis

Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

### IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE

### Manifold Learning with Variational Auto-encoder for Medical Image Analysis

Manifold Learning with Variational Auto-encoder for Medical Image Analysis Eunbyung Park Department of Computer Science University of North Carolina at Chapel Hill eunbyung@cs.unc.edu Abstract Manifold

### Visualizing Higher-Layer Features of a Deep Network

Visualizing Higher-Layer Features of a Deep Network Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent Dept. IRO, Université de Montréal P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7,

### large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

### The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem

### Gradient Methods. Rafael E. Banchs

Gradient Methods Rafael E. Banchs INTRODUCTION This report discuss one class of the local search algorithms to be used in the inverse modeling of the time harmonic field electric logging problem, the Gradient

### Learning Recurrent Neural Networks with Hessian-Free Optimization

James Martens Ilya Sutskever University of Toronto, Canada JMARTENS@CS.TORONTO.EDU ILYA@CS.UTORONTO.CA Abstract In this work we resolve the long-outstanding problem of how to effectively train recurrent

### Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Computational Optical Imaging - Optique Numerique -- Deconvolution -- Winter 2014 Ivo Ihrke Deconvolution Ivo Ihrke Outline Deconvolution Theory example 1D deconvolution Fourier method Algebraic method

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### Automatic Inventory Control: A Neural Network Approach. Nicholas Hall

Automatic Inventory Control: A Neural Network Approach Nicholas Hall ECE 539 12/18/2003 TABLE OF CONTENTS INTRODUCTION...3 CHALLENGES...4 APPROACH...6 EXAMPLES...11 EXPERIMENTS... 13 RESULTS... 15 CONCLUSION...

### The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

### Nonlinear Least Squares Data Fitting

Appendix D Nonlinear Least Squares Data Fitting D1 Introduction A nonlinear least squares problem is an unconstrained minimization problem of the form minimize f(x) = f i (x) 2, x where the objective function

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

Simplified Machine Learning for CUDA Umar Arshad @arshad_umar Arrayfire @arrayfire ArrayFire CUDA and OpenCL experts since 2007 Headquartered in Atlanta, GA In search for the best and the brightest Expert

### IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic

### CS3220 Lecture Notes: QR factorization and orthogonal transformations

CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss

### Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

This version: December 12, 2013 Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks Lawrence Takeuchi * Yu-Ying (Albert) Lee ltakeuch@stanford.edu yy.albert.lee@gmail.com Abstract We

### Manifold Learning Examples PCA, LLE and ISOMAP

Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition

### CSC 321 H1S Study Guide (Last update: April 3, 2016) Winter 2016

1. Suppose our training set and test set are the same. Why would this be a problem? 2. Why is it necessary to have both a test set and a validation set? 3. Images are generally represented as n m 3 arrays,

### Implementation of Neural Networks with Theano. http://deeplearning.net/tutorial/

Implementation of Neural Networks with Theano http://deeplearning.net/tutorial/ Feed Forward Neural Network (MLP) Hidden Layer Object Hidden Layer Object Hidden Layer Object Logistic Regression Object

### Linear Algebra Review. Vectors

Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### (Refer Slide Time: 00:00:56 min)

Numerical Methods and Computation Prof. S.R.K. Iyengar Department of Mathematics Indian Institute of Technology, Delhi Lecture No # 3 Solution of Nonlinear Algebraic Equations (Continued) (Refer Slide

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training

The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training Anonymous Author Affiliation Address Anonymous Coauthor Affiliation Address Anonymous Coauthor Affiliation Address

### Improving Generalization

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

### Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001,

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y

### called Restricted Boltzmann Machines for Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering Ruslan Salakhutdinov rsalakhu@cs.toronto.edu Andriy Mnih amnih@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu University of Toronto, 6 King

### Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Matthew D. Zeiler Department of Computer Science Courant Institute, New York University zeiler@cs.nyu.edu Rob Fergus Department

### Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

### Neural networks. Chapter 20, Section 5 1

Neural networks Chapter 20, Section 5 Chapter 20, Section 5 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 20, Section 5 2 Brains 0 neurons of

### t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

### Temporal Difference Learning in the Tetris Game

Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.

### A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Methods and Applications for Distance Based ANN Training

Methods and Applications for Distance Based ANN Training Christoph Lassner, Rainer Lienhart Multimedia Computing and Computer Vision Lab Augsburg University, Universitätsstr. 6a, 86159 Augsburg, Germany

### The Optimality of Naive Bayes

The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

### Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

### AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

### 7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

### Nonlinear Iterative Partial Least Squares Method

Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Introduction to Statistics for Computer Science Projects

Introduction Introduction to Statistics for Computer Science Projects Peter Coxhead Whole modules are devoted to statistics and related topics in many degree programmes, so in this short session all I

### Section 1.1. Introduction to R n

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

### Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear