Understanding Convolutional Neural Networks

Similar documents
What is Candidate Sampling

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Lecture 2: Single Layer Perceptrons Kevin Swingler

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Project Networks With Mixed-Time Constraints

Calculation of Sampling Weights

Single and multiple stage classifiers implementing logistic discrimination

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Adaptive Fractal Image Coding in the Frequency Domain

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Forecasting the Direction and Strength of Stock Market Movement

L10: Linear discriminants analysis

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

BERNSTEIN POLYNOMIALS

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Logistic Regression. Steve Kroon

Georey E. Hinton. University oftoronto. Technical Report CRG-TR May 21, 1996 (revised Feb 27, 1997) Abstract

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Gender Classification for Real-Time Audience Analysis System

Can Auto Liability Insurance Purchases Signal Risk Attitude?

8 Algorithm for Binary Searching in Trees

Support Vector Machines

Implementation of Deutsch's Algorithm Using Mathcad

An Alternative Way to Measure Private Equity Performance

An artificial Neural Network approach to monitor and diagnose multi-attribute quality control processes. S. T. A. Niaki*

A Multi-mode Image Tracking System Based on Distributed Fusion

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

This circuit than can be reduced to a planar circuit

Loop Parallelization

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Learning from Multiple Outlooks

Hallucinating Multiple Occluded CCTV Face Images of Different Resolutions

Traffic State Estimation in the Traffic Management Center of Berlin

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Development of an intelligent system for tool wear monitoring applying neural networks

Performance Analysis and Coding Strategy of ECOC SVMs

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Intra-day Trading of the FTSE-100 Futures Contract Using Neural Networks With Wavelet Encodings

1. Measuring association using correlation and regression

An Interest-Oriented Network Evolution Mechanism for Online Communities

1 Example 1: Axis-aligned rectangles

How To Calculate The Accountng Perod Of Nequalty

Fast Fuzzy Clustering of Web Page Collections

An interactive system for structure-based ASCII art creation

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Sequential Optimizing Investing Strategy with Neural Networks

Review of Hierarchical Models for Data Clustering and Visualization

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

where the coordinates are related to those in the old frame as follows.

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Multiple-Period Attribution: Residuals and Compounding

Distributed Multi-Target Tracking In A Self-Configuring Camera Network

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

The OC Curve of Attribute Acceptance Plans

A machine vision approach for detecting and inspecting circular parts

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Proactive Secret Sharing Or: How to Cope With Perpetual Leakage

An Analysis of Central Processor Scheduling in Multiprogrammed Computer Systems

Statistical Methods to Develop Rating Models

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

The Greedy Method. Introduction. 0/1 Knapsack Problem

DEFINING %COMPLETE IN MICROSOFT PROJECT

IMPACT ANALYSIS OF A CELLULAR PHONE

HowHow to Find the Best Online Stock Broker

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Statistical Approach for Offline Handwritten Signature Verification

Calculating the high frequency transmission line parameters of power cables

Activity Scheduling for Cost-Time Investment Optimization in Project Management

New Approaches to Support Vector Ordinal Regression

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Biometric Signature Processing & Recognition Using Radial Basis Function Network

Inter-Ing INTERDISCIPLINARITY IN ENGINEERING SCIENTIFIC INTERNATIONAL CONFERENCE, TG. MUREŞ ROMÂNIA, November 2007.

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

J. Parallel Distrib. Comput.

Efficient Project Portfolio as a tool for Enterprise Risk Management

An MILP model for planning of batch plants operating in a campaign-mode

MACHINE VISION SYSTEM FOR SPECULAR SURFACE INSPECTION: USE OF SIMULATION PROCESS AS A TOOL FOR DESIGN AND OPTIMIZATION

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

An Introduction to 3G Monte-Carlo simulations within ProMan

Recurrence. 1 Definitions and main statements

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

Improved SVM in Cloud Computing Information Mining

Parallel Numerical Simulation of Visual Neurons for Analysis of Optical Illusion

Quantization Effects in Digital Filters

Transcription:

Fakultät für Mathematk, Informatk und Naturwssenschaften Lehr- und Forschungsgebet Informatk VIII Computer Vson Prof Dr Bastan Lebe Semnar Report Understandng Convolutonal Neural Networks Davd Stutz Matrculaton Number: ###### August 30, 04 Advsor: Lucas Beyer

Abstract Ths semnar paper focusses on convolutonal neural networks and a vsualzaton technque allowng further nsghts nto ther nternal operaton After gvng a bref ntroducton to neural networks and the multlayer perceptron, we revew both supervsed and unsupervsed tranng of neural networks n detal In addton, we dscuss several approaches to regularzaton The second secton ntroduces the dfferent types of layers present n recent convolutonal neural networks Based on these basc buldng blocks, we dscuss the archtecture of the tradtonal convolutonal neural network as proposed by LeCun et al [LBD + 89] as well as the archtecture of recent mplementatons The thrd secton focusses on a technque to vsualze feature actvatons of hgher layers by backprojectng them to the mage plane Ths allows to get deeper nsghts nto the nternal workng of convolutonal neural networks such that recent archtectures can be evaluated and mproved even further

Contents Motvaton 4 Bblographcal Notes 4 Neural Networks and Deep Learnng 5 Multlayer Perceptrons 5 Actvaton Functons 6 3 Supervsed Tranng 7 3 Error Measures 8 3 Tranng Protocols 8 33 Parameter Optmzaton 8 34 Weght Intalzaton 9 35 Error Backpropagaton 0 4 Unsupervsed Tranng 0 4 Auto-Encoders 0 4 Layer-Wse Tranng 5 Regularzaton 5 L p -Regularzaton 5 Early Stoppng 53 Dropout 54 Weght Sharng 55 Unsupervsed Pre-Tranng 3 Convolutonal Neural Networks 3 3 Convoluton 3 3 Layers 3 3 Convolutonal Layer 3 3 Non-Lnearty Layer 4 33 Rectfcaton 5 34 Local Contrast Normalzaton Layer 5 35 Feature Poolng and Subsamplng Layer 5 36 Fully Connected Layer 6 33 Archtectures 6 33 Tradtonal Convolutonal Neural Network 6 33 Modern Convolutonal Neural Networks 7 4 Understandng Convolutonal Neural Networks 8 4 Deconvolutonal Neural Networks 8 4 Deconvolutonal Layer 8 4 Unsupervsed Tranng 9 4 Vsualzng Convolutonal Neural Networks 9 4 Poolng Layers 9 4 Rectfcaton Layers 0 43 Convolutonal Neural Network Vsualzaton 0 43 Flters and Features 0 43 Archtecture Evaluaton 0 5 Concluson 3

Motvaton Artfcal neural networks are motvated by the learnng capabltes of the human bran whch conssts of neurons nterconnected by synapses In fact at least theoretcally they are able to learn any gven mappng up to arbtrary accuracy [HSW89] In addton, they allow to easly ncorporate pror knowledge about the task nto the network archtecture As result, n 989, LeCun et al ntroduced convolutonal neural networks for applcaton n computer vson [LBD + 89] Convolutonal neural networks use mages drectly as nput Instead of handcrafted features, convolutonal neural networks are used to automatcally learn a herarchy of features whch can then be used for classfcaton purposes Ths s accomplshed by successvely convolvng the nput mage wth learned flters to buld up a herarchy of feature maps The herarchcal approach allows to learn more complex, as well as translaton and dstorton nvarant, features n hgher layers In contrast to tradtonal multlayer perceptrons, where deep learnng s consdered dffcult [Ben09], deep convolutonal neural networks can be traned more easly usng tradtonal methods Ths property s due to the constraned archtecture of convolutonal neural networks whch s specfc to nput for whch dscrete convoluton s defned, such as mages Nevertheless, deep learnng of convolutonal neural networks s an actve area of research, as well As wth multlayer perceptrons, convolutonal neural networks stll have some dsadvantages when compared to other popular machne learnng technques as for example Support Vector Machnes as ther nternal operaton s not well understood [ZF3] Usng deconvolutonal neural networks proposed n [ZKTF0], ths problem s addressed n [ZF3] The approach descrbed n [ZF3] allows the vsualzaton of feature actvatons n hgher layers of the network and can be used to gve further nsghts nto the nternal operaton of convolutonal neural networks Bblographcal Notes Although ths paper brefly ntroduces the basc notons of neural networks as well as network tranng, ths topc s far too extensve to be covered n detal For a detaled dscusson of neural networks and ther tranng several textbooks are avalable [Bs95, Bs06, Hay05] The convolutonal neural network was orgnally proposed n [LBD + 89] for the task of ZIP code recognton Both convolutonal neural networks as well as tradtonal multlayer perceptrons were excessvely appled to character recognton and handwrtten dgt recognton [LBBH98] Tranng was ntally based on error backpropagaton [RHW86] and gradent descent The orgnal convolutonal neural network s based on weght sharng whch was proposed n [RHW86] An extenson of weght sharng called soft weght sharng s dscussed n [NH9] Recent mplementatons make use of other regularzaton technques as for example dropout [HSK + ] Although the work by Hnton et al n 006 [HO06] can be consdered as breakthrough n deep learnng as t allows unsupervsed tranng of neural networks deep learnng s stll consdered dffcult [Ben09] A thorough dscusson of deep learnng ncludng recent research s gven n [Ben09] as well as [LBLL09, GB0, BL07] Addtonal research on ths topc ncludes dscusson on actvaton functons as well as the effect of unsupervsed pre-tranng [EMB + 09, EBC + 0, GBB] Recent archtectural changes of convolutonal neural networks are dscussed n detal n [JKRL09] and [LKF0] Recent success of convolutonal neural networks s reported n [KSH] and [CMS] Ths paper s manly motvated by the experments n [ZF3] Based on deconvolutonal neural networks [ZKTF0], the authors of [ZF3] propose a vsualzaton technque allowng to vsualze feature actvatons of hgher layers Here, tradtonal methods refers to gradent descent for parameter optmzaton combned wth error backpropagaton as dscussed n secton 3 Usng weght sharng as dscussed n secton 54, the actual model complexty s reduced 4

x x D w 0 y Fgure : A processng unt conssts of a propagaton rule mappng all nputs w 0,x,x D to the actual nput z, and an actvaton functon f whch s appled on the actual nput to form the output y = f z) Here, w 0 represents an external nput called bas and x,,x D are nputs from other unts of the network In a network graph, each unt s labeled accordng to ts output Therefore, to nclude the bas w 0 as well, a dummy unt see secton ) wth value s ncluded Neural Networks and Deep Learnng An artfcal) neural network comprses a set of nterconnected processng unts [Bs95, p 80-8] Gven nput values w 0,x,,x D, where w 0 represents an external nput and x,,x D are nputs orgnatng from other processng unts wthn the network, a processng unt computes ts output as y = f z) Here, f s called actvaton functon and z s obtaned by applyng a propagaton rule whch maps all the nputs to the actual nput z Ths model of a sngle processng unt ncludes the defnton of a neuron n [Hay05] where nstead of a propagaton rule an adder s used to compute z as the weghted sum of all nputs Neural networks can be vsualzed n the means of a drected graph 3 called network graph [Bs95, p 7-0] Each unt s represented by a node labeled accordng to ts output and the unts are nterconnected by drected edges For a sngle processng unt ths s llustrated n fgure where the external nput w 0 s only added for llustraton purposes and s usually omtted [Bs95, p 6-0] For convenence, we dstngush nput unts and output unts An nput unt computes the output y := x where x s the sngle nput value of the unt Output unts may accept an arbtrary number of nput values Altogether, the network represents a functon yx) whch dmensons are fxed by the number of nput unts and output unts, ths means the nput of the network s accepted by the nput unts and the output unts form the output of the network Multlayer Perceptrons A L + )-layer perceptron, llustrated n fgure, conssts of D nput unts, C output unts, and several so called hdden unts The unts are arranged n layers, that s a multlayer perceptron comprses an nput layer, an output layer and L hdden layers 4 [Bs95, p 7-0] The th unt wthn layer l computes the output y l) = f z l) ) wth z l) m l ) = k= w l),k yl ) k + w l),0 ) where w l),k denotes the weghted connecton from the kth unt n layer l ) to the th unt n layer l, and w l),0 can be regarded es external nput to the unt and s referred to as bas Here, ml) denotes the number of unts n layer l, such that D = m 0) and C = m L+) For smplcty, the bas can be regarded as weght when ntroducng a dummy unt y l) 0 := n each layer: z l) m l ) = k=0 w l),k yl ) k or z l) = w l) y l ) ) where z l), w l) and y l ) denote the correspondng vector and matrx representatons of the actual nputs z l), the weghts w l),k and the outputs yl ) k, respectvely 3 In ts most general form, a drected graph s an ordered par G = V,E) where V s a set of nodes and E a set of edges connectng the nodes: u,v) E means that a drected edge from node u to v exsts wthn the graph In a network graph, gven two unts u and v, a drected edge from u to v means that the output of unt u s used by unt v as nput 4 Actually, a L + )-layer perceptron conssts of L + ) layers ncludng the nput layer However, as stated n [Bs06], the nput layer s not counted as there s no real processng takng place nput unts compute the dentty) 5

nput layer x 0 st hdden layer L th hdden layer y ) 0 y L) 0 output layer y L+) x y ) y L) y L+) x D y ) m ) y L) m L) y L+) C Fgure : Network graph of a L+)-layer perceptron wth D nput unts and C output unts The l th hdden layer contans m l) hdden unts Overall, a multlayer perceptron represents a functon y,w) : R D R C,x yx,w) 3) where the output vector yx,w) comprses the output values y x,w) := y L+) and w s the vector of all weghts wthn the network We speak of deep neural networks when there are more than three hdden layers present [Ben09] The tranng of deep neural networks, referred to as deep learnng, s consdered especally challengng [Ben09] Actvaton Functons In [Hay05, p 34-37], three types of actvaton functons are dscussed: threshold functons, pecewse-lnear functons and sgmod functons A common threshold functon s gven by the Heavsde functon: { f z 0 hz) = 0 f z < 0 4) However, both threshold functons as well as pecewse-lnear functons have some drawbacks Frst, for network tranng we may need the actvaton functon to be dfferentable Second, nonlnear actvaton functons are preferable due to the addtonal computatonal power they nduce [DHS0, HSW89] The most commonly used type of actvaton functons are sgmod functons As example, the logstc sgmod s gven by σz) = + exp z) 5) Its graph s s-shaped and t s dfferentable as well as monotonc The hyperbolc tangent tanhz) can be regarded as lnear transformaton of the logstc sgmod onto the nterval [, ] Note, that both actvaton functons are saturatng [DHS0, p 307-308] When usng neural networks for classfcaton 5, the softmax actvaton functon for output unts s used to nterpret the output values as posteror probabltes 6 Then the output of the th unt n the output layer s 5 The classfcaton task can be stated as follows: Gven an nput vector x of D dmensons, the goal s to assgn x to one of C dscrete classes [Bs06] 6 The outputs y L+), C, can be nterpreted as probabltes as they le n the nterval [0,] and sum to [Bs06] 6

Logstc sgmod Hyperbolc tangent σz) 05 tanhz) 0 0 4 0 4 z a) Logstc sgmod actvaton functon 4 0 4 z b) Hyperbolc tangent actvaton functon sz) 05 Softsgn tanhz) 05 Rectfed tanh 0 4 0 4 z c) Logstc sgmod actvaton functon 0 4 0 4 z d) Rectfed hyperbolc tangent actvaton functon Fgure 3: Common used actvaton functons nclude the logstc sgmod σz) defned n equaton 5) and the hyperbolc tangent tanhz) More recently used actvaton functons are the softsgn of equaton 7) and the rectfed hyperbolc tangent gven by σz L+),) = expzl+) ) C k= expzl+) k ) 6) Experments n [GB0] show that the logstc sgmod as well as the hyperbolc tangent perform rather poorly n deep learnng Better performance s reported usng the softsgn actvaton functon: In [KSH] a non-saturatng actvaton functon s used: sz) = + z 7) rz) = max0,z) 8) Hdden unts usng the actvaton functon n equaton 8) are called rectfed lnear unts 7 Furthermore, n [JKRL09], rectfcaton n addton to the hyperbolc tangent actvaton functon s reported to gve good results Some of the above actvaton functons are shown n fgure 3 3 Supervsed Tranng Supervsed tranng s the problem of determnng the network weghts to approxmate a specfc target mappng g In practce, g may be unknown such that the mappng s gven by a set of tranng data The tranng set T S := {x n,t n ) : n N} 9) comprses both nput values x n and correspondng desred, possbly nosy, output values t n gx n ) [Hay05] 7 Also abbrevated as ReLUs 7

3 Error Measures Tranng s accomplshed by adjustng the weghts w of the neural network to mnmze a chosen objectve functon whch can be nterpreted as error measure between network output yx n ) and desred target output t n Popular choces for classfcaton nclude the sum-of-squared error measure gven by Ew) = N n= and the cross-entropy error measure gven by Ew) = N n= E n w) = E n w) = N C n= k= N C n= k= y k x n,w) t n,k ), 0) t n,k logy k x n,w)), ) where t n,k s the k th entry of the target value t n Detals on the choce of error measure and ther propertes can be found n [Bs95] 3 Tranng Protocols [DHS0] consders three tranng protocols: Stochastc tranng An nput value s chosen at random and the network weghts are updated based on the error E n w) Batch tranng All nput values are processed and the weghts are updated based on the overall error Ew) = N n= E nw) Onlne tranng Every nput value s processed only once and the weghts are updated usng the error E n w) Further dscusson of these protocols can be found n [Bs06] and [DHS0] A common practce eg used for experments n [GBB], [GB0]) combnes stochastc tranng and batch tranng: Mn-batch tranng A random subset M {,,N} mn-batches) of the tranng set s processed and the weghts are updated based on the cumulatve error E M w) := n M E n w) 33 Parameter Optmzaton Consderng stochastc tranng we seek to mnmze E n wth respect to the network weghts w The necessary crteron can be wrtten as E n w = E nw)! = 0 ) where E n s the gradent of the error E n Due to the complexty of the error E n, a closed-form soluton s usually not possble and we use an teratve approach Let w[t] denote the weght vector n the t th teraton In each teraton we compute a weght update w[t] and update the weghts accordngly [Bs06, p 36-37]: w[t + ] = w[t] + w[t] 3) From unconstraned optmzaton we have several optmzaton technques avalable Gradent descent s a frst-order method, ths means t uses only nformaton of the frst dervatve of E n and can, thus, be used n combnaton wth error backpropagaton as descrbed n secton 35, whereas Newton s method s a second-order method and needs to evaluate the Hessan matrx H n of E n 8 or an approprate approxmaton of the Hessan matrx) n each teraton step 8 The Hessan matrx H n of a the error E n s the matrx of second-order partal dervatves: H n ) r,s = E n w r w s 8

w[0] w[] w[] w[3] w[4] Fgure 4: Illustrated usng a quadratc functon to mnmze, the dea of gradent descent s to follow the negatve gradent at the current poston as t descrbes the drecton of the steepest descent The learnng rate γ descrbes the step sze taken n each teraton step Therefore, gradent descent descrbes a frst-order optmzaton technque Gradent descent Gradent descent s motvated by the dea to take a step n the drecton of the steepest descent, that s the drecton of the negatve gradent, to reach a mnmum [Bs95, p 63-67] Ths prncple s llustrated by fgure 4 Therefore, the weght update s gven by w[t] = γ E n w[t] = γ E nw[t]) 4) where γ s the learnng rate As dscussed n [Bs06, p63-7], ths approach has several dffcultes, for example how to choose the learnng rate to get fast learnng but at the same tme avod oscllaton 9 Newton s method Although there are some extensons of gradent descent avalable, second-order methods promse faster convergence because of the use of second-order nformaton [BL89] When usng Newton s method, the weght update w[t] s gven by w[t] = γ E n w[t] ) En w[t] = γ H n w[t]) ) En w[t]) 5) where H n w[t]) s the Hessan matrx of E n and γ descrbes the learnng rate The drawback of ths method s the evaluaton and nverson of the Hessan matrx 0 whch s computatonally expensve [BL89] 34 Weght Intalzaton As we use an teratve optmzaton technque, the ntalzaton of the weghts w s crucal [DHS0, p 3-3] suggest choosng the weghts randomly n the range < m l ) wl), j < 6) m l ) Ths result s based on the assumpton that the nputs of each unt are dstrbuted accordng to a Gaussan dstrbuton and ensures that the actual nput s approxmately of unty order Gven logstc sgmod actvaton functons, ths s meant to result n optmal learnng [DHS0, p 3-3] In [GB0] an alternatve ntalzaton scheme called normalzed ntalzaton s ntroduced We choose the weghts randomly n the range 6 6 < m l ) wl), j + m < 7) l) m l ) + m l) The dervaton of ths ntalzaton scheme can be found n [GB0] Expermental results n [GB0] demonstrate mproved learnng when usng normalzed ntalzaton An alternatve to these weght ntalzaton schemes s gven by layer-wse unsupervsed pre-tranng as dscussed n [EBC + 0] We dscuss unsupervsed tranng n secton 4 9 Oscllaton occurs f the learnng rate s chosen too large such that the algorthm successvely oversteps the mnmum 0 An algorthm to evaluate the Hessan matrx based on error backpropagaton as ntroduced n secton 35 can be found n [Bs9] The nverson of an n n matrx has complexty On 3 ) when usng the LU decomposton or smlar technques 9

35 Error Backpropagaton Algorthm, proposed n [RHW86], s used to evaluate the gradent E n w[t]) of the error functon E n n each teraton step More detals as well as a thorough dervaton of the algorthm can be found n [Bs95] or [RHW86] Algorthm Error Backpropagaton) Propagate the nput value x n through the network to get the actual nput and output of each unt Calculate the so called errors δ L+) 3 Determne δ l) [Bs06, p 4-45] for the output unts: δ L+) := E n y L+) f z L+) ) 8) for all hdden layers l by usng error backpropagaton: 4 Calculate the requred dervatves: 4 Unsupervsed Tranng δ l) In unsupervsed tranng, gven a tranng set := f z l) m l+) ) k= w l+),k δ l+) k 9) E n w l) = δ l) j y l ) 0) j, T U := {x n : n N} ) wthout desred target values, the network has to fnd smlartes and regulartes wthn the data by tself Among others, unsupervsed tranng of deep archtectures can be accomplshed based on Restrcted Boltzman Machnes or auto-encodes [Ben09] We focus on auto-encoders 4 Auto-Encoders Auto-encoders, also called auto-assocators [Ben09], are two-layer perceptrons wth the goal to compute a representaton of the nput n the frst layer from whch the nput can accurately be reconstructed n the output layer Therefore, no desred target values are needed auto-encoders are self-supervsed [Ben09] In the hdden layer, consstng of m := m ) unts, an auto-encoder computes a representaton cx) from the nput x [Ben09]: c x) = D k=0 w ),k x k ) The output layer tres to reconstruct the nput from the representaton gven by cx): ˆx = d cx)) = m k=0 w ),k c kx) 3) As the output of an auto-encoder should resemble ts nput, t can be traned as dscussed n secton 3 by replacng the desred target values t n used n the error measure by the nput x n In the case where m < D, the auto-encoder s expected to compute a useful, dmensonalty-reducng representaton of the nput If m D, the auto-encoder could just learn the dentty such that ˆx would be a perfect reconstructon of x However, as dscussed n [Ben09], n practce ths s not a problem A bref ntroducton to Restrcted Boltzman Machnes can be found n [Ben09] 0

nput layer representaton layer reconstructon layer c 0 x) x 0 ˆx c x) x ˆx x D c m x) ˆx C Fgure 5: An auto-encoder s manly a towlayer perceptron wth m := m ) hdden unts and the goal to compute a representaton cx) n the frst layer from whch the nput can accurately be reconstructed n the output layer 4 Layer-Wse Tranng As dscussed n [LBLL09], the layers of a neural network can be traned n an unsupervsed fashon usng the followng scheme: For each layer l =,,L + : Tran layer l usng the approach dscussed above takng the output of layer l ) as nput, assocatng the output of layer l wth the representaton cy l ) ) and addng an addtonal layer to compute ŷ l) 5 Regularzaton It has been shown, that multlayer perceptrons wth at least one hdden layer can approxmate any target mappng up to arbtrary accuracy [HSW89] Thus, the tranng data may be overftted, that s the tranng error may be very low on the tranng set but hgh on unseen data [Ben09] Regularzaton descrbes the task to avod overfttng to gve better generalzaton performance, meanng that the traned network should also perform well on unseen data [Hay05] Therefore, the tranng set s usually splt up nto an actual tranng set and a valdaton set The neural network s then traned usng the new tranng set and ts generalzaton performance s evaluated on the valdaton set [DHS0] There are dfferent methods to perform regularzaton Often, the tranng set s augmented to ntroduce certan nvarances the network s expected to learn [KSH] Other methods add a regularzaton term to the error measure amng to control the complexty and form of the soluton [Bs95]: Ê n w) = E n w) + ηpw) 4) where Pw) nfluences the form of the soluton and η s a balancng parameter 5 L p -Regularzaton A popular example of L p -regularzaton s the L -regularzaton : Pw) = w = w T w 5) The dea s to penalze large weghts as they tend to result n overfttng [Bs95] In general, arbtrary p can be used to perform L p -regularzaton Another example sets p = 3 to enforce sparsty of the weghts, that s many of the weghts should vansh: Pw) = w 6) The L -regularzaton s often referred to as weght decay, see [Bs95] for detals 3 For p =, the norm s defned by w = W k= w k where W s the dmenson of the weght vector w

5 Early Stoppng Whle the error on the tranng set tends to decrease wth the number of teratons, the error on the valdaton set usually starts to rse agan once the network starts to overft the tranng set To avod overfttng, tranng can be stopped as soon as the error on the valdaton set reaches a mnmum, that s before the error on the valdaton set rses agan [Bs95] Ths method s called early stoppng 53 Dropout In [HSK + ] another regularzaton technque, based on observaton of the human bran, s proposed Whenever the neural network s gven a tranng sample, each hdden unt s skpped wth probablty Ths method can be nterpreted n dfferent ways [HSK + ] Frst, unts cannot rely on the presence of other unts Second, ths method leads to the tranng of multple dfferent networks smultaneously Thus, dropout can be nterpreted as model averagng 4 54 Weght Sharng The dea of weght sharng was ntroduced n [RHW86] n the context of the T-C problem 5 Weght sharng descrbes the dea of dfferent unts wthn the same layer to use dentcal weghts Ths can be nterpreted as a regularzaton method as the complexty of the network s reduced and pror knowledge may be ncorporated nto the network archtecture The equalty constrant s replaced when usng soft weght sharng, ntroduced n [NH9] Here, a set of weghts s encouraged not to have the same weght value but smlar weght values Detals can be found n [NH9] and [Bs95] When usng weght sharng, error backpropagaton can be appled as usual, however, equaton 0) changes to E n w l) j, m l) = k= δ l) k yl ) 7) when assumng that all unts n layer l share the same set of weghts, that s w l) j, = wl) k, for j,k ml) Nevertheless, equaton 0) stll needs to be appled n the case that the errors need to be propagated to precedng layers [Bs06] 55 Unsupervsed Pre-Tranng Results n [EBC + 0] suggest that layer-wse unsupervsed pre-tranng of deep neural networks can be nterpreted as regularzaton technque 6 Layer-wse unsupervsed pre tranng can be accomplshed usng a smlar scheme as dscussed n secton 4: For each l =,,L + : Tran layer l usng the approach dscussed n secton 4 Fne-tune the weghts usng supervsed tranng as dscussed n secton 3 A formulaton of the effect of unsupervsed pre-tranng as regularzaton method s proposed n [EMB + 09]: The regularzaton term punshes weghts outsde a specfc regon n weght space wth an nfnte penalty such that Pw) = logpw)) 8) where pw) s the pror for the weghts, whch s zero for weghts outsde ths specfc regon [EBC + 0] 4 Model averagng tres to reduce the error by averagng the predcton of dfferent models [HSK + ] 5 The T-C problem descrbes the task of classfyng mages nto those contanng a T and those contanng a C ndependent of poston and rotaton [RHW86] 6 Another nterpretaton of unsupervsed pre-tranng s that t ntalzes the weghts n the basn of a good local mnmum and can therefore be nterpreted as optmzaton ad [Ben09]

3 Convolutonal Neural Networks Although neural networks can be appled to computer vson tasks, to get good generalzaton performance, t s benefcal to ncorporate pror knowledge nto the network archtecture [LeC89] Convolutonal neural networks am to use spatal nformaton between the pxels of an mage Therefore, they are based on dscrete convoluton After ntroducng dscrete convoluton, we dscuss the basc components of convolutonal neural networks as descrbed n [JKRL09] and [LKF0] 3 Convoluton For smplcty we assume a grayscale mage to be defned by a functon I : {,,n } {,,n } W R,, j) I, j 9) such that the mage I can be represented by an array of sze n n 7 Gven the flter K R h + h +, the dscrete convoluton of the mage I wth flter K s gven by where the flter K s gven by I K) r,s := h u= h h v= h K u,v I r+u,s+v 30) K h, h K h,h K = K 0,0 3) K h, h K h,h Note that the behavor of ths operaton towards the borders of the mage needs to be defned properly 8 A commonly used flter for smoothng s the dscrete Gaussan flter K Gσ) [FP0] whch s defned by ) ) K Gσ) = exp r + s r,s πσ σ 3) where σ s the standard devaton of the Gaussan dstrbuton [FP0] 3 Layers We follow [JKRL09] and ntroduce the dfferent types of layers used n convolutonal neural networks Based on these layers, complex archtectures as used for classfcaton n [CMS] and [KSH] can be bult by stackng multple layers 3 Convolutonal Layer Let layer l be a convolutonal layer Then, the nput of layer l comprses m l ) feature maps from the prevous layer, each of sze m l ) m l ) 3 In the case where l =, the nput s a sngle mage I consstng of one or more channels Ths way, a convolutonal neural network drectly accepts raw mages as nput The output of layer l conssts of m l) feature maps of sze ml) ml) 3 The th feature map n layer l, denoted Y l), s computed as Y l) = B l) m l ) + j= K l) l ), j Y j 33) 7 Often, W wll be the set {0,,55} representng an 8-bt channel Then, a color mage can be represented by an array of sze n n 3 assumng three color channels, for example RGB 8 As example, consder a gray scale mage of sze n n When applyng an arbtrary flter of sze h + h + to the pxel at locaton, ) the sum of equaton 30) ncludes pxel locatons wth negatve ndces To solve ths problem, several approaches can be consdered, as for example paddng the mage n some way or applyng the flter only for locatons where the operaton s defned properly resultng n the output array beng smaller than the mage 3

nput mage or nput feature map output feature maps Fgure 6: Illustraton of a sngle convolutonal layer If layer l s a convolutonal layer, the nput mage f l = ) or a feature map of the prevous layer s convolved by dfferent flters to yeld the output feature maps of layer l where B l) s a bas matrx and K l), j s the flter of sze h l) + hl) + connectng the jth feature map n layer l ) wth the th feature map n layer l [LKF0] 9 As mentoned above, m l) and ml) 3 are nfluenced by border effects When applyng the dscrete convoluton only n the so called vald regon of the nput feature maps, that s only for pxels where the sum of equaton 30) s defned properly, the output feature maps have sze m l) = ml ) h l) and m l) 3 = ml ) 3 h l) 34) Often the flters used for computng a fxed feature map Y l) are the same, that s K l), j = K l),k for j k In addton, the sum n equaton 33) may also run over a subset of the nput feature maps To relate the convolutonal layer and ts operaton as defned by equaton 33) to the multlayer perceptron, we rewrte the above equaton Each feature map Y l) n layer l conssts of m l) ml) 3 unts arranged n a two-dmensonal array The unt at poston r,s) computes the output Y l) )r,s = = l ) B l) m )r,s + j= B l) l ) m )r,s + j= K l), j h l) u= h l) ) l ) Y j r,s h l) v= h l) K l), j ) u,v Y l ) j ) 35) r+u,s+v 36) The tranable weghts of the network can be found n the flters K l), j and the bas matrces B l) As we wll see n secton 35, subsamplng s used to decrease the effect of nose and dstortons As noted n [CMM + ], subsamplng can be done usng so called skppng factors s l) and s l) The basc dea s to skp a fxed number of pxels, both n horzontal and n vertcal drecton, before applyng the flter agan Wth skppng factors as above, the sze of the output feature maps s gven by m l) = ml ) s l) h l) + and m l) 3 = ml ) 3 h l) s l) + 37) 3 Non-Lnearty Layer If layer l s a non-lnearty layer, ts nput s gven by m l) feature maps and ts output comprses agan m l) = ml ) feature maps, each of sze m l ) m l ) 3 such that m l) = ml ) and m l) 3 = ml ) 3, gven by 9 Note the dfference between a feature map Y l) y l) as used n the multlayer perceptron Y l) = f Y l ) ) 38) comprsng m l) ml) 3 unts arranged n a two-dmensonal array and a sngle unt 4

where f s the actvaton functon used n layer l and operates pont wse In [JKRL09] addtonal gan coeffcents are added: ) = g f 39) Y l) Y l ) A convolutonal layer ncludng a non-lnearty, wth hyperbolc tangent actvaton functons and gan coeffcents s denoted by F CSG 0 Note that n [JKRL09] ths consttutes a sngle layer whereas we separate the convolutonal layer and the non-lnearty layer 33 Rectfcaton Let layer l be a rectfcaton layer Then ts nput comprses m l ) feature maps of sze m l ) m l ) 3 and the absolute value for each component of the feature maps s computed: Y l) 40) = Y l) where the absolute value s computed pont wse such that the output conssts of m l) = ml ) feature maps unchanged n sze Experments n [JKRL09] show that rectfcaton plays a central role n achevng good performance Although rectfcaton could be ncluded n the non-lnearty layer [LKF0], we follow [JKRL09] and add ths operaton as an ndependent layer The rectfcaton layer s denoted by R abs 34 Local Contrast Normalzaton Layer Let layer l be a contrast normalzaton layer The task of a local contrast normalzaton layer s to enforce local compettveness between adjacent unts wthn a feature map and unts at the same spatal locaton n dfferent feature maps We dscuss subtractve normalzaton as well as brghtness normalzaton An alternatve, called dvsve normalzaton, can be found n [JKRL09] or [LKF0] Gven m l ) feature maps of sze m l ) m l ) 3, the output of layer l comprses m l) = ml ) feature maps unchanged n sze The subtractve normalzaton operaton computes Y l) = Y l ) m l ) j= K Gσ) Y l ) j 4) where K Gσ) s the Gaussan flter from equaton 3) In [KSH] an alternatve local normalzaton scheme called brghtness normalzaton s proposed to be used n combnaton wth rectfed lnear unts Then the output of layer l s gven by Y l) ) r,s = Y l ) κ + µ ml ) j= ) r,s Y l ) j ) r,s ) µ 4) where κ, λ, µ are hyperparameters whch can be set usng a valdaton set [KSH] The sum n equaton 4) may also run over a subset of to the feature maps n layer l ) Local contrast normalzaton layers are denoted N S and N B, respectvely 35 Feature Poolng and Subsamplng Layer The motvaton of subsamplng the feature maps obtaned by prevous layers s robustness to nose and dstortons [JKRL09] Reducng the resoluton can be accomplshed n dfferent ways In [JKRL09] and 0 C for convolutonal layer, S for sgmod/hyperbolc tangent actvaton functons and G for gan coeffcents In [JKRL09] the flter sze s added as subscrpt such that FCSG 7 7 denotes the usage of 7 7 flters Addtonally, the number of used flters s added as follows: 3FCSG 7 7 We omt the number of flters as we assume full connectvty such that the number of flters s gven by ml) ml ) Note that equaton 40) can easly be appled to fully-connected layers as ntroduced n secton 36, as well 5

Fgure 7: Illustraton of a poolng and subsamplng layer If layer l s a poolng and subsamplng layer and gven m l ) = 4 feature maps of the prevous layer, all feature maps are pooled and subsampled ndvdually Each unt n one of the m l) = 4 output feature maps represents the average or the maxmum wthn a fxed wndow of the correspondng feature map n layer l ) feature maps layer l ) feature maps layer l [LKF0] ths s combned wth poolng and done n a separate layer, whle n the tradtonal convolutonal neural networks, subsamplng s done by applyng skppng factors Let l be a poolng layer Its output comprses m l) = ml ) feature maps of reduced sze In general, poolng operates by placng wndows at non-overlappng postons n each feature map and keepng one value per wndow such that the feature maps are subsampled We dstngush two types of poolng: Average poolng When usng a boxcar flter, the operaton s called average poolng and the layer denoted by P A Max poolng For max poolng, the maxmum value of each wndow s taken The layer s denoted by P M As dscussed n [SMB0], max poolng s used to get faster convergence durng tranng Both average and max poolng can also be appled usng overlappng wndows of sze p p whch are placed q unts apart Then the wndows overlap f q < p Ths s found to reduce the chance of overfttng the tranng set [KSH] 36 Fully Connected Layer Let layer l be a fully connected layer If layer l ) s a fully connected layer, as well, we may apply equaton ) Otherwse, layer l expects m l ) feature maps of sze m l ) m l ) 3 as nput and the th unt n layer l computes: y l) = f z l) ) wth z l) = m l ) j= m l ) r= m l ) 3 w l), j,r,s s= Y l ) j ) r,s 43) where w l), j,r,s denotes the weght connectng the unt at poston r,s) n the jth feature map of layer l ) and the th unt n layer l In practce, convolutonal layers are used to learn a feature herarchy and one or more fully connected layers are used for classfcaton purposes based on the computed features [LBD + 89, LKF0] Note that a fully-connected layer already ncludes the non-lneartes whle for a convolutonal layer the non-lneartes are separated n ther own layer 33 Archtectures We dscuss both the tradtonal convolutonal neural network as proposed n [LBD + 89] as well as a modern varant as used n [KSH] 33 Tradtonal Convolutonal Neural Network In [JKRL09], the basc buldng block of tradtonal neural networks s F CSG P A, whle n [LBD + 89], the subsamplng s accomplshed wthn the convolutonal layers and there are no gan coeffcents used In Usng the notaton as used n secton 3, the boxcar flter K B of sze h + h + s gven by K B ) r,s = h +)h +) 6

convolutonal layer wth non-lneartes layer l = convolutonal layer wth non-lneartes layer l = 4 fully connected layer layer l = 7 nput mage layer l = 0 subsamplng layer layer l = 3 subsamplng layer layer l = 6 fully connected layer output layer l = 8 Fgure 8: The archtecture of the orgnal convolutonal neural network, as ntroduced n [LBD + 89], alternates between convolutonal layers ncludng hyperbolc tangent non-lneartes and subsamplng layers In ths llustraton, the convolutonal layers already nclude non-lneartes and, thus, a convolutonal layer actually represents two layers The feature maps of the fnal subsamplng layer are then fed nto the actual classfer consstng of an arbtrary number of fully connected layers The output layer usually uses softmax actvaton functons general, the unque characterstc of tradtonal convolutonal neural networks les n the hyperbolc tangent non-lneartes and the weght sharng [LBD + 89] Ths s llustrated n fgure 8 where the non-lneartes are ncluded wthn the convolutonal layers 33 Modern Convolutonal Neural Networks As example of a modern convolutonal neural network we explore the archtecture used n [KSH] whch gves excellent performance on the ImageNet Dataset [ZF3] The archtecture comprses fve convolutonal layers each followed by a rectfed lnear unt non-lnearty layer, brghtness normalzaton and overlappng poolng Classfcaton s done usng three addtonal fully-connected layers To avod overfttng, [KSH] uses dropout as regularzaton technque Such a network can be specfed by F CR N B P where F CR denotes a convolutonal layer followed by a non-lnearty layer wth rectfed lnear unts Detals can be found n [KSH] In [CMS] the authors combne several deep convolutonal neural networks whch have a smlar archtecture as descrbed above and average ther classfcaton/predcton result Ths archtecture s referred to as mult-column deep convolutonal neural network 7

4 Understandng Convolutonal Neural Networks Although convolutonal neural networks have been used wth success for a varety of computer vson tasks, ther nternal operaton s not well understood Whle backprojecton of feature actvatons from the frst convolutonal layer s possble, subsequent poolng and rectfcaton layers hnder us from understandng hgher layers as well As stated n [ZF3], ths s hghly unsatsfactory when amng to mprove convolutonal neural networks Thus, n [ZF3], a vsualzaton technque s proposed whch allows us to vsualze the actvatons from hgher layers Ths technque s based on an addtonal model for unsupervsed learnng of feature herarches: the deconvolutonal neural network as ntroduced n [ZKTF0] 4 Deconvolutonal Neural Networks Smlar to convolutonal neural networks, deconvolutonal neural networks are based upon the dea of generatng feature herarches by convolvng the nput mage by a set of flters at each layer [ZKTF0] However, deconvolutonal neural networks are unsupervsed by defnton In addton, deconvolutonal neural networks are based on a top-down approach Ths means, the goal s to reconstruct the network nput from ts actvatons and flters [ZKTF0] 4 Deconvolutonal Layer Let layer l be a deconvolutonal layer The nput s composed of m l ) feature maps of sze m l ) m l ) Each such feature map Y l ) s represented as sum over m l) feature maps convolved wth flters K l) j, : m l) j= 3 K l) l) j, Y j = Y l ) 44) As wth an auto-encoder, t s easy for the layer to learn the dentty, f there are enough degrees of freedom Therefore, [ZKTF0] ntroduces a sparsty constrant for the feature maps Y l) j, and the error measure for tranng layer l s gven by m l ) E l) m l) w) = K l) l) j, Y j Y l ) m l) + Y l) p 45) = j= = p where p s the vectorzed p-norm and can be nterpreted as L p -regularzaton as dscussed n secton 5 The dfference between a convolutonal layer and a deconvolutonal layer s llustrated n fgure 9 Note that the error measure E l) s specfc for layer l Ths mples that a deconvolutonal neural network wth multple deconvolutonal layers s traned layer-wse convolutonal layer deconvolutonal layer bottom-up top-down Fgure 9: An llustraton of the dfference between the bottom-up approach of convolutonal layers and the top-down approach of deconvolutonal layers 8

output layer l = L + deconvolutonal layer l = L convolutonal layer l = L deconvolutonal layer l = convolutonal layer l = feature actvatons nput mage l = 0 Fgure 0: After each convolutonal layer, the feature actvatons of the prevous layer are reconstructed usng an attached deconvolutonal layer For l > the process of reconstructon s terated untl the feature actvatons are backprojected onto the mage plane 4 Unsupervsed Tranng Smlar to unsupervsed tranng dscussed n secton 4, tranng s performed layer-wse Therefore, equaton 45) s optmzed by alternately optmzng wth respect to the feature maps Y l) gven the flters K l) l ) j, and the feature maps Y of the prevous layer and wth respect to the flters K l) j, [ZKTF0] Here, the optmzaton wth respect to the feature maps Y l) causes some problems For example when usng p =, the optmzaton problem s poorly condtoned [ZKTF0] and therefore usual gradent descent optmzaton fals An alternatve optmzaton scheme s dscussed n detal n [ZKTF0], however, as we do not need to tran deconvolutonal neural networks, ths s left to the reader 4 Vsualzng Convolutonal Neural Networks To vsualze and understand the nternal operatons of a convolutonal neural network, a sngle deconvolutonal layer s attached to each convolutonal layer Gven nput feature maps for layer l, the output feature maps Y l) are fed back nto the correspondng deconvolutonal layer at level l The deconvolutonal layer reconstructs the feature maps Y l ) that gave rse to the actvatons n layer l [ZF3] Ths process s terated untl layer l = 0 s reached resultng n the actvatons of layer l beng backprojected onto the mage plane The general dea s llustrated n fgure 0 Note that the deconvolutonal layers do not need to be traned as the flters are already gven by the traned convolutonal layers and merely have to be transposed 3 More complex convolutonal neural networks may nclude non-lnearty layers, rectfcaton layers as well as poolng layers Whle we assume the used non-lneartes to be nvertble, the use of rectfcaton layers and poolng layers cause some problems 4 Poolng Layers Let layer l be a max poolng layer, then the operaton of layer l s not nvertble We need to remember whch postons wthn the nput feature map Y l) gave rse to the maxmum value to get an approxmate nverse [ZF3] Therefore, as dscussed n [ZF3], swtch varables are ntroduced 3 Gven a feature map Y l) = K l) l ), j Y j here we omt the sum of equaton 33) for smplcty) and usng the transposed flter ) K l) T ) l ) j, gves us: Y j = K l) T l), j Y 9

unpoolng layer rectfcaton layer swtch varables poolng layer rectfcaton layer non-lnearty layer non-lnearty layer deconvolutonal layer convolutonal layer Fgure : Whle the approach descrbed n secton 4 can easly be appled to convolutonal neural networks ncludng non-lnearty layers, the usage of poolng and rectfcaton layers mposes some problems The max poolng operaton s not nvertble Therefore, for each unt n the poolng layer, we remember the poston n the correspondng feature map whch gave rse to the unt s output value To accomplsh ths, so called swtch varables are ntroduced [ZF3] Rectfcaton layers can smply be nverted by prependng a rectfcaton layer to the deconvolutonal layer 4 Rectfcaton Layers The convolutonal layer may use rectfcaton layers to obtan postve feature maps after each non-lnearty layer To cope wth ths, a rectfcaton layer s added to each deconvolutonal layer to obtan postve reconstructons of the feature maps, as well [ZF3] Both the ncorporaton of poolng layers and rectfcaton layers s llustrated n fgure 43 Convolutonal Neural Network Vsualzaton The above vsualzaton technque can be used to dscuss several aspects of convolutonal neural networks We follow the dscusson n [ZF3] whch refers to the archtecture descrbed n secton 33 43 Flters and Features Backprojectng the feature actvatons allows close analyss of the herarchcal nature of the features wthn the convolutonal neural network Fgure, taken from [ZF3], shows the actvatons for three layers wth correspondng nput mages Whle the frst and second layer comprse flters for edge and corner detecton, the flters tend to get more complex and abstract wth hgher layers For example when consderng layer 3, the feature actvatons reflect specfc structures wthn the mages: the patterns used n layer 3, row, column ; human contours n layer 3 row3, column 3 Hgher levels show strong nvarances to translaton and rotaton [ZF3] Such transformatons usually have hgh mpact on low-level features In addton, as stated n [ZF3], t s mportant to tran the convolutonal neural network untl convergence as the hgher levels usually need more tme to converge 43 Archtecture Evaluaton The vsualzaton of the feature actvatons across the convolutonal layers allows to evaluate the effect of flter sze as well as flter placement For example, by analyzng the feature actvatons of the frst and second layer, the authors of [ZF3] observed that the frst layer does only capture hgh frequency and low frequency nformaton and the feature actvatons of the second layer show alasng artfacts By adaptng the flter sze of the frst layer and the skppng factor used wthn the second layer, performance could be mproved In addton, the vsualzaton shows the advantage of deep archtectures as hgher layers are able to learn more complex features nvarant to low-level dstortons and translatons [ZF3] 0

Fgure : Taken from [ZF3], ths fgure shows a selecton of features across several layers of a fully traned convolutonal network usng the vsualzaton technque dscussed n secton 4 5 Concluson In the course of ths paper we dscussed the basc notons of both neural networks n general and the multlayer perceptron n partcular Wth deep learnng n mnd, we ntroduced supervsed tranng usng gradent descent and error backropagaton as well as unsupervsed tranng usng auto encoders We concluded the secton wth a bref dscusson of regularzaton methods ncludng dropout [HSK+ ] and unsupervsed pre-tranng We ntroduced convolutonal neural networks by dscussng the dfferent types of layers used n recent mplementatons: the convolutonal layer; the non-lnearty layer; the rectfcaton layer; the local contrast normalzaton layer; and the poolng and subsamplng layer Based on these basc buldng blocks, we dscussed the tradtonal convolutonal neural networks [LBD+ 89] as well as a modern varant as used n [KSH] Despte of ther excellent performance [KSH, CMS], the nternal operaton of convolutonal neural networks s not well understood [ZF3] To get deeper nsght nto ther nternal workng, we followed [ZF3] and dscussed a vsualzaton technque allowng to backproject the feature actvatons of hgher layers Ths allows to further evaluate and mprove recent archtectures as for example the archtecture used n [KSH] Nevertheless, convolutonal neural networks and deep learnng n general s an actve area of research Although the dffculty of deep learnng seems to be understood [Ben09, GB0, EMB+ 09], learnng feature herarches s consdered very hard [Ben09] Here, the possblty of unsupervsed pre-tranng had a huge mpact and allows to tran deep archtectures n reasonable tme [Ben09, EBC+ 0] Nonetheless, the reason for the good performance of deep neural networks s stll not answered fully

References [Ben09] Y Bengo Learnng deep archtectures for AI Foundatons and Trends n Machne Learnng, ): 7, 009 [Bs9] C Bshop Exact calculaton of the hessan matrx for the multlayer perceptron Neural Computaton, 44):494 50, 99 [Bs95] C Bshop Neural Networks for Pattern Recognton Clarendon Press, Oxford, 995 [Bs06] C Bshop Pattern Recognton and Machne Learnng Sprnger Verlag, New York, 006 [BL89] S Becker and Y LeCun Improvng the convergence of back-propagaton learnng wth second-order methods In Connectonst Models Summer School, pages 9 37, 989 [BL07] Y Bengo and Y LeCun Scalng learnng algorthms towards AI In Large Scale Kernel Machnes MIT Press, 007 [CMM + ] D C Creşan, U Meer, J Masc, L M Gambardella, and J Schmdhuber Flexble, hgh performance convolutonal neural networks for mage classfcaton In Artfcal Intellgence, Internatonal Jont Conference, pages 37 4, 0 [CMS] [DHS0] [EBC + 0] [EMB + 09] [FP0] [GB0] [GBB] D C Cresan, U Meer, and J Schmdhuber Mult-column deep neural networks for mage classfcaton Computng Research Repostory, abs/0745, 0 R Duda, P Hart, and D Stork Pattern Classfcaton Wley-Interscence Publcaton, New York, 00 D Erhan, Y Bengo, A Courvlle, P-A Manzagol, P Vncent, and S Bengo Why does unsupervsed pre-tranng help deep learnng? Journal of Machne Learnng Research, :65 660, 00 D Erhan, P-A Manzagol, Y Bengo, S Bengo, and P Vncent The dffculty of tranng deep archtectures and the effect of unsupervsed pre-tranng In Artfcal Intellgence and Statstcs, Internatonal Conference on, pages 53 60, 009 D Forsyth and J Ponce Computer Vson: A Modern Approach Prentce Hall Professonal Techncal Reference, New Jersey, 00 X Glorot and Y Bengo Understandng the dffculty of tranng deep feedforward neural networks In Artfcal Intellgence and Statstcs, Internatonal Conference on, pages 49 56, 00 X Glorot, A Bordes, and Y Bengo Deep sparse rectfer neural networks In Artfcal Intellgence and Statstcs, Internatonal Conference on, pages 35 33, 0 [GMW8] P Gll, W Murray, and M Wrght Practcal optmzaton Academc Press, London, 98 [Hay05] [HO06] S Haykn Neural Networks A Comprehensve Foundaton Pearson Educaton, New Delh, 005 G E Hnton and S Osndero A fast learnng algorthm for deep belef nets Neural Computaton, 87):57 554, 006 [HSK + ] G E Hnton, N Srvastava, A Krzhevsky, I Sutskever, and R Salakhutdnov Improvng neural networks by preventng co-adaptaton of feature detectors Computng Research Repostory, abs/070580, 0 [HSW89] K Hornk, M Stnchcombe, and H Whte Multlayer feedforward networks are unversal approxmators Neural Networks, 5):359 366, 989

[JKRL09] [KRL0] [KSH] [LBBH98] [LBD + 89] [LBLL09] [LeC89] [LKF0] [NH9] K Jarrett, K Kavukcuogl, M Ranzato, and Y LeCun What s the best mult-stage archtecture for object recognton? In Computer Vson, Internatonal Conference on, pages 46 53, 009 K Kavukcuoglu, M A Ranzato, and Y LeCun Fast nference n sparse codng algorthms wth applcatons to object recognton Computng Research Repostory, abs/003467, 00 A Krzhevsky, I Sutskever, and G E Hnton ImageNet classfcaton wth deep convolutonal neural networks In Advances n Neural Informaton Processng Systems, pages 097 05, 0 Y LeCun, L Buttou, Y Bengo, and P Haffner Gradent-based learnng appled to document recognton Proceedngs of the IEEE, 86:78 34, 998 Y LeCun, B Boser, J S Denker, D Henderson, R E Howard, W Hubbard, and L D Jackel Backpropagaton appled to handwrtten zp code recognton Neural Computaton, 4):54 55, 989 H Larochelle, Y Bengo, J Louradour, and P Lambln Explorng strateges for tranng deep neural networks Journal of Machne Learnng Research, 0: 40, 009 Y LeCun Generalzaton and network desgn strateges In Connectonsm n Perspectve, 989 Y LeCun, K Kavukvuoglu, and C Farabet Convolutonal networks and applcatons n vson In Crcuts and Systems, Internatonal Symposum on, pages 53 56, 00 S J Nowlan and G E Hnton Smplfyng neural networks by soft weght-sharng Neural Computaton, 44):473 493, 99 [RHW86] D E Rumelhart, G E Hnton, and R J Wllams Parallel dstrbuted processng: Exploratons n the mcrostructure of cognton chapter Learnng Representatons by Back- Propagatng Errors, pages 38 36 MIT Press, Cambrdge, 986 [Ros58] F Rosenblatt The perceptron: A probablstc model for nformaton storage and organzaton n the bran Psychologcal Revew, 65, 958 [SMB0] D Scherer, A Müller, and S Behnke Evaluaton of poolng operatons n convolutonal archtectures for object recognton In Artfcal Neural Networks, Internatonal Conference on, pages 9 0, 00 [SSP03] [ZF3] [ZKTF0] P Y Smard, D Stenkraus, and J C Platt Best practces for convolutonal neural networks ppled to vsual document analyss In Document Analyss and Recognton, Internatonal Conference on, 003 M D Zeler and R Fergus Vsualzng and understandng convolutonal networks Computng Research Repostory, abs/390, 03 M D Zeler, D Krshnan, G W Taylor, and R Fergus Deconvolutonal networks In Computer Vson and Pattern Recognton, Conference on, pages 58 535, 00 3