Recurrent Neural Networks A Brief Overview

Similar documents
Recurrent Neural Networks

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski


Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks

Neural Networks and Support Vector Machines

NEURAL NETWORKS A Comprehensive Foundation

SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS

Performance Evaluation On Human Resource Management Of China S Commercial Banks Based On Improved Bp Neural Networks

A generalized LSTM-like training algorithm for second-order recurrent neural networks

Lecture 8 February 4

PhD in Computer Science and Engineering Bologna, April Machine Learning. Marco Lippi. Marco Lippi Machine Learning 1 / 80

Neural Network Design in Cloud Computing

Long Short Term Memory Networks for Anomaly Detection in Time Series

SEMINAR OUTLINE. Introduction to Data Mining Using Artificial Neural Networks. Definitions of Neural Networks. Definitions of Neural Networks

Analecta Vol. 8, No. 2 ISSN

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Forecasting of Economic Quantities using Fuzzy Autoregressive Model and Fuzzy Neural Network

A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach

Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis

Feedforward Neural Networks and Backpropagation

Efficient online learning of a non-negative sparse autoencoder

Machine Learning and Data Mining -

CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER

Chapter 4: Artificial Neural Networks

Taking Inverse Graphics Seriously

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

Machine Learning: Multi Layer Perceptrons

COMBINED NEURAL NETWORKS FOR TIME SERIES ANALYSIS

TIME SERIES FORECASTING USING NEURAL NETWORKS

Supporting Online Material for

Application of Neural Network in User Authentication for Smart Home System

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Role of Neural network in data mining

Lecture 6. Artificial Neural Networks

Multiple Layer Perceptron Training Using Genetic Algorithms

Keywords: Image complexity, PSNR, Levenberg-Marquardt, Multi-layer neural network.

Back Propagation Neural Networks User Manual

Follow links Class Use and other Permissions. For more information, send to:

Self Organizing Maps: Fundamentals

An Introduction to Neural Networks

Non-Linear Regression Samuel L. Baker

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Field Data Recovery in Tidal System Using Artificial Neural Networks (ANNs)

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

LMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.

Neural Networks and Back Propagation Algorithm

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

Neural network models: Foundations and applications to an audit decision problem

Iranian J Env Health Sci Eng, 2004, Vol.1, No.2, pp Application of Intelligent System for Water Treatment Plant Operation.

Predict Influencers in the Social Network

Programming Exercise 3: Multi-class Classification and Neural Networks

System Identification for Acoustic Comms.:

Price Prediction of Share Market using Artificial Neural Network (ANN)

Learning to Process Natural Language in Big Data Environment

Power Prediction Analysis using Artificial Neural Network in MS Excel

Supply Chain Forecasting Model Using Computational Intelligence Techniques

Comparison of K-means and Backpropagation Data Mining Algorithms

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep

TRAINING A LIMITED-INTERCONNECT, SYNTHETIC NEURAL IC

Title : Analog Circuit for Sound Localization Applications

Binary Adders: Half Adders and Full Adders

Simplified Machine Learning for CUDA. Umar

Magnetometer Realignment: Theory and Implementation

Neural network software tool development: exploring programming language options

Simple and efficient online algorithms for real world applications

Sanjeev Kumar. contribute

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

Customer Relationship Management by Semi-Supervised Learning

Stock Prediction using Artificial Neural Networks

Implementation of Neural Networks with Theano.

A New 128-bit Key Stream Cipher LEX

The Backpropagation Algorithm

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Anupam Tarsauliya Shoureya Kant Rahul Kala Researcher Researcher Researcher IIITM IIITM IIITM Gwalior Gwalior Gwalior

DLP Driven, Learning, Optical Neural Networks

Sequential Logic: Clocks, Registers, etc.

Introduction to Machine Learning CMU-10701

Supervised Learning (Big Data Analytics)

DCMS DC MOTOR SYSTEM User Manual

Data Mining Techniques Chapter 7: Artificial Neural Networks

A Look Into the World of Reddit with Neural Networks

Stock Market Prediction System with Modular Neural Networks

Appendix 4 Simulation software for neuronal network models

Spark: Cluster Computing with Working Sets

Smart Home System Design based on Artificial Neural Networks

129: Artificial Neural Networks. Ajith Abraham Oklahoma State University, Stillwater, OK, USA 1 INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS

A bachelor of science degree in electrical engineering with a cumulative undergraduate GPA of at least 3.0 on a 4.0 scale

Models of Cortical Maps II

Temporal Difference Learning in the Tetris Game

Transcription:

Recurrent Neural Networks A Brief Overview Douglas Eck University of Montreal RNN Overview Oct 1 2007 p.1/33

RNNs versus FFNs Feed-Forward Networks (FFNs, left) can learn function mappings (FFN==>FIR filter) Recurrent Neural Networks (RNNs, right) use hidden layer as memory store to learn sequences (RNN==>IIR filter) RNNs can (in principle at least) exhiit virtually unlimited temporal dynamics RNN Overview Oct 1 2007 p.2/33

Several Methods SRN Simple Recurrent Network (Elman, 1990) BPTT Backpropagation Through Time (Rumelhart, Hinton & Williams, 1986) RTRL Real Time Recurrent Learning (Williams & Zipser, 1989) LSTM Long Short-Term Memory (Hochreiter & Schmidhuer, 1996) RNN Overview Oct 1 2007 p.3/33

SRN (Elman Net) Simple Recurrent Networks Hidden layer activations copied into a copy layer Cycles are eliminated, allowing use of standard ackpropagation RNN Overview Oct 1 2007 p.4/33

Some Oservations y Elman Some prolems change nature when expressed in time Example: Temporal XOR (Elman, 1990) [101101110110000011000] RNN learns Frequency Detectors Error can give information aout temporal structure of input Increasing sequential dependencies does not necessarily make task harder Representation of time is task-dependent RNN Overview Oct 1 2007 p.5/33

SRN Strengths Easy to train Potential for complex and useful temporal dynamics Can induce hierarchical temporal structure Can learn, e.g., simple natural language grammar RNN Overview Oct 1 2007 p.6/33

SRN Shortcomings We need to compute weight changes at every timestep t0 < t t1: t1 t=t0+1 w ij(t) Thus we need to compute at every t: E(t) w ij = k U (t) y k (t) y k (t) w ij = k U e k(t) y k(t) w ij Since we know e k (t) we need only compute y k(t) w ij SRN Truncates this derivative Long-timescale (and short-timescale) dependencies etween error signals are lost RNN Overview Oct 1 2007 p.7/33

SRNs Generalized: BPTT Generalize SRN to rememer deeper into past Trick: unfold network to represent time spatially (one layer per discrete timestep) Still no cycles, allowing use of standard ackpropagation RNN Overview Oct 1 2007 p.8/33

BPTT( ) For each timestep t Current state of network and input pattern is added to history uffer (stores since time t=0) Error e k (t) is injected; ǫs and δs for times (t 0 < τ t) are computed: ǫ k (t) = e k (t) δ k (τ) = f k (s k(τ))ǫ k (τ) ǫ k (τ 1) = l inu w lkδ l (τ) Weight changes are computed as in standard BP: = t τ=t 0 +1 δ i(τ)x j (τ 1) E(t) w ij RNN Overview Oct 1 2007 p.9/33

Truncated/Epochwise BPTT When training data is in epochs, can limit size of history uffer to h, length of epoch When training data not in epochs, can nonetheless truncate gradient after h timesteps For n units and O(n 2 ) weights, epochwise BPTT has space complexity O(nh) and time complexity O(n 2 h) Compares favoraly to BPTT (sustitute L, length of input sequence, for h) RNN Overview Oct 1 2007 p.10/33

BPTT s Gradient Recall that BPTT( ) computes = t τ=t 0 +1 δ i(τ)x j (τ 1) E(t) w ij Thus errors at t take into account equally δ values from the entire history of computation Truncated/Epochwise BPTT cuts off the gradient: = t τ=t h δ i(τ)x j (τ 1) E(t) w ij If data is naturally organized in epochs, this is not a prolem If data is not organized in epochs, this is not a prolem either, provided h is ig enough RNN Overview Oct 1 2007 p.11/33

RTRL RTRL=Real Time Recurrent Learning Instead of unfolding network ackward in time, one can propagate error forward in time Compute directly p k ij (t) = y k(t) w ij p k ij (t + 1) = f k (s k(t)) l U w klp l ij (t) + δ ikx j (t) w ij = α E(t) w ij = α k U e k(t)p k ij (t) RNN Overview Oct 1 2007 p.12/33

RTRL vs BPTT RTRL saves from executing ackward dynamics Temporal credit assignment solved during forward pass RTRL is painfully slow: for n units and n 2 weights, O(n 2 ) space complexity and O(n 4 )(!) time complexity Because BPTT is faster, in general RTRL is only of theoretical interest RNN Overview Oct 1 2007 p.13/33

Training Paradigms Epochwise training: reset system at fixed stopping points Continual training: never reset system Epochwise training provides arrier for credit assignment Unnatural for many tasks Distinct from whether weights are atchwise or iteratively updated RNN Overview Oct 1 2007 p.14/33

Teacher Forcing Continually-trained networks can move far from desired trajectory and never return Especially true if network enters region where activations ecome saturated (gradients go to 0) Solution; replace during training actual output y k (t) with teacher signal d k (t). Necessary for certain prolems in BPTT/RTRL networks (e.g. generating square waves) RNN Overview Oct 1 2007 p.15/33

Puzzle: How Much is Enough? Recall that for Truncated BPTT, the true gradient is not eing computed How many steps do we need to get close enough? RNN Overview Oct 1 2007 p.16/33

Puzzle: How Much is Enough? Recall that for Truncated BPTT, the true gradient is not eing computed How many steps do we need to get close enough? Answer: certainly not more than 50; Proaly not more than 10(!) RNN Overview Oct 1 2007 p.17/33

Credit Assignment is Difficult In BPTT, error gets diluted with every susequent layer (credit assignment prolem): ǫ k (t) = e k (t) δ k (τ) = f k (s k(τ))ǫ k (τ) ǫ k (τ 1) = l inu w lkδ l (τ) The logistic sigmoid 1.0/(1.0 + (exp( x))) has maximum derivative of 0.25. When w < 4.0, error is always<1.0 RNN Overview Oct 1 2007 p.18/33

Vanishing Gradients Bengio, Simard & Frasconi (1994) Rememering a it requires creation of attractor asin in state space Two cases: either system overly sensitive to noise or error gradient vanishes exponentially General prolem (HMMs suffer something similar) RNN Overview Oct 1 2007 p.19/33

Solutions and Alternatives Non-gradient learning algorithms (including gloal search) Expectation Maximization (EM) training (e.g. Bengio IO/HMM) Hyrid architectures to aid in preserving error signals RNN Overview Oct 1 2007 p.20/33

LSTM memory lock with single cell y[t] u[t] Hyrid recurrent neural network Make hidden units linear (derivative is then 1.0) Linear units unstale Place units in a memory lock protected y multiplicative gates RNN Overview Oct 1 2007 p.21/33

Inside an LSTM Memory Block g(net c ) s c h(s c ) y c y in y ϕ y out Gates are standard sigmoidal units Input gate y in protects linear unit s c from spurious inputs Forget gate y φ allows s c to empty own contents Output gate y out allows lock to take itself offline and ignore error RNN Overview Oct 1 2007 p.22/33

LSTM Learning For linear unit s c, a truncated RTRL approach is used (tales of partial derivatives) Everywhere else, standard ackpropagation is used Rationale: due to vanishing gradient prolem, errors would decay anyway RNN Overview Oct 1 2007 p.23/33

Properties of LSTM Very good at finding hierarchical structure Can induce nonlinear oscillation (for counting and timing) But error flow among locks truncated Difficult to train: weights into gates are sensitive RNN Overview Oct 1 2007 p.24/33

Formal Grammars LSTM solves Emedded Reer Grammar faster, more relialy and with smaller hidden layer than RTRL/BPTT LSTM solves CSL A n B n C n etter than RTRL/BPTT networks Trained on examples with n < 10 LSTM generalized to n > 1000 BPTT-trained networks generalize to n 18 in est case (Bodén & Wiles, 2001) 20 10 0 a T S a a a a a a a a a a a c c c c c c c c c c c c T : input : target s y s y c 1 c 1 c2 c 2 RNN Overview Oct 1 2007 p.25/33