CS 295 (UVM) The LMS Algorithm Fall 2013 1 / 23



Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Math 312 Homework 1 Solutions

(Quasi-)Newton methods

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Solving Linear Systems, Continued and The Inverse of a Matrix

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

Statistical Machine Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Linear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices

3. INNER PRODUCT SPACES

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

F Matrix Calculus F 1

Linear Threshold Units

Introduction to the Finite Element Method (FEM)

Notes on metric spaces

Using row reduction to calculate the inverse and the determinant of a square matrix

Inner Product Spaces

Solving Systems of Linear Equations

Chapter 7 Nonlinear Systems

Orthogonal Diagonalization of Symmetric Matrices

PUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include

Applications to Data Smoothing and Image Processing I

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Inner Product Spaces and Orthogonality

3 Orthogonal Vectors and Matrices

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Nonlinear Algebraic Equations Example

SOLUTIONS. f x = 6x 2 6xy 24x, f y = 3x 2 6y. To find the critical points, we solve

DATA ANALYSIS II. Matrix Algorithms

Continuity of the Perron Root

Taylor and Maclaurin Series

Introduction to Algebraic Geometry. Bézout s Theorem and Inflection Points

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 10

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

9 Multiplication of Vectors: The Scalar or Dot Product

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Continued Fractions and the Euclidean Algorithm

Systems of Linear Equations

ASEN Structures. MDOF Dynamic Systems. ASEN 3112 Lecture 1 Slide 1

1 Introduction to Matrices

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

HOMEWORK 5 SOLUTIONS. n!f n (1) lim. ln x n! + xn x. 1 = G n 1 (x). (2) k + 1 n. (n 1)!

1.5. Factorisation. Introduction. Prerequisites. Learning Outcomes. Learning Style

Single item inventory control under periodic review and a minimum order quantity

Math 120 Final Exam Practice Problems, Form: A

Numerical Analysis Lecture Notes

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

The dynamic equation for the angular motion of the wheel is R w F t R w F w ]/ J w

Chapter 4: Artificial Neural Networks

Preliminary Version: December 1998

Factorization Theorems

Methods for Finding Bases

Row Echelon Form and Reduced Row Echelon Form

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z

Matrices and Linear Algebra

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Vector and Matrix Norms

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

Mean value theorem, Taylors Theorem, Maxima and Minima.

Direct Methods for Solving Linear Systems. Matrix Factorization

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Background 2. Lecture 2 1. The Least Mean Square (LMS) algorithm 4. The Least Mean Square (LMS) algorithm 3. br(n) = u(n)u H (n) bp(n) = u(n)d (n)

Lecture 6. Artificial Neural Networks

1 Sets and Set Notation.

Solutions for Review Problems

Data Mining: Algorithms and Applications Matrix Math Review

Notes on Symmetric Matrices

CONTROLLABILITY. Chapter Reachable Set and Controllability. Suppose we have a linear system described by the state equation

Examination paper for TMA4205 Numerical Linear Algebra

Dot product and vector projections (Sect. 12.3) There are two main ways to introduce the dot product

Elasticity Theory Basics

CS3220 Lecture Notes: QR factorization and orthogonal transformations

1 VECTOR SPACES AND SUBSPACES

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

LECTURE 1: DIFFERENTIAL FORMS forms on R n

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Practice Final Math 122 Spring 12 Instructor: Jeff Lang

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Lecture 2: The SVM classifier

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

ALGEBRA 2 CRA 2 REVIEW - Chapters 1-6 Answer Section

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Introduction to Machine Learning Using Python. Vikram Kamath

A note on companion matrices

A QUICK GUIDE TO THE FORMULAS OF MULTIVARIABLE CALCULUS

9 MATRICES AND TRANSFORMATIONS

Linear Models for Classification

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 5. Banach Spaces

Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88

State of Stress at Point

Date: April 12, Contents

Computing a Nearest Correlation Matrix with Factor Structure

Transcription:

The LMS Algorithm The LMS Objective Function. Global solution. Pseudoinverse of a matrix. Optimization (learning) by gradient descent. LMS or Widrow-Hoff Algorithms. Convergence of the Batch LMS Rule. Compiled by LAT E X at :50 PM on February 7, 0. CS 95 (UVM) The LMS Algorithm Fall 0 /

LMS Objective Function Again we consider the problem of programming a linear threshold function x w x x w w w 0 y x n w n y D sgn.w T x C w 0 / D ( ; if w T x C w 0 > 0; ; if w T x C w 0 0: so that it agrees with a given dichotomy of m feature vectors, X m D f.x ; `/; : : : ;.x m ; `m/g: where x i R n and `i f ; Cg, for i D ; ; : : : ; m. CS 95 (UVM) The LMS Algorithm Fall 0 /

LMS Objective Function (cont.) Thus, given a dichotomy X m D f.x ; `/; : : : ;.x m ; `m/g; we seek a solution weight vector w and bias w 0, such that, sgn w T x i C w 0 D `i; for i D ; ; : : : ; m. Equivalently, using homogeneous coordinates (or augmented feature vectors), for i D ; ; : : : ; m, where bx i D ; x T i Using normalized coordinates, for i D ; ; : : : ; m, where bx 0 i D `ibx i. sgn bw T bx i D `i; T. sgn bw T bx 0 i D ; or, bw T bx 0 i > 0; CS 95 (UVM) The LMS Algorithm Fall 0 /

LMS Objective Function (cont.) Alternatively, let b R m satisfy b i > 0 for i D ; ; : : : ; m. We call b a margin vector. Frequently we will assume that b i D. Our learning criterion is certainly satisfied if bw T bx 0 i D b i for i D ; ; : : : ; m. (Note that this condition is sufficient, but not necessary.) Equivalently, the above is satisfied if the expression E.bw/ D id b i bw T bx 0 i : equals zero. The above expression we call the LMS objective function, where LMS stands for least mean square. (N.B. we could normalize the above by dividing the right side by m.) CS 95 (UVM) The LMS Algorithm Fall 0 4 /

LMS Objective Function: Remarks Converting the original problem of classification (satisfying a system of inequalities) into one of optimization is somewhat ad hoc. And there is no guarantee that we can find a bw? R nc that satisfies E.bw? / D 0. However, this conversion can lead to practical compromises if the original inequalities possess inconsistencies. Also, there is no unique objective function. The LMS expression, E.bw/ D id b i bw T bx 0 i : can be replaced by numerous candidates, e.g. ˇ ˇbi ˇ bw T bx 0 i ˇ; sgn bw T bx 0 i ; sgn bw T bx 0 i ; etc. id id id However, minimizing E.bw? / is generally an easy task. CS 95 (UVM) The LMS Algorithm Fall 0 5 /

Minimizing the LMS Objective Function Inspection suggests that the LMS Objective Function E.bw/ D b i bw T bx 0 i id describes a parabolic function. It may have a unique global minimum, or an infinite number of global minima which occupy a connected linear set. (The latter can occur if m < n C.) Letting, 0 bx 0T 0 ` Ox 0 bx 0T ; Ox 0 ; Ox 0 ;n X D B C @ : A D ` Ox 0 ; Ox 0 ; Ox 0 ;n B @ : : : : :: C : A Rm.nC/ ; then, E.bw/ D id b i bx 0T m bx 0T i bw D `m Ox 0 m; Ox 0 m; Ox 0 m;n 0 B @ b b b m bx 0T bw bx 0T bw C D b : A bx 0T m bw 0 bx 0T bx 0T B C bw D kb Xbwj : @ : A bx 0T m CS 95 (UVM) The LMS Algorithm Fall 0 6 /

Minimizing the LMS Objective Function (cont.) As an aside, note that E.bw/ D kb Xbwk X T X D.bx 0 ; : : : ;bx 0 B m/ @ D.b Xbw/ T.b Xbw/ D b T bw T X T.b Xbw/ D bw T X T X bw bw T X T b b T X bw C b T b D bw T X T X bw b T X bw C kbk 0 0 bx 0T : : bx 0T m bx 0T b T B X D.b ; : : : ; b m / @ : bx 0T m C A D C A D id id bx 0 ibx 0T i R.nC/.nC/ ; b i bx 0T i R nc : CS 95 (UVM) The LMS Algorithm Fall 0 7 /

Minimizing the LMS Objective Function (cont.) Okay, so how do we minimize E.bw/ D bw T X T X bw b T X bw C kbk Using calculus (e.g., Math ), we can compute the gradient of E.bw/, and algebraically determine a value of bw? which makes each component vanish. That is, solve 0 @E @bw 0 @E re.bw/ D B @bw C D 0: B @ : @E C A It is straightforward to show that @bw n re.bw/ D X T Xbw X T b: CS 95 (UVM) The LMS Algorithm Fall 0 8 /

Minimizing the LMS Objective Function (cont.) Thus, if re.bw/ D X T Xbw X T b D 0 bw? D.X T X/ X T b D X b; where the matrix, X D def.x T X/ X T R.nC/m is called the pseudoinverse of X. If X T X is singular, one defines Observe that if X T X is nonsingular, X D def lim.x T X C I/ X T :!0 X X D.X T X/ X T X D I: CS 95 (UVM) The LMS Algorithm Fall 0 9 /

Example The following example appears in R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Second Edition, Wiley, NY, 00, p. 4. Given the dichotomy, X 4 D.; / T ; ;.; 0/ T ; ;.; / T ; ;.; / T ; we obtain, Whence, 0 X D B @ bx 0T bx 0T bx 0T bx 0T 4 0 D B 0 C C @ A : A 0 5 0 4 4 8 6 X T X D @ 8 8 A ; and, X D.X T X/ X T D B @ 6 4 0 6 4 0 7 6 C A : CS 95 (UVM) The LMS Algorithm Fall 0 0 /

Example (cont.) Letting, b D.; ; ; / T, then bw D X b D 0 B @ 5 4 0 6 4 0 7 6 0 0 C A B C @ A D B @ 4 C A : Whence, x 5 w 0 D and w D T 4 ; 4 4 x CS 95 (UVM) The LMS Algorithm Fall 0 /

Method of Steepest Descent An alternative approach is the method of steepest descent. We begin by representing Taylor s theorem for functions of more than one variable: let x R n, and f W R n! R, so Now let ıx R n, and consider Define F W R! R, such that, f.x/ D f.x ; x ; : : : ; x n / R: f.x C ıx/ D f.x C ıx ; : : : ; x n C ıx n /: F.s/ D f.x C s ıx/: Thus, F.0/ D f.x/; and, F./ D f.x C ıx/: CS 95 (UVM) The LMS Algorithm Fall 0 /

Method of Steepest Descent (cont.) Taylor s theorem for a single variable (Math /), F.s/ D F.0/ C Š F 0.0/s C Š F 00.0/s C Š F 000.0/s C : Our plan is to set s D and replace F./ by f.x C ıx/, F.0/ by f.x/, etc. To evaluate F 0.0/ we will invoke the multivariate chain rule, e.g., d ds f u.s/; v.s/ D @f @u.u; v/ u0.s/ C @f @v.u; v/ v 0.s/: Thus, F 0.s/ D df df.s/ D ds ds.x C s ıx ; : : : ; x n C s ıx n / D @f @x.x C s ıx/ d ds.x C s ıx / C C @f @x n.x C s ıx/ d ds.x n C s ıx n / D @f @x.x C s ıx/ ıx C C @f @x n.x C s ıx/ ıx n : CS 95 (UVM) The LMS Algorithm Fall 0 /

Method of Steepest Descent (cont.) Thus, F 0.0/ D @f @x.x/ ıx C C @f @x n.x/ ıx n D rf.x/ T ıx: CS 95 (UVM) The LMS Algorithm Fall 0 4 /

Method of Steepest Descent (cont.) Thus, it is possible to show f.x C ıx/ D f.x/ C rf.x/ T ıx C O kıxk D f.x/ C krf.x/kkıxk cos C O kıxk ; where defines the angle between rf.x/ and ıx. If kıxk, then ıf D f.x C ıx/ f.x/ krf.x/kkıxk cos : Thus, the greatest reduction ıf occurs if cos D, that is if ıx D rf, where > 0. We thus seek a local minimum of the LMS objective function by taking a sequence of steps bw.t C / D bw.t/ re bw.t/ : CS 95 (UVM) The LMS Algorithm Fall 0 5 /

Training an LTU using Steepest Descent We now return to our original problem. Given a dichotomy X m D f.x ; `/; : : : ;.x m ; `m/g of m feature vectors x i R n with `i f ; g for i D ; : : : ; m, we construct the set of normalized, augmented feature vectors bx 0 m D f.`i; `ix i; ; : : : ; `ix i;n / T R nc ji D ; : : : ; mg: Given a margin vector, b R m, with b i > 0 for i D ; : : : ; m, we construct the LMS objective function, E.bw/ D.bw T bx 0 i b i / D kxbw bk ; id (the factor of is added with some foresight), and then evaluate its gradient re.bw/ D.bw T bx 0 i b i /bx 0 i D X T.Xbw b/: id CS 95 (UVM) The LMS Algorithm Fall 0 6 /

Training an LTU using Steepest Descent (cont.) Substitution into the steepest descent update rule, yields the batch LMS update rule, bw.t C / D bw.t/ re bw.t/ ; bw.t C / D bw.t/ C id b i bw.t/ T bx 0 i bx 0 i: Alternatively, one can abstract the sequential LMS, or Widrow-Hoff rule, from the above: bw.t C / D bw.t/ C b bw.t/ T bx 0.t/ bx 0.t/: where bx 0.t/ bx 0 m is the element of the dichotomy that is presented to the LTU at epoch t. (Here, we assume that b is fixed; otherwise, replace it by b.t/.) CS 95 (UVM) The LMS Algorithm Fall 0 7 /

Sequential LMS Rule The sequential LMS rule h bw.t C / D bw.t/ C b i bw.t/ T bx 0.t/ bx 0.t/; resembles the sequential perceptron rule, bw.t C / D bw.t/ C h sgn bw.t/ T bx 0.t/ i bx 0.t/: Sequential rules are well suited to real-time implementations, as only the current values for the weights, i.e. the configuration of the LTU itself, need to be stored. They also work with dichotomies of infinite sets. CS 95 (UVM) The LMS Algorithm Fall 0 8 /

Convergence of the batch LMS rule Recall, the LMS objective function in the form E.bw/ D Xbw b ; has as its gradient, re.bw/ D X T Xbw X T b D X T.Xbw b/: Substitution into the rule of steepest descent, bw.t C / D bw.t/ re bw.t/ ; yields, bw.t C / D bw.t/ X T Xbw.t/ b CS 95 (UVM) The LMS Algorithm Fall 0 9 /

Convergence of the batch LMS rule (cont.) The algorithm is said to converge to a fixed point bw,? if for every finite initial value bw.0/ <, lim bw.t/ D t! bw? : The fixed points bw? satisfy re.bw? / D 0, whence X T Xbw? D X T b: The update rule becomes, bw.t C / D bw.t/ X T Xbw.t/ b D bw.t/ X T X bw.t/ bw? Let ıbw.t/ D def bw.t/ bw?. Then, ıbw.t C / D ıbw.t/ X T Xıbw.t/ D I X T X/ıbw.t/: CS 95 (UVM) The LMS Algorithm Fall 0 0 /

Convergence of the batch LMS rule (cont.) Convergence bw.t/! bw? occurs if ıbw.t/ D bw.t/ bw?! 0. Thus we require that ıbw.t C / < ıbw.t/. Inspecting the update rule, ıbw.t C / D I X T X ıbw.t/; this reduces to the condition that all the eigenvalues of I X T X have magnitudes less than. We will now evaluate the eigenvalues of the above matrix. CS 95 (UVM) The LMS Algorithm Fall 0 /

Convergence of the batch LMS rule (cont.) Let S R.nC/.nC/ denote the similarity transform that reduces the symmetric matrix X T X to a diagonal matrix ƒ R.nC/.nC/. Thus, S T S D SS T D I, and S X T X S T D ƒ D diag. 0 ; ; : : : ; n /; The numbers, 0 ; ; : : : ; n, represent the eigenvalues of X T X. Note that 0 i for i D 0; ; : : : ; n. Thus, S ıbw.t C / D S.I D.I X T X/S T S ıbw.t/; ƒ/s ıbw.t/: Thus convergence occurs if ks ıbw.t C /k < ks ıbw.t/k; which occurs if the eigenvalues of I ƒ all have magnitudes less than one. CS 95 (UVM) The LMS Algorithm Fall 0 /

Convergence of the batch LMS rule (cont.) The eigenvalues of I ƒ equal i, for i D 0; ; : : : ; n. (These are in fact the eigenvalues of I X T X.) Thus, we require that < i < ; or 0 < i < ; for all i. Let max D max i denote the largest eigenvalue of X T X, then convergence 0in requires that 0 < < : max CS 95 (UVM) The LMS Algorithm Fall 0 /