A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Similar documents

Linear Threshold Units

Big Data - Lecture 1 Optimization reminders

Beyond stochastic gradient descent for large-scale machine learning

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

(Quasi-)Newton methods

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Lecture 5 Principal Minors and the Hessian

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

CS 295 (UVM) The LMS Algorithm Fall / 23

Lecture 5: Variants of the LMS algorithm

GI01/M055 Supervised Learning Proximal Methods

Online Learning, Stability, and Stochastic Gradient Descent

Machine learning and optimization for massive data

2.3 Convex Constrained Optimization Problems

MATHEMATICAL METHODS OF STATISTICS

Big Data Analytics: Optimization and Randomization

Statistical machine learning, high dimension and big data

Background 2. Lecture 2 1. The Least Mean Square (LMS) algorithm 4. The Least Mean Square (LMS) algorithm 3. br(n) = u(n)u H (n) bp(n) = u(n)d (n)

A FIRST COURSE IN OPTIMIZATION THEORY

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

MATH 304 Linear Algebra Lecture 20: Inner product spaces. Orthogonal sets.

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

The p-norm generalization of the LMS algorithm for adaptive filtering

Big learning: challenges and opportunities

Simple and efficient online algorithms for real world applications

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Penalized Logistic Regression and Classification of Microarray Data

Large-Scale Similarity and Distance Metric Learning

Random graphs with a given degree sequence

Machine Learning and Pattern Recognition Logistic Regression

Linear-Quadratic Optimal Controller 10.3 Optimal Linear Control Systems

4.1 Learning algorithms for neural networks

Applications to Data Smoothing and Image Processing I

The Ideal Class Group

Tail inequalities for order statistics of log-concave vectors and applications

Statistical Machine Learning

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Online Learning for Matrix Factorization and Sparse Coding

Analysis of Mean-Square Error and Transient Speed of the LMS Adaptive Algorithm

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Kristine L. Bell and Harry L. Van Trees. Center of Excellence in C 3 I George Mason University Fairfax, VA , USA kbell@gmu.edu, hlv@gmu.

24. The Branch and Bound Method

System Identification for Acoustic Comms.:

CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Solving polynomial least squares problems via semidefinite programming relaxations

Introduction to Online Learning Theory

MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.

Stochastic Gradient Method: Applications

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

CSCI567 Machine Learning (Fall 2014)

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

Optimisation par majoration-minimisation pour les problèmes inverses linéaires

We shall turn our attention to solving linear systems of equations. Ax = b

Computational Optical Imaging - Optique Numerique. -- Deconvolution --

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Cyber-Security Analysis of State Estimators in Power Systems

UNIVERSITY GRADUATE STUDIES PROGRAM SILLIMAN UNIVERSITY DUMAGUETE CITY. Master of Science in Mathematics

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

SOLVING LINEAR SYSTEMS

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Modélisation et résolutions numérique et symbolique

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 10

The Advantages and Disadvantages of Online Linear Optimization

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Math 312 Homework 1 Solutions

A linear algebraic method for pricing temporary life annuities

FUNCTIONAL ANALYSIS LECTURE NOTES: QUOTIENT SPACES

Computing a Nearest Correlation Matrix with Factor Structure

Perron vector Optimization applied to search engines

10. Proximal point method

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Elementary Gradient-Based Parameter Estimation

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

On the k-support and Related Norms

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Dynamic Neural Networks for Actuator Fault Diagnosis: Application to the DAMADICS Benchmark Problem

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Master of Arts in Mathematics

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Binary Image Reconstruction

MATHEMATICS (MATH) 3. Provides experiences that enable graduates to find employment in sciencerelated

Ideal Class Group and Units

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

LECTURE 4. Last time: Lecture outline

Orthogonal Diagonalization of Symmetric Matrices

STA 4273H: Statistical Machine Learning

Solutions of Equations in One Variable. Fixed-Point Iteration II

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Transcription:

A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est, France 2 Dunărea de Jos University, Electronics and Telecommunications Dept. Galaţi, România Sestri Levante 10 Sept. 2014 EC (UPE) Sestri Levante 10 September 2014 1 / 13

Problem We want to where minimize h R N F(h) ( h R N ) F(h) = 1 2 E( y n X nh 2) +Ψ(h), (X n ) n 1 random matrices in R N Q, (y n ) n 1 random vectors in R Q with ( n N ) E( y n 2 ) = E(X n y n ) = r E(X n X n) = R, and Ψ: R N R regularization function. EC (UPE) Sestri Levante 10 September 2014 2 / 13

Problem We want to minimize h R N F(h) ( h R N ) F(h) = 1 2 E( y n X nh 2) +Ψ(h). Numerous applications: supervised classification inverse problems system identification, channel equalization linear prediction/interpolation echo cancellation, interference removal... EC (UPE) Sestri Levante 10 September 2014 2 / 13

Objective Solve the problem when the second-order statistics of (X n,y n ) n 1 are estimated on-line. Huge literature on the topic: stochastic approximation algorithms in machine learning sparse adaptive filtering methods. Employ a Majorize-Minimize strategy allowing the use of a wide range of regularization functions Ψ. Few works in the stochastic case [Mairal, 2013]. Develop fast algorithms by resorting to subspace acceleration. 3MG algorithm for batch optimization [Chouzenoux et al, 2011] connections with stochastic extensions of BFGS [Byrd et al, 2014]. EC (UPE) Sestri Levante 10 September 2014 3 / 13

Stochastic approximation of the criterion At iteration n N, where ( h R N ) F n (h) = 1 2n ρ n = 1 n n y k 2, r n = 1 n k=1 n y k X k h 2 +Ψ(h) k=1 = 1 2 ρ n r n h+ 1 2 h R n h+ψ(h) n X k y k, R n = 1 n k=1 Possibility to introduce a forgetting factor. Related approaches when Ψ 1 online time weighted LASSO [Angelosante et al, 2010] sparse RLS [Babadi et al, 2010]. n X k X k. k=1 EC (UPE) Sestri Levante 10 September 2014 4 / 13

Form of the regularization function ( h R N ) Ψ(h) = 1 2 h V 0 h v0 h + } {{ } elastic net penalization S ψ s ( V s h v s ) where v 0 R N, V 0 R N N symmetric positive semi-definite, for every s {1,...,S}, v s R Ps, V s R Ps N, ψ s : R R ability to take into account linear operators. s=1 EC (UPE) Sestri Levante 10 September 2014 5 / 13

Form of the regularization function ( h R N ) Ψ(h) = 1 2 h V 0 h v0 h + } {{ } elastic net penalization S ψ s ( V s h v s ) where v 0 R N, V 0 R N N symmetric positive semi-definite, for every s {1,...,S}, v s R Ps, V s R Ps N, ψ s : R R ability to take into account linear operators. Assumptions on (ψ s ) 1 s S 1 For every s {1,...,S}, ψ s is an even lower-bounded function, which is continuously differentiable, and lim t 0 ψ s (t)/t R, t 0 where ψ s denotes the derivative of ψ s. s=1 2 For every s {1,...,S}, ψ s (.) is concave on [0,+ [. 3 There exists ν [0,+ [ such that ( s {1,...,S}) ( t [0,+ [) 0 ν s (t) ν, where ν s (t) = ψ s (t)/t. EC (UPE) Sestri Levante 10 September 2014 5 / 13

Form of the regularization function ( h R N ) Ψ(h) = 1 2 h V 0 h v0 h+ Examples of functions (ψ s ) 1 s S S ψ s ( V s h v s ) s=1 Convex Nonconvex λ 1 s ψs(t) Type Name { t δ s log( t /δ s + 1) l 2 l 1 t 2 if t < δ s 2δ s t δs 2 l 2 l 1 Huber otherwise log(cosh(t)) l 2 l 1 Green (1 + t 2 /δs 2 )κs/2 1 l 2 l κs 1 exp( t 2 /(2δs 2 )) l 2 l 0 Welsch t 2 /(2δs 2 + t2 ) l 2 l 0 Geman { -McClure 1 (1 t 2 /(6δs 2 ))3 if t 6δ s l 2 l 0 Tukey biweight 1 otherwise tanh(t 2 /(2δs 2 )) l 2 l 0 Hyberbolic tangent log(1 + t 2 /δs 2 ) l 2 log Cauchy 1 exp(1 (1 + t 2 /(2δs 2 ))κs/2 ) l 2 l κs l 0 Chouzenoux (λ s,δ s) ]0,+ [ 2, κ s [1,2] EC (UPE) Sestri Levante 10 September 2014 5 / 13

Optimization recipe EC (UPE) Sestri Levante 10 September 2014 6 / 13

Optimization recipe 1 Find a tractable surrogate for F n Quadratic tangent majorant where ( h R N ) Θ n (h,h n ) = F n (h n )+ F n (h n ) (h h n ) + 1 2 (h h n) A n (h n )(h h n ), A n (h) = R n +V 0 +V Diag ( b(h) ) V R N N V = [V 1...V S ] R P N, P = P 1 + +P S and b(h) = ( b i (h) ) 1 i P RP is such that ( s {1,...,S})( p {1,...,P s }) b P1 + +P s 1 +p(h) = ν s ( V s h v s ). EC (UPE) Sestri Levante 10 September 2014 6 / 13

Optimization recipe 1 Find a tractable surrogate for F n 2 Minimize in a subspace ( n N ) h n+1 argmin h spand n Θ n (h,h n ), with D n R N Mn. rank(dn ) = N half-quadratic algorithm Mn small low-complexity per iteration. Typical choice: D n = { [ F n (h n ),h n,h n h n 1 ] if n > 1 [ F n (h 1 ),h 1 ] if n = 1 3MG algorithm similar ideas in TWIST, FISTA,... EC (UPE) Sestri Levante 10 September 2014 6 / 13

Optimization recipe 1 Find a tractable surrogate for F n 2 Minimize in a subspace 3 Perform recursive updates of the second-order statistics ( n N ) r n = r n 1 + 1 n (X ny n r n 1 ) R n = R n 1 + 1 n (X nx n R n 1 ). EC (UPE) Sestri Levante 10 September 2014 6 / 13

Stochastic MM subspace algorithm r 0 = 0,R 0 = 0 Initialize D 0,u 0 h 1 = D 0 u 0,D R 0 = 0,DV 0 0 = V 0 D 0,D V 0 = V D 0 For all n = 1,... r n = r n 1 + 1 n (X ny n r n 1 ) c n (h n ) = r n +v 0 +V Diag ( b(h n ) ) v D A n 1 = (1 1 n )DR n 1 + 1 n X n(x nd n 1 ) +D V 0 n 1 +V Diag ( b(h n ) ) D V n 1 F n (h n ) = D A n 1 u n 1 c n (h n ) R n = R n 1 + 1 n (X nx n R n 1 ) Set D n using F n (h n ) D R n = R n D n,d V 0 n = V 0 D n,d V n = V D n B n = D n( D R n +D V ) ( ) 0 n + D V Diag ( n b(hn ) ) D V n u n = B nd nc n (h n ) h n+1 = D n u n EC (UPE) Sestri Levante 10 September 2014 7 / 13

Computational complexity If V = I N, number of multiplications per iteration: N 2 (4M n +Q)/2, where N length of h, Q column dimension of X n, M n subspace dimension, and N max{m n,m n 1,Q}. EC (UPE) Sestri Levante 10 September 2014 8 / 13

Computational complexity If V = I N, number of multiplications per iteration: N 2 (4M n +Q)/2, where N length of h, Q column dimension of X n, M n subspace dimension, and N max{m n,m n 1,Q}. If V I N, number of multiplications per iteration: N ( P(M n +M n 1 +1) } {{ } upper bound on the complexity induced by linear operators +N(4M n +Q)/2 ), where P line dimension of V. EC (UPE) Sestri Levante 10 September 2014 8 / 13

Computational complexity If V = I N, number of multiplications per iteration: N 2 (4M n +Q)/2, where N length of h, Q column dimension of X n, M n subspace dimension, and N max{m n,m n 1,Q}. If V I N, number of multiplications per iteration: N ( P(M n +M n 1 +1) } {{ } upper bound on the complexity induced by linear operators +N(4M n +Q)/2 ), where P line dimension of V. Further complexity reduction possible by taking into account the structure of D n, e.g. 3MG, V = I N, Q = 1 complexity of RLS. EC (UPE) Sestri Levante 10 September 2014 8 / 13

Convergence results Assumptions Let denote (Ω,F,P) the underlying probability space. For every n N, let X n = σ ( (X k,y k ) 1 k n ) be the sub-sigma algebra of F generated by (X k,y k ) 1 k n. We make the following assumptions: 1 R+V 0 is a positive definite matrix. ( 2 (Xn,y n ) ) is a stationary ergodic sequence and, for every n 1 n N, the elements of X n and the components of y n have finite fourth-order moments. 3 For every n N, E( y n+1 2 X n ) = E(X n+1 y n+1 X n ) = r E(X n+1 X n+1 X n ) = R. 4 For every n N, { F n (h n ),h n } spand n. 5 h 1 is X 1 -measurable and, for every n N, D n is X n -measurable. EC (UPE) Sestri Levante 10 September 2014 9 / 13

Convergence results Theorem 1 The set of cluster points of (h n ) n 1 is almost surely a nonempty compact connected set. 2 Any element of this set is almost surely a critical point of F. 3 If the functions (ψ s ) 1 s S are convex, then the sequence (h n ) n 1 converges P-a.s. to the unique (global) minimizer of F. EC (UPE) Sestri Levante 10 September 2014 10 / 13

Convergence results Theorem 1 The set of cluster points of (h n ) n 1 is almost surely a nonempty compact connected set. 2 Any element of this set is almost surely a critical point of F. 3 If the functions (ψ s ) 1 s S are convex, then the sequence (h n ) n 1 converges P-a.s. to the unique (global) minimizer of F. Remark: In the deterministic case, the convergence of 3MG to a critical point of the cost function is ensured, under the assumption of F satisfies the Kurdyka-Łojasiewicz inequality. [Chouzenoux et al., 2013] EC (UPE) Sestri Levante 10 September 2014 10 / 13

Application to 2D filter identification Observation model y = S(h)x + w x RL original image (L = 10242 ) y RL degraded image, h RN unknown two-dimensional blur kernel (N = 212 ) Original image. EC (UPE) Sestri Levante Blurred and noisy image. 10 September 2014 11 / 13

Application to 2D filter identification Observation model y = S(h)x+w x R L original image (L = 1024 2 ) y R L degraded image, h R N unknown two-dimensional blur kernel (N = 21 2 ) S(h) Hankel-block Hankel matrix such that S(h)x = Xh w R L realization of white N(0,0.03 2 ) noise (BSNR = 24.8 db). EC (UPE) Sestri Levante 10 September 2014 11 / 13

Application to 2D filter identification Observation model y = S(h)x+w x R L original image (L = 1024 2 ) y R L degraded image, h R N unknown two-dimensional blur kernel (N = 21 2 ) S(h) Hankel-block Hankel matrix such that S(h)x = Xh w R L realization of white N(0,0.03 2 ) noise (BSNR = 24.8 db). Associated penalized MSE problem yn R Q and X n R Q N : Q = 64 2 lines of y and X isotropic penalization on the gradient of the blur kernel S = N, ( s {1,...,S}) P s = 2, with (λ,δ) ]0,+ [ 2. ψ s : u λ 1+u 2 /δ 2 EC (UPE) Sestri Levante 10 September 2014 11 / 13

Application to 2D filter identification Original blur kernel. Estimated blur kernel, relative error 0.087. EC (UPE) Sestri Levante 10 September 2014 12 / 13

Application to 2D filter identification 10 2 Relative error 10 1 10 0 10 1 0 200 400 600 Time (s.) Comparison of stochastic 3MG algorithm, SGD algorithm with decreasing stepsize n 1/2, and SAG/RDA algorithms with constant stepsizes. EC (UPE) Sestri Levante 10 September 2014 12 / 13

Open problems Kurdyka-Łojasiewicz inequality for a sequence of functions (F n ) n 1 converging to a function F? Quantification of the convergence rate? Extension to a non quadratic cost? Extension to nonsmooth minimization? (connections with Silvia s work?) EC (UPE) Sestri Levante 10 September 2014 13 / 13