A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Transcription

1 A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est, France 2 Dunărea de Jos University, Electronics and Telecommunications Dept. Galaţi, România Sestri Levante 10 Sept EC (UPE) Sestri Levante 10 September / 13

2 Problem We want to where minimize h R N F(h) ( h R N ) F(h) = 1 2 E( y n X nh 2) +Ψ(h), (X n ) n 1 random matrices in R N Q, (y n ) n 1 random vectors in R Q with ( n N ) E( y n 2 ) = E(X n y n ) = r E(X n X n) = R, and Ψ: R N R regularization function. EC (UPE) Sestri Levante 10 September / 13

3 Problem We want to minimize h R N F(h) ( h R N ) F(h) = 1 2 E( y n X nh 2) +Ψ(h). Numerous applications: supervised classification inverse problems system identification, channel equalization linear prediction/interpolation echo cancellation, interference removal... EC (UPE) Sestri Levante 10 September / 13

4 Objective Solve the problem when the second-order statistics of (X n,y n ) n 1 are estimated on-line. Huge literature on the topic: stochastic approximation algorithms in machine learning sparse adaptive filtering methods. Employ a Majorize-Minimize strategy allowing the use of a wide range of regularization functions Ψ. Few works in the stochastic case [Mairal, 2013]. Develop fast algorithms by resorting to subspace acceleration. 3MG algorithm for batch optimization [Chouzenoux et al, 2011] connections with stochastic extensions of BFGS [Byrd et al, 2014]. EC (UPE) Sestri Levante 10 September / 13

5 Stochastic approximation of the criterion At iteration n N, where ( h R N ) F n (h) = 1 2n ρ n = 1 n n y k 2, r n = 1 n k=1 n y k X k h 2 +Ψ(h) k=1 = 1 2 ρ n r n h+ 1 2 h R n h+ψ(h) n X k y k, R n = 1 n k=1 Possibility to introduce a forgetting factor. Related approaches when Ψ 1 online time weighted LASSO [Angelosante et al, 2010] sparse RLS [Babadi et al, 2010]. n X k X k. k=1 EC (UPE) Sestri Levante 10 September / 13

6 Form of the regularization function ( h R N ) Ψ(h) = 1 2 h V 0 h v0 h + } {{ } elastic net penalization S ψ s ( V s h v s ) where v 0 R N, V 0 R N N symmetric positive semi-definite, for every s {1,...,S}, v s R Ps, V s R Ps N, ψ s : R R ability to take into account linear operators. s=1 EC (UPE) Sestri Levante 10 September / 13

7 Form of the regularization function ( h R N ) Ψ(h) = 1 2 h V 0 h v0 h + } {{ } elastic net penalization S ψ s ( V s h v s ) where v 0 R N, V 0 R N N symmetric positive semi-definite, for every s {1,...,S}, v s R Ps, V s R Ps N, ψ s : R R ability to take into account linear operators. Assumptions on (ψ s ) 1 s S 1 For every s {1,...,S}, ψ s is an even lower-bounded function, which is continuously differentiable, and lim t 0 ψ s (t)/t R, t 0 where ψ s denotes the derivative of ψ s. s=1 2 For every s {1,...,S}, ψ s (.) is concave on [0,+ [. 3 There exists ν [0,+ [ such that ( s {1,...,S}) ( t [0,+ [) 0 ν s (t) ν, where ν s (t) = ψ s (t)/t. EC (UPE) Sestri Levante 10 September / 13

8 Form of the regularization function ( h R N ) Ψ(h) = 1 2 h V 0 h v0 h+ Examples of functions (ψ s ) 1 s S S ψ s ( V s h v s ) s=1 Convex Nonconvex λ 1 s ψs(t) Type Name { t δ s log( t /δ s + 1) l 2 l 1 t 2 if t < δ s 2δ s t δs 2 l 2 l 1 Huber otherwise log(cosh(t)) l 2 l 1 Green (1 + t 2 /δs 2 )κs/2 1 l 2 l κs 1 exp( t 2 /(2δs 2 )) l 2 l 0 Welsch t 2 /(2δs 2 + t2 ) l 2 l 0 Geman { -McClure 1 (1 t 2 /(6δs 2 ))3 if t 6δ s l 2 l 0 Tukey biweight 1 otherwise tanh(t 2 /(2δs 2 )) l 2 l 0 Hyberbolic tangent log(1 + t 2 /δs 2 ) l 2 log Cauchy 1 exp(1 (1 + t 2 /(2δs 2 ))κs/2 ) l 2 l κs l 0 Chouzenoux (λ s,δ s) ]0,+ [ 2, κ s [1,2] EC (UPE) Sestri Levante 10 September / 13

9 Optimization recipe EC (UPE) Sestri Levante 10 September / 13

10 Optimization recipe 1 Find a tractable surrogate for F n Quadratic tangent majorant where ( h R N ) Θ n (h,h n ) = F n (h n )+ F n (h n ) (h h n ) (h h n) A n (h n )(h h n ), A n (h) = R n +V 0 +V Diag ( b(h) ) V R N N V = [V 1...V S ] R P N, P = P 1 + +P S and b(h) = ( b i (h) ) 1 i P RP is such that ( s {1,...,S})( p {1,...,P s }) b P1 + +P s 1 +p(h) = ν s ( V s h v s ). EC (UPE) Sestri Levante 10 September / 13

11 Optimization recipe 1 Find a tractable surrogate for F n 2 Minimize in a subspace ( n N ) h n+1 argmin h spand n Θ n (h,h n ), with D n R N Mn. rank(dn ) = N half-quadratic algorithm Mn small low-complexity per iteration. Typical choice: D n = { [ F n (h n ),h n,h n h n 1 ] if n > 1 [ F n (h 1 ),h 1 ] if n = 1 3MG algorithm similar ideas in TWIST, FISTA,... EC (UPE) Sestri Levante 10 September / 13

12 Optimization recipe 1 Find a tractable surrogate for F n 2 Minimize in a subspace 3 Perform recursive updates of the second-order statistics ( n N ) r n = r n n (X ny n r n 1 ) R n = R n n (X nx n R n 1 ). EC (UPE) Sestri Levante 10 September / 13

13 Stochastic MM subspace algorithm r 0 = 0,R 0 = 0 Initialize D 0,u 0 h 1 = D 0 u 0,D R 0 = 0,DV 0 0 = V 0 D 0,D V 0 = V D 0 For all n = 1,... r n = r n n (X ny n r n 1 ) c n (h n ) = r n +v 0 +V Diag ( b(h n ) ) v D A n 1 = (1 1 n )DR n n X n(x nd n 1 ) +D V 0 n 1 +V Diag ( b(h n ) ) D V n 1 F n (h n ) = D A n 1 u n 1 c n (h n ) R n = R n n (X nx n R n 1 ) Set D n using F n (h n ) D R n = R n D n,d V 0 n = V 0 D n,d V n = V D n B n = D n( D R n +D V ) ( ) 0 n + D V Diag ( n b(hn ) ) D V n u n = B nd nc n (h n ) h n+1 = D n u n EC (UPE) Sestri Levante 10 September / 13

14 Computational complexity If V = I N, number of multiplications per iteration: N 2 (4M n +Q)/2, where N length of h, Q column dimension of X n, M n subspace dimension, and N max{m n,m n 1,Q}. EC (UPE) Sestri Levante 10 September / 13

15 Computational complexity If V = I N, number of multiplications per iteration: N 2 (4M n +Q)/2, where N length of h, Q column dimension of X n, M n subspace dimension, and N max{m n,m n 1,Q}. If V I N, number of multiplications per iteration: N ( P(M n +M n 1 +1) } {{ } upper bound on the complexity induced by linear operators +N(4M n +Q)/2 ), where P line dimension of V. EC (UPE) Sestri Levante 10 September / 13

16 Computational complexity If V = I N, number of multiplications per iteration: N 2 (4M n +Q)/2, where N length of h, Q column dimension of X n, M n subspace dimension, and N max{m n,m n 1,Q}. If V I N, number of multiplications per iteration: N ( P(M n +M n 1 +1) } {{ } upper bound on the complexity induced by linear operators +N(4M n +Q)/2 ), where P line dimension of V. Further complexity reduction possible by taking into account the structure of D n, e.g. 3MG, V = I N, Q = 1 complexity of RLS. EC (UPE) Sestri Levante 10 September / 13

17 Convergence results Assumptions Let denote (Ω,F,P) the underlying probability space. For every n N, let X n = σ ( (X k,y k ) 1 k n ) be the sub-sigma algebra of F generated by (X k,y k ) 1 k n. We make the following assumptions: 1 R+V 0 is a positive definite matrix. ( 2 (Xn,y n ) ) is a stationary ergodic sequence and, for every n 1 n N, the elements of X n and the components of y n have finite fourth-order moments. 3 For every n N, E( y n+1 2 X n ) = E(X n+1 y n+1 X n ) = r E(X n+1 X n+1 X n ) = R. 4 For every n N, { F n (h n ),h n } spand n. 5 h 1 is X 1 -measurable and, for every n N, D n is X n -measurable. EC (UPE) Sestri Levante 10 September / 13

18 Convergence results Theorem 1 The set of cluster points of (h n ) n 1 is almost surely a nonempty compact connected set. 2 Any element of this set is almost surely a critical point of F. 3 If the functions (ψ s ) 1 s S are convex, then the sequence (h n ) n 1 converges P-a.s. to the unique (global) minimizer of F. EC (UPE) Sestri Levante 10 September / 13

19 Convergence results Theorem 1 The set of cluster points of (h n ) n 1 is almost surely a nonempty compact connected set. 2 Any element of this set is almost surely a critical point of F. 3 If the functions (ψ s ) 1 s S are convex, then the sequence (h n ) n 1 converges P-a.s. to the unique (global) minimizer of F. Remark: In the deterministic case, the convergence of 3MG to a critical point of the cost function is ensured, under the assumption of F satisfies the Kurdyka-Łojasiewicz inequality. [Chouzenoux et al., 2013] EC (UPE) Sestri Levante 10 September / 13

20 Application to 2D filter identification Observation model y = S(h)x + w x RL original image (L = ) y RL degraded image, h RN unknown two-dimensional blur kernel (N = 212 ) Original image. EC (UPE) Sestri Levante Blurred and noisy image. 10 September / 13

21 Application to 2D filter identification Observation model y = S(h)x+w x R L original image (L = ) y R L degraded image, h R N unknown two-dimensional blur kernel (N = 21 2 ) S(h) Hankel-block Hankel matrix such that S(h)x = Xh w R L realization of white N(0, ) noise (BSNR = 24.8 db). EC (UPE) Sestri Levante 10 September / 13

22 Application to 2D filter identification Observation model y = S(h)x+w x R L original image (L = ) y R L degraded image, h R N unknown two-dimensional blur kernel (N = 21 2 ) S(h) Hankel-block Hankel matrix such that S(h)x = Xh w R L realization of white N(0, ) noise (BSNR = 24.8 db). Associated penalized MSE problem yn R Q and X n R Q N : Q = 64 2 lines of y and X isotropic penalization on the gradient of the blur kernel S = N, ( s {1,...,S}) P s = 2, with (λ,δ) ]0,+ [ 2. ψ s : u λ 1+u 2 /δ 2 EC (UPE) Sestri Levante 10 September / 13

23 Application to 2D filter identification Original blur kernel. Estimated blur kernel, relative error EC (UPE) Sestri Levante 10 September / 13

24 Application to 2D filter identification 10 2 Relative error Time (s.) Comparison of stochastic 3MG algorithm, SGD algorithm with decreasing stepsize n 1/2, and SAG/RDA algorithms with constant stepsizes. EC (UPE) Sestri Levante 10 September / 13

25 Open problems Kurdyka-Łojasiewicz inequality for a sequence of functions (F n ) n 1 converging to a function F? Quantification of the convergence rate? Extension to a non quadratic cost? Extension to nonsmooth minimization? (connections with Silvia s work?) EC (UPE) Sestri Levante 10 September / 13