On sequence kernels for SVM classification of sets of vectors: application to speaker verification

Transcription

1 On sequence kernels for SVM classification of sets of vectors: application to speaker verification Major part of the Ph.D. work of In collaboration with Jérôme Louradour Francis Bach (ARMINES) within E-TEAM 11 1/36

2 Introduction Text independent speaker verification A binary classification task Determine if a speech sequence has been uttered by a target speaker Classical approach séquence test? locuteur cible imposteurs, Monde PRÉ-TRAITEMENT ATTRIBUTION DE SCORE GMM APPRENTISSAGE UBM Classifieur PRISE DE DÉCISION locuteur cible ou imposteur 2/36

3 Introduction Text independent speaker verification A binary classification task Determine if a speech sequence has been uttered by a target speaker Classical approach : UBM-GMM system séquence test? locuteur cible imposteurs, Monde PRÉ-TRAITEMENT ATTRIBUTION DE SCORE GMM APPRENTISSAGE UBM Classifieur PRISE DE DÉCISION locuteur cible ou imposteur Probabilistic GMM modeling (generative) 2/36

4 Introduction Text independent speaker verification A binary classification task Determine if a speech sequence has been uttered by a target speaker Classical approach : UBM-GMM system séquence test? locuteur cible imposteurs, Monde PRÉ-TRAITEMENT ATTRIBUTION DE SCORE GMM APPRENTISSAGE UBM Classifieur PRISE DE DÉCISION locuteur cible ou imposteur Probabilistic GMM modeling (generative) y Motivation : apply SVM, powerful for binary classification 2/36

5 Introduction Support Vector Machines (SVM) + Good theoretical foundations + Well mastered learning algorithms + Good results in static data classification In speech processing... Extension to dynamic data is not easy Size of training corpus time & memory consuming Bad results of SVM when applied at the frame level 3/36

6 Introduction Support Vector Machines (SVM) + Good theoretical foundations + Well mastered learning algorithms + Good results in static data classification In speech processing... Extension to dynamic data is not easy Size of training corpus time & memory consuming Bad results of SVM when applied at the frame level y Conception and study of sequence kernels for speaker verification 3/36

7 Sequence kernels Outline 1 Sequence kernels 2 FSNS sequence kernels 3 Experimental evaluation of sequence kernels 4/36

8 Sequence kernels Basics Basics on kernels Similarity measure Mercer property : symetric, positive definite = k(x, y) = φ(x) φ(y) φ : expansion in a Feature Space R D (dimension D + ) 5/36

9 Sequence kernels Basics Sequence kernels Three famillies of kernels 1 Mutual Information kernels Based on a priori data distribution 2 Kernels between probability densities Sequence distribution 3 Combination of vector kernels Function of kernel values between inter(intra)-sequence elements 6/36

10 Sequence kernels Basics Sequence kernels Three famillies of kernels 1 Mutual Information kernels Based on a priori data distribution 2 Kernels between probability densities Sequence distribution 3 Combination of vector kernels Function of kernel values between inter(intra)-sequence elements Definition of a sequence Variable-length set of acoustic vectors - same speaker, - same recording session 6/36

11 Sequence kernels Sequence kernels used in speaker verification Mutual Information kernels Exploit the a priori (generative) UBM which parameters θ o are esimated on non-labeled data Fisher kernel [Jaakkola and Haussler, 1998] Approximation of a Mutual Information kernel [Seeger, 2002] κ(x, Y) = φ(x) S 1 φ(y) φ(x) = θ log p(x θ) θ=θo (Fisher expantion) S : second moments of φ (Fisher information matrix) 7/36

12 Sequence kernels Sequence kernels used in speaker verification Mutual Information kernels Exploit the a priori (generative) UBM which parameters θ o are esimated on non-labeled data Fisher kernel [Jaakkola and Haussler, 1998] Approximation of a Mutual Information kernel [Seeger, 2002] κ(x, Y) = φ(x) S 1 φ(y) φ(x) = θ log p(x θ) θ=θo (Fisher expantion) S : second moments of φ (Fisher information matrix) 2 Gaussians GMM 7/36

13 Sequence kernels Sequence kernels used in speaker verification Mutual Information kernels Exploit the a priori (generative) UBM which parameters θ o are esimated on non-labeled data Fisher kernel [Jaakkola and Haussler, 1998] Approximation of a Mutual Information kernel [Seeger, 2002] κ(x, Y) = φ(x) S 1 φ(y) φ(x) = θ log p(x θ) θ=θo (Fisher expantion) S : second moments of φ (Fisher information matrix) Distance between Fisher expansions 7/36

14 Sequence kernels Sequence kernels used in speaker verification Kernels between probability densities X, Y GGGGGGGGGGGGGA Generative p X, p Y learning probability product kernels [Jebara and Kondor, 2003] κ(x, Y) = p X (z) q p Y (z) q dz Analytic form a GMM with degree q = 1 [Lyu, 2005] spherical normalization for more robustness : κ(x,y) κ(x,y)= κ(x,x)κ(y,y) Exponential embedding of divergences Analogous of the Gaussien kernel : κ(x, Y) = e D(pX,pY)2 2ρ 2 D : Distance between GMMs, approximation of KL divergence [Do, 2003] y supervectors GMM Approach [Campbell et al., 2006] 8/36

15 Sequence kernels Sequence kernels used in speaker verification Combination of vector kernels Sequences of vectors X = { x t t = 1 T X } Y = { y t t = 1 T Y } Similarity Similarity between between vectors sets of vectors k(x, y) = φ(x) φ(y) w κ(x, Y) Mercer kernel function of k(x t, y t ), k(x t, x t ), k(y t, y t ),... Simple example : κ(x, Y) = 1 T XT Y T X T Y t=1 t =1 k(x t, y t ) complexity O(T 2 ) for each kernel computation 9/36

16 Sequence kernels Sequence kernels used in speaker verification Combination of vector kernels GLDS kernel [Campbell, 2001] ( ) 1 κ(x, Y) = T X t ( ) φ 1 1 q(x t ) SB T Y t φ q(y t ) φ q : Polynomial expansion (R d R D ) D = (q+d)! q!d! monomes of degree q S B : Matrix of second moments of φ q estimated on B Normalization by S B : - Kernel scoring on X a discr. model learned on Y - Same amplitude for each component of the Feature Space Explicit expansion of data : + high efficiency during test (linear SVM model) Impossible to use expansions φ q of high or infinite dimension (in practice, maxi degree q = 3) 10/36

17 Sequence kernels Sequence kernels used in speaker verification Combination of vector kernels GLDS kernel [Campbell, 2001] ( ) 1 κ(x, Y) = T X t ( ) φ 1 1 q(x t ) SB T Y t φ q(y t ) φ q : Polynomial expansion (R d R D ) D = (q+d)! q!d! monomes of degree q S B : Matrix of second moments of φ q estimated on B Normalization by S B : - Kernel scoring on X a discr. model learned on Y - Same amplitude for each component of the Feature Space Explicit expansion of data : + high efficiency during test (linear SVM model) Impossible to use expansions φ q of high or infinite dimension (in practice, maxi degree q = 3) 10/36

18 FSNS sequence kernels Outline 1 Sequence kernels 2 FSNS sequence kernels 3 Experimental evaluation of sequence kernels 11/36

19 FSNS sequence kernels Definition Extension of the GLDS kernel : FSNS kernels FSNS kernels (Feature Space Normalized Sequence kernels) ( ) (SB 1 κ(x, Y) = T X t φ(x t) + εi ) ( 1 1 T Y t φ(y t ) ) φ : any Mercer expansion S B : second moments matrix of φ Regularization ε > 0 : necessary if φ is of high dimension { avoid to compute φ y Objectif : rewrite using the Mercer kernel k = φ φ 12/36

20 FSNS sequence kernels Definition Extension of the GLDS kernel : FSNS kernels FSNS kernels (Feature Space Normalized Sequence kernels) ( ) (SB 1 κ(x, Y) = T X t φ(x t) + εi ) ( 1 1 T Y t φ(y t ) ) φ : any Mercer expansion S B : second moments matrix of φ Regularization ε > 0 : necessary if φ is of high dimension { avoid to compute φ y Objectif : rewrite using the Mercer kernel k = φ φ FSMS kernels (Feature Space Mahalanobis Sequence kernels) ( ) (ΣB 1 κ(x, Y) = T X t φ(x t) + εi ) ( 1 ) 1 T Y t φ(y t ) Σ B : covariance matrix of φ SVM are invariant to translations in the Feature Space same as FSNS with centring of φ kernel KL divergence between Gaussians in the Feature Space 12/36

21 FSNS sequence kernels Dual Form Dual form of FSNS kernels Training (Background) data : B = { b i i = 1 N} Gram matrix Empirical map K = k(b1,b1) k(b1,b N) k(b1, x). k(b i,b j ). ψ B (x) =. k(b N,b 1) k(b N,b N ) k(b N, x) Proposition (without regularization, ε = 0) ( ) 1 κ(x, Y) = T X t ( φ(x 1 ) 1 t) SB T Y t φ(y t ) ( ) ( 1 = T X t ( ) ψ B(x t ) 1 N K2) 1 1 T Y t ψ B(y t ) 13/36

22 FSNS sequence kernels Dual Form Dual form of FSNS kernels Training (Background) data : B = { b i i = 1 N} Gram matrix Empirical map K = k(b1,b1) k(b1,b N) k(b1, x). k(b i,b j ). ψ B (x) =. k(b N,b 1) k(b N,b N ) k(b N, x) Proposition κ(x, Y) = = ( ) (SB 1 T X t φ(x t) + εi ) ( 1 ) 1 T Y t φ(y t ) ( ) ( 1 T X t ψ B(x t ) 1 N K2 + εk ) ( ) 1 1 T Y t ψ B(y t ) Hypothesis : the φ(b i ) span the training φ(x) 13/36

23 FSNS sequence kernels Dual Form Dual form of FSNS kernels Training (Background) data : B = { b i i = 1 N} Gram matrix Empirical map K = k(b1,b1) k(b1,b N) k(b1, x). k(b i,b j ). ψ B (x) =. k(b N,b 1) k(b N,b N ) k(b N, x) Proposition (with centering) ( ) (ΣB 1 κ(x, Y) = T X t φ(x t) + εi ) ( 1 ) 1 T Y t φ(y t ) ( ) ( 1 = T X t ( ) ψ B(x t ) 1 N KΠK + εk) 1 1 T Y t ψ B(y t ) Hypothesis : the φ(b i ) span the training φ(x) Centering : Π = I 1 N 1 (instead of I) 13/36

24 FSNS sequence kernels Dual Form Computational complexity Dot product of normalized expansions κ(x, Y) = φ(x) M B φ(y) = Uφ(X), Uφ(Y) (1) = ψ B (X) R B ψ B (Y) = Vψ B (X), Vψ B (Y) (2) form (1) form (2) Pre-computation of U/V O(D 3 ) O(N 3 ) Sequence expansion Uφ/Vψ B O(TD 2 ) O(T N 2 ) Dot product comput. O(D) O(N) D : dimension of the Feature Space (size of φ) N : number of background vectors (size of ψ) + Possibility to use expansions φ of infinite dim (Gaussian kernel) Complexity problem for large databases 14/36

25 FSNS sequence kernels Dual Form Computational complexity Dot product of normalized expansions κ(x, Y) = φ(x) M B φ(y) = Uφ(X), Uφ(Y) (1) = ψ B (X) R B ψ B (Y) = Vψ B (X), Vψ B (Y) (2) form (1) form (2) Pre-computation of U/V O(D 3 ) O(N 3 ) Sequence expansion Uφ/Vψ B O(TD 2 ) O(T N 2 ) Dot product comput. O(D) O(N) D : dimension of the Feature Space (size of φ) N : number of background vectors (size of ψ) + Possibility to use expansions φ of infinite dim (Gaussian kernel) Complexity problem for large databases y Objective : find an appropriate approximation 14/36

26 FSNS sequence kernels Approximation Kernel Approximation Goal 1 Reduce the size of the empirical map ψ 2 Keep a maximum of information ICD : Incomplete Cholesky Decomposition [Fine and Scheinberg, 2001] 1 Selection of a sub-population of backround vectors : codebook C = { b p1 b pi b pm } B index I = { p 1 p i p m } {1 N} (taille m) 2 Low-rank approximation of the Gram matrix : K L I = K(:, I)K(I, I) 1 K(:, I) (rang m) min tr ( ) K L I min φ(bi ) φ C (b i ) 2 I C + CPU and memory 15/36

27 FSNS sequence kernels Approximation Kernel Approximation Goal 1 Reduce the size of the empirical map ψ 2 Keep a maximum of information ICD : Incomplete Cholesky Decomposition Training data Codebook ICD, Gaussian kernel 16/36

28 FSNS sequence kernels Approximation Approximate Form of FSNS kernels Proposition ICD K w K(:, I)K(I, I) 1 K(:, I) κ(x, Y) = ψ B (X) R B ψ B (Y) w κ(x, Y) ψ C (X) R B C ψ C (Y) Expansion of size m N : ψ C (X) = 1 T X t k(bp 1, x t). k(b pm, x t) R B C = ( 1 N K(:, I)ΠK(:, I) + εk(i, I) 1 Complexity form (1) form (2) approx form Expansion norm. O(TD 2 ) O(TN 2 ) O(Tm 2 ) Dot product O(D) O(N) O(m) 17/36

29 Experimental evaluation of sequence kernels Outline 1 Sequence kernels 2 FSNS sequence kernels 3 Experimental evaluation of sequence kernels 18/36

30 Experimental evaluation of sequence kernels Data Speech corpus : NIST Speaker Recognition Evaluation Development : NIST 2003 and 2004 Background corpus (1/2) Validation corpus (2/2) - Hyper-parameters of kernels - Parameter C (SVM leraning) - Decision threshold Evaluation : NIST tests, 400 target speakers DCF = τ FRP locfr% + τ FAP impfa% 19/36

31 Experimental evaluation of sequence kernels Data Speech corpus : NIST Speaker Recognition Evaluation Development : NIST 2003 and 2004 Background corpus (1/2) Validation corpus (2/2) - Hyper-parameters of kernels - Parameter C (SVM leraning) - Decision threshold Evaluation : NIST tests, 400 target speakers DCF = τ FRP locfr% + τ FAP impfa% Preprocessing SVM GMM acoustic vectors MFCC+ MFCC+ loge LFCC+ LFCC+ loge silence suppression clustering of the energy normalization feature warping centring-reduction 19/36

32 Experimental evaluation of sequence kernels FSNS kernels Development : Centring & Regularization? κ(x, Y) = ψ C (X) [ 1 N K(:, I) PK(:, I) + εk(i, I)] 1 ψc (Y) Centring Regul. EER(%) DCFmin P = Π ε = P = Π ε = P = Π (ε = 0) (P = I) (ε = 0) No significant gain with centering Regularization degrades 20/36

33 Experimental evaluation of sequence kernels FSNS kernels Developement : Choice and tuning of the vector kernel κ(x, Y) = ψ C (X) [ 1 N K(:, I) PK(:, I) + εk(i, I)] 1 ψc (Y) EER(%) ρ = 2ρ ρ = ρ 0/ ρ = ρ DCFmin Best results are obtained with a Gaussian kernel k(x, y) = e x y 2 2ρ 2... using a good spread parameter ρ ρ 0 = dσ 21/36

34 Experimental evaluation of sequence kernels FSNS kernels Developement : Size of the codebook κ(x, Y) = ψ C (X) [ 1 N K(:, I) PK(:, I) + εk(i, I)] 1 ψc (Y) Codebook size EER(%) DCFmin m = m = m = m = m = m = Augmenting la codebook size improve the performances... until a certain point (m 5000) 22/36

35 Experimental evaluation of sequence kernels FSNS kernels Evaluation : FSNS kernels vs. Classical approches Validation corpus Evaluation corpus EER(%) DCFmin (1) GLDS SVM (2) FSNS SVM (3) UBM-GMM EER DCF (1) GLDS SVM (2) FSNS SVM (3) UBM-GMM FSNS kernel : better performances than GLDS kernel FSNS-SVM competitive with UBM-GMM 23/36

36 Experimental evaluation of sequence kernels FSNS kernels Evaluation : Perspectives of fusion Validation corpus Evaluation corpus EER(%) DCFmin (1) GLDS SVM (2) FSNS SVM (3) UBM-GMM (1+2) fusion no improvment (1+3) fusion (2+3) fusion EER DCF (1) GLDS SVM (2) FSNS SVM (3) UBM-GMM (1+2) fusion no improvment (1+3) fusion (2+3) fusion Gain in discriminative / generative fusion 24/36

37 Experimental evaluation of sequence kernels All sequence kernels Evaluation : All sequence kernels EER DCF Probability product kernel GLDS kernel Fisher kernel FSNS kernel UBM-GMM (ref) supervectors GMM Best results : exploiting GMM in SVM 25/36

38 Experimental evaluation of sequence kernels All sequence kernels Evaluation : Fusion EER DCF (1) FSNS kernel (2) UBM-GMM (3) Supervectors GMM (2+3) fusion no improvment (1+2) fusion (1+3) fusion Gain in GMM / SVM fusion 26/36

39 Conclusions and future work Conclusions and future work Theoretical and experimental exploration of kernels between variable-length sets A lot of possible kernels A novel kernel (FSNS) Experimental comparison on NIST SRE Best performance : fusion FSNS/ GMM supervectors Several ways to improve SVM for speaker verification Model adaptation String kernels for high-level features (prosody... ) How to handle SVM scores? 27/36