Energy-based multi-speaker voice activity detection with an ad hoc microphone array Alexander Bertrand Marc Moonen Department of Electrical Engineering (ESAT) Katholieke Universiteit Leuven ICASSP 2010 A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 1 / 20
Outline 1 Motivation Problem statement Data model 2 Solving non-negative BSS (NBSS) NBSS with well-grounded sources M-NICA 3 Results A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 2 / 20
Outline Motivation Problem statement 1 Motivation Problem statement Data model 2 Solving non-negative BSS (NBSS) NBSS with well-grounded sources M-NICA 3 Results A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 3 / 20
Problem statement Motivation Problem statement 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 : speech source : microphone Goal: individual voice activity detection (VAD) for multiple simultaneous speakers Ad-hoc microphone array Assumptions: Speakers in near-field (speech power varies over microphones) Speakers mutually independent Limited noise, and limited reverberance A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 4 / 20
Problem statement Motivation Problem statement 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 : speech source : microphone Goal: individual voice activity detection (VAD) for multiple simultaneous speakers Ad-hoc microphone array Assumptions: Speakers in near-field (speech power varies over microphones) Speakers mutually independent Limited noise, and limited reverberance A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 4 / 20
Problem statement Motivation Problem statement 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 : speech source : microphone Goal: individual voice activity detection (VAD) for multiple simultaneous speakers Ad-hoc microphone array Assumptions: Speakers in near-field (speech power varies over microphones) Speakers mutually independent Limited noise, and limited reverberance A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 4 / 20
Problem statement Motivation Problem statement Advantages: Array geometry unknown Speaker positions unknown Energy-based low data rate synchronization sampling clocks not crucial By-product: power of each speaker at each microphone Applications: Binaural hearing aids (head shadow) Video conferencing Ad hoc acoustic sensor networks... A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 5 / 20
Problem statement Motivation Problem statement Advantages: Array geometry unknown Speaker positions unknown Energy-based low data rate synchronization sampling clocks not crucial By-product: power of each speaker at each microphone Applications: Binaural hearing aids (head shadow) Video conferencing Ad hoc acoustic sensor networks... A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 5 / 20
Outline Motivation Data model 1 Motivation Problem statement Data model 2 Solving non-negative BSS (NBSS) NBSS with well-grounded sources M-NICA 3 Results A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 6 / 20
Data model Motivation Data model N speakers, J microphones, J N Speech signal n: s n [t] Microphone signal j: ỹ j [t] Instantaneous speech power (L=block length, k =frame index): s n [k] = 1 L 1 s n [kl + l] 2 L l=0 Instantaneous microphone signal power: y j [k] = 1 L 1 ỹ j [kl + l] 2 L l=0 Stack s n [k] and y j [k] in s[k] and y[k] resp. Data model: y[k] As[k], k N A is a J N mixing matrix A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 7 / 20
Data model Motivation Data model N speakers, J microphones, J N Speech signal n: s n [t] Microphone signal j: ỹ j [t] Instantaneous speech power (L=block length, k =frame index): s n [k] = 1 L 1 s n [kl + l] 2 L l=0 Instantaneous microphone signal power: y j [k] = 1 L 1 ỹ j [kl + l] 2 L l=0 Stack s n [k] and y j [k] in s[k] and y[k] resp. Data model: y[k] As[k], k N A is a J N mixing matrix A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 7 / 20
Data model Motivation Data model N speakers, J microphones, J N Speech signal n: s n [t] Microphone signal j: ỹ j [t] Instantaneous speech power (L=block length, k =frame index): s n [k] = 1 L 1 s n [kl + l] 2 L l=0 Instantaneous microphone signal power: y j [k] = 1 L 1 ỹ j [kl + l] 2 L l=0 Stack s n [k] and y j [k] in s[k] and y[k] resp. Data model: y[k] As[k], k N A is a J N mixing matrix A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 7 / 20
Data model Motivation Data model N speakers, J microphones, J N Speech signal n: s n [t] Microphone signal j: ỹ j [t] Instantaneous speech power (L=block length, k =frame index): s n [k] = 1 L 1 s n [kl + l] 2 L l=0 Instantaneous microphone signal power: y j [k] = 1 L 1 ỹ j [kl + l] 2 L l=0 Stack s n [k] and y j [k] in s[k] and y[k] resp. Data model: y[k] As[k], k N A is a J N mixing matrix A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 7 / 20
Data model Motivation Data model y[k] As[k], k N Remarks: Assumes independence of sources and no reverberation good choice of L Trade-off: size of L (time resolution vs. model mismatch) Noise (incorporate in s or subtract) Goal: find s (and A) track power of each source = blind source separation problem with non-negative source signals (NBSS) A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 8 / 20
Data model Motivation Data model y[k] As[k], k N Remarks: Assumes independence of sources and no reverberation good choice of L Trade-off: size of L (time resolution vs. model mismatch) Noise (incorporate in s or subtract) Goal: find s (and A) track power of each source = blind source separation problem with non-negative source signals (NBSS) A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 8 / 20
Data model Motivation Data model y[k] As[k], k N Remarks: Assumes independence of sources and no reverberation good choice of L Trade-off: size of L (time resolution vs. model mismatch) Noise (incorporate in s or subtract) Goal: find s (and A) track power of each source = blind source separation problem with non-negative source signals (NBSS) A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 8 / 20
Outline Solving non-negative BSS (NBSS) NBSS with well-grounded sources 1 Motivation Problem statement Data model 2 Solving non-negative BSS (NBSS) NBSS with well-grounded sources M-NICA 3 Results A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 9 / 20
Solving non-negative BSS (NBSS) NBSS with well-grounded sources NBSS with well-grounded sources Exploit non-negativity simpler algorithms (compared to classic ICA) Exploit well-groundedness of source signals (non-vanishing pdf at zero) s: well-grounded due to on-off behavior of speech Possible choice of algorithm: Non-negative PCA (NPCA) 1 Avoid step size search: Multiplicative non-negative ICA (M-NICA) 2 1 E. Oja and M. Plumbley, Blind separation of positive sources using non-negative PCA, in Proc. of the 4th international Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan, 2003. 2 A. Bertrand and M. Moonen, Blind separation of non-negative source signals using multiplicative updates and subspace projection, accepted for publication in Signal Processing. A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 10 / 20
Solving non-negative BSS (NBSS) NBSS with well-grounded sources NBSS with well-grounded sources Exploit non-negativity simpler algorithms (compared to classic ICA) Exploit well-groundedness of source signals (non-vanishing pdf at zero) s: well-grounded due to on-off behavior of speech Possible choice of algorithm: Non-negative PCA (NPCA) 1 Avoid step size search: Multiplicative non-negative ICA (M-NICA) 2 1 E. Oja and M. Plumbley, Blind separation of positive sources using non-negative PCA, in Proc. of the 4th international Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan, 2003. 2 A. Bertrand and M. Moonen, Blind separation of non-negative source signals using multiplicative updates and subspace projection, accepted for publication in Signal Processing. A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 10 / 20
Solving non-negative BSS (NBSS) NBSS with well-grounded sources NBSS with well-grounded sources Exploit non-negativity simpler algorithms (compared to classic ICA) Exploit well-groundedness of source signals (non-vanishing pdf at zero) s: well-grounded due to on-off behavior of speech Possible choice of algorithm: Non-negative PCA (NPCA) 1 Avoid step size search: Multiplicative non-negative ICA (M-NICA) 2 1 E. Oja and M. Plumbley, Blind separation of positive sources using non-negative PCA, in Proc. of the 4th international Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan, 2003. 2 A. Bertrand and M. Moonen, Blind separation of non-negative source signals using multiplicative updates and subspace projection, accepted for publication in Signal Processing. A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 10 / 20
Solving non-negative BSS (NBSS) NBSS with well-grounded sources NBSS with well-grounded sources Exploit non-negativity simpler algorithms (compared to classic ICA) Exploit well-groundedness of source signals (non-vanishing pdf at zero) s: well-grounded due to on-off behavior of speech Possible choice of algorithm: Non-negative PCA (NPCA) 1 Avoid step size search: Multiplicative non-negative ICA (M-NICA) 2 1 E. Oja and M. Plumbley, Blind separation of positive sources using non-negative PCA, in Proc. of the 4th international Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), Nara, Japan, 2003. 2 A. Bertrand and M. Moonen, Blind separation of non-negative source signals using multiplicative updates and subspace projection, accepted for publication in Signal Processing. A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 10 / 20
Outline Solving non-negative BSS (NBSS) M-NICA 1 Motivation Problem statement Data model 2 Solving non-negative BSS (NBSS) NBSS with well-grounded sources M-NICA 3 Results A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 11 / 20
M-NICA Solving non-negative BSS (NBSS) M-NICA Main idea An orthogonal mixture of non-negative, well-grounded, independent signals that preserves non-negativity, is a permutation of the original signals [M. Plumbley, 2002] M-NICA: Idea: 1 decorrelate + preserve non-negativity 2 restore signal subspace (projection step) Multiplicative updating: preserves non-negativity no user-defined learning rate Notation: S, Y: M samples of s[k], y[k] in columns, i.e. Y = AS S: mean of rows of S, i.e. S = 1 M S 1 1T A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 12 / 20
M-NICA Solving non-negative BSS (NBSS) M-NICA Main idea An orthogonal mixture of non-negative, well-grounded, independent signals that preserves non-negativity, is a permutation of the original signals [M. Plumbley, 2002] M-NICA: Idea: 1 decorrelate + preserve non-negativity 2 restore signal subspace (projection step) Multiplicative updating: preserves non-negativity no user-defined learning rate Notation: S, Y: M samples of s[k], y[k] in columns, i.e. Y = AS S: mean of rows of S, i.e. S = 1 M S 1 1T A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 12 / 20
M-NICA Solving non-negative BSS (NBSS) M-NICA Main idea An orthogonal mixture of non-negative, well-grounded, independent signals that preserves non-negativity, is a permutation of the original signals [M. Plumbley, 2002] M-NICA: Idea: 1 decorrelate + preserve non-negativity 2 restore signal subspace (projection step) Multiplicative updating: preserves non-negativity no user-defined learning rate Notation: S, Y: M samples of s[k], y[k] in columns, i.e. Y = AS S: mean of rows of S, i.e. S = 1 M S 1 1T A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 12 / 20
Solving non-negative BSS (NBSS) M-NICA The M-NICA algorithm (batch) 1 Initialize S Y 1:N,: 2 Decorrelation step: (preserves non-negativity) n = 1... N, m = 1... M : [ ] SS T D 1 [S 1 S + SST D 1 1 S + D 2S nm ] nm [S] nm [ ] SS T D 1 1 S + SST D 1 1 S + D 2S nm ERRATUM: switch nominator and denominator in paper! 3 Signal subspace projection step: 4 Return to step 2. n = 1... N, m = 1... M : ( [Prowspan{Y} [S] nm max S ] ) nm, 0 (PS: batch mode) A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 13 / 20
Results Results 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 : speech source : microphone cubical room: 5m x 5m x 5m L = 480 (i.e. 30ms) Sliding window with length K = 200 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Assessment by signal-to-error ratio (SER): SER = 1 k 10 log ([A] jn s n[k]) 2 JN 10 ] j,n k [Â ( ŝ n [k] [A] jn s n [k]) 2 A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 14 / 20 jn
Results Results 10 8 Original source energy Estimated source energy by M NICA 6 4 2 0 0 5 10 15 20 25 30 time [s] SER [db] 20 15 10 5 M NICA NPCA η=0.5 NPCA η=1 NPCA η=1.5 NPCA η=2 0 5 0 5 10 15 20 25 30 time [s] A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 15 / 20
Results Results 10 8 Original source energy (source 1) Estimated source energy by M NICA 6 4 2 0 0 50 100 150 200 250 300 350 400 450 10 8 Original source energy (source 2) Estimated source energy by M NICA 6 4 2 0 0 50 100 150 200 250 300 350 400 450 10 8 Original source energy (source 3) Estimated source energy by M NICA 6 4 2 0 0 50 100 150 200 250 300 350 400 450 A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 16 / 20
Results Effect of reverberation 16 14 L=480 L=960 12 SER [db] 10 8 6 4 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Reflection coefficient A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 17 / 20
Results Effect of residual noise 12 11 10 9 SER [db] 8 7 6 5 4 3 2 0 1 2 3 4 5 6 7 8 9 10 SNR in best microphone [db] A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 18 / 20
Results Reconstruction (limited reverberation, limited noise) 10 5 no reverberance, no residual noise Original source energy (source 1) Estimated source energy by M NICA 0 0 10 20 30 40 50 60 10 reflection coefficient = 0.7 Original source energy (source 1) Estimated source energy by M NICA 5 0 0 10 20 30 40 50 60 10 residual noise with SNR of 5 db in best microphone Original source energy (source 1) Estimated source energy by M NICA 5 0 0 10 20 30 40 50 60 A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 19 / 20
Summary Summary Track power of simultaneous speakers Ad-hoc microphone array (unknown geometry) Energy based: near-field low data rate weak synchronization constraints Solve as Non-negative BSS algorithm: M-NICA A. Bertrand, M.Moonen (K.U.Leuven) Multi-speaker VAD with ad hoc array ICASSP 2010 20 / 20