A Study of the Cosine DistanceBased Mean Shift for Telephone Speech Diarization


 Harold Jefferson
 1 years ago
 Views:
Transcription
1 TASL A Study of the Cosne DstanceBased Mean Shft for Telephone Speech Darzaton Mohammed Senoussaou, Patrck Kenny, Themos Stafylaks and Perre Dumouchel Abstract Speaker clusterng s a crucal step for speaker darzaton. The short duraton of speech segments n telephone speech dalogue and the absence of pror nformaton on the number of clusters dramatcally ncrease the dffculty of ths problem n darzng spontaneous telephone speech conversatons. We propose a smple teratve Mean Shft algorthm based on the cosne dstance to perform speaker clusterng under these condtons. Two varants of the cosne dstance Mean Shft are compared n an exhaustve practcal study. We report state of the art results as measured by the Darzaton Error Rate and the Number of Detected Speakers on the LDC CallHome telephone corpus. Index Terms Speaker darzaton, clusterng, Mean Shft, cosne dstance. S I. INTRODUCTION PEAKER darzaton conssts n splttng an audo stream nto homogeneous regons correspondng to speech of partcpatng speakers. As the problem s usually formulated, darzaton requres performng two prncpal steps, namely segmentaton and speaker clusterng. The am of segmentaton s to fnd speaker change ponts n order to form segments known as speaker turns that contan speech of a gven speaker. The am of speaker clusterng s to lnk unlabeled segments accordng to a gven metrc n order to determne the ntrnsc groupng n data. The challenge of speaker clusterng ncreases by vrtue of the absence of any pror knowledge about the consttuent number of speakers n the stream. Model selecton based on the Bayesan nformaton crteron BIC s the most popular method for speaker segmentaton [1][]. BIC can also be used to estmate the number of speakers n a recordng and other Bayesan methods have recently been proposed for ths purpose [3][4]. Herarchcal Agglomeratve Clusterng HAC s by far the most wdespread approach to the speaker clusterng problem. Other methods, ncludng hybrd approaches contnue to be developed [4] [5]. Manuscrpt receved June 1, 13, revsed August 31, 13 and accepted September 4, 13. Ths work was supported by the Natural Scence and Engneerng Research Councl of Canada. Copyrght c 13 IEEE. Personal use of ths materal s permtted. However, permsson to use ths materal for any other purposes must be obtaned from the IEEE by sendng a request to M. Senoussaou s wth Centre de recherche nformatque de Montréal CRIM, Montréal, Qc, H3A 1B9, Canada and wth École de technologe supéreure ÉTS, Montréal, QC, Canada emal: P. Kenny and T. Stafylaks are wth Centre de recherche nformatque de Montréal CRIM, Montréal, QC, H3A 1B9, Canada emal: P. Dumouchel s wth École de technologe supéreure ÉTS, Montréal, QC, Canada emal: In ths work, we focus on the speaker clusterng task rather than speaker segmentaton. We propose a clusterng method whch s capable of estmatng the number of speakers partcpatng n a telephone conversaton, a challengng problem consderng that speaker turns are generally of very short duraton [4]. The method n queston s the socalled Mean Shft MS algorthm. Ths approach s borrowed from the feld of computer vson where t s wdely used to detect the number of colors and for mage segmentaton purposes. The MS algorthm s a nonparametrc teratve modeseekng algorthm ntroduced by Fukunaga [6]. Despte ts frst appearance n 1975, MS remaned n oblvon except for works such as [7] that amed to generalze the orgnal verson. The MS algorthm reappeared n 0 wth the work of Comancu [8] n mage processng. Recently, Stafylaks et al. [9][10] has shown how to generalze the basc Eucldean space MS algorthm to noneucldean manfolds so that objects other than ponts n Eucldean space can be clustered. Ths generalzed method was appled to the problem of speaker clusterng n a context where speaker turns were characterzed by multvarate Gaussan dstrbutons. Our choce of the MS algorthm s manly motvated by ts nonparametrc nature. Ths characterstc offers the major advantage of not havng to make assumptons about the shape of data dstrbuton, n contrast to conventonal probablstc darzaton methods. Recently [11], we presented a new extenson of the Eucldean Mean Shft that works wth a cosne dstance metrc. Ths new algorthm was shown to be very effectve for speaker clusterng n large populatons where each speaker was represented by a whole sde of a telephone conversaton. Ths work was motvated by the success of cosne smlarty matchng n the speaker verfcaton feld [1][13][14][15]. Cosne dstance has also been successfully tested n speaker darzaton of the CallHome telephone corpus [16][17]. In ths work, frstly, we propose to test the cosnebased MS algorthm on the darzaton of multspeaker wre telephone recordngs. We do not assume that the number of partcpatng speakers s gven. Secondly, we compare two clusterng mechansms that explot the cosnebased MS algorthm wth respect to darzaton performance as measured by the number of speakers detected and the standard darzaton error metrc and wth respect to executon tmes. Although darzaton on telephone conversatons s an mportant and dffcult task, there are not, to our knowledge, any publshed studes on the use of the MS algorthm to solve ths problem. Unlke broadcast news speech, the shortness of
2 TASL the speaker turn duraton n telephone speech typcally one second makes the task of properly representng these segments n a feature space more dffcult. In order to deal wth ths problem, we represent each speaker turn by an  vector a representaton of speech segments by vectors of fxed dmenson, ndependent of segment duratons [1]. Ivector features have been used successfully not only n speaker recognton [1][13][14][15] and speaker darzaton and clusterng [16][17][1][11] but also n language recognton [][3]. Although probablstc classfers such as Probablstc Lnear Dscrmnant Analyss have become predomnant n applyng vector methods to speaker recognton, smple cosne dstance based classfers reman compettve [1][13] and we wll use ths approach n developng the speaker darzaton algorthms presented here. Note that n [4], the authors show that cosne dstance provdes a better metrc than Eucldean dstance n GMMsupervector space. In [16], the authors ntroduced a darzaton system where vectors were used to represent speaker turns and cosne dstance based kmeans clusterng was used to assocate speaker turns wth ndvdual speakers. Tested on twospeaker conversatons, ths approach outperformed a BICbased herarchcal agglomeratve clusterng system by a wde margn. But, n order for kmeans clusterng to work, the number of speakers n a gven conversaton needs to be known n advance, so t s not straghtforward to extend ths approach to the general darzaton problem where the number of speakers partcpatng n the conversaton needs to be determned a smple heurstc s presented n [17]. The man contrbuton of ths paper s to show how usng the Mean Shft algorthm n place of kmeans enables ths problem to be dealt wth very effectvely. As our test bed we use the CallHome telephone speech corpus development/test provded by NIST n the year 00 speaker recognton evaluaton SRE. Ths conssts of spontaneous telephone conversatons nvolvng varyng numbers of speakers. The CallHome dataset has been the subject of several studes [17][18][19][][1]. The rest of ths paper s organzed as follows. In Secton II we frst provde some background materal on the vector feature space that wll be used n ths work. In Secton III, we gve some prelmnares on the orgnal verson of the Mean Shft algorthm and explan how we nclude the cosne dstance by ntroducng a smple modfcaton. Two ways of explotng the MS algorthm for clusterng purposes wll also be gven. In Secton IV, we present dfferent methods of normalzng vectors for darzaton such normalzatons turn out to be very mportant for our approach. Thereafter, we perform a detaled expermental study and analyss n Sectons V and VI before concludng ths work n Secton VII. II. IVECTORS FOR SPEAKER DIARIZATION The supervector representaton has been appled wth great success to the feld of speaker recognton, especally when t was exploted n the wellknown generatve model named Jont Factor Analyss JFA [5]. In hghdmensonal Input: audo sgnal Output: segmentaton Feature extracton/intal segmentaton vector clusterng Mean Shft Darzaton system vector extracton Fg. 1. Skeleton of the Mean shft vector darzaton system: Segmentaton of the speech sgnal s followed by extractng vectors for each segment and then the vectors are clustered usng Mean Shft n our case. supervector space, JFA attempts to jontly model speaker and channel varabltes usng a large amount of background data. When a relatvely small amount of speaker data s avalable.e. durng enrolment and test stages, JFA enables effectve speaker modelng by suppressng channel varablty from the speech sgnal. A major advance n the area of speaker recognton was the ntroducton of the low dmensonal feature vectors known as vectors [1]. We can defne an vector as the mappng usng a Factor Analyss or a Probablstc Prncpal Component Analyss of a hghdmensonal supervector to a lowdmensonal space called total varablty space here the word total s used to refer to both speaker and channel varabltes. Unlke JFA that proposes to dstngush between speaker and channel effects n the supervector space, vector methods seek to model these effects n a low dmensonal space where many standard pattern recognton methods can be brought to bear on the problem. Mathematcally, the mappng of a supervector X to an  vector x s expressed by the followng formula: X = X UBM + Tx. 1 where X UBM s the supervector of the Unversal Background Model UBM and the rectangular matrx T s the socalled Total Varablty matrx. More mathematcal detals of  vectors and ther estmaton can be found n [1][5][6]. Ivectors have successfully been deployed n many felds other than speaker recognton [16][17][1][][3]. Methods successful n one feld can often be translated to other felds by dentfyng the sources of useful and nusance varablty. Thus n speaker recognton, speaker varablty s useful but t counts as nusance varablty n language recognton. In the darzaton problem, the speaker turn represented by an vector n our case s the fundamental representaton unt or what we usually call a sample n Pattern Recognton termnology. Moreover, an aggregaton of homogenous  vectors wthn one conversaton represents a cluster speaker n our case or what s commonly known as a class. Thus, the darzaton problem becomes one of clusterng vectors [11][16][17][1].
3 TASL III. THE MEAN SHIFT ALGORITHM The Mean Shft algorthm can be vewed as a clusterng algorthm or as a way of fndng the modes n a nonparametrc dstrbuton. In ths secton we wll present the ntutve dea behnd the Mean Shft modeseekng process as well as the mathematcal dervatons of ths algorthm. Addtonally, we present two varants of ths algorthm whch can be appled for clusterng purposes. Fnally, the extenson of the tradtonal MS to the cosnebased MS s presented. A. The ntutve dea behnd Mean Shft The ntutve dea of Mean Shft s qute natural and smple. Startng from a gven vector x n a set S = { x 1, x,..., x n } of unlabeled data whch are vectors n our case we can reach a statonary pont called a densty mode through the teratve process depcted n Algorthm 1. Note that the Algorthm 1 refers to the orgnal Mean Shft process. The mathematcal convergence proof of the sequence of successve postons {y } =1,... s found n [6][8]. Algorthm 1 Mean Shft Intuton dea =1, y = x Center a wndow around y // Intalzaton repeat µ h y // estmate the sample mean of data fallng wthn the wndow.e. neghborhood of y n terms of Eucldean dstance y +1 = µ h y Move the wndow from x to y +1 = +1 untl Stablzaton // a mode has been found B. Mathematcal development Mean Shft s a member of the Kernel Densty Estmaton KDE famly of algorthms also known as Parzen wndowng. Estmatng the probablty densty functon of a dstrbuton usng a lmted sample of data s a fundamental problem n pattern recognton. The standard form of the estmated kernel densty functon ˆf x at a randomly selected pont x s gven by the followng formula 1 : ˆf x = 1 nh d n " k x! x # h =1 where kx s a kernel functon and h s ts radal wdth, referred to as the kernel bandwdth. Ignorng the selecton of kernel type, h s the only tunable parameter n the Mean shft algorthm; ts role s to smooth the estmated densty functon. In order to ensure certan propertes such as asymptotc unbasedness and consstency, the kernel and bandwdth h should satsfy some condtons that are dscussed n detal n [6]. In general, the purpose of KDE s to estmate the densty 1 Note that for smplcty we gnore some constants n the mathematcal dervatons. functon but the Mean shft procedure s only concerned wth locatng the modes of the densty functon f x and not the values of the densty functon at these ponts. To fnd the modes, the Mean shft algorthm dervaton requres calculatng the gradent of the densty functon f x. The estmate of the gradent of the densty functon f x s gven by the gradent of the estmate of the densty functon ˆf x as follows [6][8][7][8][9]: ˆ!f x "!ˆf x = 1 n x # x *!k nh d h =1 = n x # x *x # x nh d+ k+ h. =1 A smple type of kernel s the Epanechnkov kernel gven by the followng formula: # kx = 1! x x " x >1 Let gx be the unform kernel: " gx = # 1 x! x >1 Note that t satsfes: k!x = "c gx 6 where c s a constant and the prme s the dervaton operator. Then we can wrte!ˆf x as:!ˆf x = n # x nh d+ " xg x " x h =1 * n #, = * n # g x " x  x g, =1, nh d+ h /, +, =1./ n #, g +, =1 The expresson: n " x g =1 # m h x = n " g =1 # x! x h x! x h! x x " x h x " x h  / / " x /. /. / s what we refer to as the Mean Shft vector m h x. Note that the Mean Shft vector m h x s just the dfference of the current poston nstance vector x from the next poston presented by the weghed sample mean vector of all data. Indeed, the weghts n the mean formula are gven by the bnary outputs.e. 0 or 1 of the flat kernel gx
4 TASL For smplcty, let us denote the unform kernel wth bandwdth h by gx, x, h so that: = 1 x! x g x, x, h # " h. 9 0 x! x > h In other words, gx, x, h selects a subset S h x of n x samples by analogy wth Parzen wndows we refer to ths subset as a wndow n whch the Eucldean parwse dstances wth x are less or equal to the threshold bandwdth h: S h { }. 10 x! x : x " x # h Therefore, we can rewrte the Mean Shft vector as: m h x = µ h x! x 11 where µ h x s the sample mean of the n x samples of S h x: µ h x = 1 " x. 1 n x x!s h x The teratve processng of calculatng the sample mean followed by data shftng whch produces the sequence {y } =1,... referred to n Algorthm1 converges to a mode of the data dstrbuton. C. Mean Shft for speaker clusterng The Mean Shft algorthm can be exploted to deal wth the problem of speaker clusterng n the case where the number of clusters speakers n our case s unknown, as well as other problems such as the segmentaton steps nvolved n mage processng and object trackng [8]. In the followng subsectons, we present two clusterng mechansms based on the MS algorthm, namely, the Full and the Selectve clusterng strateges. Full strategy One may apply the teratve Mean Shft procedure at each data nstance. In general, some of the MS processes wll converge to the same densty mode. The number of densty modes after prunng represents the number of detected clusters and nstances that converge to the same mode are deemed to belong to the same cluster we call these ponts the basn of attracton of the mode. In ths work we refer to ths approach as Full strategy. Selectve strategy Unlke the Full Mean Shft clusterng strategy, we can adapt ths strategy to run the MS process on a subset of data only. The dea s to keep track of the number of vsts to each data pont that occurs durng the evoluton of a Mean Shft process. After the convergence of the frst Mean Shft process the samples that have been vsted are assgned to the frst cluster. We then run a second process startng from one of the unvsted samples and create a second cluster. We contnue to run MS processes one after another untl we have no unvsted data samples. Some of the samples may be allocated to more than one cluster by ths procedure then majorty votng s needed to reconcle these conflcts. Note that the computatonal complexty depends on the number of samples n the Full strategy and t depends only on the number of clusters n the case of the Selectve strategy. A MATLAB mplementaton of the Selectve strategy can be found onlne. In ths work the expermental results of the Full and Selectve clusterng strateges are compared n Secton VI. D. Mean Shft based on cosne dstance The success of the cosne dstance n speaker recognton s well known [1][13][14][15]. A ratonale for usng cosne dstance nstead of Eucldean dstance can be suppled by postulatng a normal dstrbuton for the speaker populaton as n PLDA [30]. Suppose we are gven a par of vectors and we wsh to test the hypothess that they belong to the same speaker cluster aganst the hypothess that they belong to dfferent clusters. Because most of the populaton mass s concentrated n the neghborhood of the orgn, speakers n ths regon are n danger of beng confused wth each other. In the case of a par of vectors whch are close to the orgn, the same speaker hypothess wll only be accepted f the vectors are relatvely close together. On the other hand, f the vectors are far from the orgn, they can be relatvely far apart from each other wthout nvaldatng the same speaker hypothess. Hence, n order to ncorporate ths pror knowledge regardng the dstrbuton of the speaker means nto the MS algorthm, we may ether use a the Eucldean dstance and a varable bandwdth that ncreases wth the dstance from the orgn or b fxed bandwdth and the cosne smlarty. The latter approach s evdently preferable. The cosne dstance between two vectors x and y s gven by: # Dx, y =1! x " y x y. 13 The orgnal Mean Shft algorthm based on a flat kernel reles on the Eucldean dstance to fnd ponts fallng wthn the wndow as shown n 10. In [11] we proposed the use of the cosne metrc nstead of the Eucldean one to buld a new verson of the Mean shft algorthm. Only one modfcaton needs be ntroduced n 10; we set S h x! { x : Dx, x " h} 14 where Dx, x s the cosne dstance between x and x gven by the formula 13. Ths corresponds to redefnng the unform kernel as:
5 TASL = 1 Dx, x! h #. 15 g x, x, h " 0 Dx, x > h E. Conversatondependent bandwdth It s known from the lterature [31] that one of the practcal lmtatons of Mean Shft algorthm s the need to fx the bandwdth h. Usng a fxed bandwdth s not generally approprate, as the local structure of samples can change the data that needs to be clustered. We have found that varyng the bandwdth from one conversaton to another turns out to be useful n darzaton based on Mean Shft algorthm. In order to deal wth the dsparty caused by the varable duraton of conversatons, we adopt a verson of the varable bandwdth scheme proposed n [10]. Ths s desgned to smooth the densty estmator n the case of short conversatons where the number of segments to be clustered s small. The varable bandwdth s controlled by two parameters! and the fxed bandwdth h. For a conversaton c, the conversatondependent bandwdth!h c s gven by "!h c =1! nc! 1! h 16 # n c! + 1! h where n c s the number of segments n the conversaton. Note that! h c! h wth equalty f nc s very large. F. Cluster prunng An artfact of the Mean Shft algorthm s that there s nothng to prevent t from producng clusters wth very small numbers of segments. To counter ths tendency, we smply prune clusters contanng a small number of samples less than or equal to a constant p by mergng them wth ther nearest neghbors. IV. IVECTOR NORMALIZAION FOR DIARIZATION By desgn, vectors are ntended to represent a wde range of speech varabltes. Hence, raw vectors need to be normalzed n ways whch vary from one applcaton to another. Based on the above defntons of class and sample n relaton to our problem see Secton II, we wll present n the followng sectons some methods to normalze vectors whch are sutable for speaker darzaton. A. Prncpal components analyss PCA In [16] t was shown that projectng vectors onto the conversatondependent PCA axes wth hgh varance helps to compensate for ntrasesson varablty. A further weghtng wth the square root of the correspondng egenvalues was also appled to these axes n order to emphasze ther mportance. The authors of [16] recommend choosng the PCA dmensonalty so as to retan 50 of the data varance. We wll denote ths quantty by r. Ideally each retaned PCA axs represents the varablty due to a sngle speaker n the conversaton. Note that that ths type of PCA s local n the sense that analyss s done on a flebyfle bass. Thus has the advantage that no background data s requred to mplement t. B. Wthn Class Covarance Normalzaton WCCN Normalzng data varances usng a Wthn Class Covarance matrx has become common practce n the Speaker Recognton feld [1][13][15]. The dea behnd ths normalzaton s to penalze axes wth hgh ntraclass varance by rotatng data usng a decomposton of the nverse of the Wthn Class Covarance matrx. C. Between Class Covarance Normalzaton BCCN By analogy wth the WCCN approach, we propose a new normalzaton method based on the maxmzaton of the drectons of between class varance by normalzng the  vectors wth the decomposton of the between class covarance matrx B. The between class covarance matrx s gven by the followng formula: B = 1 I " n x! x x! x t 17 n =1 where the sum ranges over I conversaton sdes n a background tranng set, x = 1! x j s the sample mean of n k j=1 speaker turns wthn the conversaton sde and x s the sample mean of all vectors. A. CallHome data V. IMPLEMENTATION DETAILS We use the CallHome dataset dstrbuted by NIST durng the year 00 speaker recognton evaluaton [18]. CallHome s a multlngual 6 languages dataset of multspeaker telephone recordngs of 1 to 10 mnutes duraton. Fg. depcts the development part of the dataset whch contans 38 conversatons, broken down by the number of speakers to 4 #!"!!" #!"!!" #!"!!" #!"!"!" " #" #"!" #"!"!" " " Fg. CallHome development data set broken down by categores representng the number of partcpatng speakers n conversatons. Fg. 3 CallHome test set broken down by categores representng the number of partcpatng speakers n conversatons. n "!" " " "*+,*"."*+,*" /"*+,*" "*+,+." "*+,+." "*+,+." #"*+,+." "*+,+." /"*+,+."
6 TASL speakers. The CallHome test set contans 500 conversatons, broken down by the number of speakers n Fg. 3. Note that the number of speakers ranges from to 4 n the development set and from to 7 n the test set, so that there s a danger of overtunng on the development set. For our purposes the development set serves to decde whch types of vector normalzaton to use, to fx the bandwdth parameter h n 15, 16 and to determne a strategy for prunng sparsely populated clusters. Because there s essentally only one scalar parameter to be tuned, our approach s not at rsk for overtunng on the development set. B. Feature extracton 1 Speech parameterzaton Every 10ms, Mel Frequency Cepstral Coeffcents MFCC are extracted from a 5 ms hammng wndow 19 MFC Coeffcents + energy. As s tradtonal n darzaton, no feature normalzaton s appled. Unversal background model We use a genderndependent UBM contanng 51 Gaussans. Ths UBM s traned wth the LDC releases of Swtchboard II, Phases and 3; Swtchboard Cellular, Parts 1 and ; and NIST SRE telephone speech only. 3 Ivector extractor We use a genderndependent vector extractor of dmenson 100, traned on the same data as UBM together wth data from the Fsher corpus. C. Ivector normalzaton Among the normalzaton methods presented n Secton IV, only the wthn and the between class covarance matrces need background data to be estmated. In order to estmate them we used telephone speech whole conversaton sdes from the 04 and 05 NIST speaker recognton evaluatons. D. Intal segmentaton The focus n ths work s speaker clusterng rather than segmentaton. Followng the authors of [4] [16], we unformly segmented speech ntervals found by a voce actvty detector nto segments of about one second of duraton. Ths naïve approach to speaker turn segmentaton s tradtonal n darzng telephone speech where speaker turns tend to be very short and Vterb resegmentaton s generally appled n subsequent processng. Note that the results presented n [16] show that usng reference slence detector offers no sgnfcant mprovement n comparson to ther own speech detector. E. Evaluaton protocol In order to evaluate the performances of dfferent systems we use the NIST Darzaton Error Rate DER as the prncpal measure system performance. Usng the NIST scorng scrpt mdevalv1.pl 3 we evaluate the DER of the concatenated.rttm fles produced for all conversatons n the development and test sets. As s tradtonal n speaker darzaton of telephone speech, we gnore overlappng speech segments and 3 we tolerate errors less than 50 ms n locatng segment boundares. In addton to DER, the Number of Detected Speakers NDS and ts average calculated over all fles ANDS are also useful performance evaluaton metrcs n the context of clusterng wth unknown numbers of speakers. We adopt a graphcal llustraton of DER vs. NDS to represent systems behavors Fgs. 4 and 5. These graphs are obtaned by sweepng out the bandwdth parameter h. On these graphs, the actual number of speakers s gven by the vertcal sold lne and the estmated number s gven by the dashed lne. VI. RESULTS AND DISCUSSIONS In ths secton we provde a detaled study of the effect of the vector normalzaton methods descrbed n Secton VIC. A. Parameter tunng on the development set Fg. 4 Results on the development set obtaned wth PCA vector normalzaton: Full Mean Shft performances DER/Number of estmated speakers. The mnmum of DER, the correspondng bandwdth h and the number of detected speakers #Spk are also gven for each PCA reducton factor r = 80, 60, 50 and 30. In order to establsh a benchmark we frst ran the two r: 80 DER: 1.60 #Spk: 370 h: r: 50 DER: #Spk: 177 h: r: 80 DER: #Spk: 5 h: r: 50 DER: 1.8 #Spk: 33 h: Fg. 5 Results on the development set obtaned wth PCA vector normalzaton: Selectve Mean Shft performances DER/Number of estmated speakers. The mnmum of DER, the correspondng bandwdth h and the number of detected speakers #Spk are also gven for each PCA reducton factor r = 80, 60, 50 and r: 60 DER: 1.16 #Spk: 75 h: r: 30 DER: #Spk: 45 h: r: 60 DER: 1.70 #Spk: 38 h: 0.3 r: 30 DER: 1.18 #Spk: 165 h:
7 TASL versons of Mean Shft wth PCA normalzaton of vectors. Each graph n Fgs. 4 and 5 corresponds to a percentage of retaned egenvalues r = 80, 60, 50 and 30 respectvely. In Fgs. 4 and 5 we observe that although the results for the two strateges wth r = 30 are slghtly better than those wth r = 50, the graphs are rregular n the former case so that takng r = 50 as n [16] seems to be the better course. Note that the optmal DER for all confguratons s reached wth an overestmaton of the number of speakers. Fortunately overestmaton s preferable to underestmaton, as t can be remeded by prunng sparsely populated clusters. Impact of length normalzaton We began by testng the effect of length normalzaton of raw vectors before applyng PCA. Surprsngly, ths smple operaton mproves the DER by absolute row 3  Len.n  n Table 1. Wth length normalzaton and r = 50, DER decreases from 11.9 see Fg. 4 to 10 Full strategy and from 1. see Fg. 5 to 10. Selectve strategy. Furthermore, the number of detected speakers NDS n the case of Selectve strategy decreases from 33 to 81, thus approachng the actual value of 103. However, n the case of Full strategy the detected NDS ncreases form 177 to 316. TABLE 1 RESULTS ON THE DEVELOPEMNT TEST SET ILLUSTRATING THE EFFECT OF DIFFERENT NORMALIZATION METHODS DER IS THE DIARIZATION ERROR RATE, NDS THE NUMBER OF DETECTED SPEAKERS, h THE BANDWIDTH AND p THE PRUNINING PARARMETER THE ACTUAL NUMBER OF SPEAKERS IS 103. Full MS Selectve MS Norm DER DER method NDS h p NDS h p Len. n WCC BCC Var. h Prun Impact of wthn class covarance normalzaton In ths experment we frst normalze vectors usng the Cholesky decomposton of the nverse of the WCC matrx, and follow ths wth length normalzaton and PCA projecton. As we see n row 4 of Table 1 WCC normalzaton causes performance degradaton. The DER ncreases from 10 to 11.7 n the Full case and from 10. to 11.7 n the Selectve case compared to both prevous normalzaton methods, namely PCA and length normalzaton. These results were not n lne wth our expectatons derved from our experence n speaker recognton; they may be due to an nteracton between the PCA and WCC normalzatons. Impact of between class covarance normalzaton We proceeded n a smlar way to WCC normalzaton. We project data usng the Cholesky decomposton of the BCC matrx followed by length normalzaton and PCA projecton. In row 5 of Table 1 we notce a remarkable twofold mprovement compared wth row. On the one hand, we obtan a DER decrease from 10 to 7.6 for the Full strategy case and from 10. to 7.7 for the Selectve case. On the other hand, we detect a number of speakers much nearer to the actual value of 103, partcularly n the Selectve case 189 speakers. Conversatondependent bandwdth Mean Shft We appled the varable bandwdth scheme gven n formula 16 to the prevous BCC normalzaton system. In row 6 of Table 1, we observe a slght mprovement n DER for both strateges. Cluster prunng Although we succeed n reducng the DER from ~1 to ~7 for both strateges, the estmated number of speakers correspondng to the mnmum of DER s stll hgher than the actual value. As dscussed n Secton IIIF we prune clusters contanng a small number of samples less than or equal to a constant p n order to counter ths tendency. The correspondng results appear n the last row of Table 1. We observe that for the Full strategy, mergng clusters havng one nstance p = 1 reduces the estmated number of speakers from 300 to 109 whle the DER slghtly ncreases from 7.5 to 8.3. For the Selectve strategy, wth p = 3 we get a nce mprovement regardng Number of Detected Speakers 111 speakers nstead of 3 whle the DER s essentally unaffected, decreasng from 7.6 to 7.5. B. Results on the test set As we explaned when dscussng the evaluaton protocol Secton VE, we now present the results obtaned on the test set by usng parameters bandwdth and the prunng factor p tuned on the development set. Table presents the most mportant results. The term Fx. h n row 3 of Table refers to the best system usng fxed bandwdth presented n row 5 of Table 1 BCC. In ths system we used respectvely BCC, length normalzaton followed by length normalzaton and PCA projecton wth r = 50 as optmzed on the development set. In row 4 of table Var. h, the system s exactly the same as the prevous one Fx. h system but wth a varable bandwdth. Fnally, the last row of Table Prun. shows the mpact of clusters prunng on the varable bandwdth system Var. h. TABLE RESULTS ON TEST DATA SET USING OPTIMAL PARAMETERS ESTIMATED ON THE DEVELOPMENT SET. THE TOTAL ACTUAL NUMBER OF SPEAKERS IS 183. Full MS Selectve MS Norm DER DER method NDS h p NDS h p Fx. h Var. h Prun From the results n Table we observe the usefulness of the varable bandwdth n reducng the DER from 14.3 to 1.7 n the Full MS case and from 13.9 to 1.6 n the Selectve case. Observe also that the number of detected speakers NDS s reduced from 3456 to 550 n the Full MS strategy and from 3089 to 310 n the Selectve case. Fnally, n the test set, cluster prunng leads to a degradaton of the
8 TASL Raw vectors BCC normalzaton Length normalzaton whch descrbes the evaluaton protocol, these works present results broken down by the number of partcpatng speakers. Darzaton System DER from 1.6 to 14.3 usng the Selectve strategy, n contrast to what s observed on the development set where the DER showed a slght mprovement. However, cluster prunng on the test set for the Full case s surprsngly helpful to the pont that the DER 1.4 concdes perfectly wth the one optmzed on the development set see the bandwdth h n row 7 of Table 1. Gven that the above results are obtaned usng the parameters tuned on an ndependent dataset, ths confrms the generalzaton capablty of the cosnebased Mean Shft for both clusterng strateges. Among the publcatons reportng results on the CallHome dataset [17][18][19][][1], only Vaquero s thess presents results based on the total DER calculated over all fles [1]. He uses speaker factors rather than vectors to represent speech segments and he used a multstage system based prncpally on Herarchcal Agglomeratve Clusterng HAC, kmeans and Vterb segmentaton. He also estmated tunable parameters on an ndependent development set consstng solely of twospeaker recordngs. However, he was constraned to provde the actual number of speakers as stoppng crteron for HAC n order to acheve a total DER of 13.7 on the test CallHome set. Wthout ths constrant, the performance was Compared to hs results, we were able to acheve a 37 relatve mprovement n the total DER see Table. In summary, we presented some results on development and test sets from whch we can draw the followng conclusons: Length normalzaton of the raw vectors before PCA projecton helps n reducng DER. PCA wth r = 50 offers the best confguraton. WCC normalzaton degrades performance. BCC normalzaton, followed by length normalzaton and PCA, helps to decrease both DER and NDS. Varable bandwdth combned wth cluster prunng p = 1 appled after length normalzaton, PCA projecton and BCC normalzaton help n reducng DER and NDS n the Full case. Both strateges, namely Full and Selectve, perform equvalently well on development and test sets. In Fg. 6 we depct the best vectors normalzaton protocol that we adopt n ths study. C. Results brokendown by the number speakers PCA projecton Fg. 6 The best protocol of vector normalzaton for the MS darzaton systems. In order to compare our results wth those of [17][18][19][] we need to adopt the same conventon for presentng darzaton results. As mentoned n Secton VE TABLE 3 FULL MEAN SHIFT RESULTS ON TESTSET DEPICTED AS A FUNCTION OF THE NUMBER OF PATRICIPATING SPEAKERS. Speakers h / p number Dev. Param. Test param. DER Fx. h ANDS / 0 DER Var. h ANDS / 0 DER Fx. h ANDS / 0 DER Var. h ANDS / 0 Indeed, the offcal development set conssted of conversatons wth just to 4 speakers so t s hard to avod tunng on the test set f one wshes to optmze performance on conversatons wth large numbers of speakers. In Tables 3 and 4 we present results broken down by the number of speakers on the test set for the Full and Selectve Mean shft algorthms, wth two tunngs, one on the development rows 5 and the other on the test set rows Recall that the tunable parameters are the nature of the bandwdth.e. fxed or varable, ts value.e. h and the prunng factor p. It s apparent from the tables that all of the Mean Shft mplementatons generalze well from the development set to the test set. From Table 3 we observe frstly that the Full MS mplementaton does not need any fnal cluster prunng.e. p = 0 when we optmze takng account of the number of partcpatng speakers see last column of Table 3. Second, estmatng the number of speakers works better wth a fxed bandwdth see rows 3 and 7 of Table 3 and the DERs are almost comparable to those obtaned wth a conversatondependent bandwdth.e. Var. h. Generally speakng, varable bandwdth helps n reducng DER for recordngs havng small number of speakers, 3, 4 speakers. Fnally, the most mportant observaton from Table 3 s the hgh generalzaton capablty of the Full MS especally n the fxed bandwdth case. Comparng rows and 3 wth rows 6 and 7 of Table 3 we see that the optmal parameters for the test set are the same as those for the development set. Dev. Param. Test param. TABLE 4 SELECTIVE MEAN SHIFT RESULTS ON TEST DATA SET DEPICTED AS A FUNCTION OF THE NUMBER OF PATRICIPATING SPEAKERS. Speakers h / p number DER Fx. ANDS / 3 DER Var. ANDS / 3 DER Fx. ANDS / 3 DER Var. ANDS / 3
9 TASL !" #"!" #"!" #" 1/G" G/H" #/#" /I"!/" G/#" G/8" #/8" H/I" #/" /#" G/8" 1" G/H" I/#" I/1" #/1" /" /8" /#" /G" G/" H/I" H/1" /1" /1" /G" /G" 8/I" 8/#" From the results depcted n Table 4 we observe that the fnal cluster prunng s necessary n the Selectve MS case. Compared to Full MS results n Table 3, we observe that DERs are smlar but the Selectve strategy outperforms the Full one regardng the average number of detected speakers ANDS. The combnaton of the varable bandwdth wth the fnal cluster prunng p = 3 enables us to get the best results, both for DER and Average Number of Detected Speakers see rows 4 and 5 and rows 8 and 9 n table 4. The ANDS values are n fact very close to the actual numbers row 6 vs. row 1 wth a slght overestmaton, except n the 6speaker fles case where there s a slght underestmaton 5.8. Fnally, we observe that the Full strategy generalzes better than the Selectve one n the sense that we were able to reach the best performance on the test usng development tunable parameters. Vterb resegmentaton Refnng segment boundares between speaker turns usng Vterb resegmentaton s a standard procedure for mprovng darzaton system performance. Results reported n Table 5 show ts effectveness when combned wth the Mean Shft algorthms. Note that the results wthout Vterb resegmentaton gray entres n table 5 are the results presented n the 6 th and 8 th rows n tables 3 and 4. TABLE 5 IMPACT OF VITERBI RESEGMENTATION ON THE TESTSET RESULTS USING PARAMETRES ESTIMATED ON TEST DATA DEPICTED AS A FUNCTION OF THE Full MS Selectve MS!" NUMBER OF PATRICIPATING SPEAKERS AND MEASURED WITH DER. Fx. h  Vterb Vterb Var. h  Vterb Vterb Fx. h Var. h *++,"."/"01" 3+.4,"."/"0!" 567*"."/"08"  Vterb Vterb Vterb Vterb n [19] estmatng the number of speakers was done separately from speaker clusterng. We compare graphcally n Fg. 7 the results as measured by DER of our best confguratons wth Vterb resegmentaton of the Full and Selectve strateges.e. Full and Selectve systems presented n the 4 th and 8 th rows of Table 5 respectvely wth those n [19][][17]. It s evdent that our results as measured by DER are n lne wth the stateoftheart. To be clear, snce our results were taken from Tables 3 and 4, there was some tunng on the test set as n Dalmasso et al. [19], Castaldo et al. [], and Shum et al. [17] Furthermore, the comparson based on the average number of detected speakers s not possble except n the case of Dalmasso et al. [19]. In Table 6 we compare our best results from Table 4 usng ths crteron wth those of [19]. The results are smlar although the Mean Shft algorthm tends to overestmate the speaker number. TABLE 6 COMPARISON WITH DALMASSO RESULTS BASED ON THE AVRAGE OF THE NUMBER OF DETECTED SPEAKERS ANDS. Actual Number of speakers Dalmasso et al. [19] Selectve MS D. Tme complexty 5DEF":;"56<="""""""""""""""""""""""""" Fg. 7 Comparson of Full and Selectve Mean Shft clusterng algorthms wth stateoftheart results based on DER for each category of CallHome test set recordngs havng same number of speakers. Tme complexty s not a major concern n ths study but Fg. 8 llustrates the dfference between the Full and the Selectve strateges n ths regard. The average tme for the Full case s seconds per fle vs seconds for the Selectve case Comparson wth exstng stateoftheart results We conclude ths secton wth a comparson between our results and those obtaned by other authors on the Call Home data although there are several factors whch make backtoback comparsons dffcult. Contrary to [1] and our work, the authors of [17][19][] dd not use a development set ndependent of the test set for parameter tunng. Furthermore n [19] and [], the authors assumed pror hypotheses about the maxmum number of speakers wthn a slce of speech, and Tme s Full Selectve Recordngs of CallHome development set Fg. 8 Tme complextes of the Full and Selectve strateges calculated n seconds on each conversaton of the development set. The horzontal lnes ndcate the processng tmes averaged over all fles.
10 TASL VII. CONCLUSION Ths paper provdes a detaled study of the applcaton of the nonparametrc Mean Shft algorthm to the problem of speaker clusterng n darzng telephone speech conversatons usng two varants of the basc clusterng algorthm the Full and Selectve versons. We have suppled n the Appendx a convergence proof whch justfes our extenson of the Mean Shft algorthm from the Eucldean dstance metrc to the cosne dstance metrc. We have shown how, together wth an vector representaton of speaker turns, ths smple approach to the speaker clusterng problem can handle several dffcult problems  short speaker turns, varyng numbers of speakers and varyng conversaton duratons. Wth a sngle pass clusterng strategy that s, wthout Vterb resegmentaton we were able to acheve a 37 relatve mprovement as measured by global darzaton error rate on the Call Home data usng as a benchmark [1], the only other study that evaluates performance n ths way. We have seen how our results usng other metrcs are smlar to the stateofthe art as reported by other authors [16][17][19][]. We have seen that refnng speaker boundares wth Vterb resegmentaton s also helpful. Usng segment boundares obtaned n ths way could serve as a good ntalzaton for a second pass of Mean Shft clusterng. An nterestng complcaton that would arse n explorng ths avenue s that speaker turns would be of much more varable duraton than n the frst pass based on the unform segmentaton descrbed n Secton V.D. Snce the uncertanty entaled n estmatng an  vector n the case of short speaker turns than n the case of long speaker turns, ths suggests that takng account of ths uncertanty as n [3] would be helpful. APPENDIX In ths appendx we present the mathematcal convergence proof of the cosne dstancebased Mean Shft. Indeed, ths proof s very smlar the one of theorem 1 presented n [8]. Theorem 1 [8]: f the kernel k has a convex and monotoncally decreasng profle, the sequence { ˆf } =1,... converges, and s monotoncally ncreasng. Let us suppose that all vectors n our dataset are constraned to lve n the unt sphere by normalzng ther Eucldeannorm durng MS convergence process. ˆf j+1! ˆf n # j = c k 1! y " x +1 j #! k 1! y " y, / +.. j=1* h h  Due to the convexty of the profle: kx! kx 1 " k #x 1 x! x 1 and snce gx =!kx from 6 than: kx! kx 1 " gx 1 x 1! x we obtan: ˆf +1! ˆf n " c g 1! y # x j * 1! y # x j 0, j=1 h +,! 1! y +1 # x j n = c g 1! y # x j y x +1! y 0 j j=1 h h we know from 8 and 11 that the +1 th poston y +1 s equal to the weghted mean vector, so Thus: wth equalty ff y +1 = y. The sequence n # g 1! y " x j n # y +1 = g 1! y " x j x j. j=1 h j=1 h ˆf +1! ˆf n " c g 1! y # x j y * y +1! y +1 j=1 h h n = c g 1! y # x j 1! y * +1 # y " 0 j=1 h h { ˆf } =1,... s bounded and monotoncally ncreasng, and so s convergent. Ths argument does not show that {y } =1,... s convergent t may be possble to construct pathologcal examples n whch { ˆf } =1,... converges but {y } =1,... does not but t establshes convergence of the Mean Shft algorthm n the same sense as convergence of the EM algorthm s demonstrated n [33]. ACKNOWLEDGMENT Frst we would lke to thank the edtor as well as revewers for ther helpful comments. Then, we would lke to thank Stephen Shum and Najm Dehak from MIT for ther useful dscussons and feedback and also for sharng ther ntal segmentaton wth us. We would lke also to thank our colleague Vshwa Gupta for hs help wth Vterb resegmentaton software and our colleague Perre Ouellet for hs help wth other software. REFERENCES [1] G. Schwarz, Estmatng the dmenson of a model, Ann. Statst. 6, [] S. S. Chen and P. Gopalakrshnan, Clusterng va the bayesan nformaton crteron wth applcatons n speech recognton, n ICASSP 98, vol., Seattle, USA, 1998, pp [3] F. Valente, Varatonal Bayesan methods for audo ndexng, Ph.D. dssertaton, Eurecom, Sep 05. [4] P. Kenny, D. Reynolds and F. Castaldo, Darzaton of Telephone Conversatons usng Factor Analyss, Selected Topcs n Sgnal Processng, IEEE Journal of, vol.4, no.6, pp , Dec. 10. [5] Margarta Kott, Vasslk Moschou, Constantne Kotropoulos, Speaker segmentaton and clusterng, Sgnal Processng, Volume 88, Issue 5, May 08, Pages , ISSN , /j.sgpro [6] K. Fukunaga and L. Hostetler, The estmaton of the gradent of a densty functon, wth applcatons n pattern recognton, IEEE Trans. on Informaton Theory, vol. 1, no. 1, pp. 3 40, January h  /./
11 TASL [7] Y. Cheng, Mean Shft, Mode Seekng, and Clusterng, IEEE Trans. PAMI, vol. 17, no. 8, pp , [8] D. Comancu and P. Meer, Mean shft: A robust approach toward feature space analyss, IEEE Trans. Pattern Analyss and Machne Intellgence, vol. 4, no. 5, pp , May 0. [9] T. Stafylaks, V. Katsouros, and G. Carayanns, Speaker clusterng va the mean shft algorthm, n Odyssey 10: The Speaker and Language Recognton Workshop  Odyssey10, Brno, Czech Republc, June 10. [10] T. Stafylaks, V. Katsouros, P. Kenny, and P. Dumouchel, Mean Shft Algorthm for Exponental Famles wth Applcatons to Speaker Clusterng, Proc. Odyssey Speaker and Language Recognton Workshop, Sngapore, June 1. [11] M. Senoussaou, P. Kenny, P. Dumouchel and T. Stafylaks, Effcent Iteratve Mean Shft based Cosne Dssmlarty for MultRecordng Speaker Clusterng, n Proceedngs of ICASSP, 13. [1] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Frontend factor analyss for speaker verfcaton, IEEE Transactons on Audo, Speech, and Language Processng, Vol. 19, No. 4, May 11, pp [13] N. Dehak, R. Dehak, J. Glass, D. Reynolds, and P. Kenny, "Cosne Smlarty Scorng wthout Score Normalzaton Technques," Proc. IEEE Odyssey Workshop, Brno, Czech Republc, June 10. [14] N. Dehak, Z. Karam, D. Reynolds, R. Dehak, W. Campbell, and J. Glass, "A ChannelBlnd System for Speaker Verfcaton," Proc. ICASSP, pp , Prague, Czech Republc, May 11. [15] M. Senoussaou, P. Kenny, N. Dehak and P. Dumouchel, An vector Extractor Sutable for Speaker Recognton wth both Mcrophone and Telephone Speech, n Proc Odyssey Speaker and Language Recognton Workshop, Brno, Czech Republc, June 10. [16] S. Shum, N. Dehak, E. Chuangsuwanch, D. Reynolds, and J. Glass, "Explotng IntraConversaton Varablty for Speaker Darzaton," Proc. Interspeech, pp , Florence, Italy, August 11. [17] S. Shum, N. Dehak, and J. Glass, "On the Use of Spectral and Iteratve Methods for Speaker Darzaton," Proc. Interspeech, Portland, Oregon, September 1. [18] A. Martn and M. Przybock, Speaker recognton n a multspeaker envronment, n Proceedngs of Eurospeech, 01. [19] E. Dalmasso, P. Laface, D. Colbro, C. Var, Unsupervsed Segmentaton and Verfcaton of MultSpeaker Conversatonal Speech, Proc. Interspeech 05. [] F. Castaldo, D. Colbro, E. Dalmasso, P. Laface, and C. Var, Streambased speaker segmentaton usng speaker factors and egenvoces, n Proceedngs of ICASSP, 08. [1] C. Vaquero AvlésCasco, Robust Darzaton For Speaker Characterzaton Darzacon Robusta Para Caracterzacon De Locutores, Ph.D. dssertaton, Zaragoza Unversty, 11. [] N. Dehak, P. TorresCarrasqullo, D. Reynolds, and R. Dehak, "Language Recognton va Ivectors and Dmensonalty Reducton," Proc. Interspeech, pp , Florence, Italy, August 11. [3] D. Martnez, Oldrcht Plchot, Lukas Burget, Ondrej Glembek and Pavel Matejka, Language Recognton n Vectors Space, Proceedngs of Interspeech, Florence, Italy, August 11. [4] H. Tang, S.M. Chu, M. HasegawaJohnson and T.S. Huang, Partally Supervsed Speaker Clusterng, Pattern Analyss and Machne Intellgence, IEEE Transactons on, vol.34, no.5, pp.959, 971, May 1. [5] P. Kenny, Jont factor analyss of speaker and sesson varablty: theory and algorthms. Techncal report CRIM06/0814, 06. [6] P. Kenny, G. Boulanne, and P. Dumouchel, Egenvoce modelng wth sparse tranng data, IEEE Transactons on Speech and Audo Processng, May 05. [7] D. Comancu, V. Ramesh, and P. Meer, Kernelbased object trackng. IEEE Transactons on Pattern Analyss and Machne Intellgence, 55, [8] B. Georgescu, I. Shmshon, and P. Meer, Mean shft based clusterng n hgh dmensons: A texture classfcaton example, n Proceedngs of Internatonal Conference on Computer Vson pp [9] U. Ozertem, D. Erdogmus, R. Jenssen, Mean shft spectral clusterng. Pattern Recognton, Volume 41, Issue 6, June 08, Pages [30] D. GarcaRomero, Analyss of vector length normalzaton n GaussanPLDA speaker recognton systems, n Proceedngs of Interspeech, Florence, Italy, Aug. 11. [31] D. Comancu, V. Ramesh, and P. Meer, The Varable Bandwdth Mean Shft and DataDrven Scale Selecton, Proc Eghth Intl Conf. Computer Vson, vol. I, pp , July 01. [3] P. Kenny, T. Stafylaks, P. Ouellet, J. Alam, and P. Dumouchel, PLDA for Speaker Verfcaton wth Utterances of Arbtrary Duraton, In Proceedng of ICASSP, Vancouver, Canada, May 13. [33] A. P. Dempster, N. M. Lard, and D. B. Rubn, Maxmum lkelhood from ncomplete data va the EM algorthm, Journal of the Royal Statstcal Socety, Seres B Methodologcal, vol. 39, no. 1, pp. 1 38, M. Senoussaou receved the Engneer degree n Artfcal Intellgence n 05 and Magster Masters degree n 07 from Unversté des Scences et de la Technologe d Oran, Algera. Currently h s a PhD student n the École de technologe supéreure ÉTS of Unversté du Québec, Canada and also wth Centre de recherche nformatque de Montréal CRIM, Canada. Hs research nterests are concentrated to the applcaton of Pattern Recognton and Machne learnng methods to the speaker verfcaton and Darzaton problems. P. Kenny receved the BA degree n Mathematcs from Trnty College, Dubln and the MSc and PhD degrees, also n Mathematcs, from McGll Unversty. He was a professor of Electrcal Engneerng at INRS Telecommuncatons n Montreal from 1990 to1995 when he started up a company Spoken Word Technologes to spn off INRSs speech recognton technology. He joned CRIM n 1998 where he now holds the poston of prncpal research scentst. Hs current research nterests are n textdependent and textndependent speaker recognton wth partcular emphass on Bayesan methods such as Jont Factor Analyss and Probablstc Lnear Dscrmnant Analyss. T. Stafylaks receved the Dploma degree n electrcal and computer engneerng from the Natonal Techncal Unversty of Athens NTUA, Athens, Greece, and the M.Sc. degree n communcaton and sgnal processng from Imperal College London, London, U.K., n 04 and 05, respectvely. He receved hs Ph.D. from NTUA on speaker darzaton, whle workng for the Insttute for Language and Speech Processng, Athens as research assstant. Snce 11, he s a postdoc researcher at CRIM and ETS, under the supervson of Patrck Kenny and Perre Dumouchel, respectvely. Hs current nterests are speaker recognton and darzaton, Bayesan modelng and multmeda sgnal analyss. P. Dumouchel receved B.Eng. McGll Unversty, M.Sc. INRSTélécommuncatons, PhD INRS Télécommuncatons, has over 5 years of experence n the feld of speech recognton, speaker recognton and emoton detecton. Perre s Charman and Professor at the Software Engneerng and IT Department at École de technologe supéreure ETS of Unversté du Québec, Canada.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Journal of Machne Learnng Research 15 (2014) 19291958 Submtted 11/13; Publshed 6/14 Dropout: A Smple Way to Prevent Neural Networks from Overfttng Ntsh Srvastava Geoffrey Hnton Alex Krzhevsky Ilya Sutskever
More informationPerson Reidentification by Probabilistic Relative Distance Comparison
Person Redentfcaton by Probablstc Relatve Dstance Comparson WeSh Zheng 1,2, Shaogang Gong 2, and Tao Xang 2 1 School of Informaton Scence and Technology, Sun Yatsen Unversty, Chna 2 School of Electronc
More informationBoosting as a Regularized Path to a Maximum Margin Classifier
Journal of Machne Learnng Research 5 (2004) 941 973 Submtted 5/03; Revsed 10/03; Publshed 8/04 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset Data Analytcs Research Group IBM T.J.
More informationSequential DOE via dynamic programming
IIE Transactons (00) 34, 1087 1100 Sequental DOE va dynamc programmng IRAD BENGAL 1 and MICHAEL CARAMANIS 1 Department of Industral Engneerng, Tel Avv Unversty, Ramat Avv, Tel Avv 69978, Israel Emal:
More informationAlgebraic Point Set Surfaces
Algebrac Pont Set Surfaces Gae l Guennebaud Markus Gross ETH Zurch Fgure : Illustraton of the central features of our algebrac MLS framework From left to rght: effcent handlng of very complex pont sets,
More informationMANY of the problems that arise in early vision can be
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 2, FEBRUARY 2004 147 What Energy Functons Can Be Mnmzed va Graph Cuts? Vladmr Kolmogorov, Member, IEEE, and Ramn Zabh, Member,
More informationWho are you with and Where are you going?
Who are you wth and Where are you gong? Kota Yamaguch Alexander C. Berg Lus E. Ortz Tamara L. Berg Stony Brook Unversty Stony Brook Unversty, NY 11794, USA {kyamagu, aberg, leortz, tlberg}@cs.stonybrook.edu
More informationAssessing health efficiency across countries with a twostep and bootstrap analysis *
Assessng health effcency across countres wth a twostep and bootstrap analyss * Antóno Afonso # $ and Mguel St. Aubyn # February 2007 Abstract We estmate a semparametrc model of health producton process
More information(Almost) No Label No Cry
(Almost) No Label No Cry Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau
More informationDocumentation for the TIMES Model PART I
Energy Technology Systems Analyss Programme http://www.etsap.org/tools.htm Documentaton for the TIMES Model PART I Aprl 2005 Authors: Rchard Loulou Uwe Remne Amt Kanuda Antt Lehtla Gary Goldsten 1 General
More informationEnsembling Neural Networks: Many Could Be Better Than All
Artfcal Intellgence, 22, vol.37, no.2, pp.239263. @Elsever Ensemblng eural etworks: Many Could Be Better Than All ZhHua Zhou*, Janxn Wu, We Tang atonal Laboratory for ovel Software Technology, anng
More informationSVO: Fast SemiDirect Monocular Visual Odometry
SVO: Fast SemDrect Monocular Vsual Odometry Chrstan Forster, Mata Pzzol, Davde Scaramuzza Abstract We propose a semdrect monocular vsual odometry algorthm that s precse, robust, and faster than current
More informationFace Alignment through Subspace Constrained MeanShifts
Face Algnment through Subspace Constraned MeanShfts Jason M. Saragh, Smon Lucey, Jeffrey F. Cohn The Robotcs Insttute, Carnege Mellon Unversty Pttsburgh, PA 15213, USA {jsaragh,slucey,jeffcohn}@cs.cmu.edu
More informationSupport vector domain description
Pattern Recognton Letters 20 (1999) 1191±1199 www.elsever.nl/locate/patrec Support vector doman descrpton Davd M.J. Tax *,1, Robert P.W. Dun Pattern Recognton Group, Faculty of Appled Scence, Delft Unversty
More informationStable Distributions, Pseudorandom Generators, Embeddings, and Data Stream Computation
Stable Dstrbutons, Pseudorandom Generators, Embeddngs, and Data Stream Computaton PIOTR INDYK MIT, Cambrdge, Massachusetts Abstract. In ths artcle, we show several results obtaned by combnng the use of
More informationMultiProduct Price Optimization and Competition under the Nested Logit Model with ProductDifferentiated Price Sensitivities
MultProduct Prce Optmzaton and Competton under the Nested Logt Model wth ProductDfferentated Prce Senstvtes Gullermo Gallego Department of Industral Engneerng and Operatons Research, Columba Unversty,
More informationBRNO UNIVERSITY OF TECHNOLOGY
BRNO UNIVERSITY OF TECHNOLOGY FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF INTELLIGENT SYSTEMS ALGORITHMIC AND MATHEMATICAL PRINCIPLES OF AUTOMATIC NUMBER PLATE RECOGNITION SYSTEMS B.SC. THESIS AUTHOR
More informationAsRigidAsPossible Image Registration for Handdrawn Cartoon Animations
AsRgdAsPossble Image Regstraton for Handdrawn Cartoon Anmatons Danel Sýkora Trnty College Dubln John Dnglana Trnty College Dubln Steven Collns Trnty College Dubln source target our approach [Papenberg
More informationDP5: A Private Presence Service
DP5: A Prvate Presence Servce Nkta Borsov Unversty of Illnos at UrbanaChampagn, Unted States nkta@llnos.edu George Danezs Unversty College London, Unted Kngdom g.danezs@ucl.ac.uk Ian Goldberg Unversty
More informationDo Firms Maximize? Evidence from Professional Football
Do Frms Maxmze? Evdence from Professonal Football Davd Romer Unversty of Calforna, Berkeley and Natonal Bureau of Economc Research Ths paper examnes a sngle, narrow decson the choce on fourth down n the
More informationComplete Fairness in Secure TwoParty Computation
Complete Farness n Secure TwoParty Computaton S. Dov Gordon Carmt Hazay Jonathan Katz Yehuda Lndell Abstract In the settng of secure twoparty computaton, two mutually dstrustng partes wsh to compute
More informationFrom Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 45, NO. 1, JANUARY 1999 105 From Computng wth Numbers to Computng wth Words From Manpulaton of Measurements to Manpulaton
More informationThe Developing World Is Poorer Than We Thought, But No Less Successful in the Fight against Poverty
Publc Dsclosure Authorzed Pol c y Re s e a rc h Wo r k n g Pa p e r 4703 WPS4703 Publc Dsclosure Authorzed Publc Dsclosure Authorzed The Developng World Is Poorer Than We Thought, But No Less Successful
More informationAsRigidAsPossible Shape Manipulation
AsRgdAsPossble Shape Manpulaton akeo Igarash 1, 3 omer Moscovch John F. Hughes 1 he Unversty of okyo Brown Unversty 3 PRESO, JS Abstract We present an nteractve system that lets a user move and deform
More informationEffect of a spectrum of relaxation times on the capillary thinning of a filament of elastic liquid
J. NonNewtonan Flud Mech., 72 (1997) 31 53 Effect of a spectrum of relaxaton tmes on the capllary thnnng of a flament of elastc lqud V.M. Entov a, E.J. Hnch b, * a Laboratory of Appled Contnuum Mechancs,
More informationVerification by Equipment or EndUse Metering Protocol
Verfcaton by Equpment or EndUse Meterng Protocol May 2012 Verfcaton by Equpment or EndUse Meterng Protocol Verson 1.0 May 2012 Prepared for Bonnevlle Power Admnstraton Prepared by Research Into Acton,
More informationcan basic entrepreneurship transform the economic lives of the poor?
can basc entrepreneurshp transform the economc lves of the poor? Orana Bandera, Robn Burgess, Narayan Das, Selm Gulesc, Imran Rasul, Munsh Sulaman Aprl 2013 Abstract The world s poorest people lack captal
More informationThe Global Macroeconomic Costs of Raising Bank Capital Adequacy Requirements
W/1/44 The Global Macroeconomc Costs of Rasng Bank Captal Adequacy Requrements Scott Roger and Francs Vtek 01 Internatonal Monetary Fund W/1/44 IMF Workng aper IMF Offces n Europe Monetary and Captal Markets
More informationTurbulence Models and Their Application to Complex Flows R. H. Nichols University of Alabama at Birmingham
Turbulence Models and Ther Applcaton to Complex Flows R. H. Nchols Unversty of Alabama at Brmngham Revson 4.01 CONTENTS Page 1.0 Introducton 1.1 An Introducton to Turbulent Flow 11 1. Transton to Turbulent
More informationWhat to Maximize if You Must
What to Maxmze f You Must Avad Hefetz Chrs Shannon Yoss Spegel Ths verson: July 2004 Abstract The assumpton that decson makers choose actons to maxmze ther preferences s a central tenet n economcs. Ths
More information