Disagreement-Based Multi-System Tracking

Dsagreement-Based Mult-System Trackng Quannan L 1, Xnggang Wang 2, We Wang 3, Yuan Jang 3, Zh-Hua Zhou 3, Zhuowen Tu 1 1 Lab of Neuro Imagng, Unversty of Calforna, Los Angeles 2 Huazhong Unversty of Scence and Technology 3 Natonal Key Laboratory for Novel Software Technology, Nanjng Unversty Abstract. In ths paper, we tackle the trackng problem from a fuson angle and propose a dsagreement-based approach. Whle most exstng fuson-based trackng algorthms work on dfferent features or parts, our approach can be bult on top of nearly any exstng trackng systems by explotng ther dsagreements. In contrast to assumng mult-vew features or dfferent tranng samples, we utlze exstng well-developed trackng algorthms, whch themselves demonstrate ntrnsc varatons due to ther desgn dfferences. We present encouragng expermental results as well as theoretcal justfcaton of our approach. On a set of benchmark vdeos, large mprovements (20% 40%) over the state-ofthe-art technques have been observed. 1 Introducton Object trackng has been a long standng problem n vson. Once a tracker gets ntalzed, t starts to track the target n a vdeo by performng two steps: (1) makng a predcton about the locaton of the target, and (2) updatng ts object model (locaton, appearance, and shape) based on the predcton. Ths s n sprt very smlar to the bootstrappng and learnng procedure n a learnng algorthm. Wth the recent success n detecton-based trackng approaches, an ncreasng amount of work has treated the trackng problem as a sem-supervsed learnng problem [1 5]. Pckng a target to track at the begnnng provdes supervsed data; the remanng of the frames for the tracker to explore do not contan label nformaton and thus s unsupervsed. Due to the errors ntroduced n both the predcton and model updatng stage, nearly any tracker wll eventually fal wth the errors beng accumulated over the tme. Dsagreement-based sem-supervsed learnng approaches [6], such as co-tranng or tr-tranng [7, 8], provde a mechansm to allow classfers traned on dfferent vews or data samples to explot unlabeled data. The learnng process s a type of ensemble learnng [9 11]. It nvolves multple classfers whch label the unlabeled data to update and mprove each other [12]. From a dfferent angle, the use of multple classfers can be vewed as a fuson problem and t has been shown that fusng complementary features n a trackng system often leads to enhanced performances [13 15]. However, less efforts have been made n learnng to fuse well-developed exstng algorthms through sem-supervsed learnng; we wll see

2 Quannan L, et.al. later (n both theory and experments) that a dsagreement-based fuson sgnfcantly mproves the performance over drect combnaton of features/systems [14, 16, 17]. In ths dsagreement-based mult-system trackng approach, we seek a balance between the current tracker and the level of agreements among other trackers. Our ntuton s to fnd the locaton where the current tracker s confdent but dsagrees wth other trackers, whle other trackers reach a hgh degree of agreement. We provde both theoretcal and expermental evdence to our approach and show much mproved results over the state-of-the-art technques on benchmark vdeos. 2 Related Work A number of trackng methods have been proposed to perform fuson[13,14,18, 19, 16, 15, 17]. Dfferent from [13, 17] where multple parts were tracked and correlated, we deal wth a sngle target. In [14, 16] multple trackers were fused but these trackers represent dfferent features and they were drectly combned. In [18] the trackng approach was combned va the weghted combnaton of the PDFs. Dfferent from [18], our method does not perform drect multplcaton but seeks a balance between the PDF of one tracker and the degree of agreement by the other trackers; also, n our method, each tracker performs predcton separately mantanng certan ndependence and patches at the agreed postons can be recommended to update the other trackers. In [20], the trackng combnaton method s traned for specfc scenaros. Dfferent from [20], our method s based on the dsagreement-based sem-supervsed learnng and do not requre an off-lne tranng process; also, t can be appled to general vdeos, and performs very well on a farly large benchmark dataset. In [21], mutual nformaton was used for the fuson. Here, the proposed fuson approach s based on the dsagreements among the trackers. The most related work to our approach s [2], where the co-tranng dea was used to retran classfcaton-based trackers. However, [2] followed the standard co-tranng mplementaton usng one specfc type of classfer, SVM. In [19] several trackng algorthms were combned n a Bayesan framework whereas we here emphasze dsagreement-based fuson through sem-supervsed learnng. In dsagreement-based sem-supervsed learnng, much of the work has been focused on usng mult-vew features [7] or dfferent data samples [8]. The sprt of all such knd of approaches [7,22,8, 23] s to tran multple classfers wth dsagreements, and then label the unlabeled nstances for each other to update/mprove the model. [24] provded PAC bounds wth mult-vew features, whle [12] provded a suffcent condton for mult-vew as well as sngle-vew features. Recently, a suffcent and necessary condton was proved for dsagreementbased sem-supervsed learnng, by establshng a connecton between dsagreementbased and graph-based approaches [25]. In ths paper, we emphasze takng advantages of havng varous well-developed trackng algorthms. In the democratc co-learnng framework [26], dfferent al-

Dsagreement-Based Mult-System Trackng 3 gorthms are also used; however, ther approach s for classfcaton and a drect votng of all the methods s used. A man dfference between trackng and classfcaton s that there s no labelng nformaton provded once the trackng process starts (t s a dynamc system), whereas most dsagreement-based semsupervsed learnng algorthms can stll use labeled data n retranng. Notce that the exstence of large dsagreements among the classfers s a premse for the learnng or trackng process to contnue [12], whle the predcton s made by seekng the agreements among the classfers. For example, n [22, 23], classfers are learned so that they not only ft the supervsed data well, but also themselves reach a hgh degree of agreement; the tr-tranng algorthm [8] uses confdent and agreed data from two classfers to help the thrd classfer. Our agreement s used n the predcton stage lke the bootstrappng stage n [26], and we further emphasze the consstency wth the nformaton provded by the current tracker. Our work s also related to the actve learnng lterature [27] but we do not have humans n the loop. 3 Dsagreement-Based Trackng The problem of makng predctons n a trackng system has ts own unque characterstc, and drectly applyng the standard co-tranng formulaton [2] may not necessarly yeld a good soluton. Instead, we take advantages of havng well-developed exstng algorthms, experts, and combne them by explotng the dsagreements among the experts. The dfferences n the ntrnsc desgn of the exstng systems wll naturally lead to a certan amount of bases/varatons, a property the dsagreement-based approaches requres [12]. 3.1 Predcton of Sngle Tracker In ths secton, we frst clarfy our notatons for a sngle tracker. A tracker can be vewed as a learner denoted by h t = (A, f t, X t ) snce t always updates tself. Here, A s a specfc trackng method e.g. mean-shft tracker [28], or partcle flterng [29]; f t s the underlyng appearance model about the target at tme t whch can be represented by a dscrmnatve model [5], generatve model [4], or template matchng [28]; X t s the poston of the target at tme stamp t. Gven a new mage I t+1 at tme stamp t+1, tracker h t makes a predcton, X t+1, about the poston of the target and updates ts underlyng appearance model to f t+1. We can vew trackng a target of a tracker A as computng q t+1 A (x) p A(y x = +1 I t+1 (x), f t ) p(x t+1 = x, X t ) and x q t+1 A (x) = 1 (1) Here y x = +1 ndcates the occurrence of target at locaton x and I t+1 (x) s an mage patch centered at x. Moton coherence s assumed that the predcton on the tme stamp t + 1 s smooth w.r.t. to the predcton on the tme stamp t,

4 Quannan L, et.al. for example, p(x t+1 = x, X t ) can be a constant wthn a neghborhood of X t and zero outsde. Ths corresponds to the local search strategy adopted by most of the trackers. Now that q t+1 A (x) [0, 1] ndcates how lkely x t+1 s the correct poston for the target. For an exstng trackng algorthm, t may not strctly follow the formulaton as n Eq. (1), but we stll can use t so long as t outputs a probablty map for the predcton. 3.2 Dsagreement of Trackers Suppose we have a set of exstng trackers (experts) for makng a predcton n a trackng system S = {h, = 1..n} wth n 3 beng the number of trackers and each h s a tracker traned by trackng algorthm A. Gven an nput I t+1 at a tme stamp t + 1, each tracker h computes a q t+1 (x) to make a predcton of random varable X. Let p t+1 (x) denote the ground probablty map whch ndcates how lkely x s the correct poston, our general objectve s to combne the probablty maps by dfferent trackers to obtan hgh probablty modes n the ground-truth p t+1 (x). A drect way to fuse the multple trackers s by lnearly combnng the probablty maps together [15]. Here, we call t drect tracker fuson (DTF), whch serves as a baselne algorthm: q t+1 (x) = 1 n n =1 q t+1 (x) (2) wth the hope that q t+1 (x) p t+1 (x) as each tracker beng unbased and ndependent. Algorthms lke [15] perform n ths way wth an adapton n the weghtng parameters. The target locaton s retreved by ˇx t+1 = arg max x q t+1 (x). In DTF, at each tme, all trackers use the same predcton, ˇx t+1, and each tracker updates ts appearance model to f t+1 based on ˇx t+1 separately and contnues the trackng process. Fusng trackers leads to mprovement over the orgnal ones (see Secton 3.2 and Secton 4 for theoretcal and emprcal justfcaton respectvely). However, predctng the poston for the target w.r.t. Eq. (2) has a bg drawback,.e., the average performance q t+1 (x) of the n trackers may be degenerated by one bad tracker n the group. Here we gve an example to llustrate ths: suppose there are four trackers f 1, f 2, f 3, f 4 and two canddate postons x and x at t+1, where x s the correct poston for the target. The outputs of the four trackers on the two canddate postons are q 1 (x ) = 0.9, q 2 (x ) = 0.4, q 3 (x ) = 0.4, q 4 (x ) = 0.4 and q 1 (x ) = 0.1, q 2 (x ) = 0.6, q 3 (x ) = 0.6, q 4 (x ) = 0.6. The tracker f 1 s very confdent but dsagrees wth other trackers and makes a wrong predcton. To some extent, ths knd of tracker whch s confdent but dsagrees wth other trackers can be thought of as a outler tracker. If we fuse the four trackers wth drect tracker fuson (DTF), the poston x wll be predcted as the poston for the target accordng to x = arg max x q t+1 (x). Unfortunately,

Dsagreement-Based Mult-System Trackng 5 we get a wrong poston x due to the outler tracker f 1 at t + 1 although three trackers make correct predcton wth confdence larger than 0.5. Let ζ t+1 denote the probablty mass on such an event that the average performance q t+1 (x) s degenerated by some outler tracker f,.e., the other n 1 trackers agree wth each other and predct the correct poston wth hgh confdence whle the DTF n Eq. (2) predct the poston wrongly due to the outler tracker f at t + 1. Next, we gve the formulaton for combnng the multple trackers based on ther dsagreement to avod ths knd of event for the purpose of robustness. Gven n trackers, we stll let each tracker perform predcton separately. If the current tracker s confdent but dsagrees wth other trackers whle other trackers reach a hgh degree of agreement, the current tracker s prone to be drfted to the agreed poston of other trackers to reach more robust predctons. Our ntuton s that we seek a balance between the generated dstrbuton q t+1 (x) of the current tracker and the degree of agreement by the other trackers as (x) = (1 α)q t+1 (x) + α n 1 [ δ( j, qj t+1 (x) TH) n j=1,j q t+1 j (x)] and the specfc locaton by the -th tracker s x t+1 = arg max x (x). TH s a threshold correspondng to a confdence zone. α balances the mportance of each tracker s own predcton and the nfluence from other trackers. The dervaton of TH and α wll be gven n Secton 3.2. Note that the second term s non-zero only when all the other trackers have a hgh-degree agreement; ths s dfferent from the tradtonal fuson-based trackng [15] where weghted sum s performed; n addton, we emphasze that Eq. (3) focuses manly on the places wth hgh probablty and t s not necessary to ft p t+1 (x) at all xs as n the general statstcal learnng; our dsagreement formulaton n Eq. (3) can take advantage of ths property. Eq. (3) can be understood as the followng: f the current tracker dsagree wth the other trackers whle the other trackers are confdent and agree wth each other, the predcton of the current tracker wll be nfluenced towards the agreed locaton (dependng upon the overall probablty map); otherwse, tracker h gves out a predcton as f there were no other trackers. In such a way, the trackers can keep relatve ndependence and also enable confdent nteractons between each other. Ths makes our approach robust to outler trackers. In addton, usng the agreement of other trackers gves the overall system an ablty to be self-aware of when the system starts to drft. Ths happens when all trackers have hgh entropy of q t+1 (x) wth large dsagreement. The overall output s then gven by x t+1 = argmax x (x) and (x) = 1 n n =1 (3) (x) (4)

6 Quannan L, et.al. Note that x t+1 s the output of the overall system but t does not partcpate n the retranng of the ndvdual trackers. The pseudo code of dsagreementbased trackng s shown n Fg. 1. Trackng based on the dsagreements among the trackers shows advantage over usng a drect combnaton and we justfy ths pont both theoretcally and emprcally n the followng sectons. Gven n trackers {h, = 1..n}, each tracker h = (A, f t, X t ) adopts a specfc trackng method A. At the tme stamp t = 0, a target s manually dentfed located at X 0. All trackers start wth the same X 0 and obtan ther appearance model f 0. Gven a new mage I t+1 at tme stamp t + 1, Each tracker h searches a local neghborhood around X t and generate a probablty map q t+1 usng Eq. (1). Fnd modes of x t+1 for (x) as n Eq. (3). x t+1 keeps a balance between the estmaton of the current tracker and the level of agreements among other trackers. Assgn X t+1 = x t+1, sample patches around X t+1 and update the appearance model of each tracker to f t+1 usng the embedded model updatng/learnng rule n A Based on x t+1 = arg max x Qt+1 (x), report the x t+1 as the trackng result for dsagreement-based trackng (DBT). Fg.1. Pseudo code of dsagreement-based trackng. Theoretcal Justfcaton We frst show that a lnear combnaton of multple trackers as n Eq. (2), drect tracker fuson (DTF), gans mprovement over the ndvdual systems. Let p t+1 (x) denote the ground truth whch ndcates how lkely x s the correct poston, and let q t+1 (x) [0, 1] be the output of algorthm A. Lemma 1. If we take an average of the predctons from all the experts: q t+1 (x) = 1 n n =1 qt+1 (x) as n Eq. (2), then the average s bounded n a PAC sense. We suppose that the n trackers are ndependent and unbased: then q t+1 (x) p t+1 (x) as n +. Proof. For any small ǫ > 0, wth Hoeffdng nequalty, we get that P( q t+1 (x) p t+1 (x) ǫ) 2 exp( 2nǫ 2 ). Lemma 1 shows that q t+1 (x) can converge to the ground truth p t+1 (x) exponentally. Let error t+1 denote the error rate of the tracker f at t + 1,.e., the probablty that f predcts a wrong poston for the target at t + 1, error t+1 mn = mn {error t+1 } and errormax t+1 = max {error t+1 }, we gve the followng theorem to show that fusng the multple trackers accordng to Eq. (3) and Eq. (4) wll mprove the performance at least ζ t+1 n(errormax) t+1 n 1, contrastng to the drect tracker fuson (ζ t+1 was defned n the prevous secton).

Dsagreement-Based Mult-System Trackng 7 Theorem 1. If we fuse the multple trackers accordng to Eq. (3) and Eq. (4) where α 2 3 and TH 1 2, contrastng to the drect tracker fuson n Eq. (2), the performance at t+1 can be mproved at least ζ t+1 n(errormax) t+1 n 1, where ζ t+1 s the probablty of the event that the average performance q t+1 (x) s degenerated by some outler tracker. Proof. Let X t+1 denote the set of canddate postons at t + 1. If there s some x X t+1, at whch q t+1 (x ) TH for all {1,...,n}, t s easy to fnd that such x s unque, snce TH 0.5 (Here we neglect the probablty mass on the event that at t + 1 there are two postons x and x at whch q t+1 (x) = q t+1 (x ) = 1 2 ). Consderng Eq. (3) we get Qt+1 (x ) = 1 n n k=1 qt+1 k (x ), and x wll be selected as the trackng result, no matter whether argmax (x) or argmax q t+1 s used. If for any x X t+1 there are less than n 1 trackers wth q t+1 (x) TH, then the second term of Eq. (3) s zero. So for any x X t+1, (x) = 1 α n n k=1 qt+1 k (x). Predctng the tracng result accordng to arg max (x) s equal to predctng accordng to argmax q t+1. Next we analyze the stuaton when there s some x X t+1, at whch q t+1 ( x) < TH and q t+1 j ( x) TH for all j. Obvously, such x s also unque, snce TH 0.5 and n 3. Case 1: x s the correct poston for the target at t + 1. We wll show that even f q t+1 ( x) s very close to 0,.e., tracker f s an outler at t + 1, t wll not degenerate the fuson of the multple trackers due to Eq. (3). We obtan Thus, ( x) = (1 α)q t+1 ( x) + α n 1 j,j n k=1 j q t+1 j ( x) (1 α)q t+1 ( x) + α TH (5) ( x) = (1 α)qt+1 j ( x) (1 α) TH k ( x) (1 α)q t+1 ( x) + α TH + (1 α)(n 1)TH (6) For an ncorrect poston x x, snce x X q t+1 t+1 k (x) = 1, t s easy to see that (x ) = (1 α)q t+1 (x ) (1 α)(1 q t+1 ( x)) Therefore, j,j (x ) = (1 α)qj t+1 (x ) (1 α)(1 q t+1 ( x)) (1 α)(1 TH) (7) n k=1 k (x ) (1 α)(1 q t+1 ( x)) + (1 α)(n 1)(1 TH) (8)

8 Quannan L, et.al. We see that n general n k=1 Qt+1 k ( x) n k=1 Qt+1 k (x ) for α 2 3 and TH 1 2. Ths makes the correct poston more robust,.e., the predcton of the dsagreementbased trackng wll never be nfluenced even f f s a outler tracker. So the mprovement s at least ζ t+1. Case 2: x s not the correct poston for the target at t + 1. Snce q t+1 j ( x) TH for all j and TH 1 2, n 1 trackers predct the wrong poston x as the trackng result at t + 1. Now we bound the probablty of such event. Snce error t+1 j errormax t+1 and the multple trackers are assumed to be ndependent, the probablty mass on the event that n 1 trackers predct the poston mstakenly s at most n(errormax t+1 )n 1. The worst stuaton s that the fuson accordng to Eq. (3) performs worse than the drect tracker fuson n case 2 completely. We get Theorem 1 proved. From Theorem 1 we know that the fuson wll get beneft from Eq. (3) under the stuaton that one bad tracker degenerates the drect tracker fuson. When n (the number of the trackers) s large, t would be dffcult for the remanng n 1 trackers to acheve some agreement (See the experment Non-Relax n Secton 4). In practce, we can relax ths constrant, e.g., when two or more trackers acheve agreement, the agreement term would take effect. Note that the performance of Eq. (2) and Eq. (3) depends on the correlaton between the trackers. The correlaton depends on two factors: (1) the ntrnsc desgn of the trackers; (2) the tranng samples used to tran the trackers. Two trackers wth the same desgn traned on the same set of samples are hghly correlated, and two dfferent types of trackers traned on the same set of samples are more correlated than those traned on dfferent set of samples. If the n trackers are the same, then usng Eq. (3) shows no advantage over Eq. (2). In summary, Lemma 1 suggests that fusng the multple experts drectly mght gan exponental mprovement, contrastng to the sngle tracker; Theorem 1 shows that our dsagreement-based fuson method can provde more robustness to the trackng system, whch motvates the use of Eq. (3) by keepng a balance between the current expert f and the agreement from the other experts. 4 Experments In the experments, we make a comprehensve comparson between the performance of dsagreement-based trackng, drect tracker fuson, and the ndvdual trackers. Four trackers are used and the experment s conducted on 11 commonly tested vdeos (lsted n Table 1). The trackers used are MlTracker [5], the semsupervsed on-lne boostng tracker (semboost)[3], Incremental Vsual Tracker (IVT)[4], and Incremental Vsual Tracker usng edge nformaton (IVTE). Snce the ndvdual trackers perform predcton separately, the computatonal complexty of the proposed method only adds slght overhead over the ndvdual ones wth the mult-core processor and parallel computng. Compared wth the large performance gan, ths computatonal overhead s tolerable. From the experments, we observe that, statstcally, each ndvdual tracker gets sgnfcantly mproved by usng Eq. (3): the average center locaton error

Dsagreement-Based Mult-System Trackng 9 has been reduced by more than 12 pxels and the success rate has ncreased by 20% 40%. The result of DBT (Dsagreement-Based Trackng) also outperforms the system by drectly combnng the orgnal trackers,.e., DTF, wth 3.3 pxels reducton n center locaton error and 4.4% mprovement n success rate. DBT also sgnfcantly outperforms PROST [30]. 4.1 Implementaton detals Whle a wealthy body of trackng papers/systems have been reported, we found a few systems (wth avalable source code) havng decent performance on general vdeos. Here, we provde bref descrptons for these trackers we used wth necessary changes made to them. MlTracker adopts an onlne multple nstance learnng algorthm to tran a dscrmnatve classfer. In order to handle the ambguty of sampled patches, a bag of potentally postve mage patches are extracted. MlTracker mantans a pool of Haar features and the onlne boostng mechansm s adopted. SemBoostTracker also adopts an onlne boostng mechansm and t formulates the update process n a sem-supervsed fashon combned wth a gven pror. Ths helps to allevate the drftng problem. The IVT ncrementally learns a low dmensonal egenspace representaton to model the appearance changes of the object. In IVT model, the target s represented as a vector of gray-scale value, and the moton s modeled by an affne mage warpng. To propagate sample dstrbutons over tme, a partcle flter framework s adopted. Snce both MILTracker and SemBoostTracker do not support affne transformaton, we dsabled the scalng and rotatng ablty of IVT. IVTE s smlar to IVT, except that t uses level set as the feature. The forms of the 4 trackng systems outputs are rather dfferent. MILTracker and SemBoostTracker produce scores on local search regons; IVT and IVTE propagate probabltes va partcles. In the experment, we map the scores of MILTracker and SemBoostTracker to the range [0, 1] to produce probablty maps q MIL and q SBT (The probablty maps are normalzed to make sure that x qt+1 A (x) = 1). For IVT and IVTE, we keep the poston entres (a, b ) of the partcle and use a parzen wndow approach to estmate the probablty for predcton. For a pont x = (a, b) on the mage, ts probablty s calculated as q IV T (x) = M =1 w max{0, 1 (a a ) 2 + (b b ) 2 /L}. In our experment, L s set to 15. A map q IV TE s produced smlarly as q IV T for IVTE. Based on q MIL, q SBT, q IV T, and q IV TE we respectvely compute the correspondng Q MIL, Q SBT, Q IV T, and Q IV TE usng Eq. (3) (Snce 4 trackers are used, we relaxed Eq. (3) that when 2 trackers acheve confdent agreement, the agreement term wll take effect) and thus, each tracker makes ts own predcton separately. As we have mentoned before, x t+1 found by mean shft algorthm [28] can represent multple ponts (modes). For each tracker, e.g., MILTracker, the one mode wth the maxmum value s reported as ts predcton, sgnfcant modes found are used to retran the tracker and as the seeds for further search at the next tme stamp. For the results reported n our experment, α s set to be 0.67 as suggested by the theoretcal secton. The threshold TH n Eq. (3) s

10 Quannan L, et.al. set as 0.8/(Λ/3) (Λ s the sze of the search wndow). For each tracker, 2 modes are kept. 4.2 Quanttatve Results The average center locaton error The average center locaton error s a commonly used metrc to measure the performance of trackng and s defned as the average error between the predcted locatons to the ground truth. In Table 1, we summarze the results of the average center locaton error on all the 11 vdeos. It s clear that our dsagreement based tracker outperforms the ndvdual trackers, Drect tracker fuson, and the co-tranng scheme. For nonrelax (usng Eq. (3) drectly wthout relaxaton), t s less possble for the trackers to acheve some agreements, and the chance of nteracton between trackers s reduced. Stll the result s better than the ndvdual trackers. Table 1. Comparson of Average Center Locaton Error. Non-Relax ndcates to use Eq. (3) drectly wthout relaxaton;co-tranng stands for the results usng co-tranng method. vdeos MlT IVT IVTE SBT DTF Co-Tranng Non-Relax DBT Grl 31.9 25.2 18.1 19.3 20.6 39.8 23.3 13.4 CokeCan 20.5 55.3 11.0 14.9 9.3 49.0 7.9 6.6 Tger1 15.9 71.9 56.6 20.9 37.7 64.1 49.0 31.2 Sylv 10.9 44.0 19.5 16 19.5 31.7 7.3 10.8 StatOcc 27.8 3.3 4.8 74.4 2.5 41.2 26.2 3.0 Davd 22.9 4.9 16.9 26.4 7.0 9.2 7.9 4.1 Clffbar 12.0 31.4 78.6 29.9 27.1 45.6 16.7 8.5 Surfer 9.2 6.7 23.9 67.6 5.1 4.7 5.1 4.8 faceocc2 20.1 14.2 9.1 17 6.5 12.4 12.0 6.1 Indoor 17.2 30.2 193.5 116.5 4.7 61.9 10.7 4.5 faceocc 27.1 11.8 11.3 6.8 8.9 23.3 7.6 9.7 In all 20.8 21.8 22.3 37.3 11.5 32.7 15.8 8.2 The success rate If the locaton error on one frame s less than a prespecfed threshold, the predcton s regarded as a successful predcton. The success rate s defned as the rato of successful predctons over all the predctons. To compute the success rate, we set T as certan rato of the average wdth of the target,.e., T = β (w + h)/4, where, w and h are the wdth and heght of the target respectvely. Conceptually, ths s smlar to the overlap score evaluaton used n [30] and thus, success rate s conceptually smlar to the tracked percentle n [30]. We observe that, the trackers can not track the target precsely at all tmes, f there s no overlap but the predcton of the tracker s not far from the target or wthn the search area of the tracker, t s often possble for the trackng process to recover. Our evaluaton measurement can reflect such a phenomenon. We compare the success rate n Table 2 and from Table 2, the DBT acheves the hghest success rate.

Dsagreement-Based Mult-System Trackng 11 Table 2. Comparson of success rate when β = 0.5. MlT IVT IVTE SBT DTF Non-Relax DBT 0.596 0.733 0.753 0.592 0.862 0.84 0.900 Comparson of probablty maps In Fg. 2, we show how the probablty maps are generated n Eq. (2) and Eq. (3) on a testng vdeo, Tger1. DTF drfts from the 30th frame; on ths frame, the predctons of the trackers are rather dverse. In DTF, however, the ndvdual tracker s predcton s not fully respected and t has to comply wth the voted predcton; ths s the prmary reason for drftng and gettng trapped; usng Eq. (3) however leads to a more robust predcton. As we can see from the second fgure, on the 30th frame, MILTracker and Semboost tracker acheves certan agreement, but IVT and IVTE stll comples wth ts own predcton snce the agreement s weak. On the 35th frame, when MILTracker, IVT and Semboost tracker acheve confdent agreement, IVTE s pulled back to the agreed poston and the four trackers merge agan. The beneft of our dsagreement-based trackng s obvous: the trackers then keep ther relatvely dfferent traces and the rsk of gettng trapped s reduced. (a) (b) (c) Fg. 2. Illustraton of the probablty maps where four trackers (experts) are adopted (the fgures have been scaled for vsualzaton). (a) shows the results by DTF. (b) and (c) dsplay the probablty maps generated by dsagreement-based trackng. The probablty maps nsde the dashed yellow rectangle are shown below the screen shots. Underneath each fgure, from left to rght, the probablty maps are IVT, IVTE, MIL- Tracker, Semboost tracker respectvely (see the dscussons about these trackers n the experments). For DTF n (a), the ffth probablty map s the combned map. For dsagreement-based trackng n (b) and (c), the frst rows shows the orgnal probablty maps, and the second row shows the computed by Eq. (3). Comparson wth other methods PROST [30] s another fuson based trackng algorthm that adopts 3 trackers (a template model, an optcal-flow based mean-shft tracker and an onlne random forest tracker). Table 3 and Table 4 compare the average locaton errors and the tracked percentage (computed usng the overlap score n [30]) wth PROST. From the two tables, we can observe that, our dsagreement-based trackng outperforms PROST and acheves better tracked percentages on most of the vdeos. The best expermental performance of Democratc Integraton n [15] was acheved by usng unform qualtes, whch assgned equal weghts to all the

12 Quannan L, et.al. Table 3. Comparson of average center locaton error wth [30] Method Grl tger1 sylv Davd faceocc faceocc2 [30] 19.0 7.2 10.6 15.3 7.0 17.2 Ours 13.4 31.2 10.8 4.1 9.7 6.1 Table 4. Comparson of tracked percentage wth [30] Method Grl tger1 sylv Davd faceocc faceocc2 [30] 89 79 73 80 100 82 Ours 97 30 83 100 100 100 clues, and corresponded drectly wth DTF. In addton, we mplemented the qualty measure of normalzed salency and the performance was not as good as DBT: ther average center locaton error s 13.8 wth success rate 0.87. We dd not get the mplementaton of [2]. Nevertheless, we dd experment on some vdeos used n [2] and the results of DBT are better than [2] qualtatvely (skpped here due to page lmt). Moreover, we ndeed mplemented co-tranng and reported the result n Table 1 (average center locaton error 32.7), whch s much worse than DBT. Table 5. Performance by varyng α and T H (T H = R/(Λ/3)) R/α 0.8/0.3 0.8/0.67 0.8/0.85 0.7/0.67 0.9/0.67 Average Error 10.0 8.2 9.8 9.95 9.8 Success Rate 0.875 0.90 0.895 0.87 0.89 Robustness by varyng the parameters In table 5, we summarze the performance of dsagreement-based trackng by varyng the parameters α and TH, whch are the two key parameters n Eq. (3). We can see from ths table, by varyng TH and α, the results (especally the success rate) do not change too much. Ths demonstrates the robustness of dsagreement-based trackng. The average center locaton error has relatvely larger change because on portons of the vdeos Tger1 and Indoor, dsagreement-based trackng gets dstracted to postons dstant from the targets. In such cases, the success rate does not vary too much, but the center locaton error s ncreased. Screenshots of the results In Fg. 3, we compare the trackng results on the vdeo Grl. Ths vdeo undergoes several challenges: fast appearance change and occluson. Although both MILTracker and IVTE can track the face of the grl successfully, the trackng process s not very stable. IVT drfts from the face at the 20th frame. From the 391th frame, drect tracker fuson also drfts and get trapped at the background of the mages. As can be seen from both the screen shots and the error plot on the rght of Fg. 3, we fnd that dsagreementbased trackng tracks most robustly and accurately. In Fg. 3, we can also fnd

Dsagreement-Based Mult-System Trackng 13 Fg.3. Comparson of trackng results on the vdeo Grl. The frst row on the left shows the results of dsagreement-based trackng; the second row on the left shows the results of 4 ndvdual trackers and drect tracker fuson; the rght plot shows the comparson of center locaton error (the number of the x axs denotes the number of predctons). The colorng scheme s: dotted green: IVT, dotted black: MILTracker, dotted blue: IVTE, dotted yellow: Semboost tracker, sold Red: dsagreement-based trackng, and sold magenta: Drect tracker fuson. a very nce property of democratc trackng that, the traces of the four trackers are smlar but not exactly the same, thus, they can explore dfferent spaces, recommend confdent samples to other trackers, and thus avod to be trapped at an ncorrect poston. 5 Concluson In ths paper, we have ntroduced a dsagreement-based trackng method whch fuses multple exstng trackng systems n the followng way that seeks a balance between the coherence of the current tracker and the degree of agreements among other trackers. In such a way, t enables the nteracton between trackers and keeps the appealng characterstcs of the trackers at the same tme. As llustrated n the experments, the balance comples wth the characterstc of trackng. Dsagreement-based trackng can be bult on top of varous exstng well-developed trackng systems utlzng ther ntrnsc bases. Adoptng several state-of-the-art trackng algorthms, our approach s able to mprove each of them by a large margn on wdely used benchmark vdeos n the lterature Acknowledgement. Zhuowen Tu and Quannan L are supported by Offce of Naval Research Award N000140910099 and NSF CAREER award IIS-0844566; We Wang, Yuan Jang and Zh-Hua Zhou are supported by Natonal Natural Scence Fund of Chna 60975043. References 1. Avdan, S.: Ensemble trackng. In: CVPR. (2005) 2. Tang, F., Brennan, S., Zhao, Q., Tao, H.: Co-trackng usng sem-supervsed support vector machnes. In: ICCV. (2007) 3. Grabner, H., Lestner, C., Bschof, H.: Sem-supervsed on-lne boostng for robust trackng. In: ECCV. (2008) 4. Lm, J., Ross, D., Ln, R.S., Yang, M.H.: Incremental learnng for vsual trackng. Int l J. of Comp. Vs. (2008)

14 Quannan L, et.al. 5. Babenko, B., Yang, M.H., Belonge, S.: Vsual trackng wth onlne multple nstance learnng. In: CVPR. (2009) 6. Zhou, Z.H., L, M.: Sem-supervsed learnng by dsagreement. Knowledge and Informaton Systems 24 (2010) 415 439 7. Blum, A., Mtchell, T.: Combnng labeled and unlabeled data wth co-tranng. In: COLT. (1998) 92 100 8. Zhou, Z.H., L, M.: Tr-tranng: Explotng unlabeled data usng three classfers. IEEE Trans. on Know. and Data Eng. (2005) 9. Detterch, T.G.: Ensemble methods n machne learnng. In: INTERNATIONAL WORKSHOP ON MULTIPLE CLASSIFIER SYSTEMS, Sprnger-Verlag (2000) 1 15 10. Kuncheva, L.I.: A theoretcal study on sx classfer fuson strateges. IEEE Tran. on PAMI 24 (2002) 281 286 11. Bennett, K.P., Demrz, A., Macln, R.: Explotng Unlabeled Data n Ensemble Methods. In: ACM SIGKDD 02 Edmonton, Alberta CA. (2002) 12. Wang, W., Zhou, Z.H.: Analyzng co-tranng style algorthms. In: ECML. (2007) 13. Wu, Y., Huang, T.: Robust vsual trackng by ntegratng multple cues based on co-nference learnng. In l J. of Comp. Vs (2004) 14. Sebel, N., Maybank, S.: Fuson of multple trackng algorthms for robust people trackng. In: ECCV. (2002) 15. Tresch, J., von der Naksvyrg, C.: Democratc ntegraton: Self-organzed ntegraton of adaptve vsual cues. In: Neuroal Computaton. (2001) 16. Spengler, M., Schele, B.: Towards robust mult-cue ntegraton of vsual trackng. In: MVA. (2003) 17. Kwon, J., Lee, K.M.: Vsual trackng decomposton. In: CVPR. (2010) 18. I Lechter, M Lndenbaum, E.R.: A general framework for combnng vsual trackersthe black boxes approach. It l J. of Comp. Vs (2006) 19. Zhong, B., Yao, H., Chen, S., J, R.R., Yuan, X., Lu, S., Gao, W.: Vsual trackng va weakly supervsed learnng from multple mperfect oracles. In: CVPR. (2010) 20. Stenger, B., Woodley, T., Cpolla, R.: Learnng to track wth multple observers. In: CVPR. (2009) 21. Mundy, J.L., Chang, C.F.: Fuson of ntensty, texture, and color n vdeo trackng based on mutual nformaton. In: AIPR. (2010) 22. Collns, M., Snger, Y.: Unsupervsed models for named entty classfcaton. In: EMNLP. (1999) 23. Leskes, B., Torenvlet, L.: The value of agreement a new boostng algorthm. J. of Comp. and Sys. Sc. (2008) 24. Dasgupta, S., Lttman, M.L., McAllester, D.: Pac generalzaton bounds for cotranng. In: NIPS. (1999) 25. Wang, W., Zhou, Z.H.: A new analyss of co-tranng. In: ICML, Hafa, Israel (2010) 1135 1142 26. Zhou, Y., Goldman, S.: Democratc co-learnng. In: Inte l Conf. on Tools wth Art. Intell. (2004) 27. Dasgupta, S.: Coarse sample complexty bounds for actve learnng. In: NIPS. (2005) 28. Comancu, D., Ramesh, V., Meer, P.: Real-tme trackng of non-rgd objects usng mean shft. In: CVPR. (2000) 29. Isard, M., Blake, A.: Condensaton - condtonal densty propagaton for vsual trackng. Int l J. on Comp. Vs. (1998) 30. Santner, J., Lestner, C., Saffar, A., Pock, T., Bschof, H.: Prost: Parallel robust onlne smple trackng. In: CVPR. (2010)