Bi-label Propagation for Generic Multiple Object Tracking



Similar documents
SELF-EVALUATION FOR VIDEO TRACKING SYSTEMS

Real-time Particle Filters

TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS

Measuring macroeconomic volatility Applications to export revenue data,

Multiprocessor Systems-on-Chips

Automatic measurement and detection of GSM interferences

11/6/2013. Chapter 14: Dynamic AD-AS. Introduction. Introduction. Keeping track of time. The model s elements

Bayesian Filtering with Online Gaussian Process Latent Variable Models

Principal components of stock market dynamics. Methodology and applications in brief (to be updated ) Andrei Bouzaev, bouzaev@ya.

Morningstar Investor Return

Term Structure of Prices of Asian Options

Supplementary Appendix for Depression Babies: Do Macroeconomic Experiences Affect Risk-Taking?

Chapter 8: Regression with Lagged Explanatory Variables

A Natural Feature-Based 3D Object Tracking Method for Wearable Augmented Reality

Online Multi-Class LPBoost

PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE

CHARGE AND DISCHARGE OF A CAPACITOR

Particle Filtering for Geometric Active Contours with Application to Tracking Moving and Deforming Objects

Single-machine Scheduling with Periodic Maintenance and both Preemptive and. Non-preemptive jobs in Remanufacturing System 1

ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS

DOES TRADING VOLUME INFLUENCE GARCH EFFECTS? SOME EVIDENCE FROM THE GREEK MARKET WITH SPECIAL REFERENCE TO BANKING SECTOR

INTRODUCTION TO FORECASTING

Journal Of Business & Economics Research September 2005 Volume 3, Number 9

Multi- and Single View Multiperson Tracking for Smart Room Environments

Statistical Analysis with Little s Law. Supplementary Material: More on the Call Center Data. by Song-Hee Kim and Ward Whitt

An Online Learning-based Framework for Tracking

Why Did the Demand for Cash Decrease Recently in Korea?

MACROECONOMIC FORECASTS AT THE MOF A LOOK INTO THE REAR VIEW MIRROR

TSG-RAN Working Group 1 (Radio Layer 1) meeting #3 Nynashamn, Sweden 22 nd 26 th March 1999

Random Walk in 1-D. 3 possible paths x vs n. -5 For our random walk, we assume the probabilities p,q do not depend on time (n) - stationary

Niche Market or Mass Market?

USE OF EDUCATION TECHNOLOGY IN ENGLISH CLASSES

Analogue and Digital Signal Processing. First Term Third Year CS Engineering By Dr Mukhtiar Ali Unar

SPEC model selection algorithm for ARCH models: an options pricing evaluation framework

Appendix A: Area. 1 Find the radius of a circle that has circumference 12 inches.

Premium Income of Indian Life Insurance Industry

The naive method discussed in Lecture 1 uses the most recent observations to forecast future values. That is, Y ˆ t + 1

Hotel Room Demand Forecasting via Observed Reservation Information

Individual Health Insurance April 30, 2008 Pages

Chapter 7. Response of First-Order RL and RC Circuits

Signal Processing and Linear Systems I

Multi-camera scheduling for video production

Distributing Human Resources among Software Development Projects 1

Forecasting and Information Sharing in Supply Chains Under Quasi-ARMA Demand

Stochastic Optimal Control Problem for Life Insurance

Predicting Stock Market Index Trading Signals Using Neural Networks

Chapter 2 Problems. 3600s = 25m / s d = s t = 25m / s 0.5s = 12.5m. Δx = x(4) x(0) =12m 0m =12m

Time Series Analysis Using SAS R Part I The Augmented Dickey-Fuller (ADF) Test

Module 4. Single-phase AC circuits. Version 2 EE IIT, Kharagpur

Cointegration: The Engle and Granger approach

A Note on Using the Svensson procedure to estimate the risk free rate in corporate valuation

Acceleration Lab Teacher s Guide

Economics Honors Exam 2008 Solutions Question 5

Market Liquidity and the Impacts of the Computerized Trading System: Evidence from the Stock Exchange of Thailand

Duration and Convexity ( ) 20 = Bond B has a maturity of 5 years and also has a required rate of return of 10%. Its price is $613.

The Application of Multi Shifts and Break Windows in Employees Scheduling

II.1. Debt reduction and fiscal multipliers. dbt da dpbal da dg. bal

How To Predict A Person'S Behavior

Robust 3D Head Tracking by Online Feature Registration

Strategic Optimization of a Transportation Distribution Network

Inductance and Transient Circuits

How To Calculate Price Elasiciy Per Capia Per Capi

Module 3 Design for Strength. Version 2 ME, IIT Kharagpur

Performance Center Overview. Performance Center Overview 1

Making a Faster Cryptanalytic Time-Memory Trade-Off

Mathematics in Pharmacokinetics What and Why (A second attempt to make it clearer)

Analysis of Pricing and Efficiency Control Strategy between Internet Retailer and Conventional Retailer

UNDERSTANDING THE DEATH BENEFIT SWITCH OPTION IN UNIVERSAL LIFE POLICIES. Nadine Gatzert

LIFE INSURANCE WITH STOCHASTIC INTEREST RATE. L. Noviyanti a, M. Syamsuddin b

Particle Filtering for Multiple Object Tracking in Dynamic Fluorescence Microscopy Images: Application to Microtubule Growth Analysis

A New Type of Combination Forecasting Method Based on PLS

Capacitors and inductors

Relationships between Stock Prices and Accounting Information: A Review of the Residual Income and Ohlson Models. Scott Pirie* and Malcolm Smith**

Chapter 1.6 Financial Management

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C.

LEASING VERSUSBUYING

Risk Modelling of Collateralised Lending

Can Individual Investors Use Technical Trading Rules to Beat the Asian Markets?

Gene Regulatory Network Discovery from Time-Series Gene Expression Data A Computational Intelligence Approach

Using of Hand Geometry in Biometric Security Systems

Appendix D Flexibility Factor/Margin of Choice Desktop Research

Large Scale Online Learning.

Optimal Stock Selling/Buying Strategy with reference to the Ultimate Average

Contrarian insider trading and earnings management around seasoned equity offerings; SEOs

Stock Price Prediction Using the ARIMA Model

Molding. Injection. Design. GE Plastics. GE Engineering Thermoplastics DESIGN GUIDE

Answer, Key Homework 2 David McIntyre Mar 25,

DYNAMIC MODELS FOR VALUATION OF WRONGFUL DEATH PAYMENTS

Transcription:

Bi-label Propagaion for Generic Muliple Objec Tracking Wenhan Luo, Tae-Kyun Kim, Björn Senger 2, Xiaowei Zhao, Robero Cipolla 3 Imperial College London, 2 Toshiba Research Europe, 3 Universiy of Cambridge {w.luo2,k.kim,x.zhao}@imperial.ac.uk, bjorn@canab.ne, cipolla@eng.cam.ac.uk Absrac In his paper, we propose a label propagaion framework o handle he muliple objec racking (MOT) problem for a generic objec ype (cf. pedesrian racking). Given a arge objec by an iniial bounding box, all objecs of he same ype are localized ogeher wih heir ideniies. We rea his as a problem of propagaing bi-labels, i.e. a binary class label for deecion and individual objec labels for racking. To propagae he class label, we adop clusered Muliple Task Learning (cmtl) while enforcing spaio-emporal consisency and show ha his improves he performance when given limied raining daa. To rack objecs, we propagae labels from rajecories o deecions based on affiniy using appearance, moion, and conex. Experimens on public and challenging new sequences show ha he proposed mehod improves over he curren sae of he ar on his ask.. Inroducion Muliple Objec Tracking (MOT) plays an imporan role in he compuer vision lieraure. The problem is difficul due o frequen occlusions and appearance similariy beween objecs. Owing o advances in objec deecion (especially in pedesrian deecion [9, ]), in some cases he ask can be solved efficienly using a racking-as-deecion approach. However, generalizing he ask o oher objecs (see our daa ses in Sec. 4) would require raining a deecor for each new objec ype, which is impracical. In his paper we deal wih he problem of racking muliple objecs of he same generic ype given only one iniial bounding box [8], and our ask is o recover muliple rajecories from image observaions. Treaing sliding windows as poins in a spaio-emporal cuboid and he iniial bounding box as a single labeled poin, we aim o discover and rack new objecs by propagaing labels o similar candidaes. From his perspecive, our problem shares grea similariy wih semanic video segmenaion [3] which aims o label all he pixels in a video given pixel labels in he firs frame. However, hese wo problems have significan dif + Figure : The proposed framework. Yellow arrows indicae he propagaion of class labels wihin he same frame and whie arrows indicae objec label propagaion over ime (bes viewed in color). ferences: labels in video segmenaion involve only a fixed number of pre-defined classes, whereas labels in our problem involve boh binary classes (objec vs. background) and muliple classes (specific objec ideniies). Thus he number of classes in our problem varies as objecs appear or disappear. Also, in video segmenaion more han one pixel can share he same label while in our case objec labels are exclusive. We rea he labels as a combinaion of binary class labels and objec labels (ideniies), and we refer o deecion responses as an inermediae layer beween image observaions and rajecory esimaions. Furhermore, we propose a sequenial label propagaion framework (Fig. ) o propagae class labels and objec labels in boh spaial and emporal domains. This so called bi-label propagaion framework coincides wih a racking-by-deecion sraegy: hrough spaially propagaing he class labels (yellow arrows in Fig. ), we solve he deecion problem, discovering he appearance and disappearance of objecs; by emporally propagaing objec labels (whie arrows in Fig. ), we ackle he muli-objec racking problem. Learning a robus deecor from a single raining insance is challenging and sandard mehods end o eiher overfi (e.g. using a kernel Suppor Vecor Machine (SVM)) or underfi (e.g. using a linear SVM). To address he generalizaion issue, we rain muliple deecors inspired by ensemble learning. Muliple deecors are inherenly relaed o each oher since hey are dealing wih he same ype of objecs.

The moivaion of Muliple Task Learning (MTL) [0] is o learn muliple relaed asks simulaneously raher han independenly. Thus, we rea raining each of he deecors as one ask and adop clusered MTL (cmtl) [32] o regularize he raining process of muliple deecors. In addiion, we assume ha images and hence deecion resuls do no change drasically from frame o frame. We model his spaio-emporal consisency by inegraing i ino he cmtl formula during he class label propagaion. Our key conribuions are () proposing a probabilisic framework for joinly propagaing class and objec labels in spaial and emporal domains for generic MOT and (2) inroducing cmtl for generic objec deecion and improving i by considering he spaio-emporal consisency. 2. Relaed work MOT mehods can be grouped ino wo caegories [23]: sequenial (or online) approaches, which oupu rajecories on he fly, and bach (or off-line) approaches, which oupu resuls afer processing all frames. Sequenial approaches derive a cos funcion and esimae he lowes cos sae based on sophisicaed appearance, moion and ineracion models. For example, o mainain discriminaion of individual objecs, Yang e al. [29] fuse muliple componens: bags of local feaures, a head model, and a color model of orso regions. In [6], generic objec caegory and insance-specific informaion are inegraed o rack muliple objecs in a paricle filer framework. Inspired by crowd simulaion models, a dynamic model considering social moion paerns is inroduced in [2]. Similarly, Yamaguchi e al. [27] develop an agen-based behavior model aking social and environmenal facors ino accoun o predic desinaions of pedesrians. The work in [4] esimaes objec moion based on srucured crowd paerns and learns spaio-emporal variaions using a se of hidden Markov models. Bach approaches exhibi a delay in oupuing resuls, bu hey end o be more robus as hey can access all observaions simulaneously. Typical bach approaches [7, 2, 5, 28] cas he problem as a daa associaion problem, linking shor-erm observaions such as single deecion responses or rackles ino longer rajecories using mehods such as he Hungarian algorihm [25], greedy biparie maching [24], min-cos nework flow [8, 26], KShores Pahs [4], or discree-coninuous Condiional Random Fields (CRF) [9]. Mehods for generic objec deecion in video daa require eiher pre-rained deecors [30] or off-line raining []. Models are adaped o a given inpu video in order o improve he deecion accuracy, e.g. by ieraive boosing []. The closes work o ours is coupled deecion and racking [6, 26]. However, mos work assumes ha a deecor X X2 Xn Y0 Y Y2 Yn Z0 Z Z2 Zn (a) (b) Figure 2: (a) Our graphical model. (b) Top o boom: sliding windows X, deecion responses Y, and rajecories Z. For sake of display, we only show wo rajecories (bes viewed in color). is available ha has been rained off-line. For example, [26] use a dicionary of foreground images for pedesrian deecion ogeher wih background subracion. The work in [6] employs off-line rained pedesrian and car deecors. In erms of problem seing, we follow he model-free approaches in [8, 3]. The mehod of Zhang and van der Maaen requires iniializaion wih bounding boxes of all objecs and in conras o our mehod does no discover new similar objecs [3]. Luo and Kim firs rain a generic objec deecor, and subsequenly employ he deecor o regularize he raining of muliple rackers [8]. In conras o his approach, we learn deecion wih he help of racking, i.e. he spaio-emporal consisency, as well as racking based on deecion, in a join opimizaion framework. 3. Bi-label Propagaion 3.. Bayesian perspecive Le X, Y and Z represen sliding windows (image observaions), deecion responses and rajecories, respecively. Fig. 2(a) shows our graphical model which has hree layers: image observaion, deecion, and rajecory layer, respecively. The darkly shaded nodes are observed nodes, he ransparen nodes are hidden (or laen) nodes, and he lighly shaded nodes (Y0 and Z0 ) are parly observed as we are given only a single iniial bounding box in he firs frame. From he image layer o he deecion response layer we propagae class labels. From he deecion response layer o he rajecory layer we propagae objec labels. Solving our problem corresponds o maximizing P (Z X). Inroducing variable Y, we obain max P (Z X) max P (Z X, Y )P (Y X) Z Z,Y P (Z X, Y, Z0: )P (Y X, Y ), = max Z,Y ()

where P (Y X) models class label propagaion (deecion) and P (Z X, Y ) models objec label propagaion (racking). We expand i sequenially as max P (Z X, Y, Z 0: )P (Y X, Y ), (2) Z,Y and solve his esimaion problem by decomposiion. Taking he negaive logarihm of Eq. 2, we rewrie i as: min W,Θ L C (W ) + L O (Θ ), (3) where L C (W ) models class label propagaion, L O (Θ ) models objec label propagaion and W, Θ are parameers represening he deecor and propagaion configuraion a ime. To minimize he funcion, we () fix Θ o minimize L C via W ; (2) fix W, minimize L O via Θ ; (3) + (go o he nex frame). 3.2. Class label propagaion Le us review he Bayesian formula of class label propagaion P (Y X, Y ) in Eq. 2. We wan o maximize he likelihood of Y condiioned on observaions X (spaial domain) and he previous esimaion Y (emporal domain). Our deecion problem differs from he radiional deecion problem as we do no have sufficien daa o handle large inra-class variaion. Fig. 3 illusraes he exen of inra-class variaion in hree es videos. As raining a single classifier leads o underfiing or overfiing, we rain muliple deecors and make a decision based on all of hem. Moreover, by reaing raining each deecor as one ask, we invesigae he relaionship among muliple deecors and adop clusered MTL o rain hese deecors simulaneously, improving he generalizaion abiliy. In he firs frame, we add small perurbaions o he iniial bounding box (sligh shif, roaion, scale changes) o augmen he posiive daa. Sliding windows wih an overlap (inersecion/union) of he posiive samples beween 0.2 and 0.3 are negaive samples. In he following frames, we collec confiden insances as posiive samples and augmen he raining daa in he same way. By randomly sampling a subse of insances from he whole raining daa wihou placemen m imes, we obain m ses of raining daa X l,,i R d N,i, i =,..., m and heir labels Y,i {, } N,i, where he subscrip l means labeled, d is he feaure space dimension and N,i is he number of insances. Le he muliple deecors be W = [w,..., w m ] R d m. Using he leas square error he daa cos erm is m i= XT l,,iw,i Y,i 2. The deecors are relaed as hey are dealing wih objecs of he same ype. Meanwhile, as a resul of daa disribuion a cluser of insances are more similar o each oher compared wih (a) (b) (c) Figure 3: Illusraion of inra-class variance. Shown are cropped regions from (a) he Airshow sequence, (b) he Goose sequence and (c) he Hockey sequence. ohers, e.g. some insances exhibi a similar viewpoin while some do no. Consequenly, some deecors will be closer o each oher in he model parameer space. We herefore assume ha he deecors form k clusers as C j, j =,..., k, and model he coupling among all deecors following [32]: k j= v C j w v w j 2 = r(w T W) r(f T W T WF), (4) where w j is he mean of he deecors wihin he same cluser, r( ) is he race norm, and F R m k is an orhogonal cluser indicaor marix wih F i,j = nj if i C j and F i,j = 0 oherwise. Along wih regularizaion of each deecor m i= w i 2 = r(w T W), we have a regularizaion erm r(w(( + η)i FF T )W T ), where η is a weigh parameer. Following he convex relaxaion of cmtl [32], his regularizaion erm is relaxed o r(w(ηi+m) W T ), subjec o r(m) = k, M I, M S m +, where S m + is he se of posiive semi-definie (PSD) marices and M I means I M is PSD. Tradiional MOT applies a deecor o every frame independenly. By conras, we find ha deecion responses in wo subsequen frames should no change drasically. To uilize such informaion, we rack confiden insances via a weak racker (KLT in our implemenaion) from frame o frame, and produce a densiy map P (see an example in Fig. 4(d)) by smoohing he confidence scores wih a Gaussian (σ = 5). Based on P, sliding windows X u, R d N (here he subscrip u means unlabeled ) can be weakly labeled as Ψ(P ) which is he summaion of he densiy of pixels close o heir ceners (wihin a circle of radius 4). The cos erm m m i= XT u,w,i Ψ(P ) 2 can be considered as a weakly supervised erm which propagaes labels in he emporal domain. Inuiively, i assiss he deecor o recall more insances; Fig. 4 shows his concep. Yellow boxes indicae he deecion resuls (also posiive insances), black boxes are negaive insances, and whie boxes are unlabeled samples. Wih he help of spaio-emporal consisency, some candidaes have weak labels indicaed by he orange boxes in frame shown in Fig. 4(e), and he weak labels help o recover missed deecions, see he dashed yellow box in frame in Fig. 4(f) which is a missed deecion caused by occlusion in Fig. 4(c). Based on he erms de-

P ȣ௧ 50 (a) (d) 00 50 200 - - 00 200 (b) (e) - - 300 where TiA, TiM and TiC indicae appearance, moion, and conex informaion, and le he m deecion responses a ime be - (c) Figure 5: Objec labels are propagaed from rajecories (differen colors mean differen objecs) in frame o deecion responses in frame. Noe he proximiy of a flower indicaed by he black dashed circle (bes viewed in color). (f) D = {Dj Dj =< DjA, DjL, DjC >, j =,..., m}, (8) (5) where DjA, DjL and DjC represen he appearance, locaion and conex informaion. Tracking is carried ou by propagaing objec labels from rajecories o deecion responses via a configuraion variable Θ Rn m. Iniially, all he elemens of Θ are 0. If an elemen Θij is swiched o, hen he label of rajecory Ti is propagaed o deecion response Dj, and he propagaed quaniy depends on he affiniy S(Ti Dj ) beween Ti and Dj (here means considering Dj as a componen of Ti a ime ), which is deermined by appearance, moion and conex. Fig. 5 shows his process. Objecs are assumed o move smoohly, so we only consider deecion responses wihin Ti s spaioemporal proximiy Ωi (a circle wih radius dt h ) and minimize he following energy funcion: LO (Θ ) = S(Ti Dj )Θij. (9) We rea his as a join convex problem wih regard o W and M [2]. Following [32], we adop he Acceleraed Projec Gradien mehod o opimize his funcion. Labels of Xu, are obained by averaging he scores of all deecors as: Appearance Model. We simply consider he inensiy cue for appearance affiniy. The appearance model TiA of rajecory Ti consiss of he las 5 emplaes of his objec, and he appearance similariy beween Dj and Ti is - - Figure 4: Illusraion of how he spaio-emporal consisency guides he deecion procedure (bes viewed in color). scribed above, we have LC (W ) = α r(w (ηi + M ) WT ) + {z } regularizaion λ 2 m m i= XTu, w,i Ψ(P) 2 + {z } m i= XT w,i Y,i 2 2N,i l,,i {z } spaio emporal consisency loss s.. r(m ) = k, M I, M Sm + T X w,i m i= u, m Yu, = (6) We choose candidaes wih a score greaer han zero and apply non-maximum suppression o oupu final class labels Yu, {, }N. 3.3. Objec label propagaion In he Bayesian formula Eq. 2, objec label propagaion is P (Z X, Y, Z0: ), where he esimaion of Z is condiioned on deecion responses Y and he hisory of esimaions Z0:. Le he n rajecories a ime be T = {Ti Ti =< TiA, TiM, TiC >, i =,..., n}, (7) i j Ωi SA (Ti Dj ) = med(ncc(tia, DjA )), (0) where NCC(, ) is he normalized cross-correlaion (NCC) similariy measure and med( ) is he median. Moion Model. We mainain he pas hree displacemens and predic a displacemen veci weighed by [ 47, 27, 7 ], where older values are weighed higher. Given Dj, he acual displacemen vecj is he difference beween DjL and he mos recen locaion of he objec corresponding o Ti. The moion affiniy is SM (Ti Dj ) = cos(veci, vecj ). () Conex Model. In modeling conex informaion, we follow he work in [22] and employ 2D hisograms of

T 2 T T 3 T 4 T 5 (a) Figure 6: Conex model. Conexs of (a) rajecories and (b) deecion responses are modeled by hisograms, couning objecs wihin an objec s proximiy. nearby objecs o improve he robusness. As shown in Fig. 6, here are (a) five rajecories and (b) four deecion responses. To compue a hisogram for T i, we divide he neighborhood of T i ino M pariions (here M = 4 for sake of display). For each objec locaed in his neighborhood we compue a disance vecor relaive o T i. According o he disance vecor, we accumulae he disance values for each pariion. By normalizaion, we obain an M-bin hisogram H i. The conex affiniy is S C (T i D j ) = exp( Bha(H i, H j )). (2) Having obained affiniies based on hree cues, we combine hem as follows: S(T i D j ) = S A (T i D j ) S M (T i D j ) S C (T i D j ). (3) We minimize he energy (Eq. 9) by greedy search in an ieraive way. Firs we urn off all propagaion swiches. We hen compue he affiniies of all propagaion pairs and urn on he propagaion swich (say T i and D j ) which mos decreases he energy. A he same ime, D j is labeled as he exension of T i. We remove his pair of rajecory and deecion response from he search space. This procedure is repeaed unil here is no furher energy decrease. Finally, rajecories ouside he search space are updaed considering he exended componen. The remaining rajecories in he search space are erminaed, and new rajecories are iniialized based on deecion responses in he search space. For clariy, he algorihm is summarized in Algorihm. 4. Experimens 4.. Daa ses & Seup We es our algorihm on eigh daa ses, Airshow, Goose, Sailing, Zebra, Crab, Anelope, Flower and Hockey. The firs hree are new sequences obained from YouTube videos, and he las five are public sequences [8, 20, 3]. These daa ses are challenging as hey conain () crowd This sequence is par of he original sequence in [3] (500 frames of he original 2249 frames) D 2 D 4 D (b) D 3 Algorihm : Objec label propagaion for MOT Daa: T, D, proximiy se Ω. Resul: Θ, labels of deecion responses. Iniializaion: Θ = 0. 2 while L O decrease, do 3 foreach T i T and D j Ω i, do 4 compue he energy decrease of T i and D j. 5 find T i and D j wih he greaes decrease via Eq. 9 6 se Θ ij =, propagae he label of T i o D j. 7 remove T i and D j, updae he proximiy se Ω. 8 Terminae rajecories in T, iniialize rajecories according o deecion responses in D. scenarios wih similar objecs, (2) parial or complee occlusions, (3) background cluer, and (4) ou-of-plane roaions. Parameers λ, α and η in Eq. 5 are se o be 0., 0.00 and 0.00 respecively. The proximiy parameer d T h is 20. The number of deecors is 2. For each ask, we sample 2 3 insances from he whole raining daa. We exrac HoG [9], LBP and colors as feaures for objec deecion. The hreshold o deermine he confiden insances is 0.5. Noe ha for he public daa ses, we refer o resuls repored in [8]. For daa ses which are no public, we obain resuls by running he auhors code ([3, 3]) or by re-implemening he mehod ([8] and K-SVM). 4.2. Generic objec deecion We conduced experimens on generic objec deecion o verify he effeciveness of he proposed cmtl based deecion mehod. Five mehods were compared: () TLD [3] which uses a deecor based on Random Ferns; (2) K-SVM. We rain K independen SVMs on clusered raining daa from K-means clusering and deec objecs by classificaion. This is a ypical way o handle inra-class variance. The number of SVMs is four; we use he same number of clusers in our algorihm; (3) GMOT [8] is a framework which handles he same problem wih a deecor based on a Laplacian SVM; (4) our baseline mehod BL which uses cmtl wihou he spaio-emporal consisency; (5) our full mehod. Table shows he resuls. A deecion response is defined as rue posiive if is overlap wih he ground ruh bounding box is a leas 0.5. The resuls indicae ha: () TLD only discovers a small porion of objecs on some sequences. We suspec ha his is due o limiaions of he TLD deecor which uses wopixel comparisons and herefore canno handle large inraclass variance; (2) K-SVM and GMOT show good performance, and BL generally ouperforms hese, showing he effeciveness of cmtl o handle inra-class variance; (3) he full mehod ouperforms all oher mehods; in comparison wih BL he recall rae is increased due o he spaio-

Table : Generic objec deecion resuls in erms of recall and precision values. The bes resuls are shown in bold, he second bes are underlined. Sequence Recall TLD GMOT K-SVM BL Ours Precision TLD GMOT K-SVM BL Anelope.29.74.88.77.89.57.66.7.76.77 Goose.66.80.92.85.94.94.85.97.98.99 Zebra.60.80.66.74.82.92.97.88.9.9 Crab.22.52.55.56.58.58.8.70.85.88 Flower.2.47.30.50.63.58.62.95.94.9 Airshow.6.3.38.43.63.52.56.76.77.75 Sailing.60.63.56.67.84.93.99 Hockey.65.84.43.65.82.92.89.75.88.94 Avg..56.56.56.6.70.67.79.79.88.89 Ours Table 2: Comparaive resuls for differen values of K (number of SVMs in K-SVM and, correspondingly, number of clusers in our mehod). Sequence Mehod Recall Precision K= 2 4 6 8 Avg. 2 4 6 8 Avg. Anelope K-SVM.90.88.86.84.87.66.7.72.73.70 Ours.83.89.80.80.82.8.77.8.80.80 Zebra K-SVM.66.66.70.70.68.88.88.89.87.88 Ours.73.82.72.72.75.85.9.84.84.86 emporal consisency. In a separae experimen we vary he number K in K- SVM as well as he corresponding number of clusers in our algorihm. Two represenaive public sequences (Anilope and Zebra) are used in his experimen. Table 2 shows he resuls, which demonsrae ha our algorihm ouperforms K-SVM for mos K in erms of recall rae, which is imporan in our seing. Noe ha we keep K fixed for he oher experimens; a suiable choice of K is beyond he scope of his paper. In a more exensive comparison of he baseline mehod wih TLD we obain he precision-recall curves for he Anelope and Zebra sequences, shown in Fig. 7. BL uses a hreshold on he score value o deermine wheher a candidae is an objec, and TLD [3] uses he percenage of ferns voing for a posiive decision. The resuls show ha our baseline mehod ouperforms TLD consisenly. To es he variaion of performance resuling from differen iniial bounding boxes, we run our algorihm five imes on he Goose sequence, each ime labelling a differen iniial objec. The recall raes are 0.935 ± 0.006 and he precision raes 0.990 ± 0.004, indicaing low dependence on he iniializaion (see Fig. 8). Precision 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Anelope BL TLD 0.2 0 0.5 Recall Precision 0.95 0.9 0.85 0.8 0.75 0.7 BL TLD Zebra 0.65 0 0.5 Recall Figure 7: Precision-Recall performance of TLD and BL on he Anelope and Zebra sequences. 0.8 0.6 0.4 0.2 Recall Precision MOTA MOTP MT ML 0 2 3 4 5 Figure 8: Performance variaion of five differen iniializaions on he Goose sequence. 4.3. Generic MOT We carried ou experimens o compare our framework wih several sae-of-he-ar mehods on he ask of deecing and racking muliple objecs. The experimens are presened in hree pars: () For each sequence we compare wih TLD [3] and GMOT [8]. TLD was originally developed for single objec racking, and we exended i o muliple objecs by decreasing he hreshold o le i deec some similar objecs and rack hem. I is iniialized wih he same bounding box as oher mehods. (2) For he Zebra, Crab, Flower, Airshow and Sailing sequences, we apply SPOT [3] o rack muliple objecs (four in our experimens) in each sequence. To compare he performance, we excerp resuls corresponding o hese four objecs from our whole resul in each sequence and evaluae he resuls. I is worh noing ha our algorihm sars wih

a single bounding box while SPOT [3] sars wih all four bounding boxes for each sequence. (3) For he Hockey sequence, we addiionally compare wih [7, 6, 20] using he resuls from [8]. Example images are shown in Fig. 9. We adop he crieria of Muliple Objec Tracking Accuracy (MOTA), Muliple Objec Tracking Precision (MOTP) proposed in [5] as well as Mosly Tracked (MT) rajecories and Mosly Loss (ML) rajecories [7] o give quaniaive resuls. MOTA akes he missed deecion, false posiives and false maches ino consideraion. MOTP measures he average overlap beween he ground ruh and he rue posiive. MT is he raio of he ground ruh rajecories which are covered by racking resuls for more han 80% in lengh. ML is he raio of he ground ruh rajecories which are covered by racking resuls for less han 20% in lengh. As shown in Table 3, he arrows following he crieria indicae he rend of beer performance. Resuls in Table 3 show ha: () compared wih TLD and GMOT, our mehod ouperforms oher mehods on mos sequences; (2) compared wih SPOT, our approach achieves beer resuls excep on he MOTP meric. We suspec ha his is due o SPOT rackers being objec-specific, hereby obaining greaer overlap scores, i.e. larger MOTP values; (3) for he Hockey sequence, our mehod obains resuls comparable wih mehods ha use a specific off-line rained human deecor. In order o es he sensiiviy on differen iniializaions, we run our algorihm on he Goose sequence five imes wih differen iniial bounding boxes. The MOTA, MOTP, MT and ML are 0.935 ± 0.02, 0.660 ± 0.009, 0.750 ± 0.042 and 0.07 ± 0.029 respecively (see Fig. 8), indicaing low sensiiviy o he iniial labeling. 5. Conclusion This paper proposed a framework for racking muliple objecs of he same general ype, where class and objec labels are propagaed in he spaio-emporal domain. We inroduced cmtl for generic objec deecion and have shown he benefi of including spaio-emporal consisency. The proposed mehod akes a sequenial approach, enailing he limiaion ha objec rajecories may be more fragmened han when aking a more global view of he daa. Comparaive experimens on eigh sequences (five public and hree new daa ses) confirmed he effeciveness of he proposed mehod. From a pracical viewpoin an advanage of our mehod over mos oher work in he area is he requiremen of labeling jus a single iniial bounding box, hereby providing a muli-objec racker wihou resoring o an off-line rained deecor. Table 3: Generic Muliple Objec Tracking resuls. The able shows resuls in erms of four performance crieria from he lieraure (arrows indicaing direcion of beer performance) on five public and hree new daases. Sequence Mehod MOTA MOTP MT ML TLD [3].088.650.235.765 Anelope GMOT [8].356.633.368.368 Our mehod.622.74.69.77 TLD[3].62.6.286.79 Goose GMOT [8].798.604.643.07 Our mehod.938.649.786.036 TLD [3].587.645.59.420 GMOT [8].777.668.435.304 Zebra Our mehod.743.683.580.246 SPOT [3].66.753.750 0 Our mehod.982.747 0 TLD [3].068.646.049.864 GMOT [8].39.600.097.709 Crab Our mehod.497.692.24.689 SPOT [3].90.766.500.250 Our mehod.924.724 0 TLD [3].053.677 0.632 GMOT [8].86.650.053.42 Flower Our mehod.566.78.36.368 SPOT [3].372.730.500.250 Our mehod.524.737.500 0 TLD [3].03.596 0.733 GMOT [8].028.76 0.867 Airshow Our mehod.45.646 0.067 SPOT [3] -.503.676 0.250 Our mehod.346.650 0 0 TLD [3].403.737.250.083 GMOT [8].548.684.250.083 Sailing Our mehod.89.640.833.083 SPOT [3].554.73.750.250 Our mehod.786.652.750 0 TLD [3].547.647.79.250 GMOT [8].803.69.679.07 Hockey Our mehod.766.736.607.43 Brendel e al. [7].797.600 - - Breiensein e al. [6].765.570 - - Okuma e al. [20].678.50 - - Avg. References TLD [3].279.655.40.602 GMOT [8].40.637.30.427 Our mehod.63.685.482.336 SPOT [3].235.728.500.200 Our mehod.629.703.650 0 [] K. Ali, D. Hasler, and F. Fleure. FlowBoos-appearance learning from sparsely annoaed video. In CVPR, 20. [2] A. Argyriou, M. Ponil, Y. Ying, and C. A. Micchelli. A specral regularizaion framework for muli-ask srucure learning. In NIPS, 2007.

Zebra Anelope Crab Goose Airshow Sailing Flower Hockey Figure 9: Muliple objec racking resuls shown on frames excerped from he videos. Differen colors correspond o differen objecs (we only adop 8 colors so some boxes are of he same color), he yellow lines represen rajecories. [3] V. Badrinarayanan, F. Galasso, and R. Cipolla. Label propagaion in video sequences. In CVPR, 200. [4] J. Berclaz, F. Fleure, E. Tureken, and P. Fua. Muliple objec racking using k-shores pahs opimizaion. PAMI, 33(9):806 89, 20. [5] K. Bernardin and R. Siefelhagen. Evaluaing muliple objec racking performance: he CLEAR MOT merics. EURASIP Journal on Image and Video Processing, 2008. [6] M. Breiensein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Robus racking-by-deecion using a deecor confidence paricle filer. In ICCV, 2009. [7] W. Brendel, M. Amer, and S. Todorovic. Muliobjec racking as maximum weigh independen se. In CVPR, 20. [8] A. A. Bu and R. T. Collins. Muli-arge racking by lagrangian relaxaion o min-cos nework flow. In CVPR, 203. [9] N. Dalal and B. Triggs. Hisograms of oriened gradiens for human deecion. In CVPR, 2005. [0] T. Evgeniou and M. Ponil. Regularized muli ask learning. In ACM SIGKDD, 2004. [] P. Felzenszwalb, R. Girshick, D. McAlleser, and D. Ramanan. Objec deecion wih discriminaively rained parbased models. PAMI, 32(9):627 645, 200. [2] J. Henriques, R. Caseiro, and J. Baisa. Globally opimal soluion o muli-objec racking wih merged measuremens. In ICCV, 20. [3] Z. Kalal, K. Mikolajczyk, and J. Maas. Tracking-learningdeecion. PAMI, 34(7):409 422, 202. [4] L. Kraz and K. Nishino. Tracking wih local spaio-emporal moion paerns in exremely crowded scenes. In CVPR, 200. [5] C. Kuo, C. Huang, and R. Nevaia. Muli-arge racking by on-line learned discriminaive appearance models. In CVPR, 200. [6] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool. Coupled objec deecion and racking from saic cameras and moving vehicles. PAMI, 30(0):683 698, 2008. [7] Y. Li, C. Huang, and R. Nevaia. Learning o associae: HybridBoosed muli-arge racker for crowded scene. In CVPR, 2009. [8] W. Luo and T.-K. Kim. Generic objec crowd racking by muli-ask learning. In BMVC, 203. [9] A. Milan, K. Schindler, and S. Roh. Deecion-and rajecory-level exclusion in muliple objec racking. In CVPR, 203. [20] K. Okuma, A. Taleghani, N. de Freias, J. Lile, and D. Lowe. A boosed paricle filer: Muliarge deecion and racking. In ECCV, 2004. [2] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You ll never walk alone: Modeling social behavior for muli-arge racking. In ICCV, 2009. [22] V. Reilly, H. Idrees, and M. Shah. Deecion and racking of large number of arges in wide area surveillance. In ECCV, 200. [23] X. Shi, H. Ling, X. J., and W. Hu. Muli-arge racking by rank- ensor approximaion. In CVPR, 203. [24] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah. Parbased muliple-person racking wih parial occlusion handling. In CVPR, 202. [25] B. Song, T. Jeng, E. Saud, and A. Roy-Chowdhury. A sochasic graph evoluion framework for robus muli-arge racking. In ECCV, 200. [26] Z. Wu, A. Thangali, S. Sclaroff, and M. Beke. Coupling deecion and daa associaion for muliple objec racking. In CVPR, 202. [27] K. Yamaguchi, A. Berg, L. Oriz, and T. Berg. Who are you wih and where are you going? In CVPR, 20. [28] B. Yang and R. Nevaia. An online learned CRF model for muli-arge racking. In CVPR, 202. [29] M. Yang, F. Lv, W. Xu, and Y. Gong. Deecion driven adapive muli-cue inegraion for muliple human racking. In ICCV, 2009. [30] Y. Yang, G. Shu, and M. Shah. Semi-supervised learning of feaure hierarchies for objec deecion in a video. In CVPR, 203. [3] L. Zhang and L. van der Maaen. Srucure preserving objec racking. In CVPR, 203. [32] J. Zhou, J. Chen, and J. Ye. Clusered muli-ask learning via alernaing srucure opimizaion. In NIPS, 20.