Detecting Global Motion Patterns in Complex Videos

Detectng Global Moton Patterns n Complex Vdeos Mn Hu, Saad Al, Mubarak Shah Computer Vson Lab, Unversty of Central Florda {mhu,sal,shah}@eecs.ucf.edu Abstract Learnng domnant moton patterns or actvtes from a vdeo s an mportant survellance problem, especally n crowded envronments lke markets, subways etc., where trackng of ndvdual objects s hard f not mpossble. In ths paper, we propose an algorthm that uses nstantaneous moton feld of the vdeo nstead of long-term moton tracks for learnng the moton patterns. The moton feld s a collecton of ndependent flow vectors detected n each frame of the vdeo where each flow s vector s assocated wth a spatal locaton. A moton pattern s then defned as a group of flow vectors that are part of the same physcal process or moton pattern. Algorthmcally, ths s accomplshed by frst detectng the representatve modes (snks) of the moton patterns, followed by constructon of super tracks, whch are the collectve representaton of the dscovered moton patterns. We also use the super tracks for eventbased vdeo matchng. The effcacy of the approach s demonstrated on challengng real-world sequences. 1. Introducton The tradtonal approach for actvty analyss n a vdeo sequence conssts of followng steps: ) detecton of all the movng objects that are present n the scene; ) trackng of the detected object; and, ) analyss of the tracks for event/actvty detecton. Ths standard processng ppelne works well n a low densty scene where relable trajectores of movng objects can be obtaned whch eventually facltates the detecton of typcal moton patterns as well. However, n realworld stuaton the assumpton of low densty does not always hold. For nstance, vdeos depctng events such as marathons, poltcal ralles, cty center etc., usually contan hundreds of objects. Over the years, lttle attenton has been pad to analyze vdeos of these stuatons especally n terms of learnng the actvty models and moton patterns hdden n these crowded scenes. To deal wth vdeos of these challengng settngs, we propose a new method to learn the typcal moton patterns usng only the global moton flow feld, nstead of long-term trajectores of movng objects. Here, the moton flow feld s a set of ndependent flow vectors representng the nstantaneous moton present n a frame of a vdeo. Such nstantaneous moton nformaton s readly avalable n any stuaton as t s not effected by the densty of objects. The moton flow feld s obtaned by frst usng the exstng optcal flow methods to compute the optcal flow vectors n each frame, and then combnng the optcal flow vectors from all frames of the vdeo nto a sngle global moton feld. Ths global moton feld does not contan any temporal nformaton as the flow vectors from all the frames are merged nto a sngle feld wthout mantanng the nformaton about the vdeo frames they came from. Next, from the global moton flow feld, we extract the representatve modes, whch are called the snks, for each moton pattern. The process of detectng the snks s referred to as the snk seekng process. After extractng the snks and snk paths, they are grouped nto several clusters, each correspondng to a moton pattern present n the vdeo. To collectvely represent the moton pattern, a sngle super track s generated from the snk paths. Related Work: Learnng of moton paths or patterns by clusterng trajectores of movng objects has been attempted before n the lterature. For nstance, Grmson et al. [12] used the trajectores of movng objects to learn the moton patterns whch are then used for abnormal event detecton. Johnson et al. [5] used neural networks to model moton paths from trajectores. Whle n [3], trajectores were teratvely merged nto a path. Smlarly, Wang et al. [9] used a trajectory smlarty measure to cluster trajectores where each clusters was representng a specfc domnant actvty. Porkl et al. [1] represented the trajectores n the HMM parameter space for actvty analyss. Vaswan et al. [10] modeled the moton of all the movng objects performng the same actvty by analyzng the temporal deformaton of the shape whch was constructed by jonng the locatons of the objects n each frame. These above mentoned methods are based on long-term tracks of movng objects and therefore are only applcable to low densty

(a) (b) 0 50 100 y 150 200 250 300 350 0 (c) (d) 50 100 150 200 250 x 300 350 400 450 500 (e) Fgure 1. Elevator vdeo: (a) flow vectors (yellow arrows) detected at the correspondng frames #1, #101; (b) detected super tracks; (c) the moton flow feld; (d) a snk seekng process; (e) snk clusterng. scenes. In contrast, we are proposng a new method to detect moton patterns n challengng crowded scenes where long-term tracks of movng objects are not avalable or not relable. In trajectory analyss, snks are defned as the endponts of paths and can be learned from the start and end ponts of the trajectores [2, 4]. However, fragmented trajectores resultng from occlusons or trackng falures wll result n false snks. To detect snks n ths case, Stauffer [6] defned a transton lkelhood matrx and teratvely optmzed the matrx for the estmaton of sources/snks. Wang et al. [9] estmated the snks usng the local densty velocty map n a trajectory clusterng. In ths paper, the snks are defned as the end ponts of the snk paths. They are the modes of moton patterns and defne the number of dstnct moton patterns. 2. Global Moton Flow Feld Generaton Gven an nput vdeo, for each frame we use the exstng methods to compute sparse optcal flow (nstantaneous veloctes) usng the nterest ponts ([8]) or dense optcal flow for all pxel ([11]) n each frame. Consder a pont n the gven frame. Its flow vector, Z, ncludes the locaton, X = (x, y ), and the velocty, V = (vx, vy ),.e., Z = (X, V ). Note that, these flow vectors do not necessarly belong to foreground objects and no tme order or object labels are assocated wth them. In case, trajectores are avalable but not relable, e.g., broken trajectores, then the flow vectors can be obtaned drectly from these fragmented peces of trajectores. All the flow vectors computed from all the frames of the gven vdeo then consttute the global moton flow feld representng the nstantaneous moton feld of the vdeo. Ths flow feld may contan thousands of flow vectors and t s computatonal expensve to apply snk seekng process to such a large amount of data. Moveover, these flow vectors always contan redundant nformaton and nose. Therefore, the flow vectors belongng to the background can be consdered as nose as they contan lttle moton nformaton. To acheve ths, we frst apply a threshold on the velocty

magntude to remove the flow vectors that have lttle moton nformaton. Next, we use Gaussan ART (see [13]) to reduce the number of flow vectors from thousands to hundreds. The reduced number of flow vectors stll mantan the geometrc structure of the flow feld, and, therefore, do not effect the results of detectng moton patterns. Fg. 1 shows example flow vectors and correspondng moton flow feld. Snk Seekng: Suppose {Z 1, Z 2,, Z n } s the moton flow feld where Z = (X, V ). The states of the snk seekng process of each pont,, are defned as, Z,t = ( X,t, Ṽ,t), t = 1, 2,..., and computed usng: (a) Z,1 = Z, X,t+1 = X,t + Ṽ,t, (1) n Neghbor( Ṽ,t = X V,t) nw t,n n Neghbor( X,t ) W. (2) t,n L L // The above equatons states that the new poston of a pont depends only on ts locaton and velocty at the last state. Whle the new velocty, Ṽ,t+1, depends not only on the prevous velocty but also on the observed veloctes of ts neghbors. See Fg. 2(b) whch shows the moton trend of group of ponts n a local neghborhood. In ths paper, we employ the kernel based estmaton smlar to the mean shft approach [14] to ncorporate ths neghborhood effect usng followng equaton: ( W t,n = exp Ṽ,t 1 V n h t 1 2), (3) where h t 1 s the bandwdth. Note that, n the mean shft trackng [14], the appearance of pxels n a small neghborhood around the object s used to determne the locaton of the object n the next frame. In our approach, we use the locaton and the velocty of neghborng ponts n the global flow feld to determne the next locaton. The pctoral descrpton of the snk seekng process s presented n Fg. 2(a). 3. Super Track Extracton After the snks are obtaned the next task s to cluster the snks and determne ther correspondng snk paths. The clusterng algorthm starts by ntalzng the snk cluster set to an empty set. It takes each snk and attempts to match t wth all exstng clusters. If a match s found, the snk s assgned to the matched cluster. Otherwse a new cluster s ntalzed wth the current snk as ts center. Clusters wth a small number of snks are often caused by the background or nose, and, therefore, are dscarded. Formally, (b) Fgure 2. Snk seekng process for a gven pont. (a) snk seekng (red: the states of the flow vector n the snk seekng process, orange: the snk, rectangles: sldng wndows, yellow: the snk path); (b) sldng wndow (sold crcle: the flow vector under consderaton; rectangle: sldng wndow; hollow crcles: neghborng ponts; dotted crcles: non-neghborng ponts). gven a snk Z = (X, V ) assocated wth a snk path P Z, and a cluster C k, the snk-cluster dstances are gven by: ) D x (Z, C k) = max Z j C k X Xj, ) D v(z, C <V k) = mn,v j > Z j C k V V j, ) D p (Z, C k) = max Z j C k HausdorffDst(P Z, P Z j ). Here all metrcs are based on comparson between the gven snk Z and the other snk Zj n the cluster C k. The frst metrc measures whether the gven snk Z s spatally close to the cluster C k or not. The second metrc measures the smlarty of ther drectons, and the thrd measures the Hausdorff dstance between ther correspondng snk paths represented by P Z and P Z j respectvely. These three metrcs ensure that two flow vectors nvolved n a smlar moton pattern have smlar snks and snk paths. Followng the clusterng of snks, for each cluster a super track s extracted as the snk path wth the maxmum arc length to represent the correspondng global moton pattern (see Fg. 1).

Fgure 3. Generatng super tracks for crowd vdeos. Left Col: Extracted flow vectors (yellow arrows). Center Col: The moton flow feld. Rght Col: Detected super tracks. Super Track Matchng: Each super track may represent motons of several dfferent objects (people, cars etc), snce they are generated usng global flow feld of the whole vdeo. Therefore, super tracks are dfferent form the tradtonal object tracks representng the locatons of a sngle object n dfferent frames. Super track can be used n vdeo matchng snce they can effectvely reduce the problem of mult-object mult-event vdeo matchng to the problem of matchng two sets of super tracks. Consder two vdeos X and Y, and assume X and Y respectvely have n and m super tracks {x } =1,2,...,n and {y j } j=1,2,...,m. We frst defne the smlarty between two super tracks (w+wj) exp{ d(x,yj)} x and y j as p(x, y j ) =,j (w +w j ), where d(x, y j ) s the shape dstance computed by performng the dynamc tme warpng of the drectonal vectors of x and y j (see [7] for detals), and w s the relablty weght assocated to each track x, whch ArcLength(x s gven by w = ) n k=1 ArcLength(x k). To fnd the best matchng between two groups: {x } =1,2,...,n and {y j } j=1,2,...,m, we use maxmum bpartte graph matchng to acheve where each super track s a node n the bpartte graph. The weght of an edge between two nodes s gven by the above equaton. Gven a bpartte graph G = (V, E), a matchng M s a subset of E such that for any two dfferent members e, e M, e e =. The maxmum weght matchng s the one that maxmzes the sum of the weghts. 4. Experments Two classes of vdeos are consdered for the experments whch are ) Crowd, and ) Aeral vdeos. These vdeos contan groups of people and vehcles movng mostly n an unconstraned settng n the presence of shadows and severe occlusons. Fgure 4. Super tracks n aeral vdeo. (a) Top: Intal trackng results where 6 cars generated 16 broken tracklets. Mddle: Trajectores supermposed on the vdeo mosac. Bottom: Correctly generated sngle super track. (b) Left: Flow vectors supermposed on the mosac. Rght: Three super tracks. (c) Top: Flow vectors. Bottom: Fve super tracks. Crowd Vdeos: Fg. 1 shows a crowded scene of a supermarket where crowds of people go up and down through three escalators. Here, we used KLT to extract ntal flow vectors, and correctly generated three super-tracks correspondng to the moton patterns of three escalators. Fg. 3 shows results on two other challengng sequence contanng dense crowd. In Fg. 3(top-row), the crowd of plgrms s movng n two opposte drectons. The plgrms are wearng clothes of smlar color and are occluded by each other, whch makes t very hard to detect and track ndvdual persons. By processng ths vdeo through our proposed method, we generated two super tracks whch correctly correspond to the two moton patterns: plgrms gong up and plgrms gong down. Fg. 3(bottom-row) demonstrates the strength of our method on a sequence of an outdoor scene contanng crowd and shadows. In ths case several super tracks were extracted from the moton flow feld. Agan they correctly correspond to the runnng routes and the drecton of moton. Aeral Vdeos: The aeral vdeos were taken from DARPA s VIVID data set. Here, the man challenge s

to resolve the ssue of broken trajectores resultng from the lmted feld of vew and occluson of objects due to terran features. Intal tracklets were generated usng mean-shft tracker n moton compensated magery. The pont flows are then extracted from these tracklets. The frst result s shown n Fg. 4(a) where super track s extracted from the vdeo showng a group of cars makng a U-turn. In ths vdeo, sx vehcles move on a hghway n a convoy form, but only three or four of them are captured by the camera at any tme. Some cars dsappear for more than 100 frames and then reappear whch results n trajectores whch are broken nto many tracklets. It s very dffcult for a trackng based approaches to detect the moton pattern from these broken trajectores. In contrast, our method obtans the flow vectors from these tracklets and does not use the labels of objects, and, therefore, does not requre a complete trajectory. By applyng our algorthm, we are able to generate one super track representng the moton patterns hdden n the 16 tracklets of ths sequence. Two more results are shown n Fg. 4(b) and (c). Super Track Matchng: We also tested the proposed method for super track based vdeo matchng usng the VIVID data set consstng of 21 vdeos. Gven a query vdeo, the super tracks were generated usng the proposed method. The super tracks of the query vdeo were then compared wth the super-track of each vdeo n the database. Fg. 5 llustrates the vdeo matchng results for the sequence shown at the top whch s an IR vdeo. In ths vdeo, there was a group of cars makng S-turns (see frst row n Fg. 5). Fg. 5 shows the three vdeos wth the greatest smlarty to the query vdeo. Note that even though there are multple groups of objects n these three vdeos and only one group n the query vdeo, all of them contan the same moton pattern.e. the S-turn. Despte the mperfect trackng and the varablty n path shapes, our method successfully matched the vdeos wth the query vdeo. 5. Conclusons We have proposed a new method based on nstantaneous moton nformaton, to detect typcal moton patterns for dense crowded scenes. Ths s acheved by proposng a new construct called super track. Acknowledgements: Ths research was funded by the US Government VACE program. References [1] F. M. Porkl, Trajectory Pattern Detecton by HMM Parameter Space Features and Egenvector Clusterng, Fgure 5. Frst Row: Frames #500, #1000, #1130 of the query vdeo. Second to Fourth Row: Most smlar vdeos wth scores of 0.88, 0.76 and 0.75 respectvely. ECCV, 2004. [2] D. Makrs and T. Ells, Automatc Learnng of an Actvty-based Semantc Scene Model, AVSBS, 2003. [3] D. Makrs and T. Ells, Path Detecton n Vdeo Sequence, IVC, Vol. 30, 2002. [4] S. McKenna et al., Learnng Spatal Context from Trackng Usng Penalsed Lkelhood Estmaton, ICPR, 2004. [5] N. Johnson et al., Learnng the Dstrbuton of Object Trajectores for Event Recognton, IVC, 14, 1996. [6] C. Stauffer, Estmatng Trackng Sources and Snks, Event Mnng Workshop, 2003. [7] M. Vlachos et al., Rotaton Invarant Dstance Measures for Trajectores, SIGKDD, 2004. [8] B. D. Lucas and T. Kanade, An Iteratve Image Regstraton Technque wth an Applcaton to Stereo Vson, IJCAI, 1981. [9] X. Wang et al., Learnng Semantc Scene Models by Trajectory Analyss, ECCV, 2006. [10] N. Vaswan et al., Actvty Recognton Usng the Dynamcs of the Confguraton of Interactng Objects, CVPR, 2003. [11] R. Gurka et al., Computaton of Pressure Dstrbuton Usng PIV Velocty Data, Workshop on Partcle Image Velocmetry, 1999. [12] W. E. L. Grmson et al., Usng Adaptve Trackng to Classfy and Montor Actvtes n a Ste, CVPR, 1998. [13] J. R. Wllamson, Gaussan ARTMAP: A Neural Network for Fast Incremental Learnng of Nosy Multdmensonal Maps, Neural Netw., 1996. [14] D. Comancu et al., Mean Shft: A Robust Approach Toward Feature Space Analyss, PAMI, 24(5). 2002.