Human behaviour analysis and event recognition at a point of sale

Human behavour analyss and event recognton at a pont of sale R. Scre MIRANE S.A.S. Cenon, France scre@labr.fr H. Ncolas LaBRI, Unversty of Bordeaux Talence, France ncolas@labr.fr Abstract Ths paper presents a new applcaton that ams at mprovng communcaton and nteractons between dgtal meda and customers at a pont of sale. Our system analyzes n real-tme human behavour whle shoppng. In partcular, the system detects customer s nterest n products and nteracton such as people grabbng products. Ths system s based on a behavour model. A vdeo analyss module detects moton, tracks movng object, and descrbes local moton. Then specfc behavours are recognzed and sentences are generated. Fnally, our approach s tested on real vdeo sequences. Keywords: Computer vson; Human behavour analyss; Vdeo-survellance; Marketng. I. INTRODUCTION Computer vson applcatons are developed n varous felds such as survellance [11], traffc montorng [10], vdeo games, marketng, etc. The feld of marketng has evolved lately, wth an extensve use of dgtal meda at pont of sale. Ths meda offer a new form of communcaton wth customers that brngs along new ssues. For example, playng advertsng clps one after another does not have sgnfcant mpact on customer. Thus, two major nformaton must be dentfed: meda locaton and meda content. Nowadays a few software systems exst that help solvng these challenges. Concernng meda locaton, some systems track customers to obtan statstcal nformaton regardng ther habts and dsplacements nsde the shoppng mall. Other systems calculate drectly the meda audence, usng face detecton, to evaluate meda mpact. The study, ntroduced n ths paper, s n the same context and ams at adaptng the dsplayed clps to the behavour of the present customer. We want to mprove nteracton between customer and meda to mprove ts mpact. Our system requres a scene analyss that leads to an nterpretaton of the customer s behavour. In partcular, we detect people grabbng objects n known areas n real-tme. The detecton of such an event results n playng a clp related to the specfc product, for example. After a short revew, we present the system: frst the behavour model and the vdeo analyss module, then behavours recognton and semantc nterpretaton, see fgure 1. We show some results and conclude wth future work. II. PREVIOUS WORK Ths secton presents behavour analyss and semantc descrpton of behavour. Our behavour analyss and semantc nterpretaton are based on a vdeo analyss module composed of moton detecton and object trackng [15]. For a general overvew, the reader can refer to several surveys [5] [12] [10]. A. Human behavour analyss Human behavour can be dentfed n many dfferent contexts [6] [5] [12]. Humans are consdered as deformable objects. The goal of behavour analyss s to recognze moton samples n order to draw hgh-level concluson. There are several ssues due to the fact that we match realworld actvtes to outputs perceved by a vdeo processng module. We have to select relevant propertes computed wth vdeo processng tasks and handle the ncompleteness and uncertanty of these propertes. The analyss s usually made n two steps: descrpton and recognton of actons. The frst step s to defne a model that descrbes each relevant acton n our specfc applcaton context. Then there are two man possbltes. Frst, there s a tranng phase usng labeled data and then data s classfed based on the tranng set, usng learnng method as Hdden Markov Model, Neural Networks [14], Support Vector Machne (SVM), etc. Secondly a logcal model s generated that do not always requre a tranng phase [1] [7]. However these last methods are not very flexble because they rely on scene knowledge. In our study we combne a logcal model, usng fnte state machne, wth a learnng method based on SVM. B. Semantc descrpton of behavour Many applcatons requre a descrpton of object behavour n natural language, sutable for non-specalst operator. There are two man categores of behavour descrpton methods. Statstcal models, lke Bayesan network model nterpret events and behavour as nteractons between objects by the analyss of tme sequences and statstcal modelng [14]. Formalzed reasonng represents behavour patterns usng symbol systems then recognzes and classfes events wth reasonng methods [8]. We choose formalzed reasonng for ts smplcty.

Fgure 1. Moton detecton on vdeos from LAB1 (left) and MALL1 (rght). Frames are on the frst row, rough moton detecton on the second, and fltered result on the thrd. III. BEHAVIOUR MODEL Ths secton presents the model that defnes human behavour whle shoppng. At a pont of sale, customers walk around products, look at prces, pck up products, etc. We create sx states that correspond to the current behavour of a person. The chan of states descrbes the scenaro that the person plays. Enter: A new person appears n the scene. Ext: The person leaves the scene. Interested: The person s close to products,.e. possbly nterested. Interactng: The person nteracts wth products, s grabbng products. Stand by: The person s n the scene but not close to any product area or mage boundary. The person can be walkng or not. Inactve: The person has left the scene. IV. VIDEO ANALYSIS In order to detect the sx states, we requre nformaton concernng every person n the scene. For every frame, we need people s locaton, contours, etc. Therefore we use a moton detecton and object trackng process. The areas where products heap are located are assumed to be known. Durng the ntalzaton of the system the user can manually select the products areas. Then, the dentfcaton of product grabbng events s done through a descrpton and classfcaton of a person local moton, poston relatve to product areas, and overlappng surface wth product area. A. Moton detecton and object trackng Moton detecton and object trackng dentfy contours and locaton of every movng object n the scene for every frame. The method used s dvded n two phases: frst, moton detecton fnds movng regons that do not belong to the background. Then, these regons are tracked over the frame sequence. Fast methods were selected n order to cope wth the real-tme constrants of our fnal applcaton. Moton detecton uses a pxel based model of the background. A mxture of Gaussans s assocated wth each pxel, n order to characterze the background [13] [16]. Ths model s updated on-lne. A Gaussan dstrbuton s matched to the current value of each pxel. If ths Gaussan belongs to the background, the pxel s classfed as such. Otherwse t s consdered as foreground, see fgure 2. Morphologcal flters are fnally appled on ths result. Fgure 2. Moton detecton on vdeos from LAB1 (left) and MALL1 (rght). Frames are on the frst row, rough moton detecton on the second, and fltered result on the thrd. In practce, a detected object, or person, can be covered by several dsconnected regons, because the algorthm msses part of the person, see fgure 2. Thus, we need to merge regons. Our mergng process s separated n two phases. The frst step checks overlappng regons boundng boxes. Overlappng surface areas are calculated. If these overlappng areas are bgger than a gven threshold, regons are merged, as well as f a regon s nsde another regon s boundng boxes. The second step selects regons close one to another and regons wth small overlappng surface areas. These selected regons are lsted as potental merge. We create new regons that are the merge of each couple of selected regons. These new regons are added to the lst of newly detected regons. Then the matchng process, descrbed n the followng paragraph, matches ths lst of new regons wth prevously tracked regons. Current regons that are the most smlar to prevous frame regons are matched and consdered as the most relevant.

Object Trackng s composed of two processes. Frst, we calculate, for each regon, a descrptor based on ts poston, sze, surface area, frst and second order color moments. These descrptors are then matched from one frame to the next, usng a votng process that determnes regons that are the most smlar n the two sets of regons. Secondly, we use matched regons to buld and update the object lst. An object s a regon that was tracked on several frames. Newly matched regons are used to update nformaton about objects: locaton, sze, colors, etc. Then unmatched regons are compared to nactve unmatched object to solve mss-detectons. We also detect regons splt and merge that help detectng occlusons. B. Grabbng product event descrpton Ths secton focuses on tracked person s nteractons wth product areas,.e. people grabbng products. Whle grabbng a product, a person frst reaches out wth ts arm, then grasps a product, and fnally take the product. These dfferent phases n the product grabbng event correspond to observable local moton of the person over a local perod of tme. As n [3] and [14], we defne local moton descrptor. However ths descrptor s not robust enough to nose and snce we do not want to use a long tranng, we decde to create an nteracton descrptor to help characterze product grabbng events. These descrptors are used n the behavour recognton phase, see Secton 5, and are defned as follows. Local moton descrptor s bult for each frame for every tracked person, see fgure 3. Frst, each person s mask s scaled to a standard sze of 120x120 pxels, whle keepng aspect rato. Then, the optcal flow s computed usng Lucas Kanade algorthm [9]. The result of ths process s two matrxes wth values of moton vectors along x and y axs. We separate negatve from postve values n the two matrxes, and end out wth 4 matrxes before applyng a Gaussan blur on each of them, to reduce the effects of noses. A ffth matrx s computed that represent the person s slhouette. We reduce the dmensonalty of these matrxes to save computaton tme. Each matrx s dvded nto a 2x2 grd. Each grd cell gets ts values ntegrated over an 18-bn radal hstogram (20 degrees per bn). Matrxes are now represented by a 72 (2x2x18) dmensonal vector. The full frame descrptor possesses then 360 (72x5) dmensons. To take nto account temporal nformaton, we use 15 frames around the current frame and splt them n three sets of 5 frames: past, current, and future. After applyng Prncpal Component Analyss on each set s descrptors, we keep the frst 50 components for the current set, whle we only keep the frst 10 components for the past and future sets. The moton context descrptor possesses then 70 (10+50+10) dmensons and s added to each frame descrptor. Interacton descrptor uses nformaton comng from the object trackng process. Sx values are used: the person s surface coverng a product area, a Boolean that rses when ths coverng surface s bgger than a theoretcal hand sze, the surface of the person, the heght of ts boundng box, the poston of the top and the bottom of the boundng box along y axs. These measurements, related to the heght and Fgure 3. Dagram representng the descrptor. poston of the boundng box, have meanngful varatons as a person reaches out for products. We assume that people appears vertcally n the scene. The surface tends to ncrease as a person grasps a product, when products are bg enough. These measurements fll the nteracton context descrptor that possess 90 (6x15) dmensons, because we keep each measurement of the 15 frames. V. BEHAVIOUR RECOGNITION Ths secton presents the behavour recognton process. Based on the behavour model, we detect the sx states and buld a Fnte State Machne (FSM). Usng vdeo analyss nformaton, for each frame, the state of each object has to be dentfed among the sx predefned states: Enter s detected when a person appears and s connected to an mage boundary. Ext s detected when a prevously tracked person s connected to an mage boundary. Interested s detected when a person s contour connects a product area. Stand by s detected when a person s n the scene and not connected to a product area or an mage boundary. Inactve s detected when the system loses track of a person. Ths event happens when a person has left the scene or s occluded by somethng n the scene, or another person. Interact s detected usng SVM [2] on the local moton and nteracton descrptors. Frst, there are two phases n the data classfcaton: tranng and testng. The data corresponds to several nstances. An nstance s composed of a target value and several attrbutes. In our case, the target value s 0 or 1 dependng f a product grabbng event,.e. Interact state, occurs or not. The attrbutes correspond to each value of our descrptor. SVM creates a model that predcts target

value from attrbutes, by solvng the followng optmzaton problem. l 1 T mn w w + C ξ w, b, ξ 2 = 1 T Wth y ( w φ( x ) + b) 1 ξ and ξ 0 Where y are target values, x are attrbutes, ξ s the error n the tranng set, vector w and scalar b are the parameter of the hyperplane. Fnally, C > 0 s the penalty parameter of the error term. Tranng vectors x are mapped nto a hgher dmensonal space by the functon φ. In ths hgher dmensonal space, SVM fnds a separatng hyperplane that maxmze the margn. The system uses a radal bass functon (RBF) kernel: K ( x, x T ) φ( x ) φ( x ) = exp( γ x x j j j 2 ), γ > 0 Although, dfferent kernels exst, we choose RBF because t handles nonlnear relatons between attrbutes and target values, unlke lnear kernel. RBF also has less hyper parameters and less numercal dffcultes than polynomal or sgmod kernels. Fnte State Machne s used n order to organze and prortze the sx states [1] [7]. The state machne s synchronous and determnstc. Synchronous means that the machne terates over each new frame. Based on the prevous state, the system calculates a new one by testng each transton condton. If a condton s satsfed, the system moves to the new state. Otherwse, the system stays n the same state. The machne s determnstc because for each state, there can not be more than one transton for each possble nput. One FSM model the behavour of one person. We save the person s path through ts FSM n order to nterpret a hgh-level scenaro. Fgure 4 shows a possble path through the FSM. VI. SEMANTIC INTERPRETATION Ths secton ams at descrbng a person s behavour n natural language and s based on all prevous analyss. The generated sentences summarze the current state of acton of a person. Expressng human actvtes s accomplshed usng case frames [8] [4]. Ths tool lnks cases to sentences n natural language. We use smple case frames composed of three categores as llustrated n the followng example: [AG: Person 1, PRED: nteracts wth, LOC: area 1 ] AG, PRED and LOC are the agent, the predcate, and the locus of the descrbed acton respectvely. These three cases allow us to descrbe all the actons we look for. Sentences are bult usng a herarchy of actons (HoA) that s composed of case frames to represent each node. A parent node derves ts chldren by redefnng the verb (predcate s modfed) or the locus (locus s modfed) as shown n fgure 5. In partcular, by gong through the HoA, Fgure 4. Part of the Fnte State Machne, all transtons are not wrtten to make t clear. Fgure 5. Sample of the herarchy of acton. the case frame s refned untl a fnal node s reached. Fnal nodes do not have chldren and contans all the components of the sentence we want. Here are the fnal node predcates: s walkng around - s stopped - enters the scene - exts the scene - s nterested n - nteracts wth - s gone We explan n the algorthm that follows how case frames are generated from the HoA. 1. Let s be the current fnal node. For each new frame, case frames are evaluated from the top of the HoA to determne the new fnal node s. 2. If the node s and the node s are dentcal, no case frame s generated, and we go back to step 1. 3. Otherwse, a case frame assocated to the new state s s generated, and we go back to step 1. More specfcally, when a state change occurs, a case frame s generated, and a sentence s prnted. VII. RESULTS A. Data descrpton We use dfferent datasets taken wth the same camera. A part of the datasets was taken n our laboratory (LAB1, 2, and 3). The others were taken n a real shoppng mall (MALL1 and MALL2). The two frst datasets (LAB1 and MALL1) possess fve and sx sequences respectvely and contans a lot of nteractons wth products. Two and four dfferent people are shoppng respectvely, see fgure 2. LAB2 s a dataset where there s no nteracton wth the products and has three dfferent actors. Objects or products taken by people have dfferent shapes, colors, and szes n the vdeo sequences. Also, all products are dentcal n the heaps. LAB3 and MALL2 are two datasets where multple

people nteract together. Two to four people nteract smultaneously, see fgure 6. B. Tests on datasets We run tests on varous vdeo sequences. As the system generates a state for each detected object, for each frame, we compare these results to the ground truth that was labeled by hand. Then, we calculate the percentage of correct states, see table 1. The dataset LAB2 does not contan nteracton wth products. Ths dataset has better results than the LAB1, and we conclude that the system s more precse when there are fewer states to detect. In order to recognze product grabbng events, we use a cross valdaton process, see table 2. In other words, to recognze events on a vdeo, we use all the other sequences of the dataset as tranng and then calculate recall and precson. We note that usng ths process, we only use a couple of mnutes of tranng wth a few people. The dataset LAB1 gves better percentage results than MALL1. Ths dfference s manly due to noses that are more sgnfcant n MALL1 as well as slght camera moton durng the capture. However, MALL1 performs very well on recall and precson for the Interact state due to the poston of the camera, located drectly above the products and closer to them than on LAB1, see table 2. We understand that the camera locaton s really mportant to maxmze the accuracy of the system. A poston close to the products performs better on Interact state recognton. However, havng the camera too close to the products make us lose nformaton about the customers, snce they are only detected when they are near the products. After prelmnary tests, the local moton descrptor alone offers poor results for the Interact state detecton, due to the small tranng and noses. Ths s the reason why, we test ths descrptor combned wth the nteracton context descrptor. As we see on table 2, we compare result usng only the nteracton context descrptor (ICD) and both moton and nteracton context descrptor (MID). The MID performs as good as ICD for precson, but offers better results for recall on the frst datasets. On multple people datasets, MID performs slghtly better than ICD on average, however local moton descrpton tends to be noser, due to a few object trackng msmatches. The ICD remans robust n these stuatons and the recognton rates n mult-person datasets are as good as n the frst datasets, see table 2. The semantc nterpretaton offers nterestng results, see fgure 6. Snce ths nterpretaton drectly reles on the state recognton, a few errors occur. For example, mss-detectons occur on the Ext state on MALL datasets, due to the camera poston, see fgure 6 second sequence. C. Computaton tme Fnally, the applcaton has to generate responses quckly as soon as a specfc event s detected. The program s tested on a computer wth a Pentum 4, 3 Ghz and 1Gb of RAM. The applcaton can analyze 6 to 10 frames per second for an mage resoluton of 704x576 or 640x480 respectvely. Moton detecton s the most computatonal expensve process. VIII. CONCLUSION Ths paper presents a novel type of applcaton, usng computer vson n the feld of marketng, to mprove nteracton between customers and dgtal meda. The system detects, tracks, and analyzes behavours of shoppers regardng ther actons, nterests, and nteractons wth products at a pont of sale. The system s tested and offers nterestng results, 73% of the frames are correctly labeled, for sequences taken n a real envronment. Interactons wth products are well detected wth a precson of 0.79 and a recall of 0.85. Ths evaluaton helped us understand how the system behaves to maxmze ts effcency. A prototype wll soon be tested for a long perod of tme. As future work, the method can be mproved wth algorthms that better manage occlusons. It would also be nterestng to look for other behavours and scenaros that can be characterzed and detected usng ths technque. REFERENCES [1] F. Bremond, G. Medon, Scenaro recognton n arborne vdeo magery, Int. Work. IVM, pp 57-64. 1998. [2] C. Chang and C. Ln, LIBSVM: a lbrary for support vector machnes, http://www.cse.ntu.edu.tw/~cjln/lbsvm, 2001. [3] A. A. Efros, A.C. Berg, G. Mor, and J. Malk, Recognzng acton at a dstance, ICCV, 2003. [4] C.J. Fllmore, The case for case, E. Bach and R. Harms edtors, pp 1-88, 1968. [5] W. Hu, T. Tan, L. Wang, S. Maybank A survey on vsual survellance of object moton and behavours, IEEE Transac. on syst., man, and Cyb., pp 334 352, 2004. [6] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, T. Huang, "Acton Detecton n Complex Scenes wth Spatal and Temporal Ambgutes," Proc. ICCV, 2009. [7] N. Ikzler, D. Forsyth, Searchng vdeo for complex actvtes wth fnte state models, Proc. CVPR, 2007. [8] A. Kojma, T. Tamura, K. Fukunaga, Natural language descrpton of human actvtes from vdeo mages based on concept herarchy of actons, Int. J. Comput. Vs, vol. 50, n. 2, pp 171 184, 2002. [9] B. D. Lucas and T. Kanade, An teratve mage regstraton technque wth an applcaton to stereo vson, n Proc. 7th IJCAI, pp.674 679, 1981. [10] B. T. Morrs and M. M. Trved, A survey of vson-based trajectory learnng and analyss for survellance, Trans. Crc. Syst. Vdeo. Tech., v. 18, n 8, pp. 1114 1127, 2008. [11] PETS: Performance Evaluaton of Trackng and Survellance, http://wnterpets09.net/ [12] R. Poppe, A survey on vson-based human acton recognton, Im. & Vs. Comp. J., v. 28, pp. 976-990, 2010. [13] C. Stauffer, W. Grmson, Adaptve background mxture models for real-tme trackng CVPR, pp. 246 252, 1999. [14] D. Tran, A. Sorokn, Human actvty recognton wth metrc learnng, Euro. Conf. on Computer Vson, 2008. [15] A. Ylmaz, O. Javed, M. Shah, Object trackng: a survey, ACM Computng Surveys, 2006. [16] Z. Zvkovc, F. van der Hejden Effcent adaptve densty estmaton per mage pxel for the task of background subtracton Pattern Recognton Letters, vol. 27, n 7, 2006.

TABLE I. TABLE REPRESENTING THE PERCENTAGE OF CORRECTLY DETECTED STATES FOR EACH FRAME, ON VIDEOS OF VARIOUS DATASETS. TABLE II. TABLE REPRESENTING RECALL-PRECISION OF INTERACT STATE DETECTION. TWO DESCRIPTORS ARE TESTED ON VARIOUS VIDEOS: INTERACTION CONTEXT IC AND MOTION AND INTERACTION CONTEXT MI. 60 74 31 111 92 129 114 147 Fgure 6. Results for MALL 2 vdeo 1 414 Fgure 7. Results for LAB 1 vdeo 1