Muli- and Single View Muliperson Tracking for Smar Room Environmens Keni Bernardin 1, Tobias Gehrig 1, and Rainer Siefelhagen 1 Ineracive Sysems Lab Insiu für Theoreische Informaik Universiä Karlsruhe, 76131 Karlsruhe, Germany {keni, gehrig, siefel}@ira.uka.de Absrac. Simulaneous racking of muliple persons in real world environmens is an acive research field and several approaches have been proposed, based on a variey of feaures and algorihms. In his work, we presen 2 mulimodal sysems for racking muliple users in a smar room environmen. One is a muli-view racker based on color hisogram racking and special person region deecors. The oher is a wide angle overhead view person racker relying on foreground segmenaion and model-based racking. Boh sysems are compleed by a join probabilisic daa associaion filer-based source localizaion framework using inpu from several microphone arrays. We also very briefly presen wo inuiive merics o allow for objecive comparison of racker characerisics, focusing on heir precision in esimaing objec locaions, heir accuracy in recognizing objec configuraions and heir abiliy o consisenly label objecs over ime. The rackers are exensively esed and compared, for each modaliy separaely, and for he combined modaliies, on he CLEAR 2006 Evaluaion Daabase. 1 Inroducion and Relaed Work In recen years, here has been a growing ineres in inelligen sysems for indoor scene analysis. Various research projecs, such as he European CHIL or AMI projecs [17, 18] or he VACE projec in he U.S. [19], aim a developing smar room environmens, a faciliaing human-machine and human-human ineracion, or a analyzing meeing or conference siuaions. To his effec, mulimodal approaches ha uilize a variey of far-field sensors, video cameras and microphones, o gain rich scene informaion gain more and more populariy. An essenial building block for complex scene analysis is he deecion and racking of persons in he scene. One of he major problems faced by indoor racking sysems is he lack of reliable feaures ha allow o keep rack of persons in naural, evolving and unconsrained scenarios. The mos popular visual feaures in use are color feaures and foreground segmenaion or movemen feaures [2, 1, 3, 6, 7, 16], each wih heir advanages and drawbacks. Doing e.g. blob racking on background
subracion maps is error-prone, as i requires a clean background and assumes only persons are moving. In real environmens, he foreground blobs are ofen fragmened or merged wih ohers, hey depic only pars of occluded persons or are produced by shadows or displaced objecs. When using color informaion o rack people, he problem is o creae appropriae color hisograms or models. Generic color models are usually sensiive and environmen-specific [4]. If no generic model is used, one mus a some poin decide which pixels in he image belong o a person o iniialize a dedicaed color hisogram [3, 7, 15, 16]. In many cases, his sill requires he cooperaion of he users and/or a clean and relaively saic background. On he acousic side, alhough acual echniques already allow for a high accuracy in localizaion, hey can sill only be used effecively for he racking of one person, and only when his person is speaking. This naurally leads o he developmen of more and more mulimodal echniques. Here, we presen wo mulimodal sysems for he racking of muliple persons in a smar room scenario. A join probabiliy daa associaion filer is used o in conjuncion wih a se of microphone arrays o deermine acive speaker posiions. For he video modaliy, we invesigae he advanages and drawbacks of 2 approaches, one relying on color hisogram racking in several corner camera images and subsequen riangulaion, and one relying on foreground blob racking in wide angle op view images. For boh sysems, he acousic and visual modaliies are fused wih a sae-based selecion and combinaion scheme on he single modaliy racker oupus. The sysems are evaluaed on he CLEAR 06 3D Muliperson Tracking Daabase, and compared using he MOTP and MOTA merics, which will also be briefly decribed. The nex secions inroduce he muli-view and single-view visual rackers, and he jpdaf-based acousic racker. Secion 6 gives a brief explanaion of he used merics. Secion 7 shows he evaluaion resuls on he CLEAR daabase, while secion 8 gives a brief summary and concludes. 2 Muli-View Person Tracking using Color Hisograms and Haar-Classifier Cascades The developed sysem is a 3D racker ha uses several fixed cameras insalled a he room corners [11]. I is designed o funcion wih a variable number of cameras, wih precision increasing as he number of cameras grows. I performs racking firs separaely on each camera image, using color hisogram models. Color racks are iniialized auomaically using a combinaion of foreground maps and special objec deecors. The informaion from several cameras is hen fused o produce 3D hypoheses of he persons posiions. A more deailed explanaion of he sysem s differen componens is given in he following. 2.1 Classifier Cascades and Foreground Segmenaion A se of special objec deecors is used o deec persons in he camera images. They are classifier cascades ha build on haar-like feaures, as decribed in [9,
8]. For our implemenaion, he cascades were aken from he OpenCV [20] library. Two ypes of cascades are used: One rained o recognize fronal views of faces(face), and one o recognize he upper body region of sanding or siing persons (upper body). The image is scanned a differen scales and bounding recangles are obained for regions likely o conain a person. By using hese deecors, we avoid he drawbacks of creaion/deleion zones and are able o iniialize or recover a rack a any place in he room. Furher, o reduce he amoun of false deecor his, a preprocessing sep is made on he image. I is firs segmened ino foreground regions by performing background subracion using an adapive background model. The foreground regions are hen scanned using he classifier cascades. This combined approach offers wo advanages: The cascades, on he one hand, increase robusness o segmenaion errors, as foreground regions no belonging o persons, such as moved chairs, doors, shadows, ec, are ignored. The foreground segmenaion, on he oher hand, helps o decide which of he pixels inside a deecion recangle belong o a person, and which o he background. Knowing exacly which pixels belong o he deeced person is useful o creae accurae color hisograms and improve color racking performance. 2.2 Color Hisogram Tracking and 2D Hypoheses Whenever an objec deecor has found an upper or a full body in he image, a color hisogram of he respecive person region is consruced from he foreground pixels belonging o ha region, and a rack is iniialized. The acual racking is done based only on color feaures by using he meanshif algorihm [5] on hisogram backprojecion images. Care mus be aken when creaing he color hisograms o reduce he negaive effec of background colors ha may have been misakenly included in he person silhouee during he deecion and segmenaion phase. This is done by hisogram division, as proposed in [12]. Several ypes of division are possible (division by a general background hisogram, by he hisogram of he background region immediaely surrounding he person, ec, see Fig. 1). The choice of he bes echnique depends on he condiions a hand and is made auomaically a each rack iniializaion sep, by making a quick predicion of he effec of each echnique on he racking behavior in he nex frame. To ensure coninued racking sabiliy, he hisogram model for a rack is also adaped every ime a classifier cascade produces a deecion hi on ha rack. Tracks ha are no confirmed by a deecion hi for some ime are deleed, as hey are mos likely erroneous. The color based racker, as described above, is used o produce a 2D hypohesis for he posiion of a person in he image. Based on he ype of cascade ha riggered iniializaion of he racker, and he original size of he deeced region, he body cener of he person in he image and he person s disance from he camera are esimaed and oupu as hypohesis. When several ypes of rackers (face and upper body) are available for he same person, a combined oupu is produced.
(a) Deeced (b) body regions map Back- (d) DIV ground Foreground (c) His. proj. Back- (e) DIV (f) DIV Border*Backg Border*Backg 2 (g) DIV (h) Tracker oupu Background 2 Fig. 1. Color hisogram creaion, filering and racking. a) Face, upper and full body deecions (recangles) in one camera view. b) Foreground segmenaion (in whie). Only foreground pixels inside he recangles are used. c) Hisogram backprojecion for he upper body rack of he lefmos person. d), e), f) and g) Effecs of differen ypes of hisogram division. Background: Overall background hisogram. Border: Hisogram of he background region immediaely surrounding he deeced recangle. h) Tracker oupu as seen from anoher view 2.3 Fusion and Generaion of 3D Hypoheses The 2D hypoheses produced for every camera view are riangulaed o produce 3D posiion esimaes. For his, he cameras mus be calibraed and heir posiion relaive o a general room coordinae sysem known. The lines of view (LOV) coming from he opical ceners of he cameras and passing hrough he 2D hypohesis poins in heir respecive image planes are inerseced. When no exac inersecion poin exiss, a residual disance beween LOVs, he riangulaion error, can be calculaed. This error value is used by an inelligen 3D racking algorihm o esablish likely correspondences beween 2D racks (as in [13]). When he riangulaion error beween a se of 2D hypoheses is small enough, hey are associaed o form a 3D rack. Likewise, when i exceeds a cerain hreshold, he 2D hypohesis which conribues mos o he error is dissociaed again and he 3D rack is mainained using he remaining hypoheses. The racker requires a minimum of 2 cameras o produce 3D hypoheses, and becomes more robus as he number of cameras increases. Once a 3D esimae for a person s posiion has been compued, i is furher used o validae 2D racks, o iniiae color hisogram racking in camera views where he person has no ye been deeced, o predic occlusions in a camera view and deacivae he involved 2D rackers, and o reiniialize racking even in he absence of deecor his. The developed muliperson racker draws is srengh from he inelligen fusion of several camera views. I iniializes is racks auomaically, consanly
adaps is color models and verifies he validiy of is racks hrough he use of special objec deecors. I is capable of racking several people, regardless if hey are siing, moving or sanding sill, in a cluered environmen wih uneven lighing condiions. 3 Single-View Model-Based Person Tracking on Panoramic Images In conras o he above presened muli-view sysem, a single-view racker working on wide angle images capured from he op of he room was also designed. The advanage of such images is ha hey reduce he chance of occlusion by objecs or overlap beween persons. The drawback is ha deailed analysis of he racked persons is difficul as person-specific feaures are hard o observe. The racking algorihm is essenially composed of a simple bu fas foreground blob segmenaion followed by a more complex EM algorihm based on person models: Firs, foreground paches are exraced from he images by using a dynamic background model. The background model is creaed on a few iniial images of he room and is consanly adaped wih each new image wih an adapaion facor α. Background subracion and hresholding yield an iniial foreground map, which is morphologically filered. A conneced componen analysis provides he foreground blobs for racking. Blobs below a cerain size are rejeced as segmenaion errors. The subsequen EM racking algorihm ries o find an opimal assignmen of he deeced blobs o a se of acive person models, insaniaing new models or deleing unnecessary ones if need be. A person model, in our case is composed of a posiion (x, y), a velociy (vx, vy), a radius r and a rack ID. In our implemenaion, he model radius was esimaed auomaically using he calibraion informaion for he wide angle camera and rough knowledge abou he room heigh. The procedureis as follows: For all person models M i, verify heir updaed posiions (x, y) Mi. If he overlap beween wo models exceeds a maximum value, fuse hem. For each pixel p in each foreground blob B j, find he person model M k which is closes o p. If he disance is smaller han r Mk, assign p o M k. Ieraively assign blobs o person models: For every foreground blob B j whose pixels were assigned o a mos one model M k, assign B j o M k and use all assigned pixels from B j o compue a posiion updae for M k. Subsequenly, consider all assignmens of pixels in oher blobs o M k as invalid. Repea his sep unil all unambiguous mappings have been made. Posiion updaes are made by calculaing he mean of assigned pixels (x, y) m and seing (x, y) Mk,new = α M (x, y) m + (1 α M ) (x, y) Mk, wih α M he learnrae for model adapaion. For every blob whose pixels are sill assigned o several models, accumulae he pixel posiions assigned o each of hese models. Then make he posiion
updaes based on he respecively assigned pixels only. This is o handle he case ha wo person racks coincide: The foreground blobs are merged bu boh person models sill subsis as long as hey do no overlap oo grealy, and can keep rack of heir respecive persons when hey par again. For each remaining unassigned foreground blob, iniialize a new person model, seing is (x, y) posiion o he blob cener. Make he model acive, only if i subsis for a minimum period of ime. On he oher hand, if a model says unassigned for a cerain period of laency, delee i. Repea he procedure from sep 1. The wo sage approach resuls in a fas racking algorihm ha is able o iniialize and mainain several person racks, even in he even of moderae overlap. Relying solely on foreground maps as feaures, however, makes he sysem relaively sensiive o siuaions wih heavy overlap. This could be improved by including color informaion, or wih e.g. emporal emplaes, as proposed in [1]. By assuming an average heigh of 1m for a person s body cener, and using calibraion informaion for he op camera, he posiions in he world coordinae frame of all N racked persons are calculaed and oupu. The sysem makes no assumpions abou he environmen, e.g. no special creaion or deleion zones, abou he consisency of a person s appearance or he surrounding room. I runs in realime, a 15fps, on a Penium 3GHz machine. 4 A JPDAF Source Localizer for Speaker Tracking In parallel o he visual racking of all room occupans, acousic source localizaion was performed o esimae he posiion of he acive speaker. For his, he sysem relies on he inpu from four T-shaped microphone clusers insalled on he room walls. They allow a precise localizaion in he horizonal plane, as well as heigh esimaion. Two subasks are accomplished: Speech deecion and segmenaion. This is currenly done by hresholding in he power specrum, bu echniques more robus o non-speech noise and cross-alk are already being experimened wih. Speaker localizaion and racking. This is done by esimaing ime delays of arrival beween microphone pairs using he Generalized Cross Correlaion funcion (GCC): R 12 (τ) = 1 2Π X 1 (e jωτ X 2 (e jωτ ) X 1 (e jωτ X 2 (ejωτ ) ejωτ dω where X 1 (ω) and X 2 (ω) are he Fourier ransforms of he signals of a microphone pair in a microphone array. As opposed o oher approaches, Kalman or paricle-filer based, his approach uses a Join Probabilisic Daa Associaion Filer ha direcly receives as inpu he correlaion resuls from he various microphone pairs, and performs he racking in a unified probabilisic way for muliple possible arge hypoheses, hereby achieving more robus and accurae resuls. The deails of he source localizer can be found in [10].
The oupu of he speaker localizaion module is he racked posiion of he acive speaker in he world coordinae frame. This posiion is compared in he fusion module o hose of all visually racked persons in he room and a combined hypohesis is produced. 5 Sae-Based Fusion The fusion of he audio and video modaliies is done a he decision level. Track esimaes coming from he visual and acousic racking sysems are combined using a finie sae machine approach, which considers he relaive srenghs and weaknesses of each modaliy. The visual rackers are generally very accurae a deermining a person s posiion. In muliperson scenarios hey can, however, miss persons compleely because heir faces are oo small or invisible, or because hey are no well discernable from he background by color, shape or moion. The acousic racker on he oher hand can precisely deermine a person s posiion when his person speaks. In he curren implemenaion, i can, however, only rack one acive speaker a a ime can produce no esimaes when several or no persons are speaking. Based on his, he fusion of he acousic and visual racks is made using a finie sae machine weighing he availabiliy or reliabiliy of he single modaliy racks. For mulimodal racking, wo main condiions are o be evaluaed: For condiion A, only he posiion of he acive speaker in a muli-paricipan scenario is o be esimaed. Fo condiion B, on he oher hand, all paricipans have o be racked. Consequenly, he saes for he fusion of modaliies differ slighly depending on he ask condiion. For condiion A, hey are as follows: Sae 1: An acousic esimae is available, for which no overlapping visual esimae exiss. Here, esimaes are considered overlapping if heir disance is smaller han 500mm. In his case, assume he visual racker has missed he speaking person and oupu he acousic hypohesis. Sore he las received acousic esimae and keep oupuing i unil an overlapping visual esimae is found. Sae 2: An acousic esimae is available, and a corresponding visual esimae exiss. In his case, oupu he average of he acousic and visual posiions. Sae 3: Afer an overlapping visual esimae had been found, an acousic esimae is no longer available. In his case, we consider he visual racker has recovered he previously undeeced speaker and keep oupuing he posiion of he las overlapping visual rack. For condiion B, where all paricipans mus be racked, he acousic esimae serves o increase he precision of he closes visual rack, whenever available. The saes are:
Sae 1: An acousic esimae is available, for which no overlapping visual esimae exiss. In his case, assume he visual racker has missed he speaking person and oupu he acousic hypohesis addiionally o he visual ones. Sore he las received acousic esimae and keep oupuing i unil an overlapping visual esimae is found. Sae 2 and Sae 3 are similar o condiion A, wih he excepion ha here, all oher visual esimaes are oupu as well. Using his fusion scheme, wo mulimodal racking sysems were designed: Sysem1, fusing he jpdaf acousic racker wih he single-view visual racker, and Sysem2, fusing i wih he muli-view racker. Boh sysems were evaluaed on condiions A and B, and he resuls compared in secion 7. To allow beer insigh in he evaluaion scores, he following secion firs gives a brief overview of he used merics. 6 Muliple Objec Tracking Merics Defining good measures o express he characerisics of a sysem for coninuous racking of muliple objecs is no a sraighforward ask. Various measures exis and here is no consensus in he lieraure on he bes se o use. Here, we propose a small expressive se of merics and show a sysemaic procedure for heir calculaion. A more deailed discussion of hese merics can be found in [14]. Assuming ha for every ime frame a muliple objec racker oupus a se of hypoheses {h 1... h m } for a se of visible objecs {o 1... o n }, we define he procedure o evaluae is performance as follows: Le he correspondence beween an objec o i and a hypohesis h j be valid only if heir disance dis i,j does no exceed a cerain hreshold T, and le M = {(o i, h j )} be a dynamic mapping of objec-hypohesis pairs. Le M 0 = {}. For every ime frame, 1. For every mapping (o i, h j ) in M 1, verify if i is sill valid. If objec o i is sill visible and racker hypohesis h j sill exiss a ime, and if heir disance does no exceed he hreshold T, make he correspondence beween o i and h j for frame. 2. For all objecs for which no correspondence was made ye, ry o find a maching hypohesis. Allow only one o one maches. To find opimal correspondences ha minimize he overall disance error, Munkre s algorihm is used. Only pairs for which he disance does no exceed he hreshold T are valid. If a correspondence (o i, h k ) is made ha conradics a mapping (o i, h j ) in M 1, replace (o i, h j ) wih (o i, h k ) in M. Coun his as a mismach error and le mme be he number of mismach errors for frame. 3. Afer he firs wo seps, a se of maching pairs for he curren ime frame is known. Le c be he number of maches found for ime. For each of heses maches, calculae he disance d i beween he objec o i and is corresponding hypohesis.
4. All remaining hypoheses are considered false posiives. Similarly, all remaining objecs are considered misses. Le fp and m be he number of false posiives and misses respecively for frame. Le also g be he number of objecs presen a ime. 5. Repea he procedure from sep 1 for he nex ime frame. Noe ha since for he iniial frame, he se of mappings M 0 is empy, all correspondences made are iniial and no mismach errors occur. Based on he maching sraegy described above, wo very inuiive merics can be defined: The M uliple Objec T racking Precision (MOT P ), which shows he racker s abiliy o esimae precise objec posiions, and he M uliple Objec T racking Accuracy (M OT A), which expresses is performance a esimaing he number of objecs, and a keeping consisen rajecories: i, MOT P = d i, c (1) MOT A = 1 (m + fp + mme ) g (2) The MOT A can be seen as composed by 3 error raios: m = m g, fp = fp g, mme = mme g, he raio of misses, false posiives and mismaches in he sequence, compued over he oal number of objecs presen in all frames. Alernaively, o compare sysems for which measuremen of ideniy mismaches is no meaningful, an addiional measure, he A MOT A can be compued, by ignoring mismach errors in he global error compuaion: A MOT A = 1 (m + fp ) g (3) 7 Evaluaion on he CLEAR 06 3D Muliperson Tracking Daabase The above presened sysems for visual and mulimodal racking were evaluaed on he CLEAR 06 3D Muliperson Tacking Daabase. This daabase comprises recordings from 3 differen CHIL smarrooms, involving up o 6 persons in a seminar scenario, for a oal of approx. 60 min. Tables 1 and 2 show he resuls for he Single- and Muli-view based sysems, Sysem1 and Sysem2, for he visual and he muimodal condiions A and B: As Table 1 shows, he single view racker clearly ouperforms he muli-view approach. As he scenario involved mosly people siing around a able and occasionally walking, hey were very clearly disinguishable from a op view, even when using simple feaures such as foreground blobs for racking. The
Table 1. Evaluion resuls for he visual and mulimodal B condiions Sysem MOT P m fp mme MOT A 1:Visual 217mm 27.6% 20.3% 1.0% 51.1% 1:AV CondB 226mm 26.1% 20.8% 1.1% 52.0% 2:Visual 203mm 46.0% 24.9% 2.8% 26.3% 2:AV CondB 223mm 44.4% 25.8% 3.3% 26.4% Table 2. Evaluion resuls for he mulimodal A condiion Sysem MOT P m fp mme MOT A 1:AV CondA 223mm 51.4% 51.4% 2.1% -5.0% 2:AV CondA 179mm 51.4% 51.4% 5.3% -8.2% muli-view approach, on he oher hand, had more moderae resuls, semming from he considerably more difficul video daa. The problems can be summed up in 2 caegories: 2D racking errors: In several seminars, paricipans were only hardly disinguishable from he background using color informaion, or deecable by he face and body deecors, due o low resoluion. This accouns for he relaively high amoun of missed persons. Triangulaion errors: The low angle of view of he corner cameras and he small size of mos recording rooms caused a considerable amoun of occlusion in mos seminars, which could no be compleely resolved by he riangulaion scheme. A more precise disance esimaion, based on he size of deecion his could help avoid many of he occured riangulaion errors, and reduce he false posiive coun. In all cases, he average MOTP error was abou 20cm, making he MOTA he more ineresing meric for comparison. As can also be seen, alhough he addiion of he acousic modaliy could bring a sligh improvemen in racking accuracy, he gain is minimal, as i could only help improve racking performance for he speaking person a each respecive poin in ime. Compared o hese resuls, he scores for condiion A are relaively low. Boh sysems produced a high amoun of miss errors (around 50%), as he correc speaker could no be seleced from he muliple available racks. I is noiceable ha in case he correc speaker was racked, hough, he muli-view Sysem2 achieved a higher precision, reaching 18cm, as compared o 20cm for Sysem1. This sugges ha for he racking of clearly idenifiable persons (such as he presener in he seminars), he muli-view, face and body-deecor based approach does have is advanages.
8 Summary In his work, 2 sysems for mulimodal racking of muliple users is presened. A join probabilisic daa associaion filer for source localizaion is used in conjuncion wih 2 disinc sysems for visual racking: One using muliple camera images, based on color hisogram racking and haar-feaure classifier cascades for upper bodies and faces. The oher using only a wide angle overhead view, and model based racking on foreground segmenaion feaures. A fusion scheme is presened, using a 3-sae finie-sae machine o combine he oupu of he audio and visual rackers. The sysems were exensively esed on he CLEAR 2006 3D Muliperson Tracking Daabase, for he visual and he audio-visual condiions A and B. The resuls show ha under fairly conrolled condiions, as can be expeced of meeing siuaions wih relaively few paricipans, an overhead wide angle view analysis can yield considerable advanages over more elaborae mulicamera sysems, even if only simple feaures, such as foreground blobs are used. Overall, an accuracy of 52% could be reached for he audio-visual ask, wih posiion errors below 23cm. 9 Acknowledgmens The work presened here was parly funded by he European Union (EU) under he inegraed projec CHIL, Compuers in he Human Ineracion Loop (Gran number IST-506909). References 1. Rania Y. Khalaf and Sephen S. Inille, Improving Muliple People Tracking using Temporal Consisency, MIT Dep. of Archiecure House n Projec Technical Repor, 2001. 2. Wei Niu, Long Jiao, Dan Han, and Yuan-Fang Wang, Real-Time Muli-Person Tracking in Video Surveillance, Pacific Rim Mulimedia Conference, Singapore, 2003. 3. A. Mial and L. S. Davis, M2Tracker: A Muli-View Approach o Segmening and Tracking People in a Cluered Scene Using Region-Based Sereo, European Conf. on Compuer Vision, LNCS 2350, pp. 18-33, 2002. 4. Neal Checka, Kevin Wilson, Vibhav Rangarajan, Trevor Darrell, A Probabilisic Framework for Muli-modal Muli-Person Tracking, Workshop on Muli- Objec Tracking (CVPR), 2003. 5. Dorin Comaniciu and Peer Meer, Mean Shif: A Robus Approach Toward Feaure Space Analysis. IEEE PAMI, Vol. 24, No. 5, May 2002. 6. Ismail Hariaoglu, David Harwood and Larry S. Davis, W4: Who? When? Where? Wha? A Real Time Sysem for Deecing and Tracking People. Third Face and Gesure Recogniion Conference, pp. 222 227, 1998. 7. Yogesh Raja, Sephen J. McKenna, Shaogang Gong, Tracking and Segmening People in Varying Lighing Condiions using Colour. 3rd. In. Conference on Face & Gesure Recogniion, pp. 228, 1998.
8. Paul Viola and Michael Jones, Rapid Objec Deecion using a Boosed Cascade of Simple Feaures. IEEE In. Conference On Compuer Vision And Paern Recogniion, 2001. 9. Rainer Lienhar and Jochen Mayd, An Exended Se of Haar-like Feaures for Rapid Objec Deecion. IEEE ICIP 2002, Vol. 1, pp. 900 903, Sep. 2002. 10. T. Gehrig, J. McDonough, A Join-Probabilisic Daa Associaion Filer Based Source Localizaion Technique. CLEAR 2006. 11. Alexander Elbs, Mehrpersonenracking miels Farbe und Deekorkaskaden. Diplomarbei. Insiu für Theoreische Informaik, Universiä Karlsruhe, Augus 2005. 12. Kai Nickel and Rainer Siefelhagen, Poining Gesure Recogniion based on 3Dracking of Face, Hands and Head Orienaion, 5h Inernaional Conference on Mulimodal Inerfaces, Vancouver, Canada, Nov. 2003. 13. Dirk Focken, Rainer Siefelhagen, Towards Vision-Based 3-D People Tracking in a Smar Room, IEEE Inernaional Conference on Mulimodal Inerfaces, Pisburgh, PA, USA, Ocober 14-16, 2002, pp. 400-405. 14. Keni Bernardin, Alexander Elbs and Rainer Siefelhagen, Muliple Objec Tracking Performance Merics and Evaluaion in a Smar Room Environmen, acceped Sixh IEEE Inernaional Workshop on Visual Surveillance, in conjuncion wih ECCV2006, May 13h 2006, Graz, Ausria 15. Hai Tao, Harpree Sawhney and Rakesh Kumar, A Sampling Algorihm for Tracking Muliple Objecs. Inernaional Workshop on Vision Algorihms: Theory and Pracice, pp. 53 68, 1999. 16. Chrisopher Wren, Ali Azarbayejani, Trevor Darrell, Alex Penland, Pfinder: Real-Time Tracking of he Human Body. IEEE Transacions on Paern Analysis and Machine Inelligence, vol 19, no 7, pp. 780 785, July 1997. 17. CHIL - Compuers In he Human Ineracion Loop, hp://chil.server.de 18. AMI - Augmened Mulipary Ineracion, hp://www.amiprojec.org 19. VACE - Video Analysis and Conen Exracion, hp://www.ic-arda.org 20. OpenCV - Open Compuer Vision Library, hp://sourceforge.ne/projecs/opencvlibrary/