AUTOMATED VISUAL TRAFFIC MONITORING AND SURVEILLANCE THROUGH A NETWORK OF DISTRIBUTED UNITS

AUTOMATED VISUAL TRAFFIC MOITORIG AD SURVEILLACE THROUGH A ETWORK OF DISTRIBUTED UITS A. Koutsia a, T. Semertzidis a, K. Dimitropoulos a,. Grammalidis a and K. Georgouleas b a Informatis and Telematis Institute, Centre for Researh and Tehnology Hellas, st km Thermi-Panorama road, 5700 Thessaloniki, Greee - (koutsia, theosem, dimitrop, ngramm)@iti.gr b MARAC Eletronis, 65 Marias Kiouri & Tripoleos Str, 88 63, Piraeus, Greee - georgouleas@mara.gr Commission III, WG III/5 KEY WORDS: Computer Vision, Visual Analysis, Fusion, Loation Based Servies, Calibration, Change Detetion, Mathing ABSTRACT: This work aims to present an intelligent system for traking moving targets (suh as vehiles, persons et) based on a network of distributed autonomous units that apture and proess images from one or more pre-alibrated visual sensors. The proposed system, whih has been developed within the framework of TRAVIS (TRAffi VISual monitoring) projet, is flexible, salable and an be applied in a broad field of appliations. Two different pilot installations have been installed for initial evaluation and testing, one for traffi ontrol of airraft parking areas and one for tunnels at highways. Various omputer vision tehniques whih were implemented and tested during the development of the projet, are desribed and analysed. Multiple bakground extration and data fusion algorithms are omparatively evaluated.. Relative Work. ITRODUCTIO Traffi ontrol and monitoring using video sensors has drawn inreasing attention reently due to the signifiant advanes in the field of omputer vision. Many ommerial and researh systems use video proessing, aiming to solve speifi problems in road traffi monitoring (Kastrinaki, 2003). An effiient appliation for monitoring and surveillane from multiple ameras is the Reading People Traker (Le Bouffant, 2002), whih was later used as a base for the development of a system alled AVITRACK, whih monitors airplane serviing operations (Thirde, 2006). Furthermore, in the FP5 ITERVUSE projet, an artifiial vision network-based system was developed to monitor the ground traffi at airports (Pavlidou, 2005). The system uses the Autosope Solo Wide Area Video Vehile Detetion System whih has been suessfully deployed worldwide for monitoring and ontrolling road traffi (Mihalopoulos, 99)..2 Motivation and Aims Robust and aurate detetion and traking of moving objets has always been a omplex problem. Espeially in the ase of outdoor video surveillane systems, the visual traking problem is partiularly hallenging due to illumination or bakground hanges, olusions problems et. The aim of the TRAVIS projet was to determine whether the reent hanges in the field of Computer Vision an help overome these problems and develop a robust traffi surveillane appliation. The final system is easily adjustable and parameterised, in order to be suitable for diverse appliations related to target traking. Two prototypes have been installed eah for a different appliation: Traffi ontrol of airraft parking areas (APRO). This appliation fouses more on the graphial display of the ground situation at the APRO. The system alulates the position, veloity and diretion of the targets and it lassifies them aording to their type (ar, man, long vehile et). Alerts are displayed for dangerous situations, suh as speeding. This information an be aessible by the respetive employees, even if they are situated in a distant area, with no diret eyeontat to the APRO. A pilot installation of this system took plae at Maedonia airport of Thessaloniki, Greee. Traffi ontrol of tunnels at highways. The fous of this appliation is on the olletion of traffi statistis, suh as speed and traffi loads per lane. It an also identify dangerous situations, suh as objets falling, animals or traffi jams. These results an be sent to traffi surveillane entres or used to ativate road signs/warning lights. This prototype was installed at a highway tunnel at Piraeus Harbour, Athens, Greee. 2. SYSTEM ARCHITECTURE The proposed system onsists of a salable network of autonomous traking units (ATUs) that use ameras to apture images, detet moving objets and provide results to a entral sensor data fusion server (SDF). The SDF server is responsible for traking and visualizing moving objets in the sene as well as olleting statistis and providing alerts for dangerous situations. The system provides a hoie between two modes, eah supporting a different data fusion tehnique. Grid mode separates the ground plane into ells and fuses neighbouring observations while map fusion mode warps greysale images of foreground objets in order to fuse them. The topology of the ATUs network varies in eah appliation depending on the existing infrastruture, geomorphologi fats and bandwidth and ost limitations. The network arhiteture is based on a wired or wireless TCP/IP onnetion as illustrated in Figure. These topologies an be ombined to produe a hybrid network of ATUs. Depending on the available network bandwidth, images aptured from speifi video sensors may 599

also be oded and transmitted to the SDF server, to allow inspetion by a human observer (e.g. traffi ontroller). Sensor Sensor k Sensor Sensor m Autonomous Traking Unit....... Ethernet Autonomous Traking Unit n SDF Server WiFi / WiMAX Figure : Basi arhiteture of the proposed system. 3. THE AUTOOMOUS TRACKIG UITS Eah ATU is a powerful proessing unit (PC or embedded PC), whih periodially obtains frames from one of more video sensors. The video sensors are standard CCTV ameras, not neessarily of high resolution, equipped with a asing appropriate for outdoor use and telephoto lenses for distant observation. They are also stati (fixed field of view) and prealibrated. Eah ATU onsists of the following modules: - Calibration module (off-line unit to alibrate eah video sensor). To obtain the exat position of the targets in the real world, the alibration of eah amera is required, so that any point an be onverted from image oordinates (measured in pixels from the top left orner of the image) to ground oordinates and vie versa. A alibration tehnique, whih is based on a 3x3 homographi transformation and uses both points and lines orrespondenes, was used (Dimitropoulos, 2005). The observed targets are small with respet to the distane from the video sensors and they are moving on a ground surfae, whih therefore an be approximated by a plane. For more aurate results, a alibration tool (Figure 2) has been developed. This tool visualises two amera views, one of whih is onsidered the base view aording to whih the other amera is alibrated. It then allows the user to dynamially hoose orresponding points on the two views before it warps them on the ground plane. The user an repeat the proedure until the visual results are onsidered satisfatory. - Bakground extration and update module. Eah ATU of the system an automatially deal with bakground hanges (e.g. grass or trees moving in the wind) or lighting hanges (e.g. day, night et) supporting several robust bakground extration algorithms, namely: mixture of Gaussians modelling (KaewTraKulPong, 200), Bayes algorithm (Liyuan Li, 2003), Lluis-Miralles-Bastidas method (Lluis, 2005) and nonparametri modelling (Elgammal, 2000). - Foreground segmentation module. Conneted omponent labelling is applied to identify individual foreground objets. - Blob traking module (optional). The Multiple Hypothesis Traker (Cox, 996) was used, although assoiation and traking of very fast moving objets ould be problemati. Figure 2: Sreenshot from the alibration tool - Blob lassifiation module. A set of lasses of moving objets (e.g. person, ar et) is initially defined for eah appliation. Then, eah blob is lassified by alulating its membership probability of eah lass, using a previously trained bakpropagation neural network. Speifially, 9 attributes, harateristi of its shape and size, are used as input to a neural network: the two sizes of the major and minor axes of the blob s ellipse and the 7 Hu moments (Hu, 962) of the blob that are invariant to both rotations and translations. The number of outputs of the neural network equals the predefined number of lasses. The lass is determined by the maximum output value. - 3-D observation extration module. It uses the available amera alibration information to estimate the aurate position of targets in the sene. Sine the amera alibration is based on homographies, an estimate for the position ( x, y w w ) of a target in the world oordinates an be diretly determined from the entre of eah blob. Eah observation is also assoiated with a reliability matrix R, depending on the amera geometry and its position at the amera plane. This matrix is alulated using the alibration information (Borg, 2005): R ( x, y ) = J ( x, y ) ΛJ ( x, y ) w w where J = Jaobian matrix of the partial derivatives of the mapping funtions between the amera and the world o-ordinate systems, Λ = measurement ovariane at loation ( x, y ) on the amera plane, whih is assumed to be a fixed diagonal matrix. Blob extration 38% (27.80ms) Observation estimation 3% (2.55ms) ATU times T Bakground extration 59% (43.56ms) () Figure 3: Exeution times of ATU modules 600

Figure 3 shows the absolute and relevant exeution times per frame of the basi modules of the ATUs. The partiular times have been aquired by applying the non-parametri modelling method, working under grid mode. The bakground extration module is the most ruial one as the omputational ost of suh methods is typially large, ausing problems for real-time systems. Therefore, many experiments have been onduted in order to provide both a qualitative and a omputational evaluation of these methods. 4. ETWORK COMMUICATIOS The final output of eah ATU is a small set of parameters (ground oordinates, lassifiation, reliability), whih is transmitted to the SDF server through wired or wireless transmission. If the foreground map fusion tehnique is used, a greysale image is provided at eah polling yle, indiating the probability for eah pixel to belong to the foreground. All these data are transmitted through wired or wireless IP onnetion to the server whih performs observation fusion and target traking. TCP protool is used for transmission of the data from the ATUs to the entral server whereas UDP is used for remote ontrolling of the ATUs. As an indiator, the bandwidth used per ATU when operating under the map fusion mode with a frame rate of 3fps is about 92Kbps (3fps x 8Kbyte/frame). The system requires frame synhronisation and onstant frame rate of all ATUs, whih are ahieved by using the etwork Time Protool (TP). The system s loks synhronise to the entral server s lok and a appointment time tehnique (Litos, 2006) is implemented to ensure that frames from all ameras are aptured at the same instant despite network lateny. A seondary system based on media server software streams video on demand to the entral server in order to enable human visual monitoring of the sene. As an alternative, ompressed motion JPEG images (JPEG 2000) an be used for streaming. 5. SESOR DATA FUSIO SERVER The SDF Server ollets information from all ATUs using a onstant polling yle, produes fused estimates of the position and veloity of eah moving target, and traks these targets using a multi-target traking algorithm. It also produes a syntheti ground situation display (Figure 4), ollets statistial information about the moving targets and provides alerts when speifi situations (e.g. aidents) are deteted. 5. Data fusion A target present simultaneously in the field of view of multiple ameras will result in multiple observations due to the fat that the blob entres of the same objet in two different ameras orrespond to lose but different 3-D points. Two tehniques are proposed for grouping together all the observations that orrespond to the same target: 5... Grid-based fusion A grid that separates the overlap area (in world oordinates) in ells is defined. Optimal values for the ell size are determined onsidering the appliation requirements (e.g. maximum distane between vehiles). Eah observation is assigned two index values i, i ) that indiate its position on the grid: x ( y ( i, i ) ([ x x ]mod,[ y y ]mod) x y x = w s w s (2) y where s, s = world oordinates of the top left orner of the overlap area xw, yw = world oordinates of the amera level observation = ell size Observations belonging to the same ell or to neighbouring ells are grouped together to a single fused observation. To implement this tehnique the grid is expressed as a binary image: ells that have at least one assigned observation are represented by a white pixel, while those with no observations are represented by a blak pixel. A onneted omponent labelling algorithm is then used to identify blobs in this image, eah orresponding to a single moving target. Fused observations are produed by averaging the parameters of the observations that belong to eah group. More speifially, eah fused observation onsists of an estimated position of the world oordinates, an unertainty matrix as well as a lassifiation probability matrix. The position and unertainty matries ( Z, R ) of the fused observation are given by the following equations: R = ( Z = R R n ) R n Z n (3) where Z n = the position (in world oordinates) of the n-th observation in a group of R n = unertainty matrix of the n-th observation in a group of Figure 4: SDF window with 3 targets on the airport APRO To alulate the average lassifiation vetor, the unertainty of eah observation is taken into aount. In this ase the larger of 60

the two axes of the observation unertainty ellipse is used to speify a weight for the observation in the lassifiation averaging. Depending on the magnitude of this metri, a orresponding weight is assigned to eah observation. The average lassifiation vetor is then alulated by: n n = (4) w w These parameters (, omprise the input for the traking unit. n Z R, ) of eah fused observation 5..2. Foreground map fusion In this tehnique, eah ATU provides the SDF server with one greysale image per polling yle, indiating the probability for eah pixel to belong to the foreground. The SDF server fuses these maps together by warping them to the ground plane and multiplying them (Khan, 2006). The fused observations are then generated from these fused maps using onneted omponent analysis and lassifiation information is omputed as in the ATU s blob lassifiation module. Although this tehnique has inreased omputational and network bandwidth requirements, when ompared to grid-based fusion, it an very robustly resolve olusions between multiple views. 5.2 Multiple Target Traking The traking unit is based on the Multiple Hypothesis Traking (MHT) algorithm, whih starts tentative traks on all observations and uses subsequent data to determine whih of these newly initiated traks are valid. Speifially, MHT (Blakman, 999) is a deferred deision logi algorithm in whih alternative data assoiation hypotheses are formed whenever there are observation-to-trak onflit situations. Then, rather than ombining these hypotheses, the hypotheses are propagated in antiipation that subsequent data will resolve the unertainty. Generally, hypotheses are olletions of ompatible traks. Traks are defined to be inompatible if they share one or more ommon observation. MHT is a statistial data assoiation algorithm that integrates the apabilities of: Trak Initiation: Automati reation of new traks as new targets are deteted. Trak Termination: Automati termination of a trak when the target is no longer visible for an extended period of time. Trak Continuation: Continuation of a trak over several frames in the absene of observations. Expliit Modelling of Spurious Observations Expliit Modelling of Uniqueness Constraints: An observation may only be assigned to a single trak at eah polling yle and vie-versa. Speifially, the traking unit was based on a fast implementation of the MHT algorithm (Cox, 996). A 2-D Kalman filter was used to trak eah target and additional gating omputations are performed to disard observation trak pairs. More speifially, a gate region is defined around eah target at eah frame and only observations falling within this region are possible andidates to update the speifi trak. The aurate modelling of the target motion is very diffiult, sine a target may stop, move, aelerate, et. Sine only position measurements are available, a simple four-state (position and veloity along eah axes) CV (onstant veloity) target motion model in whih the target aeleration is modelled as white noise provides satisfatory results. Figure 5 shows the absolute and relevant values of the exeution times per frame of the SDF server modules. The partiular times have been aquired by working under the map fusion mode. The data fusion module appears to be the most time onsuming one. Data fusion 70% (3.70ms) Traker % (0.22 ms) SDF times Figure 5: Exeution times of SDF modules 6. EXPERIMETAL RESULTS Display 29% (3.25ms) In this setion, experimental results that onern the most ruial and omputationally expensive modules of the ATU and SDF software are presented and disussed. These modules have been identified as the bakground extration module for the ATUs and the data fusion module for the SDF server. For the purposes of deiding on the most appropriate bakground extration tehnique for the speifi appliations, tests have been run on various sequenes. The masks shown on Figure 6 are obtained from the prototype system installation at Maedonia airport in Thessaloniki. Figure 6 (a) shows the original image, while in Figure 6 (b) the three moving objets that need to be deteted by the bakground extration methods are marked with red irles. As seen in Figure 6 () the objets are deteted with the mixture of Gaussians method, although the shape of the masks is distorted due to shadows. The results of the Bayes algorithm are shown in Figure 6 (d). This method fails to detet slowly moving objets like the one on the left of the image. The Lluis et al method shown in Figure 6 (e) produes masks with low level of onnetivity, whih are not suitable for the following image proessing steps. Finally the non-parametri modelling method (in Figure 6 (f)) yields very aurate results, while oping well with shadows, as it inorporates an additional post proessing step of shadow removal. Another ruial issue when deiding on the most appropriate bakground extration algorithm is its exeution time. To evaluate the omputation omplexity, all four methods were applied on three sequenes of different resolutions (320x740px, px, px). The exeution times per frame for eah of the four methods and three sequenes are presented on Figure 7. An Intel Pentium 4 3.2GHz with GB of RAM running on Windows XP Pro was used and all algorithms were implemented in C++ using the open soure library OpenCV. Taking into onsideration both the qualitative results and the omputational omplexity of bakground extration methods, the non-parametri modelling emerges as the one having the best trade-off between results quality and exeution times. 602

(a) () (b) (d) In order to analyse the results aquired by the system, ground truth data had to be made available. For the airport prototype, tests have been onduted using ars equipped with GPS devies. For the tunnel prototype on the other hand, the ground truth data was olleted by viewing image sequenes through a speially developed manual marking tool (Figure 8), whih allows the user to speify the orret position of the moving targets. The ground truth data has been orrelated with the system results and then inserted into a database, along with other information suh as weather onditions. A test analysis tool has also been implemented in order to display various statistis that an be aquired by querying the database. This test analysis tool was used in order to provide qualitative omparison of the two operation modes of the system, the grid and map fusion modes. A very ruial statisti that is suitable for this kind of evaluation is the absene. Its signifiane lays on the fat that a high value of this statisti means that the system is prone to be lead to wrong onlusions with severe onsequenes, suh as failing to identify an emergeny situation. As it an be seen on Figure 9 the absene appears more rarely than the other two types of s (presene and position) for both modes. Espeially in the map fusion mode, this statisti is even lower, ahieving half the value of the one aquired from the grid mode. (e) Figure 6: a) Original image b) Moving objets ) Mixture of Gaussians mask d) Bayes algorithm mask e) Lluis-Miralles- Bastidas mask f) on-parametri model mask. time (ms) 30 595 726 exeution time per frame 70 295 386 45 94 (f) 275 54 353 377 Bayes MoG Lluis on parametri Although the map fusion method appears to provide more aurate results, there are other areas where the grid mode shows better performane, suh as the utilisation of less bandwidth. The volume of the data transmitted by eah ATU at every frame irle for grid fusion as measured for the prototype appliations was Kbyte/frame while for the map fusion mode it was measured at 8Kbyte/frame. Another harateristi of the two methods that is worth mentioning is the exeution times they ahieve, both for the ATUs and the SDF. As shown in Figure 0, the grid mode is more intensive for the ATUs while the map fusion mode evokes bigger exeution times at the SDF server. In other words, the hoie of operation mode is a omplex deision that should be based on several fators, suh as the network apabilities, the available omputation power of the SDF server unit and the total number of ATUs used on the partiular onfiguration of the system. Figure 7: Bakground extration methods exeution times 90 80 70 60 50 40 30 20 0 0 distribution per mode absene presene position absene presene position grid mode map fusion mode Figure 9: Perentage of distribution per mode Figure 8: Sreenshot from the Manual Marking tool 603

SDF grid 4.260328ms grid vs map fusion mode times ATU map fusion 43.5674ms SDF map fusion 45.84562ms ATU grid 73.92366ms Figure 0: ATU and SDF exeution times for both modes 7. COCLUSIOS This paper presented an automated solution for visual traffi monitoring based on a network of distributed traking units. The system an be easily adjusted and parameterised in order to be used in several traffi monitoring appliations, as it was built based on results aquired from two diverse pilot installations. The first prototype, installed at an airport APRO, was using an outdoor sene with large field of view while the seond prototype, installed in a highway tunnel, was using an indoors sene with smaller distanes and more olusions. The results presented were foused on the two most important modules of the system, the bakground extration method and the data fusion tehnique. After both qualitative and quantitative evaluation of multiple alternatives, the non-parametri modelling method was hosen as the best solution for the system, regarding the bakground extration module. On the other hand, both the data fusion tehniques tested showed satisfying behaviour under different situations and the final hoie between the two should depend on the speifi appliation demands and infrastruture. An interesting future extension is to take advantage of the low bandwidth output of the SDF server in order to reate a 3D syntheti representation of the sene under surveillane, whih ould be rendered at remote 3D displays. REFERECES Blakman, S., Popoli, R., 999. Design and analysis of modern traking systems. Arteh House, Boston, USA. Borg, M., Thirde, D., Ferryman, J., Fusier, F., Valentin, V., Brémond, F., Thonnat, M., Aguilera, J., Kampel, M., 2005. Visual Surveillane for Airraft Ativity Monitoring. VS-PETS 2005, Beijing, China. Cox, J., Hingorani, S.L., 996. An Effiient Implementation of Reid's Multiple Hypothesis Traking Algorithm and its Evaluation for the Purpose of Visual Traking. IEEE Transations on pattern analysis and mahine intelligene, Vol. 8, pp. 38-50. Dimitropoulos, K., Grammalidis,., Simitopoulos, D., Pavlidou,., Strintzis, M., 2005. Airraft Detetion and Traking using Intelligent Cameras, IEEE International Conferene on Image Proessing, Genova, Italy, pp. 594-597. Elgammal, A., Harwood, D., Davis, L., 2000. on-parametri Model for Bakground Subtration.Computer Vision. ECCV 2000. Hu, M-K., 962. Visual pattern reognition by moment invariants. IRE Trans. on Information Theory, IT-8:pp, 79-87. KaewTraKulPong, P., Bowden, R., 200. An Improved Adaptive Bakground Mixture Model for Real-time Traking with Shadow Detetion. In 2nd European Workshop on Advaned Video-based Surveillane Systems, Kingston, UK. Kastrinaki, V., Zervakis, M., Kalaitzakis, K., 2003. A survey of video proessing tehniques for traffi appliations. Image Vision Computing, Volume: 2,Issue: 4, pp. 359 38. Khan, S., Shah, M., 2006. A multiview approah to traking people in rowded senes using a planar homography onstraint. 9th European Conferene on Computer Vision, Graz, Austria. Le Bouffant, T., Siebel,. T., Cook, S., Maybank, S., 2002. Reading People Traker Referene Manual (Version.2), Tehnial Report o. RUCS/2002/TR//00/A, Department of Computer Siene, University of Reading. Litos, G., Zabulis, X., Triantafyllidis, G.A.,2006. Synhronous Image Aquisition based on etwork Synhronization, IEEE Workshop on Three-Dimensional Cinematography. Liyuan Li, Weimin Huang, Irene Y.H. Gu, Qi Tian, 2003. Foreground Objet Detetion from Videos Containing Complex Bakground. In International Multimedia Conferene. Lluis, J., Miralles, X., Bastidas, O., 2005. Reliable Real-Time Foreground Detetion for Video Surveillane Appliation. In VSS'05. Mihalopoulos, G., 99. Vehile Detetion Video Through Image Proessing: The Autosope System. IEEE Transations on Vehiular Tehnology, Vol. 40, o.. Pavlidou,., Grammalidis,., Dimitropoulos, K., Simitopoulos, D., Gilbert, A., Piazza, E., Herrlih, C., Heidger, R., Strintzis, M., 2005. Using Intelligent Digital Cameras to Monitor Aerodrome Surfae Traffi. IEEE Intelligent Systems. Vol. 20, o. 3, pp.76-8. Thirde, D., Borg, M., Ferryman, J., Fusier, F., Valentin, V., Bremond, F., Thonnat, M., 2006. A Real-Time Sene Understanding System for Airport Apron Monitoring. In Proeedings of the Fourth IEEE international Conferene on Computer Vision Systems. ACKOWLEDGEMETS This work was supported by the General Seretariat of Researh and Tehnology Hellas under the InfoSo TRAVIS: Traffi VISual monitoring projet and the EC under the FP6 IST etwork of Exellene: 3DTV-Integrated Three-Dimensional Television - Capture, Transmission, and Display (ontrat FP6-5568). 604