Object and Event Extraction for Video Processing and Representation in On-Line Video Applications

Transcription

1 INRS-Télécommunications Institut national de la recherche scientifique Object and Event Extraction for Video Processing and Representation in On-Line Video Applications par Aishy Amer Thèse présentée pour l obtention du grade de Philosophiae doctor (Ph.D.) en Télécommunications Jury d évaluation Examinateur externe Dr. D. Nair National Instruments, Texas Examinateur externe Prof. H. Schröder Universität Dortmund, Allemagne Examinateur interne Prof. D. O Shaughnessy INRS-Télécommunications Codirecteur de recherche Prof. A. Mitiche INRS-Télécommunications Directeur de recherche Prof. E. Dubois Université d Ottawa c Aishy Amer, Montréal, 20 décembre 2001

2

3 Å ««««É «««Å «««É «««Å ««Å «««É «È «««

4

5 Acknowledgments É ö «««û I have enjoyed and benefited from the pleasant and stimulating research environment at the INRS-Télécommunications. Merci à tous les membres du centre pour tous les bons moments. I am truly grateful to my advisor Prof. Eric Dubois for his wise suggestions and continuously encouraging support during the past four years. I would like also to express my sincere gratitude to Prof. Amar Mitiche for his wonderful supervision and for the active contributions to improve the text of this thesis. I also thank Prof. Konrad for his help during initial work of this thesis. I am also indebted to my colleagues, in particular Carlos, François, and Souad, for the interesting discussions and for the active help to enhance this document. My special thanks go to each member of the dissertation jury for having accepted to evaluate such a long thesis. I have sincerely appreciated your comments and questions. Some of the research presented in this text was initiated during my stay at Universität Dortmund, Germany. I would like to thank Prof. Schröder for advising my initial work. I am also grateful to my German friends, former students, and colleagues, in particular to Dr. H. Blume, who have greatly helped me in Dortmund. Danke schön. My deep gratitude to my sisters, my brothers, my nieces, and my nephews who have supplied me with love and energy throughout my graduate education time. My special thanks go also to Sara, Alia, Marwan, Hakim, and to all my friends for being supportive despite the distances. I would like to express my warm appreciations to Hassan for caring and being patient during the past four years. His understanding provided the strongest motivation to finish this writing. ««È «ö ÃÅ «««««««û Â Å º º Â ««««««««Å ÃÅÉ ««Å Å «Ã«««û ««Å Â «É ««Â «««Ãº øè ««Â ««É «««úãº««««ã ««««««º ÂÅ ««««É««««Å ã «Å É ÅÂº ö â «««É ««ÃºÉ «Å «««º ««Å ««ÅÃ ÅÉ «Å ««È Â «Â Å «Ñ ³ «À ÃÅ ««Ñº «³ «º Å ««««É «À «Ã «º «Â «É «««««2001 ««Ã «û

6

7 Abstract vii As the use of video becomes increasingly popular, and wide spread through, for instance, broadcast services, Internet, and security-related applications, providing means for fast, automated, and effective techniques to represent video based on its content, such as objects and meanings, are important topics of research. In a surveillance application, for instance, object extraction is necessary to detect and classify object behavior, and with video databases, effective retrieval must be based on highlevel features and semantics. Automated content representation would significantly facilitate the use and reduce the costs of video retrieval and surveillance by humans. Most video representation systems are based on low-level quantitative features or focus on narrow domains. There are few representation schemes based on semantics; most of these are context-dependent and focus on the constraints of a narrow application and they lack, therefore, generality and flexibility. Most systems assume simple environments, for example, without object occlusion or noise. The goal of this thesis is to provide a stable content-based video representation rich in terms of generic semantic features and moving objects. Objects are represented using quantitative and qualitative low-level features. Generic semantic features are represented using events and other high-level motion features. To achieve higher applicability, content is extracted independently of the type and the context of the input video. The proposed system is aimed at three goals: flexible content representation, reliable stable processing that foregoes the need for precision, and low computational cost. The proposed system targets video of real environments such as those with object occlusions and artifacts. To achieve these goals, three processing levels are proposed: video enhancement to estimate and reduce noise, video analysis to extract meaningful objects and their spatio-temporal features, and video interpretation to extract context-independent semantics such as events. The system is modular, and layered from low-level to middle-level to high-level where levels exchange information. The reliability of the proposed system is demonstrated by extensive experimentation on various indoor and outdoor video shots. Reliability is due to noise adaptation and due to correction or compensation of estimation errors at one step by processing at subsequent steps where higher-level information is available. The proposed system provides a response in real-time for applications with a rate of up to 10 frames per second on a shared computing machine. This response is achieved by dividing each processing level into simple but effective tasks and avoiding complex operations.

8

9 ix Extraction d objets et d événements pour le traitement et la représentation de séquences vidéo dans des applications en ligne par Aishy Amer Résumé Table des matières I. Contexte et objectif Page ix II. Aperçu sur les travaux réalisés Page xi III. Approche proposée et méthodologie Page xii IV. Résultats Page xviii V. Conclusion Page xix VI. Généralisation possible Page xxii I. Contexte et objectif L information visuelle s est intégré à tous les secteurs de communication moderne, même aux services de largeur de bande basse comme la communication mobile. Ainsi, des techniques efficaces pour l analyse, la description, la manipulation et la récupération d information visuelle sont des sujets importants et pratiques de recherche. La séquence vidéo est soumise à diverses interprétations par différents observateurs et la représentation vidéo basée sur le contenu peut varier selon les observateurs et les applications. Plusieurs systèmes existants abordent les problèmes en essayant de développer une solution générale pour toutes les applications vidéo. D autres se concentrent sur la résolution de situations complexes, mais supposent un environnement simple, par exemple la résolution d un problème dans un environnement sans

10 x Résumé occlusion et sans bruit ou artefact. La recherche dans le domaine de traitement vidéo considère les données vidéo comme étant des pixels, des blocs ou des structures globales pour représenter le contenu vidéo. Cependant, ceci n est pas suffisant pour des applications vidéo avancées. Dans une application de surveillance, par exemple, la représentation vidéo concernant l objet nécessite la détection automatique d activités. Pour des bases de données vidéo, la récupération doit être basée sur la sémantique. Par conséquent, une représentation vidéo basée contenu est devenue un domaine de recherche fort actif. Des exemples de cette activité sont les standards multimédia comme MPEG-4 et MPEG-7 et divers projets de surveillance et recherche dans des bases de données visuelles[129, 130]. Étant donné la quantité croissante de stockage de données vidéo, le développement de techniques automatiques et efficaces pour la représentation vidéo basée sur le contenu, est un problème d importance croissante. Une telle représentation vidéo vise une réduction significative de la quantité de données vidéo en transformant une séquence vidéo de quelques centaines ou milliers d images en un petit jeu d information. Cette réduction de données a deux avantages : premièrement une grande base de données vidéo peut être efficacement consultée en se basant sur son contenu et, deuxièmement, l utilisation de mémoire est réduite significativement. Le développement de la représentation à base de contenu exige la résolution de deux problèmes clés : Définir les contenus vidéo intéressants et les attributs appropriés pour représenter ce contenu. L étude des propriétés du système visuel humain (SVH) aide àrépondre à certaines de ces questions. En regardant une séquence vidéo, le SVH est plus sensible aux zones qui se déplacent et plus généralement aux objets mobiles et à leurs caractéristiques. L intérêt dans cette thèse porte tout d abord sur les caractéristiques de haut niveau des objets (c est à dire : la sémantique) puis à ceux de bas niveau (par exemple : la texture). La question principale est alors : quel est le niveau de sémantique de l objet et quelles sont les caractéristiques les plus importantes pour des applications vidéo baseées contenu? Par exemple, quelles sont les descriptions intentionnelles de haut niveau? Une observation importante consiste àcequele sujet de la majorité de la séquence vidéo soit un objet en mouvement [105, 72, 56] qui exécute des activités et agit pour créer des événements. De plus, le SVH est capable de rechercher des activités et des événements intéressants en parcourant rapidement une séquence vidéo. Quelques systèmes de représentation vidéo mettent en oeuvre un tel flipping en sautant quelques images sur la base de dispositifs de bas niveau (par exemple, utilisant l extraction d image clefs sur la base de la couleur) ou en extrayant des indices globaux. Cela peut cependant omettre des données intéressantes parce que le flipping doit être basé sur des activités d objet ou des événements combinés avec des caractéristiques à bas niveau comme la forme.

11 Ceci permettra une recherche et une surveillance vidéo flexible. Pour représenter efficacement le signal vidéo basé sur le contenu comme des objets et des événements, trois systèmes de traitement vidéo sont nécessaires: L amélioration vidéo pour réduire le bruit et l artefact, l analyse vidéo pour extraire les caractéristiques vidéo de bas niveau et l interprétation vidéo pour décrire le contenu sémantique. xi II. Aperçu sur les travaux réalisés Récemment, des systèmes vidéo incorporant des représentations vidéo à base de contenu ont été développés. La plupart de la recherche les concernant se concentre sur des techniques d analyse vidéo sans les intégrer à un système vidéo fonctionnel. Cela aboutit à des techniques intéressantes, mais souvent sans rapport à l aspect pratique. En outre, ces systèmes représentent le signal vidéo par des caractéristiques globales de base comme le mouvement global ou les images clefs. Peu de systèmes de représentation vidéo ont traité ce sujet en utilisant des objets. En effet, la plupart d entre eux utilisent seulement des caractéristiques de base comme le mouvement ou la forme. Les représentations vidéo à base d objets les plus développées se concentrent dans des domaines étroits (par exemple, des scènes de football ou le control du trafic routier). Les systèmes qui incorporent des représentations vidéo à base d objet utilisent une description quantitative des données vidéo et des objets. Les utilisateurs dans des applications vidéo avancées comme la recherche vidéo ne connaissent pas exactement à quoi ressemble le signal vidéo. Ils n ont pas d information quantitative exacte: le mouvement, la forme, ou la texture. Donc, des représentations vidéo qualitatives faciles à utiliser pour la recherche vidéo ou la surveillance sont essentielles. Dans la recherche vidéo, par exemple, les outils de récupération vidéo les plus existant demandent à l utilisateur un croquis ou une vue de vidéo, par exemple, l utilisateur fait la recherche après examen de la base de données. Cependant, examiner de grandes bases de données, peut prendre beaucoup de temps, particulièrement pour des scènes complexes (c est à dire, la scène réelle du monde). Des utilisateurs doivent pouvoir décrire un signal vidéo par des descripteurs qualitatifs sont essentiels pour le succès de telles applications. Il y a peu d ordre dans les activités et les événements de vues vidéo. Plusieurs travaux sur la détection et la classification d événements se concentrent sur la façon d exprimer les événements en utilisant des techniques d intelligence artificielle comme, par exemple, le raisonnement et l inférence. Autres systèmes de représentation vidéo à base d événements sont developpés pour des domaines spécifiques. Malgré la grande amélioration de la qualité des systèmes d acquisition vidéo modernes, le bruit reste toujours un problème qui complique les algorithmes de traitement du signal vidéo. De plus, des artefacts de codage divers se retrouvent dans un signal

12 xii Résumé vidéo transmis de manière numérique. Donc, la réduction du bruit et des artefacts est toujours une tâche importante dans les applications vidéo. Les artefacts tant sonores que numériques affectent la qualité de représentation vidéo et devraient être pris en compte. Tandis que la réduction du bruit a été sujet de plusieurs publications (peu de méthodes traitent des contraintes en temps réel) l impact du codage d artefact sur l exécution du traitement de vidéo n est pas suffisamment étudié. À cause de progrés réalisé en micro-électronique, il est possible d inclure des techniques sophistiquées de traitement vidéo, dans les services et les appareils. Cependant, l aspect temps réel de ces nouvelles techniques est crucial pour une application générale de ces techniques. Plusieurs applications vidéo nécessitant une représentation du contenu vidéo à haut niveau se retrouvent dans un environnement en temps réel, exigeant donc une performance en temps réel. Peu d approches de représentation à base de contenu prennent cette contrainte en considération. III. Approche proposée et méthodologie L objectif de cette thèse est de développer un système pour la représentation vidéo à base de contenu par un système d extraction automatisé d objets intégré à un système de détection d événements sans l interaction de l utilisateur. Le but est de fournir une représentation à base de contenu riche en termes d événements génériques et de traiter une large gamme d applications vidéo pratiques. Les objets sont représentés par des caractéristiques à bas niveau quantitatives et qualitatives. Les événements quant à eux, sont représentés par des caractéristiques haut niveau des objets telles que les activités et les actions. Cette étude évoque trois questions importantes : 1. Les représentations d objets flexibles qui sont facilement détectables pour la récapitulation, l indexation et la manipulation vidéo, 2. L interprétation du signal vidéo fiable et stable qui précède le besoin de la précision et 3. Economie en temps de calcul. Ceci nécessite la contribution d algorithmes qui répondent à ces trois questions pour la réalisation des applications vidéo basées sur le contenu et destinées au consommateur, comme la surveillance et la recherche dans des bases de données vidéo. Ces algorithmes doivent se concentrer sur les questions pratiques d analyse vidéo orientée vers les besoins de systèmes vidéo basés sur l objet et l événement. Vue vidéo Amélioration vidéo Analyse vidéo orienté object Interprétation vidéo orienté événement Descripteurs d objets & d événements Figure 1: Synoptique du système proposé. Le système proposé est conçu pour des situations réelles avec des occlusions

13 d objets, des changements d illumination, du bruit et des artefacts. Pour produire une représentation vidéo de haut niveau, la structure proposée implique trois étapes (voir figure 1) : amélioration, analyse et interprétation. Le signal vidéo original est présentéà l entrée du module amélioration vidéo et la sortie en est une version améliorée. Ce signal vidéo amélioré est alors traité par le module d analyse vidéo qui produit une description bas niveau de ce vidéo. Le module interprétation vidéo reçoit ces descriptions à bas niveau et produit une description de haut niveau du signal vidéo original. Les résultats d une étape sont intégrés pour soutenir les étapes suivantes qui corrigent ou soutiennent à leur tour les pas précédents. Par exemple, un objet suivi à une étape est soutenu par la segmentation bas niveau. Les résultats du suivi sont à leur tour intégrés dans la segmentation pour la confirmer. Cette approche, par analogie au système visuel humain (SVH), trouve des objets où ladétection partielle et l identification présentent un nouveau contexte qui approuve à son tour la nouvelle identification [103, 3]. Le système peut être vu comme une structure de méthodes et d algorithmes pour construire des systèmes d interprétation de scènes dynamiques automatiques. La robustesse des méthodes proposées sera démontrée par une expérimentation vaste sur des séquences vidéo bien connues. La robustesse est le résultat de l adaptation au bruit vidéo et aux artefacts est due au traitement qui considère les erreurs (obtenues) à une étape pour la correction ou la compensation àl étape suivante. La structure proposée dans cette thèse est conçue pour des applications où une interprétation du vidéo d entrée est nécessaire ( quel est le sujet de la séquence ). Ceci peut être illustré par deux exemples : surveillance et recherche vidéo. Dans un système de surveillance vidéo, une alarme peut être activée dans le cas où le système proposé détecte un comportement d objets particulier. Dans un système de récupération vidéo, les utilisateurs peuvent rechercher un vidéo en fournissant une description qualitative, utilisant une information comme les attributs d objet (par exemple, la forme), les rapports spatiaux (par exemple : l objet i est près de l objet j), l emplacement (par exemple : l objet i est au fond de l image) et les caractéristiques sémantiques ou de haut niveau (par exemple, action d objet : l objet i se deplace à gauche et est ensuite occlus; événement : le déplacement ou le dépôt d objets). Le système de récupération peut alors trouver les trames dont le contenu correspond le mieux à la description qualitative. Une propriété désirable pour les stratégies de représentation vidéo est de fournir une réponse à des questions simples basées sur l observation, comme par exemple, comment sélectionner des objets (qui est dans la scène), décrire leur action (qu est-ce qu il/elle fait) et déterminer leur emplacement ( où l action a eu lieu) [72, 56]. En l absence d une application spécifique, un modèle générique doit être adaptable (par exemple, avec de nouvelles définitions d actions et dévénements). xiii

14 xiv Résumé Sans considération au temps réel, une approche de représentation vidéo à base de contenu pourrait perdre son applicabilité. En outre, la robustesse au bruit et artefacts du codage est importante pour une utilisation de la solution. Le système proposé est conçu pour réaliser un équilibre entre efficacité, qualité de la solution et temps de calcul. Le système représenté dans la figure 2 se décrit ainsi: Le chapitre de l amélioration vidéo classifie d abord le bruit et les artefacts dans le vidéo puis utilise une nouvelle méthode pour l estimation du bruit et une autre pour la réduction spatiale du bruit (le Chapitre 2). La technique d estimation du bruit proposée produit des évaluations fiables dans des images ayant des régions lisses et/ou structurées. Cette technique est une méthode à base de blocs qui prend en compte la structure de l image en considération et qui utilise une mesure autre que la variance pour déterminer si un bloc est homogène ou non. Elle n utilise aucun seuil et automatise la procédure avec laquelle les méthodes à base de blocs procèdent pour la moyennisation des variances de blocs. Cette nouvelle technique de réduction spatiale du bruit utilise un filtre pass-bas ayant une complexité réduite pour éliminer le bruit spatial non corrélé. L idée de base est d utiliser un ensemble de filtres pass-haut pour détecter la direction de filtrage la plus appropriée. Le filtre proposé réduit le bruit dans l image en préservant la structure, et est adapté à la quantité du bruit estimée. L analyse vidéo est basée principalement sur l extraction des objets significatifs et des caractéristiques quantitatives à bas niveau à partir de la vidéo. La méthode se fait en quatre étapes (les Chapitres 3-6): segmentation des objets basée sur la détection du mouvement, estimation du mouvement basée objet, fusion de régions, suivi d objets basé sur une combinaison non linéaire de caractéristiques spatio-temporelles. L algorithme proposé extrait les objets vidéo importants pouvant être utilisés comme index dans les vidéo basés sur la représentation d objets flexibles et pour analyser la vidéo afin de détecter les événements liés aux objets en vue d une représentation et une interprétation sémantique. La méthode de segmentation d objets proposée classifie les pixels des images vidéo comme appartenant à des objets distincts basés sur le mouvement et des caractéristiques de contour (le Chapitre 4). Elle comporte des procédures simples, et est réalisée en quatre étapes : binarisation des images en entrée basée sur la détection du mouvement. détection morphologique des frontières. analyse des contours et la squéletisation.

15 étiquetage des objets. La tâche la plus critique est la binarisation qui doit être fiable dans toute la séquence vidéo. L algorithme de binarisation mémorise le mouvement détecté précédemment pour adapter le processus. La détection de frontières fait appel à de nouvelles opérations morphologiques dont les calculs sont réduits de facc importante. L avantage de la détection morphologique consiste en la génération de frontières continues d un seul pixel de largeur. L analyse de contour transforme les frontières en contours et élimine les contours non désirés. Cependant, les petits contours sont seulement éliminés s ils ne peuvent être associés à des contours précédemment extraits, c est à dire, si un petit contour n a aucun contour correspondant dans l image précédente. Les petits contours se trouvant complètement à l intérieur d un grand contour sont fusionnés avec ce dernier selon des critères d homogénéité. L estimation du mouvement détermine l étendue et la direction du mouvement, de l objet extrait (le Chapitre 5). Dans l approche proposée, l information extraite de l objet (par exemple, la taille, la boite à contour minimal (MBB : Minimum bounding box), la position, la direction du mouvement) est utilisée dans un processus à base de règles ayant trois étapes : la correspondance d objet, l estimation du mouvement MBB basé sur le déplacement des côtés du MBB, ce qui rend le processus d estimation indépendant du signal d intensité etdu type de mouvement de l objet. La méthode de suivi suit et associe les objets en mouvement et enregistre leurs caractéristiques temporelles. Elle transforme les objets segmentés provenant du processus de segmentation à des objets de vidéo (le Chapitre 6). Le principal problème des systèmes de suivi est leur fiabilité en cas d occlusion, d ombrage et de division d objets. La méthode de suivi proposée est basée sur un système de vote non-linéaire pour résoudre le problème de correspondances multiples. Le problème d occlusion est atténué par une procédure de détection simple basée sur les déplacements estimés du MBB de l objet, suivie d une procédure de prédiction basée médiane qui fournit un estimé raisonnable pour les objets occlus (partiellement ou complètement). Les objets sont suivis une fois qu ils entrent en scène et aussi pendant l occlusion, chose trés importante pour l analyse d activité. Des règles de plausibilité pour la cohérence, l allocation d erreur (error allowance) et le contrôle sont proposées pour un suivi efficace pendant de longues périodes. Une contribution importante au niveau du suivi est la fusion des régions fiables qui améliore la performance du système d analyse vidéo en entier. L algorithme proposé aété développé pour des applications vidéo basées sur le contenu comme la surveillance ou l indexation et la récupération. L interprétation vidéo est principalement concernée par l extraction de car- xv

16 xvi Résumé actéristiques vidéo qualitatives et sémantiques (le Chapitre 7). Son objectif principal est de fournir des outils de représentation par une combinaison des caractéristiques à bas niveau, ainsi que des données vidéo à haut niveau. Cette intégration est essentielle pour faire face au contenu visuel générique énorme contenu dans une séquence vidéo. L accent est donc mis sur la réalisation d une procédure de détection d événements génériques simple, robuste et automatique. Pour identifier les événements, une description qualitative du mouvement de l objet est un pas important vers l association des caractéristiques de base au processus de récupération des caractéristiques haut niveau. À cette fin, le comportement du mouvement des objets vidéo est analysé pour représenter les événements ainsi que les actions importantes. Cela signifie que les caractéristiques de base peuvent être combinées de manière à produire un effet sur celles de haut niveau. D abord des descriptions qualitatives des caractéristiques bas niveau des objets ainsi que les relations entre objets sont dérivées. Ensuite, des méthodes automatiques pour la représentation vidéo des contenus haut niveau basées sur les événements sont proposées. Le but du vidéo est, en général, de documenter les événements et les activités des objets ou ensemble d objets. Les utilisateurs généralement recherchent des objets vidéo qui véhiculent un certain message [124] et ils captent et retiennent en mémoire [72, 56] : 1) des événements ( ce qui est arrivé ), 2) des objets ( qui est dans la scène ), 3) des emplacements ( où cela arrive ), et 4) le temps ( quand cela est arrivé ). Les utilisateurs sont ainsi attirés par les objets et leurs caractéristiques et se concentrent d abord sur les caractéristiques à haut niveau liées au mouvement. En conséquence, l analyse vidéo proposée est conçue pour : prendre des décisions sur des données de plus bas niveau pour supporter les niveaux suivants de traitement, représenter qualitativement les objets ainsi que leurs caractéristiques spatiales, temporelles et relationnelles, extraire les caractéristiques sémantiques des objets, qui sont généralement utiles et fournir automatiquement et efficacement une réponse (opération en temps réel).

17 xvii Amélioration vidéo Vidéo amélioré Vue vidéo Estimation & réduction du bruit Vidéo amélioré Stabilisation de l image Vidéo stable Extraction des attributs globaux Mise à jour de l arrière-plan Compensation du mouvement global Vidéo stable σ n Analyse vidéo Pixels à vidéo objets Segmentation d objets basée sur le mouvement => Pixels à objets Estimation de mouvements basée sur l objet Suivie d objet basée sur le vote => Objets à vidéo objets Interprétation vidéo Objets vidéo à évènements Descripteurs d objets spatio-temporal Descripteurs globaux Analyse & interprétation des descripteurs à bas niveau de vue Détection & classification d événements Résultats (événements & objets) Requète Application basée sur l objet et le mouvement exemple: Décision basée sur l événement Figure 2: Diagramme de la structure proposée pour la représentation de séquence vidéo à base d objets et d événements. Les contributions sont schématisées par des blocs de couleur grise et les interactions entre module sont marquées par des flèches en pointillés. σ n est l écart-type du bruit d image.

18 xviii Résumé IV. Résultats Dans les applications vidéo en temps réel, une analyse vidéo orientée objet non supervisée et rapide est nécessaire. Une évaluation tant objective que subjective et des comparaisons montrent la robustesse de la méthode d analyse vidéo proposée sur des images bruitées ainsi que sur des images présentant des changements d illumination, alors que la complexité de cette méthode est réduite. Cette dernière utilise peu de paramètres qui sont automatiquement ajustés au bruit et aux changements temporels dans la séquence vidéo (la figure 3 illustre un exemple d analyse vidéo de la séquence Autoroute ). Cette thèse propose un schéma d interprétation vidéo orientéévénements. Pour détecter les événements, des descriptions perceptuelles des événements communes pour une large gamme d applications sont proposées. Les événements détectés incluent : {entrer, apparaitre, sortir, disparaitre, se déplacer, arrêter, occluer/est Occlué, a enlevé/a été enlevé, déposé/a été déposé, mouvement anormal }. Pour détecter les événements, le système proposé contrôle le comportement et les caractéristiques de chaque objet dans la scène. Si des conditions spécifiques sont rencontrées, les événements liés à ces conditions sont détectés. L analyse des événements est faite en ligne, c est à dire, les événements sont détectés quand ils arrivent. Des caractéristiques spécifiques telles que le mouvement ou la taille de l objet sont mémorisées pour chaque image et sont comparées aux images suivantes dans la séquence. La détection des événements n est pas basée sur la géométrie des objets, mais sur leurs caractéristiques et relations dans le temps. La thèse propose des modèles approximatifs mais efficaces pour définir les événements utiles. Dans diverses applications, ces modèles approximatifs même s ils ne sont pas précis, sont adéquats. Des expériences utilisant des séquences vidéo bien connues ont permis de vérifier l efficacité del étude proposée (par exemple la figure 4). Les événements détectés sont suffisamment communs pour une large gamme d applications vidéo pour aider la surveillance et la recherche vidéos. Par exemple: 1) le déplacement/dépôt d objets dans un site de surveillance peut être contrôlé et détecté dès qu il arrive, 2) le déplacement des objets en mouvement peut être contrôlé et annoncé, et 3) le comportement de clients dans des magasins ou des passages souterrains peut être surveillé. Le système en entier (l analyse vidéo et l interprétation) nécessite en moyenne entre 0.12 et 0.35 secondes pour traiter les données entre deux images. Typiquement, la vidéo surveillance est enregistré à une vitesse de 3 à 15 images par seconde. Le système proposé produit une réponse en temps réel pour des applications de surveillance avec une vitesse de 3 à 10 images par seconde. Afin d augmenter la performance

19 du système pour des applications où la fréquence des images est élevée, l optimisation du code est nécessaire. Une accélération du processus peut être réalisée, i) en optimisant l implémentation des occlusions et de la séparation d objets, ii) en optimisant l implémentation des techniques de détection du changement et iii) en travaillant avec des valeurs de type entier au lieu des réels (quand c est approprié) et avec des opérations d addition au lieu des multiplications. xix V. Conclusion Cette étude a apporté un nouveau modèle pour le traitement et la représentation vidéo d une façon indépendante du contexte et basé sur les objets et les évenements. Le traitement et la représentation vidéo basés sur objet-événement sont nécessaires pour la recherche automatique de base de données vidéo et pour la surveillance vidéo. Ce modèle représente la séquence vidéo en termes d objets, un ensemble riche d événements génériques pouvant supporter des applications vidéo basées sur le contenu et orientées utilisateur. Il permet une analyse efficace et flexible et une interprétation de la séquence vidéo dans des environnements réels où occlusions, changements d illumination, bruit et artefacts peuvent survenir. À partir du modèle proposé, le traitement se fait sur des niveaux, allant du bas niveau au haut niveau en passant par un niveau intermédiaire. Chaque niveau est organisé de facon modulaire et est responsable d un certain nombre d aspects spécifiques d analyse. Les résultats du traitement du niveau inférieur sont intégrés pour appuyer le traitement aux niveaux supérieurs. Les trois niveaux de traitement sont : Amélioration vidéo Une nouvelle méthode pour le filtrage du bruit spatial, préservant la structure et à complexité réduite a été développée. Cette méthode de filtrage est soutenue par une procédure qui estime correctement le bruit dans l image. Le bruit estimé est aussi utilisé pour soutenir l analyse vidéo qui suit (le Chapitre 2). Analyse vidéo Une méthode d extraction d objets vidéo significatifs et de leurs caractéristiques détaillées a été développée. Elle est basée sur une segmentation fiable et efficace du point de vue calcul, ainsi que sur le suivi d objets. Elle est tolérante aux erreurs et peut corriger et détecter les erreurs. Le système peut fournir une réponse en temps réel pour des applications de surveillance à une vitesse de 3 à10 images par seconde. La méthode de suivi est efficace et applicable à une large classe de séquences vidéo. L efficacité du système d analyse vidéo a étédémontrée par plusieurs expériences stimulantes (les Chapitres 3-6). Interprétation vidéo Un système d interprétation vidéo indépendant du contexte a été développé. Il permet une représentation vidéo assez riche en termes d événements génériques et de caractéristiques qualitatives d objets, ce qui fait de lui un système

20 xx Résumé StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: 5 (a) La trajectoire de l objet dans le plan image StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: x y StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: Img No Img No. (b) La trajectoire pour la direction horizontale. (c) La trajectoire pour la direction verticale. Figure 3: Trajectoire des objets dans le séquence Autoroute. Fig. 3(b) and (c) permet une interprétation du comportement du mouvement de l objet: par exemple, O 2 démarre à gauche de l image et se déplace jusqu à la frontière de l image. Des objets divers entrent dans la scène à plusieurs reprises. Quelques objets se déplacent vite tandis que d autres sont plus lents. Le système suit tous les objets régulièrement.

21 Figure 4: Images d événements clefs de la séquence Hall. Cette séquence est un exemple d une application de surveillance à l intérieur. événement clef: O 6 est déposé par l objet O 1. xxi

22 xxii pouvant être utilisé pour une large gamme d applications. Des descripteurs d objet qualitatifs sont extraits par quantification des descriptions paramétriques des objets. Pour extraire les événements, des changements de mouvement et des caractéristiques sont continuellement traitées. Les événements sont détectés quand les conditions qui les définissent sont rencontrées. Des expériences utilisant des séquences vidéo bien connues ont démontré l efficacité de la technique proposé (le Chapitre 7). VI. Généralisation possible Il y a un certain nombres de questions à considérer si on veut améliorer la performance du système proposé et élargir ses domaines d applications. Temps d exécution L implémentation peut être optimisée pour une exécution plus rapide. Segmentation d objet Dans le contexte du codage vidéo MPEG, les vecteurs de mouvement sont disponibles. Une extension immédiate de la technique de segmentation proposée est d intégrer l information du mouvement à partir du flot MPEG (MPEG-stream) pour supporter la segmentation d objet. Cette intégration aurait pour but d améliorer la segmentation sans une augmentation significative du coût de calcul. Estimation du mouvement Le modèle de mouvement proposé peut être encore amélioré pour permettre une estimation plus précise. Une extension directe serait d examiner les déplacements des extensions diagonales de l objet et d adapter l estimation du mouvement précédemment évalué pour une plus grande stabilité. Points culminants et ombres Le système peut profiter de la détection d ombres et de la compensation de leurs effets. Stabilisation de l image Les techniques de stabilisation d image peuvent être utilisées pour permettre une analyse des données vidéo à partir de déplacement de caméras et de changement d arrière-plans. Interprétation vidéo Un ensemble plus large d événements peut être considéré pour servir un ensemble plus grand d applications. Une interface peut être conçue pour faciliter l interaction entre le système et l utilisateur. La définition d une telle interface exige une étude du besoin des utilisateurs de ces applications vidéo. Une classification des objets en mouvement et un mouvement de désordre (clutter motion) tel que le mouvement des arbres dans le vent, peut être utilisé pour rejeter des événements. Une classification possible est de différencier entre un mouvement avec but (celui d un véhicule ou d une personne) et un mouvement sans but.

23 Contents Résumé vii 1 Introduction Background and objective Review of related work Proposed approach and methodology Contributions Thesis outline Video Enhancement Motivation Noise and artifacts in video signals Modeling of image noise Noise estimation Review of related work A homogeneity-oriented noise estimation Evaluation and comparison Summary Spatial noise reduction Review of related work Fast structure-preserving noise reduction method Adaptation to image content and noise Results and conclusions Summary Object-Oriented Video Analysis Introduction Fundamental issues Related work Overview of the proposed approach Feature selection Selection criteria Feature descriptors Summary and outlook xxiii

24 xxiv Summary Outlook Object Segmentation Motivation Overall approach Motion detection Related work A memory-based motion detection method Results and comparison Thresholding for motion detection Introduction Review of thresholding methods Artifact-adaptive thresholding Experimental results Morphological operations Introduction Motivation for new operations New morphological operations Comparison and discussion Morphological post-processing of binary images Contour-based object labeling Contour tracing Object labeling Evaluation of the segmentation method Evaluation criteria Evaluation and comparison Summary Object-Based Motion Estimation Introduction Review of methods and motivation Modeling object motion Motion estimation based on object-matching Overall approach Initial estimation Motion analysis and update Experimental results and discussion Evaluation criteria Evaluation and discussion Summary

25 xxv 6 Voting-Based Object Tracking Introduction Review of tracking algorithms Non-linear object tracking by feature voting HVS-related considerations Overall approach Feature selection Feature integration by voting Feature monitoring and correction Region merging Feature filtering Results and discussions Summary and outlook Video Interpretation Introduction Video representation strategies Problem statement Related work Proposed framework Object-based representation Spatial features Temporal features Object-relation features Event-based representation Results and discussions Event-based video summary Key-image based video representation Summary Conclusion Review of the thesis background Summary of contributions Possible extensions Bibliography 185 A Applications 195 A.1 Video surveillance A.2 Video databases A.3 MPEG

26 xxvi B Test Sequences 200 B.1 Indoor sequences B.2 Outdoor sequences C Abbreviations 203

27 Chapter 1 Introduction Video is becoming integrated in various personal and professional applications such as entertainment, education, tele-medicine, databases, security applications and even low-bandwidth wireless applications. As the use of video becomes increasingly popular, automated and effective techniques to represent video based on its content such as objects and semantic features are important topics of research. Automated and effective content-based video representation is significant in dealing with the explosion of visual information through broadcast services, Internet, and security-related applications. For example, it would significantly facilitate the use and reduce costs of video retrieval and surveillance by humans. This thesis develops a framework for automated content-based video representation rich in terms of object and semantic features. To keep the framework generally applicable, objects and semantic features are extracted independently of the context of a video application. To test the reliability of the proposed framework, both indoor and outdoor real video environments are used. 1.1 Background and objective Given the ever-increasing amount of video and related storage, maintenance, and processing needs, developing automatic and effective techniques for content-based video representation is a problem of increasing importance. Such video representation aims at a significant reduction of the amount of video data by transforming a video shot of some hundreds or thousands of images into a small set of information based on its content. This data reduction has two advantages: large video databases can be efficiently searched based on video content and memory usage is reduced significantly. Despite the many contributions in the field of video and image processing, the scientific community has debated the low impact on applications: video is subject to different interpretation by different observers and video description can vary ac-

28 2 Introduction cording to observers and applications [141, 103, 78, 73]. Many video processing and representation techniques address problems by trying to develop a solution that is general for all video applications. Some focus on solving complex situations but assume a simple environment, for example, without object occlusion, noise, or artifacts. Video processing and representation research has mainly extracted video data in terms of pixels, blocks, or some global structure to represent video content. This is not sufficient for advanced video applications. In a surveillance application, for instance, object-related video representation is necessary to automatically detect and classify object behavior. With video databases, advanced retrieval must be based on high-level object features and semantic interpretation. Consequently, advanced content-based video representation has become a highly active field of research. Examples of this activity are the setting of multimedia standards such as MPEG-4 and MPEG-7, and various video surveillance and retrieval projects [129, 130]. Developing advanced content-based video representation requires the resolution of two key issues: defining what are interesting video contents and what features are suitable to represent these contents. Properties of the human visual system (HVS) help in solving some aspects of these issues: when viewing a video, the HVS is, in general, attracted to moving objects and their features; it focuses first on the highlevel object features (e.g., meaning) and then on the low-level features (e.g., shape). The main questions are: what level of object features and semantic content is most important and most common for content-oriented video applications? Are high-level intentional descriptions such as what a person is thinking needed? Is the context of a video data necessary to extract useful content? An important observation is that the subject of the majority of video is related to moving objects, in particular people, that perform activities and interact creating object meaning such as events [105, 72, 56]. A second observation is that the HVS is able to search a video by quickly scanning ( flipping ) it for activities and interesting events. In addition, to design widely applicable content-based video representations, the extraction of video content independently of the context of the video data is required. It can be concluded that objects and event-oriented semantic features are important and common for a wide range of video applications. To effectively represent video, three video processing levels are required: video enhancement to reduce noise and artifacts, video analysis to extract low-level video features, and video interpretation to describe content in semantic-related terms.

29 1.2 Review of related work 3 Recently, video systems supporting content-based video representations have been developed 1. Most of these systems focus on video analysis techniques without integrating them into a functional video system. These techniques are interesting but often irrelevant in practice. Furthermore, many systems represent video either by low-level global features such as global motion or key-images. Some video representation systems implement flipping of video content by skipping some images based on low-level features (e.g., using color-based key-image extraction) or extracting global features. However, this may miss important data. Video flipping based on object activities or related events combined with low-level features such as shape allow a more focused yet flexible video representation for retrieval or surveillance. Few video representation systems are based on objects; most of these use only low-level features such as motion or shape to represent video. In addition, many object-based video representations focus on narrow domains (e.g., soccer games or traffic monitoring). Furthermore, some assume a simple environment, for example, without object occlusion, noise, or artifacts. Moreover, systems that address objectbased video representations use a quantitative description of the video data and objects. Users of advanced video applications such as retrieval do not exactly know what the video they are searching for looks like. They do not have exact quantitative information (do not memorize) the motion, shape, or texture. Therefore, user-friendly qualitative video representations for retrieval or surveillance are essential. In video retrieval, most existing video retrieval tools ask the user to sketch, to select an example of a video shot, e.g., after browsing the database, the user is looking for, or to specify quantitative features of the shot. Browsing in large databases can be, however, time-consuming and sketching is a difficult task, especially for complex scenes (i.e., real world scene). Providing users with means to describe a video by qualitative descriptors is essential for the success of such applications. There are few representation schemes concerning events occurring in video shots. Much of the work on event detection and classification focuses on how to express events using artificial intelligence techniques using, for instance, reasoning and inference methods. In addition, most high-level video representation techniques are context-dependent. They focus on the constraints of a narrow application and they lack, therefore, generality and flexibility. Despite the large improvement of the quality of modern acquisition systems, noise is still a problem that complicates video processing algorithms. In addition, various coding artifacts are introduced in digitally transmitted video, for example, using the 1 Pertinent literature and specific applications of the proposed methods and algorithms are reviewed at the respective sections of the main chapters of this thesis.

30 4 Introduction MPEG-2 video standard. Therefore, noise and artifact reduction is still an important task and should be addressed. While noise reduction has been subject of many publications (where few methods deal with real-time constraints), the impact of coding artifacts on the performance of video processing is not sufficiently studied. Due to progress in micro-electronics, it is possible to include sophisticated video processing techniques in video services and devices. Still, the real-time aspect of new techniques is crucial for a wide application of these techniques. Many video applications that need high-level video content representation occur in real-time environments so that their real-time performance is a critical requirement. Few of the content-based representation approaches take this constraint into account. 1.3 Proposed approach and methodology The objective of this thesis is to develop a modular automatic low-complexity functional system for content-based video representation with integrated automated object and event extraction systems without user interaction. The goal is to provide stable representation of video content rich in terms of generic semantic features and moving objects. Objects are represented using quantitative and qualitative low-level features. The emphasis is on stable moving objects rather than in the accuracy of their boundaries. Generic semantic meaning is represented using events and other high-level object motion features, such as trajectory. The system should provide stable video representation for a broad range of practical video applications of indoor and outdoor real environments of different contexts. The proposed end-to-end system is oriented to three requirements: 1. flexible object representations that are easily cooperatively searched for video summarizing, indexing and manipulation, 2. reliable, stable processing of video that foregoes the need for precision, and 3. low computational cost. This thesis contributes algorithms that answer these three issues for the realization of content-based and consumer-oriented video applications such as surveillance and video database retrieval. It focuses on practical issues of video analysis oriented to the needs of object- and event-oriented video systems, i.e., it focuses on the so-called original core of the problem as defined in [141]. The proposed processing and representation target video of real environments such as those with object occlusions, illumination changes, noise, or artifacts. To achieve these requirements, the proposed system involves three processing modules (Fig. 1.1): enhancement, analysis, and interpretation. The input to the video enhancement module is the original video and its output is an enhanced version of it. This enhanced video is then processed by the video analysis module which outputs low-level descriptions of the enhanced video. The video interpretation module takes these low-level descriptions and produces high-level descriptions of the original video.

31 5 Video shot Video Enhancement Enhanced video σ n Object-oriented Video analysis Objects & features Event & object-oriented Video interpretation Event- & objectdescriptors Figure 1.1: Abstract diagram of the proposed system. σ n is the estimated standard deviation of the input image noise. The proposed system can be viewed as a framework of methods and algorithms to build automatic dynamic scene interpretation and representation. Such interpretation and representation can be used in various video applications. Besides applications such as video surveillance and retrieval, outputs of the proposed framework can be used in a video understanding or a symbolic reasoning system. The proposed system is designed for applications where an interpretation of the input video is needed ( what is this sequence about? ). This can be illustrated by two examples: video surveillance and retrieval. In a video surveillance system, an alarm can be activated in case the proposed system detects a particular behavior of some objects. In a video retrieval system, users can query a video by qualitative description, using information such as object features (e.g., shape), spatial relationships (e.g., object i is close to object j), location (e.g., object i is at the bottom of the image), and semantic or high-level features (e.g., object action: object i moves left and then is occluded; event: removal or deposit of objects, or object j stops and changes direction). The retrieval system can then find the video frames whose contents best match the qualitative description. An advantage of such a representation strategy is that it allows the construction of user-friendly queries based on the observation that the interpretation of most people is often imprecise. When viewing a video, they mainly memorize objects and related semantic features. For example, who is in the scene, what he/she is doing, and where the action takes place? People do not usually memorize quantitative object features [72, 56]. In the absence of a specific application, such a generic model allows scalability (e.g., by introducing new definitions of object actions or events). The proposed system is designed to balance demands for effectiveness (solution quality) and efficiency (computational cost). Without real-time consideration, a content-based video representation approach could lose its applicability. Furthermore, robustness to image noise and coding artifacts is important for successful use of the proposed solution. These goals are achieved by adaptation to noise and artifacts, by detection and correction or compensation of estimation errors at the various processing levels, and by dividing the processing system into simple but effective tasks so that complex operations are avoided. In Fig. 1.2, a block diagram of the proposed system is displayed where contributions are underlaid with gray boxes, module interactions are marked by a dashed arrowed line, R(n) represents the background image

32 6 Introduction of the video shot, and σ n is the noise standard deviation. The system modules are: Video enhancement Enhanced video Video shot Noise estimation & reduction Enhanced video Image stabilization Stable video Global feature extraction Background update Global-motion compensation Video analysis Pixels to video objects R(n) Stable video σ n Motion-based object segmentation => Pixels to objects Object-based motion estimation Voting-based object tracking => Objects to video objects Video interpretation Video objects to events Spatio-temporal object descriptors Analysis & interpretation of low-level descriptors Global shot descriptors Event detection & classification Results Requests (Events & Objects) Object & Event-based application e.g., event-based decision-making Figure 1.2: The proposed framework for object-and-event based video representation. The video enhancement module is based on new methods to estimate the image noise and to reduce the image noise to facilitate subsequent processing. Image stabilization is the process of removing unwanted image changes. There are global changes due to camera motion, jitter [53] and local changes due to unwanted (e.g., moving of background objects) object motion. Image stabilization facilitates object-oriented video analysis by removing irrelevant changes. It can be performed by global motion compensation or by object update techniques. Global motion can be the result of camera motion or illumination change. The latter can produce apparent motion. Robust estimation techniques aim at estimating accurate motion from an image sequence. Basic camera motions are pan (right/left motion), zoom (focal length change), and tilt (up/down mo-

33 tion). Different parametric motion models can be used to estimate global motion [54, 25, 68, 144]. In practice, as a compromise between complexity and flexibility, 2-D affine motion models are used [54]. Global motion compensation stabilizes the image content by removing camera motion while preserving object motion. Several studies show the effectiveness of using global motion compensation in the context of motion-based segmentation [85, 6, 144, 54, 25, 68, 144]. Also, background update is needed in object segmentation that uses image differencing based on a background image (cf. Section 4.3). In such object segmentation, the background image needs to be updated, for example, when background objects move or when objects are added to or subtracted from the background image. Various studies have addressed background update and shown its usefulness for segmentation [35, 70, 65, 36, 50]. The video analysis module extracts video objects and their low-level quantitative features. The method consist of four steps: motion-detection-based object segmentation, object-based motion estimation, region merging, and object tracking based on a non-linear combination of spatio-temporal features. The object segmentation classifies pixels of the video images into objects based on motion and contour features. To focus on meaningful objects, the proposed object segmentation uses a background image which can be extracted using a background updates method. The motion estimation determines the magnitude and direction of the motion, both translation and non-translation, of extracted object. The tracking method tracks and links objects as they move and registers their temporal features. It transforms segmented image objects of the object segmentation module into video-wide objects. The main issue in tracking systems is reliability in case of occlusion and and object segmentation errors. The proposed method focuses on solution to these problems. Representations of object and global video features to be used in a low-level content-based video representation are the output of the video analysis method. The video interpretation module extracts semantic-related and qualitative video features. This is done by combining low-level features and high-level video data. Semantic content is detected by integrating analysis and interpretation of video content. Semantic content is represented by generic events independently of the context of an application. To identify events, a qualitative description of the object motion is an important step towards linking low-level features to high-level feature retrieval. For this purpose, the motion behavior and low-level features of video objects is analyzed to represent important events and actions. The results of this processing step are qualitative descriptions of object features and high-level descriptions of video content based on events. Within the proposed framework, results of one step are integrated to support 7

34 8 Introduction subsequent steps that in turn correct or support previous steps. For example, object tracking is supported by low-level segmentation. Results of the tracking are in turn integrated into segmentation to support it. This approach is by analogy to the way the HVS finds objects where partial detection and recognition introduces a new context which in turn supports further recognition ([103], Fig. 3.2, [3]). The robustness of the proposed methods will be demonstrated by extensive experimentation on commonly referenced video sequences. Robustness is the result of adaptation to video noise and artifacts and due to processing that accounts for errors at one step by correction or compensation at the subsequent steps where higher level information is available. The proposed system provides a response in real-time for surveillance applications with a rate of up to 10 frames per second on a multitasking SUN UltraSPARC 360 MHz without specialized hardware. 1.4 Contributions Because of the extensive growth in technical publications in the field of video processing, it is difficult to possess a comprehensive overview of the field and of published methods. The following list states which parts of this thesis are original to the knowledge of the author. A new approach to estimate white noise in an image is proposed. The novelty here is twofold: the first introduces a new homogeneity measure to detect intensity-homogeneous blocks; the second is a new way to automate averaging of estimated noise variances of various blocks. A new enhanced filter for computationally efficient spatial noise reduction in video signals is proposed. The filter is based on the concept of implicitly finding homogeneous image structure to adapt filtering. The novelty is twofold: effective detection of image structure to adapt the filter [74]; effective adaptation of the filter parameters such as window size and weights to the estimated noise variance. A new object segmentation method based on motion and contour data using a memory-based motion detection method, a fast noise-adaptive thresholding method for motion detection, a set of new morphological binary operations, and a robust contour tracing technique. A new efficient object-based motion estimation designed for video applications such as video surveillance. The method aims at a meaningful representation of object motion towards high-level interpretation of video content. The main contribution is the approximate estimation of object scaling, and acceleration.

35 A new object tracking method that solves the correspondence problem effectively for a broad class of image sequences and controls the quality of the segmentation and motion estimation techniques: Voting system: the correspondence or object matching problem is solved based on a voting system by combining feature descriptions non-linearly. Multiple object: the new tracking contributes a solution to tracking objects in the case of multi-object occlusion. The method can simultaneously track various objects, as soon as they enter the field of the camera. Error detection and correction: the proposed tracking process is faulttolerant: it takes into account possible errors from the object segmentation methods and compensates for their effects. Region merging: the proposed tracking process contributes a reliable region merging technique based on geometrical relationships, temporal coherence, and matching of objects rather than on single local features. A new context-independent video interpretation technique which provides a high-level video representation rich in terms of generic events and qualitative object features. This representation is well-chosen because it represents a good compromise between containing too many special operators and being a too small set of generic operators. The interpretation technique consists of an object-and motion based event detection and classification method, a key-image extraction method based on events, and qualitative descriptions of video objects, their features, and their related generic events. In addition, this thesis addresses the targeted applications of the proposed system and contributes definitions of frameworks for advanced event-based video surveillance and retrieval Thesis outline This thesis demonstrates the performance of the proposed three video processing and representation levels at the respective section of each chapter (Video enhancement in Chapter 2, video analysis in Chapters 3-6, and video interpretation in Chapter 7). Pertinent literature and specific applications of the proposed methods are also reviewed at the respective sections. Furthermore, each proposed algorithm is summarized at the respective section. Chapter 2 first classifies noise and artifacts in video into various categories and then uses a novel method for noise estimation and a novel spatial noise reduction method.

36 10 Introduction In Chapter 3, after reviewing related techniques, the proposed approach for video analysis is described. Then possible implementations of an image stabilization algorithm are discussed. Then representations of object and global video features to be used in a low-level content-based video representation are proposed. The steps of the video analysis are proposed in details in Chapter 4 (object segmentation), in Chapter 5 (object-based motion estimation), and in Chapter 6 (object tracking). In Chapter 7, qualitative descriptions of low-level object features and of object relationships are first derived. Then automatic methods for high-level content interpretation based on motion and events are proposed. At the end of the thesis, Chapter 8 reviews the background, goal, and achievements of the thesis. It furthermore summarizes key results and mentions possible extensions. In Appendix A, Section A.1 defines an object-and-event based surveillance system and Section A.2 designs a content-based retrieval system. Section A.3 describes relations to MPEG-7 activities. A detailed description of the test sequences used is given in Appendix B.

37 Chapter 2 Video Enhancement 2.1 Motivation Image enhancement is a fundamental task in various imaging systems such as cameras, broadcast systems, TV and HDTV-receivers, and other multimedia systems [120]. Many enhancement methods have been proposed, which range from sharpness improvement to more complex operations such as temporal image interpolation 1. Each of these methods is important with respect to the imaging system it is used for. For example, noise reduction is usually used in various imaging systems as an enhancement technique. Noise reduction is often a preprocessing step in a video system and it is important that its computational cost stays low while its performance is reliable. For example, preserving image content such as edges, textures, or moving areas to which the HVS is sensitive is an important performance feature. Because of the significant improvement in the quality of modern analogue video acquisition and receiver systems, studies show that TV viewers are more critical even to low noise [97, 107]. In digital cameras, the image noise may increase because of the higher sensitivity of the new CCD cameras and the longer exposure [88]. Noise reduction is, therefore, still a fundamental and important task in image and video processing. It is an attractive feature, especially under sub-optimal reception conditions. This calls for effective noise reduction techniques. The real-time aspect of new techniques will be a very attractive property for consumer devices such as digital cameras and TV-receivers. The focus of a noise reduction technique is not to remove noise completely (which is difficult or impossible to achieve) but to reduce the influence of noise to become almost imperceptible to the HVS. Noise occurs both in analogue and digital devices, such as cameras. Noise is always present in an image. When not visible, it is only masked by the HVS. Its visibility can increase with the camera sensor sensitivity or under low lighting conditions and 1 For a broad overview of video enhancement methods see [120].

38 12 Video enhancement especially in images taken at night. Noise can be introduced into an image in many ways (Section 2.2) and can significantly affect the quality of image and video analysis algorithms. Noise reduction techniques attempt to recover an underlying true image from a degraded (noisy) copy. Accordingly, in a noise reduction process, assumptions are made about the actual structure of the true image. The reduction of noise can be performed by both linear and nonlinear operators (Fig. 2.1) which use correlations within and between images. Also spatio-temporal noise reduction algorithms have been devised to exploit both temporal and spatial signal correlations (see Fig. 2.1 and [13, 47, 46, 74]). Spatial methods, which are computationally less costly than temporal methods, are widely used in various video applications. Temporal methods which are more demanding computationally and require more memory are mainly used in TV receivers. Temporal noise reduction algorithms that use motion have the disadvantage that little, if any, noise reduction is performed in strongly moving areas. To compensate for this drawback, spatial noise filters can be used [74]. Several noise reduction methods make, implicitly or not, assumptions regarding image formation and image noise which are associated with a particular algorithm and thus usually perform best for a particular class of images. For example, some methods assume low noise levels or low image variations and fail in the presence of high-noise levels or textured image segments. temporal static spatial linear hybrid cascade connected motion adaptive motion vector based median content adaptive median morphological combined Figure 2.1: A classification of noise reduction methods. Noise reduction techniques make various assumptions, depending on the type of images and the goals of the reduction. Considering that a noise reduction method is a preprocessing step, an effective practical noise reduction method should take the following observations into account (cf. also [87]): it should incorporate few assumptions about the distribution of the corrupting

39 noise. For example, in case of iterative noise reduction, the characteristics of the noise might be modified and any assumption can lose its significance, it should be parallel at the pixel level meaning that the value of the processed pixel is computed in a small window centered on it. Iterations of the same procedure extend, however, the region of influence beyond the small window, it should preserve significant discontinuities. This can be done by adapting the algorithms to conform to image discontinuities and noise, it should take behavior and reaction of the HVS into account. Each element, for example, piece-wise constant gray level areas or textures, of a visual field (an image) has a different influence on human visual perception. The behavior of the HVS is not well understood. Nevertheless, the following list describes some of the known properties of the HVS that can be of interest in designing a noise reduction technique: the HVS adapts to environmental conditions, in high-frequency structures, the HVS is less sensitive to artifacts than in low-frequency structures, the HVS is very sensitive to processing errors at image discontinuities such as object contours, the HVS is not as sensitive to the diagonal orientation as to the horizontal and vertical orientations (oblique effect). Structures in real images are more horizontally and vertically oriented, i.e., the spectra of real image contents are mainly concentrated along the horizontal and vertical frequency axis, and motion of objects can mask some artifacts in an image sequence. Using some of these observations, this thesis develops a noise estimation method (Section 2.4) to support a noise reduction algorithm, and then contributes a novel spatial noise reduction method (Section 2.5) that is fast and adaptive to image content. In [13], a temporal noise filter is proposed that adapts the reduction process to the high and low frequencies of the image content. The methods proposed in this thesis make no assumption on the underlying image model except that the image noise is white Gaussian noise, which is most common in images. In the next Section, a classification of noise and artifacts in video is given. The proposed noise estimation and reduction are described in the following sections Noise and artifacts in video signals An image can be corrupted by noise and artifacts due to image acquisition, recording, processing and transmission (Table 2.1). Other image artifacts can be due to

40 14 Noise and artifacts reflections, blinking lights, shadows, or natural image clutter. Acquisition noise may be generated by signal pick-up in the camera or by film grain especially under bad lightning conditions. Here, different types of noise are added due to the amplifiers and other physical effects in the camera. Furthermore, noise can be added to the signal by transmission over analogue channels, e.g., satellite or terrestrial broadcasting. Further noise is added by image recording devices. In these devices Gaussian noise or, in the case of tape drop-outs, impulse noise is added to the signal. Digital transmission inserts other distortions which also may have a noisy characteristic. Blocking are block structures which become visible in an MPEG-2 image due to the block-based and motion-compensated MPEG-coding scheme. These block structures appear in an image sequence as a Dirty Window. The boundaries of blocking and dirty windows are small details which are located in the high frequencies. The quantization of the DCT-coefficient in MPEG-2 coding causes overshoot on object contours, which is called Ringing. The Mosquito effect on high-frequency image parts is caused by different quantization results in successive images and by faulty block-based motion estimation at object contours. As mentioned earlier, the HVS is very sensitive to abrupt changes and details which are located in the high frequencies, and artifact reduction is needed in modern video applications to reduce the effect of these artifacts. Beside input artifacts in an image, intermediate artifacts and errors (e.g., false motion data or edges) and end result error (e.g., reliability) in video and image analysis are unavoidable and can have a large impact on the end results of analysis. Since corrupted image data may significantly deviate from the assumed image analysis model, the effectiveness of successive image analysis steps can be significantly reduced. This calls for modeling of these artifacts and errors. In many video applications, however, exact modeling of errors is not necessary. Analogue channel, recording, film grain, and CCD-camera noise can be modeled as white Gaussian noise, which is usually of low amplitude and can be reduced by linear operations. High amplitude noise like impulse noise can be generated, for example, through a satellite transmission. It requires different approaches, using adaptive median operators. Because of the various kinds of noise and artifacts, it is important, in a TV receiver or imaging system as in surveillance, to perform a reduction of these noise and artifacts. Attention has to be paid to use methods that do not introduce artifacts on the enhanced images. For example, non-adaptive low-pass filtering would reduce high frequency noise and artifacts but may deteriorate object edges.

41 Artifact origin Artifact type Reduction method Sampling camera (CCD), film grain white noise temporal, spatial Recording video tape noise white noise temporal, spatial tape drop-out impulse noise median Disturbed image material film damage pattern noise median, edge-based bit error impulse noise median Analogue transmission cable, terrestrial white noise temporal, spatial satellite FM-noise, temporal, spatial pattern ( Fish ) noise median, edge-based Digital coding (MPEG) Blocking, Dirty Window, spatial, object-based, Ringing, Mosquito Effects temporal Digital transmission Bit-error Bit-error protection, block and object-based, image dropout error concealment Processing artifacts false motion data or edges 15 Table 2.1: Noise and artifacts in images. 2.3 Modeling of image noise The noise signal can be modeled as a stochastic signal which is additive or multiplicative to an image signal. Furthermore, it can be modeled as signal-dependent or signal-independent. Quantization and CCD noise are modeled additive and signalindependent. Image noise can have different spectral properties; it can be white or colored. Most commonly, the noise signal in images is assumed to be independent identically distributed (iid) additive and stationary zero-mean noise (i.e., white Gaussian noise), I(n) =S(n)+η(n) (2.1) where S(n) is the original (true) image signal at time instant n, I(n) is the observed noisy image signal at the same time instant, and η(n) is the noise signal. In practice, I(n) and S(n) are defined on an X Y lattice, and each pixel I(i, j, n) (row i, column j) is an integer value between 0 and 255. Under the above assumptions, the proposed noise estimation method and noise reduction method work. The main difficulty when estimating or reducing noise is in images that contain fine structures or textures.

42 16 Noise estimation To evaluate the quality of a video enhancement technique, different numerical measures such as the Signal-to-Noise Ratio (SNR defined in [108]) or Mean Squared Error (MSE) can be used. These measures compare an enhanced (e.g., noise reduced) image with an original image. The basic idea is to compute a single number that reflects the quality of the enhanced image. Enhanced images with higher SNR are assumed to have a better subjective quality. SNR measures do not, however, necessarily reflect human subjective perception. Several research groups are working on perceptual measures, but no standard measures are known and signal-to-noise measures are widely used because they reflect image improvements and are easier to compute. A better measure that is less dependent on the input signal is the Peak Signal to Noise Ratio (PSNR) as defined in Eq The PSNR is a standard criterion for objective noise measuring in video systems. Here, the image size is X Y, I p (i, j, n) and I r (i, j, n) denote the pixel amplitudes of the processed and reference image, respectively, at the position (i, j): PSNR = 10 log 1 XY Y i=1 j=1 (255) 2. (2.2) X (I p [i, j, n] I r [i, j, n]) 2 Typical PSNRs of TV video signals range between 20 and 40. They are usually reported to two decimal points (e.g., 36.61). A threshold of 0.5 db PSNR can be used to decide whether a method delivers an image improvement that would be visible 2. The PSNR indicates the Signal-to-Noise-Improvement in db but unweighted with respect to visual perception. PSNR measurement should be, therefore, given with subjective image comparisons. The use of either SNR or PSNR as measures of image quality is certainly not ideal since it generally does not correlate well with perceived image quality. Nevertheless, they are commonly used in the evaluation of filtering and compression techniques, and do provide some measure of relative performance. The PSNR quality measure will be used throughout this thesis. 2.4 Noise estimation Review of related work The effectiveness of video processing methods can be significantly reduced in the presence of noise. For example, the performance of compression techniques can decrease due to noise in the image. Intensity variation due to noise may introduce motion [66]. 2 The MPEG-committee used this informal threshold. Reasons not to use PSNR are described in

43 estimation errors. Furthermore, the detection of high frequency image content such as edges can be significantly disturbed. When information about the noise becomes available, processing can be adapted to the amount of noise to provide stable processing methods. For instance, edge detection [31], image segmentation [143, 112], motion estimation [91], and smoothing [13, 47, 87, 74] can be significantly improved when the noise variance can be estimated. In current TV receivers the noise is typically estimated in the black lines of the TV signal [47]. In other applications, the noise estimate is provided by the user and a few methods have been proposed for automated robust noise estimation. Noise can be estimated within an image (intra-image estimation) or between two or more successive images (inter-image estimation) of an image sequence. Interimage estimation techniques require more memory (to store one or more images) and are, in general, more computationally demanding [52]. Intra-image noise estimation methods can be classified as smoothing-based or block-based. In the smoothing-based methods the image is first smoothed, for example, using an averaging filter and then the difference of the noisy and enhanced image is assumed to be the noise; noise is then estimated at each pixel where the gradient is larger than a given threshold. These methods have difficulties in images with fine texture and they tend to overestimate the noise variance. In the block-based method, the variance over a set of blocks of the image is calculated and the average of the smallest variances is taken as an estimate. Different implementations of block-based methods exist. In general, they tend to overestimate the noise variance in good quality images and underestimate it in highly noisy images. In some cases, no estimate is even possible [98, 52]. The blockbased method in its basic form is less complex and is several times faster than the smoothing-based method [98, 52]. The main difficulty with the block-based methods is that their estimate may vary significantly depending on the input image and noise level. In [98], an evaluation of noise estimation methods is given. There, the averaging methods were found to perform well with high-noise levels. No techniques were found to perform best for various noise levels and input images. Some noise estimation methods determine the noise variance within the larger context of an image processing system. Such techniques are then adapted to specific needs of the imaging systems (e.g., in the context of coding [77], TV signal processing [47] and image segmentation [126]). Many noise estimation methods have difficulties estimating noise in highly noisy images and in highly textured images [98]. Such a lack of accuracy can be a problem for noise-adaptive image processing methods. Some methods use thresholds, for example, to decide whether an edge is given at a particular image position [98]. The purpose of this section is to introduce a fast noise estimation technique which gives reliable estimates in images with smooth and textured areas. This technique is a 17

44 18 Noise estimation block-based method that takes image structure into account and uses a measure other than the variance to determine if a block is homogeneous. It uses no thresholds and automates the way that block-based methods stop the averaging of block variances. The method selects intensity-homogeneous blocks in an image by rejecting blocks of line structure using new proposed masks to detect lines A homogeneity-oriented noise estimation The method proposed in this section estimates the noise variance σn 2 from the variances of a set of regions classified as the most intensity-homogeneous regions in the image I(n), i.e., regions showing the lowest variation in structure. The method uses a new homogeneity measure ξ Bh to determine if an image region has uniform intensities, where uniformity is equated to piece-wise constant gray-level pixels. This novel noise estimation operates on the input image data: 1) without any prior knowledge of the image or noise, 2) without context, i.e., it is designed to work for different image processing domains, and 3) without thresholds or user interactions. The only underlying assumption is that in an image there exist neighborhoods (usually chosen as a 2-dimensional (2-D) rectangular window or W W block) with smooth intensities (i.e., the proposed homogeneity measure ξ Bh 0). This assumption is realistic since real-world images have well-defined regions of distinct properties, one of which is smoothness. The proposed noise estimation operates as follows: Detection of intensity-homogeneous blocks: the pixels in an intensity-homogeneous block B h = {I(i, j)} (i,j) Wij are assumed to be independent identically-distributed (iid) but not necessary zero-mean. W ij denotes the rectangular window of size W W. These uniform samples {I(i, j)} of the image have variance σb 2 h, which is assumed to represent the variance of the noise. The signal in a homogeneous block is approximately constant and variation is due to noise. With the iid property their empirical mean and variance are defined as (i,j) W µ h = ij I(i, j) (i,j) W σh 2 = ij (I(i, j) µ h ) 2. (2.3) W W W W With l = W W and by the law of large numbers lim l σ2 B h = σ 2 n. (2.4) Averaging: to estimate the image global noise variance, σn, 2 the local variances of the m most homogeneous blocks, {B h }, are averaged to σn 2 = µ σ 2 Bh = m h=1 σ2 B h. Since the noise is assumed to be stationary, the average of the variances of the m most homogeneous regions can be taken as a representative for the noise in the whole image. To achieve faster noise variance estimation, ξ Bh 0is

45 calculated for a subset of the image pixels by skipping each s th pixel of an image row. Simulations are carried using different skipping steps. Simulations of this technique show that a good compromise between efficiency (computational costs) and effectiveness (solution quality) is obtained with s = 5. Adaptive averaging: since the most homogeneous blocks could show strongly variable homogeneities and hence highly variable variances, only blocks which show similar homogeneities and hence similar variances σb 2 h to a reference representative variance σb 2 r are included in the averaging process. This stabilizes the averaging process and adapts the number of blocks to the structure of the image. Therefore, no threshold is needed to stop the averaging process. To decide whether the reference and a current variance are similar, a threshold t σ is used, i.e., σb 2 h is similar to σb 2 r if σb 2 r σb 2 h <t σ. This threshold t σ is relatively easy to define and does not depend on the input image content. It can be seen as the maximal affordable difference (i.e., error) between the true variance and the estimated variance. For example, in noise reduction in TV receivers a t σ between 3 and 5 is common [13, 46]. In the simulations of this study, t σ is set to 3. Detection of homogeneous blocks The image is first divided into blocks {B h } of the size W W. In each block B h a homogeneity measure ξ Bh is then computed using a local image analyzer based on high-pass operators that are able to measure homogeneity in eight different directions as shown in Fig. 2.2: special masks for corners are also considered which stabilize the homogeneity estimation. In this local uniformity analyzer, high-pass operators with coefficients { (W-1) -1-1} (e.g., if W = 3 the coefficients are {-1, 2, -1 }) are applied along all directions for each pixel of the image. If in one direction the image intensities are uniform then the result of the high-pass operator is close to 0. To calculate the homogeneity measure for all eight directions, all eight quantities are added and this sum provides a measure, ξ Bh, for homogeneity. In this thesis, these masks are also proposed to adapt spatial noise reduction as will be discussed in Section 2.5 (see also [74]). The operation to determine the homogeneity measure can be expressed as a second derivative of the image function I. The following example illustrates this in the horizontal direction: I o (i) = I(i i)+2 I(i) I(i + i) = I (i)+i (i i) (2.5) = (I (i) I (i i)) Therefore, I o (i) is a second-order finite-difference operator which acts as a high-pass operator. Note that the detection of homogeneity is done along edges and never across edges. Various simulations (Section 2.4.3) show that this proposed homogeneity 19

46 20 Noise estimation mask 1 mask 2 mask 3 mask 4 mask 5 mask 6 mask 7 mask 8 current pixel Figure 2.2: Directions of the local intensity homogeneity analyzer. measure performs better than that using the variance to decide whether a block has uniform intensities. A variance-based homogeneity measure fails in the presence of fine structures and textures (see Fig. 2.6(c) and 2.6(i)). Defining a reference variance σb 2 r To stabilize the averaging process, the reference variance is chosen as the median of the variances of the first three most homogeneous blocks (i.e., the blocks with the smallest sum). The first three values are taken because they are most representative of the noise variance since they are calculated from the three most homogeneous blocks. Higher-order median operators can be also used. Instead of the median, the mean can be used to reduce computation. Simulations show that better estimation is achieved using the 3-tab (i.e., of order 3) median operator. In some cases, the difference between the first three variances can be large and a median filter would result in a good estimate of the true reference variance. Further investigation can determine the best order of the median filter, or examine if there are cases where the mean operator would give better results Evaluation and comparison The new estimator has been tested using, in the image processing literature, commonly used images. Eight are represented in Fig. 2.6 and the ninth image is one with a constant gray value of 128. White additive noise is the most common form of noise in images and it has been used in the tests. Typical PSNR values in real-world images range between 20 and 40 db. To test the reliability of the proposed method, noise giving a PSNR between 20 and 50 db is added to the nine images. Noise is also estimated in the noiseless case, i.e., in the reference image. Due to the limited range of intensities ([0, 255]), saturation effects result in a Gaussian noise signal not having exactly zero-mean, especially with large noise variances. In this thesis, therefore, attention is paid to this saturation or clipping effect.

47 This has been done according to the CCIR Recommendation for the YCrCb video standard. In this recommendation, the reference black is represented by 16 and the reference white by 235 for the 8-bit range [0,255]. Thus, noise is estimated solely in regions of these ranges so that clipping effects are excluded from the estimation process. This, however, could limit the performance of the algorithm where the homogeneous regions lay outside these ranges. To evaluate the performance of the algorithm, the estimation error E n = σ 2 n σ 2 e is first calculated. E n is the difference between the true value and the estimated noise variance. The average µ En and the standard deviation σ En of the estimation error are then computed from all the measures, as a function of the input noise 3 as follows: N i=1 µ En = E N n(i) i=1 ; σ En = (E n(i) µ En ) 2 (2.6) N N where N is the number of tested images and E n is the estimation error for a particular noise variance σ n on a single image. The reliability of a noise estimation method can be measured by the standard deviation σ En or the average of the estimation error µ En. Evaluation results are given in Table 2.2. As can be seen, the proposed method is reliable for both high and low input noise levels. In [98], a evaluation of noise estimation methods is given. When our results are compared to those of Table 1 in [98], the comparison suggests that the proposed method outperforms the block-variance-based method, which has been found in [98] to be a good compromise between efficiency and effectiveness. Moreover, the proposed method adapts thresholds to the image whereas the block-based method requires the specification of the percentage of the image area to be considered (which has been set to 10 in [98]). As noted in [98], the performance of the method can be improved by tuning this parameter value. As Table 2.2 reveals, the estimation errors of the proposed method remain reliable even in the worst-case when deviation is around The method remains suitable for noise-based parameter adaptation such as in noise reduction or segmentation techniques. For example, in high-quality noise reduction techniques, the adjustment is done in an interval of 2-5 db [13, 21, 47]. Fig. 2.3(a) reveals that the average estimation error using the proposed method is lower than that of the block-variance method for all input noise variances. Interestingly, the standard deviation of the estimation error using the proposed method is significantly less than that of the block-based method, as shown in Fig. 2.3(b). Recently, a new interesting averaging-based noise estimation method has been proposed [109]. The main difficulty with this new method is its heavy computational 3 Some studies use the averaged squared error instead of the average absolute error as a quality criterion. The variance of this error among different test images is, however, an important indicator for the stability of the estimation. 21

48 22 Noise estimation PSNR noiseless σ n average µ En σ En Table 2.2: The average µ En and the standard deviation σ En of the estimation error as a function of the input noise σ n (W = 5) Proposed Block based σ EPSNR µ EPSNR In PSNR(dB) Proposed Block based In PSNR(dB) (a) Average of the estimation error. (b) Std. deviation of the estimation error. Figure 2.3: Comparison of the block-based and the proposed method (W = 5). cost even when using some optimization procedures. The success of this method seems to depend heavily on many parameters to fix, for example, on the number of process iterations, or the shape of the fade-out cosine function to evaluate the variance histogram (Eq. 9 and 10 in [109]). Furthermore, no information is given about the fluctuation of the estimation error E n, i.e, about the σ En, which is an important criterion when evaluating noise estimation methods. We have carried out simulations to evaluate the proposed method using different window sizes W =3, 5, 7, 9, 11. As shown in Fig. 2.4, using a window size of 3 3 results in a better estimation in less noisy images (PSNR>40 db), whereas using a window size of 5 5 gives better results in noisy images 4. This is reasonable since, in noisy images, larger samples are needed to calculate the noise accurately. The choice of the window size can be oriented to some image information if available. As a compromise between efficiency and effectiveness, a window size of 5 5 is used which gives good results compared to other estimation methods. If a reduction in 4 Results in Fig. 2.4 are shown for a subset of the test images displayed in Fig. 2.6.

49 computation cost is required, the proposed noise estimation can be carried out only in the horizontal direction, i.e., along one line, for example, using 3 1or5 1 window size Win5x5 Win3x µ EPSNR In PSNR(dB) Figure 2.4: The performance (µ EPSNR ) of the proposed method using different window sizes. The effectiveness of the proposed estimation method is further confirmed when applied to motion images with various motions such as pan, zoom. These simulations show the stability of the algorithm through an image sequence. For example, the sequence Train (Fig. 2.12(a)) is overlaid with 30 db PSNR white noise. Throughout the sequence the PSNR is estimated to be between db and db. These results show the stability of the method and are suitable for temporal video applications in which the adjustment of parameters is oriented to the amount of noise in the image (for example, [13, 47]). Table 2.3 summarizes the performance (effectiveness and complexity) of the proposed, the block-based, and the average methods. As shown, both the average and the standard deviation of the proposed methods are significantly better than the reference methods. The computational cost of the method is investigated in simulations using images of different sizes and noise levels. Results show (Table 2.3) that the proposed method (without using special optimization techniques) is four times faster 5 than the block-based method which has been found [98] to be the most computationally efficient among tested noise estimation methods. 5 The proposed method needs on average 0.02 seconds on a SUN-SPARC MHz.

50 24 Noise reduction Average method Block-based Proposed average of µ En average of σ En T c 6 slower 4 faster than block-based than block-based Table 2.3: Effectiveness and complexity comparison between the proposed method and other methods presented in [98]. T c is the computational time for one image Summary This thesis contributes a reliable real-time method for estimating the variance of white noise in an image. The method requires a 5 5 mask followed by averaging over blocks of similar variances. The proposed mask for homogeneity measurement is separable and can be implemented using simple FIR-filters. The local image analyzer used is based on high-pass operators which allow the automatic implicit detection of image structure. The local image analyzer measures the high-frequency image components. In case of noise, the direction filter compensates for the noise along different directions and stabilizes the selection of homogeneous blocks. The method performs well even in textured images (e.g., Fig. 2.6(i) and Fig. 2.6(f)) and in images with few smooth areas, like the Cosine2 image in Fig. 2.6(c). As shown in Fig. 2.5, for a typical image quality of PSNR between 20 and 40 db the proposed method outperforms other methods significantly and the worst case PSNR estimation error is approximately 3 db which is suitable for real video applications such as surveillance or TV signal broadcasts. The method has been applied to estimate white noise from an uncompressed input image. The performance of the method in compressed images, for instance, using MPEG-2, has to be further studied. The estimation of the noise after applying a smoothing filter is also an interesting point for further investigation. 2.5 Spatial noise reduction Review of related work The introduction of new imaging media such as Radio with Picture or Telephone with Picture makes real-time spatial noise reduction an important issue of research. Studies show that with digital cameras image noise may increase because of the higher

51 25 7 Proposed Block based 6 5 µ EPSNR In PSNR(dB) Figure 2.5: The performance (µ EPSNR, W = 5) of the proposed method in typical PSNR range. sensitivity 6 and the longer exposure of the new CCD cameras [88]. Spatial noise reduction is, therefore, an attractive feature in modern cameras, video recorders, and other imaging systems [88]. Real-time performance will be an attractive property for digital cameras, TV receivers and other modern image receivers. This thesis develops a spatial noise reduction method with low complexity that is intended for real-time imaging applications. Spatial noise reduction 7 is usually a preprocessing step in a video analysis system. It is important that it preserves image structures such as edges and texture. Structurepreserving noise reduction methods (e.g., Gaussian and Sigma filtering) estimate the output image pixel gray value I o from a weighted average of neighboring pixel gray values as follows: (l,m) W I o (i, j) = ij w σn (l, m) I(l, m) (2.7) (l,m) W ij w σn (l, m) where: σ n is related to the image degradation, in the case of white noise reduction that is the standard deviation of the noise, W ij denotes the neighborhood system which is usually chosen as a 2-D rectangular window of size W W, containing neighbors of the current pixel gray 6 As the sensitivity of the camera sensor is increased to light, so is its sensitivity increased to noise. 7 For a review of spatial noise reduction methods see [125, 122].

52 26 Noise reduction (a) Uniform (b) Cosine1 (c) Cosine2 (d) Synthetic (e) Portrait (f) Baboon (g) Aerial (h) Field (i) Trees Figure 2.6: Test images used for noise estimation comparison.

53 value I(i, j), w σn (l, m) is the weighting factor which ranges between 0 and 1 and acts as the probability that two neighboring pixels belong to the same type of image structure, I o (i, j) represents the current noise-reduced pixel gray value, and I(l, m) represents a noisy input neighboring pixel gray value. A large difference between neighboring pixel gray values implies an edge and, therefore, a low probability of belonging to the same structure. These pixel values will then have minor influence on the estimated value of their neighbors. With a small difference, meaning that two pixels belong presumably to the same structure type, the opposite effect takes place. The weighting factor depends on the parameter σ n which quantifies the notions of large and small. Structure-preserving filtering can be usually applied iteratively to reduce noise further. The Gaussian filter [69] weights neighboring pixel values with a spatial Gaussian distribution as follows w σn (l, m) = exp{ 1 4σ [I(l, m) I(i, 2 j)]2 }. (2.8) The aim, here, is to reduce small-scale structures (corresponding to high spatial frequencies) and noise without distorting large-scale structures (corresponding to lower spatial frequencies). Since the Gaussian mask is smooth, it is particularly good at separating high and low spatial frequencies without using information from a larger area of the image than necessary. An increase in noise reduction using linear filters such as the Gaussian filters corresponds, however, to an increase in image blurring especially at fine details (see [122] Section 4). The Sigma filter [79] averages neighboring pixel values that have been found to have the same type of structure as follows 27 { 1 : (I(i, j) 2σn ) I(l, m) (I(i, j)+2σ w σn (l, m) = n ) 0 : otherwise. (2.9) Therefore, the Sigma filter takes an average of only those neighboring pixels whose values lie within 2σ n of the central pixel value, where σ n is determined once for the whole image. This attempts to average a pixel with only those neighbors which have values close to it, compared with the image noise standard deviation. Another well-known structure-preserving noise filter is the anisotropic diffusion filter [106]. This filter uses local image gradient to perform anisotropic diffusion where smoothing is prevented from crossing edges. Because pixels both sides of an edge have high gradient associated with them, thin lines and corners are degraded by this process. This method is computationally expensive and is, usually, not considered for real-time video applications.

54 28 Noise reduction 5 σ in 2 z -1 z (1-c)/2 c (1-c)/ Σ R[dB] σ out c (a) A symmetrical 3-tap FIR-filter. (b) Noise reduction gain R in db as function of the central coefficient c. Figure 2.7: Spatial noise filtering. In summary, current structure-preserving spatial noise filters are either computationally expensive, require a kind of manual user intervention, or still blur the image structure, which is critical when the noise filter is a preprocessing step in a multi-step video processing system where robustness along edges is needed Fast structure-preserving noise reduction method In this section, a new technique for spatial noise reduction is proposed which uses a simple low-pass filter with low complexity to eliminate spatially uncorrelated noise from spatially correlated image content. Such a filter can be implemented, e.g., as a horizontal, vertical or diagonal 3-tap FIR-filter with the central c coefficient (Fig. 2.7(a)). If the input noise variance of the filter is σn 2 the output signal variance σo 2 can be computed by Eq ( ) 2 1 c σo 2 = c 2 σn 2 +2 σ 2 2 n. (2.10) The gain R (ratio of signal to noise values of input and output) in noise reduction of this filter can be computed by (Eq. 2.11) ( ) R[dB] =10 log σ2 n 2 =10 log. (2.11) σo 2 3c 2 2c +1 This noise reduction gain, clearly, depends on the choice of the central coefficient c. This dependency is depicted in Fig. 2.7(b). For a cos 2 -shaped (i.e., c = 1 2 ) filter a noise reduction gain of R =4.26 db can be achieved. The maximum of R is achieved

55 by a mean filter, i.e., c = 1, which is suitable for homogeneous regions. In structured 4 regions, the coefficient has to be selected adaptively to not blur edges and lines of the image. This spatial filter is only applied along object boundaries or in unstructured areas but not across them. To achieve this adaptation an image analyzing step has to be applied to control the direction of the low-pass filter, as depicted in Fig I(n) Structure analyzer I(n) Mask selection Noise estimation σ n Coefficient control Noise and structure-adaptive spatial noise reduction Quality-enhanced image: I (n) o Figure 2.8: Proposed noise and structure-adaptive spatial noise reduction Adaptation to image content and noise Adaptation to image content Several algorithms for effective detection of structure have been proposed [122, 125, 79, 82]. In [82] a comparison of the accuracy of many different edge detectors is given. In Section VI it is shown that precise edge detection is computationally expensive and therefore not suitable for real-time video systems. Other computationally less expensive algorithms either need some manual tuning (cf. [122, 79]) to adapt to different images or are not precise enough. In this thesis, a computationally efficient method for detecting edge directions is proposed (see also [74]). The basic idea is to use a set of high-pass filters to detect the most suitable direction among a set of eight different directions as defined in Fig Then noise reduction is performed in this direction. To take structure at object corners into account, four additional corner masks are defined (Fig. 2.2) which preserve the sharpness of structure at corners. For each pixel of the image, high-pass filters with coefficients {-1 2-1} are first applied along all eight directions. Then the direction with the lowest absolute highpass output is chosen to apply the noise reduction by weighted averaging along this direction. Doing this, the averaging is adapted to the most homogeneous direction and thus image blurring is implicitly avoided. Adaptation to image noise The averaging process along an edge direction should be for optimal noise reduction adapted to the amount of noise in the image. As shown in Fig. 2.9, the gain of the

56 30 Noise reduction spatial noise reduction can be roughly doubled when adapting to the estimated noise. Especially in images with higher PSNR, the noise estimator stabilizes the performance of the spatial filter. 4 Noise Adaptive Non Adaptive 3 2 Gain(dB) In PSNR(dB) Figure 2.9: Noise reduction gain in db averaged over three sequences (each of 60 images). The gain can be roughly doubled when adapting to the estimated noise. Noise adaptation can be done by weighting the central pixel. Assume that, for example, the most homogeneous edge direction is the horizontal one; the weighted average is: I o (i, j) = I(i, j 1) + w(σ n) I(i, j)+i(i, j +1). (2.12) w(σ n )+2 This weighting should be low for highly noisy images and high (emphasis on the central pixel) for less noisy images. To keep the implementation cost low, the following adaptation is chosen w(σ n )=a σ n a<1. (2.13) Thus the spatial filter automatically adapts to the source noise level, which is estimated by the new method described in Section 2.4. This estimation algorithm measures video noise and can be implemented in simple hardware. Fig. 2.10(a) shows the effect of the weight adaptation to the estimated noise: higher weighting achieves better noise reduction in less noisy images and lower weighting achieves better noise reduction in more noisy images. In addition, higher noise reduction can be achieved if the window size (in terms of the number of taps or size of the FIR-filter [121]) can also be adapted to the estimated amount of noise. In the case of highly noisy images, better noise reduction can be achieved if a larger window size is used. A larger window

57 Auto W1 W2 W4 W Auto Win3x3W1 Win3x3Wauto Win5x5Wauto 2 2 Gain(dB) 1 0 Gain(dB) In PSNR(dB) In PSNR(dB) (a) Average gain by different weights. Higher weights are suitable for less noisy images. (b) Average gain by different windows. Larger window size is needed in highly noisy images. Figure 2.10: Comparison of the proposed method by different weights and windows. size means more averaging and higher noise reduction gain. Fig. 2.10(b) illustrates this discussion and shows the effectiveness of the noise adaptation Results and conclusions Quantitative evaluation In simulations, the new spatial noise reduction achieves an average PSNR gain of db. The actual gain depends on the contents of the image. In structured areas a higher gain is achieved than in areas of complex structure. With complex structure, the spatial analysis filter may fail to detect edges. In such a case, the mask selection is not uncorrelated to noise. This leads to lower achieved gains compared to the theory (Fig. 2.7(b)). Noise is, however, reduced even in unstructured images. Noise adaptation to the estimated input PSNR achieves a higher noise reduction. This gain is especially notable in images with both high and low noise levels and in structured areas. This adaptation can achieve gains up to 5 db (Fig. 2.9). An objective (quantitative) comparison between the Sigma filter with a window size of 3 3 and the proposed filter (Fig. 2.11) shows that higher PSNR can be achieved using the proposed spatial noise filter especially in strong noise images. Further simulations show that using a 5 5 window size the Sigma filter achieves higher PSNR than with a 3 3 window but this, however, increases the image blur significantly in some image areas. This suggests that parameters of the Sigma filter

58 32 Noise reduction need to be tuned manually. 3.5 ProposedFilter SigmaFilter Gain(dB) In PSNR(dB) Figure 2.11: Noise reduction gain: Proposed versus Sigma filter. Subjective evaluation To show the performance of the proposed spatial filter, critical images that include both fine structure and smooth areas are used (Fig. 2.12). The performance of the proposed method with and without noise adaptation is shown subjectively in Fig where the Mean-Square error (MSE) has been compared. As can be seen in Fig. 2.13(a), significantly higher Mean-Square errors are given without noise adaptation. The results emphasize the advantage of the proposed noise adaptation in the spatial filter. The proposed method has been subjectively compared to the Sigma filter method. As shown in Figures 2.14(a), 2.15(a), and 2.16(a), the Sigma filter blurs edges and produces granular artifacts both in smooth (Fig. 2.14(a)) and structured (Fig. 2.16(a)) areas while the proposed filter reduces noise while protecting edges and structure (e.g., Fig. 2.16(a)). The reason for this is that the Sigma filter structure preserving component is image global whereas the proposed filter is adaptive to local image structure. Computational efficiency In general, the Sigma filter requires more computations (Table 2.4) than the proposed method. In addition, the computational cost of the Sigma filter strongly depends on

59 the size of the window used while the cost of the proposed filter increases slightly when using a larger window. The algorithms were coded in C and no special efforts were devoted to accelerate their executions. Algorithm average execution time Noise estimation 0.14 Proposed noise filter (3-tap) 0.22 Sigma filter (win. size 3 3) 0.47 Proposed noise filter (5-tap) 0.24 Sigma filter (win. size 5 5) Table 2.4: Average computational cost in seconds on a SUN-SPARC MHz for a PAL-image Summary Current structure-preserving spatial noise filters require user intervention, and are either computationally expensive or blur the image structure. The proposed filter is suitable for real-time video applications (e.g., noise reduction in TV receivers or video surveillance). The proposed noise filter reduces image noise while preserving structure and retaining thin lines without the need to model the image. For example, the filter reduces noise both in moving and non-moving areas, as well in structured and non-structured ones. The proposed method applies first a local image analyzer along eight directions and then selects a suitable direction for filtering. The filtering process is adapted to the amount of noise by different weights and window sizes. Quantitative and qualitative simulations show that the proposed filtering method is more effective at reducing Gaussian white noise without structure degradation than the reference filters. Therefore, this method is well suited for video preprocessing, for example, for video analysis (Chapter 3) or temporal noise reduction [13].

60 34 Noise reduction (a) Original image. (b) 25 db Gaussian noisy image. Figure 2.12: Images for subjective evaluation.

61 35 (a) MSE without noise-adaptation (inverted). (b) MSE with noise-adaptation (inverted). Figure 2.13: Subjectively, the proposed noise adaptation (in Fig. (b)) produces less MS errors than without noise adaptation (In Fig. (a)).

62 36 Noise reduction (a) Sigma noise filtering: note the granular effects. (b) Proposed noise filtering. Figure 2.14: Proposed noise filter gives subjectively better noise reduction.

63 37 (a) Sigma noise filtering (zoomed). (b) Proposed noise filtering (zoomed). Figure 2.15: Performance in smooth area: Proposed noise filter has higher noise reduction than the Sigma filter.

64 38 Noise reduction (a) Sigma noise filtering (zoomed). (b) Proposed noise filtering (zoomed). Figure 2.16: Performance in structured area: Proposed noise filter better protect edges and structre than the Sigma filter.

65 Chapter 3 Object-Oriented Video Analysis 3.1 Introduction Typically, a video is a set of stories, scenes, and shots (Fig. 3.1). Examples of a video are a movie, a news clip, or a traffic surveillance clip. In movies, each scene is semantically connected to previous and following ones. In surveillance applications, a video does not necessarily have semantic flows. Video Stories Scenes Shots Objects & Meaning Figure 3.1: Video units. A video contains, usually, thousands of shots. To facilitate extraction and analysis of video contents, a video has to be first segmented into shots ([93, 25]). A shot is a (finite) sequence of images recorded contiguously (usually without viewpoint change) and represents a continuous, in time and space, action or event driven by moving objects (e.g., an intruder moving, an object stopping at a restricted site). There is little semantic change in the visual content of a shot, i.e., within a shot there is a short-term temporal consistency. Two shots are separated by a cut, which is a transition at the image boundary between two successive shots. A cut can be thought of as an edge in time. A shot consists, in general, of multiple objects, their semantic interpretation (i.e., objects meaning), their dynamics (i.e., objects movement, activities, action, or related events), and their syntax (i.e., the way objects

66 40 Video analysis are spatially and temporally related, e.g., a person is close to a vehicle ) 1. The demand for shot analysis becomes crucial as video is integrated in various applications. An object-oriented video analysis system aims at extracting objects and their features for object-based video representation that are more searchable than pixel- or block-based representations. Such object-based representations allow advanced video processing and manipulation. Various research results show that with the integration of extracted relevant objects and their features in video processing, high efficiency could be achieved [26, 55, 45]. For example, video coding techniques, such as MPEG-2 and MPEG-4, use video analysis to extract motion and object data from the video to achieve better coding and representation quality. In contentbased video representation, video analysis is an important first step towards highlevel understanding of the video content. Real-time implementation and robustness demands make the video analysis, however, an especially difficult task. Two levels of video analysis can be distinguished: low-level and high-level. In lowlevel analysis, high performance operators use low-level image features such as edges or motion. An example is video coding where the goal is to achieve low bit-rates, and low-level features can support high quality coding. High-level analysis is required to determine the perceptually significant features of the video content. For example, in video retrieval higher-level features of the object are needed for effective results. The goal of this thesis is to develop a high-level modular video analysis system that extracts video objects robustly with respect to noise and artifacts, reliably with respect to the precision needed for surveillance and retrieval applications, and efficiently with regards to computational and memory cost. The focus is on automated fast analysis that foregoes precise extraction of image objects. Retinal processing early or low-level vision processing High-level vision processing Motion Memory Shape Retina Orientation More processing combined processing Color... Figure 3.2: Diagram for human visual processing: a set of parallel processors, each analyzing some particular aspect of the visual stimulus [3]. 1 In the remainder of this thesis, the term video refers to a video shot.

67 Fundamental issues 41 The structure of the proposed video analysis technique is modular, where results of analysis levels are combined to achieve the final analysis. Results of lower level processing are integrated to support higher processing. Higher levels support lower levels through a memory-based feedback loop. This is similar to the human visual perception as shown in Fig. 3.2, where visual data are analyzed and simplified to be integrated for higher-level analysis. The HVS finds objects by partial detection and recognition introduces new context which in turn supports further recognition. 3.2 Fundamental issues Video analysis aims at describing the data in successive images of a video in terms of what is in the real scene, where it is located, when it occurred, and what are its features. It is a first step towards understanding the semantic contents of the video. Efficient video analysis remains a difficult task despite progress in the field. This difficulty originates in several issues that can complicate the design and evaluation of video analysis methods. 1. Generality Much research has been concerned with the development of analysis systems that are of general application. Specific applications require, however, specific parameters to be fixed and even the designers of general systems can have difficulty adapting the system parameters to a specific application. Therefore, it seems more appropriate to develop analysis methods that focus on a well-defined range of applications. 2. Interpretation Object-oriented video analysis aims at extracting video objects and their spatio-temporal features from a video. To extract object, technically the following are given: 1) a video is a finite set of images and each image consists of an array of pixels; 2) the aim of analysis is to give each pixel a label based on some properties; and 3) an object consists of a connected group of pixels that share the same label. The technical definition of an object may not be, however, one that interpretation needs. For instance, does interpretation consider a vehicle with a driver one object or two objects? a person moving the body parts one or more objects? These questions indicate that there is no single object-oriented video analysis method that is valid for all applications. Analysis is subjective and can vary between observers and the analysis of one observer can vary in time. This subjectivity cannot always be formulated by a precise mathematical definition of an analysis concept that also humans cannot define uniquely. For some application the use of heuristics is an unavoidable part of solution approaches [100, 37].

68 42 Video analysis 3. Feature selection and filtering A key difficulty in selecting features to solve an analysis task, such as segmentation or matching, is to find useful features that stay robust throughout an image sequence. Main causes for inaccuracy are sensitivity to artifacts, object occlusion, object deformation, articulated and non-rigid objects, and view change. The choice of these features and their number varies across applications and within various tasks of the video analysis. In some tasks a small number of features is sufficient, in other tasks a large number of features may be needed. A general rule is to select features that do not significantly change over a video and that can be combined to compensate for each other s weakness. For example, the most significant features can be noisy and thus difficult to analyze, therefore requiring a filtering procedure to exclude these from being used in the analysis. 4. Feature integration Since features can be noisy, incomplete, and variant, the issue is to find ways to effectively combine these features for robustness. The most used methods for feature integration are linear. The HVS performs many vision tasks, however, in a non-linear way. In high-level video analysis HVS oriented integration is needed. Another important issue is to chose an integration method that is task-specific. Chapters 4 and 6 consider two such integration strategies. 5. Trade-off: quality versus efficiency Video analysis is further complicated by various image and object changes, such as noise, artifacts, clutter, illumination changes, and object occlusion. This complicates further two conflicting requirements: precision and simplicity. In a surveillance application, for instance, the emphasis is on the robustness of the features extracted with respect to varying image and object conditions rather than on precise estimation. In object-based retrieval, on the other hand, obtaining pixel-precise objects is not necessary but the representation of objects must have some meaning. Beside robustness, complexity has an impact on the design of analysis systems. Offline applications, such as object-based coding, tolerate analysis algorithms that need processing power and time. Other applications such as surveillance require real-time analysis. In general, however, the wide use of an analysis tool strongly depends on its computational efficiency [61]. 3.3 Related work Video analysis techniques can be classified into contour, region, and motion based methods. An analysis based on a simple feature (such as edge) cannot deal with complex object structures. Various features are often combined to achieve useful object-oriented video analysis.

69 Related work 43 The MPEG-4 oriented analysis method proposed in [89] operates on binary edge images generated with the Canny operator and tracks objects using a generalized Hausdorff distance. It uses enhancements, such as adaptive maintenance of a background image over time and improvement of boundary localization. In the video retrieval system VideoQ [33], an analysis based on the combination of color and edges is used. Region merging is performed using optical flow estimation. Optical flow methods are not applicable in the presence of large motion and object occlusion. Region merging produces regions that have similar motion. An object may consist, however, of regions that move differently. In such a case, an object may be divided into several regions that complicate subsequent processing steps. The retrieval and surveillance AVI system [38] uses motion detection information to extract objects based on a background image. Results show that the motion detection method used in AVI is sensitive to noise and artifact. The system can operate in simple environments where one human is tracked and translational motion is assumed. It is limited to applications of indoor environments and cannot deal with occlusion. In the retrieval system Netra-V [48], a semi-automatic object segmentation (the user has to specify a scale factor to connect region boundaries) is applied based on color, texture, and motion. This is not suitable for large collections of video data. Here, the shot is first divided into image groups; the first of a group of images is spatially segmented into regions, followed by tracking based on a 2-D affine motion model. Motion estimation is done independently in each group of images. The difficulty of this image-group segmentation is to automatically estimate the number of images in a group. In [48], this number is manually fixed. This introduces several artifacts, for example, when the cut is done at images where important object activities are present. Further, objects disappearing (respectively appearing) just before (respectively after) the first image of the group are not processed in that group. In [48], regions are merged based on coherent-motion criteria. A difficulty arises when different objects with the same motion are erroneously merged. This complicates subsequent analysis steps. Recently, and within the framework of the video analysis model of the COST-211- Group, a new state-of-the-art object-oriented video analysis scheme, the COST-AM scheme, has been introduced [85, 6, 60]. The basic steps of its current implementation are camera-motion compensation, color segmentation, and motion detection. Subjective and objective evaluation suggests that this method produces good objectoriented analysis of input video (see [127]). Difficulties arise when the combination of features (here motion and color) fails and strong artifacts are introduced in the resulting object masks (Fig ). Moreover, the method produces outliers at object boundaries where large areas of the background are estimated as belonging to

70 44 Video analysis objects. Big portions of the background are added to the object masks in some cases. In addition, the algorithm fails in some cases to produce temporally reliable results. For example, the method loses some objects and no object mask can be generated. This can be critical in tracking and event-based interpretation of video. Most video analysis techniques are rarely tested in the presence of noise and other artifacts. Also, most of them are tested in a limited set of video shots. 3.4 Overview of the proposed approach The proposed video analysis system is designed for both indoor and outdoor real environments, has a modular structure, and consists of: 1) motion-detection-based object segmentation, 2) object-based motion estimation, 3) feature representation, 4) region merging, and 5) voting-based object tracking (Fig. 3.3). The object segmentation module extracts objects based on motion data. In the motion estimation module, temporal features of the extracted objects are estimated. The feature representation module selects spatio-temporal object features for subsequent analysis steps. The tracking module combines spatial and temporal object features and tracks objects as they move through the video shot. Segmentation may produce objects that are split into sub-regions. Region merging intervenes to improve segmentation based on tracking results. This improvement is used to enhance the tracking. The proposed system performs region merging based on temporal coherence and matching of objects rather than on a single or combined local features. The core of the proposed system is the object segmentation step. Its robustness is crucial for robust analysis. Video analysis may produce incorrect results and much research has been done to enhance given video analysis techniques. This thesis proposes to compensate for possible errors of low-level techniques by higher-level processing because at higher levels more information is available that is quite useful and more reliable for detection and correction of errors. In each module of the proposed system, complex operations are avoided and particular attention is paid to intermediate errors of analysis. The modules cooperate by exchanging estimates of video content, thereby enhancing quality of results. This proposed video analysis system results in significant reduction of the large amount of video data: it transforms a video of hundreds of images into a list of objects described by low-level features (details in Section 3.5). For each extracted object, the system provides the following information between successive images: Identity number to uniquely identify the object throughout the shot, Age to denote its life span, minimum bounding box (MBB) to identify its borders, Area (initial, average, and present), Perimeter (initial, average, and present), Texture (initial, average, and present), Position (initial and present), Motion (initial, average, and present),

71 Proposed approach 45 R(n) I(n-1) Change detection & thresholding I(n) Motion-based object segmentation Pixels to objects D(n-1) D(n) Object labeling - Morphological edge detection - Contour analysis O(n-1) O(n) Object-based motion estimation feature selection Feature extraction & Motion estimation Objects to features Region merging Voting-based object matching Multi-feature object tracking Objects to video objects Objects & features Vector fields Figure 3.3: Video analysis: from pixels to video objects. R(n) is a background image, I(n), I(n 1) are successive images of the shot, D(n), D(n 1) are the difference images, and O(n), O(n 1) are lists of objects and their features in I(n) and I(n 1). and finally Corresponding object to indicate its corresponding object in the next image. This list of features can be extended or modified for other applications. When combined, these object features provide a powerful object description to be used in interpretation (Chapter 7). This data reduction has advantages: large video databases can be searched efficiently and memory requirements are significantly reduced. For instance, a video shot of one-second length at 10 frames per second containing three objects is represented by hundreds of bytes rather than Megabytes. Extensive experiments (see Sections 4.7, 5.5, and 6.4) using 10 video shots containing a total of 6071 images illustrate the good performance of the proposed video analysis. Indoor and outdoor real environments including noise and coding artifact are considered. Algorithmic complexity of video analysis systems is an important issue even with the significant advancements in modern micro-electronic and computing devices. As the power of computing devices increases, large problems will be addressed. Therefore, the need for faster running algorithms will remain of interests. Research oriented to real-time and robust video analysis is and will stay both important and practical. For example, the large size of video databases requires fast analysis systems and, in surveillance, images must be processed as they arrive. Computational costs 2 of 2 Algorithms of this thesis are implemented in C and run on a multitasking SUN UltraSPARC 360 MHz without specialized hardware. Computational cost is measured in seconds and given per a CIF(352x288)-image. No attention was paid to optimize the software.

72 46 Video analysis the proposed analysis modules are given in Table 3.1. As shown, the current non- Algorithm min. cost max. cost Object segmentation Motion estimation Tracking Table 3.1: Computational cost in seconds of the analysis steps. optimized implementation of the proposed video analysis requires on average between 0.11 and 0.35 seconds to analyze the content of two images. This time includes noise estimation, change detection, edge detection, contour extraction, contour filling, feature extraction, motion estimation, object tracking, and region merging. In the presence of severe occlusion, the processing time of the proposed method increases. This is mainly due to handling of occlusion especially in the case of multiple occluded objects. Typically surveillance video is recorded at a rate of 3-15 frames per second. The proposed system provides a response in real-time for surveillance applications with a rate of up to 10 frames per second. To accelerate the system performance for higher frame-rate applications, optimization of the code is needed. Acceleration can be achieved, for example, by optimizing the implementation of the occlusion and object separation, working with integer values (where it is appropriate), and additions instead of multiplications. The current version (v.4x) of the state-of-the-art reference method, COST-AM [85, 6, 60], takes on average 175 seconds to segment objects of an image. The COST-AM includes color segmentation, global motion estimation, global motion compensation, scene cut detection, change detection, and motion estimation. 3.5 Feature selection One of the fundamental challenges in video processing is to select a set of features appropriate to a broad class of applications. In video retrieval and surveillance, for example, it is important that features for matching of objects exploit properties of the HVS as to the perception of similar objects. The objective in this section is to define local and global features that are suitable for real-time applications Selection criteria Features can be divided into low-level and high-level features. Low-level features include texture, shape, size, contour, MBB, center of gravity, object and camera motion. They are extracted using video analysis methods. High-level features include

73 Feature selection 47 movement (i.e., trajectory), activity (i.e., a statistical sequence of movements such as pitching a ball ), action (i.e., meaning of a movement related to the context of the movement such as following a player ) [22]), event (i.e, a particular behavior of a finite set of objects such as depositing an object ). They are extracted by interpretation of low-level features. High-level features provide a step toward semantic-based understanding of a video shot. Low-level features are generic and relatively easy to extract [81]. By themselves, they are not sufficient for video understanding [81]. High-level features may be independent (e.g., deposit) or dependent (e.g., following or being picked up by a car ) of the context of an application. They are difficult to extract but are an important basis for semantic video description and representation. Low-level features can be further classified into global and local features. Local features, such as shape, are related to objects in an image or between two images. Global features, such as camera motion or dominant objects (e.g., an object related to an important event), refer to the entire video shot. It is now recognized that, in many domains, content-based video representation cannot be carried out using only local or only global features (cf. [124, 115, 8, 5, 142]). A key problem in feature selection is stability throughout a shot. Reasons for feature instability are noise, occlusion, deformation, and articulated movements. This thesis uses the following criteria to select stabile feature descriptions: Uniqueness: Features must be unique to the entities they describe (objects or video shots). Unique features do not change significantly over time. Robustness: Since the image of an object undergoes various changes such as gray level and scale change, it is important to select feature descriptions that are robust or invariant to relevant image and object transformations (e.g., translation, rotation, scale change, or overall lightness change). For example, area and perimeter are invariant to rotation whereas ratios of sizes and some shape properties (see next section) are invariant to geometrical magnification (details in [111]). Object changes such as rotation can be easily modeled. There are changes such as reflection or occlusion that are, however, difficult to model and a good analysis scheme should take these changes into account to ensure robust processing based on the selected feature representation. Completeness: As discussed in the introduction of this thesis, there has been a debate over the usefulness of developing video analysis methods that are general and applicable for all video applications. The same arguments can be used when selecting features. Since the need for object features may differ for different applications a feature that is important for one application can be useless for another application a description of a video object or a video can be only complete when it is chosen based on a specific application. Combination: Since features can be incomplete, noisy, or variant to transformation,

74 48 Video analysis the issue is to find ways to effectively combine these features to solve a broad range of problems. The most widely used technique is a weighted combination of the features. However, the HVS is non-linear. In Chapter 6, a new effective way to combine features based on a non-linear voting system is proposed. Filtering: Selection of good features is still not a guarantee that matching applications would give the expected results. One reason is that even expressive features, such as texture, can get distorted and complicate processing. A second reason is that good features can be occluded. Such features should be detected and excluded from processing. Therefore it is important to monitor and, if needed, temporally filter features that are used in the analysis of an image sequence. For example, occluded features can be ignored (cf. Section 6.3.4). Efficiency: Extraction of feature description can be time consuming. In real-time applications such as video retrieval or surveillance, a fast response is expected and simple descriptions are required Feature descriptors Since the human perception of object and video features is subjective, there is no one single best description of a given feature. A feature description characterizes a given feature from a specific perspective of that feature. In the following paragraphs, lowlevel object and shot feature descriptions are proposed that are easy to compute and match. The proposed descriptors are simple but combined together in an efficient way (as will be shown in Chapter 6) they provide a robust tool for matching of objects. In this section, models for feature representation are proposed that balance the requirements of being effective and efficient for a real-time application. In the following, let O i represent an object of an image I(n). Size: The size of an object is variant to some transformation such as scale change, but combined with other features such as shape it can compensate for errors in these features, for instance, when objects get smaller or when noise is present. Local size descriptors: area A i, i.e, the number of pixels of O i, perimeter P i, i.e., the number of contour (border) points of O i (both area and perimeter are invariant under translation and rotation), width W i, i.e., the maximal horizontal extent of O i, and height H i, i.e., the maximal vertical extent of O i (both width and height are invariant under translation). Global size descriptor: Two descriptors are proposed: 1) the initial and the last size (e.g., area) of the

75 Feature selection 49 object across the shot or/and 2) the median of the object sizes across the video shot. The optimal selection depends on the applications. This descriptor can be used to query shots based on objects. For example, the absolute value of the difference between the query size and indexed size can be used to rank the objects in terms of similarity. Shape: Shape is one of the most important characteristics of an object, and particularly difficult to describe both quantitatively and qualitatively. There is no generally accepted methodology of shape descriptions, especially because it is not known which element of shape is important for the HVS. Shape descriptions can be based on boundaries or intensity variations of the object. Boundary-based features tend to fail under scale change or when noise is present, whereas region-based features are more stable in these cases. The use of shape in matching is difficult because estimated shapes are rarely exact due to algorithmic error and because few of the known shape feature measures are accurate predictions of human judgments of shape similarity. Shape cannot be characterized by a single measure. This thesis uses the following set of simple measures to describe object shape: Minimum bounding box (MBB) B Oi : the MBB of an object is the smallest rectangle that includes all pixels of O i, i.e., O i B Oi. It is parameterized by its top-left and bottom-right corners. Extent ratio: e i = H i W i, the ratio of height and width. Compactness: c i = A i H i W i the ratio of object area, A i and MBB area, H i W i. Irregularity (also called elongation or complexity): r i = Pi 2 /(4πA i ). The perimeter P i is squared to make the ratio independent of the object size. This ratio increases when the shape becomes irregular or when its boundaries become jerky. r i is invariant to translation, rotation, and scaling [111]. Texture: Texture is a rich object feature that is widely used. The difficulty is how to find texture descriptors that are unique and expressive. Despite many research efforts no unique definition of texture exists and no texture descriptors are widely applicable. The difficulty of using texture for tracking or similarity comes from the fact that it is blurred by strong motion or noise, and its expressive power becomes thus limited. In such cases, shape features are more reliable. This thesis uses the following simple texture measure, µ t (O i ), shown in [39] to be as effective as the more computationally demanding co-occurrence matrices. The average grey value difference, µ g (p), for each pixel p O i is defined as µ g (p) = 1 L I(p) I(q dl ) (3.1) W d l=1

76 50 Video analysis and µ t (O i )= 1 A i (µ g (p l )), (3.2) A i l=1 where {q d1 q dl } is a set of points neighboring p at a distance of d pixels and A i is the area of O i. The best choice of d depends on the coarseness of texture within the object, which can vary over the image of the object. In this thesis, d was fixed to one pixel, and the neighborhood size W d was fixed to be the 4-neighborhood of p. Spatial homogeneity of an object: A spatial homogeneity measure of an object O i describes the connectivity of its pixels. In this thesis, the following simple measure is selected [58]: H(O i )= A i A R (3.3) A i where A i is the area of an object O i and A R is the total area of all holes (regions) inside O i. A hole or a region R i is inside an object O i if R i is completely surrounded by O i. H(O) = 1 when O i contains no regions. Center-of-gravity: Accurate estimation of the center-of-gravity (centroid) of an object is time consuming. The level of accuracy depends on an application. A simple estimate is the center of B Oi. This estimate is quickly determined but suffers from gross errors only under certain conditions, such as gross segmentation errors. Location: The spatial position of an image object can be represented by the coordinates of its centroid or by its MBB. The temporal position of an object or event can be given by specifying the start and end image. Motion: Object motion is an important feature that the HVS uses to detect objects. Motion estimation or detection methods can relate motion information to objects. Object motion: The motion direction δ =(δ x,δ y ) and displacement vector w =(w x,w y ). Object trajectory is approximated by the motion of the estimated centroid of O i. A trajectory is a set of tuples {(x n,y n ):n =1 N} where (x n,y n ) is the estimated centroid of O i in the n th image of the video shot. The trajectory is estimated via object matching (Chapter 5). Average absolute value of the object motion throughout a shot µ w = N N n=1 (µ x,µ y ) with µ x = wxn n=1 and µ N y = wyn. This feature is by analogy N to HVS motion perception. The HVS integrates local object motion at different positions of an object into a coherent global interpretation. One form of this integration is vector averaging [4].

77 Summary 51 Camera motion: Basic camera motions are pan, zoom, tilt. In practice, as a compromise between complexity and flexibility, 2-D affine motion models are used to estimate camera motion (cf. Page 6 of Section 1.3). Why not use color? Some video analysis systems rely heavily on color features when extracting video content. In general, luminance is a better detector of small details and chrominance performs better in rendering coarser structures and areas. In this thesis, the analysis system does not rely on color for the following reasons: first, color processing requires high computation and memory which can be critical for many applications. Second, color data may not be available, as is often the case in video surveillance, especially at night or under low-light conditions. Third, color when objects are small or when color variations are high is not useful. In video retrieval, the user is often asked to specify color features and the use of color causes difficulties: first, the human memory of color is not very accurate and absolute color is difficult to discriminate and describe quantitatively. Second, the perceived color of an object depends on the background, the illumination conditions, and monitor display settings. Third, the judgment of the perceived color is not the same as the color data represented in computers or in the HVS. Therefore, a user can request a perceived color different from the computer-recorded color. 3.6 Summary and outlook Summary This Chapter introduces a method to extract meaningful video objects and their low-level features. The method consists of four steps: motion-detection-based object segmentation (Chapter 4), object-based motion estimation (Chapter 5), and object tracking (Chapter 6) based on a non-linear combination of spatio-temporal features. State-of-the-art studies show that object segmentation and tracking is a difficult task, particularly in the presence of artifacts and articulated objects. The methods of the proposed video analysis system are tested using critical sequences and under different environments using analog, encoded (e.g., MPEG-2), and noisy video (see Section 4.7, 5.5, and 6.4). Evaluations of the algorithms show reliable and stable object and feature extraction throughout video shots. In the presence of segmentation errors, such as object merging or multi-object occlusion, the method accounts for these errors and gives reliable results. The proposed system performs region merging based on temporal coherency and matching of objects rather than on single local features. The proposed method extracts meaningful video objects which can be used to index video based on flexible objects (Section 3.5) or to detect events related to ob-

78 52 Video analysis jects for semantic-oriented video interpretation and representation. In this context the usefulness of the proposed object and feature extraction will be shown in Chapter 7. To focus on meaningful objects, the proposed video analysis system needs a background image. A background image is available in surveillance applications. In other applications, a background update method has to be used which must adapt to different environments (cf. Page 6 of Section 1.3) Outlook In a multimedia network, video data is encoded (e.g, using MPEG-2) and either transmitted to a receiver (e.g., TV) or stored in a video database. Effective coding, receiver-based post-processing, and retrieval of video data require video analysis techniques. For example, for effective coding motion and object data are required [136]. In a receiver, motion information is needed for advanced video post-processing, such as image interpolation [120]. Effective retrieval of large video databases requires effective analysis of video content [129, 8]. In a multimedia network, several models of the use of video analysis are possible (Fig. 3.4): 1) different video analysis for the encoder and the receiver, 2) the same video analysis for both encoder and receiver, and 3) cooperative video analysis. For example, motion information extracted for coding can be used to support retrieval or post-processing techniques. Studies show that MPEG-2 motion vectors are not accurate enough for a receiver-based video postprocessing [9, 21]. These studies suggest that the use of the second model of Fig. 3.4 may not produce interesting results. In this Chapter, we have proposed a video analysis system for the first model of Fig An interesting subject of research is the integration of extracted coding-related video content to support the video analysis for a video retrieval or surveillance application (the third model of Fig. 3.4) MPEG encoder video analysis channel/ database MPEG decoder video processing e.g.: interpretation, enhancement Model 1: separated video analysis for encoding and processing Model 2: the same video analysis for encoding and processing e.g., use MPEG-vectors, or segmented objects Model 3: combined video analysis for coding and processing e.g., MPEG-vector based object segmentation * object- and motion-data MPEG decoder MPEG decoder * * video analysis e.g.: object segmentation motion estimation processing processing video analysis Figure 3.4: Models of video analysis in a multimedia network.

79 Chapter 4 Object Segmentation 4.1 Motivation Many advanced video applications require the extraction of objects and their features. Examples are object-based motion estimation, video coding, and video surveillance. Object segmentation is, therefore, an active field of research that has produced a large variety of segmentation methods [119, 129, 11, 128, 127]. Each method has emphasis on different issues. Some methods are computationally expensive but give, in general, accurate results and others have low computation but fail to provide reliable segmentation. Few of the methods are adequately tested particularly on a large number of video shots and are evaluated throughout large shots. Furthermore, many methods work only if the parameters are fine tuned for various sequences by experts. A drawback common to most methods is that they are not tested on noisy images and images with artifacts. An object segmentation algorithm classifies the pixels of a video image into a certain number of classes that are homogeneous with respect to some characteristic (e.g., texture or motion). It aggregates image pixels into objects. Some methods focus on color features and others on motion features. Some methods combine various features aiming at better results. The use of more features does not, however, guarantee better result since some features can become erroneous or noisy and complicate the achievement of a good solution. The objective in this section is to propose an automated modular object segmentation method that stays stable throughout an image sequence. This method uses a small number of features for segmentation, but focuses on their robustness to varying image conditions such as noise. This foregoes precise segmentation such as at object boundaries. This interpretation of segmentation is most appropriate to applications such as surveillance and video database retrieval. In surveillance applications, robustness with respect to varying image and object conditions is of more concern than

80 54 Object segmentation accurate segmentation. In object-based retrieval, the detailed outline of objects is often not necessary but the semantic meaning of these objects is important. 4.2 Overall approach This proposed segmentation method consists of simple but effective tasks, some of which are based on motion and object contour information. Segmentation is realized in four steps (Fig. 4.1): motion-detection-based binarization of the input gray-level images, morphological edge detection, contour analysis and tracking, and object labeling. The critical task is the motion-based binarization which must stay reliable throughout the video shot. Here, the algorithm memorizes previously detected motion to adapt current motion detection. Edge detection is performed by novel morphological operations with a significantly reduced number of computations compared to traditional morphological operations. The advantage of morphological detection is that it produces gap-free and single-pixel-wide edges without need for post-processing. Contour analysis transforms edges into contours and uses data from previous frames to adaptively eliminate noisy and small contours. Small contours are only eliminated if they cannot be matched to previously extracted contours, i.e., if a small contour has no corresponding contour in the previous image. Small contours lying completely inside a large contour are merged with that large contour according to a spatial homogeneity criterion, as will be shown in Section The elimination of small contours is spatially adapted to the homogeneity criterion of an object and temporally to corresponding objects in previous images. This is different from methods that delete small contours and objects based on fixed thresholds (see, for example, [61, 119, 118, 85, 70]). This object segmentation method is evaluated in the presence of MPEG-2-coding artifacts, white and impulsive noise, and illumination changes. Its robustness to these artifacts is shown in various simulations (Section 4.7). The computational cost is low and results are reliable. Few parameters are used; these are adjusted automatically to the amount of noise and to the local image content. The result of the segmentation process is a list of objects with descriptors (Section 3.5) to be used for further object-based video processing. To reduce storage space, object and contour points are compressed using a differential run-length code. 4.3 Motion detection In a real video scene, there are, generally, several objects which are moving differently against a background. Motion plays a fundamental role in segmentation by the

81 Motion detection 55 Z -1 input images binarization (motion detection) Binary images morphological edge detection Edges contour analysis Z -1 Contours object labeling (contour filling) Objects (a) Block diagram. (b) Original image. (c) Binarization. (d) Edge detection: gap-free edges. (e) Contour analysis: small contours and noise reduction. (f) Object labeling: unique labels. objects with Figure 4.1: Four-step object segmentation.

82 56 Object segmentation HVS. Motion information can be extracted by motion estimation or motion detection. Motion estimation computes motion vectors using successive images, and points with similar motion are grouped into objects. There are several drawbacks to using motion estimation for segmentation. First, most motion estimation techniques tend to fail at object boundaries. Second, motion estimation techniques are, generally, too computationally expensive to serve real-time applications. Motion detection aims at finding which pixels of an image have moved in order to group them into objects. Motion can be detected based on inter-frame differencing followed by thresholding. The problem, however, is that changes between images occur not only due to object motion but also to local illumination variations, shadows, reflection, coding, and noise or artifacts. The main goal is to detect image changes that are due to object motion only. Detection of motion using inter-frame differencing is common in many applications. Common applications include object segmentation, coding, video surveillance (e.g., of intruder or vehicle), satellite images (e.g., to measure land erosion), and medical images (e.g., to measure cell distribution). It is also used for various TVapplications such as noise reduction and image interpolation. This thesis develops a fast motion detection method that is adaptive to noise and robust to artifacts and local illumination changes. The proposed method uses a thresholding technique to reduce to a minimum the typical errors of motion detection, for instance, errors associated with shadows and reflections. Performance of the proposed method will be shown against other methods Related work Motion detection methods often use a reference image R(n). R(n) can be a background image or any successive image of a sequence I(n ± k) 1. Assume that the camera is static or the video images are stabilized (cf. Page 6 of Section 1.3). Assume that background changes are much weaker compared to object changes, and that moving objects can be detected by thresholding a difference 2 image D(n) generated by subtracting the current image I(n) from the reference image R(n). The value of a pixel of D(n), D(i, j, n), can be expressed as: D(i, j, n) =LP ( [I(i, j, n) R(i, j, n)] ) (4.1) i,j W where W describes a neighborhood of the current pixel and LP is a low-pass filter. Large values in the difference map indicate locations of significant change. All pixels 1 Depending on the application, motion detection can be performed between images in the shortterm (e.g., k = ±1), medium-term (e.g., k = ±3), or long-term (e.g., k = ±10). 2 The difference is a map indicating the amount and sign of changes for each pixel.

83 Motion detection 57 above a threshold are classified as changing. This results in a binary image, B(n), representing objects against background. In [135], the image difference between two successive images is filtered by a 2-D median filter followed by deletion of small regions. This method is not robust to noise and produces objects where outline deviates significantly from the real boundaries. In [49], the difference image is low-pass filtered, thresholded, and post-processed by a 5 5 median filter. Changed regions that are too small are removed, and all unchanged regions laying inside a changed region are merged into the unchanged region so that holes are closed. Much work based on statistical tests of hypothesis and Bayesian formulations has been done to enhance motion detection based on differencing for robust differencing 3. In [1], a statistical, model-based technique is used. This method computes a global threshold to segment the difference image into moving and non-moving regions. This is done according to a model of the noise probability density function of the difference image. Detection is refined by the Maximum a posteriori probability (MAP) criterion. Despite refinement, over-segmented images are often produced. Moreover, MAPbased techniques require large amounts of computational cost. The method in [145] uses a background images and a statistical motion detection method [1] to segments objects. Parameters used are need to be manually adjusted to account for noise, artifact, and illumination changes. The method in [84] improves on the accuracy of the motion detection method introduced [135] by a local adaptive relaxation technique that smoothes boundaries. This method considers previous masks and assumes that pixels that were detected as moving in the last images should be classified as moving in the current image. This introduces some sort of temporal coherence. This method misses, however, relevant objects and produces inaccurate detection (see Section 4.7). Because of scene complexity, illumination variations, noise, and artifacts, accurate motion detection remains a challenge. Three types of detection errors need investigation. The first type of error occurs when pixels are misclassified because of noise. In case the misclassified pixels are between objects so that objects are connected or in case the image is overlaid with high noise, this misclassification can produce errors that complicate subsequent processing steps. The second type occurs when objects have shadows. The third type of error occurs when objects and background have similar gray level pattern. This thesis develops a motion detection method that aims at reducing these types of errors. 3 for reviews cf. [76, 123, 112].

84 58 Object segmentation (a) Original image. (b) Motion detection using a successive image. (c) Motion detection using a background image. Figure 4.2: Motion detection schemes A memory-based motion detection method As discussed earlier, motion can be detected either between successive images or between an image and a background image of the scene. A major difficulty with techniques using consecutive images is that they depend on inter-image motion being present between every image pair. Any moving object (or part of an object in the case of articulated objects) that becomes stationary or uncovered is erroneously merged with the background. Furthermore, temporal changes between two successive images may be detected in areas that do not belong to objects but are close to object boundaries and in uncovered background as shown in Fig. 4.2(b). In addition, removed or deposited objects cannot be correctly detected using successive images. This thesis develops an effective fast motion detection method based on image differencing with respect to a background image. The background image can be updated using a background updating technique (cf. Page 6 of Section 1.3). The disadvantages in using a background image is that shadows and reflections of moving objects can be highlighted in the difference image. As will be shown, simulations show that the proposed approach successfully reduces the effect of both artifacts. Global illumination changes can be additive and multiplicative. Assuming the images of a video shot are affected by global illumination change and by Gaussian noise. Then two successive images of the shot are modeled by: I(n) =S(n)+η(n) I(n +1)=S(n +1)+η(n +1)+ξ(n +1) (4.2) where S(n) and S(n + 1) are the projections of the scene into the image plane. η(n) and η(n + 1) are additive noise. ξ(n +1)=a + bs(n + 1) represents the additive and multiplicative illumination changes. The constants a and b describe the strength of the global illumination changes. Thus an image difference may include artifacts due to noise and illumination changes. The basic concept to detect motion between the current image I(n) and the background image R(n) is shown in Fig The method comprises spatial filtering of

85 Motion detection 59 the difference image using a 3 3 average filter, a 3 3 maximum filter (Eq. 4.3), and spatio-temporal adaptation based on thresholding. D(n) =max(lp ( I(n) R(n) )) (4.3) where LP is the averaging operator and max the maximum operator. R(n) + I(n) - absolute value spatial averaging filter spatial MAX filter Noise estimation D(n) Spatio-temporal adaptation T(n-T z) σ n z -1 B(n) T(n) Figure 4.3: Diagram of the motion detection technique. In real images, the difference I(n) R(n) includes artifacts, for example, due to noise and illumination changes. To increase spatial accuracy of detection, an average and a maximum filter are used. Averaging causes a linear addition of the correlated true image data, whereas the noise is uncorrelated and is reduced by averaging. Hence, motion detection becomes less sensitive to noise and the difference image becomes smoother. The maximum operator limits motion detection to a neighborhood of the current pixel, causing stability around object boundaries and reducing granular noisy points. To partially compensate for global illumination changes, an adaptation to the difference image is proposed in Section 4.4. To increase temporal stability of detection throughout the video, a memory component is added to the motion detection process as shown next. Spatio-temporal adaptation Two main difficulties with traditional motion detection methods which use differencing are 1) they do not distinguish between object motion and other changes, for example, due to background movement as with tree leaves shaking, or illumination changes and 2) they do not account for changes occurring throughout a long video. Usually a fixed threshold is used for all the images of the video shot. A fixed threshold method fails when the amount of moving regions changes significantly. To answer these difficulties, this thesis proposes a three step thresholding method. 1. Adaptation to noise: To adapt the detection to image content and noise, an imagewide spatial threshold, T n, is estimated using a robust noise-adaptive method (Section 4.4).

86 60 Object segmentation 2. Quantization: To stabilize thresholding spatio-temporally, this threshold, T n,is then quantized to T q into m values. This quantization partly compensates for background and local illumination changes and significantly reduces fluctuations of the threshold and hence stabilizes the binary output of the motion detector. Experiments using different quantization levels performed on different video shots suggest that the following three level quantization is a good choice: T min : T n T min T q = T mid : T min <T n T mid T max : otherwise. (4.4) Other quantization functions, such as using the middle values of the intervals instead of the limits, T min,t mid,t max, can be also used. 3. Temporal integration: To adapt motion detection to temporal changes throughout a video shot the following temporal integration (memory) function is proposed: T min : T q T min T (n) = T (n 1) : T q <T(n 1) T q : otherwise. (4.5) This function examines if there has been a significant motion change, i.e., T q > T (n 1), in the current image and, if so, the current threshold T q is selected. If no significant temporal change is detected, i.e., T q <T(n 1), the previous threshold T (n 1) is selected. When no or little motion is detected, T q T min, T min is selected. This temporal integration stabilizes the detection of binary object masks throughout a video shot. It favors changes due to strong motion and rejects changes due to small changes or artifacts. Note that other temporal integration functions, like integration over more than one image, could be also used Results and comparison In this section, results of the proposed motion detection method are given and compared to a statistical motion detection method [145] which builds on the well-known method in [1].This method determine a global threshold for binarization by a statistical test of hypothesis using a noise model. It compares the statistical behavior of a neighborhood of a pixel to an assumed noise probability density function. This comparison is done using a significance test that needs the specification of a threshold α which represent the significance level. This method works very well in images where noise fits the assumed model. However, various experiments using this method show that parameters of the significance test and noise model need to be tuned for

87 Thresholding 61 different images especially when illumination changes, shadows, and other artifacts are present in the scene. The method provides no indication as to how to adapt these parameters to the input images 4. As can be seen in Fig. 4.4 and 4.5, the proposed method displays better robustness especially in images with local illumination change (for example, when the sun shines, Fig. 4.5, objects enter the scene, Fig. 4.4, and doors are opened, Fig. 4.5). Also, the proposed method remains reliable in the presence of noise because it compensates for noise by adapting its parameter automatically to the amount of noise estimated in the image. Another important factor is the introduction of the temporal adaptation which makes the procedure reliable throughout the whole video shot. An additional advantage of the proposed method is that it has a low computational cost. For example, it requires an average of 0.1 seconds compared to 0.25 seconds for the reference method on a SUN-SPARC MHz. 4.4 Thresholding for motion detection Introduction Thresholding methods 5 for segmentation are useful when separating objects from a background, or discriminating objects from other objects that have distinct gray levels. This is also the case with difference images. Threshold values are critical for motion detection. A low threshold will cause either over-segmentation 6 or noisy segmentation. A high threshold suppresses significant changes due to object motion and causes either under-segmentation or incomplete objects. In both cases the shape of the object can be grossly affected. Therefore, a threshold must be chosen automatically to adapt to image changes. In this section, a non-parametric robust thresholding operator is proposed which adapts to image noise. The proposed method is shown to be robust under various conditions Review of thresholding methods A thresholding method classifies, depending on a threshold T, each pixel D(i, j, n) in a difference image D(n) as belonging to an object and labeled white in a binary 4 In the following experiments, fixed parameters are used, i.e., the significance level α =0.1 being the probability of rejecting the true hypothesis that at a specific pixel there are no moving objects, and the noise standard deviation σ n = For thorough surveys see [117, 137]. 6 Over-segmentation is common to most motion-based segmentation because of the aperture problem, i.e., different physical motions are indistinguishable [4, 83].

88 62 Object segmentation Original image I(42) Background image Proposed method Reference method Original image I(145) Proposed method Reference method Figure 4.4: Motion detection comparison for the Survey sequence.

89 Thresholding 63 Original image Background image Proposed method Reference method Original image Background image Proposed method Reference method Figure 4.5: Motion detection comparison for Stair and Hall sequences.

90 64 Object segmentation image B(n) or to the background and labeled black (Eq. 4.6). B(i, j, n) = { 1 : D(i, j, n) >T 0 : D(i, j, n) T. (4.6) Thresholding methods can be divided into global, local, and dynamic methods. In global methods, a gray-level image is thresholded based on a single threshold T. In local methods, the image is partitioned into sub-images and each sub-image is thresholded by a single threshold. In dynamic methods, T depends on the spatial coordinates of the pixel to which it is applied. The study in [2] further classifies thresholding methods into parametric and nonparametric. Based on the gray-level distribution of the image, parametric approaches try to estimate the parameters of the image probability density function. Such estimation is computationally expensive. Non-parametric approaches try to find the optimal threshold based on some criteria such as variance or entropy. Such methods have been proven to be more effective than parametric methods [2]. Dynamic and parametric approaches have high computational costs [41, 17]. Local approaches are more sensitive to noise, artifacts and illumination changes. For effective fast threshold determination, the combination of global and local criteria is needed. There are several strategies to determine a threshold for binarization of an intensity image [137, 117, 41, 17, 94, 99, 2]. Most methods make assumptions about the histogram of the intensity signal (e.g., some methods assume a Gaussian distribution). The most common thresholding methods are based on histograms. For a bimodal histogram it is easy to fix a threshold between the local maxima. Most real images do not, however, have a bimodal histogram. A difference image, however, differs from intensity images, and thresholding methods for intensity images may not be appropriate to difference images. There are few thresholding methods for motion detection. The methods presented in [112] have some drawbacks. First, fine tuning of parameters, such as window size, is required. Second, adaptation to image noise is problematic. In addition, the methods are computationally expensive. Finally, these methods do not consider the problem of adapting the threshold throughout the image sequence. To overcome these problems, a non-parametric thresholding method is proposed which uses both global (block-based) and local (block-histogram-based) decision criteria. Doing this, the threshold is adapted to the image contents and can change throughout the image sequence (e.g., noisy and MPEG-2 images Fig. 4.22) Artifact-adaptive thresholding Fig. 4.6 gives an overview of the proposed thresholding method for motion detection. The image is first divided into K equal blocks of size W H. For each block,

91 Thresholding 65 the histogram is computed and divided into L equal partitions or intervals. For each histogram partition, the most frequent gray level g pl,l {1 L} is fixed. This is done to take small regions, noise, and illumination changes into account. To take global image content into the thresholding function, an average gray level µ k,k {1 K} of each block is calculated. Finally, the threshold T g is calculated by averaging all the g pl and all the µ k for all the K blocks (Eq. 4.7). Simulations show this thresholding function is reliable with respect to image changes: K ( L (g pl )+µ k ) k=1 l=1 T g = (4.7) K L + K with W [ H (D(i, j))] µ k = i=1 j=1 W H. (4.8) Adaptation to global object-changes (e.g., contrast change) K: blocks (=> K avarages) W Adaption to local changes and to small gray-level regions freq. L: intervals/block (=> KxL maxima) interval 1 interval 2 interval 3 interval 4 interval L H Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 8 Block K gray-level image gray-level : most frequent gray-level in an interval Figure 4.6: Extraction of image-global threshold. Adaptation to image noise The threshold T g is adapted to the amount of image noise as follows: if noise is detected, threshold T g is set higher accordingly. This adaptation is a function of the estimated noise standard deviation σ n, taking into consideration that low sensitivity to small σ n is needed. The following positivequadratic weighting function (Fig. 4.7) is used: T n = T g + a σ 2 n, (4.9) where a < 1 and a depends on the maximum, practically assumed, noise variance (e.g., max(σ 2 n) = 25).

92 66 Object segmentation T σn IQW NQW LW PQW T min SW σ n Figure 4.7: Weighting functions used in implementations. SW represents static, LW linear, PQW positive quadratice, NQW negative quadratic, and IQW inverse quadratic weighting. Adaptation to local changes by weighting the difference image To account for local image structure and changes, especially at object borders, the average of the differences in a block k, µ k as in Eq. 4.7, is weighted using a monotonically increasing function as follows: µ n k = µ k + θ µ k, θ = θ max b µ k µ max (4.10) where µ max the maximum average, b<1, and θ max < 0.5. When µ k is high, meaning the image difference is high, then the block k is assumed to be inside the object and the threshold T g should be slightly increased (by setting θ low). When µ k is low, which can be due to artifacts or when block k is at the object boundary, the threshold T g should increase (by setting θ high). This means a bias is given to higher differences. The constants a, b, and θ max are experimentally determined and where fixed for all the simulations. Different values of these thresholds do not affect the performance of the whole segmentation algorithm. They, in few cases, affect the accuracy of object boundaries which is not of importance for the intended applications of the proposed binarization Experimental results The proposed thresholding procedure has been compared to thresholding methods [99, 2] which have been used in various image processing systems (for example, [36]).

93 Edge detection 67 They provide a good compromise between quality of binarization and computational cost (see Table III in [137]). Simulations show that the proposed method outperforms the reference methods in case of noise and illumination changes in the image. For images with no change due to object motion, the proposed method is more stable for motion detection applications (Fig. 4.8). To give a fair comparison, all simulation results in this section do not include temporal adaptation of the threshold as defined in Eq On average, the proposed algorithm needs 0.05 seconds on a SUN-SPARC MHz. The method in [99] needs on average 0.05 seconds and the method in [2] needs 0.67 seconds. Fig summarizes comparative results of these methods. As can be seen, the proposed method separates the bright background and dark objects. Further, the algorithm was tested on noisy images and MPEG-2 encoded images, showing that it remains robust (Fig. 4.10). The good performance of the proposed thresholding function comes from the fact that is takes into account all areas of the image through its block partition and the division of each block into sub-regions. This takes into account all gray-levels and not just the lower or higher gray-levels. This stabilizes the algorithm. Furthermore, adaptation to image noise and weighting of the difference signal stabilize the thresholding function. Since a binary image resulting from thresholding for motion detection may contain some artifacts, many motion-detection-based segmentation techniques have a postprocessing step, usually performed by non-linear filters, such as median or morphological opening and closing. The effectiveness of such operations within the proposed object segmentation method will be discussed in the following in Section Morphological operations Detection of object motion results in binary images which indicate the contours and the object masks. In this section, a fast edge detection method and new operational rules for binary morphological erosion and dilation of reduced complexity are proposed Introduction The basic idea of a morphological operation is to analyze and manipulate the structure of an image by passing a structuring element on the image and marking the locations where it fits. In mathematical morphology, neighborhoods are, therefore, defined by the structuring element, i.e., the shape of the structuring element determines the shape of the neighborhood in the image. Structuring elements are characterized by

94 68 Object segmentation Original image Difference image Proposed method Reference method [99] Original image Difference image Proposed method Reference method [99] Figure 4.8: Thresholding comparison for the Hall sequence.

95 Edge detection 69 Original image Difference image Proposed method Reference method [99] Figure 4.9: Thresholding comparison for the Stair sequence. (a) Difference image (b) Reference binarization with noise adaptation [99] (c) Proposed binarization Figure 4.10: Thresholding comparison in the presence of noise.

96 70 Object segmentation a well-defined shape (such as line, segment, or ball), size, and origin. The hardware complexity of implementing morphological operations depends on the size of the structuring elements. The complexity increases even exponentially in some cases. Known hardware implementations of morphological operations are capable of processing structuring elements only up to 3 3 pixels [63]. If higher-order structuring elements are needed, they are decomposed into smaller elements. One decomposition strategy is, for example, to present the structuring element as successive dilation of smaller structuring elements. This is known as the chain rule for dilation [69]. It should be stated that not all structuring elements can be decomposed. S E S - E S E Input Image 00 ( Kernel Erosion : eroded pixel) Dilation 11 ( : expanded pixel) Figure 4.11: Dilation and erosion (note that they are applied here to the black pixels). The basic morphological operations are dilation and erosion (Fig. 4.11). These operations are expressed by a kernel operating on an input image. Erosion and dilation work conceptually by translating the structuring element to various points in the input image, and examining the intersection between the translated kernel coordinates and the image coordinates. When specific conditions are met the image content is manipulated using the following rules 7 : Standard dilation: Move a kernel K line-wise over the binary image B. If the origin of K intersects a white pixel in B, then set all pixels covered by K in B to white if the respective pixel in K is set white. Standard erosion: Move a kernel K line-wise over the binary image B. If the origin of K intersects a white pixel in B and if all pixels of K intersect white pixels in B (i.e., K fits), then keep the pixel of B that intersect the origin of K white. Otherwise set that pixel to black. The dilation is an expansion operator that enlarges objects into the background. The erosion operation is a thinning operator that shrinks objects. By applying erosion to an image, narrow regions can be eliminated while wider ones are thinned. In order to restore regions, dilation can be applied using a mask of the same size. Erosion and dilation can be combined to solve specific filtering tasks. Two widely used combinations are opening, closing, and edge detection. Opening (erosion fol- 7 For set-theoretical definitions see [69, 64].

97 Edge detection 71 lowed by dilation) filters details and simplifies images by rounding corners from inside the object where the kernel used fits. Closing (dilation followed by erosion) protects coarse structures, closes small gaps, and rounds concave corners. Morphological operations are very effective for detection of edges in a binary image B, where white pixels denote uniform regions and black pixels denote region boundaries [64, 69]. Usually, the following detectors are used: E = B E[B,K (m m) ], E = D[B,K (m m) ] B, or E = D[B,K (m m) ] E[B,K (m m) ]. (4.11) B is the binary image in which white pixels denote uniform regions and black pixels denote region boundaries. E is the edge image. E (D) is the erosion (dilation) operator (erosion is often represented by and dilation by ). K m m is the erosion (dilation) m m kernel used. denotes the set-theoretical subtraction Motivation for new operations Motivation for new erosion and dilation Standard morphological erosion and dilation are defined around an origin of a structuring element. The position of this origin is crucial for the detection of edges. For each step of an erosion or dilation, one pixel is set (at a time) in B. To achieve precise edges with single-pixel width, 3 3 kernels (defined around the origin) are used (kernel examples are in Fig. 4.12): when a3 3 cross kernel is used, an incomplete corner detection is obtained (Fig. 4.13); a 3 3 square kernel gives complete edges but requires more computation (which grows rapidly with increased input data, Fig. 4.17(a)); and the use of a 2 2 square kernel will produce incomplete edges (Fig. 4.14). Figure 4.12: A 3 3 square, a3 3 cross, anda2 2 square kernel. To avoid these drawbacks, new operational rules for edge detection by erosion or dilation are proposed. A fixed-size (2 2 square) kernel is used and the rules set all four pixels of this kernel at a time in B. For edge detection based on the new rules, accurate complete edges are achieved and the computational cost is significantly reduced. Motivation for conditional operations When extracting binary images from gray-level ones, the binary images are often enhanced by applying morphological

98 72 Object segmentation operations which are effective and efficient and, therefore, widely used [51]. Applying standard morphological operations for enhancement, however, can connect some object areas or erode some important information. This thesis contributes basic definitions to conditional morphological operations to solve this problem New morphological operations Erosion Definition: proposed erosion Move the 2 2 square kernel line-wise over the binary image B. If at least one of the four pixels inside the kernel is black, then set all the four pixels in the output image E to black. If all four pixels inside the 2 2 kernel are white, then set all (at a time) four pixels in E to white if they were not eroded previously. Set-theoretical formulation An advantage of the proposed erosion is that it can be formally defined based on set-theoretical intersection, union, and translation in analogy to the formal definitions of the standard erosion [64]. The standard erosion satisfies the following property [64]: the erosion of an image by the union of kernels is equivalent to erosion by each kernel independently and then intersecting the result (Eq. 4.12). So given image A and kernels B and C in R 2, E s [A, B C] =E s [A, B] Es [A, C] (4.12) where E s denotes the standard erosion. The proposed erosion is then defined as follows: E p [A, K 2 2 ]=E s [A, S 3 3 ]= E s [A, K ul 2 2 K ur 2 2 K ll 2 2 K lr 2 2] = (4.13) E s [A, K ul 2 2] E s [A, K ur 2 2] E s [A, K ll 2 2] E s [A, K lr 2 2] where E p denotes the proposed erosion, S 3 3 isa3 3 square kernel, and K ul 2 2 is a 2 2 kernel with origin at the upper left (equivalently upper right, lower left, lower right) corner (cf. Fig. 4.12). Thus the proposed erosion gives the same results as the standard erosion when using a 3 3 square kernel. However, the proposed erosion is significantly faster. Using a 3 3 cross kernel with the standard erosion accelerates processing but gives incomplete results, especially at corners (Fig. 4.13). Dilation Definition: proposed dilation Move the 2 2 kernel line-wise over the binary image B. If at least one of the four binary-image pixels inside the kernel is white,

99 Edge detection 73 Original image Erosion Proposed detection Standard detection Figure 4.13: Proposed versus standard erosion (standard erosion uses a 3 3 cross kernel). then set all (at a time) the four pixels in the output image E to white. Set-theoretical formulation In analogy to standard erosion, the standard dilation satisfies the following property [64]: the dilation of an image by the union of kernels corresponds to dilation by each kernel and then performing the union of the resulting images (Eq. 4.14). This means that given image A and kernels B and C in R 2, D s [A, B C] =D s [A, B] D s [A, C] (4.14) where D s denotes the standard dilation. The proposed dilation is then given by: D p [A, K 2 2 ]=D s [A, S 3 3 ]= D s [A, K ul 2 2 K ur 2 2 K ll 2 2 K lr 2 2] = (4.15) D s [A, K ul 2 2] D s [A, K ur 2 2] D s [A, K ll 2 2] D s [A, K lr 2 2] where D p denotes the new dilation. Original image Proposed dilation Standard dilation Figure 4.14: Proposed versus standard dilation (standard dilation uses a 2 2 kernel, origin at left upper pixel). Binary edge detection In this section, the need to use two operations (Eq. 4.11) for a binary morphological edge detection is questioned. When detecting binary edges, erosion and subtraction can be performed implicitly. Such an implicit detection is proposed in the next definition to reduce the complexity of morphological edge detection.

100 74 Object segmentation Definition: proposed edge detection Move the 2 2 kernel over the binary image B. If at least one of the four pixels of 2 2 kernel is black, then set the four pixels of the same positions in the output edge image E to white if their equivalent pixels in B are white. Otherwise set the pixels to black. If the 2 2 kernel fits in a white area it is implicitly eroded, but edges (the kernel does not fit) are kept. Fig. 4.17(b) gives a complexity comparison of the new binary edge detection, edge detection with the proposed erosion and edge detection using standard erosion (a 3 3 square kernel). As shown, the cost of edge detection is significantly reduced. Conditional erosion and dilation Usually object segmentation requires post-processing to simplify binary images. The most popular post-processing filters are the median and morphological filters such as opening or closing. This is because of their efficiency. The difficulty with these, however, is that they may connect or disconnect objects. To support morphological filters, this thesis suggests conditional dilation and erosion for the purpose of object segmentation. They are topology preserving filters in the sense that they are applied if specific conditions are met. Conditional erosion Using conditional erosion, a white pixel is eroded only if it has at least three black neighbors. This ensures that objects are not connected. It will be performed mainly at object boundaries. The basic idea is that if the majority of the kernel s quadrant 2 2 points are black then this is most probably a border point and can be eroded. This is useful when holes inside the object had to be kept. Conditional dilation With conditional dilation, a black pixel is set to white if the majority of the 2 2 kernel pixels are white. If this condition is met then it is more likely that this pixel is inside an object and not a border pixel. Conditional dilation sets pixels mainly inside the object and stops at object boundaries to avoid connection of neighboring objects. This condition ensures that objects are not connected in the horizontal and vertical directions. In some cases, however, objects may be connected diagonally as shown in Fig In this Figure both pixels will become connected and so the two object regions. Figure 4.15: Cases where objects are connected using conditional dilation.

101 Edge detection 75 (a) Binary image. (b) Canny edge detection. (c) Classical morphological edge detection using a 3 3 cross kernel. (d) Proposed morphological edge detection. Figure 4.16: Edge detection comparison. Note the shape distortion when using Canny detector. The proposed detection gives more accurate results than the classical morphological detector using a 3 3 cross kernel Comparison and discussion The proposed edge detectors have been compared to gradient-based methods such as the Canny method [31]. Canny edge detector is powerful method that is widely used in various imaging systems. The difficulty of using this method is that its parameters need to be tuned for different applications and images. Compared to the Canny-edge detector, the proposed methods show higher detection accuracy resulting in better shapes (Fig. 4.16(d)). A better shape accuracy using the Canny method can be achieved when its parameters are tuned accordingly. This is, however, not appropriate for automated video processing. This is mainly because the Canny detector uses a smoothing filter. In addition, the proposed edge detectors have lower complexity and produce gap-free edges so that no edge linking is necessary. The proposed edge detector are also significantly faster and give more accurate result (Fig. 4.16)(c)) than the classical morphological edge detectors when using a 3 3 cross kernel. The proposed morphological edge detectors have the same performance compared to standard morphological detectors but have significantly reduced complexity as Fig shows. This is confirmed using various natural image data. Fig. 4.17(a) shows that the computational cost using the standard erosion with a 3 3 square kernel grows rapidly with the amount of input data, while the cost of the proposed erosion stays almost constant. Computations can be further reduced by applying the

102 76 Object segmentation Proposed Erosion Standard Erosion Standard erosion based detection Proposed erosion based detection Proposed direct detection Time (in sec.) Time (in sec.) Data (% of white pixels) Data (% of white pixels) (a) Proposed versus standard erosion. (b) Proposed versus standard detection. Figure 4.17: Computational efficiency comparison. novel morphological edge detection with implicit erosion (Fig. 4.17(b)) Morphological post-processing of binary images A binary image B resulting from a binarization of a gray-level image may contain artifacts, particularly at object boundaries. Many segmentation techniques that use binarization (e.g., for motion detection as in Section 4.3) have a post-processing step, usually performed by non-linear filters, such as median or morphological opening and closing. Non-linear filters are effective and efficient and, therefore, widely used [51]. This thesis examines the usefulness of applying a post-processing filter to the binary image. Erosion, dilation, closing, opening, and a 3 3 median operation were applied to the binary image and results were compared. The temporal stability of these filters throughout an image sequence has been tested and evaluated. The following conclusions are drawn: Erosion can delete some important details and dilation can connect objects. Standard opening with a 3 3 cross kernel smoothes the image but some significant object details can be removed and objects may get disconnected. Standard closing performs better smoothing but may connect objects. Conditional closing (see Page 74) is significantly faster than standard closing and is more conservative in smoothing results. It may connect objects diagonally as illustrated in Fig To compensate for disadvantages of the discussed operation, two solutions were tested:

103 Contour analysis 77 Conditional erosion followed by conditional closing. Erosion, a 3 3 median filter, and a conditional dilation. Erosion before closing does not connect objects but filters many details and may change the object shape. Erosion, median, and conditional dilation perform better by preserving edge and corners. In conclusion, applying smoothing filters can introduce artifacts, remove significant object parts, or disconnect object parts. This complicates subsequent objectbased video processing such as object tracking and object-based motion estimation. These effects are more severe when objects are small or when their parts are thin compared to the used morphological or median masks. Use of the above operations is recommended when objects and their connected parts are large. Such information is, however, rarely a priori known. Therefore, this thesis does not apply an explicit postprocessing step but implicitly removes noise within the contour tracing procedure as will be shown in Section Contour-based object labeling This section deals with extraction of contours from edges (Section 4.6.1) and with labeling of objects based on contours (Section 4.6.2) Contour tracing The proposed morphological edge detection (Section 4.5) gives an edge image, E(n), where edges are marked white and points inside the object or of the background are black. The important advantage of morphological edge detection techniques is that they never produce edges with gaps. This facilitates contour tracing techniques. To identify the object boundaries in E(n), the white points belonging to a boundary have to be grouped in one contour, C. An algorithm that groups the points in a contour is called contour tracing. The result of tracing edges in E(n) is a list, C(n), of contours and their features, such as starting point and perimeter. A contour, C C(n), is a finite set of points, {p 1,,p n }, where for every pair of points p i and p j in C there exists a sequence of points s = {p i,,p j } such that i) s is contained in C and ii) every pair of successive points in s are neighbors of each other. In an image defined on a rectangular sampling lattice, two types of neighborhoods are distinguished: 8-neighborhood and 4-neighborhood. In an 8-neighborhood all the eight neighboring points around a point are considered. In a 4-neighborhood only the four neighboring points, right, left, up, and down, are considered. A contour can be represented by the point coordinates or by a chain code. A list of point coordinates is needed for on-line object analysis. A chain code requires less

104 78 Object segmentation memory than coordinate representation and is desirable in applications that require storing the contours for later use [101]. Different contour tracing techniques are reported in the literature ([101, 7, 110]). Many methods are developed for specific applications such as pattern recognition. Some are defined to trace contours of simple structure. A commonly used technique is described in [102]. A drawback of this method is that it ignores contours inside other contours, fails to extract contours containing some 8-neighborhoods, and fails in case of contours with dead branches. A procedure for tracing complex contours in real images This proposed procedure aims at tracing contours of complex structure such as those containing dead or inner branches as with contours illustrated in Fig The proposed tracing algorithm uses plausibility rules i) to locate the starting point of a contour, ii) to find neighboring points, and iii) to decide whether to select a contour for subsequent processing (Fig. 4.18). The tracing is in a clockwise manner and the algorithm looks for a neighboring point of a current point in the 8-neighborhood starting at the rightmost neighbors. This rule forces the algorithm to move around the object by looking for the rightmost point and never inside the object (see Rule 2). The algorithm records both the contour chain code and the point coordinates. C(n-1) E(n) Locate a starting point in E(n) Find its connected neighboring points C Contour matching & selection C(n) Next contour Figure 4.18: Proposed contour tracing. In the following, let E(n) be the edge image of the original image I(n), C(n 1) the list of contours of the objects of the original image I(n 1), C(n) the list of contours of the objects of the original image I(n), C c C(n) the current contour with starting point p s, P c the length of C c, i.e., the number of points of C c, p c the current point, p i an 8-neighbor of it, and p p its previous neighbor. Rule1-Locating a starting point Scan the edge image, E(n), from left to right and from top to bottom until a white point p w is found. p w must be unvisited (i.e., not reached before) and has at least one unvisited neighbor. If such a point is found,

105 Contour analysis 79 Original Traced Obj 2 Obj 3 Obj 4 Original Traced = Tracing direction Obj 1 = Tracing direction 00 11marked as visited = Tracing direction = marked as visited (a) Rule 1. (b) Rule 3 - Dead branch. (c) Rule 4 - p i visited. Figure 4.19: Illustrating effects of proposed tracing rules. i) set p s = p w, ii) set p c = p s, and iii) perform Rule 2. If no starting point is found, end tracing. In case objects contain other objects, the given scanning direction forces the algorithm to trace first the outwards and then the inwards contours (Fig. 4.19(a)). Note that due to the image scanning direction (from left to right and from top to bottom) the object boundaries lay always left of the tracing direction. Rule 2 - Finding a neighboring point The basic idea is to locate the rightmost neighbor of the current point. This ensures that object contours are traced from outward and tracing never enters branches inside the object. The definition of the rightmost neighbor of a current point depends on the current direction of tracing as defined by the previous and the current points. If, for example, the previous point lays to the left of the current point then the algorithm looks for a neighboring point, p i, within the five neighbors of p c displayed at the upper left of Fig The other neighbors of p c were neighbors of p p and were already visited and there is no need to consider them. Based on the position of p p eight search neighborhoods are defined (Fig. 4.20). Note that the remaining neighbors of p c, that are not considered are already visited when tracing p p. Since the algorithm is designed to close contours when a visited point is reached (see Rule 4), these points should not be considered. Depending on the position of p p, look for the next neighboring point p i of p c in the respective neighborhood as given in Fig If a p i is found i) mark p c as visited if it is not marked visited, ii) set p p = p c, iii) set p c = p i, and perform Rule 4. Ifno p i is found perform Rule 3, i.e., delete a dead branch. Rule 3 - Deleting dead branches This rule is activated only if p c has no neighbor except p p. In this case, p c is assumed to be at the end of a dead branch of the contour and the following steps are performed: i) eliminate p c from E(n), ii) p c = p p, iii) p p is set to its previous neighbor (which can be easily derived from the stored chain

106 80 Object segmentation previous point current point Figure 4.20: Neighborhoods of the current point. code), and iv) perform Rule 2. Dead branches are points at the end of a contour and are not connected to the remaining contour (an example is given in Fig. 4.19(b)). In some rare cases these single-point-wide branches are part of the original object contour. In applications where reliable object segmentation foregoes the need for precision, these points provide no important information and can be deleted. Note that only single-point-wide dead branches are deleted using this rule. The elimination of dead branches facilitates subsequent steps of object-oriented video analysis. Rule 4 - Closing a contour Close C c, eliminate its points from E(n), and perform Rule 5 if p c = p s or p c is marked visited. If p c is marked visited, eliminate the remaining dead points of C c from E(n). If C c is not closed i) store the coordinate and chain code of p c and ii) look for the next point, i.e., perform Rule 2. Note that this rule closes a contour even if the starting point is not reached. This is important in case of errors in previous segmentation steps that produce, for instance, dead branches (Fig. 4.19(c)). Rule 5 - Selecting a contour Do not add C c to C(n) if 1) P c is too small, i.e., P c <t pmin where t pmin is a threshold, 2) P c is small (i.e., t pmin1 <P c <t pmin2 where t pmin2 is a threshold) and C c has no corresponding contour in C(n 1), or 3) C c is inside a previously traced contour C p so that the spatial homogeneity (Eq. 3.3) of the object of C p is low. Otherwise add C c to C(n). In both cases, perform Rule 1. With rule 5, small contours are assumed to be the result of noise or erroneous thresholding and are, therefore, eliminated if they have no corresponding contours in the previous image. The elimination of small contours (representing small objects) is spatially adapted to the homogeneity criterion of an object and temporally to corresponding objects in previous images. This is different from many methods that delete

107 Evaluation 81 small objects based on a fixed threshold (see, for example, [61, 119, 118, 85, 70]) Object labeling Object contours, characterized by contour points and their spatial relationship, are not sufficient for further object-based video processing (e.g., object tracking or object manipulation), which is based on the data of the position of the object points. Therefore, extracted contours are filled to reconstruct the object and identify the exterior points of the object. Contour filling attempts to recreate each point of the original object given its contour data [101, 102]. In video analysis this is needed, for example, when profile, area, or region-based shape or size are required. Finding the interior of a region when its contour is given is one of the most common tasks in image analysis. Several methods for contour filling exist. The two most used are the seed-based method and the scan-line method [59, 101, 102]. In the seed-based method an interior contour point is needed as a start point, then the contour is filled in a recursive way. Since this method is not automated it is not suitable for online video applications. The Scan-line method is an automated technique that fills a contour line by line. This thesis uses an enhanced efficient version of the Scan-Line method as described in [7, 110]. 4.7 Evaluation of the segmentation method Evaluation criteria Evaluation criteria for object segmentation can be distinguished into two groups: i) Criteria based on implementation and architecture efficiency: implementation efficiency is measured by memory use and computational cost, i.e., the time needed to segment an image. Important parameters are image size, frame rate and computing system (e.g., multitasking computers or computers with specialized hardware). Architectural performance is evaluated by the level of human supervision, level of parallelism, and regularity, which means that similar operations are performed at each pixel. ii) Criteria based on the quality of the segmentation results, i.e., spatial accuracy and temporal stability. Since object segmentation is becoming integrated in many applications, it is important to be able to evaluate segmentation results using some objective numerical measures similar to PSNR in comparing coding and enhancement algorithms. This measure would facilitate research and exchange between researchers. It would also reduce the cost of evaluation by humans.

108 82 Object segmentation Recently, an objective measure for segmentation quality has been introduced [140]. It measures the spatial accuracy, temporal stability, and temporal coherence of the estimated object masks relative to reference masks. The spatial accuracy (sqm(db)) is measured in terms of the number and location of the differences between estimated and reference masks. The sqm is 0 for an estimated segmentation identical to the reference and grows with deviation from the reference, indicating lower quality of segmentation. The temporal stability (vqm(db)) is measured in terms of fluctuating spatial accuracy with respect to the reference segmentation. The temporal coherency (vgc(db)) is measured by the relative variation of the gravity centers of both the reference and estimated masks. Both vqm and vgc are zero for perfect segmentation throughout the image sequence. The higher the temporal stability values vqm and vgc, the less stable the estimated segmentation over time. If all values are zero, then the segmentation is perfect with respect to a reference Evaluation and comparison In this section, simulation results using commonly referenced shots are given and discussed. These results are compared with the current state-of-the-art segmentation method [60, 6, 85], the COST-AM method, based on both objective and subjective evaluation criteria. The reference method is described in Section 3.3. Automatic operation The proposed method does not use a priori knowledge, and significant low-level parameters and thresholds are adaptively calculated. Parallelism and regularity All the elements of the algorithm have regular structure: motion detection with filtering and thresholding, morphological edge detection, contour tracing, and filling. A rough analysis shows that the process of motion detection and thresholding can be performed using parallel processing units. Edge detection, contour tracing, and filling are sequential methods. Binarization Morphological edge detection 0.01 Contour analysis 0.01 Contour filling Table 4.1: Segmentation time in seconds for the proposed method. Computational cost The proposed algorithm needs on average 0.15 seconds on a SUN-SPARC MHz per image. As shown in Table 4.1, most of the computational cost is for motion detection and thresholding. The morphological operations and

109 Evaluation 83 contour analysis require very little computation. As reported in [127], the reference method, COST-AM, needs on average 45 seconds on a PC-Pentium-II 333 MHz. Simulations on the SUN-SPARC MHz show that the reference method, COST- AM, needs roughly 95 seconds (not including global motion estimation). Quality of results In the evaluation, both indoor and outdoor test sequences and small and large objects are considered. All simulations are performed with the same parameter settings. Performance of the proposed segmentation is evaluated in the presence of MPEG- 2 artifacts and noise (cf. 2.2). In Fig the robustness of the proposed method in such environments is demonstrated. The method stays robust in the presence of MPEG-2 artifacts (the MPEG-2 images have an average of db PSNR, which means that the MPEG-2 images are strongly compressed and include many artifacts) and noise (a Gaussian white noise of 30 db was added to the original sequence). The proposed segmentation is objectively compared to the current version (4.x) of the COST-AM method 8. Fig gives the comparison results based on the criteria mentioned in Section As shown, the proposed segmentation is better with respect to all three criteria; especially it yields higher spatial accuracy. The three indicators of performance strongly depend on the reference object segmentation. For example, in the Hall test sequence, the reference segmentation focuses only on the two moving objects and disregards the third that has deposited one of the moving objects. Fig shows how the spatial accuracy sqm(db) of the proposed method gets higher when the evaluation is focused on the moving objects. sqm[db] objects 2 objects pic Figure 4.21: Spatial accuracy as a function of the reference object segmentation. In Figures subjective results are given for sample sequences and compared to results from the reference algorithm COST-AM. The segmented object masks using the proposed method are more accurate than those of the reference method. 8 The evaluation software and the reference masks are courtesy of the COST-211-group,

110 84 Object segmentation In the masks generated by the COST-AM method, parts of moving objects are often not detected, or large background areas are added to the object masks. Both spatial and temporal coherence of the estimated object masks are better for the proposed method than the COST-AM method. In case of small objects and outdoor scenes, the proposed method stays stable and segments all objects of the scene. As shown in Fig. 4.27, the reference method loses objects in some images, which is critical for object tracking applications. In few cases, the COST-AM method results in more accurate object boundaries than the proposed method. This is a result of using color image segmentation in the COST-AM method. Limitations of the proposed segmentation algorithm In the presence of shadows, the proposed method has difficulties in detecting accurate object shape. Some systems apply strategies to reduce shadows [131, 113]. This might increase, however, the computational cost. This thesis proposes to compensate for the shadow artifacts in higher-level processing, as will be shown in Chapters 6 and 7. To focus on meaningful objects the proposed method needs a background image A background image is available in surveillance applications. In other applications, a background update method has to be used which must adapt to different environments. This limitation can be compensated by motion detection information in successive images as in [85, 89] or by introducing texture and color analysis as in [123, 6, 145]. Such an extension is worthwhile only if it is robust and computationally efficient. 4.8 Summary In real-time video applications, fast unsupervised object segmentation is required. This Chapter has proposed a fast automated object segmentation method which consists of four steps: motion-detection-based binarization, morphological edge detection, contour analysis, and object labeling. The originality of the proposed approach is : 1) the segmentation process is divided into simple but effective tasks so that complex operations are avoided, 2) a fast robust motion detection is proposed which uses a novel memory-based thresholding technique, 3) new morphological operations are introduced that show significantly reduced computations and equal performance compared to standard morphological operations, and 4) a new contour analysis method that effectively traces contours of complex structure such as those containing dead or inner branches as with contours. Both objective and subjective evaluation and comparisons show the robustness of the proposed methods in noisy images and in images with illumination changes while being of reduced complexity. The segmentation method uses few parameters, and these are automatically adjusted to noise and temporal changes within a video shot.

111 Evaluation 85 Original objects MPEG-2 objects Noisy objects Figure 4.22: Object segmentation comparison for the Hall test sequence in case of MPEG-2 decoded (25 db) and noisy (30 db) sequences. The proposed method is robust with respect to noise and artifacts.

112 86 Object segmentation sqm[db] 0 5 Proposed COST AM pic (a) Spatial accuracy comparison: the proposed method has better spatial accuracy throughout the sequence. Average gain 3.5 db 0 vqm[db] Proposed COST AM pic (b) Temporal stability comparison: the proposed method has higher temporal stability throughout the sequence. Average gain 1.0 db vgc[db] Proposed COST AM pic (c) Temporal coherency comparison: after the first object enters the scene, the proposed method has higher temporal coherency. Average gain 0.5 db Figure 4.23: Objective evaluation obtained for the Hall test sequence. The proposed segmentation is better with respect to all three criteria.

113 Evaluation 87 COST-AM method Proposed method Figure 4.24: Comparison of results of the indoor Hall test sequence. objects have better spatial accuracy. Proposed

114 88 Object segmentation COST-AM method Proposed method Figure 4.25: Comparison of results of the outdoor Highway test sequence. The reference method has lower temporal and spatial stability compared to the proposed method.

115 Evaluation 89 COST-AM method Proposed method Figure 4.26: Comparison of results of the indoor Stair test sequence. This sequence has strong object and local illumination changes. The proposed method remains stable while the reference method has difficulties in providing reliable object masks.

116 90 Object segmentation COST-AM method Proposed method Figure 4.27: Comparison of results of the outdoor Urbicande test sequence. The scale of the objects change across this sequence. The reference method loses some objects and its spatial accuracy is poor. The proposed method remains robust to variable object size and is spatially more accurate.

117 Chapter 5 Object-Based Motion Estimation Motion estimation plays a key role in many video applications, such as frame-rate video conversion [120, 44, 19], video retrieval [8, 48, 134], video surveillance [36, 130], and video compression [136, 55]. The key issue in these applications is to define appropriate representations that can efficiently support motion estimation with the required accuracy. This chapter is concerned with the estimation of 2-D object motion from a video using segmented object data. The goal is to propose an object-based motion estimation method that meets the requirements of real-time and content-based applications such as surveillance and retrieval applications. In these applications, a representation of object motion in a way meaningful for high-level interpretation, such as event detection and classification, foregoes precision of estimation. 5.1 Introduction Objects can be classified into three major categories: rigid, articulated, and non-rigid [53]. The motion of a rigid object is a composition of a translation and a rotation. An articulated object consists of rigid parts linked by joints. Most video applications, such as entertainment, surveillance, or retrieval, assume rigid objects. An image acquisition system projects a 3-D world scene onto a 2-D image plane. When an object moves, its projection is animated by a 2-D motion, to be estimated from the space-time image variation. These variations can be divided into global and local. Global variations can be a result of camera motion or global illumination change. Local variations can be due to object motion, local illumination change and noise. Motion estimation techniques estimate apparent motion which is due to true motion or to various artifacts, such as noise and illumination change. 2-D object motion can be characterized as the velocity of image points or by their displacement. Let (p x,p y ) denote the spatial coordinates of an object point in I(n 1)

118 92 Motion estimation and (q x,q y ) its spatial coordinates in I(n). The displacement d p =(d x,d y )of(p x,p y ) is given by (d x,d y )=(q x p x,q y p y ). The field of optical velocities over the image lattice is often called optical flow. This field associates to each image point a velocity. The correspondence field is the field of displacements over the image lattice. In video processing (e.g., entertainment, surveillance), estimation of the correspondence field is usually considered. The goal of a motion estimation technique is to assign a motion vector (displacement or velocity) to each pixel in an image. Motion estimation relies on hypotheses about the nature of the image or object motion, and is often tailored to applications needs. Even with good motion models, a practical implementation may not find a correct estimate. In addition, motion vectors cannot always be reliably estimated because of noise and artifacts. Difficulties in motion estimation arise from unwanted camera motion, occlusion, noise, lack of image texture, and illumination changes. Motion estimation is an illposed problem which requires regularization. A problem is ill-posed if no unique solution exists or the solution does not continuously depend on the input data [92]. The choice of a motion estimation approach strongly depends on the application and on the nature of the processes that will interpret the estimated motion. 5.2 Review of methods and motivation Motion estimation methods can be classified into two broad categories: gradientbased and matching methods 1. Both generally assume that objects undergo pure translation and have been widely studied and used in the field of video coding and interpolation. Gradient-based approaches use a relationship between image motion and the spatio-temporal derivatives of image brightness. They use computations localized to small regions of the image. As a result, they are sensitive to occlusion. Their main disadvantage is that they are not applicable for motion of large extent unless an expensive multi-resolution scheme is used. A more reliable approach is the estimation of motion of a larger region of support, such as blocks or arbitrary-shaped regions, based on parametric motion models. Matching techniques locate and track small, identifiable regions of the image over time. They can estimate motion accurately only in distinguishable image regions. In general, matching techniques are highly sensitive to ambiguity among the structures to be matched. Resolving this ambiguity is computationally costly. Furthermore, it is often computationally impractical to estimate matches for a large number of regions. Motion estimation by matching that is frequently used and implemented in hardware is block matching [46, 20, 21]. Here, the motion field is assumed to be 1 For a thorough review see [92].

119 Review 93 constant over rectangular blocks and represented by a single motion vector in each block. Several refinements of this basic idea have been proposed [53, 43, 18]. In [18], for example, a spatio-temporal update strategy is used to increase the accuracy of the block-matching algorithm. Also, a median-based smoothing of block motion is used. Three advantages of block-matching algorithms are: 1) easy implementation, 2) better quality of the resulting motion vector fields compared to other methods such as phase correlation and gradient methods in the presence of large motion, and 3) they can be implemented by regular VLSI architectures. An additional important advantage of block-matching is that it does not break down totally. Block matching has, however, some drawbacks. This is particularly true at object boundaries where these methods assume an incorrect model and result in erroneous motion vectors, leading to discontinuity in the motion vector fields, causing ripped boundaries artifacts in case of block-matching based motion compensation. In motion-compensated image, block structures become visible and object boundaries may split. Another drawback is that the resulting motion vectors inside objects or object regions with a single motion are not homogeneous, producing ripped region artifacts, i.e., structure inside regions can get split or distorted. Additionally, using a block-based algorithm results in block patterns in the motion vector field, causing block patterns or blocking artifacts. These patterns often result in block motion artifacts in subsequently processed images. The human visual system is very sensitive to such artifacts (especially abrupt changes). Various studies show that the integration of object information in the process of estimating object motion reduces block matching artifacts and enhances the motion vector fields [12, 10, 62, 30, 20, 55, 26, 49]. Block-based and pixel-based motion estimation methods have been widely used in the field of coding and image interpolation. The focus in these applications is on accurate motion and less on meaningful representation of object motion. For video surveillance and retrieval, the focus is on extracting a flexible content-based video representation. The focus is on reliable estimation without high precision, but stable throughout an image sequence. Content-based video processing calls for motion estimation based on objects. In an object-based motion estimation algorithm, motion vectors are estimated using information about the shape or structure of segmented objects. This causes the motion vectors to be consistent within the objects and at object boundaries. In addition, since the number of objects is significantly less than the number of blocks in an image, object-based motion estimation has lower complexity. Furthermore, given the objects, motion models more accurate than pure translation can be used, for instance, models that include rotation and scaling. Various object-based motion estimation methods have been proposed that require large amounts of computations [119, 49, 55, 32, 132]. This complexity is mainly due to segmentation which is difficult and complex [143, 55, 118, 132]. Also, although they

120 94 Motion estimation generally give good motion estimates, they can fail to interpret object motion correctly or can simply break down. This is due to dependence on good segmentation. Several methods use region growing for segmentation, or try to minimize a global energy function when the minimum is difficult to find [143]. Furthermore, these methods include several levels of refinement. In the next sections, a low-complexity object motion estimation technique is introduced that is designed to fit the needs of content-based video representation. It relies on the estimation of the displacements of the sides of the object minimum bounding box. Two motion estimation steps are considered: initial coarse estimation to find a single displacement for an object using the four sides of the MBB between two successive images and detection of non-translational motion and its estimation. 5.3 Modeling object motion To describe the 2-D motion of objects, definition of a motion model is needed. Two broad categories of 2-D motion models are defined: non-parametric and parametric. Non-parametric models are based on a dense local motion field where one motion vector is estimated for each pixel of the image. Parametric models describe the motion of a region in the image by a set of parameters. The motion of rigid objects, for example, can be described by a parameter motion model. Various simplifications of parametric motion models exist [95, 55, 91]. Models have different complexity and accuracy. In practice, as a compromise between complexity and accuracy, 2-D affine or 2-D simplified linear motion models are used. Assuming a static camera or a camera-motion compensated video, local object motion can be described adequately as the composition of translation, rotation, and scaling. Changes in object scale occur when the object moves towards or away from the camera. This thesis uses the so-called simplified linear models to describe objects motion. Let (p x,p y ) and (q x,q y ) be the initial, respectively the final, position of a point p of an object undergoing motion. Translation The translation of p by (d x,d y ) is given by q x = p x + d x q y = p y + d y. (5.1) Scaling The scale change transformation of p is defined by q x = s (p x c x )+c x q y = s (p y c y )+c y (5.2) where s is the scaling factor and (c x,c y ) is the center of scaling.

121 Proposed approach 95 Rotation The rotational transformation of p is defined by q x = c x +(p x c x ) cos φ (p y c y ) sin φ q y = c y +(p x c x ) sin φ +(p y c y ) cos φ (5.3) where (c x,c y ) is the center of rotation and φ the rotation angle. Composition If an object O i is scaled, rotated, and displaced then the final position of p O i is defined by (assume a small-angle rotation which gives sin φ φ and cos φ 1) q x = p x + d x + s (p x c x ) φ (p y c y ) (5.4) q y = p y + d y + s (p y c y )+φ (p x c x ). 5.4 Motion estimation based on object-matching A key issue when designing a motion estimation technique is its degree of efficiency with enough accuracy to serve the purpose of intended video application. For instance, in object tracking and event detection a tradeoff is required between computation cost and quality of object prediction. In video coding applications, accurate motion representation is needed to achieve good coding quality with low bite rate. The proposed method estimates object motion based on the displacements of the MBB of the object. MBB-based object motion estimation is not a new concept. Usually, MBB-based methods use the displacement of the centroid of the object MBB. This is sensitive to noise and other image artifacts such as occlusion. Most MBB motion estimators assume translational motion when motion type can be important information, as in retrieval, for instance. The contribution is in the detection of the type of object motion: translation, scaling, composition, and the subsequent estimation of one or more motion values per object depending on the detected motion type. In the case of a composition of these primitive motions, the method estimates the motion without specifying its composition. Special consideration is given to object motion in interlaced video and at image margin. Analysis of displacements of the four MBB-sides allows further the estimation of more complex image motion as when objects move towards or away from the camera (Section 5.4.3) Overall approach This proposed non-parametric motion estimation method is based on four steps (Fig. 5.1): object segmentation, object matching, MBB-based displacement estimation, and motion analysis and update.

122 96 Motion estimation R(n) I(n-1) I(n) Object Segmentation motion & object memory O(n-1) O(n) Object matching object mapping Initial estimation of the MBB displacements Initial motion data MBB-Motion analysis (Detection of motion types) object motion Figure 5.1: Diagram of the proposed motion estimation method. Object segmentation has been considered in Chapter 4 and object matching will be discussed in Chapter 6. In the third step (Section 5.4.2), an initial object motion is estimated by considering the displacements of the sides of the MBBs of two corresponding objects (Fig. 5.2), accounting for possible segmentation inaccuracies due to occlusion and splitting of object regions. In the fourth step (Section 5.4.3), the type of the object motion is determined. If the motion is a translation, a single motion vector is estimated. Otherwise the object MBB is divided into partitions for more precise estimation, and different motion vectors are assigned to the different partitions. The proposed motion estimation scheme assumes that the shape of moving objects does not change drastically between successive images and that the displacements are within a predefined range (in the implementation the range [ 16, +15] was used but other ranges can be easily adopted). These two assumptions are realistic for most video applications and motion of real objects Initial estimation Let I(n) be the observed image at time instant n, defined on an X Y lattice where the starting pixel, I(1, 1,n), is at the upper-left corner of the lattice. If the motion is estimated forward, between I(n 1) and I(n), then the direction of object motion is defined as follows: horizontal motion to the left is negative and positive to the right; vertical motion down is positive and negative up. The initial estimate of an object motion comes from the analysis of the displacements of the four sides of the MBBs of two corresponding objects (Fig. 5.2) as follows. Definitions: M i : O p O i a function that assigns to an object O p at time n 1 an object O i at time n.

123 Proposed approach 97 displacement of min. row displacement of max. row min. row max. row min. col Op max. col object displacement Oi displacement of min. col displacement of max. col Figure 5.2: MBB-based displacement estimation. w =(w x,w y ) the current displacement of O i, between I(n 2) and I(n 1). (r minp,c minp ), (r maxp,c minp )(r minp,c maxp ), and (r maxp,c maxp ) the four corners, upper left, lower left, upper right, and lower right, of the MBB of O p (cf. Fig. 5.2). r minp and r maxp the upper and lower row of O p. c minp and c maxp the left and right column of O p. r and r the upper and lower row of O min i maxi i. If upper occlusion or splitting is detected then r = r min i minp + w x. If lower occlusion or splitting is detected then r = r max i maxp + w x. c min and c i max the left and right column of O i i. If left occlusion or splitting is detected then c = c min i minp + w y. If right occlusion or splitting is detected then c = c max i maxp + w y. d rmin = r r min i minp the vertical displacement of the point (r minp,c minp ). d cmin = c c min i minp the horizontal displacement of the point (r minp,c minp ). d rmax = r r max i maxp the vertical displacement of the point (r maxp,c maxp ). d cmax = c c max i maxp the horizontal displacement of the point (r maxp,c maxp ). d r = d rmax d rmin the difference of the vertical displacements. d c = d cmax d cmin the difference of the horizontal displacements. The initial displacement, wi 1 =(wx 1 i,wy 1 i ), of an object is the mean of the displacements of the horizontal and vertical MBB-sides (see the first part of Eq. 5.5 and 5.6). In case of segmentation errors, the displacements of parallel sides can deviate significantly, i.e., d c >t d or d r >t d. So the method detects these deviations and corrects the estimate based on previous estimates of (w x,w y ). This is given in the

124 98 Motion estimation second and third part of Eqs. 5.5 and 5.6. (d cmax +d cmin ) : 2 d c t d wx 1 i = (d cmin +w x) 2 : ( d c >t d ) [((d cmax d cmin > 0) (d cmax >d cmin )) ((d cmax d cmin < 0) (d cmax w x < 0))] (5.5) (d cmax +w x) 2 : (d rmax +d rmin ) 2 : d r t d ( d c >t d ) [((d cmax d cmin > 0) (d cmax d cmin )) ((d cmax d cmin < 0) (d cmax w x > 0))], wy 1 i = (d rmin +w y) 2 : ( d r >t d ) [((d rmax d rmin > 0) (d rmax >d rmin )) ((d rmax d rmin < 0) (d rmax w y < 0))] (5.6) (d rmax +w y) 2 : ( d r >t d ) [((d rmax d rmin > 0) (d rmax d rmin )) ((d rmax d rmin < 0) (d rmax w y > 0))]. This estimated displacement may deviate from the correct value due to inaccurate object shape estimation across the image sequence. To stabilize the estimation throughout the image sequence, the first initial estimate wi 1 = (wx 1 i,wy 1 i ) is compared to the current estimate w =(w x,w y ). If they deviate significantly, i.e., the difference w x wx 1 i >t m or w y wy 1 i >t m for a threshold t m, acceleration is assumed and the estimate w i is adapted to the current estimate as given in Eq. 5.7, where a represents the maximal allowable acceleration. This way, the estimated displacement is adapted to the previous displacement to provide stability to inaccuracies in the estimation of the object shape by the object segmentation module. wx 1 i : wx 1 i w x t m w xi = w x + a : wx 1 i w x >t m wx 1 i >w x w x a : wx 1 i w x >t m wx 1 i <w x wy 1 i : wy 1 i w y t m w yi = w y + a : wy 1 i w y >t m wy 1 i >w y w y a : wy 1 i w y >t m wy 1 i <w y Motion analysis and update (5.7) Often, objects correspond to a large area of the image and a simple translational model for object matching is not appropriate; a more complex motion model must

125 Proposed approach 99 displacement of the min. row V 1 displacement of the min. row V 3 displacement of max. row displacement of min. col displacement of max. col V1 V 2 V 3 displacement of max. row V 2 interpolated displacement (a) Detection of scale change. (b) Vertical scaling estimation. Figure 5.3: Scaling: symmetrical, (nearly) identical displacements of all MBB-sides. be introduced. To achieve this, the motion of the sides of the MBB is analyzed and motion types are detected based on plausibility rules. This analysis detects four states of object motion changes: translation, scaling, and acceleration. If non-translational motion is estimated, an object is divided into several partitions that are assigned different motion vectors. The number of regions depends on the magnitude of the estimated non-translational motion. Usually motion in objects does not contain fine details and motion vectors are spatially consistent so that large object regions have identical motion vectors. Therefore, the number of regions need not to be high. Detection of translation This thesis assumes translational object motion if the displacements of the horizontal and vertical sides of the object MBB are nearly identical, i.e., d T ranslation : r <t d (5.8) d c <t d. In this case one motion vector (Eq. 5.7) is assigned to the whole object. Detection of scaling This thesis assumes the scaling center as the centroid of the segmented object and assumes object scaling if the displacements of the parallel sides of the MBB are symmetrical and nearly identical. This means Scaling : (( d r <t s ) (d rmin d rmax ) < 0) (( d c <t s ) (d cmin d cmax ) > 0). (5.9) with a small threshold t s. For example, if one side is displaced to the right by three pixels, the parallel side is displaced by three pixels to the left. This is illustrated in Fig. 5.3(a). If scale change is detected the object is divided into sub-regions where the number of regions depends on the difference d r. Each region is then assigned one displacement as follows: the region closest to r max is assigned d rmax and the region

126 100 Motion estimation closest to r min is assigned d rmin. For in-between regions motion is interpolated by increasing or decreasing d rmin and d rmax (Fig. 5.3). The accurate detection of scaling depends on the performance of the segmentation. However, Eq. 5.9 takes into account possible segmentation errors. Detection of rotation Rotation about the center can be detected when there is a small difference between the orientations of the horizontal MBB-sides and a small difference between the orientation of the vertical sides in the current and previous images (Fig. 5.4). Oi (Cx,Cy) Op Figure 5.4: Rotation: similar orientations of the MBB-sides. Detection of general motion In case of composition of motion types, three types of motion are considered: translational motion, non-translational motion, and acceleration: If w yi d rmin >a w yi d rmax >a (5.10) where a is the maximal possible acceleration, then it is a vertical non-translation and the object is divided into d rmax d rmin +1 vertical regions. As with scale change estimation, each region in case of non-translational motion is assigned one displacement as follows: the region closest to r max is assigned d rmax and the region closest to r min is assigned d rmin. The motion of in-between regions is interpolated by increasing or decreasing d rmin and d rmax (Fig. 5.5). V1 = displ. of min. row V1 V2 V3 V4 V4 = displ. of max. row V2 & V3 are interpolated displ. of min. row Op Oi displ. of max. row Figure 5.5: Vertical non-translational motion estimation.

127 Proposed approach 101 If w xi d cmin >a w xi d cmax >a (5.11) then horizontal non-translational motion is declared and the object is divided into d cmax d cmin +1 horizontal regions. Each region is assigned one displacement as follows: the region closest to c max is assigned d cmax and the region closest to c min is assigned d cmin. The motion of the other regions is interpolated based on d cmin and d cmax. Detection of motion at image margin MBB-based motion estimation will be affected by objects entering or leaving the visual field. Therefore, this condition has to be explicitly detected to adapt the estimation. Motion at image borders is detected by small motion of the MBB-side that is at the image border. The motion of the object is then defined based on the motion of the MBB-side that is not at the image border (cf. Fig. 5.6). This consideration is important for event-based representation of video. It enables tracking and monitoring object activity as soon as the objects enter or leave the image. Op Horizontal displacement of the object Oi defines image border Figure 5.6: Object motion at image border. Compensation of interlaced artifacts Analog or digital video can be classified as interlaced or non-interlaced 2. The interlacing often disturbs image edges aligned vertically. In interlaced video, vertical motion estimation can be distorted because of aliasing where two successive fields have different rasters. The effect is that the vertical motion vector will fluctuate by ±1 between two fields. To compensate for 2 Non-interlaced video is also called progressive scan. Most personal computers use progressive scan. Here all lines in a frame are displayed in one pass. TV-signals are interlaced video. Each frame consists of two fields displayed in two passes. Each field contains every other horizontal line in the frame. A TV displays the first field of alternating lines over the entire screen, and then displays the second field to fill in the alternating gaps left by the first field. An NTSC field is displayed approximately every 1 th 60 of a second and a PAL field in 1 th 50 of a second.

128 102 Motion estimation this fluctuation, the current and previous vertical displacements are compared; if they deviate only by one pixel, then the minimal displacement of the two is selected. Another (computationally more expensive) method to compensate for the effect of interlaced video is to interpolate the missing line of the raster so that both fields become on the same raster. This interpolation results in shifting each line of the field; therefore, it must be done differently for different fields. Such an approach has been investigated in [18] which shows that the effect of the interlaced alias can be significantly reduced. 5.5 Experimental results and discussion Evaluation criteria Evaluation criteria for motion estimation techniques can be divided into: 1) Accuracy criteria: Two subjective evaluation criteria to evaluate the accuracy of the estimated motion vectors are used. The first criterion is to display the estimated vector fields and the original image side by side (Fig. 5.8). The second criterion is based on motion compensation. Motion compensation is a non-linear prediction technique where the current image I(n) is predicted from I(n 1) using the motion estimated between these images (Fig. 5.9). 2) Consistency criteria: The second category of evaluation criteria is consistency of the motion vectors throughout the image sequence. Motion-based object tracking is one way to measure the consistency of an estimated motion vector (Chapter 6). 3) Implementation criteria: An important implementation-oriented evaluation criterion is the cost of computing the motion vectors. This criterion is critical in real-time applications, such as video surveillance, video retrieval, or frame-rate conversion. It is important to evaluate proposed methods based on these criteria if these are intended for usage in real-time environment. There are also objective criteria, such as the Mean Square Error (MSE), the Root Mean Square Error (RMSE) and the PSNR (cf. [43]) to evaluate the accuracy of motion estimation. The selection of an appropriate evaluation criterion depends on the application. In this thesis, the applications are real-time object tracking for video surveillance and real-time object-based video retrieval. In these applications, objective evaluation criteria are not as appropriate as in the case of coding or noise reduction applications. We have carried out objective evaluations using the MSE criterion. These evaluations have shown that the proposed motion estimation method gives lower MSE compared to the MSE using the block-matching technique in

129 Results Evaluation and discussion Block matching for motion estimation is one of the fastest and relatively reliable motion estimation techniques. It is used in many applications. It is likely that, in many applications, block-matching techniques will be used. The proposed method is compared in this section to a state-of-the-art block-matching-based motion estimation that has been implemented in hardware and found to be useful for TV-applications, such as noise reduction [43, 46]. Computational costs Although block matching is a fast technique, in the video analysis system presented in this thesis, faster techniques are, however, needed. Simulation results show that the computational cost for the object-based motion estimation is about 1 of the computation cost of a fast block matching [43]. This block-based method has a complexity 15 about forty times lower than that of a Full-search block matching algorithm which is used in various MPEG-2 encoders. Furthermore, regular (i.e., the same operations are applied for each object) MBB-based object partition and motion estimation are used. Because of its low computational cost and regular operations, this method is suitable for real-time video applications, such as video retrieval. Quality of the estimated motion In case of partial or complete object occlusion, object-oriented video analysis relies on the estimated motion of the object to predict its position or to find it in case it is lost. Thus a relatively accurate motion estimate is required. Block matching gives motion estimation for blocks and not for objects, and it can fail if there is insufficient structure (Fig. 5.8(e)). The proposed method, as will be shown in Chapters 6 and 7, provides good estimates to be used in object prediction for object tracking and in event extraction. Figs. 5.8, 5.9, and 5.7 show samples of our results. Figs. 5.8 displays the horizontal and vertical components of the motion field between two frames of the sequence Highway, estimated by block matching and object matching methods. The horizontal and vertical components are encoded separately. Here, the magnitudes were scaled for display purposes, with darker gray-levels representing negative motion, and the lighter gray level representing positive motion. Another measure of the quality for the estimated motion is given by comparing the object predictions using block matching and object matching. Fig. 5.9 shows that, despite being a simple technique, the proposed method gives good results compared to more sophisticated block matching techniques.

130 104 Motion estimation Block motion Object motion Block motion Object motion MSE with block motion MSE with object motion MSE with block motion MSE with object motion Figure 5.7: Object-matching versus block-matching: the first row shows horizontal block motion vectors using the method in [43] and object motion vectors in sequence. The second row shows the mean-square error between the motion compensated and original images using block vectors and object vectors. A drawback of block-matching methods is that they deliver non-homogeneous motion vectors inside objects which affect motion-based video processing techniques, such as noise reduction. On the other hand, the proposed cost-effective object-based motion estimation provides more homogeneous motion vectors inside objects (Fig. 5.7). Using this motion information, the block-matching motion field can be enhanced [12]. Fig. 5.7 displays an example of the incomplete block-matching motion compensation. It also shows the better performance of object-matching motion compensation. The integration block and object motion information is an interesting research topic for applications, such as motion-adaptive noise reduction or image interpolation.

131 Results 105 (a) Original image. (b) Horizontal object motion field. (c) Vertical object motion field. (d) Horizontal block motion field. (e) Vertical block motion field. Figure 5.8: Object versus block motion. Note the estimated non-translational motion of the left car. Motion is coded as gray-levels where strong dark level indicates fast motion to the left or up and strong bright level indicates fast motion to the right or down.

132 106 Motion estimation (a) Object-based prediction I(196). The objects are correctly predicted. (b) Block-based prediction I(196). Note the artifacts introduced at object boundaries. (c) Object-based prediction I(6) (zoomed in). (d) Block-based prediction I(6) (zoomed in). Figure 5.9: Prediction of objects: block-based prediction introduces various artifacts while object-based prediction gives smoother results inside objects and at boundaries.

133 Summary Summary A new real-time approach to object-based motion estimation between two successive images has been proposed in this Chapter. This approach consists of an explicit matching of arbitrarily-shaped objects to estimate their motion. Two motion estimation steps are considered: estimation of the displacement of an object by calculating the displacement of the mean coordinates of the object and estimation of the displacements of the four MBB-sides. These estimates are compared. If they differ significantly, a non-translational motion is assumed and the different motion vectors are assigned to different image regions. In the proposed approach, extracted object information (e.g., size, MBB, position, motion direction) is used in a rule-based process with three steps: object correspondence, estimation of the MBB motion based on the displacement of the sides of the MBB, i.e., the estimation process is independent of the intensity signal and detecting object motion types (scaling, translation, acceleration) by analyzing the displacements of the four MBB sides and assigning different motion vectors to different regions of the object. Special consideration is given to object motion in interlaced video and at image margin. Various simulations have shown that the proposed method provides good estimates for object tracking, event detection, and high-level video representation as will be given in the following chapters.

134

135 Chapter 6 Voting-Based Object Tracking Object tracking has various applications. It can be used to facilitate the interpretation of video for high-level object description of the temporal behavior of objects (e.g., activities such as entering, stopping, or exiting a scene). Such high-level descriptions are needed in various content-based video applications such as surveillance or retrieval [27, 70, 35, 133, 130]. While object tracking has been extensively studied for surveillance and video retrieval, limited work has been done to temporally integrate or track objects or regions throughout an image sequence. Tracking can also be used to assist the estimation of coherent motion trajectories throughout time and to support object segmentation ([55], Chapter 5). The goal of this section is to develop a fast, robust, object tracking method that accounts for multiple correspondences and object occlusion. The object tracking module receives input from the motion estimation and object segmentation modules. The main issue in tracking systems is reliability in case of shadows, occlusion, and object split. The proposed method focuses on solutions to these problems. 6.1 Introduction Video analysis methods proposed so far, object segmentation and motion estimation, provide low-level spatial and temporal object features for consecutive images of a video. In various applications, such as video interpretation, low-level locally limited object features are not sufficient to describe moving objects and their behavior throughout an entire video shot. To achieve higher-level object description, objects must be tracked and their temporal features registered as they move. Such tracking and description of objects transforms locally-related objects into video objects. Tracking of objects throughout the image sequence is possible because of spatial and temporal continuity: objects, usually, move smoothly, do not disappear or change direction suddenly. Therefore, the temporal behavior of moving objects is predictable.

136 110 Object tracking Various changes make, however, tracking of objects in real scenes a difficult task: Image changes, such as noise, shadows, light changes, surface reflectance and clutter, can obscure object features to mislead tracking. The presence of multiple moving objects further complicates tracking, especially when objects have similar features, when their paths cross, or when they occlude each other. Non-rigid and articulated objects are yet another factor to confuse tracking because their features vary. Inaccurate object segmentation also obscures tracking. Possible feature changes, e.g., due to object deformation or scale change (e.g., object size can change rapidly) can also confuse the tracking process. Finally, application related requirements, such as real-time processing, limit the design freedom of tracking algorithms. This thesis develops an object tracking method that solves many of these difficulties using a reliable strategy to select features that remain stable over time. It uses a robust detection of occlusion to update features of occluded objects. It further robustly integrates features so that noisy features are filtered or compensated. 6.2 Review of tracking algorithms Applications of object tracking are numerous [80, 42, 71, 16, 65, 75, 55, 50, 40, 70]. Two strategies can be identified: one uses correspondence to match objects between successive images and the other performs explicit tracking using stochastic methods such as MAP approaches [75, 15, 71, 55]. Explicit tracking approaches model occlusion implicitly but have difficulties to detect entering objects without delay and to track multiple object simultaneously. Furthermore, they assume models of the object features that might become invalid [75]. Most methods have high computational costs and are not suitable for real-time applications. Tracking based on correspondence tracks objects, either by estimating their trajectory or by matching their features. In both cases object prediction is needed to define the location of the object along the sequence or to predict occluded objects. Prediction techniques can be based on Kalman filters or on motion estimation and compensation. While the use of a Kalman filter [80, 50, 16, 40] relies on an explicit trajectory model, motion compensation does not require a model of trajectory. In complex scenes, the definition of an explicit trajectory model is difficult and can hardly be generalized for many video sequences [71]. Furthermore, basic Kalman filtering is noise sensitive and can hardly recover its target when lost [71]. Extended Kalman filters can estimate tracks in some occlusion cases but have difficulty when the number of objects and artifacts increase.

137 Review 111 Correspondence-based tracking establishes correspondence between features of one object to features of another object in successive images. Tracking methods can be divided into three categories according to the features they use: Motion-based methods (ex. [80]) track objects based on their estimated motion. They either assume a simple translational motion model or more complex parametric models, e.g., affine models. Although robust motion estimation is not an easy task, motion is a powerful tool for tracking and is used widely. Contour-based methods [71, 16]) represent objects by their boundaries and shapes. Contour-based features are fast to compute and to track and are robust to some illumination changes. Their main disadvantage is sensitivity to noise. Region-based methods [65, 16] aim at representing the objects through their spatial pattern and its variations. Such a representation is, in general, robust to noise and object scaling. It requires, however, large amounts of computation for tracking (e.g., using correlation methods) and is sensitive to illumination changes. Region-based features are useful in case of small objects or with low resolution images. For small objects, contour-based methods are more strongly affected by noise or by low resolution. While earlier tracking algorithms were based on motion and Kalman filtering [80], recent algorithms [50, 65, 70] combine features for more robust tracking. The method in [65] is based on a change detection from multiple image scales, and a Kalman Filter to track segmented objects based on contour and region features. This approach is fast and can track multiple objects. It has, however, a large tracking delay, i.e., objects are detected and tracked after being in the scene for a long time, no object occlusion is considered, the object segmentation has large deviation, and its model is oriented to one narrow application (vehicle tracking). In [16] object tracking is performed using active contour models, region-based analysis and Kalmanbased prediction. This method can track one object and relies heavily on Kalmanfiltering, which is not robust to clutter and artifact. Moreover, the method has high computational cost. The study in [40] uses a change detection, morphological dilation bya5 5 kernel, estimation of the center-of-gravity, and Kalman filters for position estimation. It is able to keep the object of interest in the field of view. No object occlusion is, however, considered and the computational cost is high. The method in [70] is designed to track people that are isolated, move in an upright fashion and are unoccluded. It tracks objects by modeling their body-parts motion and matching objects whose MBB overlap. To continue to track objects after occlusion, statistical features of two persons before occlusion are compared to the features after occlusion to recover objects. A recent object tracking algorithm is proposed in [50]. It consists of various stages: motion estimation, change detection with background adaptation, spatial

138 112 Object tracking clustering (region isolation, merging, filtering, and splitting), and Kalman-filtering based prediction for tracking. The system is optimized to track humans and has a fast response on a workstation with a specialized high performance graphic card. This depends on the contents of the input sequence. It uses a special procedure to detect shadows and reduce their effects. The system can also track objects in the presence of partial occlusion. The weak part of this system is the change detection module which is based on an experimentally-fixed threshold to binarize the image difference signal. This threshold remains fixed throughout the image sequence and for all input sequences. Furthermore, all the thresholds used in the system are optimized by a training procedure which is based on an image sequence sample. As discussed in Section 4.4 and in [112], experimentally selected thresholds are not appropriate for a robust autonomous surveillance system. Thresholds should be calculated dynamically based on the changing image and sequence content, which is especially critical in the presence of noise. Many methods for object tracking contribute to solve some difficulties of the object tracking problem. Few have considered real environments with multiple rigid or/and articulated objects, and limited solutions to the occlusion problem exist (examples are [70, 50]). These methods track objects after, and not during, occlusion. In addition, many methods are designed for specific applications [65, 40, 75] (e.g., tracking based on body parts models or vehicle models) or impose constraints regarding camera or object motion (e.g., upright motion) [70, 50]. In this Chapter, a method to track objects in the presence of multiple rigid or articulated objects is proposed. The algorithm is able to solve the occlusion problem in the presence of multiple crossing paths. It assigns pixels to each object in the occlusion process and tracks objects successfully during and after occlusion. There are no constraints regarding the motion of the objects and on camera position. Sample sequences used for evaluation are taken with different camera positions. Objects can move close to or far from the camera. When objects are close to the camera occlusion is stronger and is harder to resolve. 6.3 Non-linear object tracking by feature voting HVS-related considerations Visual processing ranges from low-level or iconic processes to the high-level or symbolic processes. Early vision is the fist stage of visual processing where elementary properties of images such as brightness are computed. It is generally agreed that early vision involves measurements of a number of basic image features such as color or motion. Tracking can be seen as an intermediate processing step between high-

139 Overall approach 113 level and low-level processing. Tracking is an active field of research that produces many methods, and despite attempts to make object tracking robust to mis-tracking, tracking is far from being solved under large image changes, such as rapid unexpected motions, changes in ambient illumination, and severe occlusions. The HVS can solve the task of tracking under strong ambiguous conditions. The HVS can balance various features and is successful in tracking. Therefore, it is important to orient tracking algorithms to the way or to what is known about how the HVS tracks objects. The study of the movement of the eyes gives some information about the way the HVS tracks objects. The voluntary movement of the human eyes can be classified into three categories. Saccade, smooth pursuit, and vergence [138, 104]. The movement of the eyes when jumping from one fixation point in space to another is called saccade. Saccade brings the image of a new visual target onto the fovea. This movement can be based on intention or reflection. When the human eye maintains a fixation point of a target moving at a moderate speed on the fovea, the movement is called smooth pursuit. The HVS uses multiple clues from the target, such as shape or motion, for a robust tracking of the target. Vergence movement adds depth by adjusting the eyes so that the optical axes keep intersecting on the same target while depth varies. This ensures that both eyes are fixed on the same target. This fixation is helped by disparity clues which play an important role. When viewing an image sequence, the HVS focuses mainly on moving objects and is able to coherently combine both spatial and temporal information to track objects robustly in a non-linear manner. In addition, the HVS is able to recover quickly from mis-tracking and continue successfully to track an object it has lost. The proposed tracking method is oriented to these properties of the HVS by focusing on moving objects,byintegrating multiple clues from the target, such as shape or motion, for a robust target tracking, and by using non-linear feature integration based on a two-step voting scheme. This scheme solves the correspondence problem and uses contour, region, and motion features. The proposed tracking aims at quick recovery in case an object is lost or is partly occluded Overall approach In a video shot, objects in one image are assumed to exist in successive images. In this case, temporally linking or tracking objects of one image to the objects of a subsequent image is possible. In the proposed tracking, objects are tracked based on the similarity of their features in successive images I(n) and I(n 1). This is done in four steps: object segmentation, motion estimation, object matching, and feature monitoring and correction (Fig. 6.1). Object segmentation and motion estimation extract objects and their features and represent them in an efficient form (discussed previously in Chapter 4-5 and Section 3.5). Then, using a voting-based feature integration, each

140 114 Object tracking object O p of the previous image I(n 1) is matched with an object O i of the current image I(n) creating a unique correspondence M i = O p O i. This means that all objects in I(n 1) are matched with objects in I(n) (Fig. 6.2). M i is a function that assigns to an object O p at time n 1 an object O i at time n. This function provides a temporal linkage between objects which defines the trajectory of each object throughout the video and allows a semantic interpretation of the input video (Section 7). Finally, object segmentation errors and object occlusion are detected and corrected and the new data used to update the segmentation and the motion estimation steps. For example, the error correction steps can produce new objects after detecting occlusions. Motion estimation and tracking need to be performed for these new objects. Each tracked object is assigned a number to represent its identity throughout the sequence. This is important for event detection applications, as will be shown in Chapter 7. I(n) I(n-1) Object segmentation & motion estimation O(n) O(n-1) feadback & update Object matching by feature integration based on voting Monitoring & correction of object occlusion & segmentation error (includes region merging ) Object trajectory & temporla links Figure 6.1: Block diagram of the proposed tracking method. Solving the correspondence problem in ambiguous conditions is the challenge of object tracking. The important goal is not to lose any objects while tracking. Ambiguities arise in the case of multiple matches, when one object corresponds to several objects or in the case of zero match M 0 : O p when an object O p cannot be matched to any object in I(n) (Fig. 6.2). This can happen, for example, when objects split, merge, or are occluded. Further ambiguity arises when the appearance of an object varies from one image to the next. This can be a result of erroneous segmentation (e.g., holes caused by identical gray-levels between background and objects), or changes in lighting conditions or in viewpoint. Object correspondence is achieved by matching single object features and then

141 Overall approach 115 combining the matches based on a voting scheme. Such feature-based solutions need to answer some questions concerning feature selection, monitoring, correction, integration, and filtering. Feature selection schemes define good features to match. Feature monitoring aims at detecting errors and at adapting the tracking process to these errors. Feature correction aims at compensating for segmentation errors during tracking, especially during occlusion. Feature integration defines ways to efficiently and effectively combine features. Feature filtering is concerned with ways to monitor and eventually filter noisy features during tracking over time (Fig. 6.2). Search area Objects I(n) Oi Objects I(n-1) M 0 Mi Oj Mj Op I(n-1) Figure 6.2: The object correspondence problem. In the following sections, strategies for feature selection, integration, and filtering are proposed. Besides, techniques to solve problems related to segmentation errors and ambiguities are proposed. Many object tracking approaches based on feature extraction assume that the object topology is fixed throughout the image sequence. In this thesis, the object to be tracked can be of arbitrary shape and can gradually change its topology throughout the image sequence. No prior knowledge is assumed and there are no object models. Tracking is activated once an object enters the scene. An entering object is immediately detected by the change detection module. The segmentation and motion estimation modules extract the relevant features for the correspondence module. While tracking objects, the segmentation module keeps looking for new objects entering the scene. Once an object is in the scene, it is assigned a new trajectory. Objects that have no corresponding objects are assumed to be new, entering or appearing, and assigned a new trajectory. Once it leaves its trajectory ends. In the case of multiple object occlusion, the occlusion detection module first detects occluded and occluding objects, and then continues to track both types of objects even if objects are completely invisible, i.e., their area is zero since no pixel can be assigned to them. This is because objects may reappear. Despite attempts to make object tracking robust to mis-tracking (for example, against background distractions), tracking can fail under large image changes, such as rapid unexpected motions, big changes in ambient illu-

142 116 Object tracking mination, and severe occlusions. Many of these types of failures are unavoidable and even the HVS cannot track objects under some conditions (for instance, when objects move quickly). If it is not possible to avoid mis-tracking, the proposed tracking system is designed to at least recover tracking of an object it has lost Feature selection In this thesis, four selection criteria are defined. First, unique features are used, i.e., features not significantly changing over time. The choice is between two of the most important unique features of an object, motion and shape. Second, estimation-errorcompensated feature representations are selected. It is known [111, 65] that different features are sensitive to different conditions (e.g., noise illumination changes). For example, boundaries of objects are known to be insensitive to a range of illumination changes. On the other hand, some region-based, e.g., object area, features are insensitive to noise [111, 65]. Furthermore, segmentation errors, such as holes, affect features, such as area, but do not significantly affect perimeter or a ratio, such as H W. Features based on contour and regions are used in the proposed matching procedure. Third, features within a finite area of the object to be matched need to be selected. This criterion limits matching errors. Finally, feature representations that balance real-time and effectiveness considerations are selected. Based on these criteria the following object features and representations are selected for the matching process (details in Section 3.5). The size and shape tests look at local spatial configurations. Their representations have error-compensated properties. The motion test looks at the temporal configurations and is one of the strongest unique features of objects. The distance test limits the feature selection to a finite area. In the case of multiple matches, the confidence measure helps compensation of matching errors. Representation of the features are fast to compute and to match, and, as will be shown in the following, are effective in tracking objects even in multi-object occlusion 1. Motion: direction δ =(δ x,δ y ) and displacement w =(w x,w y ). Size: area A, height H, and width W. Shape: extent ratio e = H, compactness c = A, and irregularity r = W HW P 2 /(4πA). Distance: Euclidean distance d i between the centroids of two objects O p and O i. Confidence measure: degree of confidence ζ i of an established correspondence M i : O p O i (Eq. 6.2). 1 For not selecting color features see Section 3.5.2

143 Feature voting 117 Experimental results show that the use of this set of features gives stable results. Other features, such as texture, color, and spatial homogeneity, were tested but gave no additional stabilization to the algorithm in the tested video sequences Feature integration by voting In a multi-feature-based matching of two entities, an important question is how to combine features for stable tracking. Three requirements are of interest here. First, a linear combination would not take into account the non-linear properties of the HVS. Second, when combining features their collective contribution is considered and the distinguishing power of a single feature becomes less effective. Therefore, it is important to balance the effectiveness of a single feature and multi-features. This increases the spatial tracking accuracy. Third, the quality of features can vary throughout the sequence and, therefore, features should be monitored over time and eventually excluded from the matching process. This requirement aims at increasing the temporal stability of the matching process. The proposed solution (Fig. 6.3) takes these observations into account by combining spatial and temporal features using a non-linear voting scheme of two steps: voting for object matching (object voting) and voting between two object correspondences (correspondence voting). The second step is applied in the case of multiple matches to find the better one. Each voting step is first divided into m sub-matches with m object features. Since features can get noisy or occluded, the m varies spatially (objects) and temporally (throughout the image sequence) depending on a spatial and temporal filtering (cf. Section 6.3.7).(failed) (non-similarity d) Then each sub-match, m i, is performed separately using the appropriate test function. Each time a test is passed a similarity s variable is increased to contribute one or more votes. On the other hand, if a test is failed, a non-similarity d variable is increased to contribute one or more votes. Finally a majority rule compares the two variables and decides about the final vote. The simplicity of the two-step non-linear feature combination, which uses distinctive features oriented on properties of the HVS (Section 6.3.1), provides a good basis for fast and efficient matching, which is illustrated in the result Section 6.4. In the case of zero match O i, i.e., no object in I(n 1) can be matched to an object in I(n), a new object is declared entering or appearing into the scene, depending on its location. In the case of reverse zero match O p, i.e., no object in I(n) can be matched to an object in I(n 1), O p is declared disappearing or exiting the scene which depends on its location. Both cases will be treated with more details in Chapter 7. The voting system requires the definition of some thresholds. These thresholds are important to allow variations due to feature estimation errors. The thresholds are adapted to the image and object size (see Section 6.3.7).

144 118 Object tracking Object vote for all O p in I(n-1) Is O p close to O i? for all O i in I(n) (i.e., lays in its search area) No Yes Next object of I(n)? No End Yes Calculate object similarity by voting Yes Is O p similar to O i of O(n)? No Yes Correspondence vote has O p already a correspondence? No Declare an initial correspondence Yes Calculate the deviations between the object features of the two correspondences Compare the deviations by voting Is the new correspondence better? No Keep old correspondence Yes Replace correspondances Figure 6.3: Two-step object matching by voting. Object voting In this step, three main feature tests are used: shape, size, and motion tests. The shape and size tests include three sub-tests and the motion test 2 sub-tests. This solution wants to avoid cases where one feature fails and the tracking is lost (especially in the case of occlusion). Definitions: O p an object of the previous image I(n 1), O i the i th object of the current image I(n), M i = O p O i a correspondence and M i = O p O i (non-correspondence between O p and O i,

145 Feature voting 119 d i the distance between the centroids of O p and O i, t r the radius of a search area around O p, w i =(w xi,w yi ) the estimated displacement of O p relative to O i, w max the maximal possible object displacement. This depends on the application. For example, 15 <w max < 32. s the similarity count between O p and O i, d the dissimilarity count between O p and O i, s++ an increase of s, and d++ of d, by one vote. Then M i : (d i <t r ) (w xi <w ymax ) (w yi <w ymax ) (ζ >t m ) M i : otherwise (6.1) with the vote confidence ζ = s d and t m a real-valued threshold larger than, for example, 0.5. M i is accepted if O i lays within a search area of O p, its displacement is not larger than a maximal displacement, and if both objects are similar, i.e., s d >t m. The use of this rule instead of the majority rule (i.e., s>d) is to allow the acceptance of M i even if s<d. This is important in case objects are occluded, where some features are significantly dissimilar that might cause the rejection of good correspondence. Note that this step is followed by a correspondence step and no error can be introduced because of accepting correspondences with eventually dissimilar objects. For each correspondence M i = O p O i a confidence measure ζ i, which measures the degree of certainty of M i is used, defined as follows: ζ i = { d s s : <t v d m s d s : >t v d m where v is the total number of feature votes. To compute s and d the following feature votes are applied where t z t s < 1 are functions of the image and object sizes (see Eq. 6.18). (6.2) < 1 and Size vote Let r ai = { Ap /A i : A p A i A i /A p : A p >A i, r hi = { Hp /H i : H p H i H i /H p : H p >H i, and { Wp /W r wi = i : W p W i, where A i, H i, and W i are area, height, and width of W i /W p : W p >W i an object O i (see Section 3.5.2). Then s++ : r ai >t z r hi >t z r wi >t z d++ : r ai t z r hi t z r wi t z. (6.3)

146 120 Object tracking Shape vote Let e p (e i ), c p (c i ), r p (r i ) be, respectively, the extent ratio, compactness and irregularity of the shape of O p (O i ) (see Section 3.5.2), d ei = e p e i, d ci = c p c i, and d ri = r p r i. Then s++ : d ei t s d ci t s d ri t s d++ : d ei >t s d ci >t s d ri >t s. (6.4) Motion vote Let δ p =(δ xp,δ yp ) and δ c =(δ xc,δ yc ) be, respectively, the previous and current motion directions of O p. Then s++ : δ xc = δ xp δ yc = δ yp d++ : δ xc δ xp δ yc δ yp. (6.5) Correspondence voting Recall that all objects I(n 1) are matched to all objects of I(n). First each object O p I(n 1) is matched to each object O i I(n). This may result in multiple matches for one object, for example, (M pi : O p O i and M pj : O p O j )or (M pi : O p O i and M qi : O q O i ) with O p,o q I(n 1) and O i,o j I(n). If the final correspondence vote results in s i s j, i.e, two objects of I(n) are matched with the same object in I(n 1), or s p s q, i.e., two objects of I(n 1) are matched with the same object in I(n), plausibility rules are applied to solve this case. This is explained in detail in Section To decide which of the two correspondences is the right one the following vote is applied. Let s i (s j ) be a number that describes the suitability of M i (M j ). Then M i : s i >s j (6.6) M j : s i s j. A voting majority rule is applied in this voting step. To compute s i and s j the following votes are applied where t k z < 1, t k s < 1, and t k d > 1 are functions of the image and object sizes (see Eq. 6.18). In the following, the index k denotes a vote for correspondence. Distance vote Let d i (d j ) be the distance between O p and O i (O j ). Let d k d = d i d j. Then s i ++ : d k d >tk d d i <d j (6.7) s j ++ : d k d >tk d d i >d j. The aim of the condition d k d >tk d is to ensure that only if the two features really differ the vote can be applied. If the features do not differ then neither s i nor s j is increased.

147 Feature voting 121 Confidence vote Let d ζ = ζ i ζ j. Then s i ++ : (d ζ >t ζ ) (ζ i >ζ j ) s j ++ : (d ζ >t ζ ) (ζ i <ζ j ). (6.8) The condition d ζ >t ζ ensures that only if the two features differ significantly, the vote can be applied. Size vote Let d k a = r ai r aj, d k h = r h i r hj, and d k w = r wi r wj. Then s i ++ : (d k a >t k z r ai <r aj ) (d k h >tk z r hi <r hj ) (d k w >t k z r wi <r wj ) s j ++ : (d k a >t k z r ai >r aj ) (d k h >tk z r hi >r hj ) (d k w >t k z r wi >r wj ). (6.9) If the features do not differ, i.e., their difference is less than a threshold, then neither s i nor s j is increased. Shape vote Let d k e = d ei d ej, d k c = d ci d cj, and d k r = d ri d rj. Then s i ++ : (d k e >t k s r ei <r ej ) (d k c >t k s r ci <r cj ) (d k r >t k s r ri <r rj ) s j ++ : (d k e >t k s r ei >r ej ) (d k c >t k s r ci >r cj ) (d k r >t k s r ri >r rj ). (6.10) Only if the two features significantly differ, the vote is applied. If the features do not differ then neither s i nor s j is increased. Motion vote Direction vote: let δ c =(δ xc,δ yc ), δ p =(δ xp,δ yp ), δ u =(δ xu,δ yu ) be, respectively, the current (i.e., between I(n) and I(n 1)), previous (i.e., between I(n 1) and I(n 2)) and past-previous (i.e., between I(n 2) and I(n 3)) motion direction of O p. Let δ i =(δ xi,δ yi )(δ j =(δ xj,δ yj )) be the motion direction of O p

148 122 Object tracking if it will be matched to O i (O j ). s i ++ : (δ xi = δ xc δ xi = δ xo δ xi = δ xu ) (δ yi = δ yc δ yi = δ yo δ yi = δ yu ) s j ++ : (δ xj = δ xc δ xj = δ xo δ xj = δ xu ) (δ yj = δ yc δ yj = δ yo δ yj = δ yu ). (6.11) Displacement vote: let d mi (d mj ) be the displacement of O p relative to O i (O j ) and d k m = d mi d mj. Then s i ++ : (d k m >t k m) (d mi <d mj ) s j ++ : (d k m >t k m) (d mi >d mj ). (6.12) Here d k m >t k m means that the displacements have to differ significantly to be considered for voting. t k m is adapted to detect segmentation error. For example, in the case of occlusion, it is increased. Also it is a function of the image and object size (see Section 6.18). The motion magnitude test can contribute more than one vote to the matching process; if s i = s j and the difference d k m is large, then s i or s j is increased by 1,2, or 3 as follows: s i +1 : (d k m <t k m min ) (d k m >t k m) (d mi <d mj ) s i +2 : (t k m min <d k m <t k m max ) (d k m >t k m) (d mi <d mj ) s i +3 : (d k m >t k m max ) (d k m >t k m) (d mi <d mj ) s j +1 : (d k m <t k m min ) (d k m >t k m) (d mi >d mj ) s j +2 : (t k m min <d k m <t k m max ) (d k m >t k m) (d mi >d mj ) s j +3 : (d k m >t k m max ) (d k m >t k m) (d mi >d mj ). (6.13) Feature monitoring and correction Since achieving perfect object segmentation is a difficult task, it is likely that a segmentation algorithm outputs erroneous results. Therefore, robust tracking based on object segmentation should take possible errors into account and try to correct or compensate for their effects. Three types of error are of importance: object merging due to occlusion (Fig. 6.4), object splitting due to various artifacts (6.6), and object deformation due to viewpoint change or other changing conditions. Analysis of displacements of the four MBB-sides allows the detection and correction of various object segmentation errors. This thesis detects and corrects these types of errors based on plausibility rules and prediction strategies as follows.

149 Feature monitoring 123 Correction of erroneous merging Detection Let: O p1,o p2 I(n 1), M i : O p1 O i where O i results from the occlusion of O p1 and O p2 in I(n), d p12 be the distance between the centroids of O p1 and O p2, w =(w x,w y ) be the current displacement of O p1, i.e., between I(n 2) and I(n 1), and recall (Section 5.4.2) that d rmax (d rmin ) be the vertical displacement of the lower (upper) row and d cmax (d cmin ) be the horizontal displacement of the right (left) column of O p1. Object occlusion is declared if (( w y d rmax >t 1 ) (d rmax > 0) (d i12 <t 2 )) (( w y d rmin >t 1 ) (d rmin > 0) (d i12 <t 2 )) (( w x d cmax >t 1 ) (d cmax > 0) (d i12 <t 2 )) (( w x d cmin >t 1 ) (d cmin > 0) (d i12 <t 2 )), (6.14) where t 1 and t 2 are thresholds. If occlusion is detected then both the occluding and occluded objects are labeled with a special flag. This labeling enables the system in the subsequent images to continue tracking both objects even if they are completely invisible. Tracking invisible objects is important since they might reappear. The labeling is further important to help detect occlusion even if the occlusion conditions in Eq are not met. min. row in I(n-1) large displ. of the min. row O p1 O p2 O i min. row in I(n) (two objects are occluded and merged) I(n-1) I(n) Figure 6.4: Object occlusion: large outward displacement of an MBB-side. Correction by object prediction If occlusion is detected, the occluded object O i is split into two objects. This is done by predicting both object O p2 and O p1 onto I(n) using the following displacement estimates: d p1 = (MED(d 1 x c,d 1 x p,d 1 u x ), MED(d 1 y c,d 1 y p,d 1 u y )) d p2 = (MED(d 2 x c,d 2 x p,d 2 u x ), MED(d 2 y c,d 2 y p,d 2 u y )) (6.15)

150 124 Object tracking Figure 6.5: Two examples of tracking two objects during occlusion (zoomed in). with MED representing a 3-tap median filter, d 1 x c (d 1 y c ), d 1 x p (d 1 y p ), d 1 u x (d 1 u y ) as, respectively, the current, previous and past-previous horizontal (vertical) displacements of O p1 and d 2 x c (d 2 y c ), d 2 x p (d 2 y p ),d 2 u x (d 2 u y ) as the current, previous and past-previous horizontal (vertical) displacement of O p2. After splitting occluded and occluding objects, the list of objects of I(n) isupdated, for example, by adding O p2. Then a feedback loop estimates the correspondences in case new objects are added (Fig. 6.1). Two examples of object occlusion detection and correction are shown in Fig The scene shows two objects moving and then they occlude each other. The change detection module provides one segment for both objects but the tracking module is able to correct the error and track the two objects also during occlusion. Note that in the original images of these examples (Fig. 6.20) the objects appear very small and pixels are missing or misclassified due to some difficulties of the change detection module. However, most pixels of the two objects are correctly classified and tracked. Correction of erroneous splitting Detection Assume O p I(n 1) is split in I(n) into two objects O i1 I(n 1) and O i2 / I(n 1). Let M i : O p O i1, d i12 be the distance between the centroids of O i1 and O i2, and w =(w x,w y ) be the current displacement of O p, i.e., between I(n 2) and I(n 1). Then object splitting is declared if ( w y d rmax >t 1 ) d rmax < 0 d i12 <t 2 ( w y d rmin >t 1 ) d rmin < 0 d i12 <t 2 ( w x d cmax >t 1 ) d cmax < 0 d i12 <t 2 ( w x d cmin >t 1 ) d cmin < 0 d i12 <t 2. (6.16)

151 Region merging 125 If splitting is detected, then the two object regions O i1 and O i2 are merged into one object O i (Section 6.3.6). After merging two object regions, the features of O i and the match M i : O p O i are updated (Fig. 6.1). I(n) O i1 Oi2 O p1 (object is split into 2 regions) I(n-1) Figure 6.6: Object splitting: large inward displacement of an MBB-side. Compensation of deformation and other errors Detection Let M i : O p O i.if ( w y d rmax >t 1 ) ( w y d rmin >t 1 ) ( w x d cmax >t 1 ) ( w x d cmin >t 1 ) (6.17) and no occlusion or split is detected as described in Eqs and 6.16 then object deformation or other unknown segmentation error is assumed and the object displacement estimation is adapted as described in Chapter 5, Eqs Region merging Objects throughout a video show specific homogeneity measures, such as motion or texture. Segmentation methods may divide one homogeneous object into several regions. The reason is twofold: optical, i.e., noise, shading, illumination change, and reflection, and physical when an object include regions of different features. For example, human body parts have different motion. When a segmentation method assumes one-feature-based homogeneity or when it does not take optical errors into account, it will fail to extract objects correctly. Region merging is an unavoidable step in segmentation methods. It is a process where regions are compared to determine if they can be merged into one or more objects [61, 118]. It is desirable because subregions complicate the process of object-oriented video analysis and interpretation and merging may reduce the total number of regions, which results in improved performance, if applied correctly.

152 126 Object tracking Regions can be merged either based on i) spatial homogeneity features such as texture or color, ii) temporal features such as motion, or iii) geometrical relationships. Examples of such geometrical relationships are inclusion, e.g., one region is included in another region, and size ratio, i.e., the size of a region is significantly larger than the other. If, for example, a region is in another region and its size significantly smaller, it may be merged if the two objects show some similar characteristics such as motion. This thesis develops a different merging strategy that is based on geometrical relationships, temporal coherence, and matching of objects rather than on single local features such as motion or size. Assume an object O p I(n 1) is split in I(n) into two sub-regions O i1 and O i2 (Fig. 6.6). Assume the matching process matches O p with O i1. Then O i2 and O i1 are merged to be O i if all the following conditions are met: Equation 6.16 applies. Object voting gives M i : O p O i with lower vote confidence ζ, i.e., ζ>t mmerge with t mmerge <t m (Eq. 6.1). If a split is found at one side of the MBB (based on Eq. 6.16), then all the displacements of the three other MBB sides of O p should not change significantly when the two objects are merged. O i1 is spatially close to O i2 and O i2 to O p ; for example, in the case of a down split as shown in Fig. 6.7, all the distances d, d nc, d xc, and d xr have to be small. Geometrical features: size, height, and width, of the merged object O i = O i1 + O i2 match the geometrical features of O p. For example, t min < Ap A i <t max, with thresholds t min,t max. The motion direction of O p does not significantly change if matched to O i. O p O i1 d nc d xc O i2 d r d h Figure 6.7: Merging example: spatially close O i1 and O i2. This simple merging strategy has proven to be powerful in various simulations. The good performance is due to the convolution of the tracking and merging processes. Each process supports the other based on restricted rules that aim at limiting the

153 Feature filtering 127 (a) An object I(60) is split in two. (b) Video objects after matching and merging. (c) Objects I(191). (d) Video objects I(191) after matching and merging. Figure 6.8: Performance of the proposed region merging. result of the merging. The MBB includes the false merging. It is preferable to leave objects unmerged rather than merging different objects, which then complicates tracking. The advantage of the proposed merging strategy compared to known merging techniques (cf. [61, 118]) is that it is based on temporal coherency through the tracking process and not on simple features such as motion or texture. Fig. 6.8 shows examples of the good performance of the merging process. The method is successful also when multiple small objects are close to the split object (Fig. 6.8(d)) Feature filtering A good object matching technique must take into account noise and estimation errors. Due to various artifacts, the extraction of object features is not perfect (cf. Section 6.1). A new key idea in the proposed matching process is to filter features between two images and throughout the image sequence for robust tracking. This is done by ignoring features that become noisy or occluded across images. With such a model it is possible to discriminate between good and noisy feature estimates, and to ignore estimates taken while the object of interest is occluded. This means that features for tracking are well-conditioned, i.e., two features cannot differ by several orders of magnitude. The following plausibility filtering rules are applied: Error allowance: Feature deviations are possible and should be allowed. Error allowance should be, however, a function of the object size because small objects are more sensible to image artifacts than larger ones. If two objects are small, then a small error allowance is selected. If, however, objects are large then a larger error allowance can be tolerated.

154 128 Object tracking the HVS perceives differences between objects, for example, depending on the differentially changing object size. In small objects, for example, a difference of few percent in the number of pixels is significant, whereas in large objects, a small deviation may not be perceived as significant. Therefore, thresholds of the used feature tests (Eqs ) should be a function of the input object size. This adaptation to the object size allows a better distinction at smaller sizes and a stronger matching at larger sizes. The adaptation of the thresholds to the object size is done in a non-linear way. For example, t smin : A A min t s = f(a) : A min <A A max (6.18) t smax : A>A max, where the form of the function f(a) depends on an application. Various forms of f(a) are possible (see Fig. 4.7 in Section 4.4.3). In simulation a linear f(a) was used. The values of A min and A max are determined experimentally but they can be changed for specific applications. For example, in applications where objects appear small as in the sequence Urbanicade (Fig. 6.20), these values should be set low. However, in all simulations in this thesis the same parameters where chosen for all test sequences. This show the stability and good performance of the proposed framework. Error monitoring: To monitor the quality of the feature over time the dissimilarity of features between two successive images is examined and when it grows large this feature is not included in the matching process. If two correspondences (M i and M j ) of the same object O p have a similar feature deviation, then this feature is excluded from the voting process. For example, shape irregularity d r = r ip r jp <t r with r ip = rp r i and r jp = rp r j (definitions of r ip and r jp are given in Section 3.5.2). Matching consistency: Objects are tracked once they enter the scene and also during occlusion. This is important for activity analysis. Object correspondence is performed only if the estimated motion directions are consistent. If, after applying the correspondence voting scheme, two objects of I(n 1) are matched with the same object in I(n), the match with the oldest object (i.e, with the longer trajectory) is selected. If, due to splitting, two objects of I(n) are matched with the same object in I(n 1), the match with the largest size is selected.

155 Results 129 If, during the matching process or after object separation due to occlusion, a better correspondence than a previous one is found, the matching is revised, i.e., the previous correspondence is removed and the new one is established (Fig. 6.2). Due to the fault-tolerant and correction strategy integrated into the tracking, objects that split into disjoint regions or change their topology over time, can still be tracked and matched with the most similar object. An object of a set of disjoint regions in I(n 1) that becomes connected in I(n) is tracked and matched with the object region most similar to the new formed object in I(n). 6.4 Results and discussions Computational cost The proposed tracking takes between 0.01 to 0.21 seconds to link objects between two successive images. As can be seen in Table 6.1, the main computational costs go to detect and correct segmentation errors such as occlusion. The object projection and separation needs some computation cost in the case of large objects or when multiple objects are occluded. Object matching Object correction Motion estimation Table 6.1: Tracking computation cost in seconds on a SUN-SPARC MHz. Experimental evaluation Few in the literature presented object tracking methods have considered real environments with multiple rigid or/and articulated objects, and limited solutions to the occlusion problem exist. These methods track objects after, and not during, occlusion. In addition, many methods are designed for specific applications (e.g., tracking based on body parts models or vehicle models) or impose constraints regarding camera or object motion (e.g., upright motion). The proposed method is able to solve the occlusion problem in the presence of multiple crossing paths. It assigns pixels to each object in the occlusion process and tracks objects successfully during and after occlusion. There are no constraints regarding the motion of the objects and on camera position. Sample sequences used for evaluation are taken with different camera positions. In the following, simulation results using the proposed tracking method applied on widely used video shots (10 containing a total of 6371 images) are presented and discussed. Indoor, outdoor, and noisy real environments are considered. The shown

156 130 Object tracking results illustrate the good performance and robustness of the proposed approach even in noisy images. This robustness is due to the non-linear behavior of the algorithm and due to the use of plausibility rules for tracking consistency, object occlusion detection and other segmentation error. The good performance of the proposed tracking is shown by three methods: Tracking in successive images: The robustness of the proposed tracking can be clearly demonstrated when tracking objects in non-successive images. As can be seen in Fig. 6.9 the objects are robustly tracked even when five images have been skipped. Visualization of the trajectory of objects: To illustrate the temporal tracking consistency of the proposed algorithm the estimated trajectory of each object is plotted as a function of the image number. Such a plot illustrates the reliability of both the motion estimation and tracking methods and allows the analysis and interpretation of the behavior of an object throughout the video shot. For example, the trajectories in Fig show that various objects enter the scene at different times. Two objects (O 4 and O 2 ) are moving quickly (note that the trajectory curve increases rapidly). In Fig. 6.12, the video analysis extracts three objects. Two objects enter the scene in the first image while the third object enters around the 70 th image. O 1 moves horizontally to the left and vertically down, O 2 moves horizontally right and vertically up, and O 5 moves quickly to the left. While the interpretation of objects undergoing straight-forward motion (for example, not stopping or depositing something) is easy to follow and interpret, motion and behavior of persons that perform action are not easy to follow. For example, in Fig. 6.14, a person enters the scene and removes an object, and in Fig. 6.13, a person enters, moves, deposits an object, and meanwhile changes direction. As can be seen, the trajectory of these two persons is complex. Selection of samples of tracking throughout the video: Figures show sample tracking results throughout various test sequences. These results show the reliability of the proposed method in the case of occlusion (Fig and 6.20), object scale variations (Fig. 6.20), local illumination changes and noise (Fig and 6.17). The three evaluation methods show the reliability of both the motion estimation and the tracking. Their output allows the detection of events by analyzing the behavior of objects throughout a video shot. This can be done in an intuitive and straightforward manner, as given in Section 7.3.

157 Summary Summary and outlook The issue in tracking systems is reliability in the case of shadows, occlusion, and object split. Few in the literature presented tracking methods have considered such real environments. Many methods impose constraints regarding camera or object motion (e.g., upright motion). This chapter develops a robust object tracking method. It is based on a non-linear voting system that solves the problem of multiple correspondences. The occlusion problem is alleviated by a simple detection procedure based on the displacements of the object and then a median-based prediction procedure, which provide a reasonable estimate for (partially or completely) occluded objects. Objects are tracked once they enter the scene and also during occlusion. This is important for activity analysis. Plausibility rules for consistency, error allowance and monitoring are proposed for accurate tracking over long periods of time. An important contribution of the proposed tracking is the reliable region merging which improves the performance of the whole video algorithm. A possible extension of this method is seen in tracking objects that move in and out of the scene. The proposed algorithm has been developed for content-based video applications such as surveillance or indexing and retrieval. Its properties can be summarized as follows: Both rigid (e.g., vehicles) and articulated (e.g., human) objects can be tracked. Since in real-world scenes articulated objects may contain a large number of rigid parts, estimation of their motion parameters may result in huge computation (e.g., to solve a very large set of non-linear equations). the algorithm is able to handle several objects simultaneously and to adapt to their occlusion or crossing. No template or model matching is used, but simple rules are used that are largely independent of object appearance (e.g., matching based purely on the object shape and not on its image content). Further, the technique does not require any trajectory model. A confidence measure is maintained over time until the system is confident about the correct matching (especially in the case of occlusion). A simple motion estimation guides the tracking without the requirement of having predictive temporal filtering (e.g., Kalman filter). The tracking procedure is independent of how objects are segmented. Any advance in object segmentation will enhance the final results of the tracking but will not influence the way the tracking works. There is no constraint regarding the motion of objects or camera position. Sample sequences used for evaluation are taken with different camera positions. Objects can move close to or far from the camera.

158 132 Object tracking Figure 6.9: Tracking results of the sequence Highway. To show the robustness of the tracking algorithm only one in every five images has been used. This show that the proposed method can track objects that moves fast.

159 Results 133 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: x y StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:3 ObjID: 3 StartP:4 ObjID: 4 StartP:5 ObjID: Img No Img No. Figure 6.10: The trajectories of the objects in the sequence Highway where StartP represents the starting point of a trajectory. The upper figure gives the trajectory of the objects in the image plan while the two other figures give the trajectories for vertical and horizontal direction separately. This allows an interpretation of the object motion behavior throughout the sequence. For example, O 2 starts left of the image and moves and stops at the edge of the highway. The figures show how the object is available throughout the whole shot while the other objects start and disappear within the shot. Various vehicles enter the scene at different times. Some objects move quickly while the others are slower. Objects are moving in both directions: away from the camera and towards the camera. The system tracks all objects reliably. See also Fig

160 134 Object tracking StartP:1 ObjID: 1 StartP:4 ObjID: 4 StartP:5 ObjID: 5 StartP:6 ObjID: 6 StartP:7 ObjID: 7 StartP:8 ObjID: 8 StartP:9 ObjID: StartP:1 ObjID: 1 StartP:4 ObjID: 4 StartP:5 ObjID: 5 StartP:6 ObjID: 6 StartP:7 ObjID: 7 StartP:8 ObjID: 8 StartP:9 ObjID: StartP:1 ObjID: 1 StartP:4 ObjID: 4 StartP:5 ObjID: 5 StartP:6 ObjID: 6 StartP:7 ObjID: 7 StartP:8 ObjID: 8 StartP:9 ObjID: 9 x y Img No Img No. Figure 6.11: The trajectories of the objects in the sequence Urbicande. Many persons enter and leave the scene. One person is walking around. Persons appears very small and occlude each other. The system is reliable even in the presence of multiple occluding objects. O 1 starts inside the image at (160,250) and moves around within the rectangle (300,160),(150,80). O 5, for example, starts at image 25 and moves left across the shot reaching the other end of the image. The original sequence is rotated by 90 to the right to comply with the CIF format. See also Fig

161 Results 135 StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:5 ObjID: StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:5 ObjID: x y StartP:1 ObjID: 1 StartP:2 ObjID: 2 StartP:5 ObjID: Img No Img No. Figure 6.12: The trajectories of the objects in the sequence Survey. Three persons are entering the scene at different instants. The sequence includes reflection and other local image changes. The system is able to track the three objects before, during and after occlusion. See also Fig

162 136 Object tracking StartP:1 ObjID: 1 StartP:5 ObjID: 5 StartP:6 ObjID: x y 130 StartP:1 ObjID: 1 StartP:5 ObjID: 5 StartP:6 ObjID: StartP:1 ObjID: 1 StartP:5 ObjID: 5 StartP:6 ObjID: Img No Img No. Figure 6.13: The trajectory of the objects in the sequence Hall. Two persons enter the scene. One of them deposits an object. This sequence contains noise and illumination changes. As can be seen this shot include complex object movements. For example, the person on the left side of the image is entering, turning left, putting an object, comes back a little and then moves shortly straight before turning left and disappearing. See also Fig

163 Results 137 StartP:1 ObjID: 1 StartP:2 ObjID: x y StartP:1 ObjID: 1 StartP:2 ObjID: Img No. StartP:1 220 ObjID: 1 StartP:2 ObjID: Img No. Figure 6.14: The trajectories of the objects in the sequence Floort. An object is removed. The difficulty of this sequence is that it contains coding and interlaced artifacts. Furthermore, illumination changes across the trajectory of the objects. The change detection splits some objects but due to the robust merging based on tracking results the algorithm remains stable throughout the sequence and in the case of error.

164 138 Object tracking StartP:1 ObjID: 1 StartP:2 ObjID: StartP:1 ObjID: 1 StartP:2 ObjID: x y Img No. StartP:1 220 ObjID: 1 StartP:2 ObjID: Img No. Figure 6.15: The trajectory of the objects in the sequence Floorp. An object is deposited. The shadows are a main concern. They complicate the detection of the correct trajectory, but despite some small deviations the object trajectory is reliable.

165 Results 139 StartP:1 ObjID: 1 StartP:3 ObjID: StartP:1 ObjID: 1 StartP:3 ObjID: StartP:1 ObjID: 1 StartP:3 ObjID: x y Img No Img No. Figure 6.16: The trajectories of the objects in the sequence Floor. An object is being first deposited and then removed. This is a long sequence with complex movements, interlace artifacts, illumination changes and shadows. The trajectory is complex but the system is able to track the object correctly.

166 140 Object tracking Figure 6.17: Tracking results of the Highway sequence. Each object is marked by an ID-number and enclosed in its minimum bounding box. This sequence illustrates successful tracking in the presence of noise, scale and illumination changes.

167 Results 141 Figure 6.18: Tracking results of the Hall sequence. Each object is marked by an IDnumber and enclosed in its minimum bounding box. The algorithm works correctly despite the various local illumination changes and object shadows.

168 142 Object tracking Figure 6.19: Tracking results of the Survey sequence. Each object is marked by an ID-number and enclosed in its minimum bounding box. Despite the multi-object occlusion (O 1, O 2 and O 5 ), light changes, and reflections (e.g., car surfaces) the algorithm stays stable. Because of the static traffic sign, the change detection divides the object into two regions; Tracking is recovered properly.

169 Results 143 Figure 6.20: Tracking results of the Urbanicade sequence. Each object is marked by an ID-number and enclosed in its minimum bounding box. In this scene, objects are very small and experience illumination changes and object occlusions (O 1 & O 6, O 1 & O 8 ). However, the algorithm continues to track the objects correctly.

170

171 Chapter 7 Video Interpretation 7.1 Introduction Computer-based interpretation of recorded scenes is an important step towards automated understanding and manipulation of scene content. Effective interpretation can be achieved through integration of object and motion information. The goal in this chapter is to develop a high-level video representation system useful for a wide range of video applications that effectively and efficiently extracts semantic information using low-level object and motion features. The proposed system achieves its objective by extracting and using context-independent video features: qualitative object descriptors and events. Qualitative object descriptors are extracted by quantizing the low-level parametric descriptions of the objects. To extract events, the system monitors the change of motion and other low-level features of each object in the scene. When certain conditions are met, events related to these conditions are detected. Both indoor and outdoor real environments are considered Video representation strategies The significant increase of video data in various domains requires effective ways to extract and represent video content. For most applications, manual extraction is not appropriate because it is costly and can vary between different users depending on their perception of the video. The formulation of rich automated content-based representations is, therefore, an important step in many video services. For a video representation to be useful for a wide range of applications, it must describe precisely and accurately video content independently of context. In general, a video shot conveys objects, their low-level and high-level features

172 146 Video interpretation within a given environment and context 1. Video representation using solely low-level objects does not fully account for the meaning of a video. To fully represent a video, objects need to be assigned high-level features as well. High-level object features are generally related to the movement of objects and are divided into context-independent and context-dependent features. Features that have context-independent components include object movement, activity, and related events. Here, movement is the trajectory of the object within the video shot and activity is a sequence of movements that are semantically related (e.g., pitching a ball) [22]. Context-dependent high-level features include object action which is the semantic feature of a movement related to a context (e.g., following a player) [22]. An event expresses a particular behavior of a finite set of objects in a sequence of a small number of consecutive images of a video shot. An event consists of contextdependent and context-independent components associated with a time and location. For example, a deposit event has a fixed semantic interpretation (an object is added to the scene) common to all applications but the deposit of an object can have variable meaning in different contexts. In the simplest case, an event is the appearance of a new object into the scene or the exit of an object from the scene. In more complex cases, an event starts when the behavior of objects changes. An important issue in event detection is the interaction between interpretation and application. Data are subject to a number of different interpretations and the most appropriate one depends upon the requirements of the application. An event-oriented video content representation is complete only if it is developed in a specific context and application. Video shot Video representation Moving objects Video analysis Global features Meaning (e.g., events) Video interpretation - context independent - Behavior Video understanding - context dependent - Figure 7.1: Video representation levels. 1 A video sequence can also contain audio information. In some applications it is useful to use both audio and visual data to support interpretation. On the other hand, audio data is not always available. For example, in some countries the recording of sound by CCTV surveillance systems is outlawed). Therefore, it is important to be able to handle video retrieval and surveillance based on visual data.

173 Introduction 147 Video Shot Video Shot consists of Object Object 1 Object i Object n Object 1 consists of Object 2... Event 1... Event n close to, start after,... depart away, near miss,... Event related to objects (e.g., deposit) (a) Structural representation. (b) Conceptual representation. Figure 7.2: Interpretation-based video representation. To extract object features, three levels of video processing are required 2 (Fig. 7.1): The video analysis level aims at the extraction of objects and their spatiotemporal low-level and quantitative features. The video interpretation level targets the extraction of qualitative and semantic features independent of context. A significant semantic features are events that are extracted based on spatio-temporal low-level features. The video understanding level addresses the recognition of behavior and actions of objects within the context of object motion in the video. The interpretation-based representations can be divided into structural and conceptual representations (Fig. 7.2). Structural representations use spatial, temporal, and relational features of the objects while conceptual representations use objectrelated events. This chapter addresses video interpretation (both structural and conceptual) for on-line video applications such as video retrieval and surveillance Problem statement There is a debate in the field of video content representation whether low-level video representations are sufficient for advanced video applications such as video retrieval or surveillance. Some researchers question the need for high-level representations. For some applications, low-level video representation is an adequate tool and the cost of high-level feature computations can be saved. Studies have shown that low-level features are not sufficient for effective video representation [81]. The main restriction of low-level representations is that they rely on the users to perform the high-level 2 Video representations based on global content, such as global motion, are needed in some applications. They can be combined with other representations to support different tasks, for example, in video retrieval [24].

174 148 Video interpretation abstractions [81, 124]. The systems in [116, 90] contribute a solution using relevance feedback mechanisms where they first interact with the user by low-level features and then learn the user s feedback to enhance the system performance. Relevance feedback mechanisms work well for some applications that require small amounts of high-level data, such as in retrieval of texture images. Most users do not, however, extract video content based on low-level features solely and relevance feedback mechanisms are not sufficient for effective automated representation in advanced applications. The main difficulty in extracting high-level video content is the so-called semantic gap. It is the difference between the automatically extracted low-level features and the features extracted by humans in a given situation [124]. Humans look for features that convey a certain message or have some semantic meaning, but automatically extracted features describe the objects quantitatively. To close this gap, methods need to be developed for association high-level semantic interpretation with extracted lowlevel data without relying completely on low-level descriptions to take decisions. For many applications, such as surveillance, there is no need to provide a fully semantic abstraction. It is sufficient to provide semantic features that are important to the users and similar to how the humans find content. Extracting semantic features for a wide range of video applications is important for high-level video processing, especially in costly applications where human supervision is needed. High-level content allows users to retrieve a video based on its qualitative description or to take decisions based on the qualitative interpretations of the video. In many video applications, there is a need to extract semantic features from video to enable a video-based system to understand the content of the video. The issue is what level of video semantic features are appropriate for general video applications? For example, are high-level intentional descriptions such as what a person is doing or thinking needed? An important observation is that video context can change over time. It is thus important to provide content representation that has fixed semantic features which are generally applicable for a wide range of applications. The question here is how to define fixed semantic video contents and to extract features suitable to represent it. The purpose of video is, in general, to document events and activities made by objects or a group of objects. People usually look for video objects that convey a certain message [124] and they usually focus and memorize [72, 56]: i) events, i.e., what happened, ii) objects, i.e., who is in the scene, iii) location, i.e., where did it happen, and iv) time, i.e., when did it happen. Therefore, a generally useful video interpretation should be able to: take decisions on lower-level data to support subsequent processing levels, qualitatively represent objects and their spatial, temporal, and relational features,

175 Introduction 149 extract object semantic features that are generally useful, and automatically and efficiently provide a response (e.g., real-time operation) Related work As defined, events include semantic primitives. Therefore, event recognition is widely studied in the artificial intelligence literature, where the focus to is develop formal theories and languages of semantics and inference of actions and events ([23, 67]; for more references see [22, 81]). Dynamic scene interpretation has traditionally been quantitative and typically generates large amounts of temporal qualitative data. Recently, there has been increased interest in higher-level approaches to represent and to reason with such data using structural and conceptual approaches. For example, the studies in [57, 34] focus on structural video representations based on qualitative reasoning methods. Research in the area of detecting, tracking, and identifying people and objects has become a central topic in computer vision and video processing [22, 81, 105]. Research interest shifted towards detection and recognition of activities, actions and events. Narrow-domain systems recognize events and actions, for example, in hand sign applications or in Smart-Cameras based cooking (see the special section in [130], [81, 139, 22]). In these systems, prior knowledge is, usually, inserted in the event recognition inference system and the focus is on recognition and logical formulation of events and actions. Some event-based surveillance application systems also have been proposed. In the context-dependent system in [86] the behavior of moving objects in an airborne video is recognized. The system compensates for global motion, tracks moving objects and defines their trajectory. It uses geo-spatial context information to analyze the trajectories and detect likely scenarios such as passing or avoiding the checkpoint. In the context-dependent system in [14], events, such as removal or siting or use terminal, are detected in a static room and precise knowledge of the location of certain objects in the room is needed. Other examples and references to context-dependent video interpretation can be found in [29, 28]. The context-dependent system in [70] tracks several people simultaneously and uses appearance-based models to identify people. It determines whether a person is carrying an object and can segment the object from the person. It also tracks body parts such as head or hands. The system imposes, however, restrictions on the object movements. Objects are assumed to move upright and with little occlusion. Moreover, it can only detect a limited set of events. There has been little work on context-independent interpretation. The system in [38] is based on motion detection and tracking using prediction and nearest-neighbor matching. The system is able to detect basic events such as deposit. It can operate in

176 150 Video interpretation simple environments where one human is tracked and translational motion is assumed. It is limited to applications of indoor environments, cannot deal with occlusion, and is noise sensitive. Moreover, the definition of events is not widely applicable. For example, the event stop is defined by when an object remains in the same position for two consecutive images. The interpretation system for indoor surveillance applications in [133] consists of object extraction and event detection modules. The event detection module classifies objects using a neural network. The classification includes: abandoned object, person, and object. The system is limited to one abandoned object event in unattended environments. The definition of abandoned object, i.e., remaining in the same position for long time, is specific to a given application. Besides, the system cannot associate abandoned objects and the person who deposited them. The system is limited to surveillance applications of indoor environments Proposed framework For an automated video interpretation to be generally useful, it must include features of the video that are context-independent and have a fixed semantic meaning. This thesis proposes an interpretation system that focuses on objects and their related events independent of the context of a specific application. The input of the video interpretation system is a low-level description of the video based on objects and the output is a higher-level description of the video based on qualitative object and event descriptions. With the information provided from the video analysis presented in Chapter 3, events are extracted in a straightforward manner. Event detection is performed by integrating object and motion features, i.e., combining trajectory information with spatial features, such as size and location (Fig. 7.3). Objects and their features are represented in temporally linked lists. Each list contains information about the objects. Information is analyzed as it arrives and events are detected as they occur. An important feature of the proposed system is that it uses a layered approach. It goes from low-level to middle-level to high-level image content analysis to detect events. It integrates results of a lower level to support a higher level, and vice-versa. For example, low-level object segments are used in object tracking and tracking is used to analyze these segments and eventually correct them. In many applications, the location of an object or event is significant for decision making. To provide location information relevant for a specific application, a partition of the scene into areas of interest is required. In the absence of a specific application, this thesis uses two types of location specification (see Fig. 7.4). The first type specifies where is the border of the scene. The second type separates the image into nine sectors: center, right, left, up, down, left up, right up, left down, and right down.

177 Introduction 151 Shot Object-oriented video analysis Data: objects & features Motion analysis & interpretation Spatio-temporal feature description Video interpretation Video objects to events Event detection & classification Description: objects & features Information: objects & events Results Requests Object & Event-based application e.g., event-based decision-making Figure 7.3: Video interpretation: from video objects to events. As a result, the proposed video interpretation outputs at each time instant n of a video shot V a list of objects with their features as follows: Identity - a tag to uniquely identifies an object throughout the shot, Low-level feature vector: Location - where object appears on the scene (initial and current) Shape - (initial, average, and current) Size - (initial, average, and current) Texture - texture of an object Motion - where an object is moving (initial, average, and current) Trajectory - the set of the estimated centroids of the object throughout the shot. Object life span or age - the time interval over which an object is tracked. Event descriptions - the location and behavior of the object. Spatio-temporal relationship - relation to other objects in space-time. Global information - global motion or dominant object. A video shot, V, is thus represented by {(O, P o ), (E,P e ), (G)} where O is a set of video objects throughout V, P o is a set of features for each O i O, E is a set of events throughout V, P e is a set of features of each event, and G is a set of global features of the shot.

178 152 Video interpretation Origin Left Center Right Up Center min. row max. row Down Obj. centroid min. col max. col Border Figure 7.4: Specification of Object locations and directions. Objects and events are monitored as the objects enter in the scene. Objects are linked to events by describing their relationship to the event. The representation (O, P o )is defined in Section 7.2 and (E,P e ) in Section 7.3. A video application can then use this information to search or process raw video (as an example, see the query form in the Appendix A in Fig. A.3). In the following, be defined V = {I(1),,I(N)} the input video shot of N images, I(n),I(k),I(l) V images at time instant n, k, orl, O i,o j objects in I(n), O p,o q objects in I(n 1), B i the MBB of O i, g i the age of O i, c i =(c xi,c yi ) the centroid of O i, d ij the distance between the centroids of O i and O j, r min the upper row of the MBB of an object, r max the lower row of the MBB of an object, c min the left column of the MBB of an object, and c max the right column of the MBB of an object.

179 Object representation Object-based representation In this section, qualitative descriptions of features of moving objects are developed. Spatial, temporal, and relational features are proposed Spatial features Location - the position of object O i in image I(n). Qualitative: to permit users to specify qualitative object locations, the image is divided into nine sectors (Fig. 7.4). O i is declared to be in the center (right, left, top, down, up left, up right, down left, down right, respectively) of the image if its centroid is located in the center (right, left, top, down, up left, up right, down left, down right, respectively) sector. Quantitative: the position of an object is represented by the coordinates of its centroid c i. Size, shape, and texture: Qualitative: size descriptors: {small, medium, large}, {tall, short}, and {wide, narrow}. shape descriptors: {solid, hollow}, and {compact, jagged, elongated}. Classification of shape needs to be more precise based on an application. In some applications, finer categories are needed to differentiate between objects, e.g., between person and vehicle or between person and another person. texture descriptors: {smooth, grainy, motted, striped}. The categorization of texture is also application-dependent. Quantitative: see Section Temporal features Motion is a key low-level feature of video. Suitable motion representation and description play an important role in high-level video interpretation. The interpretation of quantitative object and global motion parameters is needed in video retrieval applications to allow a user to search for objects or shots based on perceived object motion or camera motion. Global motion Since video shot databases can be very large, pruning of large video databases is essential for efficient video retrieval. This thesis suggests two methods for pruning:

180 154 Video interpretation pruning based on qualitative global motion and pruning based on dominant objects (cf. Section 7.4.2). In video retrieval, the user may be asked to specify the qualitative global motion or the dominant object of the shot. This is useful, since a user can better describe a shot based on its qualitative features rather than by specifying a parametric description. Global motion estimation techniques represent global motion by a set of parameters based on a given motion model. Global motion estimation using an affine motion model is used to estimate the dominant global motion. The instantaneous velocity w of a pixel at position p in the image plane is given by a 6-parameter a =(a 1,a 2,a 3,a 4,a 5,a 6 ) motion model: ( ) ( ) a1 a3 a w(p) = + 4 p (7.1) a 2 a 5 a 6 In [25, 24], linear combinations of the parameters in a are analyzed to extract qualitative representations. For example, while a 1 and a 2 describe the translational motion, the linear combination 1 2 (a 2 + a 6 ) determines zoom. Rotation is expressed as the combination 1 2 (a 5 a 3 ). If the dominant global motion is a pure pan, the only non-zero parameter is supposed to be a 1. In the case of zoom, the linear combination 1 2 (a 2+a 6 ) is assumed to be the only non-zero parameter. Object motion The requirements on motion representation accuracy are not decisive in some application, but low complexity processing is essential. Retrieval or surveillance application are examples. The main requirement is, here, the capture of basic motion characteristics and not the highest possible accuracy of the motion description. For such applications it may be sufficient to classify motion qualitatively (e.g., large translation) or with an interval (e.g., motion lays within [4, 7]). Motion quantization and classification A classification of object motion includes translation, rotation, scale change or a mixture of these motions. The motion estimation technique proposed in Chapter 5 classifies object motion into translation and non-translation. Scale change can be approximated easily by temporal analysis of the object size. Motion is represented by direction and speed {δ, w}. The speed w is quantized into four descriptions: {static, slow, moderate, and fast}. Directions δ of the object motion are normalized and quantized into eight directions: down left, down right, up left, up right, left, right, down, and top. Trajectory For retrieval purposes, the trajectory (or path) of an object is needed to easily query video shots. Some examples are: objects crossing near, objects moving

181 Object representation 155 left to right, and objects moving far side right to left. Once an object enters the scene, the tracking method assigns to it a new trajectory. When it leaves, the trajectory ends. Object trajectories are constructed from the coordinates of the centroid of the objects. These trajectories are saved and can be used to identify events or interesting objects, to support object retrieval or statistical analysis (e.g., frequent use of a specific trajectory) Object-relation features Spatial relations The following spatial object relationships are proposed: Direction: O i is to the left of O j if c xi <c xj. O i is to the right of O j if c xi >c xj. O i is below O j if c yi >c yj. O i is above O j if c yi <c yj. O i is to the left and below O j if (c xi <c xj ) (c yi >c yj ). O i is to the left and above O j if (c xi <c xj ) (c yi <c yj ). O i is to the right and below O j if (c xi >c xj ) (c yi >c yj ). O i is to the right and above O j if (c xi >c xj ) (c yi <c yj ). Containment: O i is inside O j if O i O j. O i contains O j if O j O i. Distance: O i is near or close to O j if d ij <t d and O i O j. Composite spatial relations, such as O i is inside and to the left of O j, can be easily detected. Also features, such as O i is partially inside O j, are easily derived. Temporal relations The following temporal object relationships are defined: O i starts after O j if O i enters or appears in the scene at I(n) and O j at I(k) with n>k. O i starts before O j if O i enters or appears in the scene at I(n) and O j at I(k) with n<k. O j and O j start together if both enter or appear at the same I(n). O j and O j end together if both exit or disappear at the same I(n).

182 156 Video interpretation Possible extensions The following object relations can be also compiled for needs of video applications: Closeness: behavior may involve more than one object and typically is between objects that are spatially close. The identification of spatial and/or temporal closeness features, such as next to, ahead, adjacent, or behind, is important for some applications. For example, the detection of objects that come close in a traffic scene is important for risk analysis. Objects are generally moving at varying speeds. A static notion of closeness is, therefore, not appropriate. Ideally, the closeness feature should be derived based on the velocity of objects and their distance to the camera. Collision can be defined as: two objects occlude each other and then the shape of both change drastically. If no 3-D data are available, real-world object collision can be approximated, for example, by collision of the MBB of objects. Near miss: objects come close but do not collide. Estimation of time-to-collision based on the interpretation of the object motion and distance. Relative direction of motion: same, opposing, orperpendicular. 7.3 Event-based representation This thesis proposes perceptual descriptions of events that are common for a wide range of applications. Event detection is not based on geometry of objects but on their features and relations over time. The thesis proposes approximate but efficient world models to define useful events. In many applications, approximate models even if not accurate, are adequate. To define events, some thresholds are used which can be adapted to a specific application. For example, the event an object enters is defined when an object is visible in the scene for some time, i.e., its age is larger than a threshold. Some applications require the detection of an enter eventassoonasa small portion of the object is visible while other applications require the detection of an event when the object is completely visible. In some applications, applicationspecific conditions concerning low-level features, such as size, motion, or age, need to be considered when detecting events. These conditions can be easily added to the proposed system. To detect events, the proposed system monitors the behavior and features of each object in the scene. If specific conditions are met, events related to these conditions are detected. Analysis of the events is done on-line, i.e., events are detected as they occur. Specific object features, such as motion or size, are stored for each image and compared as images of a shot arrive. The following low-level object features are

183 Event representation 157 combined to detect events: Identity (ID) - a tag to uniquely identify an object throughout the video. Age - the time interval when the object is tracked. MBB - (initial, average, and current). Area - (initial, average, and current). Location - (initial and current). Motion - (initial, average, and current). Corresponding object - a temporal link to the corresponding object. Here following are the definition of the events that the current system detects automatically. The proposed events are sufficiently broad for a wide range of video applications to assist understanding of video shots. Other composite events can be compiled using this set of events to allow a more flexible event-based representation to adapt for the need of specific applications. Enter An object, O i, enters the scene at time instant n if all the following conditions are met: O i I(n), O i / I(n 1), i.e, zero match M 0 : O i meaning O i cannot be matched to any object in I(n 1), and c i is at the image border in I(n) (Fig. 7.4) 3. Examples are given in Figs This definition aims at detecting object entrance as soon as a portion of the object becomes visible. In some applications, only entering objects of specific size, motion, or age are of interest. In these applications, additional conditions can be added to refine the event enter. Appear An object, O i, emerges, or appears 4, in the scene at time instant n in I(n) ifthe following conditions are met: O i I(n), O i / I(n 1), i.e., zero match in I(n 1): M 0 : O i, and 3 This condition should depend on how fast the object is moving which is an important extension of the proposed event detection method. 4 An object can either enter or appear at the same time

184 158 Video interpretation c i is not at the image border in I(n). Examples are given in Figs Exit (leave) An object, O p, exits or leaves the scene at time instant n if the following conditions are met: O p I(n 1), O p / I(n), i.e., zero match in I(n): M 0 : O p, c p is at the image border in I(n 1), and g p >t g where g p is the age of O p and t g a threshold. Examples are given in Section Disappear An object, O p, disappears from the scene at time instant n in I(n) if the following conditions are met: O p I(n 1), O p / I(n), i.e., zero match in I(n): M 0 : O p, c p is not at the image border in I(n 1), g p >t g. Examples are given in Section Move met: An object, O i, moves at time instant n in I(n) if the following conditions are O i I(n), M i : O p O i where O p I(n 1), and the median of the motion magnitudes of O i in the last k images is larger than a threshold t 5 m. Examples are given in Figs Stop An object, O i, stops in the scene at time instant n in I(n) if the following conditions are met: O i I(n), M i : O p O i where O p I(n 1), the median of the motion magnitudes of O i in the last k images is less than a threshold t ms. 5 Typical values of k are three to five and t m is one. Note that there is no delay to detect this event because motion data at previous images are available. Ideally, the value of k should depend on the object size as an approximation of the objects distance from the camera. To reduce computation a fixed threshold was, however, used.

185 Event representation 159 Examples are given in Figs Occlude/occluded In Section 6.3.5, Eq. 6.14, the detection of occlusion is defined. With occlusion, at least two objects are involved where at least one is moving. All objects involved into occlusion have entered or appeared. When two objects occlude each other, the object with the larger area is defined as the occluding object, the other the occluded object. This definition can be adapted to the requirements of particular applications. Examples of occlusion detection are given in Figs. 7.12, 7.6, 7.7, and Expose/exposed Exposure is the opposite operation of occlusion. It is detected when occlusion ends. Remove/removed Let O i I(n) and O p,o q I(n 1) with M i : O p O i. O p removes O q if the following conditions are met: O p and O q were occluded in I(n 1), O q / I(n), i.e., zero match in I(n): M 0 : O q, and the area of O q is smaller than that of O i, i.e., Aq A i <t a, t a < 1 being a threshold. Removal is detected after occlusion. When occlusion is detected the proposed tracking technique (Section 6) predicts the occluded objects. In case of removal, the features of the removed object can change significantly and the tracking system may not be able to predict and track the removed objects. Thus the tracking technique may lose these objects. In this case, conditions for removal are checked and if they are met, removal is declared. The object with the larger area is the remover, the other is the removed object. Removal examples are given in Figs and 7.9. Deposit/deposited Let O p I(n 1) and O i,o j I(n) with M i : O p O i. O i deposits O j if the following conditions are met:

186 160 Video interpretation O i has entered or appeared, O j / I(n 1), i.e., zero match in I(n 1) with M 0 : O j, A j A i <t a, t a < 1 being a threshold, A i + A j A p [(H i + H j H p ) (W i + W j W p )], where A i, H i, and W i are area, height, and width of an object O i, O j is close to a side, s, of the MBB of O i where s {r,r,c,c } min i maxi mini maxi (O j is then declared as deposited object). Let d is be the distance between the MBB-side s and O j. O j is close to the MBB-side s if t cmin <d is <t cmax with the thresholds t cmin and t cmax, and O i changes in height or width between I(n 1) and I(n) at the MBB-side s. If the distance between the MBB-side s and O j is less than the threshold t cmin, then a split of O j from O i is assumed and O j is merged to O i. Only if this distance is large is the event deposit considered. This is so because in the real world, a depositor moves away from the deposited object and the deposit detection declares the event after the distance between the two objects is large. To reduce false alarms, deposit is declared if the deposited object remains in the scene for some time, e.g., age larger 7. The system differentiates between stopping objects (e.g., seated person or stopped car) and deposited objects. The system can also differentiate between deposit events and segmentation error due to splitting of objects (see Section 6.3.6). A deposited object remains long in the scene and the distance between the depositor and deposited object increases. Examples of object deposit are in Figs. 7.11, 7.9, and 7.8. Split An object splitting can be real (in case of object deposit) or due to object segmentation errors. The main difference between deposit and split is that a split object is close to the splitter while a depositor moves away from the deposited object and they become afar. The conditions for split are defined in Section and Eq Objects at an obstacle Often, objects move close to static background objects (called obstacles) that can occlude part of the moving objects. This is particularly frequent in traffic scenes

187 Event representation 161 where objects move close to traffic and other road signs. In this case, a change detection module is not able to detect pixels occluded by the obstacle and objects are split into two or more objects as shown in these figures:. This is different from object split because no abrupt, but a gradual, change of object size and shape occurs. This thesis develops a method to detect the motion of objects at obstacles. The method monitors the size of each object, O i, in the scene. If a continuous decrease or increase of the size of O i is detected (by comparing the area of two corresponding objects), a flag for O i is set accordingly. Let O q,o p I(n 1). Then O q is at an obstacle if the following conditions are met: O q and O p have appeared or entered, O q has no corresponding object in I(n), i.e., zero match in I(n) with M 0 : O q, A q was monotonically decreasing in the last k images, O q has a close object O p where A p was continuously increasing in the last k images and O p has a corresponding object, i.e., M i : O p O i, with O i I(n). O q and O p have some similarity, i.e., object voting (Section 6.3.4, Eq. 6.1) gives M p : O q O p O i with a low confidence, and motion direction of O q does not change if matched to O i. Examples are given in Fig Note that while the transition images show two objects, the original object gets its ID back when motion at the obstacle is detected. Abnormal movements An abnormal movement occur when the movement of an object is frequent (e.g., fast motion) or when it is rare (slow motion or long stay). an object, O i, stays for long in the scene in the following cases (cf. examples in Figs and 7.6): g i >t gmax, i.e., O i does not leave the scene after a given time. t gmax is a function of the frame-rate and the minimal allowable speed. d i <t dmin, i.e, the distance, d i, between the current position of O i in I(n) and its past position in I(l), with l<nless than a threshold t dmin which is a function of the frame-rate, the object motion, and the image size. an object, O i, moves too fast (or moves too slow) if the object speed in the last k (for example, five) images is larger (smaller) than a threshold (cf. the example in Fig. 7.6).

188 162 Video interpretation Dominant object A dominant object is related to a significant event, has the largest size of all objects, has the largest speed, or has the largest age. Possible extensions Other events and composite events can be easily extracted based on our representation strategy. Also, application-specific conditions can be easily integrated. For example, approach a restricted site can be easily extracted when the location of the restricted site is known. The following list of events can be added to the proposed set of events: Composite events Examples are: O i moved, stopped, is occluded, and reverses directions. O j is exposed, moves, and exits. Stand/Sit/Walk Standing and sitting are characterized by continuous change in height and width of the object MBB. Sitting is characterized by continual increase of the width and decrease of the height. When an object stands, the width of its MBB continual increase while the height decrease. In both events, height and width must be compared to the values of the height and width at the time instant before they started to change. The event walk can be easily detected as continuous moderate movements of a person. Approaching a restricted site This is an event that is straightforward to detect. If the location of a restricted site is given, the direction of an object s motion and distance to the site can be monitored and the event approach a restricted site can be eventually declared. Object lost/found At a time instant n, an object is declared lost if it has no corresponding object in the current image and occlusion was previously reported (but no removal). It is similar to the event disappear. Some applications require the search for lost objects even if they are not in the scene. To allow the system to find lost objects, features, such as ID, size, or motion, of lost objects need to be stored for future reference. If the event object lost was detected and a new object appears in the scene which shows similar features to the lost object, the objects can be matched and the event object found declared. Changing direction or speed Based on a registered motion direction, which is registered when the object

189 Results 163 completely enters the scene, the motion direction in the last k images previous to I(n) are compared with the registered motion direction. If the current motion direction is deviating from the motion direction in each of the k images, a change of direction can be declared. Similarly, change of speed can be detected. Normal behavior Often, a scene contains events which have never occurred before or occur rarely. This is application dependent. In general, normal behavior can be defined as a chain of simple events: for example, enters, moves through the scene, and disappears. Object history For some video applications, a summary or a detailed description of the spatiotemporal object features is needed. The proposed system can provide such a summary. An object history can include: initial location, trajectory, direction and velocity, significant change in speed, spatial relation to other objects, distance between current location of the object and a previous location. 7.4 Results and discussions There are few representation schemes concerning high-level features such as events. Most high-level video representations are context-dependent or focus on the constraints of a narrow application; so they lack generality and flexibility (Section 7.1.3). Extensive experiments using widely referenced video shots have shown the effectiveness and generality of the proposed framework. The technique has been testes on 10 video shots containing a total of 6071 images including sequences with noise and coding artifacts. Both indoor and outdoor real environments are considered. The performance of the proposed interpretation is shown by an automated textual summary of a video shot (Section 7.4.1) and an automated extraction of key images (Section 7.4.2). The proposed events are sufficiently broad for a wide range of video applications to assist surveillance and retrieval of video shots. For examples, i) the removal/deposit of objects, such as computing devices, in a surveillance site can be monitored and detected as they happen, ii) the movement of traffic objects can be monitored and reported, and iii) the behavior of customers in stores or subways can be monitored. The event detection procedure (not including the video analysis system) is fast and needs on average seconds on a SUN-SPARC MHz to interpret data between two images. The whole system, video analysis and interpretation, needs on average between 0.12 and 0.35 seconds to process the data between two images. Typically surveillance video is recorded at a rate of 3-15 frames per second. The proposed

190 164 Video interpretation system provides a response in real-time for surveillance applications with a rate of up to 10 frames per second. Speed-up can be achieved, for example, by i) optimizing the implementation of the occlusion and object separation, ii) optimizing the implementation of the change detection technique, and iii) working with integers instead of floating numbers (where appropriate) and with additions instead of multiplications. In this thesis, special consideration is given to processing inaccuracies and errors of a multi-level approach to handle specific situations such as false alarms. For example, the system is able to differentiate between deposited objects, split objects, and objects at an obstacle. It also rejects false alarms of entering or disappearing due to segmentation error (cf. Section 7.3 and 6.3.5). A critical issue in video surveillance is to differentiate between real moving objects and clutter motion, such as trees blowing in the wind and moving shadows. One way to handle these problems is to look for persistent motion and a second way is to classify motion as motion with purpose (vehicle or people) and motion without purpose (trees). The proposed tracking method can implicitly handle the first solution. Implementations of the second way need to be developed. In addition, the detection of background objects that move during the shot needs to be explicitly processed.

191 Video summary Event-based video summary The following tables show shot summaries generated automatically by the proposed system. hall_cif Shot Summary based on Objects and Events; StartPic 1/EndPic 300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pic Event ObjID Age Status Position Motion Size start/present present start/present Appearing completed 1 8 Move (68,114)/(88,147) (2,1 ) 3878 / Appearing completed 5 8 Move (234,111)/(224,130) (-1,1 ) 750 / is Deposit by ObjID Stop (117,162)/(117,163) (0,0 ) 298 / Occlusion 6 88 Stop (117,162)/(117,163) (0,0 ) 292 / Occlusion ObjID Move (68,114)/(149,129) (-1,0 ) 6602 / Disappear Disappear (68,114)/(125,128) (-1,1 ) 6602 / road1_cif Shot Summary based on Objects and Events; StartPic 1/EndPic 300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pic Event ObjID Age Status Position Motion Size start/present present start/present Appearing completed 1 8 Move (306,216)/(272,167) (-5,-5 ) 1407 / Entering completed 2 8 Move (344,205)/(340,199) (-1,-1 ) 543 / Appearing completed 3 8 Stop (148,39 )/(148,39 ) (0,0 ) 148 / Appearing completed 4 8 Move (142,69 )/(138,74 ) (-1,1 ) 157 / Disappear Disappear (306,216)/(173,41 ) (0,0 ) 809 / Appearing completed 5 8 Move (336,266)/(295,191) (-8,-8 ) 1838 / Move Fast 5 9 Move (336,266)/(288,181) (-8,-8 ) 1588 / Exit 4 53 Exit (142,69 )/(8,273) (-9,14 ) 191 / Occlusion 5 12 Move (336,266)/(271,157) (-5,-8 ) 1103 / Occlusion by ObjID Stop (344,205)/(289,129) (0,0 ) 822 / Exit Exit (148,39 )/(3,230) (-6,7 ) 156 / Abnormal Stop (344,205)/(291,129) (0,0 ) 822 /

192 166 Video interpretation floor Shot Summary based on Objects and Events; StartPic 1/EndPic 826 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pic Event ObjID Age Status Position Motion Size start/present present start/present Appearing completed 1 8 Move (126,140)/(123,135) (0,-1 ) 320 / is Deposit by ObjID Stop (121,140)/(121,140) (0,0 ) 539 / Occlusion Stop (121,140)/(120,141) (0,0 ) 541 / Occlusion ObjID Move (126,140)/(83,109) (1,0 ) 840 / Removal by ObjID Removal (121,140)/(104,132) (0,0 ) 541 / Appearing completed 18 8 Move (105,68 )/(108,86 ) (0,0 ) 91 / Exit Exit (126,140)/(9,230) (-10,7 ) 840 / floort Shot Summary based on Objects and Events; StartPic 1/EndPic 636 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pic Event ObjID Age Status Position Motion Size start/present present start/present Entering completed 1 8 Stop (32,136)/(32,136) (0,0 ) 270 / Appearing completed 2 8 Move (55,65 )/(57,78 ) (1,0 ) 814 / Occlusion Move (32,136)/(32,131) (0,-1 ) 269 / Occlusion ObjID Move (55,65 )/(59,102) (-1,1 ) 1267 / Removal by ObjID Removal (32,136)/(33,116) (0,-1 ) 269 / Exit Exit (55,65 )/(277,235) (9,2 ) 1267 / floorp Shot Summary based on Objects and Events; StartPic 1/EndPic 655 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pic Event ObjID Age Status Position Motion Size start/present present start/present Entering completed 1 8 Move (85,234)/(74,215) (2,-5 ) 3671 / is Deposit by ObjID Stop (32,135)/(32,135) (0,0 ) 266 / Disappear Disappear (85,234)/(51,108) (0,1 ) 3327 /

193 Video summary 167 urbicande_cif Shot Summary based on Objects and Events; StartPic 1/EndPic 300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pic Event ObjID Age Status Position Motion Size start/present present start/present Appearing completed 1 8 Move (246,160)/(249,152) (0,-1 ) 185 / Entering completed 4 8 Move (331,158)/(322,160) (0,0 ) 65 / Entering completed 5 8 Stop (337,157)/(337,157) (0,0 ) 47 / Appearing completed 6 8 Move (235,95 )/(240,107) (1,1 ) 197 / Occlusion by ObjID Stop (246,160)/(260,129) (0,0 ) 180 / Occlusion 6 12 Move (235,95 )/(243,118) (1,3 ) 277 / Appearing completed 7 8 Move (120,13 )/(138,48 ) (1,5 ) 621 / Occlusion by ObjID Stop (246,160)/(249,117) (0,0 ) 180 / Occlusion 7 49 Move (120,13 )/(245,106) (1,1 ) 441 / Exit Exit (331,158)/(30,194) (-6,2 ) 61 / Entering completed 8 8 Stop (337,157)/(337,157) (0,0 ) 49 / Exit Exit (235,95 )/(334,172) (0,0 ) 122 / Occlusion by ObjID Stop (337,157)/(302,152) (0,0 ) 47 / Entering completed 9 8 Stop (337,157)/(337,157) (0,0 ) 49 / Abnormal Move (246,160)/(183,92 ) (-1,-1 ) 180 / Occlusion 9 40 Move (337,157)/(341,158) (1,0 ) 49 / Occlusion by ObjID Stop (120,13 )/(331,150) (0,0 ) 441 / Abnormal Move (337,157)/(226,143) (-1,0 ) 47 / Entering completed 10 8 Stop (307,193)/(307,193) (0,0 ) 550 /

194 168 Video interpretation survey_d Shot Summary based on Objects and Events;; StartPic 1/EndPic 979 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pic Event ObjID Age Status Position Motion Size start/present present start/present Entering completed 2 8 Move (31,161)/(34,173) (1,5 ) 7115 / Entering completed 1 8 Move (200,29 )/(196,31 ) (1,0 ) 2156 / Entering completed 3 8 Move (15,173)/(7,177) (-1,1 ) 1103 / Entering completed 4 8 Move (81,195)/(82,190) (1,-3 ) 1967 / Occlusion ObjID Move (31,161)/(129,154) (1,0 ) 3593 / Occlusion 1 70 Move (200,29 )/(162,48 ) (-1,1 ) 2219 / Entering completed 5 8 Move (283,146)/(275,164) (0,1 ) 3038 / Occlusion 5 18 Move (283,146)/(211,163) (-5,0 ) 2886 / Occlusion ObjID Move (31,161)/(153,143) (1,0 ) 3593 / Occlusion by ObjID Move (283,146)/(121,164) (-5,0 ) 2886 / Occlusion ObjID Move (200,29 )/(124,66 ) (-1,1 ) 2219 / Exit 5 48 Exit (283,146)/(5,165) (-3,3 ) 2886 / Exit Exit (200,29 )/(3,118) (-2,5 ) 2219 / Exit Exit (31,161)/(309,12 ) (0,-2 ) 3593 / Entering completed 7 8 Move (210,27 )/(200,29 ) (-1,1 ) 1137 / Entering completed 8 8 Move (242,24 )/(235,26 ) (-1,1 ) 1431 / Exit Exit (210,27 )/(4,103) (-1,3 ) 1424 / Entering completed 9 8 Move (245,26 )/(241,27 ) (-1,3 ) 1304 / Exit Exit (242,24 )/(4,167) (-2,3 ) 1708 / Entering completed 12 8 Move (261,28 )/(255,30 ) (-1,1 ) 1453 / Exit Exit (245,26 )/(13,171) (-1,1 ) 601 / Exit Exit (261,28 )/(8,186) (-3,3 ) 1436 / Entering completed 13 8 Move (233,26 )/(226,30 ) (0,2 ) 1202 / Exit Exit (233,26 )/(3,123) (-3,1 ) 1444 /

195 Key-image extraction 169 stair_wide_cif Shot Summary based on Objects and Events;; StartPic 1/EndPic 1475 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pic Event ObjID Age Status Position Motion Size start/present present start/present Entering completed 2 8 Move (312,248)/(308,230) (-2,-2 ) 5746 / Entering completed 3 8 Move (184,186)/(167,169) (-2,-2 ) 7587 / Exit 2 69 Exit (312,248)/(337,282) (3,16 ) / Appearing completed 4 8 Stop (128,104)/(127,100) (0,0 ) 211 / Disappear Disappear (184,186)/(114,67 ) (0,-1 ) 6374 / Entering completed 7 8 Move (120,88 )/(125,79 ) (1,1 ) 2536 / Entering completed 8 8 Move (138,85 )/(131,85 ) (-1,0 ) 5308 / Exit Exit (120,88 )/(337,282) (3,20 ) 3432 / Disappear Disappear (138,85 )/(121,92 ) (0,0 ) 4334 / Entering completed 9 8 Move (11,151)/(23,159) (3,0 ) 2866 / Exit Exit (11,151)/(127,72 ) (0,0 ) 4955 / Entering completed 16 8 Stop (123,72 )/(128,77 ) (0,0 ) 2737 / Exit Exit (123,72 )/(83,154) (-5,1 ) 3844 / Key-image based video representation In a surveillance environment, important events may occur after a long time has passed. During this time, the attention of human operators decreases and significant events may be missed. The proposed system for event detection identifies events of interest as they occur and human operators can focus their attention on moving objects and their related events. This section presents automatic extracted key-images from video shots. Keyimages are the subset of images which best represent the content of a video sequence in an abstract manner. Key-image video abstraction transforms an entire video shot into a small number of representative images. This way important content is maintained while redundancies are removed. Key-images based on events are appropriate when the system must report on specific events as soon as they happen. Figures show key-images extracted automatically from video shots. Each image is annotated on its upper left and right corners with the image number, object ID, age, and events. Only objects performing the events are annotated in this application. Note that in the figures no key-images for the event Exit or Disappear are shown because of space constraint. In Fig. 7.13, no appear, enter, exit, and disappear

196 170 Video interpretation key-images are displayed. Not displayed events are, however, given in the summary Tables in Section In some applications, detailed description of events using key-images may be required. The proposed system can provide such details. For example, Fig. 7.5 illustrates detailed information during occlusion. 7.5 Summary There has been little work on context-independent video interpretation. The system in [38] is limited to applications of indoor environments, cannot deal with occlusion, and is noise sensitive. Moreover, the definition of events is not widely applicable. The system for indoor surveillance applications in [133] provides only one abandoned object event in unattended environments. This chapter has introduced a new context-independent video interpretation system that provides a video representation rich in terms of generic events and qualitative object features. Qualitative object descriptors are extracted by quantizing the lowlevel parametric descriptions of the objects. The thesis proposes approximate but efficient world models to define useful events. In many applications, approximate models even if not precise, are adequate. To extract events, changes of motion and the behavior of low-level features of the scene s objects are continually monitored. When certain conditions are met, events related to these conditions are detected. The proposed events are sufficiently broad for a wide range of video applications to assist surveillance and retrieval of video shots. Examples are: 1) the removal/deposit of objects, such as computing devices, in a surveillance site can be monitored and detected as they happen, 2) the movement of traffic objects can be monitored and reported, and 3) the behavior of customers in stores or subways can be monitored. The proposed system can be used in both modes: on-line or off-line. In an on-line mode, such as surveillance, the detection of an event can send related information to a human operator. In an off-line mode, the system stores events and object representation in a database. Experimentations on more than 10 indoor and outdoor video shots containing a total of 6371 images including sequences with noise and coding artifacts have demonstrated the reliability and the real-time performance of the proposed system.

197 Key-image extraction 171 Figure 7.5: Key images during occlusion. Each image is annotated with events (upper left hand corner) and objects are annotated with their MBB and ID. The original sequence is rotated by 90 to the right to comply with the CIF format.

198 172 Video interpretation Figure 7.6: Key-event-images of the Highway sequence (300 images). This sequence is characteristic of a traffic monitoring application. Each image is annotated with events (upper left hand corner) and objects are annotated with their MBB and ID. Important key events: Abnormal movement, O 5 moves fast and O 2 stops for long.

199 Key-image extraction 173 Figure 7.7: Key-event-images of the Highway2 sequence (300 images). Important key event: the appearance of a person O 7 on the highway.

200 174 Video interpretation Figure 7.8: Key-event-images of the Hall sequence (300 images). This sequence is characteristic of an indoor surveillance application. Important key event: O 6 is deposited by object O 1.

201 Key-image extraction 175 Figure 7.9: Key-event-images of the Floor sequence (826 images). Important key events: O 1 deposits and then removes O 3. Both the depositor/remover and deposited/removed objects are detected.

202 176 Video interpretation Figure 7.10: Key-event-images of the FloorT sequence (636 images). Important key event: removal of O 1 by O 2. Figure 7.11: Key-event-images of the FloorP sequence (655 images). Both the key event Deposit and the object that performs the key event are correctly recognized.

203 Key-image extraction 177 Figure 7.12: Key-event-images of the Survey sequence (979 images). This sequence is typical for a parking lot surveillance application. Important key event: occlusion of three objects O 1, O 2, and O 5.

204 178 Video interpretation Figure 7.13: Key-event-images of the Urbicande sequence (300 images) which is characteristic of a city urban surveillance. Various objects enter and leave the scene. Important key events: O 1 and O 5 are moving abnormally and stay for long in the scene (see I(253) & I(276)). Note that the original sequence is rotated by 90 to the right to comply with the CIF format.

205 Key-image extraction 179 Figure 7.14: Key-event-images of the Stair sequence (1475 images). This sequence is typical for entrance surveillance application. The interesting feature of this application is that objects can enter from three different places, the two doors and the stairs. One of the doors is restricted. To detect specific events, such as entering or approaching a restricted site (see image 964), a map of the scene is needed.

206

207 Chapter 8 Conclusion 8.1 Review of the thesis background This thesis has developed a new framework for high-level video content processing and representation based on objects and events. To achieve high applicability, contents are extracted independently of the context of the processed video. The proposed framework targets efficient and flexible representation of video from real (indoor and outdoor) environments where occlusion, illumination change, and artifacts may occur. Most video processing and representation systems have mainly dealt with video data in terms of pixels, blocks, or some global structure. This is not sufficient for advanced video applications. In a surveillance application, for instance, objects are necessary to automatically detect and classify object behaviors. In video databases, advanced retrieval must be based on high-level object features and object meaning. Users are, in general, attracted to moving objects and focus first on their meaning and then on their low-level features. Several approaches to object-based video representation were studied but they often focus on low-level quantitative features or assume a simple environment, for example, without object occlusion. There are few representation schemes concerning high-level features of video content such as activities and events. Much of the work on event detection and classification focuses on how to express events using reasoning and inference methods. In addition, most high-level video representations are context-dependent and focus on the constraints of a narrow application; so they lack generality and flexibility. 8.2 Summary of contributions The proposed system is aimed at three goals: flexible object representations, reliable stable processing that foregoes the need for precision, and low computational cost. The proposed system targets video from real environments such as those with object

208 182 Conclusion occlusions or artifacts. This thesis has achieved these goals through the adaptation to noise and image content, through the detection and correction or compensation of estimation errors at the various processing levels, and through the division of the processing system into simple but effective tasks avoiding complex operations. This thesis has demonstrated that based on such a strategy quality results of video enhancement, analysis, and interpretation can be achieved. The proposed system provides a response in real-time for surveillance applications with a rate of up to 10 frames per second on a multitasking SUN UltraSPARC 360 MHz without specialized hardware. The robustness of the proposed methods has been demonstrated by extensive experimentation on on more than 10 indoor and outdoor video shots containing a total of 6371 images including sequences with noise and coding artifacts. The robustness of the proposed system is a result of adaptation to noise and artifacts and due to processing that accounts for errors at one step by correction or compensation at the subsequent steps where higher level information is available. This considerations to process inaccuracies and errors of a multi-level approach allow the system to handle specific situations such as false alarms. For example, the system is able to differentiate between deposited objects, split objects, and objects at obstacle. It also rejects false alarms of entering or disappearing due to segmentation error. The proposed system can be viewed as a framework of methods and algorithms to build automatic dynamic scene interpretation and representation. Such interpretation and representation can be used in various video applications. Besides applications such as video surveillance and retrieval, outputs of the proposed framework can be used in a video understanding or a symbolic reasoning system. Contributions of this thesis are made in three processing levels: video enhancement to estimate and reduce noise, video analysis to extract meaningful objects and their spatio-temporal features, and video interpretation to extract context-independent semantic features such as events. The system is modular, and layered from low-level to middle level to high-level. Results from a lower level are integrated to support higher levels. Higher levels support lower levels through memory-based feedback loops. Video enhancement This thesis has developed a spatial noise filter of low complexity which is adaptive to the image structure and the image noise. The proposed method applies first a local image analyzer along eight directions and then selects a suitable direction for filtering. Quantitative and qualitative simulations show that the proposed noise and structure-adaptive filtering method is more effective at reducing Gaussian white noise without image degradation than reference filters used. This thesis has also contributed a reliable fast method to estimate the variance

209 of the white noise. The method finds first homogeneous blocks and then averages the variances of the homogeneous blocks to determine the noise variance. For typical image quality of PSNR between 20 and 40 db the proposed method outperforms other methods significantly and the worst case PSNR estimation error is approximately 3 db, which is suitable for video application such as surveillance or TV signal broadcast. 183 Video analysis The proposed video analysis method extracts meaningful video objects and their spatio-temporal low-level features. It is fault tolerant, can correct inaccuracies, and recover from errors. The method is primarily based on computationally efficient object segmentation and voting-based object tracking. Segmentation is realized in four steps: motion-detection-based binarization, morphological edge detection, contour analysis, and object labeling. To focus on meaningful objects, the proposed segmentation method uses a background image. The proposed algorithm memorizes previously detected motion data to adapt current segmentation. The edge detection is performed by novel morphological operations with significantly reduced computations. Edge are gap-free and single-pixel wide. Edges are grouped into contours. Small contours are eliminated if they cannot be matched to previously extracted regions. The tracking method is based on a non-linear voting system to solve the problem of multiple object correspondences. The occlusion problem is alleviated by a medianbased prediction procedure. Objects are tracked once they enter the scene until they exit, including the occlusion period. An important contribution of the proposed tracking is the reliable region merging, which significantly improves the performance of the whole proposed video system. Video interpretation This thesis has proposed a context-independent video interpretation system. The implemented system provides a video representation rich in terms of generic events and qualitative object features. Qualitative object descriptors are extracted by quantizing the low-level parametric descriptions of the objects. The thesis proposes approximate but efficient world models to define useful events. In many applications, approximate models, even if not precise, are adequate. To extract events, changes of motion and low-level features in the scene are continually monitored. When certain conditions are met, events related to these conditions are detected. Detection of events is done on-line, i.e., events are detected as they occur. Specific object features, such as motion or size, are stored for each image and compared as images of a shot come in. Both indoor and outdoor real environments are considered. The proposed events are sufficiently broad for a wide range of video applications to assist surveillance and retrieval of video shots.

210 184 Conclusion 8.3 Possible extensions There are a number of issues to consider in order to enhance the performance of the proposed system and extend its applicability. Time of execution and applications The motion detection and object occlusion processing modules have the highest computational cost of the proposed modular system. The implementation of their algorithms can be optimized to allow faster execution of the whole system. In addition, the proposed system should be applied to a larger set of video shots and environments. Object segmentation In the context of MPEG-video coding, motion vectors are available. One of the immediate extensions of the proposed segmentation technique is to integrate motion information from the MPEG-stream to support object segmentation. This integration is expected to enhance segmentation without a significant increase in computational cost. Motion estimation The proposed model of motion can be further refined to allow more accurate estimation. A straightforward extension is to examine the displacements of the diagonal extents of objects and adapt the estimation to previously estimated motion for greater stability. A possible extension of the proposed tracking method is in tracking objects that move in and out of the scene. Highlights and shadows The system can benefit from the detection of shadows and compensation of their effects, especially when the source and direction of illumination is known. Image stabilization Image stabilization techniques can be used to allow the analysis of video data from moving cameras and changing backgrounds. Video interpretation A wider set of events can be considered for the system to serve a larger set of applications. A program interface can be designed to facilitate user-system interaction. Definition of such an interface requires a study of the needs of users of video applications. A classification of moving objects and clutter motion, such as trees blowing in the wind, can be considered to reject events. One possible classification is to detect motion as motion with purpose (for example, motion of vehicle or people) and motion without purpose (for example, motion of trees). In addition, the proposed modular framework can be extended to assist context dependent or higher-level tasks such as video understanding or symbolic reasoning.

211 Bibliography [1] T. Aach, A. Kaup, and R. Mester, Statistical model-based change detection in moving video, Signal Process., vol. 31, no. 2, pp , [2] A. Abutaleb, Automatic thresholding of gray-level pictures using two-dimensional entropy, Comput. Vis. Graph. Image Process., vol. 47, pp , [3] E. Adelson and J. Bergen, The plenoptic function and the elements of early vision, in Computational Models of Visual Processing (M. Landy and J. Movshon, eds.), ch. 1, Cambridge: M.I.T. Press, [4] E. Adelson and J. Movshon, Phenomenal coherence of moving visual patterns, Nature, vol. 300, pp , Dec [5] P. Aigrain, H. Zhong, and D. Petkovic, Content-based representation and retrieval of visual media: A state-of-the-art review, Multimedia tools and applications J., vol. 3, pp , [6] A. Alatan, L. Onural, M. Wollborn, R. Mech, E. Tunceland, and T. Sikora, Image sequence(1) analysis for emerging interactive multimedia services - the European COST 211 Framework, IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp , Nov [7] A. Amer, Motion estimation using object segmentation methods, Master s thesis, Dept. Elect. Eng., Univ. Dortmund, Dec In German. [8] A. Amer, Object-based video retrieval based on motion analysis and description, Tech. Rep , INRS-Télécommunications, June [9] A. Amer and H. Blume, Postprocessing of MPEG-2 decoded image signals, in Proc. 1 st ITG/Deutsche-Telekom Workshop on Multimedia and applications, (Darmstadt, Germany), Oct In German. [10] A. Amer and E. Dubois, Segmentation-based motion estimation for video processing using object-based detection of motion types, in Proc. SPIE Visual Communications and Image Process., vol. 3653, (San Jose, CA), pp , Jan [11] A. Amer and E. Dubois, Image segmentation by robust binarization and fast morphological edge detection, in Proc. Vision Interface, (Montréal, Canada), pp , May

212 186 Bibliography [12] A. Amer and E. Dubois, Object-based postprocessing of block motion fields for video applications, in Proc. SPIE Image and Video Communications and Processing, vol. 3974, (San Jose, CA), pp , Jan [13] A. Amer and H. Schröder, A new video noise reduction algorithm using spatial sub-bands, in Proc. IEEE Int. Conf. Electron., Circuits, and Syst., vol. 1, (Rodos, Greece), pp , Oct [14] D. Ayers and M. Shah, Recognizing human actions in a static room, in Proc. 4th IEEE Workshop on Applications of Computer Vision, (Princeton, NJ), pp , Oct [15] A. Azarbayejani, C. Wren, and A. Pentland, Real-time 3-D tracking of the human body, in Proc. IMAGE COM, (Bordeaux, France), pp , May M.I.T. TR No [16] B. Bascle, P. Bouthemy, R. Deriche, and F. Meyer, Tracking complex primitives in an image sequence, in Proc. IEEE Int. Conf. Pattern Recognition, (Jerusalem), pp , Oct [17] J. Bernsen, Dynamic thresholding of grey-level images, in Proc. Int. Conf. on Pattern Recognition, (Paris, France), pp , Oct [18] H. Blume, Bewegungsschätzung in videosignalen mit parallelen örtlich zeitlichen prädiktoren, in Proc. 5. Dortmunder Fernsehseminar, vol. 0393, (Dortmund, Germany), pp , 29 Sep.- 1 Oct In German. [19] H. Blume, Vector-based nonlinear upconversion applying center weighted medians, in Proc. SPIE Conf. Nonlinear Image Process., (San Jose, CA), pp , Feb [20] H. Blume and A. Amer, Parallel predictive motion estimation using object segmentation methods, in Proc. European Workshop and Exhibition on Image Format Conversion and Transcoding, (Berlin, Germany), pp. C1/1 5, Mar [21] H. Blume, A. Amer, and H. Schröder, Vector-based postprocessing of MPEG-2 signals for digital TV-receivers, in Proc. SPIE Visual Communications and Image Process., vol. 3024, (San Jose, CA), pp , Feb [22] A. Bobick, Movement, activity, and action: the role of knowledge in the perception of motion, Tech. Rep. 413, M.I.T. Media Laboratory, [23] G. Boudol, Atomic actions, Tech. Rep. 1026, Institut National de Recherche en Informatique et en Automatique, May [24] P. Bouthemy and R. Fablet, Motion characterization from temporal co-occurences of local motion-based measures for video indexing, in Proc. IEEE Int. Conf. Pattern Recognition, vol. 1, (Brisbane, IL), pp , Aug [25] P. Bouthemy, M. Gelgon, and F. Ganansia, A unified approach to shot change detection and camera motion characterization, Tech. Rep. 1148, Institut National de Recherche en Informatique et en Automatique, Nov

213 [26] M. Bove, Object-oriented television, SMPTE J., vol. 104, pp , Dec [27] J. Boyd, J. Meloche, and Y. Vardi, Statistical tracking in video traffic surveillance, in Proc. IEEE Int. Conf. Computer Vision, vol. 1, (Corfu, Greece), pp , Sept [28] F. Brémond and M. Thonnat, A context representation for surveillance systems, in Proc. Workshop on Conceptual Descriptions from Images at the European Conf. on Computer Vision, (Cambridge, UK), pp , Apr [29] F. Brémond and M. Thonnat, Issues of representing context illustrated by videosurveillance applications, Int. J. of Human-Computer Studies, vol. 48, pp , Special Issue on Context. [30] M. Busian, Object-based vector field postprocessing for enhanced noise reduction, Tech. Rep. S04 97, Dept. Elect. Eng., Univ. Dortmund, In German. [31] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Machine Intell., vol. 9, pp , Nov [32] M. Chang, A. Tekalp, and M. Sezan, Simultaneous motion estimation and segmentation, IEEE Trans. Image Process., vol. 6, no. 9, pp , [33] S. Chang, W. Chen, H. Meng, H. Sundaram, and D. Zhong, A fully automatic content-based video search engine supporting multi-object spatio-temporal queries, IEEE Trans. Circuits Syst. Video Techn., vol. 8, no. 5, pp , Special Issue. [34] A. Cohn and S. Hazarika, Qualitative spatial representation and reasoning: An overview, Fundamenta Informaticae, vol. 43, pp. 2 32, [35] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, and O. Hasegawa, A system for video surveillance and monitoring, Tech. Rep. CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, May [36] J. Conde, A. Teuner, and B. Hosticka, Hierarchical locally adaptive multigrid motion estimation for surveillance applications, in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, (Phoenix, Arizona), pp , May [37] P. Correia and F. Pereira, The role of analysis in content-based video coding and indexing, Signal Process., vol. 66, pp , [38] J. Courtney, Automatic video indexing via object motion analysis, Pattern Recognit., vol. 30, no. 4, pp , [39] A. Cross, D. Mason, and S. Dury, Segmentation of remotely-sensed images by a split-and-merge process, Int. J. Remote Sensing, vol. 9, no. 8, pp , [40] A. Crétual, F. Chaumette, and P. Bouthemy, Complex object tracking by visual servoing based on 2-D image motion, in Proc. IEEE Int. Conf. Pattern Recognition, vol. 2, (Brisbane, IL), pp , Aug

214 188 Bibliography [41] M. Dai, P. Baylou, L. Humbert, and M. Najim, Image segmentation by a dynamic thresholding using edge detection based on cascaded uniform filters, Signal Process., vol. 52, pp , Apr [42] K. Daniilidis, C. Krauss, M. Hansen, and G. Sommer, Real time tracking of moving objects with an active camera, J. Real-Time Imging, vol. 4, pp. 3 20, February [43] G. de Haan, Motion Estimation and Compensation: An Integrated Approach to Consumer Display Field Rate Conversion. PhD thesis, Natuurkundig Laboratorium, Univ. Delft, Sept [44] G. de Haan, IC for motion compensated deinterlacing, noise reduction and picture rate conversion, IEEE Trans. Consum. Electron., vol. 42, pp , Aug [45] G. de Haan, Progress in motion estimation for consumer video format conversion, in Proc. IEEE Digest of the ICCE, (Los Angeles, CA), pp , June [46] G. de Haan, T. Kwaaitaal-Spassova, M. Larragy, and O. Ojo, IC for motion compensated 100 Hz TV with smooth movie motion mode, IEEE Trans. Consum. Electron., vol. 42, pp , May [47] G. de Haan, T. Kwaaitaal-Spassova, and O. Ojo, Automatic 2-D and 3-D noise filtering for high-quality television receivers, in Proc. Int. Workshop on Signal Process. and HDTV, vol. VI, (Turin, Italy), pp , [48] Y. Deng and B. Manjunath, NeTra V: Towards an object-based video representation, IEEE Trans. Circuits Syst. Video Techn., vol. 8, pp , Sept Special Issue. [49] N. Diehl, Object-oriented motion estimation and segmentation in image sequence, Signal Process., Image Commun., vol. 3, pp , Feb [50] S. Dockstader and A. Tekalp, On the tracking of articulated and occluded video object motion, J. Real-Time Imging, vol. 7, pp , Oct [51] E. Dougherty and J. Astola, An Introduction to Nonlinear Image Processing, vol. TT 16. Washington: SPIE Optical Engineering Press, [52] H. Dreßler, Noise estimation in analogue and digital trasmitted video signals, Tech. Rep. S11-96, Dept. Elect. Eng., Univ. Dortmund, Apr [53] E. Dubois and T. Huang, Motion estimation, in The past, present, and future of image and multidimensional signal processing (R. Chellappa, B. Girod, D. Munson, and M. V. M. Tekalp, eds.), pp , IEEE Signal Processing Magazine, Mar [54] F. Dufaux and J. Konrad, Efficient, robust and fast global motion estimation for video coding, IEEE Trans. Image Process., vol. 9, pp , June [55] F. Dufaux and F. Moscheni, Segmentation-based motion estimation for second generation video coding techniques, in Video coding: Second generation approach (L. Torres and M. Kunt, eds.), pp , Kluwer Academic Publishers, 1996.

215 189 [56] M. Ferman, M. Tekalp, and R. Mehrotra, Effective content representation for video, in Proc. IEEE Int. Conf. Image Processing, vol. 3, (Chicago, IL), pp , Oct [57] J. Fernyhough, A. Cohn, and D. Hogg, Constructing qualitative event models automatically from video input, Image and Vis. Comput., vol. 18, pp , [58] J. Flack, On the Interpretation of Remotely Sensed Data Using Guided Techniques for Land Cover Analysis. PhD thesis, EEUWIN Center for Remote Sensing Technologies, Feb [59] J. Foley, A. van Dam, S. Feiner, and J. Hughes, Computer Graphics: Principles and Practice. Reading, MA: Addison-Wesley, Second edition. [60] M. Gabbouj, G. Morrison, F. Alaya-Cheikh, and R. Mech, Redundancy reduction techniques and content analysis for multimedia services - the European COST 211quat Action, in Proc. Workshop on Image Analysis for Multimedia Interactive Services, (Berlin, Germany), pp , May [61] L. Garrido, P. Salembier, and D. Garcia, Extensive operators in partition lattices for image sequence analysis, Signal Process., vol. 66, pp , [62] A. Gasch, Object-based vector analysis for restoration of video signals, Master s thesis, Dept. Elect. Eng., Univ. Dortmund, July In German. [63] A. Gasteratos, Mathematical morphology operations and structuring elements. Computer Vision On-line, [64] C. Giardina and E. Dougherty, Morphological Methods in Image and Signal Processing. New Jersey: Prentice Hall, [65] S. Gil, R. Milanese, and T. Pun, Feature selection for object tracking in traffic scenes, in Proc. SPIE Int. Symposium on Smart Highways, vol. 2344, (Boston, MA), pp , Oct [66] B. Girod, What s wrong with mean squared error?, in Digital Images and Human Vision (A. Watson, ed.), ch. 15, M.I.T. Press, Cambridge, Mar [67] F. Golshani and N. Dimitrova, A language for content-based video retrieval, Multimedia tools and applications J., vol. 6, pp , [68] M. Hansen, P. Anandan, K. Dana, G. van der Wal, and P. Burt, Real-time scene stabilization and mosaic construction, in Proc. DARPA Image Understanding Workshop, vol. 1, (Monterry, CA), pp , Nov [69] R. Haralick and L. Shapiro, Computer and Robot Vision. Reading: Addison-Wesley, [70] I. Haritaoglu, D. Harwood, and L. S. Davis, W 4 : Real-time surveillance of people and their activities, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp , Aug

216 190 Bibliography [71] M. Isard and A. Blake, Contour tracking by stochastic propagation of conditional density, in Proc. European Conf. Computer Vision, vol. A, pp , [72] R. Jain, A. Pentland, and D. Petkovic, Workshop report, in Proc. NSF-ARPA Workshop on Visual Information Management Systems, (Cambridge, MA), June [73] R. Jain and T. Binford, Dialogue: Ignorance, myopia, and naivete in computer vision systems, Comput. Vis. Graph. Image Process., vol. 53, pp , January [74] K. Jostschulte and A. Amer, A new cascaded spatio-temporal noise reduction scheme for interlaced video, in Proc. IEEE Int. Conf. Image Processing, vol. 2, (Chicago, IL), pp , Oct [75] S. Khan and M. Shah, Tracking people in presence of occlusion, in Proc. Asian Conf. on Computer Vision, (Taipei, Taiwan), pp , Jan [76] J. Konrad, Motion detection and estimation, in Image and Video Processing Handbook (A. Bovik, ed.), ch. 3.8, Academic Press, [77] K. Konstantinides, B. Natarajan, and G. Yovanof, Noise estimation and filtering using block-based singular-value decomposition, IEEE Trans. Image Process., vol. 6, pp , Mar [78] M. Kunt, Comments on dialogue, a series of articles generated by the paper entitled ignorance, myopia, and naivete in computer vision, Comput. Vis. Graph. Image Process., vol. 54, pp , November [79] J. Lee, Digital image smoothing and the sigma filter, Comput. Vis. Graph. Image Process., vol. 24, pp , [80] G. Legters and T. Young, A mathematical model for computer image tracking, IEEE Trans. Pattern Anal. Machine Intell., vol. 4, pp , Nov [81] A. Lippman, N. Vasconcelos, and G. Iyengar, Human interfaces to video, in Proc. 32 nd Asilomar Conf. on Signals, Systems, and Computers, (Asilomar, CA), Nov Invited Paper. [82] E. Lyvers and O. Mitchell, Precision edge contrast and orientation estimation, IEEE Trans. Pattern Anal. Machine Intell., vol. 10, pp , November [83] D. Marr, Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W.H. Freeman and Company, [84] R. Mech and M. Wollborn, A noise robust method for segmentation of moving objects in video sequences, in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, vol. 4, (Munich, Germany), pp , Apr [85] R. Mech and M. Wollborn, A noise robust method for 2-D shape estimation of moving objects in video sequences considering a moving camera, Signal Process., vol. 66, no. 2, pp , 1998.

217 191 [86] G. Medioni, I. Cohen, F. Brémond, and R. N. S. Hongeng, Event detection and analysis from video streams, IEEE Trans. Pattern Anal. Machine Intell., vol. 23, no. 8, pp , [87] P. Meer, R. Park, and K. Cho, Multiresolution adaptive image smoothing, Graphical Models and Image Process., vol. 44, pp , Mar [88] megapixel.net, Noise: what it is and when to expect it. Monthly digital camera web magazine, [89] T. Meier and K. Ngan, Automatic segmentation of moving objects for video object plane generation, IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp , Sept Invited paper. [90] T. Minka, An image database browser that learns from user interaction, Master s thesis, M.I.T. Media Laboratory, Perceptual Computing Section, [91] A. Mitiche, Computational Analysis of Visual Motion. New York: Plenum Press, [92] A. Mitiche and P. Bouthemy, Computation and analysis of image motion: a synopsis of current problems and methods, Intern. J. Comput. Vis., vol. 19, no. 1, pp , [93] M. Naphade, R. Mehrotra, A. Ferman, J. Warnick, T. Huang, and A. Tekalp, A high performance algorithm for shot boundary detection using multiple cues, in Proc. IEEE Int. Conf. Image Processing, vol. 2, (Chicago, IL), pp , [94] W. Niblack, An introduction to digital image processing. Prentice Hall, [95] H. Nicolas and C. Labit, Motion and illumination variation estimation using a hierarchy of models: Application to image sequence coding, Tech. Rep. 742, IRISA, July [96] M. Nieto, Public video surveillance: Is it an effective crime prevention tool?. CRB California Research Bureau, California State Library, June CRB [97] A. Oliphant, K. Taylor, and N. Mission, The visibility of noise in system-i PAL colour television, Tech. Rep. 12, BBC Research and Development Department, [98] S. Olsen, Estimation of noise in images: An evaluation, Graphical Models and Image Process., vol. 55, pp , July [99] N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Syst., Mach. and Cybern., vol. 9, no. 1, pp , [100] T. Pavlidis, Structural Pattern Recognition. Berlin: Springer Verlag, [101] T. Pavlidis, Contour filling in raster graphics, in Proc. SIGGRAPH, (Dallas, Texas), pp , Aug

218 192 Bibliography [102] T. Pavlidis, Algorithms for Graphics and Image Processing. Maryland: Computer Science Press, [103] T. Pavlidis, Why progress in machine vision is so slow, Pattern Recognit. Lett., vol. 13, pp , [104] J. Peng, A. Srikaew, M. Wilkes, K. Kawamura, and A. Peters, An active vision system for mobile robots, in Proc. IEEE Int. Conf. on Systems, Man and Cybernetics, (Nashville, TN, USA), pp , Oct [105] A. Pentland, Looking at people: Sensing for ubiquitous and wearable computing, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp , Jan [106] P. Perona and J. Malik, Scale-space and edge detection using ansotropic diffusion, IEEE Trans. Pattern Anal. Machine Intell., vol. 12, pp , July [107] R. Poole, DVB-T transmissions - interference with adjacent-channel PAL services, Tech. Rep. EBU-Winter-281, BBC Research and Development Department, [108] K. Pratt, Digital image processing. New York: John Wiley and Sons, Inc, [109] K. Rank, M. Lendl, and R. Unbehauen, Estimation of image noise variance, IEE Proc. Vis. Image Signal Process., vol. 146, pp , Apr [110] S. Reichert, Comparison of contour tracing and filling methods, Master s thesis, Dept. Elect. Eng., Univ. Dortmund, Feb In German. [111] A. Rosenfeld and C. Kak, Digital Picture Processing, vol. 2. Orlando: Academic Press, INC., [112] P. Rosin, Thresholding for change detection, in Proc. IEEE Int. Conf. Computer Vision, (Bombay, India), pp , Jan [113] P. Rosin and T. Ellis, Image difference threshold strategies and shadow detection, in Proc. British Machine Vision Conf., (Birmingham, UK), pp , [114] Y. Rui, T. Huang, and S. Chang, Digital image/video library and MPEG-7: Standardization and research issues, in Proc. IEEE Int. Conf. Acoustics Speech Signal Processing, (Seattle, WA), pp , May Invited paper. [115] Y. Rui, T. Huang, and S. Chang, Image retrieval: Current techniques, promising directions and open issues, J. Vis. Commun. Image Represent., vol. 10, pp. 1 23, [116] Y. Rui, T. Huang, and S. Mehrotra, Relevance feedback techniques in interactive content based image retrieval, in Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases, (San Jose, CA), pp , Jan [117] P. Sahoo, S. Soltani, A. Wong, and Y. Chen, A survey of thresholding techniques, Comput. Vis. Graph. Image Process., vol. 41, pp , 1988.

219 [118] P. Salembier, L. Garrido, and D. Garcia, Image sequence analysis and merging algorithm, in Proc. Int. Workshop on Very Low Bit-rate Video, (Linkoping, Sweden), pp. 1 8, July Invited paper. [119] P. Salembier and F. Marqués, Region-based representations of image and video: Segmentation tools for multimedia services, IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 8, pp , [120] H. Schröder, Image processing for TV-receiver applications, in Proc. IEE Int. Conf. on Image Processing and its applications, (Maastricht, The Netherlands), Apr Keynote paper. [121] H. Schröder, Mehrdimensionale Signalverarbeitung, vol. 1. Stuttgart, Germany: Teubner, [122] T. Seemann and P. Tischer, Structure preserving noise filtering of images using explicit local segmentation, in Proc. Int. Conf. on Pattern Recognition, vol. 2, (Brisbane, Australia), pp , Aug [123] Z. Sivan and D. Malah, Change detection and texture analysis for image sequence coding, Signal Process., Image Commun., vol. 6, pp , Aug [124] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Content-based image retrieval: the end of the early years, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp , Dec [125] S. Smith, Feature Based Image Sequence Understanding. PhD thesis, Robotics Research Group, Department of Engineering Science, Oxford University, [126] M. Spann and R. Wilson, A quad-tree approach to image segmentation which combines statistical and spatial information, Pattern Recognit., vol. 18, no. 3/4, pp , [127] Call for analysis model comparisons. On-line COST211ter, [128] Workshop on image analysis for multimedia interactive services. Proc. COST211ter, Louvain-la-Neuve, Belgium, June [129] Special issue on segmentation, description, and retrieval of video content. IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, Sept [130] Special section on video surveillance. IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 8, Aug [131] J. Stauder, R. Mech, and J. Ostermann, Detection of moving cast shadows for object segmentation, IEEE Trans. on Multimedia, vol. 1, no. 1, pp , [132] C. Stiller, Object-based estimation of dense motion fields, IEEE Trans. Image Process., vol. 6, pp , Feb

220 194 Bibliography [133] E. Stringa and C. Regazzoni, Content-based retrieval and real time detection from video sequences acquired by surveillance systems, in Proc. IEEE Int. Conf. Image Processing, (Chicago, IL), pp , Oct [134] H. Sundaram and S. Chang, Efficient video sequence retrieval in large repositories, in Proc. SPIE Conf. Storage and Retrieval for Image and Video Databases, vol. 3656, (San Jose, CA), pp , Jan [135] R. Thoma and M. Bierling, Motion compensating interpolation considering covered and uncovered background, Signal Process., Image Commun., vol. 1, pp , [136] L. Torres and M. Kunt, Video coding: Second generation approach. Kluwer Academic Publishers, [137] O. Trier and A. Jain, Goal-directed evaluation of binarization methods, IEEE Trans. Pattern Anal. Machine Intell., vol. 17, pp , Dec [138] P. van Donkelaar, Introductory overview on eye movements. On-line, [139] N. Vasconcelos and A. Lippman, Towards semantically meaningful feature spaces for the characterization of video content, in Proc. IEEE Int. Conf. Image Processing, vol. 1, (Santa Barbara, CA), pp , Oct [140] P. Villegas, X. Marichal, and A. Salcedo, Objective evaluation of segmentation masks in video sequences, in Proc. Workshop on Image Analysis for Multimedia Interactive Services, (Berlin, Germany), pp , May [141] P. Zamperoni, Plus ca va, moins ca va, Pattern Recognit. Lett., vol. 17, pp , June [142] H. J. Zhang, C. Low, S. Smoliar, and J. Wu, Video parsing, retrieval and browsing: An integrated and content-based solution, in Proc. IEEE Conf. Multimedia, (San Francisco, CA), pp , Nov [143] S. Zhu and A. Yuille, Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation, IEEE Trans. Pattern Anal. Machine Intell., vol. 18, pp , Sept [144] Z. Zhu, G. Xu, Y. Yang, and J. Jin, Camera stabilization based on 2.5-D motion estimation and inertial motion filtering, in Proc. IEEE Int. Conf. on Intelligent Vehicles, pp , [145] F. Ziliani and A. Cavallaro, Image analysis for video surveillance based on spatial regularization of a statistical model-based change detection, in Proc. Int. Conf. on Image Analysis and Processing, (Venice, Italy), pp , Sept

221 Appendix A Applications This thesis has proposed a framework for object-and-event based video processing and representation. This framework uses two systems: an object-oriented shot analysis and a context-independent shot interpretation. The resulting video representation includes i) shot s global features, ii) objects parametric and qualitative low-level features, iii) object relationships, and iv) events. Following are samples of video applications that can benefit from the proposed framework. Video databases: retrieval of shots based on specific events and objects. Surveillance: automated monitoring of activity in scenes. detecting people, their activities, and related events such as fighting, or overstaying, monitoring traffic and related events such as accidents and other unusual events, detection of hazards such as fires, and monitoring flying objects such as aircrafts. Entertainment and telecommunications: video editing and reproduction, smart video appliances, dynamic video summarization, and browsing of video on Internet. Human motion analysis: dance performance, athletic activities in sports, and smart environments for human interactions. The next sections address in more details three applications and suggest ways of using the proposed video analysis and interpretation framework of this thesis.

222 196 Applications A.1 Video surveillance Close Circuit Television (CCTV), or video 1 surveillance system, is a system to monitor public, private, or commercial sites such as art galleries, residential districts, and stores. Advances in video technologies (as in camcorder or digital video technology) have significantly increased the use of surveillance systems. Although surveillance cameras are widely used, video data is still mainly used as an after event tool to manually locate interesting events. The continuous active monitoring of surveillance sites to alert human operators to events in progress is required in many applications. Human resources to detect events or observe the output of a surveillance system are expensive. Moreover, events occur, typically, at large time intervals and system operators may lose attention and miss important events. Therefore, there is an increasing and immediate need for automated video interpretation systems for surveillance. The goal of a video interpretation system for video surveillance is to detect, identify, and track moving objects, analyze their behavior and interpret their activities (Fig. A.1(a)). A typical scene classification in a surveillance application is depicted in Fig. A.1(b). Interpretation methods of surveillance video need to consider the following conditions: Ever changing conditions: object appearance and shapes are highly variable and many artifacts such as shadows, poor light, and reflections, Object occlusion and unpredictable behavior, Fault-tolerance: robust localization and recognition of objects in the presence of occlusion. An error must not stop the whole system, Special consideration should be given to inaccuracies and other sources of errors to handle some specific situations such as false alarms, Complexity and particular characteristics of each application. This may limit a wider use of general video processing systems, and Real-time processing (typical frame rates are 3-15 frames per second). Considering these conditions and the definition of a video surveillance system (Fig. A.1), the proposed techniques in this thesis for video analysis and interpretation are suitable to meet the requirements of video surveillance applications. A.2 Video databases Considering the establishment of large video archives such as for the arts, environment, science, and politics, the development of effective automated video retrieval systems is a problem of increasing importance. For example, one hour of video represents approximately 0.5 Gigabyte and requires approximately 10 hours of manual cataloging and archiving. One clip requires 5-10 minutes for viewing, extraction, and annotation. In a video retrieval system, features of a query shot are computed, compared to features of the shots in the database, and shots most similar to the query are returned to the user. Three models for video retrieval can be defined based on the way video content is represented. In the first model (low-level Query-By-Example), the user either sketches or selects 1 Many video surveillance systems, involve no recording of sounds [96] which emphasizes the need for stable video analysis procedures.

223 197 Scene Video Database Real-time video analysis Objects & Features Video interpretation Events Event-based descision and control Decisions/alarms/data Empty Scene Occupied Man-Machine interface normal abnormal Surveillance operator fast, stop long, deposit,... (a) Interpretation-based video surveillance. (b) Scene classification in video surveillance. Figure A.1: A definition of a content-based video surveillance system. a video query, e.g., after browsing the video database. A video analysis module extracts a low-level quantitative video representation. This representation is compared to stored low-level representations and the video shots most similar to the query are selected. Comparison based on low-level quantitative parameters can be expensive, in particular when the dimension of the parameter vector is high. This model is suitable for unstructured (raw) video and for small size databases. In the second model (high-level Query-By-Example), the user selects a video query and the system finds a high-level video representation and compares high-level features to find similar shots. In the third model (Query-By-Description), the user can specify a qualitative and high-level description of the query and the system compares this description with the stored descriptions in the database. Such a model is useful when the user cannot specify a video but has memorized a (vague) description of it. In most existing object-based video retrieval tools, the user either sketches or selects a query example, e.g., by browsing. Browsing a large database can be time consuming and sketching is a difficult task, especially for complex scenes. Since the subjects of the majority of video shots are objects and related events, this thesis suggests a retrieval framework (Fig. A.2) where the user either selects a shot or gives a qualitative and high-level descriptions of a query shot as given in Fig. A.3. The suggested framework for video retrieval as given in Fig. A.2(a) aims at introducing functionalities that are oriented to the way users usually describe and judge video similarity and to requirements of efficiency and reliability of video interpretation that can forego precision. An advantage of the proposed high-level video representation in this thesis is that it allows the construction of user-friendly queries based on the observation that most people s interpretation of real world domains is imprecise and that users, while viewing a video, usually memorize objects, their action, and their location and not the exact (quantitative) object features. In the absence of a specific application, such a generic model allows

224 198 Applications Video Shots Global motion estimation Global-motion-based shot representation Global motion interpretation global-motion description PreProcessing Object Segmentation Object Tracking Motion Estimation Basic object features Spatio-Temporal Object and event interpretation Spatio-Temporal Object and event description Meta Data Spatio-temporal Features Video Shot Analysis Video Shot Interpretation Shot Analysis and Interpretation off-line Shot Monitoring and Retrieval Monitoring and Query Interface Query Shot Analysis AND Query Shot Interpretation Query global-motion description Spatio-Temporal Object and event description on-line Retrieval system Retrieved shots OR objects Figure A.2: Object and event-based framework for video retrieval. scalability (e.g., by introducing new definitions of object actions or events). Using the proposed video interpretation of this thesis, users can formulate queries using qualitative object descriptions, spatio-temporal relationship features, location features, and semantic or high-level features (Fig. A.3). The retrieval system can then find video whose content matches some or all these qualitative descriptions. Since video shot databases can be very large, pruning techniques are essential for efficient video retrieval. This thesis has suggested two methods for fast pruning. The first is based on qualitative global motion detected in the scene, and the second on the notion of dominant objects (cf. Section 7.4.2). A.3 MPEG-7 Because of the increasing availability of digital audio and visual information in various domains, MPEG started the work on a new standard on Multimedia Content Description Interface, MPEG-7. The goal is to provide tools for the description of multimedia content where each type of multimedia data is characterized by a set of distinctive features. MPEG- 7 aims at supporting effective and efficient retrieval of multimedia based on their content features ranging from low-level to high-level features [114]. Fig. A.4 shows a high-level block diagram of a possible MPEG-7 processing chain. Both feature extraction and retrieval techniques are relevant in MPEG-7 activities but not part of the standard. MPEG-7 only defines the standard description of multimedia content and focuses on the inter-operability of internal representations of content descriptions. There are large dependencies between video representation, applications, and access to MPEG-7 tools. For example, tools for extracting and interpreting descriptions are essential for effective use of the upcoming MPEG-7 standard. On the other hand, a well-defined

225 199 Find video shots in which Event specifications Spatial object locations object i appears in disappears from moves left in right in the scene lays in the left right top bottom center of the scene up in down in Spatial object features rests within... shows texture (example) Shape size (small, medium, large) object relations object i left right above below inside near within a circle of 50 pixels 50 pixels left/right/... to object j global shot specifications global motion: zoom pan rotation stationary dominant objects: Event Shape size (small, medium, large) Motion (slow, medium, fast) Figure A.3: An object and event-based query form. MPEG-7 standard will significantly benefit exchange among various video applications. Effective and flexible multi-level video content models that are user-friendly play an important role in video representation, applications, and MPEG-7. In the proposed system for video representation in this thesis, a video is seen as a collection of video objects, related meaning, local and global features. This supports access to MPEG-7 video content description models. Scope of MPEG-7 Multimedia Content Description extraction ( Feature Extraction & Indexing ) Description Standard Description-based application ( Search & retrieval tool and interface ) Figure A.4: Abstract scheme of a MPEG-7 processing chain.

226 Appendix B Test Sequences B.1 Indoor sequences All test sequences used are real-world images that represent different typical environment of surveillance applications. Indoor sequences represent people walking in different environments. Most scenes include changes in illumination. The target objects have various features such as speed and shape. Many of the target objects have shadowed regions. Hall This is an CIF-sequence ( , 30 Hz) of 300 images. It includes shadows, noise, and local illumination changes. A person enters the scene holding an object and deposits it. Another person enters and removes an object. Target objects are the two persons and the objects. Stairs This is an indoor CIF-sequence ( , 25 Hz) of 1475 images. A person enters from the back door, goes to the front door and exits. The same person returns and exits from the back door. Another person comes down the stairs, goes to the back door, then to the front door and exits. The same person returns through the front door and goes up the stairs. This is a noisy sequence with illumination change (for example, through the glass door) and shadows. Figure B.1: Images of the Hall shot, courtesy of the COST-211 group.

227 201 Figure B.2: Images of the Stair shot, courtesy of the COST-211 group. Figure B.3: Images of the Floor shot, INRS-Télécommunications. Floor This is an SIF-sequence ( , 30 Hz) of 826 images. It was recorded with an interlaced DV camcorder (320x480 pixels and frame rate of 60) then converted to AVI. All a-fields are dropped and the resulting YUV is progressive. This sequence contains many coding and interlace artifacts and shadows. Other sequences of the same environment were used for testing. B.2 Outdoor sequences Selected test sequences are real-world image sequences. The main difficulty is how to cope with illumination changes, occlusion, and shadows. Urbicande This is an CIF-sequence ( , 4:2:0, 12.5 Hz) of 300 images. Several pedestrians enter, occlude each other, and exit. Some pedestrians enter the scene from buildings. A pedestrian remains in the scene for a long period of time and moves suspiciously. The sequence is noisy and has local illumination changes. Some local flicker is visible in the sequence. Objects are very small. Survey This is an SIF-sequence ( , 30 Hz) of 976 images. This sequence was recorded with an analog NTSC-based camera and was PC-sized at 320x240. It has interlace artifacts and a number of frames are dropped. This sequence was recorded at 60 Hz - interlaced and converted to a 30 Hz progressive video by merging the even and odd fields. This was done automatically since the original capture format was MPEG-1. Strong interlace artifacts are present in the constructed frames.

228 202 Test sequences Figure B.4: Images of the Urbicande shot, courtesy of the COST-211 group. Figure B.5: Images of the Survey shot, courtesy of the University of Rochester. Highway This is an CIF-sequence ( , 4:2:0, 25 Hz) of 600. This sequence was taken under daylight conditions from a camera placed on a bridge above the highway. Various vehicles with different features (e.g. speed, shape) are in the scene. Target Objects are the moving (entering and leaving) vehicles. The challenge here is the detection and tracking of individual vehicles in the presence of occlusion, noise, or illumination changes. Figure B.6: Images of the Highway shot, courtesy of the COST-211 group.

229 Appendix C Abbreviations CCIR Comité Consultatif International des Radiocommunications HDTV High Definition Television PAL Phase Alternate Line. Television standard used extensively in Europe NTSC National Television Standards Committee. Television standard used in extensively North America Y Luminance corresponding to the brightness of an image pixel UV/CrCb Chrominance corresponding to the color of an image pixel YCrCb/YUV A method of color encoding for transmitting color video images while maintaining compatibility with black-and-white video MPEG Moving Picture Experts Group MPEG-7 A standard for Multimedia Content Description Interface COST Coopération Européenne dans la recherche Scientifique et Technique AM Analysis Model PSNR Peak Signal to Noise Ratio MSE Mean Square Error MBB Minimum Bounding Box MED Median LP Low-pass MAP Maximum A posteriori Probability HVS Human Visual System 2-D Two Dimensional FIR Finite Impulse Response CCD Charged Couple Device DCT Discrete Cosine Transform db Decibel IID Independent Identically Distribution