Vision-Based Tracking and Recognition of Dynamic Hand Gestures

Transcription

1 Vision-Based Tracking and Recognition of Dynamic Hand Gestures Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften in der Fakultät für Physik und Astronomie der Ruhr-Universität Bochum vorgelegt von Maximilian Krüger aus Düsseldorf Bochum 2007

2 1. Gutachter: Prof. Dr. Christoph von der Malsburg 2. Gutachter: Prof. Dr. Andreas D. Wieck Tag der Disputation: 11. Dezember 2007

3 Acknowledgements I am grateful to Prof. Dr. Christoph von der Malsburg for his guidance and support. He offered me the opportunity to start my research in the challenging field of computer vision and taught me to ask the question How does the brain work?. His passion on this topic inspired me and the present work. Further I would like to thank Prof. Dr. Andreas Wieck for his interest in being my second reviewer. Special thanks go to Dr. Rolf P. Würtz for many discussions, his advice and support during my research and the generation of this thesis. The life at the institute has been so much easier with the help of Uta Schwalm and Anke Bücher. Thank you for the administrative support. Acknowledgements are also made to Michael Neef for providing the IT infrastructure. For their patient help whenever I struggled with L A TEX or C++ problems I would like to thank Günter Westphal and Manuel Günter. I am also grateful to Marco Müller who helped me finding faces and Andreas Tewes for endless discussions on scientific or non-scientific matters. I am glad to have friends like Thorsten Prüstel and Bianca von Scheidt who accompanied and supported me from my first academic day on. My father Dietmar and my mother Margaretha, my brother Christian and his family for all the nice moments and for showing me the strength of a family. Words cannot express what you mean to me, my wife Seza and our son David.

4 Contents 1 Introduction 1 2 Survey of Sign Language Recognition Techniques Sign Language Previous Work of Sign Language Recognition System Architecture Software Agents Multi-Agent System Visual Tracking Multi-Agent Tracking Architecture Tracking Agent Recognition Agent Feature Extraction Position Bunch Graph Matching for Face Detection Hand Posture Bunch Graph Matching for Hand Posture Classification Contour Information for Hand Posture Classification Recognition Hidden Markov Models Elements of an HMM HMM Basic Problems Evaluation Problem Decoding Problem Estimation Problem Modification of the Hidden Markov Model Evaluation Problem i

5 Contents ii Estimation Problem The HMM Recognition Agent Layer One: HMM Sensor Layer Two: Gesture HMM Integration Layer Three: Decision Center Experiments and Results BSL Data Signer Dependent Experiments Different Signer Experiments Discussion and Outlook Discussion Outlook A Democratic Integration 88 B Gabor Wavelet Transformation 92 C BSL Database 95 D BSL Experiments Results 98 List of Figures 110 List of Tables 112 Bibliography 113 Curriculum Vitae 122

6 Chapter 1 Introduction Communication between humans and computers is limited and is done mainly by pressing keys, by touching screens and by verbal commands like reading coded signs. Speech, however, is the dominant factor in communication between humans. It is often accompanied by intended or unintended gestures. Wherever the environment is not favorable to verbal exchange gestures are the preferred way of communication, i.e., when background noise, long distances or language problems make verbal interaction between people difficult or impossible. The human brain is capable of understanding gestures quite easily and clearly. There are different ways to perform and interpret a gesture, using one or two hands or arms or both. Some gestures are static, e.g., showing a single hand posture, while other gestures are changing and need dynamics to transmit their information. The performance of computers could be greatly enhanced if they were able to recognize gestures, if their interaction with humans can be more human. Gesture recognition by computers offers new applications in industry (steering and control of robots) and in security (surveillance). Another important field for human computer interaction (HCI) lies in the recognition of sign language. A translation from the language of the deaf to the hearing will ease the communication in public institutions like postal offices or medical centers. Thus, gesture recognition is an interdisciplinary field of research. Finding answers to the questions like: How do we understand other people s behavior? How can we assign goals, intentions, or beliefs to the inhabitants of our social world? How can we combine/perform tracking and recognition? What data structures are suitable to store the information in a robust and fault-tolerant way? build the basis for investigations in brain research and neural computation. The current state of knowledge in the field of brain research is that gesture recognition or action recognition in primates and humans is managed by

7 Introduction 2 mirror neurons (Gallese and Goldman, 1998; Iacoboni et al., 2005). A mirror neuron is defined as a neuron which fires both when an animal acts and when the animal observes the same action performed by another animal or person (Umiltai et al., 2001; Keysers et al., 2003). Thus, the neuron mirrors the behavior of another animal, as though the observer were itself acting. In humans, brain activity consistent with mirror neurons has been found in the premotor cortex and the inferior parietal cortex (Rizzolatti and Craighero, 2004). Hence, it seems that recognition is performed by comparing the made observations with a previously learned sequence. This allows to predict the coming events during the observation and thus enhances the robustness of the whole process. The aim of the present work is to build a recognition system which performs gesture recognition during the observation process. Each learned gesture is represented by an individual module that competes with the others during recognition. The data is acquired by a single camera facing the signer. No data gloves or electric tracking is applied in the recognition system. Computer vision and thus visual gesture recognition has to deal with well-known problems of image processing, e.g., illumination and camera noise. Hence, the implemented recognition system has to be reliable in the presence of camera noise and a changing environment. Therefore, object tracking and object recognition are important steps towards the recognition of gestures. Thus, the system in the present work has to show robust feature extraction and adaption to a flexible environment and signer. This is realized by applying different autonomous modules which cooperate in order to solve the given task. Following the principles of Organic Computing presented in Müller-Schloer et al. (2004), the robustness of the recognition system is enhanced by dividing a problem into different subtasks, which are solved by autonomous subsystems. All subsystems are working on-line and therefore can help each other. Furthermore, they are able to flexibly adapt to new situations. The results of the different subsystems are combined to solve the overall task. This integration of information from different sources, like hand contour, position and their temporal development presents, beside the coordination of these processes, the main challenge for creating the recognition system. The implemented subsystems will autonomously solve a part of the problem using techniques like democratic integration (Triesch and von der Malsburg, 2001a) for information merging, bunch graph matching (Lades et al., 1993) for face/object recognition and a modified parallel Hidden Markov Model (HMM) (Rabiner, 1989), where the information of the different subsystems are merged in order to recognize the observed gesture. Each of these techniques learns its knowledge from examples. For this purpose the train-

8 Introduction 3 (a) different (b) bat Figure 1.1: Trajectory variations The signs different and bat shown with the trajectory differences of ten repetitions by the same professional signer. Both signs are part of the British Sign Language. ing is performed with minimal user interaction. Just, like one-click learning, described in Loos and von der Malsburg (2002), the user only defines a similarity threshold value. In order to realize different autonomous units, their environment and communication between them in a software framework, a multi-agent system (MAS) (Ferber, 1999) has been designed. The installed agents show self-x properties like dynamic adaption to a changing environment (self-healing), perception of their environment and the capability to rate their action (selfcontrol). They can easily be added or deleted during execution time and thus provide the needed flexibility. Sign language is a good start for gesture recognition research because it has a structure. This structure allows to develop and test methods on sign language recognition before applying them on gesture recognition. Therefore, the present work concentrates on the British Sign Language (BSL). Like gesture recognition, sign language recognition also has to deal with the challenge of recognizing the hand postures, and their spatial and temporal changes that code the distinct sign. From a technical point of view the projection of the 3D scene onto a 2D plane has to be considered. This results in the loss of depth information and therefore the reconstruction of the 3D-trajectory of the hand is not always possible. Besides, the position of

9 Introduction 4 the signer in front of the camera may vary. Movements like shifting in one direction or rotating around the body axis must be considered, as well as the occlusion of some fingers or even a whole hand during signing. Despite its constant structure each sign shows plenty of variations in the dimensions of time and space. Even if the same person performs the same sign twice, small changes in speed and position of the hand will occur. An example is presented in fig. 1.1, where the different trajectories, which occur when the same sign is performed ten times by a professional, are illustrated. Generally, a sign is affected by the preceding and subsequent signs. This effect is called co-articulation. Part of the co-articulation problem is that the recognition system has to be able to detect sign boundaries automatically, so that the user is not required to segment the sign sentence into single signs. Finally, any sign language or gesture recognition system should be able to reject unknown signs and gestures, respectively. A graphical overview of the sign language recognition system applied in the present work is given in fig In order to explain the applied system this thesis is structured as follows: Chapter 2 gives an overview of previous work in the field of sign language recognition. Subsequently chapter 3 describes the multi-agent system architecture, in particular the constructed agents and the applied methods. The visual tracking is presented in chapter 4. Chapter 5 introduces the different features that are applied for the sign language recognition which is presented in chapter 6. A description of the experiments undertaken and their results can be found in chapter 7. The present work is discussed in chapter 8 where future work is outlined.

10 Introduction 5 Figure 1.2: Sign Language Recognition Work Flow The sign language recognition system is divided in four work steps. Starting with the recording of the sign sequence using a monocular camera, the tracking is performed by multi-cue fusion, tracking each object separately. The next step is the feature extraction where the position of the hands and the corresponding hand posture are extracted and processed in order to enter the sign recognition. There, the received features are integrated to recognize the performed sign.

11 Chapter 2 Survey of Sign Language Recognition Techniques Research in sign language recognition is connected to the fields of spoken language recognition and computer vision. Language recognition, because it deals with the similar problem of recognizing a sequence of patterns in time (sound) and as sign languages are visual languages, computer vision is needed to collect and process the input data. This chapter gives an overview of previous work in the field of gesture or, more precisely, sign language recognition. Each of the presented ideas, including the present work, is focused on the manual part of sign language. However, sign language is more complex as will be shown before the sign language recognition systems are introduced. 2.1 Sign Language In contrast to gestures which are a typical component of spoken languages, the sign languages present the natural way of communication between deaf people. Sign languages develop, like oral languages, in a self-organized way. An example which shows that sign language appears wherever communities of deaf people exist is reported by Bernard Tervoort 1 (Stokoe, 2005). Like spoken languages, sign languages are not universal and vary from region to region. British and American Sign Language, for instance, are quite different and mutually unintelligible, even though the hearing of both countries share the same oral language. Furthermore, a sign language is 1 Tervoort observed the development of a sign language for groups of deaf pupils. Although unacquainted with any sign language outside their own group, they developed signs that were only used by the group itself and tended to vanish when the group dispersed.

12 Related Work 7 not a visual reproduction of an oral language. Its grammar is fundamentally different, and thus distinguishes itself from gestures (Pavlovic et al., 1997). While the structure of a sentence in spoken language is linear, one word followed by another, the structure in sign language is not. It shows a simultaneous structure and allows parallel temporal and spatial configurations which code the information about time, location, person and predicate. Hence, even though the duration of a sign is approximately twice as long as the duration of a spoken word, the duration of a signed sentence is about the same (Kraiss, 2006). Differing from pantomime, sign language does not include its environment. Signing takes place in the 3D signing space which surrounds the trunk and the head of the speaker. The communication is done by simultaneously combining hand postures, orientation and movement of the hands, arms or body, facial expressions and gaze. Fig. 2.1 (a) illustrates some signs of the British Sign Language (BSL), one-handed or two-handed with facial expression. In addition to the signs, each sign language has an manual alphabet for finger spelling. Fig. 2.1 (b) depicts the alphabet performed in the BSL. Differing from the American Sign Language its letters are performed twohanded. The finger spelling codes the letters of spoken language and is mainly used to spell names or oral language words. In contrast to written text for spoken language an equally accepted notation system does not exist for sign language (Kraiss, 2006). Although some systems like signwriting.org (2007) or HamNoSys (2007) exist, no standard notation system has been founded. As mentioned above, the performance of the sign language can be divided into manual (hand shape, hand orientation, location and motion) and non-manual (trunk, head, gaze, facial expression, mouth) parameters. Some signs can be distinguished by manual parameters alone, while others remain ambiguous unless additional non-manual information is made available. If two signs only differ in one parameter they are called a minimal pair. The following recognition systems, including the present work, concentrate on manual features and investigate one-handed signs performed by the dominant hand only, and two-handed signs, which can be performed symmetrically or non-symmetrically.

13 Related Work 8 (a) (b) Figure 2.1: BSL Both figures are kindly provided by the Royal Association for Deaf people RAD (2007) and show examples taken from the British sign language. The left image shows the variations of performed sign which include one-handed or two-handed signs and the importance of facial expressions. On the right hand side the finger spelling chart of the BSL is depicted. 2.2 Previous Work of Sign Language Recognition Sign language recognition has to solve three problems. The first challenge is reliable tracking of the hands, followed by robust feature extraction as the second problem. Finally, the third task concerns the interpretation of the temporal feature sequence. In the following some approaches to solve these problems, which have inspired this work, are presented. Starner and Pentland (1995) analyze sign gestures performed by one signer wearing colored gloves. After color segmentation and the extraction of

14 Related Work 9 position and contour of the hands their recognition is based on a continuous sentence of signs which are bound to a strict grammar using trained Hidden Markov models (HMM). Their work is enhanced in Starner et al. (1998), by changing to skin color data collected from a camera in front of the speaker. In a second system the camera is mounted in a cap worn by the user. Hienz et al. (1999) and Bauer and Kraiss (2002) introduce an HMMbased continuous sign language recognition system which splits the signs into subunits to be recognized. The needed image segmentation and feature extraction is simplified by using gloves with different colors for fingers and palm. Thus, the extracted sequence of feature vectors reflects the manual sign parameters. The same group has built another recognition system that works with skin color segmentation and builds a multiple tracking hypothesis system (Akyol, 2003; Zieren and Kraiss, 2005; von Agris et al., 2006). The winner hypothesis is determined at the end of the sign. The authors include high level knowledge of the human body and the signing process in order to compute the likelihood of all hypothesized configurations per frame. They extract geometric features like axis ratio, compactness and eccentricity of the hands segmented by skin color and apply HMMs as well. von Agris et al. (2006) use the recognition system for signer independent sign language recognition. Instead of colored gloves Vogler and Metaxas (1997, 1999, 2001) use 3D object shape and motion extracted with computer vision methods as well as a magnetic tracker fixed at the signer s wrists. They propose a parallel HMM algorithm to model gesture components and recognize continuous signing sentences. Shape, movement and location of the right hand along with movement and location of the left hand are represented by separate HMM channels, which were trained with relevant data and features. For recognition, individual HMM networks were built in each channel and a modified Viterbi decoding algorithm searched through all the networks in parallel. Path probabilities from each network that went through the same sequence of words were combined. This work is enhanced in Vogler and Metaxas (2003) using multiple channels, by integrating 3D motion data and cyberglove hand posture data. Tanibata et al. (2002) proposed a similar scheme for isolated word recognition in the Japanese Sign Language. The authors apply HMMs which model the gesture data from right and left hand in a parallel mode. The information is merged by multiplying the resulting output probabilities. Richard Bowden s group structures the classification mode around a linguistic definition of signed words (Bowden et al., 2003; Ong and Bowden, 2004; Kadir et al., 2004). This enables signs to be learned reliably from few training examples. Their classification process is divided into two stages. The

15 Related Work 10 first stage generates a description of hand shape and movement using skin color detection. The extracted features are described in the same way which is used within sign linguistics to document signs. This description allows broad generalization and therefore significantly reduces the requirements of further classification stages. In the second stage, Independent Component Analysis (ICA) (Comon, 1994) is used to separate the sources of information from uncorrelated noise. Their final classification uses a bank of Markov chains to recognize the temporal transitions of individual signs. All of the presented works are very inspiring and have different interesting approaches to overcome the different problems of sign language recognition. Most of the introduced systems are working off-line, meaning they collect the feature sequence and start recognition when the gesture has already been performed. The approach in the present work divides the problems into different subtasks that are solved by autonomous subsystems. Instead of single color tracking a self-organized multi-cue tracking for the different body parts is applied. As in the papers mentioned above an HMM approach for the temporal recognition is chosen. The HMM method is extended by introducing self-controlling properties, allowing the recognition system to perform its recognition on-line during observation of the input sequence.

16 Chapter 3 System Architecture The aim of this thesis is to develop a recognition system that distributes the task of recognition to a set of subsystems. Fig. 3.1 depicts the idea of splitting the process of sign language recognition into three main subsystems. There is one subsystem for object tracking and one for object recognition. Both provide the input data for the sign language recognition subsystem. Each subsystem works autonomously, but cooperation is needed to solve the overall goal. Chapter 4 explains the subsystem for visual object tracking of the head and the hands. The object recognition subsystem is used for the recognition of static hand postures and face detection. The applied techniques for object recognition are described in chapter 6. Both systems are continuously exchanging information about input data from tracking and the recognized object. The information flow to the sign language recognition system is oneway and comprises the preprocessed features, which are integrated in order to the recognize the observed sign. Each subsystem is technically realized by one or more software agents. Software agents are a popular way to install a framework of autonomous subsystems on a computer. They provide the autonomy, flexibility and robustness that suit the demands of the tracking and recognition system. As there is more than one agent in use, a multi-agent system (MAS) has been developed to provide the infrastructure. This chapter will give a short introduction to the field of software agents and the implemented MAS. The whole recognition system is written in C ++ using the image processing libraries FLAVOR (Rinne et al., 1999) and Ltilib (Kraiss, 2006).

17 System Architecture 12 Figure 3.1: System Architecture The sign language recognition system is divided into three subsystems. The object tracking and the object recognition systems are working on the sequence of input image and are responsible for the robust preprocessing of the features that enter in the third subsystem. The sign language recognition system integrates the received information in order to determine the most probable sign. The data flow between the subsystems is denoted by arrows. 3.1 Software Agents The term software agent describes a software abstraction, or an idea, similar to object-oriented programming terms such as methods, functions, and objects (Wikipedia, 2007). The concept of an agent provides a convenient way to describe a complex software entity that is capable of acting with a certain degree of autonomy in order to accomplish given tasks. Software agents are acting self-contained and capable of making independent decisions. They are taking actions to satisfy internal goals based upon their perceived environment (Liu and Chua, 2006). Thus, they are able to exhibit goal-directed behavior by taking the initiative (Nikraz et al., 2006). Unlike objects, which are defined in terms of methods and attributes, an agent is defined in terms of its behavior. Franklin and Graesser (1997); Bigus (2001) and Wooldridge (2002) discuss four main concepts that distinguish agents from arbitrary programs: Persistence, which is the concept of continuously running code that is not executed on demand, but rather decides for itself when it should perform which activity. The autonomy enables the agent to be task-selective, set its priorities, work goal-directed and to make decisions without human intervention. The agent has control over its actions and internal state. Its social ability allows the agent to engage

18 System Architecture 13 Figure 3.2: MAS Objects The multi-agent system contains three base classes. The environment and the blackboard are singleton objects and administrate the input data and the communication. Different kind and a varying number of agents are installed to solve the subtask of sign language recognition. The arrows denote the data flow between the interfaces of the object classes. other components through some kind of communication and coordination. Finally, reactivity enables the agents to perceive the context in which they operate and react to it appropriately. They perceive their surrounding area to which they can adapt. These concepts also distinguish agents from expert systems, which are not coupled to their environment and are not designed for reactive or proactive behavior (Wooldridge, 2002). 3.2 Multi-Agent System The whole recognition system is build on a multi-agent system developed earlier by Krüger et al. (2004). The MAS conducts the framework for the implementation of the different subsystems and provides the interface for communication. It has to solve problems like changing number of agents, caused by occlusion or tracking failure and the communication between the different entities (Ferber, 1999). Although the agents are equipped with all the abilities required to fulfill their individual task, they are not supposed to have all data or all methods available to solve the overall goal of sign language recognition. The robustness and flexibility of the present work is achieved by the idea of splitting a complex task into smaller and simpler subtasks. Hence, as demanded by Liu and Chua (2006) collaboration and communication between the agents have to be installed. As depicted in fig. 3.2 the MAS consists of three base classes of objects, the environment, the blackboard, and the agent. While environment and blackboard are realized as singleton objects (Buschmann et al., 1996), there can be a multitude of different agents. These agents handle tasks ranging

19 System Architecture 14 from coordination of subprocesses and tracking of an image point, up to the recognition of human extremities. Environment The information about the world is supplied by the environment. Based on the desired functionality of visual tracking and recognition, the environment provides access to image sequences, e.g., the current original color image and its processed versions, like the gray value image and the difference image between two consecutive video frames. Blackboard Communication within the system is done via the blackboard. For this purpose, a message can be quite complex, e.g., it can carry an image, and has a defined lifetime. Each agent can write messages onto the blackboard and read the messages other agents have posted. Thus, message handling allows the creation of new agents with specific properties, the change of properties and also the elimination of agents during run-time. Agent The agent is the most interesting entity because it shows the above mentioned properties. In order to implement this behavior, agents have three layers, which are shown in fig The top layer, called AgentInterface, administrates the communication and provides the interface to the environment. The fusion center in the second layer is called cueintegrator and merges the information supplied by one or more sensors of layer three. The perception of the surrounding area is twofold. First, there is the message handling via the blackboard and second an agent can receive information by its sensors, which filter the input data coming directly from the environment. Based on the collected information the agent reaches a decision about further actions. As mentioned above, the sign language recognition is separated into the subtasks of object tracking and recognition (object and gesture). Each subtask is solved by one or more agents. Hence, teamwork and an observer/controller architecture are essential. For this purpose three main classes of agents, the tracking agents, the agents for recognition and the agents for control are implemented in the MAS. Tracking agents merge different visual cues like color, texture and movement to follow an object. Cue fusion is done using democratic integration (Triesch and von der Malsburg, 2001a). This technique offers a selforganized, flexible, and robust way of tracking and will be explained in chap-

20 System Architecture 15 Figure 3.3: Design of an Agent An agent is based on three modules. The connection to the environment and the communication are included in the AgentInterface module. Followed by one cueintegrator, the module that integrates and interprets the information provided by one or more Sensors. ter 4. Agents which provide stored world knowledge are called recognition agents and are applied for face and hand posture recognition. Training the recognition agents, i.e., learning world knowledge from examples, is also a crucial task requiring autonomous working methods (Würtz, 2005). One key point of the recognition system is its independence from user interaction. Hence, controlling agents are responsible for solving the conflicts that might occur during execution.

21 Chapter 4 Visual Tracking Human motion tracking, i.e., the tracking of a person or parts of the person, is a difficult task. Viewed on a large scale, humans look nearly the same; they have a head, two shoulders, two arms, a torso and two legs. However, they tend to be very different at smaller scales; having different clothes, shorter arms, et cetera 1. The same holds true for the movement of a person. When repeating the same sign, the trajectories and dynamics can be quite different. Based on the framework presented in chapter 3, the tracking of the head and the left and right hand is based on visual information and solved by a cooperation of globally and locally acting agents. Following Moeslund and Granum (2001), the implemented visual tracking has to solve two tasks. First, the detection of interesting body parts and second their reliable tracking. An example tracking sequence is shown in fig. 4.1, where the implemented tracking system runs in a scene with complex background which includes moving objects. This chapter will introduce the applied agents and their interaction for object tracking. Beginning the description with the architecture where the important modules will be presented. Subsequently, the connection of the different agents will be explained. The features which are applied for sign language recognition in the present work focus on the information extracted from the head and both hands of the signer. Thus, the head and hands have to be detected to start the tracking. For this purpose, the whole image is scanned for skin colored moving blobs to get a rough estimate of the body parts. These blobs are the only constraints applied for defining a region of interest (ROI). When detected, their position and size will be passed to the control agent which administrate the tracking agents. 1 The problems that occur for learning models of articulated objects like the human body is investigated in Schäfer (2006).

22 Visual Tracking 17 Figure 4.1: Tracking Sequence In this tracking sequence head and hands were found. Object identity is visualized by the color of the rectangles, which delineate the attention region of each tracking agent. Moving skin color in the background is ignored. The task of a tracking agent is twofold. It tracks its target and, as shown in fig. 4.5 provides the needed information of the object recognition agents. In order to enhance the tracking the tracking agent uses a multi-cue tracking method that adapts to changes of the target. Each cue is initialized with cue values extracted from the object during the agent s initialization. The tracking agents will be explained in detail in the second part of this chapter in section 4.2.

23 Visual Tracking 18 Figure 4.2: Applied Agents The arrows denote the data flow between the different agents. Starting with the attention agent to search for possible position for head and hands, its results are passed to the tracking control agent. The tracking control agent verifies if the object is already tracked by a tracking agent and otherwise instantiates a new one. Recognition agents are continuously trying to identify the objects that are tracked by the tracking agents. 4.1 Multi-Agent Tracking Architecture Fig. 4.2 pictures the agents which are implemented in the tracking system. While the attention agent and the tracking control agent are only used to detect and administrate new ROIs, the tracking agents and the recognition agents are running throughout the whole tracking and recognition process. The object tracking starts with the attention agent scanning the whole image for the signer s head and hands. The identification of the foreground or target regions constitutes an interpretation of the image based on knowledge, which is usually specified to the application scenario. Since contour and size of a hand s projection in the two-dimensional image plane vary considerably, color is the feature most frequently used for hand localization in gesture recognition systems. Although the color of the recorded object is dependent on the current illumination, skin color has proven to be a robust cue for human motion tracking (Jones and Rehg, 2002). Just like in Steinhage (2002) the attention agent has two sensors, one is scanning for skin color and the other for movement. Each sensor is producing a two-dimensional map that codes the presence of colors similar to skin color and motion. These binary maps are merged in the cueintegration unit by applying the logical conjunction to gain the skin colored moving blobs. At

24 Visual Tracking 19 the current stage these blobs are likely, but not verified, the head and the two hands of the signer. In the next step, the detected blobs are segmented and used to define the object s position and size. In order to reject artifacts due to noise small segments are filtered out. However, the global search is expensive in terms of computational time. Thus, the attention agent scans for abstract cues and is only active as long as the head and hands have not been found in the image. It becomes reactivated if one of the tracking agents is loosing its target. Position and size of the remaining ROIs are then passed to the tracking control agent which supervises the tracking. It checks whether the target is already tracked and, if this is not the case, a new tracking agent is instantiated and calibrates its sensors to follow the object. 4.2 Tracking Agent Object tracking is performed by tracing an image point on the target by its corresponding tracking agent. The tracking agents take on this task by scanning the local surrounding area of the previous target position on the current frame. This area is called the tracking agent s attention region. Local agents have the advantage that they save computational time. Furthermore, they do not get confused by objects that are outside their region of interest. Differing from the detection of ROIs, skin color and movement are not enough for reliable tracking. Therefore, the tracking agent uses a more robust method for object tracking. As shown in fig. 4.3, it does not rely on a single feature but instead integrates the results of four different information sources. In order to integrate those information, the tracking agent applies a scheme called democratic integration Triesch and von der Malsburg (2001a), which is explained in detail in appendix A. The information acquisition of each feature is realized through a featurespecific sensor, namely pixel template, motion prediction, motion and color information. Each sensor contains a prototype of its corresponding feature. During the instantiation of the agent, the prototypes are extracted from the image position appointed by the tracking control agent. Hence, they are adjusted to the object s individual features and are capable to adapt themselves to new situations during the tracking process (see below). Scanning the attention region of its corresponding tracking agent, the sensor produces a similarity map by comparing its prototype with the image region. In order to determine the new position of the target object, the tracking agent s cueintegrator computes a weighted average of the similarity maps derived

25 Visual Tracking 20 Figure 4.3: Democratic Integration As the person is entering the scene, there is only one tracking agent in charge. On the left the tracking result of the sensor is marked with a circle. The rectangle shows the border of the agent s search region. On the right the similarity maps created by the different sensors are given, from left to right: color, motion, motion prediction and pixel template. The fusion center shows the resulting saliency map by applying the democratic integration scheme. from the different sensors. Out of this map the position with the highest value is assigned as the current target position. This position is fed back to the sensors and serves as the basis for two types of adaptation. First, the weights of the sensor are adapted according to their agreement with the overall result. Second, the sensors adapt their internal parameters in order to have their output match the determined collective result better. Spengler and Schiele (2001) criticize the democratic integration presented in Triesch and von der Malsburg (2001a) by mentioning that the system is limited to single target tracking. Furthermore, they add that self-organizing systems like democratic integration are likely to fail in cases where they start to track false positives. In such cases the system may adapt to a wrong target resulting in reinforced false positive tracking. By introducing a multitude of local tracking agents the limitation to single object tracking has been solved. It is true that the tracking agents can be susceptible to tracking false positives, but as shown in fig. 4.4 they are performing well even in cases where the three objects are very close to each other. After tracking the object on the current frame, each tracking agent evaluates its success. For this purpose, democratic integration provides a confidence value, which allows to rate the object tracking. If the confidence value is below a certain threshold, the tracking result is not reliable and the object

26 Visual Tracking 21 Figure 4.4: Meet gesture in BSL The four images are part of the sign meet performed in British Sign Language. As shown, the local tracking agents perform well and do not mix or switch to another object even in situations where all three objects (head and hands) meet. is defined to be lost, e.g., the object left the scanning region of the tracking agent. If the tracking agent is not able to retrieve the object during the next two frames, it terminates its tracking. Before it is deleted it sends a message to the attention agent that will restart the search for ROIs on the whole image. This evaluation of the tracking is part of the self-healing of the system. Otherwise, if tracking has been successful, the tracking agent informs the other agents by posting a message containing the information needed for recognition. These information include the current position, the contour of the target and an image of its attention region. Fig. 4.5 illustrates that the images sent by different tracking agents can overlap. 4.3 Recognition Agent The messages coming from the tracking agents are read by the three recognition agents, with a different function. One is scanning the image included in the message for the presence of a face. The other two agents are specialized on hand posture recognition and apply two different approaches. Chapter 6 will give a detailed description of the used recognition techniques. The result of the recognition agents is fed back to the tracking agents and furthermore, provides the input for sign language recognition. When initialized, the tracking agent does not know which kind of the object it is tracking. This information is stored in a type flag, which is initialized with UNKNOWN. A tracking agent following an UNKNOWN

27 Visual Tracking 22 Figure 4.5: Input images Input image is the attention region of the tracking agent in charge. On the left is the whole image where the tracked object like head and left and right hand are color coded. The individual attention region of each tracking agent is displayed on the right. In addition to position and contour information, they present the input for the recognition agents which are using the bunch graph method introduced in section target over a couple of frames (three in the presented work) is probably tracking an uninteresting object. Therefore, it will delete itself after sending a message to the attention agent. Only the message from the recognition agent changes the type flag to HEAD, LEFT HAND or RIGHT HAND in order to allow to continue the tracking. After the recognition agents send their messages, the environment loads the next image and the next tracking cycle is started.

28 Chapter 5 Feature Extraction The sign language recognition system has to be stable and robust enough to deal with variations in the execution speed and position of the hands. Hidden Markov Models (HMMs) can solve these problems. Their ability to compensate time and amplitude variations of signals has been amply demonstrated for sign language recognition as described in Ong and Ranganath (2005) and chapter 2. Thus, the HMM method serves as the data structure to store and recognize learned feature sequences. The information integration used for sign language recognition is installed in the HmmRecognition agent which embeds an extended self-controlled Hidden Markov Model architecture. Before the extension of the HMM method is discussed in section 6.2, this chapter describes the input features to be processed by the HMM. In order to be suitable for a recognition task a feature should show high inter-gesture variance, which means that it varies significantly between different gestures, and a low intra-gesture variance, which denotes that the feature shows only small variations between multiple productions of the same gesture. The first property means that the feature carries much information, while the second indicates that it is not significantly affected by noise or unintentional variations. Thus, the features used for sign language recognition in the present work include the position of both hands and the corresponding hand posture. As mentioned in the previous chapter, each feature assignment is performed by a special recognition agent. In contrast to the trajectory information, which comes directly from the corresponding tracking agent and is explained in section 5.1, the static hand posture has to be recognized and assigned to previously learned sign lexicons as introduced in section 5.2. The generation of a sign lexicon is driven by examples and requires only minimal user interaction. It consists of one threshold for the similarity of two corresponding features.

29 Feature Extraction Position In order to make position information translation invariant, the face of the signer serves as a body-centered coordinate system, which is used to describe the position of both hands. The position is posted directly by the corresponding tracking agent and is treated in a continuous fashion. Thus, a Gaussian-Mixture model (Titterington et al., 1985), is used to store the position information. Hence, a recognition agent for face detection is browsing the messages of the tracking agents and advices the corresponding agent if a face has been detected in its attention region. The face detection which is implemented in the present work is performed by the bunch graph matching algorithm (Wiskott et al., 1997). The concept of the bunch graph proved to be very flexible and robust in many circumstances and in the present work serves for hand posture classification as well Bunch Graph Matching for Face Detection Two of the implemented recognition agents in the system apply bunch graph matching for object recognition. One is the above mentioned face detection agent, the other is classifying static hand postures and will be introduced in section 5.2. Both agents need a database of stored bunch graphs, which will be used for the matching. The object knowledge is learned from examples, e.g., for faces, the nodes of the graph are set on remarkable position, so-called landmarks, like the tip of the nose, the eyes, etc. Bunch graph matching, developed at the Institut für Neuroinformatik, Systembiophysik, Ruhr University of Bochum, Germany proved to be a reliable concept for face finding and face recognition in three international contests, namely the 1996 FERET test (Rizvi et al., 1998; Phillips et al., 2000), the Face Recognition Vendor Test (Phillips et al., 2003) and the face authentication test based on the BANCA database (Messer et al., 2004). An good overview is given in Tewes (2006), chapter 3. Elastic Graph Matching Elastic Graph Matching (EGM) is the umbrella term for model graph and bunch graph matching. Its neurally inspired object recognition architecture introduced in Lades et al. (1993) has been successfully applied to object recognition, face finding and recognition. One advantage of EGM is that it does not require a perfectly segmented input image. Instead, the attention region of the tracking agent in charge provides the input images. The task of object recognition using these images as input can be quite difficult. As

30 Feature Extraction 25 (a) Elastic Graph (b) Input Image (c) Matching result Figure 5.1: Face detection using EGM From left to right this figure illustrates the previously trained elastic graph, the input image provided by the tracking agent and the result of a matching process. At the position in (c), the object information which is represented in the graph (a) reaches the highest similarity with the input image (b). depicted in fig. 4.5 the attention region of the tracking agent which are tracing different targets can overlap. Fig. 5.1 outlines EGM. The data structure is a two-dimensional labeled elastic graph, which represents the learned objects. An example of an elastic graph for face detection is shown in fig. 5.1 (a). The nodes of an elastic graph are labeled with a local image description of the represented object. The edges are attributed with a distance vector. The object recognition using EGM is working in a unsupervised way. It performs a search for a set of node positions on the input image which best matches, in terms of maximizing a measure of similarity using the local image description attached to each node of the corresponding graph. An example for the EGM is illustrated in fig The elastic graph depicted in fig. 5.1 (a) is trained to represent a face. The input image fig. 5.1 (b) contains a face which is not included in the training set. Fig. 5.1 (c) shows the result of the EGM. As expected, the face region of the input image shows the highest similarity with the faces stored in the elastic graph. Model Graph and Model Graph Matching As mentioned above each node codes a local image description. The texture information of the neighborhood around the node positions proved to be a reliable description when applied for the recognition of the object. Texture

31 Feature Extraction 26 (a) Jet and model graph (b) Bunch graph Figure 5.2: Construction of a bunch graph Each node of the model graph is attributed with a local texture description. This texture description is achieved by a wavelet transformation with a family of Gabor wavelets and stored in a so-called Jets. In order to increase the flexibility several model graphs are added or overlayed to form a bunch graph. can be well stored using Gabor jets. The Gabor jets are obtained by convolving the input image with a family of Gabor wavelets. These wavelets differ in size and orientation. They are coding an orientation and scale-sensitive description of the texture around a node position and are explained in detail in appendix B. Furthermore, the appendix B lists the applied set of parameters which controll the Gabor jets in the present work in tab. B.1 and tab. B.2. An elastic graph whose nodes are attributed with Gabor jets is called model graph G. In fig. 5.2 (a) a Gabor jet is depicted with its corresponding Gabor wavelets which differ in orientation and scale. The model graph in fig. 5.2 (a) shows that a Gabor jet is connected to each node. The recognition process using a model graph is called graph matching. For the purpose of graph matching a similarity function which expects Gabor jets to be compared with each other is applied in the present work. The jets have to be extracted from the input image. This is done by convolving the whole image with the same Gabor wavelets which were used to create the entries of the model graph. The resulting Gabor wavelet transformed input image is called Feature Set Image (FeaStImage) and is created by attaching the computed Gabor jets to all pixels of the input image.

32 Feature Extraction 27 In order to conduct a graph matching several so-called moves are performed on the FeaStImage. Each move modifies the location of the graph s nodes, and thus the whole graph, with respect to the input image. After the node positions have been modified, the node entries of the model graph are compared with the corresponding jets of the FeaStImage. This comparison is done by using the similarity function S abs of eq. (B.7) and is explained in detail in appendix B. The applied similarity function S abs uses the magnitudes of the jet entries and has the form of a normalized scalar product. It returns a similarity value for each node. By averaging the similarity values over all nodes the similarity value s 1 G (eq. (B.8)) for the whole model graph can be obtained. This similarity value s G is used to rate the result of the EGM. Thus, the object which is represented by the model graph is defined to be at the position in the input image where the highest s G has been computed. For object recognition, a set of previously learned bunch graphs (see below) is sequentially matched to the input image. The graph which receives the highest similarity s G represents the classification result. In case that s G is below a certain threshold T sg, the recognition is considered to be uncertain and will be neglected. Bunch Graphs Wiskott et al. (1997) introduce an extension to the concept of model graphs. Model graphs are enlarged to the store more object information and thus enhance the object recognition. While for model graphs only one Gabor jet has been attached to each node, the bunch graph attaches several Gabor jets as shown in fig. 5.2 (b). These Gabor jets are taken from different presentations of the same class of objects, e.g., different faces or the same object with different backgrounds. They form a bunch of jets for each node position. The object recognition procedure using bunch graph matching is similar to the presented model graph matching. The calculation of the similarity value s B in eq. (B.9) differs only in the computation of the similarities at each node position. This has to be modified insofar, as now one jet of the image is compared to the whole bunch of jets connected to the bunch graph s node. For this purpose each jet in the bunch is compared to that of the FeaStImage using the similarity function s G already mentioned. The maximal s G is associated with the corresponding node. Finally, the similarity for the bunch graph is determined in the same way as for the model graph by averaging over all nodes. 1 This work resigns to moves where the model graph can be distorted. Therefore the distortion term which can be added to s G is neglected.

33 Feature Extraction 28 Figure 5.3: BSL sign different The sign different taken from the British Sign Language. It is performed by both hands moving synchronously from the inner region to the outside. 5.2 Hand Posture The classification of hand posture stands for the search of the most similar entry out of a previously learned sign lexicon of hand postures. However, vision based object recognition is a complicated task (von der Malsburg, 2004) and the human hand, as an articulated object, makes the hand posture recognition even more difficult. It has 27 degrees of freedom (Pavlovic et al., 1997) and therefore is very flexible and hard to model (Wu et al., 2005). Even if the scaling problem is left out by limiting to a fixed distance to the camera, the recognition has to deal with the current problems of rotation and lighting conditions as described in Barczak and Dadgostar (2005). Eliminating the rotation is performed by increasing the number of angles to which the model graph has to be compared. This makes the classification computationally expensive. Therefore, the present work neglects rotation invariance. Rotated hand image (fig. 5.3) are treated as different hand postures. The lighting conditions are sensitive to different sources of light and to other objects producing shadows over the detectable object. The way the

34 Feature Extraction 29 object appears may change dramatically with variations in the surrounding area. If new situations arise it is unlikely that the object will be detected. Therefore, it is important to collect appropriate training images. Solved in the same way as the rotation, the objects that include shadows and that are partially occluded can best be encoded if the examples include these situations. For the purpose of hand posture recognition the present work implements two different and independent classification methods. The first method is based on bunch graph matching, which has already been introduced for face detection, while the second classification method is contour matching based on shape context information. As the bunch graph and the contour matching approach are quite different and demand diverse input data, they are applied to complement hand posture recognition. Nevertheless, the aim of both techniques is to extract the hand posture and assign it, if possible, to the most similar element of the corresponding sign lexicon. Other than the position, the hand postures are stored in a discrete manner. Thus, the first step of each technique is to build up a sign lexicon. It consists of a set of representative hand postures, where each is assigned to a unique index number. The multiplicity of hand postures makes it necessary to conduct the sign lexicons of the hand postures by learning from examples instead of using hand labeled data Bunch Graph Matching for Hand Posture Classification Bunch graph matching has already been applied to hand posture recognition by Becker et al. (1999); Triesch and von der Malsburg (2001b) and Triesch and von der Malsburg (2002). As suggested in Triesch and von der Malsburg (2002), the classification is enhanced by using a bunch graph which is applied to model complex backgrounds. For this purpose, model graphs taken from one hand posture in front of different backgrounds are merged into a bunch graph. A hand posture in front of different backgrounds is shown in fig These bunch graphs proved to enhance the matching because the nodes which lie on the edge of the object are extracting and comparing their texture information always with some parts (depending on the used scales of the Gabor kernel) of the background. Thus, during the matching process the nodes with the most similar learned background, compared to the current background, are used to compute the similarity between the bunch graph and the presented image.

35 Feature Extraction 30 (a) Original Image (b) Extracted Hand Posture (c) Different Background Figure 5.4: Bunch Graph with Background The nodes on the edge of the object are extracting and comparing their texture information always with some part (depending on the used scales of the Gabor kernel) of the background. Thus, it proved to enhance the bunch graph matching if the hand posture is extracted from the original image fig. 5.4 (a). Finally the extracted hand posture fig. 5.4 (b) can be pasted in front of different backgrounds, like pictured in fig. 5.4 (c). Each of these scenes produces a model graph to be added to the bunch graph. Instead of different representations of the object this bunch graph stores the descriptions of the same model with different backgrounds. The bunch graphs used by Triesch and von der Malsburg (2001b) and Triesch and von der Malsburg (2002) are hand-labeled. Their node positions are manually placed at anatomically significant points,i.e., the rim of the hand and on highly textured positions within the hand. Their work proved to give good recognition results on data sets containing 10 examples. In the case of sign language recognition the huge amount of variation makes it complicated to set up a sign lexicon of representative hand postures manually. As reported by Triesch and von der Malsburg (2002) problems result from the large geometric variations which occur between different instantiations of a hand posture. These variations would require the bunch graph to be distorted in a way that would allow every node to easily reach a number of false targets and therefore the overall performance would become worse (Triesch and von der Malsburg, 2002). Hence, the bunch graphs in the present work are not distorted, knowing that this enlarges the sign lexicon.

36 Feature Extraction 31 Figure 5.5: Automatic model graph extraction Examples of images for automatic model graph extraction. In contrast to the hand labeled data, the node positions are not set to geometric markers. Creation of the Sign Lexicon Automatic and unsupervised creation of bunch graphs used for hand posture classification has been a research topic in Ulomek (2007). The author investigates different methods for the automatic creation of a bunch graph, especially using corner information. Therefore, he computed the recognition rates and the average calculation times to compare the different concepts. His experiments show that the bunch graph having regularly spaced nodes are performing better than the ones including corner information for node placing 2. The more densely the nodes are set, the better is the recognition rate at the expense of increase in calculation time. Hence, the bunch graphs in the present work ignore corner information and are created by putting a grid upon the hand posture as shown in fig The sign lexicon is data driven, using a simple but automatic and unsupervised training and clustering of the model graphs. Model graphs are automatically extracted from segmented images showing a left or a right hand. On the foreground a node is placed at every fourth pixel. Thus, a very dense model graph as illustrated in fig. 5.5 is obtained for each example. Model graphs with too few nodes are rejected to enlarge the robustness of the recognition, knowing that very small hand postures will not be included in the conducted sign lexicon. After their extraction, the model graphs are clustered according to algorithm 1 to reduce the amount of entries in the sign lexicon. However, a nodewise comparison using the similarity function of the matching process 2 Although it is shown in Ulomek (2007) that the hand labeled graphs perform better, they are not used in this work for the reasons stated above.

37 Feature Extraction Algorithm 1: Clustering of model graphs Data: List of input images, T sg = 0.95 create List of model graphs sort List by Size while List not empty do take next entry as matching graph foreach model graph after the matching graph do match matching graph on the image of model graph if s G T sg then model graph is represented by the matching graph delete model graph from List else match next entry in List end end end Result: Database of model graph sign lexicon only makes sense if the topology of the model graphs and the number of nodes to be compared are equal. This is true for face finding but cannot be realized for the multitude of hand postures. Thus, an indirect similarity is measured by matching the extracted model graphs on the input image of the other model graph, using the resulting s G and a threshold value T sg to conclude the similarity between two hand postures. Hence, the number of elements and thus the decision of which postures are needed to represent the training set in the lexicon is coming out of the data itself. Only minor user interaction by setting the similarity threshold value T sg is necessary. To enhance the robustness of the recognition a bunch graph is formed using the computed model graph and the same model graph with added background information as mentioned above. The sign lexicon for left and right hand are applied to train the HMM (section 6.2) and to equip the recognition agent to perform its recognition task Contour Information for Hand Posture Classification In contrast to bunch graph matching, hand posture classification using contour information can only be done by segmenting the image and splitting

38 Feature Extraction 33 it in the hand as foreground and the rest as background. The input data containing the contour or shape of the static hand posture is provided in the message of the corresponding tracking agent. The agent performs color segmentation based on the current value of its color sensor to extract the contour. The hand posture recognition using contour information is performed on closed hand contours like the ones depicted in fig. 5.6 (a). The measurement of similarity of two closed hand shapes is performed by searching for correspondences between reference points on both contours. In order to solve the correspondence problem a local descriptor called shape context is attached to each point on the contour (Belongie et al., 2002). The present work uses the shape context as described in Schmidt (2006). At a reference point the shape context captures the distribution of the remaining points relative to its position. Thus, a global discriminative characterization of the contour is offered. Therefore, corresponding points on two similar contours have similar shape contexts. Shape Context A contour is represented by a discrete set of n uniformly spaced reference points Φ = p 1,..., p n, where p i R 2. These points do not need correspond to key points such as maxima of curvature or inflection points. Given that contours are piecewise smooth, the reference points are a good approximation of the underlying continuous contour as long as the number n of reference points is sufficiently large. As illustrated in fig. 5.6 (b) for a single shape context, its K bins are uniformly distributed in log-polar space. Thus, the descriptor is more sensitive to positions of nearby sample points than to those of more distant points. For a point p i on the contour a histogram h i of the relative coordinates to the remaining n 1 points is computed. This histogram is defined to be the shape context at p i. Matching with Shape Contexts Finding correspondences between two contours is comparable to assigning the most similar reference points on both contours. The similarity is measured by comparing the appropriate shape contexts. Hence, the dissimilarity between two contours is computed as a sum of matching errors between corresponding points. Consider a point p i, on the first contour and a point q j on the second contour. Let C ij = C(p i, q j ) denote the cost of matching these two points.

39 Feature Extraction 34 (a) Contour Extraction (b) Shape Context Figure 5.6: Shape Context Closed contours as shown on the left side (a) provide the input data for the recognition which is based on local image descriptors. The shape context descriptor depicted in (b) is collecting contour information by collecting the distribution of the reference points relative to a fixed point in log-polar coordinates. As shape contexts are distributions represented as histograms, it is natural to use the chi-square test statistics: C ij C(p i, q j ) = 1 K [h i (k) h j (k)] 2 2 h i (k) + h j (k), (5.1) where h i (k) and h j (k) represent the K bin normalized histogram at p i and q j, respectively. Given the matrix of costs C ij between all pairs of points p i on the first contour and q j on the second contour using the permutation ɛ, the minimum of the total cost of matching, H(ɛ) = i k=1 C(p i, q ɛ(i) ) (5.2) is subject to the constraint that the matching will fit one-to-one (Schmidt, 2006). In order to apply eq. (5.2), both contours have to have the same number of shape context points. Differing from Belongie et al. (2002), the present work restricts the matching to preserve topology. Thus, the order of the points is a stringent condition. Therefore, if point p i of the first contour corresponds to point q j of the second contour, then p i+1 has to correspond to q j+1 or q j 1 in the opposite direction, respectively. The direction is a result of the computation of the best matching result.

40 Feature Extraction 35 Shape Distance Matching The similarity of two contours measured at their n reference points is expressed by their shape distance D SC. The shape distance is computed as the sum of shape context matching costs from the best matching reference points n using ɛ min as the optimal permutation. D SC = 1 n H(ɛ min). (5.3) The shorter the distance D SC, the higher is the similarity between the two contours. Hence, D SC can be a measure for a nearest neighbor classification. Together with a threshold value T SC the shape distance D SC serves to conduct the sign lexicon by using a standard vector quantization as described in Gray (1990). Similar to the bunch graph method the training of the sign lexicon is performed with only very little user interaction. Again, just a threshold for the maximal distance T SC, which declares two contours to be similar enough to be represented by one of them, has to be set. The same distance of eq. (5.3) is used for the classification of the current hand posture. In case that no entry of the contour sign lexicon is below the threshold 3, the current hand posture cannot be detected by the contour recognition agent. 3 Reasons are mainly connected with poor color segmentation or too small shapes.

41 Chapter 6 Recognition As presented in the previous chapter, the data sources for sign language recognition are the position and the posture of the left and the right hand. Since the hand posture is determined by two procedures, which have two different approaches to the classification, the results of bunch graph and contour classification are supposed to be independent and will be treated as two different features. Therefore, six features/observations can be extracted from each frame of the sign sequence. These six feature streams are collected over time in order to come to a decision whether the presented sign is known to the system and, if so, which sign is performed. The feature integration method applied for the sign language recognition works on-line. The data is collected for every frame and is directly processed by the recognition system during performance of the sign. Thus, the most probable sign for the image sequence up to the current frame can be given. Sometimes, recognition of a static hand posture might fail in the bunch graph or the contour matching agent or for both of them. Therefore, the challenge is recognition with varying temporal execution of the sign as well as handling missing data or falsely classified data. The ability of Hidden Markov models to compensate time and amplitude variations of signals has been amply demonstrated for speech (Rabiner, 1989; Rabiner and Juang, 1993; Schukat-Talamazzini, 1995) and character recognition (Bunke et al., 1995). The use of an HMM is twofold. First, during the training phase, it is applied to store a specified data sequence and its possible variations, while second the stored information can be compared with an observed data sequence by computing the similarity over time during the recognition phase. According to the divide and conquer strategy, the present work uses a single HMM for each feature to store the temporal and spatial variations that occur when the sign is performed several times.

42 Recognition 37 The task of the recognition system lies not only in recognizing the sequence of a single feature, e.g., the left hand position but also in fusing the feature recognition information. This challenge of integrating different features is done by differentiating into strong and weak features. The position of the hands build up the strong feature which gives the baseline of the recognition. The hand posture information is chosen as the weak features. They are bound to the strong features and will be used in a rewarding way. Thus, weak features will add confidence to the recognition of the sign if they have been detected in the observation sequence. Otherwise, they will not change or punish the recognition if they are absent. The idea of defining position as the strong feature which gives the basis for sign recognition is taken from the early experiments by Johansson (1973), where moving light images suggest that many movements or gestures may be recognized by motion information alone. Besides, the way of the feature description confirms the strong and weak feature method in this work. Position has a Euclidean metric, which can describe the similarity of surrounding points, while recognition of the hand posture is based on discrete sign lexicons, where each entry is chosen to be dissimilar from the others. In order to run the sign language recognition system in real world applications, the detection of the temporal start and end point of meaningful gestures is important. This problem is referred to as gesture spotting (Lee and Kim, 1999; Derpanis, 2004) and addresses the problem of co-articulation in that the recognition system has to ignore the meaningless transition of the hands from the end of the previous sign to the start point of the following sign. Another critical point of consideration in any recognition system is sign rejection, the rejection of an unknown observation sequence. The idea of present work is to solve the previously mentioned problems by distributing the information of the signs among autonomous subsystems, one for each sign. Every subsystem calculates its similarity to the running sign sequence and decides about the start and end of the sign by its own. This chapter will start with an introduction of the theory of HMM 1. The structure and basic problems of the HMM are explained in section 6.1. The HMM has to be adapted to the problem of sign language recognition. In section 6.2 the modification undertaken in the present work will be motivated and explained in detail. The whole sign recognition is performed by a recognition agent that will be presented in section The introduction of Hidden Markov models is based on the famous paper of Rabiner (1989) which is recommended for a more detailed description of this topic.

43 Recognition 38 Figure 6.1: 4 State Markov Chain Markov Chain having four states (denoted by a circle). Each state can emit just one symbol, e.g., S1. The transition to the next state is dependent on the transition probability distribution at each state. The Markov chain is a first order Markov process, where future states depend only on the present state. 6.1 Hidden Markov Models The capabilities of an Hidden Markov model are to store a sequence of symbols 2 and to produce a sequence of observations based on the previously stored information. The production of the observation sequence is described by two stochastic processes. The first stochastic process applies the Markov chain (Meyn and Tweedie, 1993) and directs the transmission from the current state to the next state. A Markov chain as illustrated in fig. 6.1 is characterized by the distribution of the transition probabilities at each state. Thus, by running through the chain, it produces a sequence of states like S 1, S 3, S 3,.... In order to produce a sequence of observations each state can 2 including the occurring variations of the sequence

44 Recognition 39 Figure 6.2: Bakis Model HMM with the left-right (Bakis) topology, typically used in gesture and speech recognition. The solid lines denote the transition probabilities and thus a ij = 0 if j < i j > i + 2. The dotted line connects a continuous observation distribution to the belonging state (circle). be attributed to an observation symbol. The transition through the chain is described by a Markov process of order 1, in which the conditional probability distribution of future states, depends only on the present state. In contrast to the Markov chain where each state of the model can only emit a single symbol, states in the HMM architecture as depicted in fig. 6.2 can emit one symbol out of a distinct alphabet. The probability of emitting the symbol is stored for each state in the emission probability distribution over the alphabet. The probability of emitting a symbol in the state S 1 can be interpreted as the probability to observe the symbol, when being in state S 1. Therefore, the emission probability distribution is called observation distribution in the recognition process. The decision of emitting a symbol represents the second stochastic process. As the observer of an HMM only observes the emitted symbols, while the emitting states remain unknown, the states of an HMM are called hidden. Due to their doubly stochastic nature, HMMs are very flexible and became well known in the gesture recognition community, see chapter 2.

45 Recognition Elements of an HMM A Hidden Markov Model is characterized by the following elements: 1. N, the number of states in the model. The individual states are denoted by S = {S 1, S 2,..., S N }, and the state at time t by q t. 2. M, the number of distinct observation symbols per state, i.e., the size of a discrete alphabet. The observation symbols correspond to the physical output of the system being modeled. The included symbols are denoted by V = {v 1, v 2,..., v M }. If the observation is in a continuous space, M is replaced by a continuous distribution over an interval of possible observations. 3. The transition probability distribution A = {a ij }, which is given by a ij = P (q t+1 = S j q t = S i ), 1 i, j N (6.1) and which adds up for a fixed state S i as a ij = 1. (6.2) j Just like the state transition of the Markov chain, the state transition probability a ij (eq. (6.1)) from state S i to state S j only depends on the preceding state (first order Markov process). For the special case that any state can reach any other state in a single step (the ergodic model), the transition probabilities have a ij > 0, for all i, j. For other types of HMM (e.g., the Bakis model in fig. 6.2), the transition probabilities can be a ij = 0 for one or more S i, S j state pairs. 4. The observation probability distribution B = {b j (k)} in state j, where and b j (k) = P (v k at t q t = S j ), 1 j N, 1 k M (6.3) b j (k) = 1 (6.4) k The observation probability distribution B can be discrete or continuous. It describes the probability of emitting/observing the symbol v k, when the model is in state j. 5. The initial distribution π = {π i } where π i = P (q 1 = S i ), 1 i N (6.5) specifies the probability of the state S i to be the first state starting the HMM calculations.

46 Recognition 41 Thus, a complete specification of an HMM consists of two model parameters (N and M), the specification of observation symbols, and the specification of the three probabilistic measures A, B and π. In the following the compact notation λ = ( π, A, B ) (6.6) to indicate the complete parameter set of the model λ will be used. Generation of an Observation Sequence Having appropriate values for an HMM λ, algorithm 2 can be used to generate a sequence with length T of observations o t taken from the alphabet V. o = (o 1, o 2,..., o T ). (6.7) The advantage of algorithm 2 is twofold, first it can generate a sequence of observations (eq. (6.7)) using an HMM and second it is able to model how a given observation sequence was generated by an appropriate HMM. Algorithm 2: This algorithm can be applied to generate an observation sequence. It can also model how a given observation sequence was generated by an appropriate HMM. Data: λ = ( π, A, B ) Result: o 1 Choose initial state q 1 = S i according to π 2 t = 1 3 while t = t + 1 < T do 4 Observation: select o t = v k, according to B i in state S i 5 Transition: based on (a ij), of state S i change to the new state q t+1 = S j 6 end

47 Recognition HMM Basic Problems The art of HMMs lies in their topology of allowed transitions, the features to be observed and their emission probabilities. While the choice of features highly dependents on the observation task, the solution of the following three basic problems will help to set transition and emission probabilities for realworld applications. Evaluation problem : Given the model λ and the observation sequence o, how do we efficiently compute P (o λ), the probability of generating the observation sequence given the model? Decoding problem : Given the model λ and the observation sequence o = (o 1, o 2,..., o T ), how do we choose a corresponding state sequence q = q 1, q 2,..., q T which is meaningful in some sense, i.e., best explains the observations? Estimation problem : How do we train the HMM by adjusting the model parameters λ = ( π, A, B ) to maximize P (o λ)? Evaluation Problem The objective is to calculate how well a given model λ matches with a presented observation sequence o. This computation is important for any recognition application using HMMs. The standard way to recognize a gesture g out of a set G is to train an HMM λ g for every single known gesture g G. Based on the trained data, the recognition starts after the observation sequence of g is recorded with the calculation of P(o λ g ) for every HMM λ g. This computation is done by solving the evaluation problem. Finally, the model λ g which produces the highest probability of describing the observation sequences g = arg max P ( ) o λ g (6.8) g G is deemed to be the recognized gesture g. Given the observation sequence o and the HMM λ, the straightforward way to compute the P (o λ) is through enumerating every possible state sequence q of length T and compute the corresponding probability P(o q 1,..., q T ). The results will then be added up to receive the overall P (o λ). The evaluation problem is solved by calculating the observation probability while running a distinct sequence of states S of the HMM. Therefore,

48 Recognition 43 first the probability P(S λ) of the HMM passing through this fixed state sequence is computed by P(S λ) = P(q 1,..., q T λ) = π q1 T a qt 1 q t. (6.9) Then the observation probability of producing o by the state sequence S is calculated: t=2 P(o S, λ) = P(o 1,..., o T q 1,..., q T, λ) = T b qt (o t ). (6.10) t=1 Thus, the joint probability of S and o can be calculated by the probabilities calculated in eq. (6.9) and eq. (6.10) using the chain rule: P(o, S λ) = P(o S, λ) P(S λ) = π q1 b q1 (o 1 ) Finally, the observation probability is computed by P (o λ) = S Q T P(o, S λ) = T a qt 1 q t b qt (o t ). (6.11) t=2 S Q T P(o S, λ) P(S λ) = S Q T π q1 b q1 (o 1 ) T a qt 1 q t b qt (o t ) (6.12) summing over all possible state sequences Q T with length T. Using eq. (6.12) to solve the evaluation problem, one has to sum up over N T state sequences resulting in 2T N T multiplications to calculate P(o λ). The computation time grows exponentially with the sequence length T. Therefore, even for small models and short observation sequences the calculation of the observation probability would take too much time. Thus, the above computation is by far to expensive for real-world applications and the more efficient Forward- Backward algorithm is used. t=2

49 Recognition 44 Forward-Backward Algorithm The Forward-Backward algorithm (Rabiner, 1989) is a recursive algorithm which runs the calculation of P(o λ) in linear time to the length of the observation sequence. The algorithm is based on the computation of the forward variable α and the backward variable β. The forward variable is defined as α t (j) = P(o 1,..., o t, q t = j λ) (6.13) the probability of observing the partial observation sequence o 1, o 2,..., o t and being in state S j at time t, given the model λ, while the backward variable β, β t (i) = P(o t+1,..., o T q t = i, λ). (6.14) is defined as the probability of observing the partial observation sequence o t+1,..., o T of an HMM λ being in state q t at time t. Algorithm 3 describes the computation of α and β. Their computation differs mainly in the temporal development of the recursion. As shown, in the termination step, the resulting P (o λ), can be computed using either α t (j) or β t (i) or both, using: P (o λ) = N α t (j)β t (j). (6.15) j=1 Applying the Forward-Backward algorithm, especially the multiplication of the probabilities in eq. (6.15), the recognition would not be robust to: 1. a missing observation, 2. a falsely classified static hand posture, which was not included in the training data 3, 3. an observation sequence which takes longer than the learned ones. The first problem can be solved by perfect tracking and faultless classification of static hand postures. The remaining problems become less crucial when collecting more training data. However, up to now neither perfect tracking nor classification could be achieved. Thus, the aim of the present work is to develop a system that works reliably and robustly under real world conditions, where only sparse data is obtained. 3 Using a discrete observation distribution, zero will be returned for the observation probability of a symbol that has not been learned for the particular state.

50 Recognition 45 Algorithm 3: The Forward-Backward algorithm is used for efficient computation of P (o λ). Having computed the forward and backward variables, the computation of the probability of observing the sequence is calculated in the Termination step. Initialization: For j = 1, 2,..., N initialize α 1 (j) = π j b j (o 1 ) Induction: For t > 1 and j = 1, 2,..., N compute For i = 1, 2,..., N initialize β T (i) = 1 For t < T and i = 1, 2,..., N compute ( N ) α t (j) = α t 1 (i)a ij b j (o t ) i=1 β t (i) = N a ij b j (o t+1 )β t+1 (j) j=1 Termination: Compute Compute P (o λ) = N α T (j) j=1 P (o λ) = N π j b j (o 1 )β 1 (j) j= Decoding Problem The aim of solving the decoding problem is to uncover the hidden part of the HMM, i.e., to find the most likely sequence of hidden states ˆq that could have emitted a given output sequence o. This information about ˆq becomes interesting when physical significance is attached to the states or to sets of states of the model λ. Besides, the decoding of the state sequence can be used for solving the estimation problem, which is applied for training the HMM. Although the state sequence is hidden, it can be estimated by the knowledge of the observation sequence o and the model parameters of the HMM λ. Therefore, it is postulated in the present work that the observation sequence

51 Recognition 46 o is generated by one state sequence q Q T of length T. Thus, the state sequence q that maximizes the posterior probability P(q o, λ) = P(o, q λ) P(o λ) (6.16) is defined to be the producing sequence. Ignoring the (constant) denominator of eq. (6.16) the optimal state sequence ˆq P(o, ˆq λ) = arg max P(o, q λ) =: ˆP(o λ) (6.17) q Q T can be computed using the Viterbi algorithm. Viterbi Algorithm The Viterbi algorithm (Viterbi, 1967) is a dynamic programming algorithm, which is applied to find the sequence of hidden states ˆq most likely to emit the sequence of observed events. This sequence of hidden states ˆq is called Viterbi path. The Viterbi algorithm is based on several assumptions. First, both the observed events and the hidden states have to be in a sequence corresponding to time. Second, these two sequences have to be aligned. Thus, an observed event needs to correspond to exactly one hidden state. Third, computing the most likely hidden sequence ˆq t up to a certain time t has to depend only on the observed event at time t, and the most likely sequence ˆq t 1 of the previous time step t 1. These assumptions are all satisfied in the first-order hidden Markov model. The concept behind the Viterbi algorithm is based on recursion with result caching (sometimes called memorization in the computer science literature) and is shown in algorithm 4. Instead of computing the forward variable α t (j) from section 6.1.3, the maximal achievable probability ϑ t (j) = max{p(o 1,..., o t, q 1,..., q t 1, S j λ) with q Q T } (6.18) of running through a sequence of states to produce the partial observation sequence o 1,..., o t under the constraint that the HMM ends in state S j is applied. Thus, ϑ t (j) is the best score (highest probability) along a single path at time t which accounts for the partial observation sequence until time t and ends in state S j. The following state is achieved by induction [ ] ϑ t+1 (j) = arg max ϑ t (i)a ij b j (o t+1 ). (6.19) i

52 Recognition 47 Algorithm 4: The Viterbi algorithm is similar to the implementation of the forward variable α t (j). The difference is the maximization over the previous states during the recursion which is used instead of the summing used in the induction step of the calculation of the forward variable in algorithm 3. Initialization: Recursion: ϑ 1 (j) = π j b j (o 1 ), 1 j N ψ 1 (j) = 0 ϑ t (j) = max t 1(i)a ij ] b j (o t ), 1 i N 2 t T 1 j N ψ t (j) = arg max [ϑ t 1 (i)a ij ] 1 i N Termination: ˆP(o λ) = max 1 j N [ϑ T (j)] ˆq T = arg max 1 j N [ϑ T (j)] Backtracking: ˆq t = ψ t+1 (ˆq t+1 ) In order to follow the argument which maximizes eq. (6.19) for each t and j, the path history is stored in the array ψ t (j). As depicted in algorithm 4, the last step of the Viterbi algorithm, backtracking, is applied to determine the Viterbi path Estimation Problem Solving the estimation problem, i.e., training an HMM means to adjust its model parameters such that they best describe how a given observation sequence o train can be produced. In speech recognition, e.g., each word is presented by an HMM, which codes the sequence of phonemes. The HMM is trained by presenting a number of repetitions of the coded word and adapting the HMM parameters to enhance the recognition using the Forward- Backward algorithm from section

53 Recognition 48 Up to now, there is no known analytical solution for the HMM parameters which maximize the probability of the training sequence (Rabiner, 1989). Instead, the Baum-Welch algorithm constitutes an iterative procedure which can be used to adapt the λ = { A, B, π } in such a way that P (o train λ) is locally maximized. Baum-Welch Algorithm The Baum-Welch algorithm (Baum et al., 1970) is a generalized expectationmaximization algorithm. It computes maximum likelihood estimates and posterior mode estimates for the parameters (A, B and π) of an HMM λ, when given a set of observation sequences o train O as training data. The construction of an improved model ˆλ is mainly adapted from the α t (j) and β t (j) probabilities computed for the calculation of P(o train λ). In order to describe the adaptation of the HMM parameters, the probability ξ t (i, j) = P(q t = i, q t+1 = j o train, λ) = P(q t = i, q t+1 = j, o train λ) P(o train λ) = α t(i)a ij b j (o t+1 )β t+1 (j), 1 t T. (6.20) N α t (i)β t (i) i=1 of being in state S i at time t and in state S j at time time t + 1, is introduced. Further, γ t (i) = P(q t = i o train, λ) (6.21) the probability of being in state S i at time t, given the observation sequence o train and the model λ is defined. The sum over the sequence of the two probabilities can be interpreted as: T 1 γ t (i) = expected number of transitions from S i, (6.22) t=1 T 1 ξ t (i, j) = expected number of transitions from S i to S j. (6.23) t=1 They are connected by: γ t (i) = N ξ t (i, j) = j α t(i)β t (i) N α t (i)β t (i) i=1 t < T. (6.24)

54 Recognition 49 Using the concept of event counting, the expectations for the transition from state S i to S j and the number of transitions through state S i, respectively are achieved by summing the ξ t (i, j) and γ t (i) values resulting from the training sequence. These estimates allow to change the model parameters of λ in order to improve its evaluation towards the training sequence. The Baum-Welch algorithm is depicted in algorithm 5 and works in two steps. In the first step the forward variable α and the backward variable β for each HMM state are calculated in order to compute ξ (eq. (6.20)) and γ (eq. (6.21)). Based on these computations, the second step updates the HMM parameters. The initial distribution ˆπ i of state S i is calculated by counting the expected number of times in S i and dividing it by the overall sequence observation probability. The transition probabilities are updated to â ij by counting the frequency of the transition pair values from state S i to S j and dividing it by the entire number of transitions through S i. Finally, the new emission probabilities ˆb j (k) for emitting the symbol v k when the HMM is in state S j are determined by the fraction of the expected number of times being in state S j and observing symbol v k in the numerator and the expected number of times in state S j in the denominator. If the initial HMM λ = {A, B, π} is modified according to the Baum- Welch algorithm, it has been proven by Baum et al. (1970) that the resulting HMM ˆλ = {Â, ˆB, ˆπ} is either ˆλ = λ or the model ˆλ is more likely in the sense that P(o ˆλ) > P(o λ), i.e., the new model is more likely to produce the training sequence. While running the Baum-Welch algorithm, the new model ˆλ is placed iteratively as λ. The HMM parameter adaptation is repeated until no further improvement of the probability of o train is observed from the model or a fixed number of iterations (i max ) has been reached. The final result of the Baum- Welch algorithm is called a maximum likelihood estimate of the HMM. It should be noted that the number of states N is not part of the parameter estimation process and has to be specified manually. This is an important decision. Lower values of N will reduce the computational cost at the risk of inadequate generalization, while higher values may lead to overspecialization. Another disadvantage mentioned in Rabiner (1989) is the dependence of the result of the Baum-Welch algorithm from the initialization of the trained parameters. While the transition matrix can be initialized with random values, the emission matrix should contain good initial estimates.

55 Recognition 50 Algorithm 5: The Baum Welch algorithm works in two steps, first calculating the forward and backward variables and second updating of the HMM parameters. Starting with an initial HMM λ = ( π, A, B ) the algorithm is optimizing the parameters towards the trainings sequence o train. The algorithm runs until a maximal number of iterations i max has been reached. χ [ ] is a function that returns 1 in case of a true statement and zero otherwise. Result: ˆλ ; /* improved HMM */ begin initialize: ˆλ = λ while i max not reached and λ! = ˆλ do /* Step one: */ set: λ = ˆλ calculate: γ, ξ for ˆλ, o train /* Step two: */ update parameters: end end ˆπ i = expected number of times in state S i at time t = 1 = γ 1 (i) = α 1(i)β 1 (i) N α t (i)β t (i) i=1 â ij = expected number of tranisitions from state S i to state S j expected number of transitions from state S i = T 1 t=1 T 1 ξ t (i, j) γ t (i) t=1 = T 1 t=1 α t (i)a ij b j (o t+1 )β t+1 (j) T 1 t=1 α t (i)β t (i) ˆbj (k) = expected number of times in S j and observing symbol v k expected number of times in state S j T T γ t (j)χ [ot=vk ] α t (j)β t (j)χ [ot=vk ] t=1 t=1 = = T T γ t (j) α t (j)β t (j) t=1 t=1

56 Recognition Modification of the Hidden Markov Model Following HMM notation the extracted features introduced in chapter 5 will be called observations in the following. Each kind of observation has a particular degree of uncertainty. Due to the applied tracking method, the position on the object can vary, the contour might not be accurately determined due to blurring or erroneous segmentation and the bunch graph classification could have gone wrong. Thus, instead of putting all observations into one observation vector as shown in fig. 6.3 (a), they are separated into single channels fig. 6.3 (b) in order to enhance the robustness of recognition and to decrease the training effort. The parallel HMM (PaHMM) structure has been applied by Vogler and Metaxas (1999, 2001, 2003) and Tanibata et al. (2002). These authors divided the observations for left and right hand and trained a HMM for each (a) Single HMMs (b) parallel HMMs Figure 6.3: HMM architectures By using single HMMs (a) each state stores the feature vector using a multi-dimensional space. The parallel HMMs (b) code each feature (circle, cross, triangle) in a single HMM and combine the information during runtime using the concept of strong and weak features presented in section 6.3.

57 Recognition 52 hand. Shet et al. (2004) and Vogler and Metaxas (2003) show that parallel HMMs model the parallel processes independent from each other. Thus, they can be trained independently which has the advantage that considerations of the different combinations are not required at training time. Hence, each gesture g will be represented by six channels which separate the position y relative to the head, the hand posture observations of contour classification c and texture τ for left and for right hand. The HMMs coding the sign are running in parallel and are merged during the recognition process. For this purpose, the information of the weak (c, τ) and strong (y) feature channels pass through two stages of integration. At the first stage, the integration is limited to the three HMMs which code the observations for left or right hand, respectively. The following second stage connects the information of both hands in order to determine the similarity of the trained sign to the incoming observation sequence. The recognition process is administrated by a recognition agent and will be explained in section 6.3. The topology of the HMM, which serves to code the information of each channel is an extension of the Bakis model as seen in fig While the data structure of the HMM is the same as described in section 6.1, the modifications needed to include the HMM to the recognition system refer to the evaluation and the estimation problem. In order to achieve the aim of robust sign language recognition the computation of the probability of emitting the observed sign is solved in a way different from that explained in section This modified solution to the evaluation problem will be explained in section and was made to enhance the robustness of the recognition by allowing a flexible combination of the independent channels. A different way of solving the training of the HMMs was chosen in order to improve the capability of the composition of the HMM, given that only a few training sequences are available. This alternative to the Baum-Welch algorithm is presented in section Evaluation Problem The flexibility of the HMM depends on the training data which determines the transition probabilities a ij and the observation probability distributions B = {b j (k)} of the model. Under real world conditions the HMM will be confronted with unknown, i.e., not learned, observations or variations in the dynamics, e.g., caused by missing tracking information, blurring etc. This will lead to problems when using the conventional procedure to solve the evaluation problem (section 6.1.3). Thus, the doubly probabilistic nature of the HMM is separated by introducing a self-controlled transition between the HMM states. The self-controlled transition allows to pass the current

58 Recognition 53 observation to the next state (null transition) or to ignore it. This approach applies a strict left-right model with π 1 = 1 and π i = 0 for i 1 and a ij = 1 if j = i+1, and a ij = 0 otherwise. Instead of using the transition probability matrix A where the transitions are learned, the a ij are replaced by a weighting function w t (u) which is updated during the recognition process. Thus, the computation of the forward variable α introduced in section changes to: α t+1 (j) = α t (i) a }{{} ij w(u)b j (o t+1 ), (6.25) =1 and therefore the computation of the P (o λ) becomes P (o λ) = α T (N). (6.26) Eq. (6.25) is used to perform on-line gesture recognition and computes the probability of observing the partial observation sequence on every frame. The weighting function w expressing the certainty for each channel is inspired by a Gaussian: ) w(u) = exp ( u2, (6.27) 2σ where u [0, ] is a measure of uncertainty. The modified calculation of the P (o λ) is presented in algorithm 6. Starting with a maximal certainty of u = 0 at the beginning of the recognition, the HMM checks whether the received observation o t is presented in the observation distribution B i of the current state S i. In case that the probability b i (o t ) is not satisfying, i.e., below a recognition threshold T r, the HMM can pass the observation to the next state S i+1, checks again and pass it further to the state S i+2 if necessary. If the observation does not even match at state S i+2 it will be ignored and the HMM returns to state S i. Each of these transitions is punished by increasing the uncertainty u and thus lowering the weighting function. To gain its certainty back, the HMM recovers with every recognized observation by decreasing u. In case of a recognized observation the system switches to the next state Estimation Problem Six different information channels (f = 1,..., 6) are applied for sign recognition. Each feature sequence is coded and treated in a separate HMM. Because one of the assumptions made above is independence of the features, the HMMs are trained independently. The HMMs which represent a sign g have the same number of states N g. Their training is performed in the same way, differing only in the used feature and the feature representation.

59 Recognition 54 Algorithm 6: Process performed for the determination of the recognition probability of a HMM, being in state i and receiving a new observation o t if b i (o t ) > T r then update (decrease) u P (o λ) + = b i (o t ) w(u) else if b i+1 (o t ) > T r then /* null transition */ i = i + 1 update (increase) u P (o λ) + = b i+1 (o t ) w(u) else if b i+2 (o t ) > T r then /* null transition */ i = i + 2 update (increase) u P (o λ) + = b i+2 (o t ) w(u) else /* ignore observation */ update (increase) u re-enter to state i = i 1 end end end Therefore, the following describes the training for one HMM λ f,g, but the process is the same for the other feature HMMs. The estimation or training procedure of the λ parameters which will be presented in this section is completely based on training examples. Thus, the HMM parameters are adapted towards a set O g of observation sequences which represent the sign g. In the first step, the number of states N g of the HMM 4 has to be defined. For this reason, two different strategies can be applied. The first strategy is using a fixed number of states for each trained sign, as described in Starner et al. (1998) (using four states for each HMM) and Rigoll et al. (1996) (using three states). The other strategy is to concatenate the number of 4 As mentioned in section 6.1.5, the number of states of the trained HMM is not given by the Baum-Welch algorithm.

60 Recognition 55 states with the length of the applied training sequences of O g. This work follows the second strategy. In contrast to Grobel and Assan (1997) and Zahedi et al. (2005) who are taking the minimal sequence length of the training data, this work is using the maximal sequence length to determine the N g for the HMM. As the HMM is using the modified evaluation algorithm from section 6.2.1, there is no need to train the transition matrix and training concentrates on the estimation of the emission probability distributions. The creation of the observation probability distribution for each state is simple, easily extensible and avoids the accurate initialization needed for the Baum-Welch algorithm. Representation of Discrete Observations The emission probability distribution of the hand posture information received from the sign lexicons of the bunch graph and the contour matching are discrete and thus can be described by an histogram H hp, where hp represents the contour or texture. The sign lexicon entries represent the bin of the corresponding histogram. As mentioned in chapter 5, they are chosen to be as dissimilar as possible. The probability of observing a symbol o equals the entry in the attributed histogram bin: b j (o) = H hp,j (o), (6.28) where H hp,j is the normalized histogram of feature hp at state j. Representation of Continuous Observations In contrast to the hand posture observations, the position information has a Euclidean metric and the description of similarity or proximity makes sense. Thus, the emission probability distribution of the position is described by a mixture of Gaussians: b j (o) = M c jm N m=1 [o, µ jm, Σ jm ], (6.29) where M is the number of Gaussians N, with mean vector µ and covariance matrix Σ, which is constant in this work. The mixture coefficient c applied in the present work is set constant to c = 1 for each Gaussian. M

61 Recognition 56 Figure 6.4: Estimation of HMM Parameters The number of states of the generated HMM is set to the longest training sequence. The observation distribution stored at each state is conducted by the observations which are available at the position in the training sequences which corresponds to the current state. Training of the HMM Thus, training of the emission probability distribution for each state j is reduced to either fill the bins of the histograms for the hand posture observations or to set up the mixture of Gaussians. Both types of representation are empty in the beginning. Training is performed as illustrated in fig The number of Gaussian mixtures M or the number of entries into the histogram is equal to the number of observations for each time step of the observation sequences. As the training sequences are of different length, there will be fewer observations when coming to the last states of the HMM. In the case of the position each observation at time step j is added as a mean vector µ, while for the hand posture the associated entry in the histogram is increased by one. If an observation is empty, e.g., the hand contour could not be matched with an entry of the sign lexicon, this special observation does not join the training and does not increase M or the number of entries in H, respectively.

62 Recognition 57 Figure 6.5: Recognition Agent The recognition agent is hierarchically organized. At the bottom are the HMM Sensor modules, six of them for each sign. They collect the information coming from the other agents and calculate their observation probability. The HMM Sensor results are merged in the Gesture HMM Integration modules, of which there is one for each learned sign. The Decision Center decides about the most probable sign. The advantage of this solution to the estimation problem is that it can be easily extended, a starting model is not needed and it can be trained with only little training data (9 sequences compared to the 30 sequences used in Vogler and Metaxas (1999) or 25 training sequences in Shet et al. (2004)). 6.3 The HMM Recognition Agent The administration of the sign language recognition is hosted in the HMM recognition agent. Starting the recognition cycle with collecting the observations, sign language recognition is done from bottom to top by running through the hierarchically ordered three layers of the HMM recognition agent (see fig. 6.5 and algorithm 8). In terms of the notation of the HMM recognition agent, each learned sign is represented as a Gesture HMM Integration module and its attached HMM Sensors. The HMM Sensors denote the trained feature HMMs which calculate their current observation probability relative to the incoming data. Their

63 Recognition 58 Figure 6.6: Information processing Information processing used for single sign recognition. The rectangles on the left side refer to the HMM Sensors, collecting input data. The information integration for one sign is placed in Gesture HMM Integration and runs through two stages. In the first stage, the information for each hand is analyzed. In the second stage the information from both hands are integrated to give a rating for the recognition of the gesture. results are passed to the corresponding Gesture HMM Integration module, which fuses the information in order to compute the similarity of the incoming observation sequence to the stored sign. While the HMM Sensors are collecting the observations supplied by the other agents, the Gesture HMM Integration module performs the information integration based only on the information provided by the HMM Sensors. In order to compute the similarity of the accommodated sign to the observed image sequence the Gesture HMM Integration module performs two integration steps as depicted in fig In the first integration step, all the HMM Sensor information connected to the same hand is fused. While, the information received for each hand is merged in the second step and the similarity of both hands information of the current frame is computed. This value is added to the overall similarity which has been aquired by the observations of the previous frames. Finally, in the Decision Center module of layer three, the results of the attached Gesture HMM Integration modules are analyzed in order to determine which sign is the most probable.

64 Recognition Layer One: HMM Sensor The HMM Sensors provide the framework for the computation of the weighted observation probability w(u) b j (o t ) using the algorithm 6 presented in section In order to perform the recognition, each HMM Sensor embeds an HMM. The threshold T r introduced in section is the same for each HMM Sensor. While the HMM stores the learned data and is applied for the recognition, T r determines if an observation can be recognized in respect of the current HMM state. Six HMM Sensors, as illustrated in fig. 6.6, form a group that constitutes the features used for the recognition of a sign. If the observed sequence is similar to the stored data, the encapsulated HMM, starting in its first state S 1, will proceed to the following states S 2,... during the recognition process described in section This is an indicator for the progress of the recognition. The HMM Sensors are not connected to each other. The only restriction is that the lowest state in the sign group is at most four states behind the leading, in terms of highest state, HMM of the HMM Sensor group. Besides this restriction the sensors are running independently and can be in different states. Due to the nature of the observation, the HMM Sensors differ in the probability distributions of their embedded HMM; these are a continuous distribution for the position, that has been realized by using Gaussian mixtures, and a discrete one, which represents the histogram for contour or texture information. As a Euclidean distance metric can be used for the position observations, the continuous distribution offers a more flexible way to evaluate the observation. In contrast to the position, the discrete feature space does not allow the concept of similarity and distance cannot be assumed to be Euclidean. However, only the resulting probability, independent of the engaged observation distribution, and the current state of the sensor are passed to the upper Gesture HMM Integration module Layer Two: Gesture HMM Integration Each learned sign g is represented by a Gesture HMM Integration module in layer two. The task of a Gesture HMM Integration module is to merge the information of its attached HMM Sensors to compute the quality κ g eq. (6.32) of the sign g matching the partial observation sequence. To get rid of possible multiplications with small numbers causing numerical problems when estimating α t+1 (j) using equation eq. (6.25), it is common

65 Recognition 60 to work with the logarithmical values (log) of the probabilities sent by the sensors. Hence, the applied forward variable is expressed by: α t+1 (j) = α t (i) + log(w t (u)b j (o t+1 )). (6.30) The logarithm is a strictly monotonic mapping, changing the multiplication of the probabilities to an addition of the received log values. Thus, for every frame the Gesture HMM integration receives the log probability l of the left hand s (lh) position l lh (y), contour l lh (c) and texture l lh (τ) and of the right hand l rh (y), l rh (c) and l rh (τ). The weighted probabilities in eq. (6.30) are in the interval of [0, 1]. Hence, the resulting log values are all negative or zero. They decrease very quickly for lower probabilities and become for zero probability. Thus, high probability values will map to negative log values close to zero. Low probability values will map to lower negative values with a strong increase of their absolute value. The calculation of the current quality κ a for the present frame is done by passing the two integration stages illustrated in fig Integrating the quality for left and right hand separately in the first stage is done using the concept of strong and weak features. Missing Observations Due to the continuous nature of its observation distribution the position feature is chosen to be the strong cue. In case of a missing observation, e.g., the tracking agent failed to track the hand, the use of the Euclidean distance allows to take the last known position as the current observation. The missing data problem is more critical for the recognition of the static hand posture. As mentioned in section 5.2, recognition might not be stable on every frame, especially when the hand is moving. Thus, contour and texture information are called weak cues and are integrated in the recognition process by using the rewarding function ϱ of eq. (6.31). Hence, correct contour and texture information reward the recognition while missing data or falsely classified hand postures do not disturb the corresponding Gesture HMM Integration module. Rewarding Function The rewarding function ϱ rewards only if position and contour or position and texture information of the corresponding hand are correlated. In this context, correlation does not necessarily mean that the HMMs have to be in the same state i. Instead, for each hand the corresponding l(c) and l(y) or l(τ) and l(y)

66 Recognition 61 Rewarding Function ρ (x-θ)h(x-θ) Reward ρ(x) θ Input Values of ρ Figure 6.7: Rewarding Function ϱ The most important parameter of the rewarding function is the correlation threshold θ. Only if the quality of the hand position is below this value, the rewarding learned sign performed with the corresponding hand posture. The more probable the hand posture information (due to the applied log it will be close to zero) the higher will be the reward which is given by the absolute value of the distance between threshold and input value x. both just have to be above a correlation threshold θ. This means that both observations would match within the context of the performed sign. The reward is linked to the probability for the hand posture recognition l(c), l(τ), which are treated in the same way and therefore are depicted by the variable x in the following eq. (6.31). Both parameters, the hand posture probability x and the threshold θ are negative or zero. The rewarding function is designed to give a higher reward if the hand posture is more likely in the sense of fitting to the current observation sequence. Generally, the rewarding function can be written as: ϱ(x) = (x θ)h(x θ) ; H : Heaviside step function. (6.31) Fig. 6.7 shows a plot of ϱ as applied in the present work.

67 Recognition 62 Algorithm 7: The κ a for each hand is based and the information of the position and is rewarded by matching hand posture observations Result: Current quality κ a of left and right hand. foreach Hand do /* compute current quality */ κ a = l(y) /* check reward of texture information */ if l(y) > θ l(τ) > θ then /* add reward */ κ a + = ϱ(l(τ)) end /* check reward of contour information */ if l(y) > θ l(c) > θ then /* add reward */ κ a + = ϱ(l(c)) end end Integration of Strong and Weak Features The term in front of the Heaviside function H determines the amount of the reward which will be added to κ a as specified in algorithm 7. Due to the applied Heaviside function, the result of the rewarding function is zero or positive. Thus, in case of a missing or falsely classified observation the reward changes nothing, but it increases the κ a if the hand posture stored in the present state of the HMM Sensor matches to the learned sign of the current frame. At the beginning of the recognition process or after a reset (see below) the overall quality κ g for the single sign g will be initialized with zero. After computing κ a of the current frame, the quality κ g is updated by adding the κ a of each hand. Without the hand posture information, using only the position information, the κ g would continuously decrease with increasing sign length, as illustrated in fig. 6.8 (a). By introducing ϱ of eq. (6.31) the κ a and thus κ g can become positive. Therefore, κ g cannot be transferred back into a probability again. The resulting belief of the sign represented by the Gesture HMM Integration module is performed by a weighted addition of the current qualities κ lh, κ rh received for each hand using: κ g + = w lh κ lh + w rh κ rh. (6.32)

68 Recognition 63 Focusing on the dominant hand by setting w rh = 0.7 and w lh = 0.3, the recognition is run without further information whether the presented sign is performed by one or both hands. States of the Gesture HMM Integration Modules Each Gesture HMM Integration module has two states, active or inactive. In the active state, the module is certain that it could match the sign and by proceeding to higher states of the HMM Sensor, the recognition is continuously following the coming observations. The increase of the HMM Sensor states is a cue for the similarity of the learned sign to the performed sign. A Gesture HMM Integration module and therefore its corresponding sign becomes active if the κ a of the first state 5 is above the activation threshold ξ start. Otherwise the Gesture HMM Integration module is inactive. This means that all the connected HMM Sensors are set to their initial state and all the parameters like the uncertainty u of each sensor and the κ g are reset to zero. An active Gesture HMM Integration module becomes inactive if its κ g drops below the inactivation threshold ξ stop. A graphical overview of the behavior of the Gesture HMM Integration modules is given in fig Both plots where taken form the same recognition experiment described in section 7.2 where the system was tested to recognize the sign about. While fig. 6.8 (a) shows the development of the corresponding Gesture HMM Integration module, trained on the data of the about sign, fig. 6.8 (b) depicts the behavior of a module where learned data (the sign fast ) and thus the presented observation sequence does not match. As shown in fig. 6.8 (a) for the matching sign module, without the hand posture information adding only the l(y), the κ g decreases with increasing sign length. Depending on the matching position observation the sign would run in its inactive state and would be reset for the further recognition. By introducing ϱ of eq. (6.31) the κ a and thus κ g can become positive and helps the Gesture HMM Integration module to stay active. ξ start and ξ stop have global values and allow the system to reset a sign module autonomously in order to restart the recognition during the performance of the sign. With a view to a continuous stream of data, the active/inactive mode was developed to handle the problem of co-articulation (the frames between two gestures) and the case where the first frames for one or more signs is similar and only the following frames will decide which sign is performed. Thus, as the sequence is followed, only the most likely sign will stay active until the final decision about the most probable sign is made by the 5 According to the applied recognition algorithm, the observation is not matched with the first state the observation can be passed to the second state, see section

69 Recognition 64 Decision Center of layer three. The Decision Center compares the results of the Gesture HMM Integration modules and determines which sign is the most probable so far.

70 Recognition Gesture Module ABOUT, presented sign ABOUT Only Position κg ρ(c) ρ(τ) ξstop ξstart Quality Values Frame (a) ABOUT 10 5 Gesture Module FAST, presented sign ABOUT Only Position κg ρ(c) ρ(τ) ξstop ξstart Quality Values Frame (b) FAST Figure 6.8: Recognition Modules The plots illustrate the behavior of the Gesture HMM Integration modules during the recognition of the sign about. For this purpose each figure plots the threshold for activating and inactivating the modules and the curve which shows the overall quality κ g as well as the rewards given by the attached weak cues. In order to demonstrate the advantage of the reward, the recognition quality based only on the position information is plotted. As can be seen for the about module in fig. 6.8 (a) this line constantly decreases and therefore would run the Gesture HMM Integration module into its inactive mode. The rewards allow the module to stay active and thus recognize the sign. Fig. 6.8 (b) shows that parts of the about sign are quite similar to the beginning of the stored fast sign. Thus, it is important to consider the introduced confidence value to discard these events.

71 Recognition Layer Three: Decision Center Only active Gesture HMM Integration modules will receive the attention of the Decision Center in layer three. The autonomy of the Gesture HMM Integration modules prohibits the Decision Center to use equation eq. (6.8) to declare the sign g with the highest current value of κ g as the recognized sign. The Decision Center would compare sign modules that are already running with sign modules that just started. Thus, recognition is coupled to the progress of the HMM Sensor by means of a confidence value ζ g, which is individual for each sign g. It is computed by the ratio of the current state of the sensor to the maximal number of states N of the HMM and is a measure of certainty. Therefore, only signs which are above the confidence threshold of ζ g,min will be handled by the Decision Center. This minimal confidence ζ g,min can be different for each sign and is computed as the ratio of its shortest to its longest sequence of the training set. Out of the signs that reached their ζ g,min, the Decision Center chooses the one with the highest κ g to be the most probable sign that represents the observation sequences so far. This method favors short signs that only need a small amount of recognized frames to reach their ζ g,min. Therefore, competition is introduced in the Decision Center. Based on the sign module with the highest κ g all Gesture HMM Integration modules are inhibited by κ g. Thus, short signs become inactive before they reach the needed confidence value. If a sign reaches a confidence value of one a reset signal is sent to all connected Gesture HMM Integration modules and the recognition of the current sign is completed.

72 Recognition 67 Algorithm 8: The recognition is hierarchically organized into three layers. The characteristic of each layer is its way of information integration. Layer one, the HMM Sensor analyzes the received observation with its observation probability function. Layer two comprise the HMM Integration module of each learned gesture and integrates the information received from layer one. The top layer compares the results from the HMM Integration modules. The Decision Center determines the most probable sign and manages the inhibition while not at end of gesture sequence do /* ************************************************* */ /* Layer one: HMM Sensor */ /* ************************************************* */ foreach HMM sensor do calculate observation probabilities end /* ************************************************* */ /* Layer two: HMM Integration modules */ /* ************************************************* */ foreach HMM Integration module do compute ϱ to fuse the information of position and contour calculate the current quality κ a update the overall quality κ g control the activation using κ g, ξ start and ξ stop end /* ************************************************* */ /* Layer three: Decision Center */ /* ************************************************* */ if HMM Integration module that reached its ζ g,min then choose HMM Integration module with highest κ g as current winner end if ζ winner == 1 then reset all HMM Integration modules end else /* inhibit all gestures */ search for the maximal quality κ max foreach HMM Integration module do subtract κ max from κ g end end end Result: last winner will be chosen as recognized gesture.

73 Chapter 7 Experiments and Results The presented MAS recognition system was tested for its capabilities in signer dependent (section 7.2) and signer independent (section 7.3) sign language recognition. The aim of the experiments was to evaluate how the teamwork and cooperation within the MAS (handling object tracking and hand posture recognition) solves the problems immanent in sign language recognition. Especially the information fusion which uses the concept of weak and strong cues presented in section is of special interest. One crucial problem of any sign language (or speech) recognition system is to detect the start and the end of the observed sign sequence. Continuous sign language recognition systems like Vogler and Metaxas (2001) solve the problem by modeling the movement between two consecutive signs. However, the authors remark that this enlarges the amount of included HMM considerably. This number can be reduced by using filler model HMMs as presented in Eickeler and Rigoll (1998); Eickeler et al. (1998). An alternative is presented in Starner et al. (1998) and Hienz et al. (1999). These authors put the sign sentence under the constraints of a known grammar. Thus, it will be treated as one long sign, where the hand s resting positions mark the start and the end of the sentence. The recognition system in the present work solves this segmentation problem by introducing a high degree of autonomy to each sign module 1. Every sign module implemented in the recognition agent is designed to detect the start of the sign by screening the input stream in order to find a partial observation sequence that fits with the data sequence stored in its first state. As described in section 6.3 the corresponding Gesture HMM Integration module becomes active if its quality κ a is above the activation threshold ξ start. 1 The Gesture HMM Integration module and its HMM Sensors.

74 Experiments and Results 69 (a) (b) (c) (d) (e) (f) Figure 7.1: End of Sign Detection This sequence of images shows the end of the sign fantastic, including some following frames. According to the provided ground truth information, the sign ends with frame The detection of the last frame of the sign in the presented sequence is even more difficult. While the start of a sign performance is very similar in terms of starting position and hand posture, the different length of the individual performance makes the precise decision of the last frame very difficult. Fig. 7.1 demonstrates the problem. Even for the human eye it is complicated to determine the turning point which marks the end of the sign and the transition to the next sign. In order to determine the end of the sign, the system uses the confidence value ζ g introduced in section 6.2. As the Gesture HMM Integration module runs through the states of the HMM Sensors the confidence value increases. Hence, the end of the sign can be determined once the confidence value is 1. However, a confidence value of 1 can only be reached if the presented sequence is as long as the longest sequence which was used to build the HMM Sensor. Thus, a confidence threshold ζ g,min was introduced for each sign module. It was computed by the ratio of the shortest to the longest sequence in the corresponding training set. The strength of the recognition system presented in this work is that it determines the most probable sign at any time during performance of the sign. This is supported by the fact that the sign has already reached a certain degree of confidence, i.e., the confidence value is above the sign specific threshold mentioned above. Hence, in the coming experiments the sign which is most probable at the end of the presented sequence is defined as the recognized sign. It is worth noting that the system does not know the end of the sequence. Both experiments, the signer dependent and the signer independent, work on signs of the British Sign Language (BSL). Sign language data was kindly

75 Experiments and Results 70 Parameter Value Description T r 0.01 Declare observation as known to the current state of the HMM θ -5 Correlation threshold ξ start -1 Activate a sign module ξ stop -11 Reset a sign module Table 7.1: HMM Recognition Agent Parameters Parameters which are applied by the HMM Recognition Agent. These parameters are the same in each of the conducted experiments. provided by Richard Bowden from the University of Surrey. The same data has been used to train the HMM Recognition agent as well as the applied agents for hand posture recognition. Some of the signs have already used in recognition experiments published by Bowden et al. (2004) and Kadir et al. (2004). All experiments are running with the same set of parameters: There is no individual tuning by changing the used thresholds listed in tab BSL Data The BSL data base consists of a continuous movie with a separate file containing ground truth information about the first and last frame of the recorded sign. Overall, the movie shows 91 different signs, each repeated 10 times. The signs were performed by one professional signer. This makes a total of 29, 219 images that have to be processed. The number of frames per sign ranges from 11 to 81. Even within the sign repetitions, the sequence length shows differences of approximately 50 percent, e.g., the length of sign live ranges from 18 to 41 frames. All 91 signs and some additional information about the temporal variations between the different signs and the variations within the repetitions of the sign are listed in appendix C. Fig. 7.2 illustrates the trajectories of two example sign sequences. The signer is wearing a red shirt and colored gloves, a yellow glove on the right hand and a blue on the left hand. Thus, color segmentation is applied to extract the position of the hand (the center of gravity), the texture and the

76 Experiments and Results 71 (a) different (b) bat Figure 7.2: BSL Trajectories The images show the starting position of the signs different and bat, including the different trajectories of ten repetitions by the same signer. contour during the training of the recognition agents. This information in combination with the bunch graph face detection introduced in section are applied to start the automated learning from examples, which builds up the hand posture sign lexicon for the bunch graph matching (section 5.2.1) and contour matching (section 5.2.2), respectively. These sign lexicons serve to generate the HMM for each sign as explained in section Signer Dependent Experiments In addition to the recognition experiments in Bowden et al. (2004) and Kadir et al. (2004), further investigations like testing the ability of the presented system to handle the effect of co-articulation and the rejection of unknown sequences were undertaken. Hence, three main signer dependent experiments were set up. The first two experiments investigate the mean recognition rate. It is computed knowing the start of the sign in the first and an unknown start in the second experiment. In the third experiment the recognition system was confronted with an unknown sign and was tested on its rejection capability. The result is given by the positive rejection rate.

77 Experiments and Results 72 (a) Input image (b) Color sensor detection Figure 7.3: Attention Agent In order to demonstrate the portability to experiments where the sign is performed with ungloved hands, the color sensor of the attention agent merges the computed color similarities of the blue, yellow and skin color values (a) into one overall map (b). The image on the right shows the color information map, which is used for the processing described in chapter 4. Both recognition experiments were performed using a leave-one-out procedure, which means that for the testing sign all the sequences excluding the one which is tested were used to build the HMMs as described in section Thus, ten recognition experiments per sign have been carried out. In contrast to the recognition experiments, all ten repetitions were used to generate the HMMs during the evaluation of the positive rejection rate in the last experiment. The integration scheme of coupling strong and weak features has been tested by running each recognition experiment three times. In order to estimate the value of each feature, the first run integrates only position and contour information for the sign language recognition. In the second run the combination of position and texture information is applied. Finally, all features are integrated to evaluate the advantage of the multi-cue information integration. During the detection of the ROIs, the attention agent s color sensor fuses the color information of the blue, yellow and skin color blobs as illustrated in fig. 7.3 to demonstrate the transferability to experiments where the sign is performed with ungloved hands.

78 Experiments and Results 73 Experiment Mean Recognition Rate in % Contour Texture Combined Known Start Unknown Start Table 7.2: Mean Recognition Rates As shown, the combination of both weak features with the position as the strong feature achieves the highest recognition rates for both experiments. The recognition rates decrease for the unknown start experiment, but are only lowered by approximately 5% for the combination run. Recognition Experiments Both recognition experiments differ only with respect to the start of the sign performance. The attained recognition rates are presented in tab In the case of a known start of the sign, the mean recognition rate of 83.08% was achieved by integrating the position and contour information. Although much slower in terms of processing speed, a better result of 86.81% was gained using the texture information instead of contour information. As expected, the best mean recognition rate of 91.43% was achieved by the combination of all three cues. In order to simulate the co-articulation, the recognition system started 10 frames before the ground truth setting. Having only changed the start of the input sequence the experiment showed the same trend of the computed mean recognition rates as described above. Using only position and contour the mean recognition rate was 78.02%. Again, the bunch graph in combination with the position showed a higher mean recognition rate of 84.61%. Following the experiments carried out with known start, the integration of the two weak cues gained the best result in terms of recognition rates. In this case a mean recognition rate of 87.03% was obtained. By comparing both experiments, tab. 7.2 shows that there is just a 5% difference between a given and a selfcontrolled start of the recognition. The distribution of the recognition rates is shown in fig. 7.4 (a) for the given start of the sign and fig. 7.4 (b) for the co-articulation experiment. They are presented in a histogram style where the number of recognized signs is the entry in the recognition rate bins. For example as shown in fig. 7.4 (a)

79 Experiments and Results Combined Texture Contour Number of Signs Recognition Rates in % (a) Known Start of Signing Combined Texture Contour Number of Signs Recognition Rates in % (b) Unknown Start of Signing Figure 7.4: BSL Recognition Rates Both histograms show the number of recognized signs as the entry in the recognition rate bins. Especially the distribution plotted in fig. 7.4 (b) demonstrates the advantage of fusing both cues, contour and texture.

80 Experiments and Results 75 for the combined case, 21 signs had been recognized with a recognition rate of 90%. Taking the first set of experiments, a mean recognition rate of 90% and higher could be achieved for over 84% of the presented signs. The signs with lower recognition rates of 0 to 10% were dominated by very similar signs which have a lower sequence length. The behavior of the recognition system is presented by discussing two results of the recognition experiments. Tab. 7.3 and tab. 7.4 show the results 2 of the recognition experiments for the signs computer, excited interested and live. The sign computer demonstrates the advantage of the multi cue integration. While the sign is not recognized using only the position and contour information, using texture instead provides a mean recognition rate of 60% for the known start experiment and 40% for the unknown start experiment, respectively. Combining both weak cues enhances the mean recognition rate to 50% for the unknown start experiment while the result of 60% stays the same for the known start experiment. Therefore, the applied information integration proved to handle the combination of two information streams and shows that the result can even be better than by taking the best out of the two single cues. However, a disadvantage of the recognition system is shown by the results on the excited interested sign. This sign is dominated by the shorter live sign, that shows to be very similar as depicted in fig The difference of the trajectories is small compared to the inter-sign variations that can occur in other sign like the different and the bat gesture trajectories that are given in fig This misclassification trap is caused by the self-organizing property of the system. All known signs are in a loop and are thus waiting to become active by passing the activation threshold ξ start. Therefore, as seen above, signs with low recognition rate are likely to be dominated by similar shorter signs. This effect can be decreased by introducing grammar or other non-manual observation like facial expressions to future sign language recognition systems. 2 The complete list of recognition failures is presented in appendix D.

81 Experiments and Results 76 Gesture Known Start (Contour) Recognized Signs and Number of Hits computer best 7, father 1, how 2 excited interested ill 2, excited interested 2, cheque 1, live 5 live live 10 Gesture Known Start (Texture) Recognized Signs and Number of Hits computer difficult 1, best 2, computer 6, father 1 excited interested excited interested 1, cheque 1, live 6, dog 2 live best 1, live 9 Gesture Known Start (Combined) Recognized Signs and Number of Hits computer difficult 1, best 1, in 1, computer 6, father 1 excited interested ill 2, excited interested 2, cheque 1, live 5 live live 10 Table 7.3: Recognition Results Known Start This table displays a selection of the recognition results for the experiments with known start. The input signs are presented in the left column and the corresponding recognition result are listed in the right column. The Number of Hits entry codes who often the sign has been recognized.

82 Experiments and Results 77 Gesture Unknown Start (Contour) Recognized Signs and Number of Hits computer best 8, father 1, how 1 excited interested excited interested 2, cheque 1, live 6, angry 1 live angry 1, live 9 Gesture Unknown Start (Texture) Recognized Signs and Number of Hits computer later 1, difficult 2, best 3, computer 4 excited interested cheque 1, live 6, angry 1, dog 2 live angry 1, live 9 Gesture Unknown Start (Combined) Recognized Signs and Number of Hits computer later 1, difficult 1, ill 1, in 1, computer 5, father 1 excited interested ill 1, excited interested 3, in 1, cheque 1, live 3, angry 1 live angry 1, live 9 Table 7.4: Recognition Results Unknown Start This table displays a selection of the recognition results for the experiments with unknown start. The input signs are presented in the left column and the corresponding recognition result are listed in the right column. The Number of Hits entry codes who often the sign has been recognized.

83 Experiments and Results 78 (a) excited interested (b) live Figure 7.5: Similar Signs The trajectories and static hand gestures of the signs excited interested (left) and live (right) are very similar. Therefore, the shorter sign live dominates the recognition of the excited interested performance. The integration of non-manual observation like a grammar or facial expression should help to differentiate between similar signs. Sign Rejection The common HMM recognizer determines the model with the best likelihood to be the recognized sign. However, it is not guaranteed that the pattern is really similar to the reference sign unless the likelihood value is high enough. A simple thresholding for the likelihood is difficult to realize and becomes even more complicated in the case of sign language recognition, where all signs differ in their length. Instead of a fixed value, Lee and Kim (1999) introduce a threshold HMM which provides the needed threshold. Their threshold HMM is a model for all trained gestures and thus its likelihood is smaller than that of the dedicated gesture model. The authors demonstrate their model on a sign lexicon of 10 gestures. Applying the threshold model on a database of 91 signs causes several problems when generating the HMM. The first problem concerns the distribution of the stored observations. When merging the emission distribution of every sign the attained observation distribution of the threshold HMM might be distributed in a way that fills the whole observation space. The same holds true for the distribution of the transition probabilities. Thus, the threshold model might lose its character-

84 Experiments and Results 79 istic of producing a higher likelihood if the sign is not known. Or even worse, it shows a higher probability and thus will better explain the observation sequence. Differing from the threshold model, the filler model presented in Eickeler and Rigoll (1998), Eickeler et al. (1998) is trained on arbitrary and other garbage movements. Their rejection experiments are performed on a set of 13 different gestures and thus pose the same problems as mentioned above. The presented recognition system handles the sign rejection with the autonomy of the applied sign modules. If the presented image sequence is not known to the recognition system none of the HMM Gesture recognition modules becomes active or is able to reach a confidence value which is high enough to confirm a recognized sign during the recognition process. Thus, the positive rejection rate represents the ability of the recognition system to reject a sign that is not included in its learned sign memory. The mean positive rejection rate was computed on the 910 experiments with known start and using all sign modules except the one of the running sign. The recognition system achieved a mean positive rejection rate of 27.8%. The false acceptance of signs can be explained by the missing competition and thus missing suppression in the Decision Center. As the corresponding sign is not included, similar signs are able to achieve their minimal confidence value and are declared for the recognized gesture. However, the presented recognition system is not dependent on any filler or threshold HMM and thus allows to add or delete signs without further processing on existing data. Discussion The experiments conducted in the present work show that the MAS proved to be a reliable recognition system for signer dependent sign language recognition. The presented information integration of strong and weak features proved to work well. Although the texture feature showed to perform better than the contour, the combination of both features improves the recognition. All experiments were running with the same set of parameters. There was no individual tuning by changing the used thresholds. When comparing the results of the recognition system with other systems, it should be considered that, in contrast to speech recognition, there is no standardized benchmark for sign language recognition. Thus, recognition rates cannot be compared directly (Kraiss, 2006). Nevertheless, there are impressive results by Zieren and Kraiss (2005) of 99.3% on a database of

85 Experiments and Results isolated signs and von Agris et al. (2006) with 97.9% for a database of 153 isolated signs. As the processed data in the present work is partly included in the work of Bowden et al. (2004) and Kadir et al. (2004), their results are more comparable with the results of the present work than the ones mentioned above. Bowden et al. (2004) achieved a mean recognition rate of 97.67% for a sign lexicon of 43 while Kadir et al. (2004) achieved a recognition rate of 92% for a lexicon of 164 words. The comparable recognition rate achieved in the present work achieves 91.43% on 91 signs and therefore is in line with the other systems. However, the strength of the MAS is the autonomy of the recognition process in order to handle the effect of co-articulation and the rejection of unknown signs. Both experiments have not been adressed in the workings mentioned above. 7.3 Different Signer Experiments Analysis of the hand motion reveals that the variation between different signers are larger than within one signer (von Agris et al., 2006). The authors also confirm that other manual features such as hand posture and location exhibit analogue variability. Besides the problems which have to be solved for the signer dependent sign language recognition like temporal and spatial variations in the performance of the sign (which are even enlarged by the individual performance of each signer) the signer independent system has to deal with the different physique of the signers as well. Two approaches can be applied to solve these problems. The first approach is to build the recognition on features that are described in a way that the differences mentioned above are too small to disturb the coarse description of the feature. This approach is used in Eickeler and Rigoll (1998) where signer independence is achieved by using general features out of difference (pixel change) images of the whole body region. Their recognition system proved to work with 13 different gestures. One problem resulting from the coarse feature description is the small number of gestures that can be distinguished by the recognition system. The other approach concerns signer adaption methods as presented by von Agris et al. (2006). The authors extract geometric and dynamic features, which are adapted to methods taken from speech recognition research. In their experiments, three signer are used for training and one signer for testing. The system was tested on 153 isolated signs. Using a supervised adaption with 80 adaption sequences, their recognition rate shows an accuracy of 78.6%.

86 Experiments and Results 81 Signer independence in the present work is investigated by running the recognition experiment with a known start on a subset of the signs mentioned above. The MAS recognition system is not changed, it is the same as for the signer dependent experiments, except for the color sensor of the attention agent which is set to skin color. Hence, it has been trained on data, where the signer is wearing gloves. Thus, the applied recognition agents stored contours or bunch graphs which were trained on the glove data. Therefore, these experiments did not only prove the signer independence of recognition, the difference in performing the gesture and having a different body structure, but also it was indirectly proved for the applied object recognition methods to generalize over identity. The signer independence of recognition has been proved on a subset of signs ( hello, meet, fast, know ) included in two data sets. In the first set the signs were performed by the same signer as in the experiments in section 7.2. Recorded on different sessions, the signs were cut out of a running sentence performed without colored gloves. The signs in the second set were performed by members of the Institut für Neuroinformatik, Ruhr University of Bochum Germany, who performed sign language for the first time. Object tracking worked well on both data sets. It is shown in fig. 7.6 for the first data set and in fig. 7.7 for the second set. However, the recognition system could only recognize the sign meet of the first data set and failed with all other represented sign sequences. Discussion Possible reasons for the failure of the signer independent recognition are the following. One major problem is the temporal and spatial variation between the training and the testing set. Temporal, due to the fact, that the recorded signs of the first set, although recorded by the same signer as in the training set, are shorter than the shortest sequence in the training sets and could therefore not reach the required confidence threshold of the recognized sign. The spatial variation means that the position for performing the sign is outside the position variations of the trained position HMM. Another difficulty concerns feature extraction. The classification of the hand postures, based on the data trained on the BSL database and therefore trained on colored gloves is not adequate for signer independence.

87 82 Experiments and Results (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 7.6: Different Signer Detection difficult This sequence of images was used as input for the signer independent recognition. As shown in the sequence, the tracking performed well. The problem that occurred concern the recognition of the hand posture and the length of the sequence which is only half as long as the shortest sign sequence in the training set.

88 83 Experiments and Results (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 7.7: Different Signer Detection fast This sequence was performed by an amateur who performed sign language for the first time. Therefore, temporal and spatial variations in the performance as well as the recognition failures of the hand posture did not allow the recognition. However, the tracking performed well in the cluttered environment. The first frame shows the system during its initialization phase, the face and the hands are not yet recognized and thus are coded with yellow boxes.

89 Chapter 8 Discussion and Outlook 8.1 Discussion In order to realize the sign language recognition system, a software framework to design and test multi-agent systems has been built up. The characteristics of the implemented multi-agent system are autonomous and cooperating units. Principles like divide and conquer, learning from examples and selfcontrol have been applied for object tracking and sign language recognition. Both systems are further divided into smaller subsystems, which are realized as simultaneously running agents. The modular framework allows the recognition system to be easily extensible. New signs are included by connecting their HMM Sensors to a Gesture HMM Integration module which is added to the HMM recognition agent. The recognition of signs is realized by introducing a modification to the standard HMM architecture. The task of the HMMs is to store feature sequences and their variations. This data is compared with an incoming feature sequence in order to recognize the performed sign. The presented recognition system divides the input features into two types of information. Reliable features show temporal continuity and are more robust under variations between the observed and the learned data, while weaker features are not as robust and will therefore fail more often to be recognized from the observations. Both types of information channels are integrated by using a correlation and rewarding scheme. Another innovation is the competition of the learned signs during the recognition process. In addition to satisfactory recognition results the autonomy of the system allows to handle the problem of co-articulation. Although the sign rejection experiments did not shown the expected positive rejection rate, the ability of

90 Discussion 85 sign rejection immanent in the design of the recognition system. Therefore it does not need extra modules like threshold or filler HMMs. Only simple features like the position and hand posture have been applied. The present work does not include a grammar or a high level description. These would be an interesting challenge for future projects and will be discussed in section Outlook The presented work can and should be further enhanced by investigating the parameters which control the dynamics and the tracking process as well as the applied thresholds systematically. Although the system is designed to work online, and thus present the most probable sign for each frame, the recognition processes are too expensive to run in real-time. This problem does not hold if tracking only is demanded. In this case the tracking is running in real-time. In order to speed up the recognition, the modular architecture allows to be simultaneously executed on different computers. First tests using a CORBA interface have successfully been done. Based on the presented work, sign language research can be continued in the following directions: Online Learning In order to have a recognition system, one would like it to be adaptive to sign variations as well as to be capable of learning new signs. The modular design of the present work favors an easy integration of new sign modules. The most challenging task is to build the whole system from scratch by starting with an empty HMM recognition agent. In this case, it would be realistic to start the learning with a defined start and end of the observed sign. Further, under the hypothesis of suitable color segmentation even the hand posture sign lexicons for contour and texture could be build from scratch. Nevertheless, the first step should be to start learning by adding recognized signs to the corresponding sign modules. In a second step, the rejection capability of the system should be enhanced. This allows to find new gestures and then to add them to the HMM recognition agent. The applied HMM structure allows to expand the distribution probability by just adding a new entry in the histogram for the discrete observations or add a new Gaussian in case of a continuous observation, respectively. Alternatively, the weights of the existing Gaussian mixtures could be adjusted

91 Discussion 86 to avoid a distribution with too many Gaussians. New states can easily be added to the HMM if the new sign is having more frames than the previously learned ones. However, a problem occurs if the new sign is shorter. In this case, the HMM might not reach the needed confidence threshold which is used to declare the probable occurrence of the sign. As the start of the sign is expected to be given, the computed overall quality might be used to determine the similarity of the observation under the condition that the sign module is active. Integration of Non-Manual Information The integration of facial expressions is often demanded and important for the full understanding of sign language. However, facial expressions are hard to recognize and should therefore be integrated as a weak feature, giving a reward if the observation match the expected expression or change nothing otherwise. Classification of facial expression is treated in Tewes et al. (2005) and could be imported as a new HMM sensor using a discrete feature description. The consideration of grammar is an important feature for continuous sign language recognition. In the present work, the grammar would not be used to recognize the whole sentence but instead it would predict the appearance of the sign in the context of the previously seen observations. Same as the facial expression, the grammar could be integrated as a weak cue and thus contribute a reward. Person Independence The most challenging task for sign language recognition systems is the generalization over signer identity. The signer independence capability of the present work could be realized by an enhancement of the hand posture recognition as well as the adaption of the position feature. The trajectories of the hands have to be adjusted to the characteristic behavior of the performing signer. In order to improve the contour matching, the first task would be to enhance the contour extraction by using enhanced color segmentation. A second approach concerns the process of contour matching which is described in Horn (2007). The author investigates the advantage of integrating contour and texture information as well as the detection of contours. The improvement of the bunch graph concept in order to allow a more generalized hand posture recognition is more difficult. Triesch and von der Malsburg (2002) showed that generalization could be achieved if the bunch graph stores

92 Discussion 87 the hand postures of multiple persons. As a second extension, the variations of the landmarks could be learned and stored as a special move which is shown for the facial expressions in Tewes (2006). However, both enhancements of the bunch graph require more human interaction, at least for the initialization. The adaption of the position information has to solve several problems, because the position of the trajectory in the signing space might differ not only in a linear shift, but as well the whole performance could show nonlinear variations in form and space. The approach to collect more data from different signers to train the HMM is somehow limited. As the variations get to broad the distribution loses its characteristics and becomes less distinguishable from the other signs. Thus, it seems to be necessary to find a transformation that adapts the learned position information to the observed position sequence. This solution would require to solve a global optimization over the learned signs during runtime. In order to limit the number of applied signs, the system profits by the introduction of a grammar and the improved hand posture recognition mentioned above.

93 Appendix A Democratic Integration Democratic integration was first introduced in Triesch and von der Malsburg (2001a) and developed for robust face tracking. By using democratic integration, different cues like color, motion, motion prediction and pixel templates are integrated to agree on one result. After this decision each cue adapts towards the result agreed on. In particular, discordant cues are quickly suppressed and re-calibrated, while cues having been consistent with the result in the recent past are given a higher weight for future decisions. The information fusion using democratic integration relies on two underlying assumptions. The first one is that the cues have to be statistically dependent, otherwise there is no point in trying to integrate them. The second assumption is that the environment has to exhibit a certain temporal continuity. Otherwise, any adaption would be useless. In the following, the democratic integration will be explained in the context of the MAS chapter 3 and the visual tracking chapter 4 applied in the present work. Thus, each of the above mentioned cues is represented by a specific sensor, working on the two dimensional attention region of the agent. Referring to fig. A.1, each sensor i provides a saliency map M i (x, t) at time t that shows the image similarity at each coordinate x with an agent-specific and adaptable prototype template P i (t). The prototype default template is extracted from the object when the agent is initialized. The integration is performed in the cueintegrator layer by summing up the weighted saliency maps to an overall map R(x, t), R(x, t) = i r i (t)m i (x, t). (A.1) The weight r i (t) is part of the self-controlling of sensor i and is called reliability. All reliabilities included in the sensors of the agent are normed and

94 Democratic Integration 89 Figure A.1: Democratic Integration Tracking agent in use, on the left we see the tracking result marked with the circle. The rectangle shows the border of the agent s search region. On the right we see the similarity maps created by the different sensors, from left to right: color, motion, motion prediction and pixel template. The fusion center shows the result of the information integration. therefore sum to one i r i(t) = 1. In order to find the current target position ˆx(t), the overall similarity map is scanned for the maximum entry { } ˆx(t) = arg max R(x, t). (A.2) x After the target position ˆx is determined, the tracking agent rates its success by analyzing the similarity value at position ˆx(t) in map R(ˆx(t), t). If the value is above a threshold, the image point has been found and thus the object successfully tracked. Otherwise, the tracking agent has failed and lost the image point. Depending on the threshold and its similarity value at the target position the agent is capable to update the reliabilities and to adapt its sensors to the new situation. Adaption of Reliabilities The benefit of each sensor can be evaluated by introducing a quality value q i (t), which computes how well the sensor i could predict the target position ˆx(t). Using, q i (t) = R(M i (ˆx(t))) M i (x, t) (A.3) where... denotes an average over all positions x, and { 0 : x < 0 R(x) = x : x 0 (A.4)