Online Place Recognition for Mobile Robots

Transcription

1 Autonomous Systems Lab Prof. Roland Siegwart Master-Thesis Online Place Recognition for Mobile Robots Spring Term 2011 Supervised by: Maye Jérôme Kästner Ralf Author: Baselgia Ciril

2

3 Contents Abstract Acknowledgment iii v 1 Introduction Outline Related Work 5 3 Modeling Images SIFT Descriptor Bag of Visual Words CENTRIST Descriptor Spatial Weighting Color Descriptor RGB Color Hue Color Transformed Color Image Descriptor Concatenation Modeling Places The Unigram Model Mixture of Unigrams Probabilistic Latent Semantic Indexing Latent Dirichlet Allocation Dirichlet Compound Multinomial Online Place Recognition using Bayesian Change-Point Detection Model Based Change-Point Detection Transition Probability Data Likelihood Particle Filter Change-Point Based Place Labelling Experiments Experimental Setup VPC Database Cosy Localization Database ETH Data Set Change-Point Detection and Place Labeling Supervised Place Recognition Supervised Place Categorization i

4 7 Discussion and Further Work 45 A VPC: Change Point Detection and Labeling 47 B COLD: Change Point Detection and Labeling 50 C Ground Truth COLD 53 D Ground Truth ETH Data Set 55 Bibliography 59 ii

5 Abstract Although visual place recognition and categorization is one of the most fundamental and natural task for humans, it generally remains an unsolved problem in robotics. In this work, we aim to tackle this problem by applying an unsupervised and online approach to place recognition and categorization on image streams obtained from a monocular video camera. Our approach begins with the assignment of the incoming image streams into coherent segments that correspond to distinctive places. This is accomplished by using a model-based Bayesian change-point detection framework. Change-point detection flags out abrupt changes to the generative parameters of a statistical model which is computed online in a recursive and unsupervised manner. After segmenting the image streams, we assign them to the relevant distinctive places by using an online unsupervised framework whereby new places are detected by means of hypothesis testing of Bayes factors. To model the places in our system, we use a Dirichlet Compound Multinomial (DCM) model which is known for modeling word burstiness, the concept that a word is more likely to occur again once it has already appeared. It has been demonstrated that this phenomenon is also likely to occur for an image stream corresponding to a single place. In order to reduce the complexity and the running time of the maximum likelihood update for the hyperparameters of the DCM model, we develop a new update scheme that is multiple times faster than the originally used gradient descendent optimization. To asses the accuracy of our system, we present multiple experiments performed on two existing image data sets and a third data set which we recorded at the Autonomous Systems Lab of the Swiss Federal Institute of Technology. All experiments include a comparison of the concatenation of different image features such as SIFT, CENTRIST and color histograms. iii

6 iv

7 Acknowledgment I would like to thank everyone who encouraged, supported and motivated me throughout my studies, especially during the master thesis. In particular, I am grateful to Jérôme Maye and Ralf Kästner, my supervisors, for their excellent support during this project, as well as for their fruitful feedbacks and discussions. Prof. Roland Y. Siegwart for giving me the opportunity to carry out this project at the Autonomous Systems Lab Yuanshan Lee for proofreading this thesis. My family, who supported me wholeheartedly my whole studies. My Friends, for making my life at ETH so enjoyable and colorful. v

8 vi

9 Chapter 1 Introduction Where am I? and Have I been here before? are two very different questions although both relate to being at a certain place. The answer to the former question is often an exact place, such as Zurich Paradeplatz, or a category of places such as kitchen. For the latter, instead of one answer, there are usually two possible answers: Yes or No. Although seemingly easy questions for us as humans, these are difficult questions for robots. In many ways, this is still an unsolved problem when dealing with robot localization. Nevertheless, it is sometimes essential for a robot or an intelligent agent to recognize or categorize places in a manner similar to how humans do it. This would, for instance, facilitate robot-human interaction or allow the system to overcome the kidnapped-robot problem. Robot localization is also used in many other applications such as Simultaneous Localization And Mapping (SLAM) algorithms, where a good localization enables a proper loop closure. Whereas it is highly important in a SLAM approach that the localization gives the exact position and orientation of the robot, this is of less significance in a framework for topological mapping. A topological map is a graph-based structure of the environment. It consists of nodes and edges where nodes indicate landmarks or significant places, while edges denote their connectivity. This is in contrast to a metric map which shows space to scale (Fig. 1.1) [37]. In a topological framework, the robot only has to decide if the current measurement comes from an already seen place, and if yes, from where. It is not important to know its exact position within a metric space. Performing place recognition or place labeling can be divided into Figure 1.1: (a) Topological Map, (b) Metric Map. The doted red line is the robot path taken while gathering the measurements [37]. 1

10 Chapter 1. Introduction 2 top-down and bottom-up approaches. In top-down approaches, one concludes from the overall room appearance the kind of place label which can be expected, whereas in a bottom-up approach, one concludes from the objects found in the measurement the kind of place category where the measurement is taken from. The solutions to the two questions from the beginning of this chapter can be classified into place categorization and place recognition. Place categorization, also known as scene recognition, usually refers to the task of recognizing the semantic label of a scene when asked for a category of places. As already mentioned, semantic labels could have a very wide range, from corridor right up to coffee-bar in the first floor of ETH Zurich s main building. Recognizing the scene category comes within the task of understanding the scene. The label of a scene indicates strongly the types of objects which can be found there or the types of tasks a person in such a specific place could be doing. For instance, it is much more likely to find a person brewing coffee with a coffee machine in a kitchen than finding a person sleeping on the floor. There are tons of other arguments why semantic labeling can be useful, a collection of them can be found in [44]. Most existing place categorization algorithms, which assume a finite set of place categories, require a lot of learning. The labels are commonly learned offline in a supervised manner with a corresponding set of training data. The training data contains manually labeled measurements, which the system uses to learn how measurements are grouped. During runtime, a classifier separates and categorizes input measurements into their corresponding labels using the previously learned groups. While such supervised systems relying on classifiers have the advantage of simplicity, they have a lot more drawbacks [36],[13]: 1. The classifier needs a huge amount of labeled training data due to the large variation in the measurements (e.g. offices are vastly different in many aspects from one another), which implies long hours of manual work to annotate images by hand, which is tedious and expensive. 2. Expert defined labels are somewhat arbitrary and therefore possibly suboptimal [13]. 3. For the classifier to learn the different labels in the best possible manner, it is essential that each training data set contains the main characteristics of the underlying scene. When testing the system on new data, the data must also have the same main characteristics. This means that a human has to supervise the process by recording the data, making the use of continuous measurements almost impossible. 4. The training data assumes a fixed number of different labels and a system will classify new measurements according to previously learned labels, which makes the recognition of possibly new categories impossible. 5. The system classifies each measurement individually and does not make decisions based on recently seen measurements. One very important issue when using supervised place categorization is that it needs to be applicable across a wide range of spatial environments. Otherwise accurate semantic labeling will not be possible for places which the robot has not visited before [44]. Recognizing a place is the ability to consistently labeling a place as the same when a particular place is being revisited [36]. No semantic labeling of the place is

11 3 performed with this approach. This implies that the whole scene understanding part can be omitted. Furthermore, there are two main tasks when talking about place recognition. These are global localization, where the robots exact pose1 is determined, and topological place recognition, where just a rough location (e.g. corridor) is determined [45]. In the following sections, we refer the term place recognition as topological place recognition. While a topological mapping does not have to coincide with the human understanding of rooms or places [10], we will use the term place to indicate a room as humans define it. As in the place categorization task, most existing place recognition approaches require training. This is done by taking some measurements from a specific environment with manually labeled places and testing the system on previously unseen measurements from the same environment. Supervised place recognition is an easier task than place categorization because the learning and testing measurements look quite similar as they come from the same places. The drawbacks of the existing place recognition approaches are similar to those of place categorization. One also needs labeled training data although it does not require as much training as place categorization approaches. The system will have the same limitation that it simply classifies the current measurement according to the learned labels, making the creation of new labels impossible. No matter what kind of measurement is taken as the system s input, a good place recognition algorithm must be robust to dynamic changes in the environment. This could, e.g., range from overall low dynamic changes, such as lighting and/or viewpoint changes (Fig. 1.2) when images are used, to very dynamic changes when a person walks by. (a) (b) Figure 1.2: Two different views of the same scene. In this thesis, we devise an unsupervised place recognition approach. In contrast to most existing approaches, the input of our system consists of image streams or videos instead of stand-alone images. Thus we are able to intrinsically capture time information in our algorithm. When using image streams, one has to overcome the problem that not all images will capture the characteristics of a room or scene. In fact, it is often the case that just a single wall or some other close-up views are present such that it is impossible even for a human being to characterize the current room. We extract and combine several descriptors out of the images to form a single distinctive image histogram, in the hope that at least one descriptor will overcome the shortcomings described. Our method is based on change-point detection that detects abrupt variations in the generative parametrization of a statistical model [30]. The change-points will indicate place changes as we assume that each place has its own parametrization and that the parameters within a place do not change drastically. Thus, when a change-point is observed, the robot exits the place and enters another place. We use a Bayesian algorithm to infer the change-points by 1 pose = position + orientation

12 Chapter 1. Introduction 4 computing the probability a change-point occurs for every new input image. The exact algorithm would keep track of all possibilities that a change-point could occur and make no irrevocable decision. As the computational cost of the exact algorithm would increase linearly with every time step, we use a Rao-Blackwellized particle filter to keep the costs almost constant. While change-points deliver boundaries to the places, the place label is assigned based on the probability that the measurement comes from an already seen place. Thus, the place label assignment depends on the distribution of all change-points and the past model assignments. The algorithm then calculates a probability distribution across all place labels seen so far and uses a Bayes-factor to test if the robot is currently in a previously unseen place. It is thus possible for the algorithm to start from scratch and to systematically learn previously unseen places online by assigning them their corresponding parametrization. 1.1 Outline The remainder of the thesis is structured as follows. In Chapter 2, we summarize the previous works related to the topic of this thesis. Chapter 3 describes our method of image representation and we will discuss several descriptors with their advantages and disadvantages. In Chapter 4, we give a theoretical overview of 5 document modeling techniques which will be used later as place models in this thesis. Based on the evaluation of these techniques and theory, we will decide for one model to be used in our system. Chapter 5 describes the place recognition algorithm which is based on visual change-point detection. In Chapter 6 experimental results are provided where we compare our system to the PLISS [36] and the VPC [44] system. Finally, we will conclude in Chapter 7 by summarizing the results and providing insights for future work.

13 Chapter 2 Related Work In robotics, many approaches exist to perform visual place recognition. Although place classification methods based on laser and sonar range scans are also used for visual place recognition [18], [26] they are not within the scope of this thesis. In this thesis, we focus on image feature-based methods instead. Typically, visual place recognition methods involve using measures of distinctiveness of image features to determine the location. Examples of such measures include comparing color histograms [41], matching SIFT features [20] obtained from different images [47], retrieving images by means of key-point matching [11], and using classifiers such as SVM s based on manually labeled data [32]. In a recent work by Pronobis et al. [33], sensory data is merged and the system s output for place recognition is obtained from individually trained SVM classifiers. However, as with other existing classifier-based approaches, the method by Pronobis et al. has the disadvantage that it cannot be generalized to learn from previously unseen places. Methods for visual place recognition/classification, can be divided into two main types - those that use global features [41], [39], [13], [29], and those that use local features [35] or model distinctive parts of the image [34]. These can be further divided into methods that use omnidirectional cameras [41], [27], [24], [3] and those that use perspective cameras [19], [34]. In this thesis, we adopt the latter approach of using perspective cameras and agree with the context-based vision system from Torralba et al. [39] to use global image texture features (Gist), which are related to functional constraints. According to Torralba et al., this method is more robust than using local features which could occur randomly and would therefore be highly variable. Their method is based on a hidden Markov model (HMM) which recursively computes the place label based on past measurements. This method, however, has the disadvantage that prior training is required to learn the transition function for the HMM and the observation likelihood for each place in order to obtain the probability of the current measurement coming from a specific place. Ullah et al. [40] combine Harris-Laplace detectors [14] with the SIFT descriptors for place recognition as they provide an excellent trade-off between descriptive power (due to the SIFT descriptor) and generalizability because they capture significant fragments which are very likely to appear again in different settings. In the work by Wu et al. [44], the CENTRIST [45] image descriptor is used for their system input whereby place recognition and classification is done based on Bayesian filtering. These approaches, which are all classifier-based methods, have the implicit disadvantage that they have to pre-learn labels and are not able to learn new place categories during runtime. More recently, a new approach called PLISS (Place Labeling through Image Sequence Segmentation), which is based on online change-point detection, was intro- 5

14 Chapter 2. Related Work 6 Figure 2.1: Maximum-likelihood place labeling using PLISS. Thumbnails of the images are shown on top, followed by ground truth, maximum-likelihood place labeling output and change-point detection of the algorithm [36] duced by Ranganathan [36]. Compared to other approaches, PLISS is an algorithm which is able to learn new place labels in an online manner. This is particularly interesting and useful for mobile robots as they are constantly confronted with new places when exploring the environment. Change-point detection in PLISS is done using the approach proposed by Adams and McKay [30], whereby a particle filter approach, by Fearnhead et al. [12], is used to control computing costs. Place labeling is conditioned on the change-point detection (see Fig. 2.1). In addition, the algorithm performs at each time step a hypothesis testing which uses a likelihood ratio to determine the place model to which the current measurement belongs to. If all hypotheses are rejected, the algorithm will then introduce a new place label. To model places, a Dirichlet Compound Multinomial (DCM) framework, which is supposed to model word burstiness [22], is used. PLISS uses a maximum-likelihood parameter update for the DCM model each time a new measurement is available, and hence is able to learn online a statistical model parameter for a specific place by using all measurements which belong to this place. As inputs, PLISS uses images that are modeled using the bag-of-words approach where a word is assigned to each SIFT descriptor, whereas SIFT features are calculated on a dense grid over the image. These group of words is then further processed with the spatial pyramid algorithm by Lazebnick et al. [19] to introduce some spatial information. Spatial pyramids are obtained by dividing the image into a grid as shown in Fig Finally, each cell is represented as a histogram of words which is then weighted and concatenated to form an image descriptor. There are two main drawbacks of the DCM approach. Firstly, because no analytical maximum-likelihood update in closed form exists for the DCM parameter, iterative algorithms have to be used which can be quite time consuming especially when a lot of data has to be considered. Secondly, the framework will encounter storage problems with time as the algorithm needs to store every measurement from each place visited in order to compute maximum-likelihood parameters. In this thesis, we adopt mainly the PLISS approach, and we will show that better results can be achieved when a combination of different global and local descriptors are used. According to [19], some confusion occurs for the classification of indoor images (such as kitchen, bedroom, living room) when using the spatial

15 7 Figure 2.2: Spatial Pyramid histogram according to Lazebnick et al. [19]. pyramid approach. We believe that this confusion results from the fact that indoor scenes are not as spacious as outdoor scenes. This leads to large image variations even when the camera is only slightly moved. This narrow spatiality results in drastic pyramid-cell-content changes in a short time, the words which are close to a cell-border especially can easily move to a different cell which could lead to a very different overall image descriptor when the single histogram are concatenated. Thus, we found that in image sequences where no emphasis of image content is made (in contrary to [19] where only scene-characteristic images were used), the spatial pyramid approach is not suitable for place recognition. Due to these drawbacks, we will introduce other methods in this thesis to capture some of the image s spatial information. To model places, we also use the DCM model, but instead of using a maximum-likelihood update for its parameters, we will demonstrate a Bayesian approach that is many times faster and hence able to provide real-time responses. As we do not use maximum-likelihood update, we can perform our hypothesis testing using a Bayes-factor [17] instead of using likelihood ratio hypothesis testing. Finally, we will provide a thorough evaluation of our algorithm, testing it with different databases with clearly defined parameters.

16 Chapter 2. Related Work 8

17 Chapter 3 Modeling Images In this chapter, we describe three different image descriptors. The first one is the well-known SIFT descriptor, which can be represented by a histogram when using the bag-of-words approach. As most people are familiar with SIFT, we will only provide a brief introduction to SIFT and focus on describing the bag-of-words method instead. The second section will describe the recently introduced CEN- TRIST descriptor and finally, we discuss three different color descriptors. 3.1 SIFT Descriptor SIFT was introduced by David G. Lowe in his well-known paper Object Recognition from Local Scale-Invariant Features [20]. Since its introduction, many researchers worldwide used the SIFT descriptor for a wide range of tasks. The main reason for its success is due to its invariance to image translation, scaling and rotation. SIFT is robust to local variations arising from nearby clutter resulting in a very distinctive local descriptor. Furthermore, it is partially invariant to illumination changes as well as affine or projective transformation [20]. SIFT s scale invariance results from a staged filtering approach where so called SIFT key-points or interest points are found. Key-points are extrema of a difference-of-gaussian function sampled at different scale-space coordinates. To each key-point an orientation is assigned by computing a gradient orientation histogram in the key-point s neighborhood (see bar in Fig. 3.1). Projective, affine and rotational invariance is then achieved because all properties of a key-point are measured relative to the keypoint orientation. Once the orientation is set for a key-point, the SIFT descriptor is computed as a set of orientation histograms on a 4 4 pixel neighborhood which is orientated relative to the key-point orientation. Each histogram contains 8 bins and each descriptor contains an array of 4 histograms around the key-point. Hence, the SIFT feature vector has = 128 elements. Finally, the vector is normalized to enhance illumination changes. 3.2 Bag of Visual Words The bag of words model originated from document modeling. It is a simplified assumption that a document is represented as an unordered accumulation of words. The same method is also applicable to images by stating that an image is a document and its content is built out of visual words. A visual word could be anything. In the simplest case, it is the intensity value of a pixel. Hence, the representation of an image by a bag of visual words, where a word is associated with the pixel s intensity value, is simply a histogram with at most 256 bins (the range for intensity values 9

18 Chapter 3. Modeling Images 10 Figure 3.1: SIFT key-points detection. The size of the ellipse implies at which scale the key-point is found and the bar shows the key-point s orientation. is [0 255]). By representing intensity values with words, they now have their own intrinsic so-called dictionary. A dictionary is a collection of words which describe the image. In the case of intensity values, a word is a single intensity value and hence the dictionary, which is the bag, has a size of 256 words. Contrary to intensity values, the SIFT descriptors do not have a given dictionary beforehand. Instead, a dictionary has to be learnt in advance. To learn a dictionary of SIFT-words, SIFT descriptors have to be extracted out of a set of training images. With N the number of SIFT descriptors extracted from all training images, we get N feature points within a 128-dimensional feature space. Learning is then accomplished by using the well-known k-means clustering algorithm [21]. When k-means is applied to the feature space, it provides W cluster centres within the feature space. Thus, this procedure generates a dictionary of size W where each word is associated with one of the W cluster center in the feature space. Note that the dictionary size W can be set to any number required. A typical dictionary size, however, ranges between 200 to 400 [19]. With the dictionary of SIFT words, an image can therefore be represented by a bag-of-words whereby a word is assigned to every descriptor. Word assignment is accomplished by the Nearest-Neighbor classification, i.e. the SIFT descriptor computed is assigned to the word in feature space which is closest to it. To be more concrete, let k denote the closest cluster center for a given SIFT feature vector in the feature space. The word w k is then represented by a vector containing only zeros except at the kth position where a 1 is set, i.e., w k = (0,..., 1,..., 0). Furthermore, let x W be the bag-of-words where W denotes the size of the dictionary, w i denotes the ith word and x i denotes the number of times an individual word i (i.e., w i ) is observed in an image. Summing up all resulting word-vectors w 1:N will result in a one-dimensional histogram called the bag-of-words which represents the word frequency of an input image and is given by x W = [x 1, x 2, x 3,..., x W ]. In Fig. 3.2 the process is represented graphically. SIFT descriptors only contain local information around the scale they were computed. In contrast, histograms only contain global information. Bringing both together results in a more sophisticated image representation.

19 CENTRIST Descriptor Figure 3.2: Graphical representation of the Bag-of-Words algorithm. The algorithm starts with the extraction of interest points (key-points). To each interest point a descriptor is calculated which is then quantized in a histogram using the precomputed dictionary (see text for details). 3.3 CENTRIST Descriptor Recently Wu et al. [45] introduced a new global descriptor named CENTRIST (CENsus TRansform histogram) which they used for place categorization. CEN- TRIST is based on Census Transform (CT) [46], which compares the intensity value of a pixel with its eight neighboring pixels [45]. If the center pixel exceeds (or is equal to) the intensity value of the neighboring pixel then a bit 1 is set at the corresponding location, otherwise a bit 0 is set. The bit stream resulting from the eight comparisons for each individual pixel is then converted into a base-10 number (Eq. 3.1). Hence, each center pixel is census transformed into a value in the range [0 255]. Although it is possible to arrange the individual bits arbitrarily, we followed [45] and order the bits from top left to bottom right through out this thesis. Once all CT values are calculated, one can easily transform them into a histogram with 256 bins which results in the so called CENTRIST descriptor ( ) 2 CT = 214 (3.1) As with other non-parametric local transforms for intensity values, CT is robust to illumination changes, gamma variation etc. [45]. In addition, it can also retain the underlying global image structure after an image undergoes a census transformation. This is shown in Fig. 3.3 where each pixel s intensity value is replaced by its CT value. Additionally, the census transformation highlights the discontinuities of an image, which is a very useful property as the discontinuities are the most distinctive features in an image. In general, CT represents the underlying image geometry as it captures structural properties by modeling distribution of local structures [45]. Many properties can be inferred from the census transform. One of them is that neighboring census transformed values are highly correlated because one neighboring pixel is involved in the census transform of the other pixel and vice versa. Hence, bit five of the pixel at (x, y) is strictly complement to bit four of the pixel at (x + 1, y) (Fig. 3.4). Extending this constraint to the whole image, this implies that the number of 1s at bit 5 must be at least equal to the number of 0s at bit 4. Furthermore, there are eight other constraints belonging to a single pixel arising from its eight neighbors (strictly speaking more constraints can be found in [45]). As a result of these constraints, the feature vector, although being

20 Chapter 3. Modeling Images 12 (a) (b) Figure 3.3: Example of (a) original and (b) census transformed image. 256-dimensional, is located in a much smaller subspace of the feature space giving rise to PCA for dimension reduction. In fact, Wu et al. [45] found that 15, 23, 32, 232, (excluding 0 and 255) are the most frequent CT values. These values correspond to local shapes with horizontal or close to diagonal edge structure. It is counter-intuitive that vertical structures are not amongst these values. Wu et al. [45] state that vertical edges are possibly inclined in pictures arising from the perspective nature of cameras. Figure 3.4: Example to demonstrate the correlation of two neighboring pixels when the census transform is applied. Due to the many constraints that can be derived from a census transformed image, the single bins of the CENTRIST are not independent. A CENTRIST descriptor therefore implicitly encodes some of the underlying spatial image structure (note that this is not the same as shown in Fig. 3.3 because we do not look at local image structure, instead we investigate the global structure). The encoding of spatiality is best demonstrated in an image reconstruction experiment. The initial image is shuffled by repeatedly exchanging two randomly chosen pixels. The following reconstruction is done with the constraint that the initial and final image must have the same CENTRIST description. As shown in Fig. 3.5, the probability that the resulting reconstruction shares a similar structure as the input image is very high [45]. We have to note that the images used in this example are black and white and contain just a small number of pixels. Hence, a CENTRIST alone is insufficient for the reconstruction of larger gray-scale images. Nevertheless, this example shows that the CENTRIST captures at least some small image structure. CENTRIST descriptors are well suited for use in computer vision because census transform values are very efficient to compute. In practice, this is done with a sliding window of size 3 3. As comparing different pixels only involves integer calculations, it is possible to achieve an image frame rate of up to 50 frames per second. Furthermore, the implementation is very easy and there are hardly any parameter that require tuning (so far, tuning is only required if PCA is used). As mentioned, the CENTRIST descriptor is invariant to illumination changes. It is

21 Spatial Weighting (a) (b) (c) (d) (e) (f) Figure 3.5: Image reconstruction from CENTRIST. The left image is always the initial image, the middle image is the shuffled image after repeatedly exchanging two randomly chosen pixels and the right image is the reconstructed one. The first two images are completely reconstructed, the second two images are partially not the same and in the last two images the reconstruction fails completely [45]. also invariant to translations and robust against scale changes. However, it is very sensitive to rotations. Although, a CENTRIST descriptor implicitly encodes some spatial image structure as discussed above, it is worth thinking about techniques to improve spatial information. 3.4 Spatial Weighting Wu et. al [45] proposed one kind of spatial pyramid [19] to represent spatial information. As mentioned, we believe that such a method is not suitable for image sequences because key-points close to cell boundaries are very likely to change their cell and this will result in a completely different image description. Furthermore, such a spatial pyramid enlarges the dimension of a descriptor by multiple times. This is due to the concatenating of different histograms arising from different scales and position. To be more precise, let W denote the size of the dictionary (i.e. the histogram size when no spatial pyramid is used) and let L denote the pyramid levels. The resulting descriptor has dimension W L l=0 4l = 1 3 W (4L+1 1) [19]. Thus, choosing W = 256 and L = 2 results in a 5376-dimensional descriptor. Nevertheless, it is obvious that spatial information is useful in both classifying and recognizing rooms. But on the other hand, a huge dimensional descriptor can be problematic. Therefore, we propose a new method that keeps spatial information intrinsically in the descriptor without the unnecessary dimension expansion. In many images, the most characteristic part is mainly located in the middle of an image. When dealing with image streams, e.g., when a robot moves around and records images, it is likely that it records an image where all the characteristics lie completely in the centre, and in the next frame the main characteristics are shifted to the side but at the same height. With this reasoning we propose a weighting scheme, where an image is divided into horizontal strips instead of dividing the images into a pyramid. As shown in Fig. 3.6, we divide the image horizontally into three approximately equally sized patches resulting in a 3 1 representation. To avoid artefacts arising from the non-overlapping regions, we introduce two new patches (dashed line) which result in a total of 5 blocks. We then extract from each block a CENTRIST descriptor (note that the same scheme can be applied to other descriptors). These are weighted in such a way, that the descriptor from the inner most block is assigned with the highest weight and the weights decrease when we move outwards. For blocks which are equally far away from the horizontal image center line, they are assigned twice the same weight. Note that the assignment of weights can differ in other applications. Finally, the weighted histograms are summed up which results in a descriptor whose dimension is not expanded but retains some rough spatial information.

22 Chapter 3. Modeling Images 14 Figure 3.6: Image splitted in 5 patches, where the histogram of each patch is individually weighted with a, b or c, respectively. 3.5 Color Descriptor 1986 Biederman wrote [4]: Surface characteristics such as color and texture will typically have only secondary roles in primal access...we may know that a chair has a particular color and texture simultaneously with its volumetric description, but it is only the volumetric description that provides efficient access to the representation of CHAIR. Biederman implicitly meant that geometrical cues are most reliable for identifying objects [38]. This might be one of the reasons why color descriptors are not very often used in the computer vision community. As our focus in this thesis does not lie in object categorization but rather visual place recognition, we found color to be very useful when dealing with room recognition and especially when dealing with changes in the overall room representation (see Chap. 5 for the change-point algorithm). For instance, the color of a bathroom can look quite different from most other rooms in a house. Furthermore, global color descriptors have some very valuable properties such as invariance to rotation, translation and scale and color is widely independent of the view and its resolution. On the other hand color can be very sensitive to illumination and other light changes. Therefore, we provide in the following subsections a discussion about three different kinds of color-based histograms with their invariances. At the end of the section, we summarize all invariants of the color descriptors discussed in a table RGB Color RGB color is what most people understand when they talk about colors. Indeed, it is widely used, for instance, in the television market where each color is a mixture of three or four base colors. Every color in a RGB image is a mixture of red (R), green (G) and blue (B). In computer vision, a RGB image is represented as a three-dimensional matrix where each of the three channels represents one color and each matrix entry denotes the intensity of the corresponding color. Thus, each channel of the RGB matrix represents one dimension in the three-dimensional color space. A color histogram is obtained by discretizing the individual dimensions of the color space and counting the number of times each color intensity occurs in the image array [38]. This results in a three-dimensional histogram where each bin is a representation of a color in the discretized color space. A bin in the threedimensional space can be understood as a sphere whose center is located at the

23 Color Descriptor discretized color position. The radius of the sphere is proportional to the bin counts of the corresponding color. See Fig. 3.7 for an illustration of the three dimensional histogram. (a) (b) Figure 3.7: An image of a Baboon with its corresponding color histogram [16]. The RGB color model is very intuitive to handle, but it has absolutely no invariance to illumination changes such as intensity change, intensity shift, etc. (Tab. 3.1) Hue Color Besides the previously discussed RGB color model, there exists another model named HSV color. In contrary to the cubic RGB model, the HSV is a cylindrical model that models the hue (H), saturation (S) and value (V) of a color (Fig. 3.8). Because the hue becomes unstable near the gray axis [42], Weijer et al [43] applied an error propagation analysis to the hue transform and found that the certainty of the hue is inversely proportional to the saturation. Therefore, the hue histogram becomes more robust when each hue value is weighted with its corresponding saturation value [42]. Hue color histograms are invariant to light changes and shifts, but they are not invariant to light color changes (Tab. 3.1) Transformed Color As already mentioned, the RGB color histogram is not invariant to any light changes. Yet, with proper normalization of the single RGB channels, invariance against scale and shift with respect to light intensity and color changes can be achieved [42]. The color channels are normalized as follows: R t G t B t = R µ R σ R G µ G σ G B µ B σ B (3.2) with µ X the mean and σ X the standard deviation of the color distribution in channel X [42]. Thus, we have normalized the distribution in each channel and obtain a new color model with µ = 0 and σ = 1.

24 Chapter 3. Modeling Images 16 Figure 3.8: The cylindrical HSV color model [2]. Table 3.1: Invariances of color descriptors. Light Light Light intensity Light color Light color intensity intensity change change change change shift and shift and shift RGB Hist Hue Hist Tr. Col Image Descriptor Concatenation So far, we have briefly provided the theory of three very different image descriptors where all of them have their advantages and disadvantages. In the hope to eliminate most of the disadvantages and to retain the advantages, we use a combination of all three descriptors. We combine the descriptors by concatenating the individual histograms, where either the ordinary RGB, the Hue, Transformed Color histogram or none of them is used. This will result in four slightly different image descriptors. While the SIFT, CENTRIST and the Hue color descriptors are represented as one-dimensional histograms, the RGB and Transformed Color descriptors have a three-dimensional representation. In order to be able to concatenate the threedimensional histograms, we must reduce their dimension to one. The reduction is simply done by projecting the bins down to their individual dimensions. This is achieved for instance for the red-axis by summing first along the green and afterwards along the blue axis, which then results in a one dimensional histogram of bin counts. By applying the same procedure to the two other axes, we get three different histograms which we connect in series to form a single color histogram which is then combined with the SIFT and CENTRIST histograms.

25 Chapter 4 Modeling Places As mentioned in Sec. 3.2, we represent images with the widely used bag-of-words approach. It seems therefore natural to also use other models which originate from document modeling. A lot of generative approaches for document modeling exist, where some of them use latent variables while others do not. In this chapter a few models are discussed with their mathematical background and we will explain the rational for the model used in this thesis. Before going into details, it is important to clearly define the notations that we will use in the following sections as we will be using the language of text collections throughout this thesis. Terms such as words, documents and corpus will be used often. The definitions are as follows: A word is the basic unit of data, i.e., a single measurement. It is defined as one of the cluster centres calculated with k-means. See Sec. 3.2 for more detail. In computer vision, a word is associated with one descriptor. A topic reflects the latent structure of a document. A document is a sequence of N words represented as a vector w = (w 1, w 2,..., w N ) whit w n the nth word in the sequence [8]. This vector w can be binned into a histogram x with W bins, where W denotes the size of the vocabulary (see Sec. 3.2). The counts for a particular word is denoted with x w. A document can be associated with a single image. A corpus is a collection of D documents, D = (w 1, w 2,..., w D ). In the language of computer vision where we perform place recognition/categorization, this can be associated with a single place. 4.1 The Unigram Model The easiest statistical model for document modeling is the unigram. With this model, the words of each document are generated by independent samples from a multinomial distribution [8] N p(w) = p(w n ), (4.1) n=1 where p(w n ) denotes the emission probability for the nth word w n. The multinomial distribution specifies the probability that a given vector x = (x 1, x 2,..., x W ) of 17

26 Chapter 4. Modeling Places 18 word counts is observed [22], where x wn denotes the number of times the nth word is the outcome, i.e., x w = δ (w i w n ). (4.2) i If we parametrize the multinomial with θ and denote with θ w the probability that a specific word w n is emitted subject to the constraints W w=1 θ w = 1 and θ w 0. Then the probability of a document having word counts x w is given by ( ) N W p(x θ) = x 1 x 2... x W w=1 θ xw w = N! x 1!x 2!... x W! W w=1 θ xw w, (4.3) where N = W w=1 x w [5],[22]. Using the multinomial, implies that the probability for the emission of a particular word depends just on itself and is not influenced by other facts. In Fig. 4.1 the graphical representation of the unigram is shown. w d,n θ N D Figure 4.1: Graphical representation of the Unigram model. Blue denotes observed variables, red denotes multinomial parameters. The multinomial distribution depends on the document length N and therefore is different for each different N [22]. Nevertheless, this is not a problem because we only want to learn the model parameters θ. Since the maximum likelihood parameter estimate ˆθ depends only on the fraction of times a particular word appears in the entire corpus, the length of a document has no influence (Eq. 4.4). To compute the maximum likelihood estimate ˆθ, consider a training data set D with D independent observation x 1, x 2,..., x D, i.e., a corpus with D documents. Then the maximum likelihood solution for the parameter is given by D d=1 ˆθ w = x d,w W D w=1 d=1 x. (4.4) d,w As mentioned at the beginning of this section, the unigram model samples each document from the same multinomial parameter θ. This implies that a single word has exactly the same emitting probability across all documents within a corpus. In the case of visual places, this implies that each image recorded from a particular place, should capture approximately the same word frequency. It is intuitive that this model is not a good approximation for real places. Therefore, other techniques have to be developed. 4.2 Mixture of Unigrams To overcome the drawback on the unigram model that each document within a corpus contains the same word parametrization, Nigam et al [28] introduced the mixture of unigrams by augmenting the unigram model with a random topic variable z. Using this modified model, a document is generated by first choosing a topic

27 Probabilistic Latent Semantic Indexing variable z out of T topics. Words are then drawn independently from a multinomial conditioned on that topic. The probability of a document is: p(w) = z N p(z) p(w n z). (4.5) n=1 Parametrizing the mixing weights with τ and the word distribution with θ τ, the likelihood of a document under the mixture of unigrams model becomes [7] p(w τ, Θ) = T N p(z j τ ) p(w n z j, θ zt ). (4.6) j=1 Rewriting the equation and using word-counts x w instead of the document vector w we get T W p(x τ, Θ) = p(z j τ ) p(x w z j, θ zj ), (4.7) j=1 where τ is a multinomial hyperparameter with dimension T and Θ is a T W matrix where its jth row θ zj determines the probability of the nth word in the vocabulary given by the jth topic. These parameters are learned from a corpus beforehand. n=1 w=1 τ z d w d,n θ zj N d D T Figure 4.2: Graphical representation of the Mixture of Unigrams. Blue denotes observed variables, red denotes multinomial parameters, white denotes hidden variables. The T multinomial distribution over topics represents an underlying semantic structure in the corpus [7]. Although a corpus contains documents which are generated out of different topics, each individual document is a manifestation of only one topic. By using a similar terminology like bag-of-words, the mixture of unigrams model is best named as bag-of-topics-of-bag-of-words, i.e. we have a bag-of-topics, where each topic implies a distribution over words (Fig. 4.5(a)). In the language of computer vision, the term topic is something rather vague. However, one could interpret a topic as all words arising from e.g., a sofa or a fridge. This would imply, that an image should contain just sofas when it is conditioned on the sofa topic. It is obvious that this will hardly be the case when using real data. Hence, a model containing several topics per measurement would be a better approximation for real-life measurements. 4.3 Probabilistic Latent Semantic Indexing Hoffmann et al. [15] augmented the mixture of unigrams with a new variable γ resulting in the probabilistic latent semantic indexing (plsi) model. The plsi

28 Chapter 4. Modeling Places 20 model posits that a word w n and a document label γ are conditionally independent given an unobserved topic z [8]: p(γ, w n ) = p(γ) z p(w n z)p(z γ). (4.8) The plsi model relaxes the simplifying assumption that a document is generated by just one topic. Because the multinomial p(z γ) serves as a mixing component for a particular document d, it allows a document to contain more than one topic (Fig. 4.5(a)). However, it is important to note that γ is a dummy variable with values as many as the number of training documents [8]. Thus, the model only learns the topic mixture p(z γ) based on the document it is trained with and cannot be generalized to previously unseen documents. Furthermore, the plsi model is likely to overfit the data because its parameters grow linearly with the size of the corpus. For more information on that issue see [8]. d d z d,n w d,n θ zj N d D T Figure 4.3: Graphical representation of the plsi model. Blue denotes observed variables, red denotes multinomial parameters, white denotes hidden variables. 4.4 Latent Dirichlet Allocation To overcome the overfitting and low generalizability of the plsi model, Blei et. al [8] introduced a new model called latent dirichlet allocation (LDA). LDA treats the topic mixture weights p(z τ ) as a T -parameter hidden random variable instead of a large set of stand alone parameters explicitly linked to the training set, i.e., a topic z is a probability distribution over a vocabulary of topics (see Fig. 4.4). The general idea behind LDA is that documents are generated by a random mixture over latent topics where each topic is characterized by a distribution over words [8]. LDA assumes the following generative process for each document d in a corpus D [8]: 1. Choose the number of words N Poisson(ξ) 2. Choose the topic mixing parameter τ Dir(α) 3. For each of the N words w n : (a) Choose a topic z j Multinomial(τ ) (b) Choose a word w n from p(w n θ zj ), a multinomial probability conditioned on the topic z j. Thus, given the parameters α and θ, this leads to the joint distribution p(τ, z, w α, θ) = p(θ α) N p(z n τ )p(w n θ zn ), (4.9) n=1