Furniture Recognition using Implicit Shape Models on 3D Data

Transcription

1 Furniture Recognition using Implicit Shape Models on 3D Data Master Thesis in Intelligent Systems submitted by Jens Wittrowski Written at Applied Informatics Research Group Faculty of Technology Bielefeld University Advisors: Dr.-Ing. Agnes Swadzba, M.Sc. Leon Ziegler Started: 02 May 2012 Finished: 30 October 2012

2

3 I hereby declare that I have done this work with the title Furniture Recognition using Implicit Shape Models on 3D Data on my own. I have used no other sources than the ones listed and I have marked any citations accordingly. Ich erkläre hiermit, dass ich die vorliegende Arbeit mit dem Titel Furniture Recognition using Implicit Shape Models on 3D Data selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet und Zitate als solche gekennzeichnet habe. Bielefeld, 30 October 2012 Jens Wittrowski

4

5 Abstract The recognition and classification of objects in 3D scene data is still a very challenging task in computer vision. For robots acting in domestic environments especially the correct recognition of furniture objects is important because these belong to the main things they need to consider, either for navigating around or for using them. In this work an approach for the recognition of furniture objects in indoor room scenes is presented. The implemented method learns the spatial relationship of typical object regions by defining an Implicit Shape Model (ISM). For training the ISM, artificial 3D models are used, which are public available in several internet datasets. To recognize the appearances of the learned relationships in test scenes captured with a 3D sensor, a probabilistic Hough voting is performed, giving the ability to simultaneously recognize and localize instances of the learned object category. The implemented method is tested on four furniture categories: office chair, dining chair, dining table and couch.

6

7 Contents 1 Introduction Related Work Approach Implicit Shape Model Training Procedure Categorization Procedure Codebook D Hough Space Voting Descriptor Relevant 3D Computer Vision Algorithms Uniform Downsampling Normal Computation Keypoints - Boundary Estimation Moving Least Squares Smoothing Implementation Hardware Software Trainingdata Implicit Shape Model Codebook Global Codebook Separated Codebook D Hough Space Voting SHOT Descriptor Training & Category Recognition Training Chain Object Category Recognition Chain Hypotheses Generation Hypothesis Selection Evaluation Finding Good Thresholds Baseline Evaluation Hypotheses Generation Hypotheses Selection Evaluation on Indoor Scenes Office Chair

8 viii Contents Dining Chair Dining Table Couch Analysis of the Different Strategies Discussion Baseline results Indoor Scenes results Additional comments Summary & Outlook 47 A List of Abbreviations 49 B Addendum 51 B.1 Used Models B.1.1 Office Chair B.1.2 Dining Chair B.1.3 Dining Table B.1.4 Couch B.2 Used Testscenes for the Evaluation B.2.1 Office Chair B.2.2 Dining Chair B.2.3 Dining Table B.2.4 Couch Bibliography 59

9 List of Figures 2.1 Training procedure for an 3D-ISM (taken from [17]) Categorization procedure for a 3D-ISM (taken from [17]) Hough Voting based on local reference frames (taken from [22]) Hough Voting scheme (taken from [22]) SHOT sphere (taken from [23]) Normal computation via PCA Normal computation on a training model Left: original point cloud; right: detected keypoints Microsoft Kinect Model of the Kinect triangulation process (taken from [6]) Left: full model, right: virtual scan from the front Camera positions for the virtual scans Results of the codebook performance analysis Implemented training procedure Implemented keypoint detection and description for training Implemented recognition procedure Implemented keypoint detection and description for the object recognition procedure Voting Results - Green points: keypoints that voted into the chosen bin; Red points: resulting vote positions Bounding Box surrounding the detected object Recognition results for the office chair scenes Recognition rates in percent for the office chair scenes Recognition results for the dining chair scenes Recognition rates in percent for the dining chair scenes Recognition results for the dining table scenes Recognition rates in percent for the dining table scenes Recognition results for the couch scenes Recognition rates in percent for the couch scenes Total recognition results by varying the parameter combinations Average recognition results by varying the parameter combinations Number of recognitions for the different codebook types Average recognition rates in % for the different codebook types Number of recognitions for the different combinations of codeword activation and vote weighting Average recognition results for the different combinations of codeword activation and vote weighting

10 x List of Figures 4.1 Screenshots of Recognitions

11 1. Introduction Today we can find more and more examples of little robots aimed to support us in the household or in the garden. Robots vacuuming the floor, cutting the lawn or cleaning the pool are already available on the market. Most of them can get along with simple sensors and algorithms to navigate through their environment and execute the required task. In the future we may also have robots for supporting us in more complex tasks, like setting the table or hanging up the laundry. Accompanied with it they also need to be equipped with better sensors and actors and use more complex algorithms, making the robots more and more intelligent. Generally the processing of an intelligent robot can be grouped into three parts: perception, reasoning and action. Perception covers the mechanisms (sensors and algorithms) that are needed to have the robot perceive its environment. The interpretation of the current situation and the choice of the next steps is part of reasoning. Finally action covers all aspects that are needed for a robot to manipulate its environment, like moving forward or grasping things for example. The scope of this work is placed in the first part, the perception in particular the visual perception. Here we find several requirements for intelligent robots. For autonomous navigation for example they need to be able to construct a map of their environment and compute possible paths they can take. When interacting with persons they need to detect the persons and track them or detect special gestures for example. Or for performing complex tasks in the household they need to be able to recognize and classify the objects in the rooms. In domestic environments especially the correct recognition of furniture objects is very important because these belong to the main things robots need to consider, either for navigating around or for using them. If a robot is asked to do a complex task like Put the cup on the table for example the robot must be able to recognize and locate the supposed table. With the introduction of better sensors for visual perception like the Kinect camera from Microsoft, new promising approaches came up in the computer vision research

12 2 1. Introduction community. These sensors give the robots the ability to get a 3D point cloud of their environment. The usage of 3D data has the main advantage, compared to 2D image data, that the sizes of the objects can be directly measured. Nevertheless the recognition and classification of objects in 3D scene data is still a very challenging task in computer vision. The data received by current available 3D sensors still contains a lot of noise and clutter. In addition the target object is often occluded or at least partly occluded by other objects in the environment, making a segmentation and recognition of the target object very difficult. Especially the recognition of furniture objects is challenging, because these often do not contain many outstanding regions. Furniture objects mostly consist of planar or almost planar areas and hence do not differentiate that much in comparison to other captured regions in a flat like walls, floors, doors or windows. In addition to that, some furniture categories have a high intra-class variability (e.g. chairs) and a low inter-class variability (e.g. office chair and dining chair), making the recognition of the right category very challenging. In this work an approach for the recognition of furniture objects in indoor room scenes is presented. The implemented method learns the spatial relationship of typical object regions by defining an Implicit Shape Model (ISM). For training the ISM artificial 3D models are used, which are public available in several internet datasets. To recognize the appearances of the learned relationships in a test scene captured with a 3D sensor, a probabilistic Hough voting is performed, giving the ability to simultaneously recognize and localize an instance of the learned object category. In addition to that it even allows the recognition of multiple object instances. The implemented method is tested on four furniture categories: office chair dining chair dining table couch. The structure of this work is as follows. In the next section I give an overview of current existing approaches for object category recognition in 3D data. In chapter 2 the theoretical background of the approach I used is described. It contains information about the approach itself as well as a description of some relevant 3D computer vision algorithms which were used for this work. Chapter 3 points out, how the proposed method was adapted and integrated to be used for the recognition of furniture objects in indoor scenes. The used hardware, software libraries and the implemented algorithms with the used parameters are presented in detail. Chapter 4 shows how I tested the implemented procedure. It is described what data was used for training and testing, how the tests were performed and finally gives an overview of the received results. An interpretation of the results is then given in chapter 5. The results for the different combinations of strategies used are analysed and discussed. Finally chapter 6 draws a conclusion of the implemented approach, points out the

13 1.1. Related Work 3 main advantages and the remaining challenges. In addition an outlook for further research is given. To have the different areas of object recognition clear, I will use the definitions pointed out in [21] for this work. The task of finding an already known object instance in a scene is defined by object instance recognition. Here the concrete object that is to be found is known and has to be detected and relocated in the given scene. In contrast object category recognition is the task of detecting an (as yet) unknown object instance in a scene. Here the method has learned, how typical object instances of this category look like, and the challenge is to find out if there is an instance of this category included in the presented scene and where it is located. Whereas in the recognition tasks the object has to be detected in a full scene of many objects, in object categorization the object is completely separated. Here the input data exclusively consists of the object and the task is to assign this object to the right category out of some possible options. The scope of this work is placed in the area of object category recognition, here by using 3D point clouds. 1.1 Related Work Though the recognition of object categories in noisy and cluttered 3D data is a very challenging task, some promising approaches exist. Generally the task can be divided into two parts, a localization and a recognition part. The goal of the recognition part is to analyse if an object of the requested category can be found in the presented scene. Whereas the localization part aims at finding out where exactly the object is located. The methods proposed so far typically learn a kind of reference model of the requested object category and test if this model can be found in a scene. The reference models can be generally grouped into two types: statistical classifiers and geometry based reference models. Statistical classifiers learn the weights of a stochastic function to classify or categorize the input data. Whereas the geometry based ones learn a model by considering the geometric relationship of object parts found in the training data. To generate the reference model the outstanding characteristics of the training data must be determined. Often this is reached by using global or local 3D descriptors. Global descriptors use a single vector to describe the whole training object. In contrast local descriptors describe the local region around each point, hence many local descriptor vectors are needed to describe the whole training object. An example for a local 3D descriptor is the Signature of Histograms of OrienTations (SHOT) descriptor introduced in [23], which will be described later in this work in detail. Another one is the Fast Point Feature Histogram (FPFH) descriptor, shown in [15]. This descriptor generates a histogram of the angular variations of the normals found in the neighbourhood of the point. For detecting the points and normals in the neighbourhood a radius search is performed. Since local descriptors only describe the local area around the point, several descriptor vectors are needed to describe the whole object. To allow a generalization of the object description an alphabet of the feature vectors, termed codebook, is typically

14 4 1. Introduction used. This alphabet is obtained by performing clustering or vector quantization on the whole set of feature vectors found in the training data. The model can then be described by generating a histogram of the used codebook entries, which is called Bag-of-Features or Bag-of-Visual-Words. When using global descriptors the whole model can be described in one vector. In [16] the Global Fast Point Feature Histogram (GFPFH) descriptor is introduced, which generates a global object description on the basis of the local FPFH descriptors. The Spin Images introduced in [4] are another method to describe the local surface, but without using a feature vector. Here a 2D histogram map is calculated over the radial and orthogonal distance between points and a plane tangent to the feature point. The relevant points are detected by rotating a plane around the point s normal and choosing all points which intersect with the plane. For performing object category recognition in 3D data, the mentioned building blocks need to be combined to a promising approach. A simple localization method is to run a brute-force search by using a sliding window over the depth image of the scene. A check would then be needed for every subwindow, if an object of the searched category is contained. In [8] a similar method is descried by using multiple RGB-D images. In [3] the Bag-of-Features concept was used and tested in combination with two statistical based classifiers: a Naive Bayes classifier [10] and a support vector machine [24]. In [16] the global GFPFH descriptor was used in combination with a Conditional Random Fields [7] classifier. The proposed method was tested on cups and bowls. A method that uses the Spin Images for object category recognition is shown in [5]. This method was especially tested on partly occluded objects and under clutter. A good approach (and similar to the one used in this work) for the recognition of furniture objects is demonstrated in [11]. The authors used artificial 3D models from internet datasets for training and generated a vocabulary (codebook) of typical object parts. The parts were identified by a segmentation step, which investigates the normals of the points in the neighbourhood, and then described by a set of parameters. For object recognition, they matched the detected parts in the scene against their vocabulary. Then each activated vocabulary entry casts votes for possible object center locations, which is known as probabilistic Hough voting. Finally the results of the Hough voting steps are further validated by fitting models of the training data into the scene points.

15 2. Approach This chapter describes the theoretical background of the approach I applied for object category recognition. A detailed description about Implicit Shape Models and its usage for an object categorization task is shown in chapter 2.1. Chapter 2.2 shows some general 3D computer vision algorithms which I used for preprocessing the 3D scene data. 2.1 Implicit Shape Model The implemented approach is based on the proposed method described in the paper by Samuele Salti, Federico Tombari and Luigi Di Stefano [17], published in They suggested to adapt the 2D Implicit Shape Models (ISM) approach, pointed out in [9], to use for 3D object categorization. The main benefit of an ISM compared to other statistically based classifiers is that it considers the spatial relationship of the object parts found in the models. In [9] an Implicit Shape Model for a category C is mathematically defined as ISM(C) = (I C, P I,C ), where I C is an alphabet of typical local appearances of the selected object category (termed codebook ) and P I,C is a spatial probability distribution, which specifies where each codebook entry may be found on an object. So the Implicit Shape Model stores information (frequency and position) about the appearances of typical object regions of the training models. Thinking of a table for example, one typical object region might be a corner of the table s plate. The ISM learns the possible geometrical relationships between such a corner point and an object reference point (object center for example), which is just the information where in a table these corner points appear. To realize this, it stores vectors showing the relationship between typical object feature points (like corner) and a unique object reference point, e.g. the object center. If the same relationships can then be found in the test object then the object is assigned with the category of the trained ISM. When using a 3D-ISM the stored vote vectors are 3-dimensional in contrast to the 2-dimensional vectors used in [9].

16 6 2. Approach Since the main aim of the authors in [17] was to show the functioning of their proposed method for object categorization, they could use artificial 3D models from public available internet datasets for training and testing. This section explains the idea of a 3D Implicit Shape Model in general, how it is constructed and how it can be used for object categorization. It is structured as follows: First the training procedure for a 3D-ISM is described to explain how this kind of model is constructed. After that the corresponding recognition procedure is explained to demonstrate how the ISM generally categorizes objects. The second part of the chapter describes in detail the main methods and techniques that are required to construct and use a 3D-ISM, which are the 3D Hough Space, the codebook and the 3D descriptor Training Procedure By looking at the training procedure of a 3D-ISM one can get a very good idea of what a 3D-ISM is and how it can be constructed. At first the training models are examined and feature points are detected for each model. Every feature point found is then described using a 3D-descriptor which examines the local environment of the feature points according to some given characteristics, like the points normals for example. The descriptor stores the information about the environment with respect to a local reference frame computed for the feature point to finally get a point of view invariant description. Then a clustering mechanism is run on the whole set of feature vectors to identify some typical appearances in the training models. These typical appearances (being the cluster representatives) can be seen as a kind of codebook, representing all the object parts that are required to construct the objects in the training data. The entries of the codebook are typically termed codewords. After the codebook was generated, 3D vectors are computed, pointing from the feature points to the model center with respect to the local reference frames computed by the 3D-descriptor for every feature point. Then all feature vectors of the training models are matched against the codebook according to a codeword activation strategy. A detailed description of the different codeword activation strategies can be found later in section After that the pre-computed 3D vectors are assigned to exactly those codewords that were activated by their corresponding feature points, resulting in a set of vote vectors related to each codeword. This relationship between the codewords and the set of vote vectors is just what the ISM consists of. Figure 2.1 shows the training procedure schematically.

17 2.1. Implicit Shape Model 7 Figure 2.1: Training procedure for an 3D-ISM (taken from [17]) Categorization Procedure For assigning a test model an object category, the procedure shown in figure 2.2 is used, which works as follows: The test features are extracted and described from the 3D input data using the same 3D descriptor that was also used for training. The feature vectors are then matched against the codebook according to the chosen codeword activation strategy. For each codeword-match the assigned vote vectors are received. As these vote vectors were stored in coordinates of the local reference frames they need to be re-transformed into world coordinates. Finally each activated codeword casts its set of votes for the object center position into a 3D Hough space belonging to the model category. For object categorization the Hough space containing the bin into which the maximum number of votes across all Hough spaces have fallen (global maximum) is chosen. The category belonging to this Hough space gives the categorization result. In case of object category recognition each local maximum in the Hough spaces gives a detection hypothesis for an object of the belonging category, which then needs to be further investigated. Figure 2.2: Categorization procedure for a 3D-ISM (taken from [17])

18 8 2. Approach Codebook As described in the codebook consists of the cluster reference vectors of all features that were extracted from the training data. One can distinguish between two codebook types: a global codebook, being built from the feature vectors of all considered object categories and separated codebooks, where every object category uses its own codebook, being built by only using the training features of the training models belonging to their category. Another factor that has to be considered, when using a codebook, is its size being the number of codewords. As the codewords are the cluster centers of the training feature vectors, the size of the codebook is equal to the number of clusters that are computed by applying standard clustering algorithms like K-Means for example. The third factor that plays a very important role is the codeword activation strategy. The codeword activation strategy defines the method that is used to search for a match between the feature vectors and the codewords. In [17] two different codeword activation strategies are presented: a k-nearest Neighbour (k-nn ) and a threshold based (cutoff ) strategy. The k-nearest Neighbour strategy searches for the k codewords that are nearest to the feature vector of the feature point. This strategy has the advantage, that the total number of activated codewords is fixed and hence predictable, being k times the number of feature points. The disadvantage is that every feature point activates exactly k codewords, even if it does not fit to any of the codewords very well. The cutoff codeword activation strategy matches all codebook entries whose (euclidean) distance to the feature vector is beneath a pre-defined threshold. Its advantages and disadvantages are just inverse to the k-nn strategy: only codewords that are really similar to the feature vector are activated but it is not predictable how many these are D Hough Space Voting The basic idea of Hough space voting is to find feature points in a given scene and cast votes from these feature points for a unique object reference point. Typically either the object center or the center of mass is used as that unique object reference point. I will use the object center in the further explanations. If enough votes meet in a point or region, meaning that many points vote for the same object center, then the object is considered as detected and its orientation can be computed from the corresponding feature points. Figure 2.3 shows this procedure schematically. The following procedure is described by using an example taken from [22] for the task of object instance recognition, meaning that the challenge is to find an already known object instance in a test scene. At first the object center for the training model is computed. Then the feature points of the model are extracted and described and the vector between each feature point and the object center is computed. To be rotation and translation invariant these vectors need to be translated and rotated into a local reference frame computed for each feature point (left side of figure 2.3). These vectors define the votes for the object center position which are needed for the test models later on. When examining a test scene, feature points of the scene are extracted, described and matched against the feature vectors of the model. If a match is found (green

19 2.1. Implicit Shape Model 9 arrows in the figure), e.g. the euclidean difference between the vectors is beneath a threshold, votes for the object center position are casted with respect to the computed local reference frame of the feature point. Figure 2.3: Hough Voting based on local reference frames (taken from [22]) To count the number of votes meeting in a location a 3D Hough space is used, meaning that the 3-dimensional scene space is divided into several equally sized bins. For each bin the number of votes falling into that bin is determined, showing the probability that the points which voted into that bin belong together and result in an object instance of the trained model. Figure 2.4 shows a schema of the 3D Hough space. The green lines show corresponding feature points of the model and the test object. As they belong to the same object their votes for the object center meet inside the same Hough space bin. The red lines also show corresponding points between the model and the test object, but false ones. So their votes do not fall into the same bin. Figure 2.4: Hough Voting scheme (taken from [22]) As mentioned before the method described here is used for object instance recognition, meaning that the task is to find an already known object instance in a test scene. All feature points of the model and the test object can be directly compared

20 10 2. Approach to each other and one matched feature point always casts just one vote. In contrast in this work the task is to find object categories, which means that there will be a lot more than one training model and with it a lot more votes for each detected feature point will be possible, finally depending on the number of training models used. In addition the usage of a codebook and the possibility to assign many votes to each codeword will increase the possible number of votes even more. To be able to deal with this, the strength of the votes can be influenced by using vote weights. In [17] two vote weighting strategies are introduced. The first strategy uses the same constant weight (simply 1) for all votes, which they termed Categorization Weights (CW). The second strategy, called Localization Weights (LW) is dependent on the number of codewords that were activated and the number of vote vectors that are assigned to each activated codeword. The weights for the LW strategy are computed as: ω = 1 M 1 Occ[i] where ω is the vote weight, M is the number of activated codewords by this feature vector and Occ[i] is the set of vectors assigned to codeword i. As a result the sum of all vote weights of each feature point is just 1. In addition to the two proposed vote weighting strategies in [17] I introduce a third strategy which includes the accuracy a codeword was activated by the feature vector. Especially when using k-nn as codeword activation the usage of the accuracy of the activation in the vote weights can have significant advantages, since the k-nn always activates k codewords, regardless how good they really fit. If a codebook is trained from feature vectors belonging chairs for example, any detected feature point in a scene will activate k codewords, even if it belongs to a totally different object and does not fit to any of the codewords very well. The vote weighting strategy I suggest is based on the Localization Weights and adds a factor containing the accuracy of a codeword match. This factor is calculated as 1 d i d max where d i is the euclidean distance between the feature vector and the activated codeword and d max is the maximum possible distance between these two. Since the feature vectors are normalized the maximum distance d max is equal to 2, which would mean that the vectors just point in opposite directions. So altogether the weights are then computed as ω = (1 d i 2 ) 1 M 1 Occ[i] (2.1) In the following I will use the term Activation Weights (AW) for this strategy.

21 2.1. Implicit Shape Model Descriptor It can be easily seen, that the performance of an ISM heavily relies on a good matching between the codewords and the feature vectors of the detected scene points. So the characteristics of the points must be described as precise as possible to be able to find good correspondences in the codebook. Furthermore the description must be repeatable, meaning that the description results stay constant when being computed in multiple runs and/or from different point of views. To reach these goals several local 3D descriptors were introduced by the 3D Computer Vision research community. Local 3D descriptors aim to characterize a feature point by generating a description of the neighbourhood (support) of the point. Therefor the descriptors examine the support according to some given parameters and store the results in a vector, called the descriptor or feature vector. As the geometrical relationship of the point and the points in the support depends on the point of view, the descriptors need to compute a unique local reference frame. The detected characteristics can then be stored according to local coordinates, which finally results in a point of view invariant descriptor. In [17] they used the hybrid Signature of Histograms of OrienTations (SHOT) descriptor proposed in [23] to describe the feature points. This descriptor lays a sphere with a given radius around the feature point and divides the sphere into 32 spatial bins by performing 8 azimuth, 2 elevation and 2 radial divisions. Figure 2.5 shows the division of the spheres into bins (here: only 4 azimuth divisions are shown). Figure 2.5: SHOT sphere (taken from [23]) For each spatial bin a histogram is then computed giving information about the normals of the points in the bins in comparison to the feature point s normal. Therefor the cosine values of the angles between the feature point s normal and the normal of every point in the support are computed. The cosine values are then sorted into 11 histogram bins, where one histogram is calculated for each of the 32 spatial bins. The usage of cosine values for the histogram binning has the advantage that when using equally spaced bins there is a coarser binning of the angles of the directions parallel to the feature point s normal and a finer binning for the angles of the directions orthogonal to it. This is a cause of the fact that an equally spaced binning on the cosine values is equivalent to a spatially varying binning on the angles itself.

22 12 2. Approach As the points having a big difference in the direction of the normals are the most informative ones, a finer binning for them improves the description of the point s support. Finally the whole descriptor vector is normalized so that its description is independent on the total number of points in the support. To get a point of view invariant descriptor the spatial bins of the SHOT descriptor are arranged according to a local reference frame computed for the feature point. This local reference frame needs to be repeatable and unambiguous to generate the same description independent on the point of view [13]. The chosen method for the computation of the local reference frame for the SHOT descriptors is based on the idea of the computation of the point normals described in 2.2.2, where the eigenvectors of the covariance matrix of all points in the support are investigated. The problem with this approach is that the directions of the axes is ambiguous. While the direction of the normal can be disambiguated by considering the point of view, this cannot be used for computing local reference frames because this solution must hold without the use of a global coordinate system (because we want a point of view invariant descriptor). In addition, it must also work for the remaining two axes. Therefore the SHOT descriptor uses an adapted version of this approach. At first the elements of the covariance matrix are weighted with the distance of the points to the feature point. So the covariance matrix is then defined by C = 1 (R d i ) i:d i <=R (R d i )(p i p)(p i p) T i:d i <=R where R is the radius of the sphere that is used to determine the points in the support and d i is euclidean distance between p i and the feature point p. This increases the repeatability of the local reference frame in presence of clutter. To disambiguate the axes the sign of each eigenvector is determined by ensuring that the sign is coherent with the majority of the points it represents. For this purpose the vectors representing the euclidean distance (p i p) between the feature point p and every point in the support p i are calculated. Then for every vector the dot product is calculated, once with the positive and once with the negative eigenvector. Finally the total numbers this dot product is greater than zero for the positive and the negative eigenvector are compared (greater than zero means an angle of less than 90 degrees, so pointing in same direction). If the total number for the positive eigenvector is greater, then that one is chosen, otherwise the negative one. This computation is performed for the x- and z-axis. The y-axis is then obtained by x z. 2.2 Relevant 3D Computer Vision Algorithms This section describes some basic 3D computer vision algorithms which are relevant for with work Uniform Downsampling To speed up computation time and ensure that the point cloud consists of a given resolution, a uniform downsampling mechanism can be performed. The method used

23 2.2. Relevant 3D Computer Vision Algorithms 13 in this work assembles a 3D grid with a defined edge length over the point cloud and then computes the centroids of the points in each grid bin. The computation of the centroid has the advantage, that the underlying surface is approximated more correctly compared to using the center of the points. In addition it has the effect that the noise in the data is slightly reduced. After proceeding this downsampling mechanism, the only points remaining are the centroids Normal Computation Since the SHOT descriptor relies on the normals of the points, a precise computation of the normals is very important. To estimate the normal of a point on a surface several approaches exist. The one I explain here tries to fit a plane tangent to the point s surface and then computes the plane s normal which can finally be seen as an approximation of the point s normal. To fit a plane tangent to the point s surface a principal component analysis (PCA) is performed on the points lying in the neighbourhood of the analysed point. The neighbourhood can either be determined by a k-nearest Neighbour method or by taking all points whose euclidean distances is beneath a threshold. The PCA is based on the intention to examine the variance in the data. Therefor the covariance matrix C is computed for every point as C = 1 k k (p i p) (p i p) T i=1 where k is the number of points in the neighbourhood, p i is the currently chosen neighbour point (3D) and p is the 3-dimensional centroid of all neighbours. Then the three eigenvalues and the corresponding eigenvectors of the covariance matrix C are computed. The eigenvectors for the first and second largest eigenvalues are the vectors that span the fitted plane and the third largest (here also the smallest one) is the estimated normal, as all three vectors are orthogonal to each other. Figure 2.6 shows the three eigenvectors computed by the PCA graphically. Figure 2.6: Normal computation via PCA The PCA can find two possible normals of the fitted plane because there are two possible vectors orthogonal to the first two eigenvectors, showing just in opposite directions. If the viewpoint (p v ) of the camera is known one can easily determine the normal that shows in the direction of the camera. This normal (n) for the point p needs to satisfy the equation n (p v p) > 0 because the dot product gives the cosine of the angles between n and the vector showing in viewing direction (p v p). So if that cosine value is greater than zero the angle between the vectors is less than

24 14 2. Approach 90 degrees. Hence this must be the normal that points towards the camera, whereas the other normal must be the one pointing away (giving a negative cosine value). Figure 2.7 shows the result of the presented algorithm applied on a training model. Figure 2.7: Normal computation on a training model Keypoints - Boundary Estimation In domestic environments a lot of planar surfaces can be found, like walls, floors, tables or kitchen fronts for example. These surfaces are scanned with the camera, giving a very high number of points that have similar characteristics, when described with a typical 3D descriptor like SHOT. Using all these points would mean a lot of computation and memory usage, because every point needed to be described and inserted into the ISM model. In addition tests have shown that using all these points for Hough voting leads to very many false positive recognitions because every planar surface casts very many votes (as a cause of same descriptors assigned to same codewords), finally giving a lot of Hough space bins that contain a significant number of votes. To prevent this, an interesting point or keypoint detection method is necessary, which detects exactly those points which contain very individual characteristics and hence have a strong impact on the recognition task. Furniture objects often consist of planar or almost planar areas. A table or a couch for example has very big planar areas. So interesting points for furniture objects could be the corners or the edges, as they are the ones with the most descriptive information. The boundary points detected in a scene are a good example for that because they are exactly the points where two or more planes meet or where the objects separates from the background. First tests have shown that the boundary estimation method works robust on artificial training models as well as on captured scenes, compared to other mechanisms. Boundaries can be found by performing the following algorithm. To detect whether a point is a boundary point or not, at first the neighbourhood of that point is determined, either by a k-nearest Neighbour or a distance threshold method. Then a plane is estimated according to a least-squares method by considering the received

25 2.2. Relevant 3D Computer Vision Algorithms 15 neighbouring points. For every neighbouring point the angle between its normal and the point s estimated plane normal is computed. Finally these resulting angles are examined to detect if the point is a boundary point. Therefor the angles are sorted and the maximum angle difference between two consecutive angles is determined. If that difference is above a pre-defined threshold, then the point is considered to be a boundary point. Figure 2.8: Left: original point cloud; right: detected keypoints Moving Least Squares Smoothing The point cloud data coming from 3D sensors often contain a lot of noise, making a processing and especially a recognition of objects very difficult. To reduce the noise in the data a Moving Least Squares algorithm can be performed. This method tries to achieve the noise reduction by approximating the underlying surface of the points. To achieve this a plane is fitted to the local surface by using principal component analysis. Then a polynomial function is fitted in the set of distances from the points to the surface. Finally the points are projected back on the estimated surface, defined by the polynomial function.

26 16 2. Approach

27 3. Implementation As their aim in [17] was to proof the general functioning of their approach for object categorization, they completely worked on artificial 3D models, for training as well as for testing. As a consequence they had complete and fully segmented test models and worked on the task to assign the models to the right category out of some possible options. When using this method for recognizing object categories in real indoor room scenes, containing many different objects, some things have to be adapted. I have to pay attention not to find objects where in fact no one of the learned ones is actually present. So I cannot simply take the bin containing the maximum number of votes, because there will always be a maximum even if the votes come from totally different objects. Instead, I have to use thresholds to find good hypotheses for possible object locations of the requested category. If the number of votes falling into a bin exceeds a certain threshold, there is a hint that these votes were casted from points belonging to the requested object category. For a decision making some post-processing steps are still necessary. The following sections show in detail, how I adapted and used this method to recognize object categories in 3D point clouds. 3.1 Hardware The sensor used to capture the 3D data is the Kinect camera from Microsoft. The camera, which was originally developed as a new input device for the X-Box 360 gaming console, is a widely used sensor in the computer vision research community, due to its very good price-performance ratio. The Kinect delivers mainly two images, one depth image and a color image, both with a resolution of 640 x 480 pixels.the depth image shows at every pixel the distance to the point that it captured. For the measurement of the depth the camera uses a triangulation process. I will describe the general functioning of this triangulation here, a detailed description can be found in [6]. Figure 3.1: Microsoft Kinect

28 18 3. Implementation To compute the depth of an image pixel point the Kinect uses an infrared (IR) laser projector and an infrared sensor. The projector emits a known IR-pattern consisting of speckles onto the scene. The IR-sensor captures the pattern and enables the camera to calculate the distance to the captured pixel point. Figure 3.2 shows the mathematical model that is used to calculate the distance. The camera has an image of a reference plane at a known distance stored in memory. It is then able to estimate the depth of every pixel point by looking at the shifting of the captured speckle from the expected one shown on the reference image. If a speckle is for example projected on a plane nearer to the camera than the reference plane, the position of the speckle on the image sensor will be shifted to the left (d) in the direction of the baseline (b). Figure 3.2: Model of the Kinect triangulation process (taken from [6]) The distance (Z k ) of a point lying on the object plane (k) can then finally be computed by Z k = Z Z 0 f b d where Z 0 is the distance from the camera to the reference plane, f is the focal length, b is the base length and d is the measured disparity in image space (Z 0, f and b can be determined by calibration) [6]. This finally results in the depth image that the camera delivers. The driver that is used in the Point Cloud Library [14] is able to further process the depth image and convert it into a 3-dimensional point cloud. For my work I use this point cloud directly, so the input data for my approach finally is a 3D point cloud captured with the Kinect camera.

29 3.2. Software Software Point Cloud Library The Point Cloud Library (PCL) [14] is a library containing algorithms and tools relevant for 3D perception. It includes algorithms for filtering, feature estimation, surface reconstruction, registration, model fitting and segmentation. PCL is developed by a community of engineers and researchers and is released as open source software under the BSD license. Some companies financially support the further development of the library. For my work I use the version 1.7, as it is the current version at the time this work was implemented. FLANN FLANN (Fast Library for Approximate Nearest Neighbors) [12] is a library containing some algorithms for a fast nearest neighbour search in high dimensional spaces. The main usage for my work is for the implementation of the k-nn codeword activation strategy and for the K-Means clustering algorithm. EIGEN Eigen [1] is a C++ template library containing algorithms and classes for linear algebra computations like matrices, vectors and numerical solvers. I use EIGEN for several matrix and vector operations in my work like the vote vectors computation for example. 3.3 Trainingdata Usually the set of training data is generated under the same conditions (same sensor, computations, etc.) as the tests are performed later on. By using Implicit Shape Models one has to proceed in a different way. The ISM stores a set of vote vectors to a model reference point (for which I use the object center) for each codeword in the codebook. As a consequence the correct object center positions need to be known during training to be able to compute the vote vectors from the feature points to each model center. If scans captured with Kinect were used as training data the problem is that the true object center is not known and cannot be computed either, because the camera just gives a scan from one side of the object. To compute the true object center a complete 3D model is needed. So one way could be scanning objects from all sides, registering the data to receive a complete 3D model and finally removing all parts that were scanned but do not belong to the intended object, like the ground floor for example. Another possibility is to take artificial 3D models of the object categories for training, which are complete and therefore allow an easy computation of the object center positions. These models are available in several internet datasets free of charge. I decided to use the latter one because new objects and object categories can be easily integrated, due to the high availability of 3D models in the internet, whereas generating models from a set of captured scans is very time intensive and complex. When using artificial models one thing has to be taken care of. As I use the SHOT descriptor, which constructs a sphere around the feature point and investigates the

30 20 3. Implementation normals of the points in the environment, the point environments of the training data and the test data (which are the real Kinect scans) need to be comparable. If I used the full artificial models for training, the SHOT descriptors would also include points on the backside of the object. But these points cannot be captured with a camera and hence will not be included in the descriptors of a test object. So to be able to have comparable descriptors I cannot take the full artificial models directly but have to use simulated scans of the objects instead. Figure 3.3 shows a point cloud of a full model and one of a simulated scan of the same model. It can be easily seen that the point cloud showing the virtual scan does not contain any points on the back- and downside of the chair because the scan was simulated from the front. Whereas the one showing the full model contains all points. Figure 3.3: Left: full model, right: virtual scan from the front To generate the scans I used a virtual scanner tool included in the Point Cloud Library. I chose five example models for each category from Princeton Shape Database [18], rescaled the models to fit to true object sizes (the Princeton Shape models are all sized to fit in a unit cube) and generated 12 scans of each model. To have more realistic scans I added a little noise on the data as there is always noise in the scenes captured with a camera. These scans were finally used as training data, only the corresponding model center positions were computed on the full artificial models in advance. Figure 3.4 shows the camera positions from where the scans were simulated.

31 3.4. Implicit Shape Model 21 Figure 3.4: Camera positions for the virtual scans 3.4 Implicit Shape Model The Implicit Shape Model is implemented as a C++ class. Each instance of this class contains a codebook and a mapping of codeword IDs to sets of vote vectors. So each instance represents a model of one object category. The type of vote weighting strategy that is used for the Hough voting can be set to a variable. The main functionality is realized in two methods, one for generating the model, which means assigning the vote vectors to the codewords, and one is for performing the Hough space voting and receiving the vote results. To train and use the model for object category recognition, I implemented a class called ISMAPP. This class performs the following tasks: Starting the application Reading training files Grabbing data from Kinect / Reading test data files performing training and detection chain post processing of the results visualization of the input data and the results The following sections describe in detail the realization of the main building blocks to use an ISM for object category recognition, which are the codebook, the 3D Hough space and the chosen descriptor.

32 22 3. Implementation Codebook The codebook has a very important role, because it defines the typical elements the trained Implicit Shape Model consists of. Beyond that, the size of the codebook has a very high influence on the computation time. If the codebook is very small, many votes are assigned to the same codewords during training. If these codewords are later activated, very many votes are casted into the Hough space, which costs a lot of computation time and memory. On the other hand, if the codebook is too big, there may be too less votes to get a significant number of votes concentrated in one bin. So a good size for the codebook has to be determined. As mentioned in chapter there are two different codebook types: a global codebook and separated codebooks for each object category. I use both types to test which of them performs better for the recognition task. The codebook is implemented as a C++ class containing methods for performing both codeword activation functions (k-nn, and cutoff ) Global Codebook The global codebook is generated from the feature vectors of all training models, across the whole range of considered categories. After extracting and describing the feature points detected in all simulated scans, I received a dataset of 352 dimensional vectors. Then I ran K-Means clustering on this whole dataset to extract reference vectors representing typical feature vectors found in the scans. The cluster reference vectors finally construct the codebook containing the K codewords. To find out how compact the clusters are I computed the distortion d which is defined as follows: d = k i=1 x j S i x j µ i 2 where x j is a point assigned to cluster S i and µ i is the center of cluster i. The disadvantage using K-Means clustering is that the number of clusters the algorithm finds is fixed. It is not able to automatically detect how many clusters there really are in the data. To find out what number of clusters is a good choice, I varied the number of clusters the algorithm should find (K) and computed the distortion value for each of them. Figure 3.5 shows the resulting distortion as a function of varying the number of K.

33 3.4. Implicit Shape Model 23 Figure 3.5: Results of the codebook performance analysis As the graph does not allow to use the Elbow-Method to find out a good number for K, I had to choose it manually. I finally decided to take K=300 clusters, because the distortion does not decrease significantly for greater values than 300. Hence the resulting global codebook consists of 300 codewords. The feature vectors of the training data have an average euclidean distance of about 0.25 to their corresponding cluster centers, which is shown by the distortion Separated Codebook For generating the separated codebooks I used only the scans of the training models of the corresponding category. Again I ran K-Means clustering, setting K to half of the number used for the global codebook: 150. So I finally received one separated codebook for each category I wanted to train, each of them containing 150 codewords D Hough Space Voting The Hough space itself is implemented in PCL since version 1.6, which I therefore use for my work. It provides methods to generate the Hough space with given bin sizes, cast weighted votes and find global and local maxima. For the two chair categories (office chair and dining chair) I use a fix bin size of cm, which is big enough to capture votes nearby but also small enough not to contain too many false votes coming from other objects in the scene. For the bigger furniture objects (couch and dining table) I use a fix bin size of cm, because with an increasing vote distance (as it occurs in big furniture objects) the accuracy gets worse and thus a relatively small bin would be hard to hit and no significant number of votes would be concentrated in one bin. All vote weighting strategies mentioned in section have been implemented: Categorization Weights, Localization Weights and Activation Weights.

34 24 3. Implementation SHOT Descriptor For the description of the points characteristics I use the SHOT descriptor explained in chapter This descriptor is used because of the unambitious local reference frame computation and the good performances as shown in [23]. To be able to compare the descriptors of the training models with the ones found in the test scenes the descriptors must be computed with the same parameters. One very important parameter of the SHOT descriptor is the radius of the sphere that defines the support of the point. Tests with models and captured scenes showed that a radius of 8 cm is a good choice for describing parts of furniture objects, because it is wide enough to contain important information as curvatures or corners in the point s support but it is still close enough to not interfere with parts of other objects appearing in the test scenes (like a chair standing beside a table for example). For the other parameters the default values implemented in the PCL are used, which mean 32 spatial bins in the sphere and 11 histogram bins for the cosine values, giving a descriptor vector dimension of Training & Category Recognition This section describes my implementation for the training an the recognition chain Training Chain The training chain to construct an Implicit Shape Model of an object category can be divided into four main parts: keypoint detection and description, codeword activation, vote vectors transformation and finally the assignment of the transformed vote vectors to the activated codewords. Figure 3.6 shows the whole process schematically.

35 3.5. Training & Category Recognition 25 Figure 3.6: Implemented training procedure The detailed process for the keypoint detection and description part can be seen in figure 3.7. At first the keypoints of the given 3D model scan have to be detected and described. As already pointed out in 3.3 I use simulated scans of artificial models for training. These scans were configured to have a comparable resolution to real scans captured with the Kinect camera. In a first step the input scans are uniformly downsampled by applying the method shown in The grid is chosen with the same bin size (edge length: 0.1 cm) as the test scenes later on to ensure same resolutions on the training and test data. The resulting point cloud contains less points and even a bit less noise. After downsampling, the normals of the remaining points are computed. As shown in chapter the method performs a principal component analysis on the points in the neighbourhoood. To define the neighbouring points a radius of 0.3 cm is used. Then the 3D boundary points of the model scans are estimated by applying the method shown in chapter Therefor the downsampled point cloud including the estimated normals is used. The resulting boundary points define the keypoints of the model. Finally for every detected keypoint the SHOT descriptor (as shown in 2.1.5) including the point s local reference frame is computed. To have a realistic description of the points characteristics I use the point cloud that resulted from the

36 26 3. Implementation downsampling step together with the corresponding normals for the SHOT feature computation, because these ones cover the whole scene surface. Figure 3.7: Implemented keypoint detection and description for training Finally I have a set of feature vectors containing a good description of the keypoints detected in the model scan. These feature vectors are then matched against the pre-computed codebook according to the chosen codeword activation strategy. For training the ISMs I decided to use k-nn with k=1 for codeword activation, because each detected keypoint must correspond to one codeword, as the codebook was just generated from the feature vectors of the training data. Since the keypoints corresponding to the feature vectors are known and the object center is known as well, the vote vectors being the euclidean distances between the model center and the keypoins can be easily computed. These vote vectors are then translated and rotated into the local reference frames given by the SHOT descriptors. Finally the transformed vectors are assigned to each codeword that was matched by their corresponding feature vectors. In the end the resulting ISM consists of IDs of the codewords and a set of vote vectors for each codeword ID Object Category Recognition Chain The recognition chain can be basically divided into two parts, the hypotheses generation and the hypotheses selection. The first part uses the Implicit Shape Model and the Hough space voting to find good hypotheses for object locations in a scene. The second part then performs further verifications to finally select valid hypotheses. Figure 3.8 shows the whole chain schematically.

37 3.5. Training & Category Recognition 27 Figure 3.8: Implemented recognition procedure Hypotheses Generation For finding good hypotheses for possible object locations of the given category, at first same as in the training procedure the keypoints have to be detected and described. Therefor the same uniform downsampling mechanism, which was also used for the training data, is performed to have an equal resolution for training and test data and to speed up the computation time. Then the Moving Least Squares (MLS) smoothing described in is performed to further reduce the noise in the Kinect scan and have a better conditioned point cloud, especially for the following

38 28 3. Implementation normal computation. For the MLS smoothing the neighbouring points are found by a radius search with a radius of 0.2 cm. The next three steps are equal to the parts in the training chain: normal computation, boundary estimation to detect the scene keypoints and then the computation of the SHOT descriptors for the detected keypoints. Figure 3.9 shows the first steps of the recognition process in detail. Figure 3.9: Implemented keypoint detection and description for the object recognition procedure The resulting feature vectors are then matched against the codebook according to the chosen codeword activation strategy. For every activated codeword the vote vectors are received. Since these vote vectors were stored in coordinates of the local reference frames, they now have to be re-transformed into the global coordinate system. Here the usage of the SHOT descriptor has its main advantage, because this descriptor contains an unambiguous and repeatable local reference frame, allowing a transforming of the vote vectors during training and a re-transforming in the recognition phase. Finally the votes are casted into the 3D Hough space described in Since the recognition chain is designed to recognize object categories in scenes, I cannot easily choose the Hough space bin holding the maximum number of votes, because every scene will contain a bin with a maximum however high this maximum may be. In addition this would also make a recognition of multiple object instances in a scene impossible. So I need to use a threshold based criterion instead. Each Hough space bin, which contains more votes than the pre-defined threshold, gives a good hypothesis for the center location of an instance of the searched object category Hypothesis Selection To further investigate and finally decide, if an object should be declared to be found, some additional verifications need to be performed. First the corresponding keypoints which casted their votes into the chosen bin are determined. Figure 3.10 shows the keypoints (green points) and the votes these keypoints casted into the chosen bin (red points).

39 3.5. Training & Category Recognition 29 Figure 3.10: Voting Results - Green points: keypoints that voted into the chosen bin; Red points: resulting vote positions Then an object oriented bounding box and its volume is computed for these keypoints. If this volume is inside a given min-max-range and the maximum edge length of the bounding box is beneath a threshold, an object instance of the category defined by the ISM is declared to be found in the position surrounded by the bounding box. Figure 3.11 shows the bounding box around the detected object. Figure 3.11: Bounding Box surrounding the detected object

40 30 3. Implementation

41 4. Evaluation This chapter describes how I evaluated the implemented method for object category recognition. Whereas in [17] only artificial models were used for testing to check the general functioning of the approach, I need to investigate its performance also on real scenes captured with the Kinect. The tests described in this chapter were performed for four object categories: office chair, dining chair, dining table and couch. For each category five models were chosen from Princeton Shape database. In B.1 the chosen models for each category are listed. The chapter is divided into three sections. In the first section I describe how good thresholds for the hough voting and for the bounding box parameters were found. Then in the second section I test, if the applied classifier generally works. Therefor I perform a baseline evaluation, checking the trained Implicit Shape Models on a test set of artificial models. In the third section I finally show the results of my implementation on real room scenes. It contains information about the effects of different codebook types, codeword activation strategies and vote weighting strategies on the category recognition results. 4.1 Finding Good Thresholds As I pointed out in I use a threshold on the votes in the bins to find good hypotheses for the location of an object of the trained category. The height of this threshold has a very big impact on the results of the recognition chain. If it is too high the procedure will not detect many of the presented objects in the scenes, leading to many false negatives. If it is too low it will detect objects in locations where in fact no object is present, meaning a lot of false positives. In addition to that the true location and embedding of the objects in the scenes will play a role in the number of votes it generates. If the distance from the camera to the object is very large for example the keypoint detection algorithm will find less keypoints belonging to the true object and this will lead to a reduced number of votes generated. So the

42 32 4. Evaluation Model Combination office chair dining chair dining table couch GC, 1NN, LW GC, cutoff, LW GC, 1NN, AW SC, 1NN, LW SC, cutoff, LW SC, 1NN, AW Table 4.1: Resulting threshold values giving the minimum number of votes expected for the Hough space bins definition of good thresholds is a very challenging task. I decided to use leave-one-out cross validation to get a good hint for a suitable threshold. I took the models for each category and iteratively chose four for training the ISM and one for testing. Then I counted the number of votes that were casted into the center bin of the test model (which is known because it is an artificial 3D model) and finally determined the average votes for each object category. As vote weighting strategy I chose the Localization Weights because they are independent on the number of models used for training, whereas the Categorization Weights are not. And as I had four models for training in leave-one-out cross validation but five models for training the full ISM which I finally used for recognition, the results would not have been comparable when using Categorization Weights. In addition I also used the Activation Weights because they are based on the Localization Weights and hence are also independent on the number of training models. The Localization Weights were tested in combination with the k-nn and cutoff codeword activation strategies, the Activation Weights only in combination with the k-nn, because in the cutoff strategy the distance to the activated codewords is already involved. Since the codebook type, the codeword activation strategy and the vote weighting strategy all have significant effects on the number of votes, I had to define different thresholds for each combination of them and for each object category. As the leave-one-out cross validation method was performed on the scans of the artificial models, which gave a kind of perfect artificial situation, the resulting thresholds were too high to be directly portable for the usage on the kinect scans. So I took the standard deviation, calculated from the vote results of all test scans belonging to an object category, and subtracted it from the average value. The result gave the threshold that was used. The table 4.1 shows the resulting thresholds that were found by the leave-one-out cross validation method (standard deviation values already subtracted). For the cutoff method I used a threshold of 0.25, since it fits to the average distance of the feature vectors of the training data to the codewords as shown by the distortion computation in For the k-nn codeword activation I chose k=1. Another two parameters that are used to further validate the results of the Hough space voting are the volume and the maximum edge length of the object oriented

43 4.2. Baseline Evaluation 33 Model BB Min Volume BB Max Volume Max Edge Length office chair dining chair dining table couch Table 4.2: Resulting parameter values for the bounding box validation bounding box surrounding the keypoints, which casted their vote into the chosen bin. All these parameters are calculated from the training models of a category. The maximum edge length is determined by the maximum edge length of all bounding boxes of the training models belonging to a category plus the standard deviation of the maximum values for each bounding box. The minimum volume value is calculated from the minimum volume of the bounding boxes of the training data minus the standard deviation of them. And correspondingly the maximum volume of the bounding boxes is calculated as the maximum volume of the training data plus the standard deviation of them. Table 4.2 shows the resulting values for the minimum and maximum volume and for the maximum edge length of the bounding boxes for each category. 4.2 Baseline Evaluation For the baseline evaluation I checked the general functioning of my implementation. Therefor I divided the whole data, consisting of simulated scans of 20 models (4 categories, 5 models each category) into training and testing data. For each model 12 simulated scans were computed, containing views from all sides (front, back, left, right). I chose the scans of 3 models each category as training data and the ones of the other 2 models each category as testing data. Then I trained an ISM with the scans of 3 models of a category and tested with the 8 models (2 each category) of the testing data. For the Hough voting and the bounding box parameters the thresholds found in section 4.1 were used. To test the quality of the two main steps in the recognition chain I first evaluated the hypothesis generation, where I left out the validation steps of the hypothesis selection. After that I tested the whole category recognition chain including the hypothesis selection steps Hypotheses Generation To test the quality of the hypothesis generation part, I left out the validation steps. I used the trained Implicit Shape Models and tested if there is a vote bin containing more votes than the predefined threshold. Table 4.3 shows the number of times the threshold was exceeded in the scans. Each row shows the trained ISM category and the columns list the categories of the test objects. The results were received by using a global codebook, k-nn as codeword activation strategy with k=1 and Localization

44 34 4. Evaluation Test Model Trained Model office chair dining chair dining table couch office chair dining chair dining table couch Table 4.3: Hypothesis Generation results for 1-NN as codeword activation strategy and Localization Weights as vote weighting strategy Weights as vote weighting strategy. Since 24 scans were used for each test category, the maximum possible number is Hypotheses Selection To test the whole recognition procedure, I included the hypotheses selection step and ran the tests again. Table 4.4 shows the resulting confusion matrix, again by using a global codebook, k-nn as codeword activation strategy with k=1 and Localization Weights as vote weighting strategy. The numbers in the cells show the categorization results for the 24 scans of each category. So again 24 is the maximum possible number of categorization results. Test Model Trained Model office chair dining chair dining table couch office chair dining chair dining table couch Table 4.4: Confusion Matrix (incl. hypotheses selection) for 1-NN as codeword activation strategy and Localization Weights as vote weighting strategy Table 4.5 shows the resulting categorization rates in percent, calculated by the number of recognitions (table 4.4) divided by 24 and multiplied by 100.

45 4.3. Evaluation on Indoor Scenes 35 Test Model Trained Model office chair dining chair dining table couch office chair dining chair dining table couch Table 4.5: Categorization Rates 4.3 Evaluation on Indoor Scenes For the evaluation of the implemented method I used several test scans from the Berkeley Kinect dataset [2]. I chose 12 scans for each object category (shown in B.2) where I found a compatible object included in the scan. Then I ran the implemented procedure on the 12 scans with each trained ISM. Figure 4.1 shows screenshots of some example object recognitions in scenes. To evaluate if the correct object was found in a scene I had to compare the estimated object center position (by the implemented method) with the true object center. Therefor I manually marked the true center points of the objects in the scenes prior to running the evaluation. Then the center of mass of the keypoints showing the estimated object was calculated and compared to the true center that was marked before. If the distance between these two points is less than 1/4 of the maximum edge length of the valid bounding box then I rate this one as a success (true positive) whereas if the distance is higher the result is declared as a miss (false positive). To prevent the algorithm from finding the same miss more than once, I compared the false positives found in a scene to each other. For deciding if two of them point to the same region the center of mass values of their keypoints were again checked. If the distance between them was less than 1/4 of the maximum edge length of the valid bounding box, they were declared to point to the same false positive object. The following graphs show the results that were received by using the different combinations of codebook type, codeword activation strategy and vote weighting strategy. The first bar graphs show the number of true and false positive recognitions. The maximum possible number of true positives is 12, as this is just the number of scenes that were used for each category and one object was marked in each scene. The number of false positives is the total number of false positives found in the 12 scenes. The second graphs show the resulting recognition rate r calculated by r = #truepositives #objects + #f alsepositives where #objects is the total number of objects that appear in the scenes (here: 12) Office Chair For the office chairs the best results were received by using the cutoff codeword activation strategy together with Localization Weights (for global and separated codebook), which is shown on figures 4.2 and 4.3. Although both combinations were only able to recognize 8 and 6 of the 12 chairs, they managed to keep the number of false recognitions with 7 and 4 relatively low compared to other combinations. This resulted in the comparatively high recognition rates.

46 36 4. Evaluation Figure 4.2: Recognition results for the office chair scenes Figure 4.3: Recognition rates in % for the office chair scenes Dining Chair For recognizing dining chairs the best results were received by using a combination of global codebook, 1NN codeword activation and Localization Weights and a combination of global codebook, 1NN codeword activation and Activation Weights, which can be seen on figures 4.4 and 4.5. Both combinations were able to get recognition rates above 30%.

47 4.3. Evaluation on Indoor Scenes 37 Figure 4.4: Recognition results for the dining chair scenes Figure 4.5: Recognition rates in % for the dining chair scenes Dining Table The dining tables were best recognized by using a combination of global codebook, cutoff codeword activation and Localization Weights, which performed a recognition rate of 47% (figure 4.7). This was reached by finding 8 of the 12 presented tables while recognizing 5 false positives, as shown in figure 4.6. Figure 4.6: Recognition results for the dining table scenes

48 38 4. Evaluation Figure 4.7: Recognition rates in % for the dining table scenes Couch Finally for the couches the best results were received by using a combination of global codebook, 1NN codeword activation and Activation Weights. As figures 4.8 and 4.9 show a recognition rate of 40% could be reached by recognizing 10 of 12 presented couches while also recognizing 13 false positives. Figure 4.8: Recognition results for the couch scenes

49 4.3. Evaluation on Indoor Scenes 39 Figure 4.9: Recognition rates in % for the couch scenes Analysis of the Different Strategies To find out if there is a parameter or parameter combination which generally performed exceedingly well, I draw some graphs on the whole recognition results. Figure 4.10 shows the total number of true and false positive recognitions for the different parameter combinations. Figure 4.10: Total recognition results by varying the parameter combinations The average recognition rates are shown in figure 4.2. Figure 4.11: Average recognition results by varying the parameter combinations Finally figure 4.12 and figure 4.13 show the recognition results for the different codebook types (global and separated), independent on the codeword activation and

50 40 4. Evaluation vote weighting strategies. And figure 4.14 and 4.15 show the recognition results for the different combinations of codeword activation and vote weighting (independent on the codebook type) that were used in the tests. Figure 4.12: Number of recognitions for the different codebook types Figure 4.13: Average recognition rates in % for the different codebook types Figure 4.14: Number of recognitions for the different combinations of codeword activation and vote weighting Figure 4.15: Average recognition results for the different combinations of codeword activation and vote weighting

51 4.3. Evaluation on Indoor Scenes 41 (a) Office Chair scene: img 0434 (b) Dining Chair scene: img 0644 (c) Dining Table scene: img 0637 (d) Couch scene: img 0774 Figure 4.1: Screenshots of Recognitions

52 42 4. Evaluation

53 5. Discussion This chapter describes my interpretation of the results. The next section contains an analysis of the recognition results on artificial models, which were received by the baseline evaluation. After that the results of the tests on the real scenes are discussed. 5.1 Baseline results For the baseline evaluation two tests were performed. For the first test only the hypotheses generation part of the recognition chain was performed, to test the functionality of the Implicit Shape Model and the Hough voting. The results show that the right objects were found in most of the cases. The best results were obtained for the office chair, where all 24 test scans were correctly recognized. But it also shows that 21 of the dining chair scans also leaded to a concentration of enough votes in the Hough space. Similar results were received by the ISM of the dining chair. Here 15 of the 24 dining chair scans were correctly categorized but also 14 of the office chair scans were recognized as dining chair. So in general the chairs can be recognized quiet good, but the differentiation between the office chair and the dining chair is often not possible with the implemented hypotheses generation part. The results for the dining table and the couch are similar. Whereas the dining table was correctly found in 13 scans, without any false hypotheses, the couch ISM produced a lot of false hypotheses (20) when a dining table was presented. The effect can be explained with the fact that the thresholds for the votes in Hough space bins are in general much smaller for the couch than those for the dining table as shown in table 4.1. So less votes are needed in a bin to generate a hypotheses for the couch. On the dining tables in the test data many keypoints were detected which produced enough votes to have a concentration above the threshold defined for the couch. After that the whole recognition chain was tested on the test data, to see how good the hypotheses selection step can filter out false positive hypotheses. The results show that the mechanism could successfully filter out all the wrong hypotheses for the dining chair and also some for the office chair and couch. But the false positive

54 44 5. Discussion ones the dining tables produced on the couch ISM could not be filtered out, because the geometric parameters used for these two categories are very similar. 5.2 Indoor Scenes results The results on the indoor scenes show that the implemented method is generally able to recognize object categories in scenes. For every tested object category a combination of codebook type, codeword activation strategy and vote weighting strategy could be found, which was able to recognize at least 2/3 of the presented objects. But there is no combination which is best for all considered categories. On an average the combination of global codebook, cutoff codeword activation and Localization Weights gave the best results, which is shown in figure Figure 4.10 shows that this result was mainly achieved by keeping the number of false positive recognitions comparatively low. Altogether the number of false positive recognitions is still very high in many of the used combinations. Global vs. Local Codebook As shown in figure 4.13 the usage of a global codebook gave on average better results compared to using separated codebooks. When using separated codebooks all codewords belong to the trained object category and hence contain some votes. Whereas when using a global codebook there are codewords which do not contain any or only very few votes, because they are representatives of regions of objects belonging to other categories. An explanation for the given results could have been that wrong keypoints, which are not part of the actual object, activated codewords containing a significant number of votes. The probability for that is of course much higher, when using separated codebooks. But this would have resulted in an increased recognition of false positives. As figure 4.12 shows the better performance of the global codebook was reached by recognizing more true positives, whereas the false positive recognitions are almost the same. So this explanation does not fit to the measured results. Codeword Activation & Vote Weighting Figure 4.14 shows that the combinations of cutoff codeword activation together with Localization Weights and 1NN codeword activation together with Activation Weights gave the best results on an average. This can be explained with the consideration of the quality a codeword is actually matched, which is integrated in both combinations. Whereas the first combination named performs the quality check in the codeword activation step (by using the cutoff strategy), the latter one has the quality of matched codewords integrated in the vote weights. Both options led to a reduced number of false positive recognitions which can be seen in figure As already mentioned in chapter the k-nn codeword activation strategy always activates the best k codewords, regardless how good the fit really is. This of course leads to wrong keypoints casting votes in the Hough space, which finally results in a high number of false positive recognitions, due to randomly occurring vote concentrations. In this case the results confirm the assumptions made.

55 5.2. Indoor Scenes results Additional comments While performing the tests some observations were made which explain the received results. The combination of the Implicit Shape Model and the Hough voting promised to be able to recognize and locate the object in one step. If the target objects were spatially divided from other objects this implicit segmentation could really be seen, as shown in figure 4.1a or 4.1b for example. But if the object was included in a group of other objects (figure 4.1c), a clear segmentation could not be reached. Nearby objects often casted votes into the same bins as the target object itself and were consequently also treated as belonging to the object. A cause of this might be that the computed feature vectors do not differ that much. The edges of the planar areas found on chairs and tables for example have a similar distribution of the normals. As a consequence the same codewords are activated and votes are casted from points which do not belong to the intended object. Problems also occurred when the target object was partly occluded by other objects. Here the detected keypoints of the actual object were not able to cast enough votes to exceed the given threshold. Figure 4.1c shows a phenomenon that comes up with the chosen keypoint detection mechanism. The boundary points of the dining chair were also detected on the wall behind. This is known as shadow boundaries, similar to the ones addressed in [20], and arises because some parts on the walls were occluded by the chair and hence could not be captured by the camera. The points directly near to the occluded wall parts were then detected as boundary points.

56 46 5. Discussion

57 6. Summary & Outlook In this work a method for detecting furniture objects in indoor room scenes was implemented and evaluated. The used method was based on Implicit Shape Models, which represent the geometric relationship of typical object regions of a category. These models were trained by using artificial 3D models from public available internet databases. The procedures for training and testing including the preprocessing steps were explained in detail. The implementation was then tested on simulated scans of artificial models and on a set of 3D point clouds that were captured with the Kinect camera. Finally the measured results were analysed and discussed. As the results of the evaluation show, the implemented procedure is generally able to recognize furniture objects in scenes. Especially when the requested object is not occluded and situated relatively near to the camera, the recognition results were very good. But in case the requested object is partly occluded or integrated in a scene with many other objects nearby, the implemented method produces a lot of false positive recognitions, which have to be filtered out by performing further verification steps. To reach this goal, the scans that were used for training could be fitted into the keypoints which created the good hypothesis, similar to the model fitting step performed in [11]. If a scan fits well then the hypothesis is selected, if not it is filtered out. Furthermore, improvements on the training data keypoint detection feature description threshold definition could result in better recognitions. For training the Implicit Shape Models virtual scans of artificial 3D models were

58 48 6. Summary & Outlook used, where little noise was added. Further developments of virtual scanner tools to produce more realistic Kinect-like scans could improve the quality of the training data and thus lead to better recognition results. For the keypoint detection I decided to use a method which detects the boundaries of the objects in a scene. The tests on the complex scenes have shown that this method also detects shadow boundaries similar to the ones pointed out in [20]. Further improvements on the used method, to prevent the algorithm from detecting shadow boundaries, can also have a positive influence on the results. Furthermore the usage of other good keypoint detectors can be investigated. To train the codebook and to test for codebook matches, the complete feature vector dimension was used with a size of 352. Maybe the application of a dimension reduction method in advance has positive effects on computation time and recognition. Furthermore, the usage of different 3D descriptors can be evaluated as well. I decided to determine the thresholds for the Hough space voting methodically by leave-one-out cross validation. It can be further investigated if other methods or empirically found thresholds lead to further improvements. In this work the implemented procedure was tested on furniture objects, which are mainly rigid and symmetrical bodies. How the method performs on either non rigid or asymmetrical bodies is also worth further investigations.

59 A. List of Abbreviations 1NN k Nearest Neighbour, with k=1 AW Activation Weights cutoff cutoff threshold FPFH Fast Point Feature Histogram GFPFH Global Fast Point Feature Histogram GC Global Codebook ISM Implicit Shape Model k-nn k Nearest Neighbour LRF Local Reference Frame LW Localization Weights PCA Principal Component Analysis PCL Point Cloud Library RF SC Reference Frame Separated Codebooks SHOT Signature of Histograms of OrienTations

60 50 A. List of Abbreviations

61 B. Addendum B.1 Used Models The following models from Princeton Shape Database were used for training the ISMs: B.1.1 Office Chair m795 m796 m798 m799 m801

62 52 B. Addendum B.1.2 Dining Chair m810 m818 m823 m824 m825 B.1.3 Dining Table m870 m872 m873 m882 m894

63 B.1. Used Models 53 B.1.4 Couch m829 m830 m833 m835 m842

64 54 B.2 B. Addendum Used Testscenes for the Evaluation The evaluation of the implemented procedure was performed on the following scenes from the Berkeley Dataset: B.2.1 Office Chair img 0414 img 0423 img 0426 img 0433 img 0434 img 0441 img 0511 img 0584 img 0625 img 0626 img 0627 img 0628

65 B.2. Used Testscenes for the Evaluation B Dining Chair img 0298 img 0312 img 0313 img 0319 img 0491 img 0570 img 0571 img 0629 img 0639 img 0644 img 0658 img 0836

66 56 B.2.3 B. Addendum Dining Table img 0298 img 0312 img 0313 img 0629 img 0630 img 0634 img 0637 img 0638 img 0645 img 0647 img 0650 img 0658

67 B.2. Used Testscenes for the Evaluation B Couch img 0298 img 0312 img 0313 img 0359 img 0376 img 0655 img 0668 img 0682 img 0774 img 0775 img 0777 img 0797