Training-based Object Recognition in Cluttered 3D Point Clouds

Transcription

1 Training-based Object Recognition in Cluttered 3D Point Clouds Guan Pang Department of Computer Science University of Southern California Los Angeles, CA, USA Ulrich Neumann Department of Computer Science University of Southern California Los Angeles, CA, USA Abstract Recognition of three dimensional (3D) objects is a challenging problem, especially in cluttered or occluded scenes. Many existing methods focus on a specific type of object or scene, or require prior segmentation. We describe a robust and efficient general purpose 3D object recognition method that combines machine learning procedures with 3D local features, without a requirement for a priori object segmentation. Experiments validate our method on various object types from engineering and street data scans. Keywords-3D Object Recognition; 3D Matching; Point Clouds; I. INTRODUCTION 3D object recognition, especially in cluttered scenes, is a challenging problem for several reasons. First, 3D scanners only sample objects at discrete points so small scale details are aliased or completely lost. Second, scans may contain significant noise or gaps, degrading recognition performance. Third, many object types are similar in both shape and size or contain repeating patterns. Finally, in cluttered scenes, nearby objects may occlude surfaces or otherwise interfere with the recognition process. Various methods tackle the 3D object recognition problem, though many only focus on a specific type of object, e.g. cars, trees, or rooftops. Among those methods that handle multiple types of 3D objects, many require segmented input data. For example, 3D object recognition on street LiDAR data [4] requires the scene be pre-processed based on ground estimation, so that candidate objects are segmented before applying recognition algorithms. Since many methods use global description of the target object, segmentation is necessary. A few methods [4] [6] utilize machine learning to select the best description for a specific type of 3D object, so they can be recognized efficiently and repeatedly in a large scene. These focus on street data, neglecting other applications such as engineering parts recognition [17], [26], where objects are often more densely arranged and segmentation is much harder or impossible. Our 3D object recognition method is applicable to both sparse street scenes and dense engineering scenes, as shown in fig. 1. We use a machine learning Adaboost training procedure [23]. Segmentation is not required and we use local features instead of global features to describe the Figure 1. Object recognition from 3D point cloud in both engineering scene and street scene. objects. Our local feature is a 3D Haar-like feature [24] and a 3D summed area table [28] is used for efficient computation. The 3D point cloud is resampled into a 3D image, for 3D feature and summed area table calculation. Our method is demonstrated on both dense engineering data and sparse street data containing several different types of objects. Our main contributions include: Combining Adaboost training procedures and local features for 3D object recognition. Scanning search without a priori segmentation to deal with cluttered object scenes. Converting point cloud into volume data (3D image) as basis for feature computation. A. Training-based Methods II. RELATED WORK As a common object class that has variations within a generally similar size and shape, cars are a typical target for training-based methods. Often a set of car objects is defined first to train either a set of local features [1], [2] or a bag of word [3]. Golovinskiy et al. [4] extend the targets to over twenty types of street objects, using classifiers trained with global features. The most similar to our method is Laga et al. [6], which introduces the Adaboost training framework to select a combination of weighted weak classifiers based on Light Field Descriptors [7]. Training-based methods are

2 also seen in other applications like shape retrieval [5], segmentation [10] or scene analysis [8], [9]. B. Volumetric Methods 3D volumetric data is useful for feature extraction and description, though commonly applied in only 3D shape models or medical applications. Gelfand et al. [11] compute an integral volume descriptor for each surface point and pick those with unique descriptors. Mian et al. [12] describe partial surfaces and 3D models by defining three dimensional grids over a surface and computing surface area that passes through each grid. Knopp et al. [13] extend the 2D SURF descriptor to 3D shape classification by converting meshes into volumetric data. Features based on volumetric data are also used in medical applications such as the 3D SIFT [14] for CT scan volumes and MSV (3D MSER) [15] for MRIs. Yu et al. [16] provides an evaluation for several interest points on 3D scalar volumetric data. C. 3D Descriptors Many local and global 3D shape descriptors have been proposed. Most popular are spin images (SI) [17] and extended Gaussian images [18], followed by various others like spherical spin images [19], regional point descriptors [20], angular spin images [21] and FPFH [22]. However, existing 3D Descriptors are usually designed for mesh models, hard to be adapted to raw point clouds with noisy points and cluttered objects. III. ALGORITHM INTRODUCTION A. Detector Training and Detection To detect objects in 3D point clouds, our algorithm is divided into two modules, a training module and a detection module. We train a detector for each class of object, using the Adaboost training procedure [23] with training samples generated from a pre-labeled object library. The detector exhaustively scans and evaluates the point cloud, returning the maximum positive responses as detected object locations. Figure 2 shows an overall flow diagram of our algorithm. The object detector consists of N (default N = 30) weak classifiers c i (sec.iii-c), each with a weight α i. Each weak classifier evaluates a subset of the candidate region, and returns a binary decision. The object detector, or strong classifier, is a combination of all weighted weak classifiers Σ i α i c i, which is compared to a predetermined threshold t (= 0.5Σ i α i by default) to determine whether the candidate region is a positive match. The difference Σ i α i c i t is also used to estimate a detection confidence. The object detector is trained using an Adaboost training procedure [23]. Positive training samples are obtained from pre-labeled library objects by random sampling with optional additional noise and occlusions (synthesized by selecting a random region where all points are removed). Negative input samples are produced from negative point cloud regions Figure 2. Flow diagram of our algorithm (region without the target object) by randomly sampling a subset with the size of the target object. The input of the detection module is a region of 3D point cloud, which is pre-processed (sec.iii-b) into a 3D summed area table for efficient computation. A 3D detection window is moved to search across the 3D image, evaluating the match between each point cloud cluster within the detection window and the target object with the detector trained during the training phase. After the scanning window exhaustively evaluates the input point cloud, all detected positive match instances are further processed by non-maximum suppression to identify the target object with the best match and confidence above a threshold. B. Pre-Processing The raw point cloud data often contains too much unnecessary detail information for object detection. A points distribution is usually enough for the task. Therefore, we perform a voxelization process, in which we convert the point cloud into volumetric data, or a 3D image. Each voxel in the 3D image corresponds to a grid subset of the original point cloud. However, only the number of points within each

3 ii(x, y, z) = ii(x 1, y, z) + ss(x, y, z) (4) ii(x, y, z) can be computed from i(x, y, z) in one pass in linear time. (s(x, y, z) and ss(x, y, z) are cumulative sums, s(x, y, 1) = 0, ss(x, 1, z) = 0, ii( 1, y, z) = 0.) Any 3D rectangular sum is obtained with eight array references (see fig.3(a)). 3D Haar-like features (sec.iii-c) require twelve array references for the two neighboring rectangular regions. Figure 3. (a) Illustration of 3D summed area table [28]. The eight regions are A, B, C, D, E, F, G, H, respectively. Assume V ijk is the value of the 3D summed area table at A ijk (i, j, k = 0, 1). Then V 111 = A+B+C +D+ E +F +G+H, V 110 = A+B+C +D, V 101 = A+B+E +F, V 011 = A+C+E+G, V 100 = A+B, V 010 = A+C, V 001 = A+E, V 000 = A. Thus H = V 111 V 110 V 101 V V V V 001 V 000. (b) Illustration of 2D [24] and 3D Haar-like features. Feature value is the normalized difference between the sum of pixels (voxels) in the bright area and the sum of pixels (voxels) in the shaded area. Figure 4. The first 5 features selected by the Adaboost training procedure. (a) For the T-junction object; (b) For the oil tank object. grid is stored at each voxel. All the coordinate information of points are discarded. To smooth the sampling effect during grid conversion, each point contributes to more than one voxel through linear interpolation. In our tests, the size of a grid region is set to about 1/100 of the average object size. Feature computation based on the 3D image is accelerated with a 3D version of summed area table [25], [28], which efficiently computes rectangular features in constant time (e.g. the 3D haar-like feature in sec.iii-c). The 3D summed area table element at x, y, z has the sum of all elements with coordinates no more than x, y, z: ii(x, y, z) = i(x, y, z) (1) x x,y y,z z where ii(x, y, z) is the 3D summed area table and i(x, y, z) is the original 3D image. Using the recursive equations: s(x, y, z) = s(x, y, z 1) + i(x, y, z) (2) ss(x, y, z) = ss(x, y 1, z) + s(x, y, z) (3) C. Features Each weak classifier is based on a feature computed from the 3D image. The 3D Haar feature closely resembles its 2D version proposed in [24]. The feature value is the sum of voxels in half the region minus the sum in the other half, with possible orientations aligned to either the x, y or z-axis. Essentially, this feature extracts the boundary of the object, since that s where the two blocks are most distinctive. Figure 3(b) illustrates the computation of 2D Haar-like feature [24] and its 3D counterpart. The feature may vary in both location and size, randomly generated in the weak classifier pool and optimally selected for the best error rate by the Adaboost training procedure (sec.iii-a). Figure 4 shows the first 5 features selected for the T- junction and oil tank object. We can see that the features are located on the object surface, thus capturing the object boundary, and the Adaboost training procedure can select the most distinctive regions of an object as its features. One key point of our method is a framework combining the Adaboost training procedure with local features. In this sense, any local feature can be used as long as it can support a weak classifier. As illustration, we implemented another local feature based on binary occupancy, in addition to the 3D Haar-like feature already introduced. The occupancy feature is computed by converting the 3D image (sec.iii-b) into a binary 3D image through simple thresholding, then use the target object as a template to test the percentage of overlapping grids w.r.t. the matching candidates. Binary AND operation is used to speed-up the computation, while a dilate operation is used to make the matching more resistant to noise. Figure 5 illustrate a simple version of the binary feature. Table I provides a performance comparison between the two features. The 3D Haar-like feature performs better, so we use it for the remainder of experiments. Nevertheless, the binary occupancy feature illustrates that other local feature may be used in our framework to pursue improved classification performance. A. False Alarm Reduction IV. ANALYSIS The negative training samples described in sec.iii-a are randomly cropped, which means they are usually very different from the target object, so the features selected during

4 Table I PERFORMANCE COMPARISON OF THE 3D HAAR-LIKE FEATURE AND THE BINARY OCCUPANCY BASED FEATURE. Haar-like Feat. Binary Occupancy Feat. Conf. of 2 True Positives 90%, 60% 90%, 30% Valves in Engineering Scene # False Alarms 0 0 Processing Time 2 sec 2 min Conf. of 2 True Positives 100%, 70% 80%, 60% Cars in Street Scene # False Alarms 8 22 Processing Time 5 sec 4 min Table II THE CHANGE IN PERFORMANCE AS MORE NEGATIVE SAMPLES ARE ADDED. # Negative Samples # Detections # False Alarms Conf. of 2 True Positives Highest False Detection Conf %, 40% 50% %, 50% 30% %, 20% 20% %, < 0 < 0 Figure 5. Illustration of the binary occupancy based feature. training are not very discriminative. As a result, the detector often produces a large number of false positive detections. To reduce the number of false detections, or to better train the detector for more discriminative power, the idea is to use false detections (relatively hard samples) as additional negative training samples to re-train the detector. The false detections used for re-training are detected purely from other negative scenes that we know don t contain the target object. The re-training process can be repeated several times to further reduce false detections. Figure 6 shows a scene in which we want to detect a valve (as shown in the top-left corner). Table II shows the change in performance as more negative samples are added. We can see that the number of false detections is gradually reduced, and their max confidence is also reduced. One of the detected valves is the same as the reference object, which is always positively identified and keeps a high confidence. The other positive valve actually is a bit different from the reference object, since it has a different shaped base and handle, with a ball attached to it. We can see it gets lower and lower confidence as more negative samples are added, until finally identified as negative. This shows that the detector is more precise and restrictive after re-training. Clearly, as hard negatives are filtered, the assumed positive is also filtered, which is understandable. Figure 6. Valve detection result. B. Speed-Up Processing Time The exhaustive scanning search method has brought significant amount of computation in the object detection phase. Without any optimization, the original processing speed is about 400 detection windows per second. Two ideas are employed to speed-up the process. Firstly, when a detection window has too few number of points, we skip its evaluation since it won t contain the target object. Equivalently, this means deciding whether the sum of 3D image voxels within a detection window is smaller than a threshold, thus we can use the 3D summed area table (sec.iii-b) to obtain the sum

5 Figure 7. Principle direction detection by PCA to align objects with arbitrary rotation changes. efficiently. This helps raise the processing speed to about detection windows per second, because in a real scene most area in the point cloud is actually blank. Moreover, detection window evaluation occupies many processing time, which involves computing features and applying weak classifiers. As introduced in sec.iii-b, each 3D Haar-like feature requires only twelve array reference with the help of 3D summed area table. As a result, the final optimized processing time is about windows per second, nearly times or 4 orders of magnitude speed-up from the original. C. Rotation and Scale Scale changes are accommodated by resizing the object detector and searching at different scales. Evaluating varying scales uses the same 3D summed area table to speed-up the computation, so the extra time introduced by multiple scale searches is linear. For rotation changes, we constrain our tests to perpendicular rotations (90 or 180 degrees along x, y, z-axes) in most of our experiments, so the same 3D summed area table can be used. The assumption (that rotations are usually 90 or 180 degrees) actually holds true for many engineering and street applications. To cope with arbitrary rotations, a principle direction detector using PCA can be applied at each window position before matching evaluation, as shown in fig.7. The detected direction is used to align the candidate object to the same orientation as the library object. However, this will render the 3D summed area table ineffective, thus these cases increase processing time. V. EXPERIMENTS To demonstrate the range and performance of our method, we choose data from several intrinsically different sources, including engineering data, street data, and a public 3D object recognition dataset [12] containing 3D models of cluttered objects with high occlusion. A. Engineering Data Engineering applications often focus on objects with simple shapes such as T-junctions, valves, etc. The challenge in detecting such objects is that they usually appear connected to or nearby other objects of similar shape, making it challenging for segmentation algorithms to isolate them and extract global features. Figure 8 provides some object detection results in engineering scenes for six types of objects. We observe that the chosen objects are all simple structures and positioned in areas cluttered with other objects. The detector manages to identify all of them successfully. Table III lists statistical results for nine categories of objects from the engineering dataset, including the number of instances for each type of objects, the number of total detections, and number of correct detections (true positives) and wrong detections (false alarms), and recall / precision rate based on the numbers. We see that the detector finds most of the instances, achieving a combined recall rate of 83.8%. Most false alarms are from the T-Junction and Warning Board categories that have a very generic shape, which is easily confused with similar structures, as shown in fig.10. False alarms usually have a detection confidence much lower than the true positives, making it possible to filter them out by ranking or adjusting the threshold. Moreover, since we are running detectors for different objects in parallel, low-confidence false alarm of one object overlapping with true positive of another object will automatically be ignored. B. Street Data Street data applications often focus on a specific class of objects, such as trees or cars. Such objects are relatively easily segmented from the ground, so many detection methods recognize the whole cluster using global features. Experiments show that our method recognizes different types of street objects without the prerequisite of segmentation. Figure 9 provides some object detection results in street scenes. Table IV lists statistical results for five kinds of street objects. All detections are achieved by a scanning search without a segmentation process. Our method locates almost all the street objects, with a combined recall rate of 94.1%, even though the point density is very low and the samples are highly biased to one side. However, as in the previous experiments, generically shaped objects and sparsely sampled data result in several false alarms. Some examples are shown in fig.10. As in the engineering data, most of the false alarms have a confidence much lower than the true positives, and thus can possibly be filtered out. C. Public Data The last experiment uses a public dataset [12] for 3D object recognition in cluttered scenes. The dataset has 50 scenes, each containing 4-5 3D models of randomly placed

6 Figure 8. Object detection results in engineering scenes. Table III PERFORMANCE OF OUR DETECTOR ON EACH TYPE OF OBJECTS IN THE ENGINEERING DATA. Object Category # appearance # detected # correct Recall # false alarms Precision Big-based Valve % 1 75% Small-based Valve % % T-Junction % % Long Handle Tank-top Tube % 0 100% Short Handle Tank-top Tube % 0 100% Warning Board % % Lights % 2 50% Tube % 0 100% Ladder % 3 40% Total % % Figure 9. Object detection results in street scenes. Table IV PERFORMANCE OF OUR DETECTOR ON EACH TYPE OF OBJECTS IN THE STREET DATA. Object Category # appearance # detected # correct Recall # false alarms Precision Car % % Lamp Post % 2 60% Tree % % Roadblock % % Stop Sign % % Total % %

7 Figure 11. Left: Object recognition in cluttered scene with high occlusion. Right: Recognition rate vs. Occlusion compared to spin image [17], tensor matching [12] and keypoint matching [27]. Figure 10. (Upper) False alarm examples in engineering scene. (Lower) False alarm examples in street scene. scene (with points, size of after voxelization), detection windows are scanned, with of them skipped due to too few number of points. This leaves windows to be evaluated, leading to features and weak classifiers to be computed and evaluated, which cost seconds. The pre-processing of 3D summed area table itself takes only seconds. The total detection time is seconds. Our complete engineering dataset contains a middle-size industry scene with about 14 millions of points, which can take less than half an hour to process. In another example, the first scene of fig.9, seconds are spent to detect a car (2590 points) in a street scene ( points), with seconds on feature computation and seconds on pre-processing. objects. Since the objects are placed together, and the scene is captured only by a single-sided scan, high occlusion and partial data are common. We convert the 3D models surfaces into point clouds with a virtual scanner, then apply our detector. We can compare our results to several state-of-the-art algorithms, including spin image [17], tensor matching [12] and keypoint matching [27]. Figure 11 provides recognition results of our algorithm. A recognition rate vs. occlusion graph is plotted as in [27]. Even though our detector is global, the ensemble of locally trained weak classifiers achieves comparable performance to local descriptor matching methods for these highly occluded scenes with very limited partial views. D. Run-Time Statistics The optimized object detector can efficiently process large-size point cloud data. For example, in the first scene of fig.8, to detect a T-junction (with points, size of after voxelization) in an engineering VI. CONCLUSION We describe a general purpose 3D object recognition method that detects 3D objects in various 3D point cloud scenes, including street scenes and engineering scenes. An Adaboost training procedure is introduced to train detectors with 3D Haar-like features, whose computation receives significant speed-up from 3D summed area table. The framework can support other feature types, as illustrated by our implementation of a 3D Haar-like feature and a binary occupancy feature. Our method doesn t require prior object segmentation and experiments show good performance on various object types from engineering data, street data, and mesh model data. ACKNOWLEDGEMENT This work is supported by Chevron Corp. under the joint project Center for Interactive Smart Oilfield Technologies (CiSoft), at the University of Southern California.

8 REFERENCES [1] A Patterson, P Mordohai, K Daniilidis. Object Detection from Large-Scale 3D Datasets using Bottom-up and Top-down Descriptors. ECCV, pp , [2] M. Himmelsbach, A. Mueller, T. Luettel, H.J. Wuensche. LIDAR-based 3D Object Perception. Proceedings of 1st International Workshop on Cognition for Technical Systems, [3] A. Velizhev, R. Shapovalov, K. Schindler. Implicit Shape Models for Object Detection in 3D Point Clouds. International Society of Photogrammetry and Remote Sensing Congress, [4] A. Golovinskiy, V.G. Kim, T. Funkhouser. Shape-based Recognition of 3D Point Clouds in Urban Environments. ICCV, September [5] R. Wessel, R. Baranowski, R. Klein. Learning Distinctive Local Object Characteristics for 3D Shape Retrieval. In proceedings of Vision, Modeling and Visualization (VMV), pages , Oct , 2 [6] H. Laga and M. Nakajima. A Boosting Approach to Contentbased 3D Model Retrieval. In the 5th ACM International Conference on Computer Graphics and Interactive Techniques in Australasia and South East Asia (GRAPHITE), pp , Nov [7] D.Y. Chen, X.P. Tian, Y.T. Shen, M. Ouhyoung. On Visual Similarity Based 3D Model Retrieval. Computer Graphics Forum, Volume 22 Issue 3, pp , [8] J.F. Lalonde, N. Vandapel, D. Huber, M. Hebert. Natural Terrain Classification using Three-Dimensional Ladar Data for Ground Robot Mobility. Journal of Field Robotics, Vol. 23, No. 10, pp , November, [9] M. Ruhnke, B. Steder, G. Grisetti, W. Burgard. Unsupervised Learning of Compact 3D Models Based on the Detection of Recurrent Structures. IEEE/RSJ International Conference on Intelligent Robots and Systems, pp , Oct [10] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, A. Ng. Discriminative Learning of Markov Random Fields for Segmentation of 3D Range Data. CVPR, [11] N. Gelfand, N.J. Mitra, L.J. Guibas, H. Pottmann. Robust Global Registration. Symposium on Geometry Processing, [12] A. Mian, M. Bennamoun and R. Owens. 3D Model-based Object Recognition and Segmentation in Cluttered Scenes. PAMI, vol. 28(10), pp , , 5, 7 [13] J. Knopp, M. Prasad, G. Willems, R. Timofte and L. Van Gool. Hough Transform and 3D SURF for robust three dimensional classification. In ECCV, Chersonissos, [15] M. Donoser, H. Bischof. 3D segmentation by maximally stable volumes (MSVs). International Conference on Pattern Recognition, pp.63-66, [16] T.H. Yu, O.J. Woodford, R. Cipolla. An Evaluation of Volumetric Interest Points, International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), pp , [17] A. Johnson and M. Hebert. Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes. PAMI, Vol. 21, No. 5, pp , May , 2, 7 [18] B.K.P. Horn. Extended Gaussian Images. Proceedings of the IEEE, Vol. 72, I. 12, pp , [19] G. Hetzel, B. Leibe, P. Levi, B. Schiele. 3D Object Recognition from Range Images using Local Feature Histograms. CVPR, Vol.2, pp. II II-399, [20] A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing Objects in Range Data Using Regional Point Descriptors. ECCV, May [21] F. Endres, C. Plagemann, C. Stachniss, W. Burgard. Unsupervised Discovery of Object Classes from Range Data using Latent Dirichlet Allocation. Robotics: Science and Systems, [22] R.B. Rusu, N. Blodow, M. Beetz. Fast Point Feature Histograms (FPFH) for 3D registration. ICRA, pp , [23] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory: Eurocolt 95, pages 23-37, , 2 [24] P.A. Viola and M.J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR (1), pages , , 3 [25] F. Crow. Summed-area tables for texture mapping. Proceedings of SIGGRAPH, 18(3): , [26] S. Jayanti, Y. Kalyanaraman, N. Iyer, K. Ramani. Developing An Engineering Shape Benchmark For CAD Models. Computer-Aided Design, Vol. 38, Is. 9, pp , Sep [27] A. Mian, M. Bennamoun, R. Owens. On the Repeatability and Quality of Keypoints for Local Feature-based 3D Object Retrieval from Cluttered Scenes. International Journal of Computer Vision, Vol. 89, Is. 2-3, pp , Sep [28] E. Tapia. A note on the computation of high-dimensional integral images. Pattern Recognition Letters, Vol. 32. Is. 2, pp , , 3 [14] G.T. Flitton, T.P. Breckon, and N.M. Bouallagu. Object Recognition using 3D SIFT in Complex CT Volumes. British Machine Vision Association, page 1-12,