1 Efficient & Effective Image-Based Localization Von der Fakultät für Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation vorgelegt von Diplom-Informatiker Torsten Sattler aus Düsseldorf, Deutschland Berichter: Prof. Dr. Leif Kobbelt Prof. Dr. Bastian Leibe Prof. Dr. Marc Pollefeys Tag der mündlichen Prüfung: Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verfügbar.
3 Selected Topics in Computer Graphics herausgegeben von Prof. Dr. Leif Kobbelt Lehrstuhl für Informatik 8 Computergraphik & Multimedia RWTH Aachen University Band 11 Torsten Sattler Efficient & Effective Image-Based Localization Shaker Verlag Aachen 2014
4 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at Zugl.: D 82 (Diss. RWTH Aachen University, 2013) Copyright Shaker Verlag 2014 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publishers. Printed in Germany. ISBN ISSN Shaker Verlag GmbH P.O. BOX D Aachen Phone: 0049/2407/ Telefax: 0049/2407/ Internet:
5 Abstract The problem of image-based localization is the problem of accurately determining the position and orientation from which a novel photo was taken relative to a 3D representation of the scene. It is encountered in many interesting applications such as pedestrian or robot navigation, Augmented Reality, or Structure-from-Motion, creating a strong need for algorithms solving the image-based localization problem. In this thesis, we therefore present solutions to this problem that are both effective and efficient, i.e., we propose methods that can localize novel query images taken under a wide range of viewing conditions while requiring only a small amount of processing time. We assume that the 3D scene representation is obtained by using Structure-from- Motion techniques to reconstruct the environment from a set of photos. As a result, we can associate each 3D point with multiple image descriptors modeling the local appearance of the scene around this point. We can then obtain 2D-3D correspondences between 2D feature points in the query image and 3D scene points in the model by solving a descriptor matching problem. These 2D-3D matches can in turn be used to estimate the camera position of the query image, i.e., the position and orientation from which it was taken. The main difficulty of descriptor matching lies in the sheer size of the problem, since our models contain millions of 3D points while thousands of features are found in our query images. As a major contribution, we show that the resulting descriptor matching problem can still be solved very efficiently using prioritized search. We propose a prioritization scheme that is easy to implement, yet can be expected to perform close to optimal in practice. By combining our prioritization with a novel active search step that is able to discover additional matches, we are able to derive an imagebased localization approach that achieves or surpasses state-of-the-art effectiveness while offering the fastest run-times published so far. Analyzing such direct matching methods, we demonstrate that their major advantage, namely their ability to identify a set of high-quality matches, also prevents their scalability to larger datasets. Consequently, we also consider image retrieval methods for image-based localization since they are inherently more scalable. As a second major contribution, we identify the algorithmic factors preventing image retrieval methods to achieve the same effectiveness as our original system and propose a modification that is able to close the gap in effectiveness without sacrificing scalability.
7 Zusammenfassung Das Ziel von bildbasierten Lokalisierungsverfahren ist es, für ein gegebenes Fotos die Position und Ausrichtung der dazugehörigen Kamera relativ zu einem 3D Szenenmodel zu bestimmen. Das entsprechende Problem der bildbasierten Lokalisierung findet dabei viele praktische Anwendungen, wie z.b. Fußgängernavigation, Augmented Reality und Structure-from-Motion. In dieser Arbeit stellen wir effektive und effiziente Ansätze zur Lösung dieses Problems vor, d.h., wir präsentieren Verfahren welche die Position und Orientierung der Kamera für eine große Bandbreite von Blickpunkten und Beleuchtungsbedingungen in kurzer Zeit berechnen können. Im folgenden gehen wir davon aus, dass das 3D Szenenmodell durch eine Structurefrom-Motion Rekonstruktion der Umgebung aus einer Menge von Bilder erzeugt wurde. Dies erlaubt es uns jedem 3D Punkt mehrere Featuredeskriptoren zuzuweisen, welche das Aussehen der Szene um diesen Punkt herum beschreiben. Folglich können wir 2D-3D Korrespondenzen zwischen Featurepunkten im Anfragebild und 3D Punkten im Modell mit Hilfe der dazugehörigen Deskriptoren bestimmen. Diese Korrespondenzen erlauben es uns wiederum die Position und Ausrichtung der Anfragekamera zu berechnen. Die Hauptschwierigkeit beim Deskriptorenvergleich liegt dabei in der Größe des betrachteten Problems da unsere Szenenmodelle mehrere Millionen 3D Punkte enthalten während tausende von Featuren in den Anfragebildern gefunden werden. Als ein Hauptbeitrag dieser Arbeit zeigen wir, dass selbst solche großen Vergleichsprobleme immer noch effizient mittels prioritisierten Suchverfahren gelöst werden können. Wir stellen dabei ein einfach umzusetzendes Prioritisierungsverfahren vor, welches in der Praxis trotzdem eine nahezu optimale Lösung darstellt. Wir verbinden dabei unsere Prioritisierungsstrategie mit einem neuen Ansatz der aktiv nach weiteren Korrespondenzen sucht. Das resultierende Verfahren zur bildbasierten Lokalisierung erreicht dabei die schnellsten Laufzeiten die bisher veröffentlicht wurden während es andere Verfahren in Effektivität erreicht oder sogar übertrifft. Wir zeigen außerdem, dass die große Stärke dieser Klasse von Verfahren, ihre Fähigkeit qualitativ hochwertige Korrespondenzen zu finden, gleichzeitig deren Anwendbarkeit auf beliebig große Datensätze verhindert. Im letzten Teil der Arbeit beschäftigen wir uns daher mit besser skalierenden Ansätzen und zeigen wie diese Skalierbarkeit mit Effizienz und Effektivität in Einklang gebracht werden kann. iii
9 Acknowledgments I want to thank my parents for their never-ending love and support. Without them, none of what I have achieved would have been possible. I thank my advisors, Leif Kobbelt and Bastian Leibe, for providing this opportunity and a nurturing environment in which I could grow both academically and as a person. I am very grateful for all the (technical) discussions and the guidance and help they offered me over all the years. They provided direction when I got lost and they are the reason I fell in love with Computer Vision and Computer Graphics. I thank all my colleagues at the Computer Graphics and Computer Vision groups, Alex, Alexander, Arne, Darko, David B., David S., Dennis, Dominik, Ellen, Esther, Georgios, H.C., Henrik, Jan, Johannes, Jun, Lars, Lucas, Marcel, Martin, Michael, Mike, Ming, Patrick, Robin, Robert, Sven, Tobias, Volker, and Wolfgang, for the great times that I had over all this years. I will never forget the kart sessions with Darko and Arne, the weekly Friday lunch, and especially the City Reconstruction meetings. A special thanks goes to Jan for the excellent technical support and keeping up with me when I again occupied most of the hard disk space or copied massive amounts of data over the internal network. A very special thank you goes to Dennis Mitzel for all the fun we had at the conferences we visited together. I am even more thankful towards Tobias Weyand for all our discussions, listening to and improving on my ideas, and helping with collecting datasets. I want to thank Ole for being the best HiWi ever. Finally, I want to thank all of my friends, Ann, Bianca, Bert & Manu, Claudia, Dirk & Laura, Elias, Eugen, Henrik, Jan, Jana & Manish, Jan-Thorsten, Mehmet, Melanie, Micha, Michel, Nadine, Oliver, Robert, Sara & Torsten, just to name a few, for their support and patients with me, especially as I too often stayed at work instead of meeting them. They are a major reason I enjoyed my time in Aachen and I will always remember thetimeswehad. This thesis is dedicated to my brother.
11 Contents 1. Introduction Solving the Image-Based Localization Problem Related Work Contributions & Overview Foundations Camera Model Structure-from-Motion Local Features RootSIFT Compact Binary Descriptor Representations Image Retrieval I. Feature Matching and Robust Pose Estimation Correspondence Search Approximate Nearest Neighbor Search kd-tree Search Hierarchical k-means Trees Other Search Methods Correspondence Search for 3D Models Adapting the Ratio Test D-to-3D vs. 3D-to-2D Matching Quantized Search Discussion RANSAC-Based Pose Estimation The N-Point Pose Problem Calibrated Cameras Uncalibrated Cameras vii
12 Contents 4.2. RANSAC Introduction to RANSAC Spatial Consistent RAndoM SAmple Consensus Identifying Possible Outliers The Spatial Consistency Check (SCC) SCRAMSAC Experimental Results Discussion II. Large-Scale Localization using Direct Matching Fast Direct 2D-to-3D Matching for Image-Based Localization D-to-3D vs. 3D-to-2D Matching for Localization D-to-3D Matching D-to-2D Matching Experimental Evaluation Vocabulary-Based Prioritized Search A Prioritization Scheme for 2D-to-3D Matching Parameter Evaluation Comparison With Previous Work Discussion Active Correspondence Search for Direct Matching Active Correspondence Search Prioritization Efficient Implementation of Quantized 3D-to-2D Matching Computational Complexity Discussion of Active Search Visibility Filtering Experimental Evaluation Parameter Evaluation Faster Linear Search Through Cache Consistency Localization Accuracy Comparison With State-of-the-Art Discussion III. Scalability of Image-Based Localization Approaches 131 viii
13 Contents 7. The Scalability of Direct Matching Limitations of the SIFT Ratio Test Better Descriptor Representations Compact Models for Direct Matching Building Compact Models Experimental Setup Using Compact Models to Reduce Memory Requirements Evaluating the Scalability of Compact Models Discussion Image Retrieval for Scalable Localization Image Retrieval Revisited Image Retrieval for Image-based Localization Retrieval vs. Direct Matching Selective Voting Efficient Correspondence Selection Experimental Evaluation The Impact of Incorrect Votes Correspondence Selection Discussion Conclusion Summary & Contributions Future Work Bibliography 189 ix
15 1. Introduction Users of modern smartphones are able to quickly determine their current position and orientation using the different sensors embedded in their devices, which enables them to interact with location-based services. For example, a user can learn about nearby places of interest such as touristic landmarks, local attractions, or shopping opportunities by simply looking at a digital map displaying his position and the places of interest. Yet, the camera that is integrated into every modern smartphone could be used to enable a much more powerful way to acquire such location-based information: A future smartphone user could simply take a picture of the object or landmark of interest and receive information on the content of the image, directly superimposed over the photo. For example, a tourist taking a photo of the western entrance of Notre Dame Cathedral could obtain the names of and details about the sculptures carved into the doorways, e.g., in the form of web links to Wikipedia pages, by sending an image to a localization server, which then returns the relevant information about the scene. In order to overlay information onto a photo in a perspectively correct way, the position and orientation of the camera relative to the object of interest needs to be known. Since the accuracy provided by the combination of GPS or WiFi localization and a digital compass is not sufficient for Augmented Reality applications such as the tourist information system outlined above, we have to use the information provided by the image itself to obtain a more precise estimate. In this thesis, we therefore consider the image-based localization problem: Given a photo and a representation of the scene of interest, we seek to determine the position and orientation of the camera, i.e., its pose. This problem is encountered in many interesting Computer Vision tasks. For example, it is a fundamental step for visual navigation aides for pedestrians, robots, and cars, while Structure-from-Motion systems require estimates of the camera poses to reconstruct the 3D structure of a scene from a set of overlapping photos. Furthermore, recognizing the current location and determining the current position and orientation yields powerful cues for higher level tasks such as scene understanding or semantic object annotation. In order to compute the camera pose, we need to establish correspondences between 2D pixel positions in the image and structures in the scene. Consequently, we use a visual scene representation, i.e., we extract information about the environment directly from a set of images capturing the appearance of the scene from different viewpoints, since 1
16 1. Introduction this approach enables us to directly compare the novel image and the place or object of interest. Consequently, there are two choices for computing the camera pose: We can either determine its position and orientation relative to another image depicting the same object or structure or we can directly estimate its absolute pose in the coordinate frame of the scene. Naturally, the pose estimation strategy depends on the type of scene representation that is available. In the case that we are given a database of photos, each of which is annotated with a (precise) position and orientation, we prefer to compute the relative camera pose of the novel image. We first determine the database pictures most similar to the query image, which can be done extremely efficiently using image retrieval techniques. We then establish 2D-2D correspondences between 2D pixel coordinates in the query image and 2D coordinates in the retrieved photos which depict the same 3D point in the scene. Given such matches, we can estimate the position and orientation of the query camera relative to the database image and use the known pose of the latter to obtain the camera pose of the query image in the scene. The 2D-2D matches required to compute the relative pose can be established using local features found in both images, where each feature represents a salient point in its image. Two features can then be compared via feature descriptors encoding information about the local image regions surrounding the keypoints and the result of that comparison yields the desired matches. In the following, we will refer to this approach as image matching. The advantage of image-based localization approaches using image matching is that large photo databases can easily be built from pictures found on photo sharing websites such as Flickr or Panoramio. At the same time, maintaining and expanding the imagebased scene representation is also rather simple. However, the precision of the obtained camera pose estimate strongly depends on the accuracy of the positions of the database images. One approach to acquire precise pose estimates for the database images is to use Structure-from-Motion techniques. Given the relative poses between pairs of database photos, Structure-from-Motion methods simultaneously recover the 3D structure of the scene and compute the absolute camera poses of the database images. The result of this process is a 3D point cloud of the environment, where every point corresponds to multiple local features in the database pictures. Consequently, we can associate every 3D point with its corresponding feature descriptors, which allows us to establish 2D-3D correspondences between features in the query image and the scene points. Based on these matches, we can directly estimate the absolute camera pose of the query image. Besides allowing us to precisely estimate the position and orientation of the query image, 3D models obtained from Structure-from-Motion (SfM) offer a richer representation of the scene than the images alone as they contain information about both the local appearance of the scene in the form of descriptors and information about the 3D 2
17 1.1. Solving the Image-Based Localization Problem structure of the environment. In addition, the 3D point clouds also offer a more compact representation than the original images due to the fact that not all local features found in the database photos have corresponding scene points. In this thesis, we therefore consider solutions to the image-based localization problem that use 3D point clouds to model the environment. While the methods presented in this thesis can also be used for indoor localization, we mainly focus on large-scale scenes reconstructed from thousands of images, resulting in 3D models containing millions of points. Since image-based localization is only a single step in a larger processing pipeline for many applications, this thesis aims at developing efficient methods that can easily handle such large point clouds. At the same time, ourapproachesalsohavetobeaseffective and accurate as possible, i.e., they should be able to precisely determine the camera poses for query images taken from a large variety of viewpoints under changing illumination conditions. As a last criterion, we also want our localization methods to be scalable, i.e., increasing the size of the models should not decrease the effectiveness and accuracy of our solutions while having only a slight impact on their efficiency. In the following, we will therefore propose different image-based localization approaches and use them to explore the trade-off between these (partially conflicting) requirements through rigorous experiments. The remainder of this chapter is structured as follows. In the next section, we introduce the two types of solutions to the image-based localization problem that are considered in this thesis. Sec. 1.2 discusses existing image-based localization approaches and their relation to the methods derived in this thesis. Sec. 1.3 then lists the contributions of this thesis and outlines its structure Solving the Image-Based Localization Problem In order to solve the image-based localization problem, we need to estimate the camera pose of a given query image using a 3D model describing the structure of the environment. As outlined above, we can compute the pose from 2D-3D correspondences between 2D image features and 3D scene points. Since both features and points are associated with descriptors encoding their local appearance, establishing the 2D-3D correspondences results in a descriptor matching problem that is solved using nearest neighbor search. Notice that the feature descriptors are represented as high-dimensional vectors. Consequently, we have to use approximate nearest neighbor search since the curse of dimensionality prevents search methods to find the exact nearest neighbors in sub-linear time for high-dimensional spaces. Finding correct correspondences is complicated even further by the sheer size of the descriptor matching problem since the 3
18 1. Introduction 3D models considered in this thesis contain millions of points and descriptors while nearly ten thousand features are found in our query images. Yet, modern Structurefrom-Motion approaches are able to reconstruct large scenes from thousands of images [ASS 09, COSH11, FGG 10, SPF10], thus creating a strong need for image-based localization approaches to solve the resulting descriptor matching problem efficiently. Essentially, there are two different approaches for establishing 2D-3D correspondences. The first is to directly compare the descriptors of the 2D image features and the 3D scene points, resulting in direct matching approaches that rely solely on the discriminative power of the individual descriptors. The second approach first determines a set of scene points likely to be seen in the query image and restricts the descriptor matching problem to this selected subset. Instead of considering each feature or point individually, indirect matching methods, e.g., image retrieval techniques, thereby aggregate the appearance information of multiple descriptors in order to determine a suitable subset more efficiently. Indirect matching approaches thus avoid the costs associated with solving a large correspondence search problem between millions of points and thousands of image features. While the main challenge for direct matching methods is to solve the matching problem efficiently, indirect methods need to find a trade-off between retrieving relevant parts of the scene efficiently and the risk of selecting irrelevant subsets of points. Fig. 1.1 illustrates the resulting image-based localization pipelines for both approaches. In the following, we discuss the individual advantages and disadvantages of direct and indirect matching. We then consider the problem of robustly estimating the camera pose from the correspondences found by either method. Direct matching. Using direct matching, we are given the choice of performing nearest neighbor search either for the 2D features (2D-to-3D matching) orforthe3d points (3D-to-2D matching). 2D-to-3D search thereby computes 2D-3D correspondences on a global scale by considering all points in the model as potential nearest neighbors for a given query feature. Using the so-called SIFT ratio test [Low04], we are able to detect if multiple scene points have descriptors similar to the query feature and reject the resulting matches as too ambiguous. While this approach reduces the number of false positive matches, it also has a rather high chance of rejecting correct matches, for example in the case that architectural details are repeated over multiple building facades. In contrast, 3D-to-2D matching solves a local nearest neighbor search problem as a 3D point only needs to have a distinct neighbor in the image in order to pass the ambiguity test. This results in a more permissive matching approach that is more likely to accept wrong matches than 2D-to-3D search since it is prone to accept correspondences for entire sets of points with similar descriptors [LSH10], e.g., points found on repetitive structures in the scene. Since there are multiple orders of magnitude more points in the 3D model than 2D features contained in a query image, matching a single 3D point against an image is 4
19 1.1. Solving the Image-Based Localization Problem Figure 1.1.: General outline of image-based localization approaches using (top) direct matching and (bottom) indirect matching, respectively. Direct matching uses the descriptors of the 2D features and 3D points to directly establish the 2D-3D correspondences required for pose estimation. In contrast, indirect matching subdivides the scene into (overlapping) clusters and first identifies those clusters related to the query image before proceeding with feature matching and pose estimation. We visualize the 2D-3D correspondences as lines connecting the camera position and the scene points. inherently more efficient than matching a 2D feature against the 3D model. However, special care has to be taken to avoid matching all 3D points against the image and to prevent finding too many incorrect correspondences. One important result presented in this thesis is that 2D-to-3D search offers a better localization effectiveness than 3D-to- 2D search due to its ability to find correspondences of a higher quality (cf. Chap. 5). Using our prioritization strategy for 2D-to-3D search, we are even able to achieve faster localization times than approaches based on 3D-to-2D matching. However, we will also show that state-of-the-art effectiveness can only be achieved by combining both matching directions, enabling us to profit from the complementary properties outlined above (cf. Chap. 6) and increasing the scalability of the resulting methods (cf. Chap. 7) Indirect matching. The database images used to construct the 3D model offer a discrete approximation to the set of all possible viewpoints in the scene. As such, we can partition the model into overlapping parts, each of which contains all points visible together in one of the database images. The resulting localization approach, proposed by Irschara et al., thus first identifies database photos taken from viewpoints similar to the query image using an efficient image retrieval method [IZFB09]. In a second step, it 5
20 1. Introduction obtains 2D-3D correspondences by matching the 3D points observed in the corresponding database images against the 2D query features. One main advantage of this method is that keeping the storage-intensive feature descriptors found in the database images in memory is not required for the initial retrieval step. Consequently, indirect matching approaches are inherently better scalable than direct matching methods since the latter need to keep the descriptors in memory at all time. However, the fact that the image retrieval step does not require the SIFT ratio test to determine visually similar database images is of an even larger importance. With increasing model size, the individual point descriptors become less and less distinctive individually, increasing the chance that the ratio test employed by 2D-to-3D search rejects too many correct matches as too ambiguous to allow pose estimation. At the same time, models containing more and more 3D points with similar descriptors increase the false positive matching rate of 3D-to-2D search to a level that does not allow efficient localization anymore. Yet, this decrease in the discriminative power of the individual descriptors has a smaller impact on image retrieval. As we will demonstrate in Part III, this property is crucial for truly scalable localization approaches. However, indirect matching approaches also have some disadvantages compared to direct search methods. First of all, the query image has to share enough visual overlap with the retrieved database photos in order to yield enough 2D-3D matches to facilitate camera pose estimation, a restriction that does not apply to direct matching. Secondly, the retrieval stage of indirect matching can be prone to select irrelevant database images instead of photos depicting the same part of the scene as the query image. This is confirmed by the experimental results presented in Part II, which will show that direct matching achieves a significantly better localization effectiveness than indirect methods. As one main contribution of this thesis, we identify the reasons for this difference in effectiveness and show how to improve indirect search approaches to achieve state-of-the-art results. Camera pose estimation. Although most wrong correspondences can be rejected using the SIFT ratio test, we cannot avoid incorrect matches completely. Thus, imagebased localization methods employ robust estimation techniques to ensure that the computed camera pose is unaffected by these outlier matches. The estimation technique most commonly used is the RANdom SAmple Consensus (RANSAC) algorithm by Fischler and Bolles [FB81]. RANSAC iteratively generates a camera pose hypothesis from a subset of all matches, which is verified using all correspondences. The pose consistent with the largest number of matches is then chosen as the best estimate. In order to avoid testing all possible subsets of a fixed size, RANSAC randomly selects the correspondences and terminates the estimation process if the probability of missing a better pose is below a predefined threshold. The number of samples required to achieve a given confidence level grows exponentially in the percentage of wrong matches. As a result, 6