MOBILE TECHNIQUES LOCALIZATION. Shyam Sunder Kumar & Sairam Sundaresan

Transcription

1 MOBILE TECHNIQUES LOCALIZATION Shyam Sunder Kumar & Sairam Sundaresan

2 LOCALISATION Process of identifying location and pose of client based on sensor inputs Common approach is GPS and orientation sensors There is significant interest for fast and accurate image based localisation Very accurate GPS and orientation sensors not cheap Useful for AR purposes

3 From Structure-from-Motion Point Clouds to Fast Location Recognition

4 INTUITION 3D models impose better geometric constraints to scene views Especially offering pose directly 3D models can be built efficiently from large image collections Image scene recognition and retrieval also possible in near real-time

5 PROPOSED APPROACH Offline: Build representative 3D models for given scene Index features from images using vocabulary trees for fast retrieval Online: Feature Matching Geometric Verification

6 3D SCENE REPRESENTATION Use SIFT as primary tool to represent point features. For stable points, i.e. points which appear in many images and are matchable, the descriptor list shows redundancy. Hence, the descriptor set can be compressed without loss in registration performance. Mean shift clustering is applied to quantize SIFT descriptors belonging to each point.

7 3D SCENE RECONSTRUCTION Using a fronto-parallel assumption, the scale found in the image can be extrapolated to a 3D scale. This scale is later used to estimate the size of a 3D feature in synthetic views, thereby affecting patch visibility. Each descriptor also carries a directional component pointing towards the camera in which the descriptor was extracted.

8 3D SCENE RECONSTRUCTION

9 SYNTHETIC VIEWS The reconstructed model is represented as a 3D point cloud with associated scale values and feature descriptors. In addition, the set of images used to build the model with known orientation is available. This information allows registration of new views sufficiently close to the original ones. But in order to be able to compute the poses for images taken far from the originally provided set of views, the authors propose the creation of synthetic views located at additional positions not covered by the original images.

10 SYNTHETIC VIEWS synthetic cameras are placed uniformly on the horizontal plane Under the assumption of dominant horizontal viewing directions, 12 synthetic views are used with a 30 rotation between the cameras. Not all generated synthetic views are really useful. Given the 3D position and the respective scale of each triangulated point in the sparse model, the projected feature size in the synthetic images can be estimated and therefore the visibility of each 3D point can be inferred given the set of visible features.

11 CONDITIONS FOR VISIBILITY OF A 3D POINT A 3D point is potentially visible in a synthetic view, if the following criteria are met: The projected feature must be in front of the camera and lie within the viewing frustum The scale of the projected 3D feature must be larger or equal to one pixel to ensure detectability One of the associated descriptors is extracted from an original image with a sufficiently similar viewing direction due to the limited repeatability of the SIFT.

12 COMPRESSED SCENE REPRESENTATION A reduced set of documents has two major advantages over utilizing the full set of real and synthetic views: the signal-tonoise ratio for vocabulary tree is increased, since it is expected that a reduced document set is more discriminative for their respective scene content. Further, the smaller database size has a positive impact on the run-time efficiency. The overall goal of the compression strategy is to keep a minimal number of documents while still ensuring a high probability for successful registration of new images.

13 COMPRESSED SCENE REPRESENTATION A view V can be successfully registered by a set of 3D points P, if a certain number of 3D points from P is visible in V and has a good spatial distribution in the image. For given sets of 3D documents and views a binary matrix can be constructed, which has an entry equal to one, if the respective document covers the particular view, and zero otherwise. In order to have every view covered by at least one document, a document covers its corresponding view by default.

14 COMPRESSED SCENE REPRESENTATION The objective is to determine a subset of the documents, such that every view is still covered by at least one 3D document. A straightforward greedy approach is used to determine a reduced but representative subset of documents with low time complexity.

15 VIEW REGISTRATION Steps: Find potentially matching relevant feature sets (using vocabulary tree) Geometric Verification (expensive) Check possible matches Determine pose wrt 3D model Reduce Verification for performance gain

16 VOCAB TREE 3 levels with 50 children per node Leaves: contain quantised feature descriptors Get approx. solution with K*D comparisons Novel Scoring scheme Optimisation: use a CUDA based approach for feature comparisons

17 BUILDING A VOCABULARY TREE k=3 L=2 SLIDE CREDIT T. TOMMASI 17

18 VOCAB TREE NOVEL SCORING SCHEME Assumptions: More features in Q < D if, then corresponding words in D and Q are the same with high probability. Probability of mismatch is uniform for every leaf of the vocabulary tree. Then obtain score as the ratio of true positives to false positives.

19 POSE EXTRACTION If Camera Parameters Available: Use fast RANSAC on 3-point Correspondence Else 4-point perspective pose approach to estimate pose and focal length simultaneously

20 RESULTS

21 RESULTS

22 RESULTS

23 WIDE AREA LOCALISATION ON MOBILE PHONES

24 OVERVIEW Localise Mobile User s 6DOF Approach: localisation using sparse 3D reconstruction

25 SYSTEM ORGANISATION Offline Offline Generate Sparse Reconstructions Feature extraction Online Localisation Feature extraction Matching Pose Estimation

26 OFFLINE Structure from Motion (SfM) Image Acquisition Reconstruction Feature Extraction and Triangulation Global Registration Potentially Visible Sets (PVS)

27 SFM IMAGE ACQUISITION 8 megapixel SLR with large FOV (~90o) Pre-calibrated camera

28 SFM - RECONSTRUCTION Extract SIFT features Coarse Match images using vocabulary trees Tree is trained using ~2million features from 2500 images Each segment reconstructed separately 50~300 images in each segment

29 FEATURE EXTRACTION AND TRIANGULATION Extract some more features (more later ) Triangulate features with matched image pairs Suppress outliers as features that have a reprojection error greater than some threshold.

30 RECONSTRUCTION (EX )

31 GLOBAL REGISTRATION Combine reconstructed segments Align manually to a 2D floor plan

32 POTENTIALLY VISIBLE SETS Idea: Discretise environment into viewing cells Pre-compute cell-by-cell visibility Need: Reduces the run-time load data

33 PVS - ORGANISATION Every cell contains a number of PVS. At least one PVS points to other cells Each cell contains visible features Each feature remembers all images it occurs in.

34 PVS EXAMPLE

35 ONLINE LOCALISATION Feature Extraction PVS Selection Localisation Feature Matching They also use their own online memory management (~5MB)

36 EXTRACT DESCRIPTORS Proprietary, SURF based descriptor (~80 bytes/feature) Faster than SURF and GPU-SIFT Tested on 2.5GHz Intel Core2 Quad, and, NVIDIA GeForce GTX 280 Takes about: Phone: 120ms for describing 640x480 Intel Core2 2.5GHz Quad: 20ms 80% of total localisation time as well!

37 PVS SELECTION Subselect PVS based on orientation/scene Outdoors: GPS Indoors: WiFi Triangulation, Bluetooth, Infrared Beacons (not sure how ) No Sensors: User Interface Perform incremental tracking (huh?) Reinitialise if required (using cue from previous PVS)

38 POINT MATCHING Two Methods: Directly matching PVS features with camera image features Vocabulary tree voting scheme Neither method is robust to outliers Use RANSAC with 3-point pose hypothesis and up-to 50 correspondences

39 EXPERIMENTS

40 EXPERIMENTS

41 EXPERIMENTS

42 WIDE AREA LOCALISATION ON MOBILE PHONES

43 OVERVIEW Localise Mobile User s 6DOF Approach: localisation using sparse 3D reconstruction

44 SYSTEM ORGANISATION Offline Offline Generate Sparse Reconstructions Feature extraction Online Localisation Feature extraction Matching Pose Estimation

45 OFFLINE Structure from Motion (SfM) Image Acquisition Reconstruction Feature Extraction and Triangulation Global Registration Potentially Visible Sets (PVS)

46 SFM IMAGE ACQUISITION 8 megapixel SLR with large FOV (~90o) Pre-calibrated camera

47 SFM - RECONSTRUCTION Extract SIFT features Coarse Match images using vocabulary trees Tree is trained using ~2million features from 2500 images Each segment reconstructed separately 50~300 images in each segment

48 BUILDING A VOCABULARY TREE k=3 L=2 SLIDE CREDIT T. TOMMASI 48

49 FEATURE EXTRACTION AND TRIANGULATION Extract some more features (more later ) Triangulate features with matched image pairs Suppress outliers as features that have a reprojection error greater than some threshold.

50 RECONSTRUCTION (EX )

51 GLOBAL REGISTRATION Combine reconstructed segments Align manually to a 2D floor plan

52 POTENTIALLY VISIBLE SETS Idea: Discretise environment into viewing cells Pre-compute cell-by-cell visibility Need: Reduces the run-time load data

53 PVS - ORGANISATION Every cell contains a number of PVS. At least one PVS points to other cells Each cell contains visible features Each feature remembers all images it occurs in.

54 PVS EXAMPLE

55 ONLINE LOCALISATION Feature Extraction PVS Selection Localisation Feature Matching They also use their own online memory management (~5MB)

56 EXTRACT DESCRIPTORS Proprietary, SURF based descriptor (~80 bytes/feature) Faster than SURF and GPU-SIFT Tested on 2.5GHz Intel Core2 Quad, and, NVIDIA GeForce GTX 280 Takes about: Phone: 120ms for describing 640x480 Intel Core2 2.5GHz Quad: 20ms 80% of total localisation time as well!

57 PVS SELECTION Subselect PVS based on orientation/scene Outdoors: GPS Indoors: WiFi Triangulation, Bluetooth, Infrared Beacons (not sure how ) No Sensors: User Interface Perform incremental tracking (huh?) Reinitialise if required (using cue from previous PVS)

58 POINT MATCHING Two Methods: Directly matching PVS features with camera image features Vocabulary tree voting scheme Neither method is robust to outliers Use RANSAC with 3-point pose hypothesis and up-to 50 correspondences

59 EXPERIMENTS

60 EXPERIMENTS

61 EXPERIMENTS

62 LOCATION BASED AUGMENTED REALITY ON MOBILE PHONES

63 IMPLEMENTING AR ON PHONES Typically object detection and recognition are used to provide information about the recognizable objects in the scene. Using markers on objects is invasive. SLAM approaches are more suitable for mapping unfamiliar areas, and the maps so generated are not precise enough for AR & Localization.

64 SYSTEM OVERVIEW A local database containing several images of the environment is created for use by the AR system. The images are taken at different locations, and are used by the algorithm to find the best match to the live cell phone image. Once the best match is found, point correspondences are found between the two images after feature extraction. Using this, the pose between the two images can be computed. Finally, using this, the position and orientation of the cell phone camera can be found.

65 SYSTEM OVERVIEW

66 BUILDING THE DATABASE A stereo camera is used to take images of the environment. For each image, the pose of the camera as well as it s intrinsic parameters are stored. For each image, SURF features are extracted, and the positions of these features as well as the descriptors are stored. Also, the 3D position of the image center is stored. The stereo camera provides metric information, which later is used in user localization.

67 SENSORS AND POSE ESTIMATION A Nokia N 97 is used for experiments. It has an accelerometer, a magnetometer and a rotation sensor. The accelerometer provides the second derivative of the position. However, data from this sensor is too noisy. It s used to estimate tilt using projected gravitational components on the phone, when the user is still. The magnetometer is used to measure the rotation around the vertical axis.

68 REDUCING THE SEARCH SPACE Based on the computed pose, images in the database that are not likely to be seen by the user are discarded. E.g. : Images behind the user. To further reduce the search space, images whose centers are not within the camera field of view are discarded. This reduces the chances of poorly matching configurations. Image descriptors are loaded on demand. Grouping images which belong to the same room, further reduces complexity.

69 IMAGE RETRIEVAL SURF features are matched between the user image and the images in the database. For each feature, the nearest neighbor in the database of image features is picked. Only matches that have a low enough distance, or whose ratio between the second best distance & best distance is high enough are selected. The image with the highest number of matches is then selected from the database for further computation.

70 OUTLIER REMOVAL First fit a homography between the two sets of points. The points from one image are projected onto the other image using the computed homography. Points whose projection error is large are discarded as outliers. The remaining points are further refined using RANSAC to remove any outliers which may have passed the homography test. This step is done prior to the pose computation.

71 POSE COMPUTATION The goal here is to find the rotation and translation between the two images. Given that, Kc and Kd are the calibrated camera matrices of the phone and the stereo camera respectively. ci and di are 2D points on the cell phone and database images respectively. Xi is a 3D point in the database system coordinates.

72 REPROJECTION MINIMIZATION In order to ensure that the projected virtual objects match the image content as close as possible, the reprojection error is minimized. The minimization is done using the LevenbergMarquardt algorithm over 6 parameters ( 3 rotation + 3 translation) There are two different methods used to estimate the pose. Initialization of R and T is done in two ways

73 POSE INITIALIZATION In order to augment the scene accurately, the pose has to be initialized prior to the final minimization mentioned before. In the first method, the estimated rotation from the sensors is used to estimate the translation up to scale. This is done using SVD. The estimated pose is then refined by minimizing the Sampson criterion.

74 POSE INITIALIZATION In the second method, a linearized version of the reprojection error criterion is used. Once again, the rotation estimate obtained from the sensors is used here. It has the advantage that it can be quickly minimized. However, it is less meaningful, because it gives greater weight to points that are farther away from the image center and points which have high depth.

75 EXPERIMENTS The virtual objects used in the experiments were planar rectangles. The cell phone can also be localized in the environment using the proposed method The pose error is between 10-15cm. A Nokia N97 is used. SURF features take around 8 seconds to be computed. The rest of the processing takes less than 450ms.

76 RESULTS

77 CONCLUSIONS Is this system feasible for practical use? : 8+ secs to spit out results. What are the applications? Museum tour, art gallery guide. Can this be done more efficiently and more SIMPLY?