Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

Transcription

1 Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite Philip Lenz 1 Andreas Geiger 2 Christoph Stiller 1 Raquel Urtasun 3 1 KARLSRUHE INSTITUTE OF TECHNOLOGY 2 MAX-PLANCK-INSTITUTE IS 3 TOYOTA TECHNOLOGICAL INSTITUTE AT CHICAGO Philip Lenz (KIT) Andreas Geiger (MPI-IS) Christoph Stiller (KIT) Raquel Urtasun (TTI-C)

2 The Challenge of Autonomous Cars State of the art Localization, path planning, obstacle avoidance Heavy use of 3D laser scanner and detailed maps Problems for computer vision Stereo, optical flow, localization Object detection, recognition and tracking Semantic segmentation, 3D scene understanding P. Lenz: The KITTI Vision Benchmark Suite 2

5 The Challenge of Autonomous Cars 3D Laserscanner State of the art Localization, path planning, obstacle avoidance Heavy use of 3D laser scanner and detailed maps Problems for computer vision Stereo, optical flow, localization Object detection, recognition and tracking Semantic segmentation, 3D scene understanding P. Lenz: The KITTI Vision Benchmark Suite 2

6 The Challenge of Autonomous Cars Depth/reflectance map for 3D localization Detailed road map for path planning State of the art Localization, path planning, obstacle avoidance Heavy use of 3D laser scanner and detailed maps Problems for computer vision Stereo, optical flow, localization Object detection, recognition and tracking Semantic segmentation, 3D scene understanding P. Lenz: The KITTI Vision Benchmark Suite 2

11 The KITTI Vision Benchmark Suite Two stereo rigs ( px, 54 cm base, 90 opening) Velodyne laser scanner, GPS+IMU localization 6 hours of recordings, 10 frames per second P. Lenz: The KITTI Vision Benchmark Suite 3

14 Sensor Calibration Challenges Camera camera calibration Velodyne camera registration GPS Velodyne registration } Geiger et al., ICRA 2012 } ICP + Hand-eye calibration P. Lenz: The KITTI Vision Benchmark Suite 4

15 Sensor Calibration Challenges Camera camera calibration Velodyne camera registration GPS Velodyne registration } Geiger et al., ICRA 2012 } ICP + Hand-eye calibration P. Lenz: The KITTI Vision Benchmark Suite 4

16 Sensor Calibration Challenges T GPS T C T C T Velodyne Camera camera calibration Velodyne camera registration GPS Velodyne registration } Geiger et al., ICRA 2012 } ICP + Hand-eye calibration P. Lenz: The KITTI Vision Benchmark Suite 4

17 Data Annotation Challenges 3D object labels: 22 Annotators Occlusion labels: Mechanical Turk P. Lenz: The KITTI Vision Benchmark Suite 5

18 Data Statistics 150k Number of Labels 100k 50k 0 Car Van Truck Pedestrian Person (sitting) Cyclist Tram Misc P. Lenz: The KITTI Vision Benchmark Suite 6

19 Data Statistics 150k Number of Labels 100k 50k 0 Car Van Truck Pedestrian Person (sitting) Cyclist Tram Misc P. Lenz: The KITTI Vision Benchmark Suite 6

20 Data Statistics P. Lenz: The KITTI Vision Benchmark Suite 6

21 Novel Challenges Middlebury Stereo Evaluation Version 2 Average errors: 2 3% (non-occluded regions) P. Lenz: The KITTI Vision Benchmark Suite 7

22 Novel Challenges Middlebury Stereo Evaluation Version 2 Average errors: 2 3% (non-occluded regions) P. Lenz: The KITTI Vision Benchmark Suite 7

23 Novel Challenges Fast guided cost-volume filtering (Rhemann et al., CVPR 2011) Middlebury, Errors: 2.7% Error threshold: 1 px (Middlebury) / 3 px (KITTI) P. Lenz: The KITTI Vision Benchmark Suite 8

24 Novel Challenges Fast guided cost-volume filtering (Rhemann et al., CVPR 2011) Middlebury, Errors: 2.7% KITTI, Errors: 46.3% Error threshold: 1 px (Middlebury) / 3 px (KITTI) P. Lenz: The KITTI Vision Benchmark Suite 8

25 Novel Challenges So what is the difference? Middlebury KITTI Laboratory Lambertian Rich in texture Medium-size label set Largely fronto-parallel Moving vehicle Specularities Sensor saturation Large label set Strong slants P. Lenz: The KITTI Vision Benchmark Suite 9

28 Novel Challenges So what is the difference? Middlebury d = 16 px KITTI d = 0 px d = 50 px d = 150 px Laboratory Lambertian Rich in texture Medium-size label set Largely fronto-parallel Moving vehicle Specularities Sensor saturation Large label set Strong slants P. Lenz: The KITTI Vision Benchmark Suite 9

30 Stereo Evaluation 200 training images / 200 test images P. Lenz: The KITTI Vision Benchmark Suite 10

31 Stereo Evaluation 200 training images / 200 test images P. Lenz: The KITTI Vision Benchmark Suite 10

32 Stereo Evaluation Particle Convex Belief Propagation (PCBP): Best Results Errors: 0.5% Errors: 0.5% Natural scenes, lots of texture, no objects A couple of wrong pixels at poles P. Lenz: The KITTI Vision Benchmark Suite 10

33 Stereo Evaluation Particle Convex Belief Propagation (PCBP): Worst Results Errors: 19.5% Errors: 21.1% Inner city scenes, lots of objects Textureless surfaces, sensor saturation, reflections P. Lenz: The KITTI Vision Benchmark Suite 10

34 Optical Flow Evaluation Second Order Total Generalized Variation: Best Results Errors: 0.5% Errors: 0.5% City scenes with slow motion (intersections) Small flow vectors (< 30 px) P. Lenz: The KITTI Vision Benchmark Suite 11

35 Optical Flow Evaluation Second Order Total Generalized Variation: Worst Results Errors: 56.5% Errors: 58.8% Difficult lighting conditions, highway driving Large flow vectors (> 150 px) P. Lenz: The KITTI Vision Benchmark Suite 11

36 SLAM Evaluation 22 sequences 40 kilometers P. Lenz: The KITTI Vision Benchmark Suite 12

37 SLAM Evaluation 11 training sequences / 11 test sequences P. Lenz: The KITTI Vision Benchmark Suite 12

38 Challenges: Visual Odometry Ground truth from GPS+IMU Metric: For all frame combinations (i, j): P. Lenz: The KITTI Vision Benchmark Suite 13

39 Challenges: Visual Odometry Ground truth from GPS+IMU Metric: For all frame combinations (i, j): P. Lenz: The KITTI Vision Benchmark Suite 13

40 Challenges: Visual Odometry Ground truth from GPS+IMU Metric: For all frame combinations (i, j): Translation Error [%] VISO2-S VO3ptLBA VO3pt VOFSLBA VOFS VISO2-M GT V O3pt Rotation Error [deg/m] VISO2-S VO3ptLBA VO3pt VOFSLBA VOFS VISO2-M Speed [km/h] Speed [km/h] P. Lenz: The KITTI Vision Benchmark Suite 13

41 Object Detection and Orientation Image Selection for Training and Test Sequences are added exclusively to training or testset Maximize no. of non-occluded objects Images with many objects Diversity in object orientation Entropy Maximization [ ] X X argmax α noc(x) + 1 C H c (X x) x C X : current set x: image from the whole dataset c=1 noc(x): no. of non-occluded objects in an image C: no. of object classes H c: entropy of class c with respect to the orientation P. Lenz: The KITTI Vision Benchmark Suite 14

44 Object Detection Evaluation Evaluation Criteria difficulty height occlusion truncation easy > 40px fully visible < 0.15 moderate > 25px partly occluded < 0.30 hard > 25px mostly occluded < 0.50 Overlap o Criterion per Class class overlap car > 0.7 pedestrian > 0.5 cyclist > 0.5 o = GT DET GT DET Object Detector used as baseline for the benchmark Discriminatively Trained Deformable Part Models v4 [1] [1] P. Felzenszwalb et. al. Cascade Object Detection with Deformable Part Models. CVPR P. Lenz: The KITTI Vision Benchmark Suite 15

45 Object Detection Evaluation Evaluation Criteria difficulty height occlusion truncation easy > 40px fully visible < 0.15 moderate > 25px partly occluded < 0.30 hard > 25px mostly occluded < 0.50 Overlap o Criterion per Class groundtruth class overlap car > 0.7 pedestrian > 0.5 cyclist > 0.5 o = GT DET GT DET detection Object Detector used as baseline for the benchmark Discriminatively Trained Deformable Part Models v4 [1] [1] P. Felzenszwalb et. al. Cascade Object Detection with Deformable Part Models. CVPR P. Lenz: The KITTI Vision Benchmark Suite 15

46 Object Detection Evaluation Evaluation Criteria difficulty height occlusion truncation easy > 40px fully visible < 0.15 moderate > 25px partly occluded < 0.30 hard > 25px mostly occluded < 0.50 Overlap o Criterion per Class groundtruth class overlap car > 0.7 pedestrian > 0.5 cyclist > 0.5 o = GT DET GT DET detection Object Detector used as baseline for the benchmark Discriminatively Trained Deformable Part Models v4 [1] [1] P. Felzenszwalb et. al. Cascade Object Detection with Deformable Part Models. CVPR P. Lenz: The KITTI Vision Benchmark Suite 15

47 Object Orientation Evaluation Evaluation Metric: Orientation Average Orientation Similarity (AOS) AOS = 1 11 max s( r) r: r r r {0,0.1,..,1} Normalized Cosine Similarity s(r) groundtruth detection s(r) = 1 D(r) i D(r) 1 + cos (i) θ δ i 2 D(r): set of all object detections at recall rate r (i) θ : difference in angle between estimated and ground truth orientation of detection i δ i = 1 if detection i has been assigned to a ground truth bounding box δ i = 0 if no ground truth has been assigned P. Lenz: The KITTI Vision Benchmark Suite 16

48 Object Orientation Evaluation Evaluation Metric: Orientation Average Orientation Similarity (AOS) AOS = 1 11 max s( r) r: r r r {0,0.1,..,1} Normalized Cosine Similarity s(r) groundtruth detection s(r) = 1 D(r) i D(r) 1 + cos (i) θ δ i 2 D(r): set of all object detections at recall rate r (i) θ : difference in angle between estimated and ground truth orientation of detection i δ i = 1 if detection i has been assigned to a ground truth bounding box δ i = 0 if no ground truth has been assigned P. Lenz: The KITTI Vision Benchmark Suite 16

49 Object Detection Evaluation Car Easy Moderate Hard Precision Recall P. Lenz: The KITTI Vision Benchmark Suite 17

50 Object Detection Evaluation Car Easy Moderate Hard Orientation Similarity Recall P. Lenz: The KITTI Vision Benchmark Suite 17

51 Object Detection Evaluation Pedestrian Easy Moderate Hard Precision Recall P. Lenz: The KITTI Vision Benchmark Suite 18

52 Object Detection Evaluation Pedestrian Easy Moderate Hard Orientation Similarity Recall P. Lenz: The KITTI Vision Benchmark Suite 18

53 Which cars are hard to see? Bad illumination conditions P. Lenz: The KITTI Vision Benchmark Suite 19

54 Which cars are hard to see? Bad illumination conditions P. Lenz: The KITTI Vision Benchmark Suite 19

55 Which cars are hard to see? Occlusions P. Lenz: The KITTI Vision Benchmark Suite 19

56 Which cars are hard to see? Occlusions P. Lenz: The KITTI Vision Benchmark Suite 19

57 Which pedestrians are hard to see? Children P. Lenz: The KITTI Vision Benchmark Suite 20

58 Which pedestrians are hard to see? Carried Items P. Lenz: The KITTI Vision Benchmark Suite 20

59 Conclusion Where are we now? Realistic dataset with 3D ground truth Stereo Optical flow SLAM Object detection / orientation estimation Object Tracking Recognition meets Reconstruction Challenge Complement existing benchmarks / reduce overfitting Submit your results: P. Lenz: The KITTI Vision Benchmark Suite 21

63 Conclusion But KITTI is much more... 3D object tracking Loop closure (SLAM) Structure-from-Motion Semantic segmentation (class labels) 3D scene understanding (layout and objects) Use of maps Lenz et al., IV 2011 P. Lenz: The KITTI Vision Benchmark Suite 22

64 Conclusion But KITTI is much more... 3D object tracking Loop closure (SLAM) Structure-from-Motion Semantic segmentation (class labels) 3D scene understanding (layout and objects) Use of maps Williams et al., RAS 2009 P. Lenz: The KITTI Vision Benchmark Suite 22

65 Conclusion But KITTI is much more... 3D object tracking Loop closure (SLAM) Structure-from-Motion Semantic segmentation (class labels) 3D scene understanding (layout and objects) Use of maps Agarwal et al., ICCV 2009 P. Lenz: The KITTI Vision Benchmark Suite 22

66 Conclusion But KITTI is much more... 3D object tracking Loop closure (SLAM) Structure-from-Motion Semantic segmentation (class labels) 3D scene understanding (layout and objects) Use of maps Wojek et al., ECCV 2008 P. Lenz: The KITTI Vision Benchmark Suite 22

67 Conclusion But KITTI is much more... 3D object tracking Loop closure (SLAM) Structure-from-Motion Semantic segmentation (class labels) 3D scene understanding (layout and objects) Use of maps Geiger et al., CVPR 2011 and NIPS 2011 P. Lenz: The KITTI Vision Benchmark Suite 22

68 Conclusion But KITTI is much more... 3D object tracking Loop closure (SLAM) Structure-from-Motion Semantic segmentation (class labels) 3D scene understanding (layout and objects) Use of maps OpenStreetMap The free Wiki World Map P. Lenz: The KITTI Vision Benchmark Suite 22

69 Related Datasets and Benchmarks Stereo / Optical Flow setting #images ground truth EISATS synthetic 500 dense Middlebury laboratory 40 dense Make3D Stereo real % Ladicky real 70 manual KITTI real % SLAM setting length metric TUM RGB-D indoor 0.4 km New College outdoor 2.2 km Malaga 2009 outdoor 6.4 km Ford Campus outdoor 5.1 km KITTI outdoor 39.2 km P. Lenz: The KITTI Vision Benchmark Suite 23

70 Related Datasets and Benchmarks Object Detection #cat. #labels/cat. occlusion 3D orientation Caltech MIT StreetScenes 9 3k LabelMe ETHZ Pedestrian 1 12k PASCAL k Daimler 1 56k Caltech Pedestrian 1 350k COIL discrete EPFL Multi-View discrete Caltech 3D discrete KITTI 3 1k - 40k continuous P. Lenz: The KITTI Vision Benchmark Suite 23