Stereo-Assist: Top-down Stereo for Driver Assistance Systems

Similar documents
A Computer Vision System on a Chip: a case study from the automotive domain

Automatic Labeling of Lane Markings for Autonomous Vehicles

A PHOTOGRAMMETRIC APPRAOCH FOR AUTOMATIC TRAFFIC ASSESSMENT USING CONVENTIONAL CCTV CAMERA

3D Vision An enabling Technology for Advanced Driver Assistance and Autonomous Offroad Driving

3D Vision An enabling Technology for Advanced Driver Assistance and Autonomous Offroad Driving

Automatic Calibration of an In-vehicle Gaze Tracking System Using Driver s Typical Gaze Behavior

T-REDSPEED White paper

Whitepaper. Image stabilization improving camera usability

How To Fuse A Point Cloud With A Laser And Image Data From A Pointcloud

Face detection is a process of localizing and extracting the face region from the

Advanced Vehicle Safety Control System

Building an Advanced Invariant Real-Time Human Tracking System

VECTORAL IMAGING THE NEW DIRECTION IN AUTOMATED OPTICAL INSPECTION

Colorado School of Mines Computer Vision Professor William Hoff

ENGINEERING METROLOGY

An Energy-Based Vehicle Tracking System using Principal Component Analysis and Unsupervised ART Network

Vehicle Tracking System Robust to Changes in Environmental Conditions

Reflection and Refraction

PHOTOGRAMMETRIC TECHNIQUES FOR MEASUREMENTS IN WOODWORKING INDUSTRY

Auto Head-Up Displays: View-Through for Drivers

Tips and Technology For Bosch Partners

Edge tracking for motion segmentation and depth ordering

Geometric Camera Parameters

VEHICLE LOCALISATION AND CLASSIFICATION IN URBAN CCTV STREAMS

Optical Tracking Using Projective Invariant Marker Pattern Properties

A technical overview of the Fuel3D system.

Wii Remote Calibration Using the Sensor Bar

LIST OF CONTENTS CHAPTER CONTENT PAGE DECLARATION DEDICATION ACKNOWLEDGEMENTS ABSTRACT ABSTRAK

Static Environment Recognition Using Omni-camera from a Moving Vehicle

INTRODUCTION TO RENDERING TECHNIQUES

Solving Simultaneous Equations and Matrices

4. CAMERA ADJUSTMENTS

E70 Rear-view Camera (RFK)

3D Scanner using Line Laser. 1. Introduction. 2. Theory

The Visual Internet of Things System Based on Depth Camera

Real-time Stereo Vision Obstacle Detection for Automotive Safety Application

TomTom HAD story How TomTom enables Highly Automated Driving

3D Drawing. Single Point Perspective with Diminishing Spaces

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

A Reliability Point and Kalman Filter-based Vehicle Tracking Technique

Geometric Optics Converging Lenses and Mirrors Physics Lab IV

MACHINE VISION MNEMONICS, INC. 102 Gaither Drive, Suite 4 Mount Laurel, NJ USA

CS 534: Computer Vision 3D Model-based recognition

The KITTI-ROAD Evaluation Benchmark. for Road Detection Algorithms

Automotive Applications of 3D Laser Scanning Introduction

2) A convex lens is known as a diverging lens and a concave lens is known as a converging lens. Answer: FALSE Diff: 1 Var: 1 Page Ref: Sec.

EFFICIENT VEHICLE TRACKING AND CLASSIFICATION FOR AN AUTOMATED TRAFFIC SURVEILLANCE SYSTEM

Robot Perception Continued

Scientific Issues in Motor Vehicle Cases

Optical Digitizing by ATOS for Press Parts and Tools

The Trade-off between Image Resolution and Field of View: the Influence of Lens Selection

How an electronic shutter works in a CMOS camera. First, let s review how shutters work in film cameras.

Terrain Traversability Analysis using Organized Point Cloud, Superpixel Surface Normals-based segmentation and PCA-based Classification

Monitoring Head/Eye Motion for Driver Alertness with One Camera

Testimony of Ann Wilson House Energy & Commerce Committee Subcommittee on Commerce, Manufacturing and Trade, October 21, 2015

Mean-Shift Tracking with Random Sampling

Solution Guide III-C. 3D Vision. Building Vision for Business. MVTec Software GmbH

RIEGL VZ-400 NEW. Laser Scanners. Latest News March 2009

PASSENGER/PEDESTRIAN ANALYSIS BY NEUROMORPHIC VISUAL INFORMATION PROCESSING

Relating Vanishing Points to Catadioptric Camera Calibration

3D/4D acquisition. 3D acquisition taxonomy Computer Vision. Computer Vision. 3D acquisition methods. passive. active.

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Description of field acquisition of LIDAR point clouds and photos. CyberMapping Lab UT-Dallas

VRSPATIAL: DESIGNING SPATIAL MECHANISMS USING VIRTUAL REALITY

By: M.Habibullah Pagarkar Kaushal Parekh Jogen Shah Jignasa Desai Prarthna Advani Siddhesh Sarvankar Nikhil Ghate

Direct and Reflected: Understanding the Truth with Y-S 3

New development of automation for agricultural machinery

THE CONTROL OF A ROBOT END-EFFECTOR USING PHOTOGRAMMETRY

Comparing Digital and Analogue X-ray Inspection for BGA, Flip Chip and CSP Analysis

False alarm in outdoor environments

More Local Structure Information for Make-Model Recognition

ROBUST VEHICLE TRACKING IN VIDEO IMAGES BEING TAKEN FROM A HELICOPTER

VEHICLE TRACKING USING ACOUSTIC AND VIDEO SENSORS

Using the Leadership Pipeline transition focused concept as the vehicle in integrating your leadership development approach provides:

The Scientific Data Mining Process

Current status of image matching for Earth observation

Optimao. In control since Machine Vision: The key considerations for successful visual inspection

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

Endoscope Optics. Chapter Introduction

METHODS FOR ESTABLISHING SAFE SPEEDS ON CURVES

Vehicle Tracking at Urban Intersections Using Dense Stereo

A Short Introduction to Computer Graphics


Subspace Analysis and Optimization for AAM Based Face Alignment

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Space Perception and Binocular Vision

Synthetic Sensing: Proximity / Distance Sensors

Detection and Recognition of Mixed Traffic for Driver Assistance System

1 of 9 2/9/2010 3:38 PM

Neovision2 Performance Evaluation Protocol

9/16 Optics 1 /11 GEOMETRIC OPTICS

Application Report: Running µshape TM on a VF-20 Interferometer

CS231M Project Report - Automated Real-Time Face Tracking and Blending

Visual Perception and Tracking of Vehicles for Driver Assistance Systems

Template-based Eye and Mouth Detection for 3D Video Conferencing

Part-Based Recognition

Extended Floating Car Data System - Experimental Study -

Vision-Based Blind Spot Detection Using Optical Flow

Video-Rate Stereo Vision on a Reconfigurable Hardware. Ahmad Darabiha Department of Electrical and Computer Engineering University of Toronto

Tube Control Measurement, Sorting Modular System for Glass Tube

Transcription:

Stereo-Assist: Top-down Stereo for Driver Assistance Systems Gideon P. Stein, Yoram Gdalyahu and Amnon Shashua Abstract This paper presents a top-down approach to stereo for use in driver assistance systems. We introduce an asymmetric configuration where monocular object detection and range estimation is performed in the primary camera and then that image patch is aligned and matched in the secondary camera. The stereo distance measure from the matching assists in target verification and improved distance measurements. This approach, Stereo-Assist, shows significant advantages over the classical bottom-up stereo approach which relies on first computing a dense depth map and then using the depth map for object detection. The new approach can provide increased object detection range, reduced computational load, greater flexibility in camera configurations (we are no longer limited to side-by-side stereo configurations), greater robustness to obstructions in part of the image and mixed camera modalities FIR/VIS can be used. We show results with two novel configurations and illustrate how monocular object detection allows for simple online calibration of the stereo rig. I. INTRODUCTION Camera based driver assistance systems can be divided into two main types: monocular systems and stereo systems. The latter typically employ two symmetric cameras mounted side by side where epipolar lines are aligned with the horizontal image scan lines. Due to the compact nature, simplicity in hardware and lower cost, driving assistance applications using a monocular camera system are gaining traction. Recent years have witnessed product launches of Lane Departure Warning (LDW), Adaptive High-Beam Assist (AHB) and Traffic Sign Recognition (TSR) as a bundle driven by a monocular camera (available in BMW 7 Series, Mercedes E-class and Audi A8 among others). LDW and Monocular vehicle detection for Forward Collision Warning (FCW) [1], [4], [5] are bundled in the Mobileye AWS. Fusion between radar and a monocular camera is used in applications such as LDW and Collision Mitigation by Braking (CMBB) on licensed vehicles (Volvo Collision Avoidance Package ) and on Pedestrians [2]. The NHTSA safety ratings for model year 2011 include LDW and FCW [3] and underscore the need to integrate both customer functions into a single monocular camera system in order to save costs. Taken together, one can identify three important trends: (i) monocular camera systems are in high demand and entering into the domain of (model-based) object detection (Vehicles for certain with possibility for Pedestrians as well), (ii) as a result, there is a strong drive to push the performance envelope of a monocular camera system in order to maximize G. Stein and Y. Gdalyahu are with Mobileye Vision Technologies Ltd. Jerusalem, Israel. gideon.stein@mobileye.com A. Shashua is with the Department of Computer Science, Hebrew University, Jerusalem, Israel, shashua@cs.huji.ac.il (a) Fig. 1. Examples of cluttered scenes where a depth map is not useful for detecting the pedestrians but Stereo-Assist can be used. (a) Pedestrian near a row of parked cars. Driver stepping out of car or truck. the return on investment, (iii) monocular camera systems are quite mature. However, monocular systems have no direct method for measuring distance so information about the object class and/or context, such as the road plane, must be used. Depth can be computed directly using multiple cameras. In the classic stereo approach, a dense disparity map is used to create a 3D map of the environment. This 3D representation is then used for foreground/background segmentation (e.g., for finding candidate regions for further processing), for triggering object detection processing and for estimating range and range-rate to detected objects. The classical stereo approach is a low-level pixel-based process and, in principle, can handle arbitrary shapes without modeling the object class beforehand. By contrast, monocular systems use pattern recognition for detection of a specific object class prior to monocular range estimation. The classical stereo approach works well for close targets and in good weather conditions. However, the utility drops rapidly for longer distances and inclement weather. Furthermore, the effective range increases only by the square root of the camera resolution (see Sec. III-C for details) and thus, will not increase significantly with the introduction of the newer 1M pixel automotive sensors entering the market. Finally, it is not clear how truly useful the depth map is in cluttered scenes (see Fig. 1 and Fig. 2). The images themselves hold much more information than the depth map for segmenting out objects (and object parts) in the presence of clutter. Moreover, in an automotive setting, packaging a stereo rig with light-shield and windscreen wiper protection is not a simple feat. It adds to the cost of introducing a stereo system, and thereby is a hinderance for high-volume adoption. Instead of having depth estimated at an early stage throughout the image (classical stereo), we propose to invert the process on its head. Monocular camera systems are ma-

(a) Primary camera images (a) Secondary camera images (c) Alignment results. Fig. 2. Example of Stereo-Assist for pedestrian detection. (a) The primary camera image with the targets detected using monocular detection. The primary camera is attached to the windshield just below the rearview mirror mounted. The secondary camera image. The camera in this case was mounted on the left tip of the rearview mirror. It is a few cm from the glass and exposed to reflections which can be seen here as three vertical dark lines in the center of the image. (c) The alignment results of the three targets are not hindered by the reflections. ture and have proven ability to segment-out objects (vehicles and pedestrians). In Stereo-Assist, targets (and parts thereof) are detected in the monocular system and target distance is computed. This is the Primary camera responsible also for all non-ranging applications (like LDW, IHC, TSR). The targets are then matched with a Secondary camera image adding stereo-depth on the selected targets. (For an example see Fig. 2.) This asymmetric configuration, which we call Stereo Assist, combines stereo-depth and monocular depth together thus maximizing the strengths from each modality. It can be viewed as a way of bolstering the performance envelope of a monocular system (whatever that envelope limit may be). In particular, more accurate distance information can be obtained and the stereo distance can also be used as an extra cue for target verification. Stereo Assist, on the other hand, can be viewed as an extension of classical stereo to handle high-level primitives (rather than pixel level measurements) covering the spectrum from complete model-based objects to object parts common to man-made objects in general. The move towards high-level primitives, thereby a stronger emphasis on mono abilities, endows the stereo system with increased robustness against clutter. Following a review of current stereo and monocular systems in Section II, we discuss in more detail the rationale for Stereo-Assist in Section III. Section IV shows how the results of the monocular object detection can be used for online calibration of the stereo rig. Stereo-Assist can work well with a classic side-by-side stereo configuration, but to demonstrate the flexibility of the concept, Section VI describes experiments and results from two novel configurations. Section VII will discuss the results and suggest other possible camera configurations. II. BACKGROUND An example of the classic stereo approach is given in [8] which presents both sparse and dense disparity maps, and their use in both foreground/background segmentation and to trigger object detection. Effective results are presented for vehicle and pedestrian targets up to 20m using a 32deg FOV lens and a 30cm baseline. A simple epipolar geometry is used where the epipolar lines are parallel to the image rows. This is done for computational efficiency and is common to virtually all stereo systems. Since it is difficult to mount the cameras accurately enough, image rectification is performed. Schick [9] describes using cameras with different optics but still the images are warped to the simple epipolar geometry. Detection range was improved using higher resolution, narrower lenses and wider baselines (see [10]). [11] pushes the limits using 35mm lenses and a baseline of 1.0m to detect targets with heights as small as 20cm at 100m range. Calibration becomes critical at these ranges and the solution was found in alignment of the road surface and then determining points that are above the road an approach that has become known as v-disparity [10]. Suganuma [12] uses v-disparity and clustering to impressively segment a vehicle from a row of trees targets, one lane to the side, at a range of 30m using a only narrow 12cm baseline and a 30deg FOV. Vehicle targets at 70m however get blended together. The disparity map is a computational challenge. Mark and Gavrila [13] give a review of the faster methods which are based on SSD and SAD computation and then investigate the various methods for choosing window size, smoothing and edge preserving. Their performance analysis is focused on intelligent vehicle (IV) applications including analysis of the effects of imager noise and alignment error. They do not, however, explore the effects of rain and lighting conditions. Hirschmuller [14] surveys more accurate methods which are more suitable to offline computation and presents a mutual information based cost function and an approximation to global matching (SGM). Technology has progressed and [15] reports a real-time SGM implementation on special hardware.

Subpixel disparity maps can be obtained by fitting the cost function to a parabola over local disparity perturbations and finding the maximum/minimum analytically. This topic is surveyed by [16] and results are shown for 3D reconstruction of buildings and for road scenes. A 26deg FOV and a 55cm baseline is used and gives improved shape at 40m in good weather. Subpixel disparity accuracy has not been demonstrated for inclement weather conditions. Munder et.al. [17] use the depth from a dense stereo depth map as one input to the detection system. Keller et.al [15] use the dense stereo depth map to determine road geometry and in particular improve the ROI detection for possible pedestrian candidates. Chiu et.al. [18] break from tradition and explicitly look for vehicle candidates in the grayscale images using vertical and horizontal edges maps before computing the disparity for the candidates patches. This significantly reduces the computational load. They limit themselves to simple epipolar geometry. Monocular detection of specific object classes for driving assistance has been demonstrated successfully. Traffic recognition is already in serial production. Vehicle detection was presented in [1], [19] with serial production performance [4], [5] and pedestrian detection in [20], [21] also in serial production [2]. Monocular camera based vehicle and pedestrian detection technology is available as a black box. The experiments in this paper use such such a system for the primary camera. III. WHY USE STEREO-ASSIST? The advantages of the Stereo-Assist concept over classic stereo, can be grouped into several categories: (i) computational efficiency, (ii) increased robustness, (iii) increased effective range, (iv) flexibility in camera configurations, and (v) modularity. We expand on some of those categories in more details below. A. Computational efficiency: In classic stereo, a dense depth map is computed which requires significant computational resources. In order to make the computation more efficient, simple similarity scores such as SSD and SAD are used in most real-time systems and the epipolar geometry is reduced to a simple configuration where disparity is calculated by matching along image rows. B. Robust Matching and Disparity Estimation In Stereo-Assist, matching is performed only on a limited number of patches thereby allowing for matching techniques which are both more advanced and more computationally expensive, per image patch, than those affordable when computing a dense depth map. In this paper we use normalized correlation. This measure is robust to lighting variations and to partial occlusions of the target [22]. Alternatively one could consider measures such as Hausdorff distance [23] and mutual information [24]. One example of the increased robustness is shown in figure 3. Figure 3a and figure 3b show the images from (a) Primary camera images (a) Secondary camera images (c) Alignment results. Fig. 3. Example of stereo alignment with the secondary camera at the bottom center of the windshield. It is a difficult scene for traditional stereo because of the precipitation on the windshield distorts the secondary image. (a) Primary image. Secondary image: The car hood is visible in the bottom 20 lines of the image. (c) The three targets aligned. The left image shows the target cropped from the primary image. The right image shows the results of the alignment. Due to the arrangement of the cameras, the stereo signal is in the vertical alignment. the primary and secondary cameras respectively. In this frame, some moisture blocks part of the secondary camera image while the primary image is clear. This situation is quite common unless the windshield wipers are designed to clear the windshield in front of both cameras simultaneously. Figure 3c shows the three vehicle targets detected in the primary camera and aligned in the secondary camera. The alignment is good even though much of the targets in the secondary image is distorted by water. This highlights a significant advantage of the Stereo- Assist system. It is quite tolerant to image quality issues in the secondary camera. In figure 2 the secondary camera is mounted exposed and away from the glass. Significant reflections appear from the dashboard. These reflections however, did not affect the overall matching of the objects. C. Greater detection range Consider the example of a classic stereo camera pair where each camera is of VGA resolution and with horizontal FOV

0 10 20 30 40 50 60 70 80 Range Error (m) 16 14 12 10 8 6 4 2 0 Range error for 1 pixel disparity error (B=30cm, F=950) Range (m) (a) Fig. 4. (a) Range error due to a disparity error of one pixel assuming a focal length of 950pixels and a baseline of 30cm. Example of a pedestrian at 33m. The person must be at least 3m from the wall in order to be seen in the depth map. Note also that the shop window is glass and the mixture of reflections and transparency would increase the disparity noise on the background beyond 1pixel. view of 40deg (focal length, f = 950pixels) and the baseline b is 30cm. The disparity d as a function of distance is given by: d = fb (1) The distance error err resulting from a disparity error of one pixel is given by: err = 2 fb + from which we conclude that in the (relevant) case where f b >> the error goes up quadratically with distance err 2. A plot is shown in figure 4a. D. Detection in cluttered scenes In cluttered scenes the foreground and background are at the same depth and the depth map, which is very expensive to compute, is not really useful in detecting the objects of interest. Some examples are shown in Fig. 1: pedestrians near rows of parked cars and people stepping out of a car or truck. Figure 4b shows an image captured of a pedestrian in front of a store. The distance to the pedestrian is approximately 33m. If we assume a disparity map accuracy of 1pixel 1, eqn. 2 implies that the pedestrian cannot be detected by stereo unless he/she is at least 3m from the wall. Monocular pedestrian range in daylight is 50m [2] with a VGA sensor (f = 950pixels) and this range increases linearly with increased resolution (either by narrowing the FOV or by moving to a 1M pixel sensor). For target verification we would like a minimum disparity of 4pixels and that translates to a verification range of 72m. At those ranges, the distance estimate from stereo is only about 25% accurate and no better than the monocular distance. Thus the pedestrian in figure 4b can be detected by the monocular system and then verified by the Stereo-Assist algorithm for higher confidence. 1 Although sub-pixel accuracies have been reported, those were limited to good weather conditions. (2) E. Flexible configurations and tolerance to calibration errors Since no dense depth map is required, there is no need to perform image rectification. Not only does this reduce the computational need for image warping, it also reduces the adverse effects of rectification in the following cases: (i) wide angle lenses with significant distortion, and (ii) in the case where one camera is significantly in front of the other thereby having the rectified images no longer facing straight ahead. These complex geometries are made possible by allowing image translation, rotation and scale in the alignment of the small number of patches. Moreover, as we match entire objects, which inherently have edges and corners in a variety of orientations, we are no longer limited, by the aperture problem, to search precisely along epipolar lines. The result is a system which to a large extent is tolerant to errors in epipolar geometry estimation. IV. CALIBRATION Calibration of a stereo rig involves calibration of internal and external camera parameters. In our case the focal length is known (within 1%). Furthermore, the position and orientation of the primary camera relative to the vehicle is known. What is left is to determine is the relative position and orientation of the two cameras. This is known as the epipolar geometry. Various methods have been proposed using calibration targets or point correspondences [25]. The online calibration method presented does not require specialized calibration targets. Instead, is bootstraps off the object detection available from the monocular vision system. Vehicle targets are used for calibration (not pedestrians or motorcycles) since they are wide and have strong texture edges. Two variants of this concept are described. We will assume that the relative orientation angles are a few degrees at most and thus small angle approximations can be used. A. Calibration using targets tracked to infinity We use targets that are detected when they are close and then tracked till they are under 8pixels in the image. The calibration has four steps: 1) Calibration of relative rotation around optical axis : The targets, cropped from the primary image, are aligned with the targets in the secondary image allowing for planar translation, rotation and scaling. The larger (closer) targets are used to determine the relative rotation around the optical axis ( axis) - see figure 5. 2) Calibration of the relative rotation around the X and Y axes: The distant targets are used as points at infinity (Fig. 5c). This targets are vehicles (at least 1.5m wide) and thus must be over 150m in distance. The angles are small and we can simply use the x and y alignment values of the distant targets as an offset and not compute a true image rotation 2. 2 This approximation can lead to a few pixels of error at the edges of the image. For accurate depth measurements further calibration refinement is required but this is beyond the scope of this paper.

(a) (c) Fig. 5. (a) Alignment with rotation of 1/80 rad around the optical axis () Without rotation correction (c) alignment of the same vehicle when it is approximately 150m away. 3) Determining the epipolar lines: The x and y alignment values of the targets at closer range are used to determine the relative displacement in X and Y of the secondary camera. The scale gives an estimate of the displacement in as described in section IV-B. 4) The baseline: Often the absolute value of the baseline is known by design. If not, the the baseline can be estimated from the monocular distance estimates as show in the next section. B. Calibration using monocular distance estimates In an urban environment distant targets might be rare. For such a a setting, a different calibration method was developed, which does not require targets at infinity. This method uses the range measurements from the monocular system and the stereo matches to determine the epipolar geometry and stereo baseline. Determine the relative rotation around the axis and correct the secondary image: 1) Targets, detected by the primary camera are matched to targets in the secondary camera using a coarse-tofine search over a wide range of translations in x and y, rotations and scales. To speed up the search, best matches are first found using only translation and only for even disparities over a wide disparity range (±40 pixels). This was followed by a refinement stage which included a search in ±4 pixels in each direction and a rotation over a ±0.1rad range. 2) As in the previous method, the final rotation estimate is the median rotation of all the targets that are larger than 20pixels 3) The secondary images are all corrected the for rotation and then the alignment repeated using only image translation and scale. In practice this last step is a refinement step with translation in a range of ±4 pixels in each direction and scale ranging from 0.95 to 1.05 in 0.01 increments. These translation and scale values are used to calibrate for the relative rotation around the X and Y camera axes and the relative translation in X, Y and. We make the following assumptions and approximations: 1) The relative rotation around the axis (roll) has been computed and the secondary image has been corrected. 2) The relative rotation is small and can be approximated by an image shift. It can then be coupled in with the unknown principal points of the two cameras. 3) The effects of scale (S) due to the δ relative translation can be decoupled from the image δx,δy displacement (disparity) due to the camera δx and δy relative translations. These leads to the following relationships: δx = fδx + x 0 (3) δy = fδy + y 0 (4) where f is the known camera focal length and the values x 0 and y 0 combine the relative orientation and the unknown differences in the principal points. The pairs of values of (δx, x 0 ) and (δy, y 0 ) can be determined using a best line fit to (3) and (4) respectively. If there is also relative camera displacement in the direction, there will some scaling required of the corresponding image patch where the scale factor depends on the distance to the object: S = (5) δ The scale S is in most cases too small a signal to be used for measuring the distance. However, it cannot be neglected in the model as it can add a few pixels error to the disparity values δx and δy. Equation 5 can also be written as: 1 S 1 = δ (6) Dividing both sides by the dominant x or y disparity (for example δy) gives: 1 S 1 δy = δ fδy = δ fδy where we have already corrected for the calibrated offset value y 0. The ratio 1 S 1 δy is thus fixed for a particular relative camera geometry. This ratio is computed for all the targets and the median value is taken as the calibration value. Using this ratio one can perform a 1D stereo search over a range of disparities (δy) and for each disparity compute the appropriate scale factor. V. TOWARDS GENERAL OBJECT DETECTION We have described the Stereo-Assist in the context modelbased detections where the Primary camera is responsible for detecting generic object classes of vehicles, traffic signs and pedestrians. It is possible to extend the envelope and include object parts in the vocabulary of high-level entities (7)

(a) Fig. 6. Stereo-Assist used for non-model-based detection which includes partially occluded Pedestrians and man-made objects rich with vertical structure. (a) finding the roadway based on statistical properties [7], extracting vertical line segments common to man-made objects and pedestrians, (c) objects detected through clustering using shape and depth cues. (a) Diagonal cameras configuration. extracted by the Primary camera. Most objects of interest, including partially occluded pedestrians and man-made objects in general are rich with structured visual cues where the most notable ones are vertical line segments. Given texture analysis processes for extracting roadway segments (cf. [7]), it is possible to cluster vertical line segments based on lateral proximity, height from the roadway, motion and monocular depth cues into candidates of meaningful structure parts which are fed into the Stereo-Assist direct depth measurement process for purpose of target validation. The possibility of obtaining direct depth measurements using stereopsis is crucial for making the process robust because otherwise accidental line segments, such as those arising from spurious shadows, can collude to form false positives. Those false positives can be removed using direct depth measurements or through model-based classification. Further details are outside the scope of this paper but Fig. 6 illustrates the main steps: (i) the roadway extraction, (ii) vertical line segments, and (iii) detected objects which include a partially occluded pedestrian and a sign-post further ahead. Finally, the concept can be further extended to focus on all non-road texture clusters (not necessarily generated by vertical line segments). This case is not far from classical bottom-up stereo with some regions masked off, thus underscoring a fundamental principle: Stereo-Assist and classical bottom-up stereo lie on a spectrum defined by the degree of generality of the candidates sent to the Secondary camera for depth measurement. In classical stereo those candidates are single pixels whereas with Stereo-Assist those are high-level clusters defined by model-based classification and general texture extracts common to the type of objects of interest found in the driver assistance domain. VI. EXPERIMENTS WITH TWO NOVEL CONFIGURATIONS Stereo-Assist can work well with a standard symmetric stereo camera configuration. However this paper presents more novel camera arrangements in order to highlight the flexibility of the approach. The test vehicle sued (Ford Focus, MY 2010) already had a monocular object detection system for vehicles and pedestrians [2], [4] located just below the rear-view mirror. This was used as the primary camera. We chose two locations for the Secondary camera (see Fig. 7). 1) A diagonal configuration where the secondary camera head is mounted on the left tip of the rear-view mirror. Vertical cameras configuration. Fig. 7. The image pair from Primary and Secondary cameras with an overlay of the epipolar line. (a) diagonal configuration, and vertical configuration. The lateral displacement of the secondary camera (i.e. the stereo baseline) is half the rear-view mirror width or in this case 14cm. The camera is a few centimeters back from the windshield and there is no protection from reflections from the dashboard. 2) A vertical configuration where the Secondary camera was placed below the Primary camera approximately 13cm down and 26cm along the windshield (with a rake angle of 30deg). This configuration is attractive from a production standpoint as both cameras can fit into conventional standard mono packages. The cameras provided with the units used a Micron MT9V024 CMOS sensor (VGA resolution) and a 5.7mm lens providing a horizontal FOV of 40deg. Camera mounting height of the primary camera was 1.2m. A. Data collection Data capture was performed using the object detection system. The system performs real-time vehicle and pedestrian detection and provides as output target data (image location bounding box and range estimate) and the images themselves. The bounding box and range data from the primary camera was used as targets to be matched between the stereo images. Weather conditions during experiments included daytime winter conditions, i.e., overcast skies, on and off precipitation of light snow/rain mixture, wet roads after rain. B. Calibration results using monocular distance estimates The second data sequence, the vertical camera configuration, involved driving around a suburban shopping and parking area. In such a setting, no target vehicles could be tracked to beyond 100m and we used the second calibration method which does not require targets at infinity.

8 x 10 3 Ratio of Scale over Vertical Disparity Ratios for 20 samples Median Ratio: 0.0015 6 4 (a) Primary camera images Ratio 2 0 2 4 (a) Secondary camera images 6 0 2 4 6 8 10 12 14 16 18 20 Sample Number Fig. 10. Calibration of the scaling to be applied as a function of δy disparity due to forward displacement of the secondary camera. The median value is 0.0015 which corresponds to a forward displacement value of δ = 0.22m. 80 70 Distance Estimates for Vehicles using Monocular and StereoAssist AWS Output Stereo Manual Labeling (a) Targets from primary camera 60 50 Fig. 8. Images used for calibration and testing of the second camera configuration. Targets are numbered 1-20 in row first order. Distance (m) 40 30 20 5 Estimation of Vertical Baseline and Offset data best line fit (y= 0.22*F/ 5.4) 30 28 Estimation of Lateral Baseline and Offset data best line fit (x= 0.0087*F/+17.4) 10 10 26 0 0 2 4 6 8 10 12 14 16 18 20 Sample Number 24 Vertical Disparity (pixels) 15 20 25 Lateral Disparity (pixels) 22 20 18 16 Fig. 11. Distance to the 20 vehicle targets estimated using the monocular system, stereo and ground truth using a guess at the true vehicle width. 30 14 35 0 20 40 60 80 100 120 F/ (pixel/m) (a) 12 10 0 20 40 60 80 100 120 F/ (pixel/m) Fig. 9. (a) y disparity as a function of inverse depth ( F ). x disparity as a function of inverse depth ( F ). Twelve images were used which included twenty vehicle targets (Fig. 8 includes a sample). The median axis rotation of all the targets that were larger than 20pixels in width was 1/40rad. This value was used to correct the secondary image prior to further processing. Figure 9a shows the disparity in y as a function of the inverse depth ( f ). The value of the best line fit at ( f = 0 gives a vertical offset for points at infinity (y 0 ) of 5.4pixels and the slope give a vertical base line (δx) of 0.22m. Figure 9b shows the disparity in x as a function of the inverse depth ( f ). The slope of the best line fit is almost zero - the cameras have a negligible horizontal displacement. The horizontal offset for points at infinity is 17.4pixels. These numbers correspond well to values 6pixels and 16pixels estimated by manual alignment of points on distant buildings. The best-line fit gives a value δx of less that 1cm and thus the disparity δx can be ignored and we can search along a range of disparities δy. We allow for some slight errors in the calibration by picking the best match within δx = ( 1, 0, 1) around the calibrated x 0. In our case the secondary camera is mounted at the bottom of the windshield which has a rake angle of about 30 o so there is also displacement in the direction. Figure 10 shows the ratio eqn. 7 computed for each of the 20 targets. The median value was 0.0015. Using (7) this gives: δ = 0.0015 fδy = 0.22m (8) C. Range results with Vertical Configuration Fig. 11 shows the range estimates for 20 targets (sample is shown in Fig. 8). One can see the robustness of the stereo range even in the presence of rain on the windshield. Targets (9,10), (11,12) and (13,14) are of two parked vehicles at almost the same distance. The range estimates for both vehicles is almost the same even though the dark vehicle on the left is obscured by rain in the secondary image. After calibration, the search space is reduced to a 1D search along δy disparity values with appropriate scaling used for each for δy value. With pedestrians there is a strong vertical edge component. So that the vertical edges did not have any influence on the δy search we allowed for a small (±1 pixel) perturbation in the δx positions. Figure 12 shows the results of Stereo-Assist on some pedestrian examples. Table I shows the distance results estimated (1) Manually assuming pedestrian height of 1.75m (2) Monocular System (c) Stereo-Assist. TABLE I Distance estimates for the pedestrian targets in figure 12 calculated using three methods. (c) (e) Manual 20 22 23 Monocular 23 24 26 Stereo Assist 24 26 26

(a) (c) (d) (e) Fig. 12. Example of pedestrian targets detected and alignment results. Only the Primary camera image is shown. The secondary camera is mounted at the bottom of the windshield, below and in front of the primary camera so vertical alignment of the targets gives the range. VII. SUMMARY We have presented Stereo-Assist, a new approach to stereo for driver assistance systems. In the StereAssist concept the system fuses information from both Monocular and Stereo by having a Primary camera responsible for target attention and selection and range estimation and a Secondary camera providing stereo-range on selected targets for purposes of target verification and range-fusion from both mono and stereo sources. The system combines the strengths from both sources and improves the performance envelop of of both monocular systems and classic, bottom-up stereo systems. Furthermore, it is to a large extent immune to errors in stereo depth, and image quality issues that arise from a Secondary camera unprotected from reflections and rain. The Stereo-Assist supports packaging configurations with a higher degree of flexibility than classical stereo, as for example the novel configurations addressed in Sec. VI. As production monocular camera systems proliferate in volume and customer functions, more innovation and resources are invested in extending the envelope of performance and variety of customer functions. The Stereo-Assist concept allows disparity to be added as a modular element to an existing (monocular) system, thereby, providing an evolutionary growth of computer-vision capabilities in the automotive market as opposed to a zero sum game between monocular and stereo as to which technology will gain dominance. REFERENCES [1] O. Mano, G. Stein, E. Dagan and A. Shashua, Forward Collision Warning with a Single Camera, in IEEE Intelligent Vehicles Symposium (IV2004), June. 2004, Parma Italy. [2] J.R. quain, Volvo Safety, 2011 Style: It Brakes for Walkers, New York Times, Technology Section, November 13, 2009 [3] Consumer Information; New Car Assessment Program, National Highway Traffic Safety Administration (NTSA), US Department of Transportation, Docket No. NHTSA-2006-26555, Final notice. July 2008 [4] Accident prevention systems for lorries: The results of a largescale field operational test aimed at reducing accidents, improving safety and positively affecting traffic circulation, TNO - Netherlands, Organisation for Applied Scientific Research Building and Construction Research, Nov. 2009 [5] G. Bootsma and L. Dekker. The Assisted Driver: Systems that support driving, Dutch Directorate-General for Public Works and Water Management (Rijkswaterstaat), July 2007. [6] G. P. Stein, O. Mano and A. Shashua. Vision-based ACC with a Single Camera: Bounds on Range and Range Rate Accuracy, in IEEE Intelligent Vehicles Symposium (IV2003), Columbus, OH, June 2003, [7] Y. Alon, A. Ferencz and A. Shashua. Off-road Path Following using Region Classification and Geometric Projection Constraints, Proceedings of the CVPR, 2006. [8] U. Franke and A. Joos, Real-time stereo vision for urban traffic scene understanding in Intelligent Vehicles Symposium, 2000. (IV 2000), Dearborn, MI, 2000 [9] J. Schick, A. Schmack S. Fritz, Stereo Camera for a Motor Vehicle, US Patent Application Publication number US20080199069 Filed 2005, Published August 21st 2008. [10] R. Labayrade, D. Aubert and J.P. Tarel, Real time obstacle detection in stereovision on non flat road geometry through v-disparity representation, in Intelligent Vehicle Symposium, June 2002 [11] T. Williamson and C. Thorpe, A specialized multibaseline stereo technique for obstacle detection, in Computer Vision and Pattern Recognition, 1998, June 1998 [12] N. Suganuma, Clustering and tracking of obstacles from Virtual Disparity Image, ib Intelligent Vehicles Symposium, 2009, June 2009 [13] W. van der Mark and D.M. Gavrila, Real-time dense stereo for intelligent vehicles, in IEEE Transactions on Intelligent Transportation Systems, Volume 7, Issue 1, March 2006 [14] H. Hirschmuller, Accurate and efficient stereo processing by semiglobal matching and mutual information, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition,(CVPR) 2005, June 2005 [15] G. K. Christoph, D. F. Llorca and D. M. Gavrila Dense Stereo-Based ROI Generation for Pedestrian Detection, in DAGM-Symposium 2009, 2009 [16] S.K. Gehrig and U. Franke, Improving Stereo Sub-Pixel Accuracy for Long Range Stereo, in IEEE 11th International Conference on Computer Vision, 2007, October 2007 [17] S. Munder, C. Schnorr and D. M. Gavrila, Pedestrian Detection and Tracking Using a Mixture of View-Based Shape Texture Models, in IEEE Transactions on Intelligent Transportation Systems, Volume 9, Issue 2, June 2008 [18] C. C. Chiu, W. C. Chen, M. Y. Ku and Y. J. Liu, Asynchronous stereo vision system for front-vehicle detection in IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP) 2009, April 2009 [19] A. Haselhoff and A. Kummert, A vehicle detection system based on Haar and Triangle features in IEEE Intelligent Vehicles Symposium, 2009, June 2009 [20] A. Shashua, Y. Gdalyahu and G. Hayun, Pedestrian detection for driving assistance systems: single-frame classification and system level performance, in IEEE Intelligent Vehicles Symposium, 2004, June 2004 [21] M. Enzweiler and D. M. Gavrila, Monocular Pedestrian Detection: Survey and Experiments, in IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 31, Issue 12, Dec. 2009 [22] M. Irani and P. Anandan, Robust Multi-Sensor Image Alignment, in IEEE International Conference on Computer Vision (ICCV), India, January 1998. [23] D. O. Huttenlocher and W. J. Rucklidge, A Multi-Resolution Technique for Comparing Images Using the Hausdorff Distance,technical report, Cornell University, December 1992 [24] P. A. Viola, W. M. Wells III, Alignment by Maximization of Mutual Information, in International Journal of Computer Vision 24(2): 137-154 (1997) [25]. hang, A flexible new technique for camera calibration, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1330-1334, 2000