Sensor Fusion of a CCD Camera and an. Acceleration-Gyro Sensor for the Recovery of. Three-Dimensional Shape and Scale

Sensor Fusion of a CCD Camera and an Acceleration-Gyro Sensor for the Recovery of Three-Dimensional Shape and Scale Toshiharu Mukai Bio-Mimetic Control Research Center The Institute of Physical and Chemical Research (RIKEN) Nagoya, 463-0003 Japan Noboru Ohnishi Dept. of Information Engineering Faculty of Engineering Nagoya University Nagoya, 464-8603 Japan Abstract Shape recovery methods from an image sequence have been studied by many researchers. Theoretically, these methods are perfect, but they are sensitive to noise, so that in many practical situations, we could not obtain satisfactory results. In addition, we could not obtain the scale of the recovered object because of the imageprojection property. To solve these problems, we propose a shape recovery method based on the sensor fusion technique. This method uses an acceleration-gyro sensor attached on a CCD camera for compensating images. Keywords: Recovery from images, Sensor fusion, Gyro sensor, Three-dimensional model 1 Introduction Image changes produced by a moving camera are an important source of information on the observer's motion and structure of the environment. These changes are represented by velocities called optical ow on an image screen or point-correspondences between two or more images. The recovery of a three-dimensional structure and motion from an image sequence is one of the most important issues in computer vision. It can be used in many elds such as three-dimensional object modeling, tracking, passive navigation, and robot vision. Recovery methods from an image sequence have been proposed by many researchers (for example, [1, 2, 3, 4]). Theoretically, these methods are perfect, but they are very sensitive to noise, so that, in many practical situations, we could not obtain satisfactory results. Recently the factorization method developed by Tomasi and Kanade has attracted researchers' attentions [5]. This method has been proposed for orthogonal projection [5] and then extended for approximations of perspective projection [6]. It is reported that good results have been obtained in practical situations by the use of this method, when the approximation of the camera model is suitable. However when the assumed camera approximation is not suitable for the situation or the amount of camera motion through an image sequence is small, the results are not satisfactory yet. There is another limitation comes from an image property. Under the perspective projection or orthogonal projection which is widely used as a camera model, a slowly-moving small object near to a camera produces perfectly the same image sequence as a fast-moving object far from a camera. It means that we could not recover the scale concerning the object and camera motion (velocity or displacement). In order to solve these problems, we propose the use of an acceleration-gyro sensor attached onto a CCD camera. We selected the sensor because it does not require any environmental setting, so that the sensing system can be carried anywhere. In the following sections, we propose a method for the recovery of object shape and scale from the output of the CCD camera and acceleration-gyro sensor. Experimental results are also shown. 2 Sensor Fusion for Obtaining Good-Quality Information One of the causes of the diculty in shape recovery is the fact that the discrimination of small rotation and small translation, as shown in Figure 1, is di-

cult when the object's width along the optical axis relative to the distance between the camera and the object is small, because they invoke similar image changes. Object Camera Situaton 1: Translation Situation 2: Rotation Figure 1: Small rotation and small translation have a similar eect on the image screen. When we study animals, we nd that many control their eye motion so as to obtain better visual information. For example, the vestibulo-ocular reex makes us possible to obtain stabilized images on the retina. This reex rotates our eyeballs so as to cancel rapid head motions, by using information on our head's rotation obtained from three semicircular canals [7]. It is also reported that, when ying, insects control the direction of their bodies to obtain visual information without rotation [8]. In our system, we do not control the video camera to remove rotation, in order to build a compact, inexpensive and free-from-mechanicalproblems system. Instead, we process the image sequence by a computer to remove rotation using output from the gyro sensor obtained simultaneously with the image sequence. Conceptually, we design a virtual camera, as shown in Figure 2. This virtual sensor receives input from the video camera and gyro sensor and outputs an image sequence without rotation. Gyro sensor Video camera Inside computer Virtual camera Image sequence without rotation Figure 2: The virtual camera outputs an image sequence without rotation. 3 Overview of Our System 3.1 Setup Our system consists of two sensors, a CCD camera and an acceleration-gyro sensor, and a computer for processing. The purpose of our system is the recovery of an object's three-dimensional structure including its scale from the sensor output. We assume the following situation. The rigid object is xed in the environment and the sensor system moves around it. The object has feature points which can be tracked through an image sequence and its structure is determined by three-dimensional feature point positions. The acceleration-gyro sensor (GU-3011 by Data Tec), mounted on the CCD camera, as shown in Figure 3, is used to compensate the CCD camera. It consists of 3 vibration gyroscopes and 3 acceleration sensors in a cube with sides 36 mm long and outputs 3-axial acceleration, 3-axial angular velocity and 3-axial rotation angle at 60 Hz. The rotation angle is obtained by integrating angular velocity, so that it drifts even though it can be corrected to some extent by using gravity as reference. In the present paper, we use the acceleration and angular velocity information. Figure 3: A photograph of the CCD camera and acceleration-gyro sensor 3.2 Four Stages in Our Method Our shape and scale recovery method using the CCD camera and acceleration-gyro sensor consists of four stages. In the rst stage, optical ow or pointcorrespondences through the image sequence are obtained. This is usually achieved by tracking feature points in the image sequence. Many methods are studied for this purpose but, are not discussed here. In the second stage, we use one of any shapefrom-image-sequence methods, modied for use of

the acceleration-gyro sensor. This stage recovers the shape of the object and camera velocity and angular velocity at a point of time. It should be noted that the scale factor concerning the camera velocity and the recovered point positions cannot be obtained in this stage. The third stage is the integration of recovered parameters at dierent time points, which are obtained in the previous stage. By integration of the images, three eects can be expected. When recovered points exist in more than one structure at dierent time points, more accurate positions can be obtained by taking their average. In addition, if there are points which exist in only some of the recovered structures, they are appended to the rest of the points. That is, occluded points in some parts of the image sequence can be recovered by integration if the points are viewed in other parts of the image sequence. Finally, transition of camera motion is obtained. In the fourth stage, we obtain the scale concerning recovered positions and camera velocity. By integrating acceleration output from the accelerationgyro sensor, camera velocity is obtained, theoretically. However, the acceleration output includes noise in practice, so that the velocity obtained from the acceleration-gyro sensor drifts. By using both acceleration from the acceleration-gyro sensor and velocity with the unknown scale factor obtained from the image sequence, the scale is obtained. 3.3 Three Coordinate Systems Vector elements depend on the coordinate system to which the vector is related. For representing vector elements, we use three kinds of coordinate systems. The rst is xed in the world and is constant over time. We call this a base world coordinate system. Recovered structures and camera motion parameters at dierent time points are integrated in this coordinate system. When a vector x is represented in this coordinate system, it is denoted as x B. The second is a camera coordinate system which is attached to and moves with the CCD camera. When a vector x is represented in this coordinate system, it is denoted as x C. The last coordinate system is also xed in the world, but its position is changed according to the referred time. It is used to represent the sensor output obtained at the referred time. The coordinate system is positioned so as to correspond with the camera coordinate system when the sensor output is obtained. Therefore this depends on time. We termed this a temporary world coordinate system. When a vector x is represented in this coordinate system, it is denoted as x W. An example of the application of the coordinate systems is as follows. When camera velocity is v B, v C = o because the camera coordinate system moves with the camera. The relation between the base world coordinates and the temporary world coordinates is v W = Rv B where R is the rotation from the base world coordinates to the temporary world coordinates. Vector coordinates do not depend on the position of the origin of the related coordinate system, so that they can be transformed by only rotation. Output of the acceleration-gyro sensor is based on the temporary world coordinate system. That is, acceleration a W and angular velocity! W are obtained from the sensor. 4 The Recovery of Object Shape and Camera Motion from an Image Sequence The recovery of object shape and camera motion from an image sequence at a point of time is studied by many researchers. We modify one of these methods in order to use the acceleration-gyro sensor output for compensating images, then use it in our system. In this section, we briey introduce the method previously proposed by us. The details are reported in [4]. Assume that we observe a point on the object at time t and t + t. We denote the unit vector from the camera center to the point as q W and camera translation as u W, as shown in Figure 4. Then we obtain (q W (t + t) 2 q W (t)) 1 u W = 0; (1) because q W (t), q W (t+t) and u W are on the same plane. Taking t 0! 0 and using the following relation (this can be easily proved) we obtain _q W = _q C +! W 2 q C ; (2) (( _q C +! W 2 q C ) 2 q C ) 1 v W = 0: (3) By arranging the above equation on n( 8) points, we obtain G = 0; (4)

Observed point From the acceleration-gyro sensor,! W is obtained. By substituting this into (2), we obtain _q W. This can be considered as the output of the virtual camera in Figure 2. The virtual camera output when observing m points (m 2) yields q W ( t ) Camera center δu W at time t q W ( t+ δt) Hv W = o; (8) where H is the m 2 3-matrix and its ith row is _q W 2 q C i : (9) Figure 4: Relationship of camera positions before and after innitesimal time lapse. where = [v W 1 j v W 2 j v W 3 j! W 1 vw 1 j! W 2 vw 2 j! W 3 vw 3 j! W 1 vw 2 +! W 2 vw 1 j! W 2 vw 3 +! W 3 vw 2 j! W 3 vw 1 +! W 1 vw 3 ] T (5) and G is a n 2 9-matrix composed of only observed values. The ith row of G is [ _q C i;2 qc i;3 0 _qc i;3 qc i;2 j _qc i;3 qc i;1 0 _qc i;1 qc i;3 j _qc i;1 qc i;2 0 _qc i;2 qc i;1 j (q C i;1) 2 0 1 j (q C i;2) 2 0 1 j (q C i;3) 2 0 1 j q C i;1 qc j i;2 qc i;2 qc j i;3 qc i;3 qc i;1 ]; (6) where the dot on variables denotes the time derivative of the variable and qi;j C is the jth element of q C i which is the unit vector from the camera center to the ith point. By nding nontrivial 6= 0 from (4), camera velocity v W up to its scale and angular velocity! W are obtained. We select the unit vector ^v W s as the recovered camera velocity and denote the recovered camera angular velocity as ^! W. The positions of observed points are also recovered as x W i = 0s ^vw s 1 ( _qc i + ^! W 2 q C i ) k _q C i + ^! W 2 q C i k2 q C i ; (7) where s is the unknown scale factor and v W = s^v W s if noise is absent. 5 Using the Acceleration- Gyro Sensor for Compensating an Image Sequence In this section, we describe a modication of the shape and motion recovery method described in the previous section. This can be determined by observed values only. This equation is obtained from (3). By nding nontrivial v W 6= 0 from this equation, we obtain the velocity up to its scale. This velocity is expected to be better than the previous one because the degree of freedom in the equation is smaller. In practice, m is usually much larger than 3 and H is disturbed by noise, so that the matrix has rank 3. Hence this equation is ill-conditioned for obtaining v W 6= 0. The SVD (Singular Value Decomposition) method is suitable for solving this equation. By the SVD, H is decomposed as U6V T = [u 1 ju 2 ju 3 ]diagf 1 ; 2 ; 3 g[v 1 jv 2 jv 3 ] T ; (10) where U and V are orthonormal matrices and 1 2 3. Then v 3 is adopted as ^v W s. Point positions can be determined by (7). 6 Integration of Recovered Structures and Motion Parameters at Dierent Time Points The integration of recovered structures at dierent time points is expected to improve accuracy, recover occluded points and clarify the transition of camera motion. However, the object structures and camera motion parameters at dierent time points are obtained with respect to dierent temporary world coordinate systems. The scales are also different because they cannot be determined by the method in the previous stage. Hence we cannot simply integrate recovered structures. This problem can be solved as follows. The object shapes are the same even though coordinate systems are dierent. Therefore we can determine (relative) scaling, rotation and translation transformations which make transformed structures overlap each other. We take the structure at the rst time

point as the base structure and nd the transformation to this structure. This means that the temporary world coordinate system at the rst time point is used as the base world coordinate system. We denote the recovered position of point i with respect to the temporary world coordinate system at time k as x k i. The superscript W is dropped in this section for concise description. The transformed point position x 0k i from x k i is dened as x 0k i = s k R k x k i + t k ; (11) where s k ; R k and t k are the scaling, rotation and translation from the structure at time k to the base structure. In order to obtain s k ; R k ; t k, we minimize E k 1 (s k ; R k ; t k ) = 1 2 X i f~xi 0 x 0k i g 2 ; (12) where ~xi is the position of point i in the base structure. In practice, the translation t k is obtained from @E k 1 =@t = 0 as t k = ~g 0 s k R k g k ; (13) where ~g; g k are the centroids of ~xi; x k i. Hence we minimize X0 1 ~x i 0 ~g 0 s k R k (x k i 0 g k 2 ) : E k 1 (s k ; R k ) = 1 2 i (14) In our implementation, we used the conjugate gradient method for minimizing the function numerically. Using s k ; R k ; t k obtained above, we can obtain a better object structure and camera motion as follows. Object structure: Taking the average of the structures transformed using scaling, translation and rotation, accuracy is improved. If the corresponding point does not exist in the integrated structure yet, it is appended to the integrated structure. Camera velocity: Transforming using only scaling, v W (t) is recovered. Transforming using scaling and rotation, v B (t) is recovered. Camera angular velocity: From the recovery at a point of time,! W (t) is recovered. So, transforming using rotation,! B (t) is recovered. Camera position: In the recovery at a point of time, the camera center is assumed to be at the origin of the temporary world coordinate system. Using the scaling, translation and rotation information, transition of the camera center position and direction in the base world coordinate system is obtained. 7 Determining the Scale of Structure and Velocity In this section, we denote camera velocity obtained from an image sequence as v W I (t), that from the acceleration-gyro sensor as v W G (t), and true camera velocity as v W T (t). Then, if noise is absent, v W T (t) = sv W I (t); (15) where s is the unknown scale factor. Therefore if the relation between v W T (t) and vw G (t) is known, we can determine s form the above equation. Theoretically, v W G (t) is obtained by integrating acceleration a W (t) obtained from the accelerationgyro sensor if the initial value is known. It is formulated as v W G (t) = R(t) nz t + v B (t 0 ) o t 0 fr 01 ()a W () 0 g B gd ; (16) where v B (t 0 ) is the initial velocity and R(t) is the rotation form the base world coordinates to the temporary world coordinates. It can be obtained from the acceleration-gyro sensor output! W (t) if the initial value R(t 0 ) is known. However, in practice, the acceleration-gyro sensor output includes noise so that v W G (t) drifts. Hence the following relation holds. v W T (t) = vw G (t) + b(t) (17) The b(t) represents the eect of drift and the unknown initial value. The change of b(t) comes from the drift, so we can assume that the change in short time is small. From discretization of above equations, we obtain sv W ;k I = v W ;k G + bk ; (18) where v W ;k I is v W I at time k (k = 0; 1;... ; K) and so on, and changes of b k along k are small. We minimize the following function for obtaining s. E 2 (s; bi) = 1 2 KX k=0 ksv W ;k I 0 (v W ;k G + bk )k 2 + 1 2 K X k=0 k2b k 0 b k01 0 b k+1 k 2 ; (19)

where b 01 = b K+1 = 0 and is some positive value for weighting. 8 Experiments 8.1 Experimental Environment We used a cube with a known size, shown in Figure 5, for examining the recovery errors. The cube has sides 20 cm long. The acceleration-gyro sensor output is obtained at 60 Hz via serial connection. The CCD camera has lens whose focal length is 8 mm and its output (640 2 240 pixels) is captured at 15 frames/sec synchronously with accelerationgyro sensor output. The CCD camera's inner parameters are obtained in preliminary calibration. The CCD camera was moved by human hand, so we know only the rough trajectory of the camera motion. 8.2 Experiment 1: Simple and Short Camera Motion In experiment 1, the CCD camera was moved in short period almost straightly, as shown in Figure 6. The lengths mentioned in this gure are rough estimates as explained before. 65 cm 1.5 m Object 20 cm CCD camera moves along this trajectory Figure 6: Camera motion in experiment 1. We obtained 27 frames (in 1.68 sec) in the motion. When we do not use the acceleration-gyro sensor, optical ow for obtaining results must be large. In this case, only two structures were recovered and the average of the errors was 9.4 cm. When the acceleration-gyro sensor output was used, 13 structures were recovered and the average of the errors was 3.1 cm. The accuracy was much improved by using the acceleration-gyro sensor. In Figure 7, errors of each of the recovered structures and the results of the integration of structures when the acceleration-gyro sensor output was used are plotted. The error of the integration result at index i is the result of integration from 0 to i. It is shown that the integration of recovered structures at dierent time points improves the accuracy. Figure 5: Photograph of the object with sides 20 cm long. The interval between the two images for obtaining optical ow are automatically determined by a certain method, but we do not mention it for lack of space. In order to examine the accuracy of recovery, we found the rotation, translation and scaling (if needed) from the recovered structure to the actual structure, because the recovered structures are related to a dierent coordinate system from that of the actual structure. We adopted the RMS (Root Mean Square) of the distances between actual and recovered points as the recovery error. Error [cm] 5 4 3 2 1 0 Error of each structure Error of integrated structure 0 2 4 6 8 10 12 Number of recovered structure Figure 7: Errors of recovered structures using the acceleration-gyro sensor in experiment 1. In Figure 8, the relation between kv W I k and

kv W G k are plotted. The plotted points are on almost the same line through the origin, because the motion nished in short period, so that the drift of v W I was small. To determine s, (19) with = 1 was used. The results are shown in Table 1, where point numbers for specifying sides are shown in Figure 5. 1 m 60 Object 20 cm 0.4 v G W [m/s] 0.35 0.3 0.25 0.2 0.15 Figure 9: Camera motion in experiment 2. 7 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 v W I Error [cm] 6 5 4 3 Error of each structure Figure 8: Velocities obtained from the image sequence and acceleration-gyro sensor output in experiment 1. Table 1: Recovered length and actual length in experiment 1 Side Recovered [cm] Actual [cm] 0-1 82.7 100 1-2 91.9 100 2-3 94.2 100 3-4 80.8 100 6-12 90.0 100 12-13 82.9 100 7-8 45.2 50 1-9 58.0 50 5-18 44.5 50 12-17 49.0 50 8.3 Experiment 2: Complex Camera Motion More complex motion in long period was adopted in experiment 2. In this experiment, the CCD camera moved around the object as shown in Figure 9. We obtained 94 frames (in 5.64 sec) in the motion. When the acceleration-gyro sensor was not used, 6 structures were recovered as shown in Figure 10. In this case, the average of the errors before the integration of recovered structures was 3.2 cm. In Figure 11, the results when the accelerationgyro sensor output was used are plotted, where 20 2 1 Error of integrated structure 0 0 1 2 3 4 5 Number of recovered structure Figure 10: Errors of recovered positions without the acceleration-gyro sensor in the experiment 2. structures were recovered. It is shown that the integration of the recovered structures at dierent time points improves the accuracy. In this case, the average of the errors before integration was 3.3 cm. It is a little worse than the case without the acceleration-gyro sensor, but we obtained the larger number of recovered structures, so that the integrated structure was better than the results without the acceleration-gyro sensor. However, in this experiment where the camera motion is complex, we could not obtain reliable scale. A part of results is shown in Table 2. We need more study to improve the accuracy. The recovered structure using the accelerationgyro sensor output projected to new screen positions is shown in Figure 12. It is displayed using a wire frame or texture mapping. In the wire frame image, points are connected by lines in order to clearly show the structure. 9 Conclusion We have proposed a method for shape and scale recovery using a CCD camera and an acceleration-

8 7 Table 2: Recovered length and actual length in experiment 2 Error [cm] 6 5 4 3 2 Error of each structure Side Recovered [cm] Actual [cm] 0-1 60.1 100 1-2 64.6 100 5-18 31.9 50 12-17 30.4 50 1 Error of integrated structure 0 0 2 4 6 8 10 12 14 16 18 20 Number of recovered structure Figure 11: Errors of recovered positions using the acceleration-gyro sensor in the experiment 2. [2] T. S. Huang and A. N. Netravali. Motion and Structure from Feature Correspondences: A Review. Proc. of the IEEE, 82(2):252{268, 1994 [3] J. K. Aggarwal and C. H. Chien. 3-D Structure from 2-D Images. in Advances in Machine Vision (J. L. C. Sanz, Ed.), 64{121, 1989. [4] T. Mukai and N. Ohnishi. Motion and Structure from Perspectively Projected Optical Flow by Solving Linear Simultaneous Equations. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'97), 740{745, 1997. [5] C. Tomasi and T. Kanade. Shape and Motion from Image Streams under Orthography: a Factorization Method. International Journal of Computer Vision, 9(2):137{154, 1992. Figure 12: Recovered structure using the acceleration-gyro sensor in experiment 2. gyro sensor. We modied the method proposed by us before, in order to use both the CCD camera and the acceleration-gyro sensor. In the experiments, improvement of recovered structure is veried. However, recovered scales are not so reliable when camera motion is complex. In the next step, we try to improve the accuracy of our method. In particular, improvement of scale recovery is necessary. [6] C. J. Poelman and T. Kanade. A Paraperspective Factorization Method for Shape and Motion Recovery. IEEE Trans. Pattern Anal. Machine Intell., 19(3):206{218, 1997. [7] O. Coenen and T. J. Sejnowski. A Dynamical Model of Context Dependencies for the Vestibulo-Ocular Reex. in Advances in Neural Information Processing Systems 8, MIT Press, 1996. [8] N. Franceschini, J.M. Pighon and C. Blanes. From insect vision to robot vision. Phil. Trans. Roy. Soc. B, 337:283{294, 1992. References [1] R. Y. Tsai and T. S. Huang. Uniqueness and Estimation of Three-Dimensional Motion Parameters of Rigid Objects with Curved Surfaces. IEEE Trans. Pattern Anal. Machine Intell., 6(1):13{27, 1984.