Three-Dimensional Data Recovery Using Image-Based Modeling Jeremy W. Cannon Jonathan C. Derryberry Vitaly Y. Kulikov jcannon1@mit.edu jonderry@mit.edu vkulikov@mit.edu Abstract 6.837: Introduction to Computer Graphics MASSACHUSETTS INSTITUTE OF TECHNOLOGY Final Project Report Team 13 December 6, 2002 Extraction of a three-dimensional model from a set of two-dimensional projections is a well-known problem of contemporary computer science. Termed image-based modeling, solutions to this problem have a number of practical applications ranging from virtual tours and image recognition to generation of physical models from image data. However, this problem remains the subject of active research as it has not yet been solved in the general case. Although this general case has proven very challenging, there are certain special cases where a satisfactory solution can be achieved with minimal human intervention. The following report describes our approach to a general solution to the problem of inferring geometric information from a photographic image. A detailed description of our algorithm and the method of implementation are provided along with sample results demonstrating the capabilities of this approach. I. Introduction Since the initial work of Horn in 1970 [1], the use of photograph images for constructing physical models has evolved into a range of new disciplines in the fields of both computer graphics and computer vision. This classical work has been termed shape from shading as it uses the reflectance equation (1) to relate image brightness, I to the surface normal, N: I = R( p, q) = ρ( N L) (1) where R(p,q) is the reflectance function in terms of the surface gradient, ρ is the composite albedo, and L is the light source direction. To derive the surface normals, the radiosity at a point P on the surface of the object is given by (2): BP ( ) = ρ( P) N( P) L (2) where ρ(p) is the surface albedo, N is the surface normal, and L is the light source vector. Assuming the camera response is linear with respect to the surface radiosity, the intensity value of each pixel can be written as I( x, y) = kb( x, y) = kρ( x, y) N( x, y) L (3) = g( xy, ) V where k is the constant relating camera response to surface radiance thereby making V a vector containing elements related to both the scene lighting and the camera. Although the surface normal is not uniquely - 1 -
determined in this expression, it can be obtained by assuming a convex surface. Because N is a unit normal, ρ(x,y) is simply the 2-norm of the surface vector g(x,y). Thus, N can be found as 1 N( x, y) = g( x, y) (4) g( xy, ) 2 A surface model can then be determined from this reference normal by recognizing that the normal can also be written as a homogeneous vector: T 1 f f N( xy, ) = 1 2 2 (5) f f x y 1+ + x y where f(x,y) is the equation for the parameterized surface which can then be integrated over x and y to yield the final model. Subsequent work by Chen and Williams, McMillan, Debevec, and others has spawned the fields of Image-based Modeling and Rendering (IBMR) which seeks to enhance the realism of computer graphics scenes by extracting environmental information about a scene from photographs [2,3]. This environmental information typically goes far beyond derivation of realistic geometry to include new approaches to visibility, modeling view-dependent variations in the appearance of materials, and the extraction of more accurate lighting models for complex scenes [3]. Indeed, many of these new approaches to model generation view photographic images as measurements which can inform the realism of any given scene. Although the general concepts of shape from shading have been studied for decades, this field remains quite active as a research discipline due to the wide range of complex issues that have been uncovered as research in this field has progressed. Examples of these complex issues include variable albedo within an object which confounds the relationship expressed in Equation 1 [4], interreflections which lead to dramatically different appearance from that predicted by local lighting models [5], and ambiguous geometries which cannot be resolved based on shading alone [6]. In summary, a general solution to the complete extraction of three-dimensional geometric and environmental data has not been described in part because of the complexity of this problem and in part due to the diversity of subject matter and modeling objectives held by those employing the techniques of IBMR. II. Goals 1.1 Image-based Modeling A variety of techniques for solving image-based modeling problems have been developed since the original methods described by Horn [1,7]. More recent techniques include using an array of silhouettes to reconstruct object geometry, using surface curves from object profiles to create the model geometry, and using stereoscopic imaging to extract a so-called depth map of the object [8]. In this project, we aimed to reconstruct a graphical model of physical objects based on photographic images of the object. Our goals for this phase included: Implementing an algorithm that identifies the boundaries of the two-dimensional projection of the model and that ensures that the RGB values of pixels within the boundaries of the projection are smooth functions of their position. Implementing an algorithm that, given a reference normal or normals, scans the area within the boundaries of the preprocessed projection and restores the depth and the normal direction at each vertex of the generated 3D model.. - 2 -
Implementing an algorithm that generates a complete model from a set of two or more partial models where each partial model corresponds to one two-dimensional projection (only provided enough time remaining in the term). 1.2 Using the Generated Model Once the model is extracted, it needs to be rendered and, if the results are desirable, exported for use in other applications. Therefore, we needed to develop a flexible interface for rendering the model as well as include the option to convert the extracted model to a universal format. Our goals for this phase included: Providing the user with simple tools to control different parameters of the image-based modeling process such as the granularity of the model (i.e. the number of pixels in the image per vertex in the model). Providing the user with a simple tools to view the model from different directions and distances to determine the quality of the model in real time, before exporting. Providing the user with a simple tool to save the model into VRML and/or Open Inventor file formats for easy access from other applications (whereby the generated model could be edited as needed). III. Achievements 3.1. General Approach As our initial approach to this problem, we assumed a single, diffuse (Lambertian) object with a uniform albedo illuminated by a point light source imaged using an orthographic projection. After generating several synthetic (Open Inventor) images satisfying these constraints, we began to develop an algorithm to determine a reference normal based on intensity values within the image. In particular, the reference normal was assumed to reside at the pixel of the image with the brightest intensity pointing in the positive z direction. From this normal, the entire field of normals was derived which then could be integrated to obtain a surface mesh. The coordinates in the mesh and their corresponding normals were then used to create 3D geometry to be rendered next to the original image. The user could then adjust the view of the model (orientation and zoom) to inspect the model before exporting the result to an Open Inventor (*.iv) file. Once the algorithm for model extraction was validated using synthetic images, we Synthetic Image (Development Phase) OR User-specified Concavity Image Acquisition Filtering Determine Reference Normal(s) Model Rendering Generate a Surface Mesh Compute Field of Normals Model Export Figure 1. Stepwise approach to model generation from photographic data. Synthetic images refer to idealized scenes generated using Open Inventor which were used to test our algorithm. Steps in double frames indicate those integrated into a user interface. - 3 -
then applied this approach to digital photographs obtained from scenes designed to meet the above constraints as closely as possible, so that there were no sharp edges, specular color components, or variations in surface color. We then applied this approach to a range of basic objects to test the robustness and stability of this algorithm. This sequence of steps is summarized in Figure 1, and the results of our analysis are presented below. 3.2. Physical Models & Lighting Synthetic images for initial development and testing of our algorithm were obtained using SceneViewer. Models of single, monochromatic, diffuse objects (with the complexity node increased to eliminate surface irregularities) were illuminated with a single directional light source and presented with orthographic projection. The SceneViewer window was then converted to a *.jpg image with minimal compression using XV 3.10a. Re-creating this environment using physical models required some approximations; however, the setup shown in Figure 2 gave images which were acceptable for use by our algorithm. In this setup, a directional light source is simulated by a single incandescent 100 W clear bulb placed 6 feet from the scene thus permitting the assumption that the light rays were nearly parallel when they hit the objects. All images were obtained with only this light source illuminating the scene contained in a chamber lined with black felt to further reduce any contribution from ambient light. A diffuse surface was achieved by using cardboard models or models of modeling clay with a slightly roughened surface. All of these modeled objects were monochromatic with a uniform or nearly uniform albedo. The most difficult constraint to approximate was an orthographic projection as we had to balance camera resolution with separation from the viewing chamber. A reasonable compromise between these parameters was achieved by fixing the camera viewpoint at 18 inches from the chamber with no zoom. All of our objects were no more than 3 inches in diameter and were centered in the camera field of view thereby giving a close approximation to orthographic projection. 3.3. Image Pre-processing Like estimation of surface curvature from photographic images, estimation of surface normals is highly noise sensitive [9]. To ensure the best possible model estimation, we implemented a set of image filters into our Java user interface to smooth the intensity curves while preserving the underlying shape as much as possible. In the frequency domain, we assumed that geometry causes low frequency variations in image intensity while detailed features of an image such as edges and textures, and noise are generally d >> ε Point Light Source radius=ε Camera Figure 2. Setup for image acquisition with a simulated point light source (incandescent bulb placed far away from the object) and orthographic projection obtained as well as possible. The black box containing the object represents a felt-lined box designed to minimize ambient light. - 4 -
Figure 3. Intensity profiles for a green sphere showing the native intensity values for the green channel (A) and the filtered intensity values using a 50x50 average filter (B). higher frequency components of an image [10]. On this basis, we implemented several types of lowpass filters for the user to select for the purpose of minimizing the image noise. These filters included a Gaussian kernel filter. Also, we incorporated a mean filter, which uses a normalized uniform kernel, a median filter, and a minimum filter. For the latter three filters, the user inputs the size (by setting the window radius in a dialog box). Sample results of the average filter are shown in Figure 3. Our incorporation of the median filter is based on work by Tsai and Shah [4] and sets the intensity of pixel (i.j) to the median value of the neighboring n x n pixels. The minimum filter assigns pixel values to the smallest intensity value in the surrounding n x n pixels. Finally, the user has the option to skip the filtering step by selecting the blank filter prior to the model creation step. 3.4. Determining surface normals from shading values Using a monocular viewpoint and a directional light source permits extraction of only a partial model of 3-D objects due to self-shadowing. However, in the setting of the constraints outlined above, this partial model can be rather convincing, and by combining multiple partial models from registered images, the complete three-dimensional geometry can be reproduced. This section presents our approach to determining the surface normals of an object based on image intensity values from which a single partial model of the object is generated. 3.5. Model creation a. Preliminaries Calculating the field of normal vectors over a surface proves a difficult problem because concavity/convexity can vary across the photographed object, and the result is an ambiguous picture. For example, a bowl and a sphere may have the same light intensities at each point, but are obviously different shapes. More generally, the concavity can be different in any arbitrary direction. For instance, a surface can be concave in the x direction while being convex in the y direction. One approach to resolving this ambiguity would be to have many pictures of the object from different viewpoints. However, given the time constraints, the complexity of such an algorithm was determined to be too great. Moreover, extracting the model from a single image is an interesting problem in and of itself. Therefore, we made a design decision to use just one picture but assume convexity everywhere. Obviously, this would excessively constrain the range of geometry the algorithm could successfully extract, so a set of tools was implemented in the user interface that allows the specification of regions in - 5 -
which the object is concave. To specify concavity the user can draw, move, and delete any number of polygonal regions to indicate that a particular region of the screen has concave underlying geometry. Moreover, the user has the freedom of making such specifications applicable only to a particular direction, either x or y. Thus, the user can specify certain regions that are concave in the x direction while specifying other regions to be concave in the y direction. For convenience, the user is allowed to specify concavity in both the x and y directions with a single polygon. Even with such flexible ability for concavity specification, there are pathological objects whose concavity cannot be specified. Consider the graph of f( x, y) = xy, which is linear in both the x and y direction. However, giving the user addition freedom to specify arbitrary convexity would burden the user with too many choices. Also, it would complicate the model extraction algorithm. Therefore, the user was only allowed the ability to specify the concavity in the x and y directions. b. Mesh generation Equation (5) describes the relationship between surface normals and a parameterized equation of the surface which can be used to generate a mesh to reconstruct the geometry of the object. The algorithm that we use to reconstruct the surface consists of two separate steps. During step one, a separate partial 3D-model is generated for each reference normal. During step two, partial 3D models from the previous step are processed to form a more precise, average, partial 3D-model of the object in question. Let us separately consider each of the steps mentioned above. During step one, a separate partial 3D-model is generated for each given reference normal. The process of generating a partial 3D-model can in turn be subdivided into two stages. During the first stage, the mesh of surface normals is recovered, using the reference normal and information about light intensity at each point of the surface. During the second stage, the field of normals from the previous step is used to recover the set of z values or depth for each vertex in the mesh. While the process of recovering the z-coordinate for each vertex in the mesh is a comparatively simple procedure once the field of normals is built, restoring the field of normals proved to be challenging. Given a reference normal at the most illuminated point of the image, the geometrical set of vectors that satisfy the illumination equation above forms a cone. Among those vectors we need to choose the one that satisfies the convexity requirement at the part of the surface in question and that is consistent with directions of normals nearby. We are considering only relatively smooth surfaces and therefore can expect the change between any two normals close to each other to be small. To determine the normal at a particular vertex in the mesh, we sample the intensity values of pixels nearby to find the direction in which the absolute change in intensity is the largest. Of course, this direction cannot be always determined precisely due to a certain amount of noise, but it can be determined well enough to avoid large errors in the normal direction. Moreover, this direction is not necessarily the direction in which the normal should point (e.g. consider the case of a cylinder slanted in the z direction). However, in practice this assumption proved to generate reasonable normals for a wide family of objects. Once we determine the direction of the maximum absolute change in intensity, we can decrease the number of potential normals to those two that lie within the vertical plane, containing that direction. Then, to choose between two normals, we only need to check which normal is consistent both with the convexity of the surface in the region and with the direction of already computed normals nearby. This can be done by analyzing how parallel the direction of the maximum absolute change in intensity is to the X- and Y-axes, and the direction of the closest neighboring vertices in the mesh. The geometry of this portion of our algorithm is shown in Figure 4. Once we have the field of normals, the process of recovering z-coordinate for each vertex in the mesh is simple. If we assume that the value of the unit normal at some point (x, y) is (a (x, y), b (x, y), c (x, y)) and Z (x, y) is the function of the surface, then it is easy to see that: z axy (, ) z bxy (, ) = and = (6) x cxy (, ) y cxy (, ) - 6 -
x z y x I ( xy, ) Figure 4. Geometric relationship between the reference normal (red) and a neighboring point where the normal at that point (x,y) is being computed. One of 2 possible normals are identified (black) which our algorithm then evaluates to determine which of these is the correct normal. y As a result, if we assume that the z-value at some reference point P r is 0.0, the z-value at any other z z point P d will be a linear integral of (, ) along some path from P r to P d. In theory, the z-values that we x y get from different paths should be the same. However, because the field of normals that we built is not completely error-proof, the z-values that we get along different paths are a little bit different. To make sure that they are as close to correct as possible, we compute z-value at each point through multiple paths and than take the average. If we assume that all errors are independent random variables, taking the average should help us to reduce the amount of error. When a separate 3D-model for each of the given reference normals is computed, we need to process these partial models to build a more precise, average, model. There are many different ways to do this. One possible way is just to choose some reference normal M and the coordinate system associated with it as the main, take the z-values of origins associated with other reference normals, and for each vertex compute its average depth, using the following formula: ([ ZM1+ Z1( x, y)] + [ ZM2 + Z2( x, y)] + + [ ZMN + ZN( x, y)]) ZM ( x, y) = (7) N In (7), Z MJ stands for the z-value of the origin of reference normal J in the main coordinate system, while Z J (x, y) denotes the z-value at the (x, y) point of the surface in the coordinate system associated with reference normal J, and N denotes the total number of reference normals. The algorithm described above is a final result of multiple attempts of the team to come up with the best way of solving the problem. Many of the algorithms that we considered first proved to be unsuccessful. For instance, there are many possible approaches for building a field of normals using a reference normal and light intensity at each point of the surface. One way to do it would be to average between two or more normals nearby that have already been computed and then to choose that of the two potential candidate normals that forms the smallest angle with the average of the two neighboring vertices. This algorithm is very simple to implement, for it is much less sensitive to the order in which different vertices within the mesh are processed, and does not involve a lot of computations. Unfortunately, while the algorithm worked perfectly on non-flat surfaces, it made too many errors on flat ones, where the two normals to choose from at each vertex were very close to each other. - 7 -
Another version of the algorithm that we implemented but eventually rejected was similar to our final algorithm but used only one of normals nearby. If the already-computed normal nearby belonged to a vertex located in the mesh along X-axis relatively to the vertex being processed, we used our knowledge about the surface convexity in X direction; otherwise, we used our knowledge about the surface convexity in Y direction to choose between the two possible normals at the vertex. This algorithm was also less sensitive to the order in which different vertices in the mesh were processed, and worked on most of the surfaces we considered. However, it did not work on cylinder-like surfaces, where there can be no change in the normal direction along one axis but a large change along the other axis. 3.6. Model viewing Following computation of the parametric surface equation and mesh fitting, the user can manipulate the resulting model in a number of different ways. These adjustments can include rotated viewpoints or altered magnification (zoom) as in the SceneViewer application. To display the 3D geometry, Java3D was used. A simple interface was provided for the model extraction algorithm to satisfy. The extraction algorithm was required to provide a two-dimensional array of points for the coordinates and a two-dimensional array of per-vertex normals, which were used for Gouraud shading a TriangleStripArray that was built from the points. To allow the user flexibility in viewing the model, a key listener was added to allow the user to rotate and recenter the model in addition to the buttons provided for such functions. IV. Description of Deliverables The following figures demonstrate the abilities of our integrated model extraction system which takes a single image as input, derives a partial model of the three-dimensional geometry which it then renders in an adjacent window for viewing by the user. Final adjustments can then be made using the control buttons on the user interface before the model is exported. Figure 5. User interface showing a filtered picture of a cylinder (left) and the resulting partial model constructed using our algorithm described above. - 8 -
A B Figure 6. Demonstration of the effect of image filtering on model generation. (A) Partial model extracted from a photograph of an unfiltered image of an egg. (B) Improved partial model of the egg after use of an average filter to reduce high-frequency noise components while preserving the geometry of the object. - 9 -
A B Figure 7. Demonstration of partial model extraction of a concave surface (original image on the left). (A) With the concave region not specified, the algorithm assumes a convex surface. However, with user-specified concavity, the model is correctly generated as in (B). - 10 -
V. Individual Contributions Producing this integrated model extraction system required extensive background reading on IBMR and shape from shading analysis followed by systematic planning of the sequence of steps required to yield a functional result. To this end, we divided the project into four separate elements with one team member primarily responsible for each of the first 3 elements and all of the team members participating in the last element: (1) physical model and image acquisition Jeremy Cannon, (2) surface normal determination and mesh generation Vitaly Kulikov, (3) user interface for image processing and model rendering Jonathan Derryberry, and (4) integration of these elements into a single system ALL. Specifically, Jeremy Cannon performed the following tasks in support of this project: Generating suitable synthetic images using Open Inventor for primitives and MATLAB for irregular surfaces Setting up the environment for image acquisition including approximation of the modeling constraints Design and testing the image processing filters Preparation and integration of the project documentation including the project proposal, final report, and presentation Vitaly Kulikov performed these specific tasks for this project: Converting indexed images to RGB images Deriving surface reference normal(s) from image intensity values Generating a field of normals from the derived reference normal(s) Producing a smooth, continuous surface mesh from these normals Algorithm for exporting to Inventor Finally, Jonathan Derryberry supported this project with the following contributions: Image display in a Java-based User Interface Integration of image filters, model generation, and model exporting into this Interface Model rendering using Java 3D Worked with Vitaly on improving the model extraction algorithm VI. Lessons Learned This project taught us a great deal about the complexities of using images as measurements. It also gave us great appreciation for the enormous complexity of the types of problems investigators such as Leonard McMillan, Paul Debevec, and Takeo Kanade are currently tackling and for the incredible insight of Horn s groundbreaking work in the early 1970 s. Although we had hoped to synthesize a complete three-dimensional model from stereo image pairs, this did not prove possible given the time constraints of this project and the significant increase in complexity over partial model extraction. However, in producing this system which extracts partial threedimensional models, our knowledge increased greatly in the following specific ways: Appreciation of image processing techniques specific to using images as environmental measurements Understanding of the necessary constraints required to extract precise geometric data from a physical scene Understanding of the mathematical basis for model extraction from single images - 11 -
Knowledge of Java 3D and the supporting mathematical libraries required for image processing and manipulation. Engineering experience in choosing algorithms that may not be correct generally, but provide adequate functionality without excessive complexity and computational cost so that a rich set of surface geometry could be extracted reliably in a reasonable amount of time Acknowledgments We would like to acknowledge the insights of Dr. Doug Perin who gave us inspiration for this project and offered specific suggestions on optimizing the image acquisition setup. In addition, Addy Ngan was very helpful in keeping us on schedule and in giving suggestions on debugging our algorithm. Bibliography 1. Horn BKP. Shape from shading: a method for obtaining the shape of a smooth opaque object from one view. PhD Thesis, MIT, 1970. 2. Chen SE, Williams L. View interpolation for image synthesis. SIGGRAPH 93. 279-288. 1993. 3. Debevec P, McMillan L. Image-based modeling, rendering, and lighting. IEEE Computer Graphics and Applications. Mar/Apr 2002. 24-25. 4. Tsai P-S, Shah M. Shape from shading with variable albedo. Opt Eng 37(4): 1212-20. 1998. 5. Forsyth D, Ponce J. Sources, Shadows, and Shading in Computer Vision: A Modern Approach. Prentice Hall, NJ. 70-96. 2002. 6. Horn BKP. Impossible shaded images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 15(2): 166-70. 1993. 7. Horn BKP. Height and gradient from shading. Int J Comput Vision. 5(1): 37-75. 1990. 8. R. Szeliski. From images to models (and beyond): a personal retrospective. In Vision Interface 97. 126-137, Kelowna, British Columbia. May 1997. 9. Fan, T-J. Surface Segmentation and Description in Describing and Recognizing 3-D Ojbects Using Surface Properties. Springer-Verlag, New York. 27-54. 1990. 10. Gonzales RC, Woods RE. Image Enhancement in Digital Image Processing. Addison-Wesley Publishing Company, Reading, MA. 161-251. 1993. - 12 -
Appendix Compilation Instructions The source code for our ModelBuilder UI is located in the following directory: /afs/athena.mit.edu/user/j/o/jonderry/public/ The test images are contained in /afs/athena.mit.edu/user/j/o/jonderry/public/ivpics /afs/athena.mit.edu/user/j/o/jonderry/public/photos To execute this program, use the code contained in the first directory on a machine with JAVA and the JAVA 3D libraries. - 13 -