Video Mosaicing for Document Imaging

Video Mosaicing or Document Imaging Noboru Nakajima Akihiko Iketani Tomokazu Sato Sei Ikeda Masayuki Kanbara Naokazu Yokoya Common Platorm Sotware Research Laboratories, NEC Cororation, 8916-47 Takayama, Ikoma 630-0101, Jaan Nara Institute o Science and Technology, 8916-5 Takayama, Ikoma 630-0192, Jaan {n-nakajima@ay,iketani@c}.j.nec.com, {tomoka-s,sei-i,kanbara,yokoya}@is.aist-nara.ac.j Abstract This aer describes a real-time video mosaicing method or digitizing aer documents with mobile and low-resolution video cameras. Our roosed method is based on the structure-rom-motion technique with real-time rocessing. Thereore, it can dewar an image suering rom ersective deormation caused by a slanted and curved document surace. The curved suraces are reconstructed as an eanded lat image by 3D olynomial surace regression and rojection onto a lat mosaic lane. The order o the olynomial is automatically determined according to the geometric AIC. Moreover, the real-time scanning and reconstruction rocess oers interactive user instruction on the style o user s camera handling, such as aroriateness o camera osture and scanning seed. Manual camera scanning or video mosaicing catures a oint on the target multile times and that allows other merits. They include suer-resolution and shading correction. The eeriments demonstrated the easibility o our aroach. 1. Introduction Digital camera devices are getting increasingly commoditizing. But in contrast to their acility, their quality and assurance o document digitization are not suicient or document analysis and recognition. The literature in the DAR ield describes several technical challenges or their ractical realization such as the ollowings [1-3]: (1) dissolution or deiciency in resolving ower, (2) dewaring o curved surace suering ersective deormation, (3) elimination o illumination irregularity, (4) target localization and segmentation. A ull A4 age image at 400 di requires more than 14.7M iels, whereas a latbed scanner enables a ew thousands di. For the resolution deiciency roblem (1), video mosaicing is one o the most romising solutions. In video mosaicing, artial images o a document are catured as a video sequence, and multile rames are stitched seamlessly into one large, high-resolution image. Conventional video mosaicing methods usually achieve air-wise registration between two successive images, and construct a mosaic image rojecting all the images onto a reerence rame, in general, the irst rame. Szeliski [4] has develoed a homograhy-based method, which adots 8-DOF lanar rojective transormation. Ater his work, various etensions have been roosed [7-10]. One o the major develoments is the use o image eatures instead o all o the iels or reducing comutational cost [5-6]. All o these methods, however, align the inut images to the reerence rame, and thus will suer ersective slanted eect i the reerence rame is not accurately arallel to the target surace, even though a target age is rigidly lanar, as shown in Figure 1(a). There actually eists a severe image distortion (2) when caturing a curved target, as shown in Figure 1(b). The homograhy-based methods are alicable just when a target is a iecewise lanar, or the otical (a) Mosaic image generated (b) Curved age catured by conventional methods with a digital camera Figure 1. Problems in digitized document image. 171

center o the camera is aroimately ied (anoramic mosaicing). Thus, i the target is curved, the above assumtion no longer holds, and misalignment o images will degrade the resultant image. Although there are some video mosaicing methods that can deal with a curved surace, they require an active camera and a slit light rojection device [10]. On the other hand, the usability o the system is also an imortant actor. Unortunately, conventional video mosaicing methods have not ocused on the usability o the system because interactive video mosaicing systems have not been develoed yet. In order to solve these roblems, we emloy a structure-rom-motion technique with real-time rocessing. Our method is constructed o two stages as shown in Figure 2. In the irst stage, a user catures a target document using a web-cam attached to a mobile comuter (Figure 3). In this stage, camera arameters are automatically estimated in real-time and the user is suorted interactively through a review window or dislaying a mosaic image on the mobile comuter. Ater that, camera arameters are reined, and a highresolution and distortion-ree mosaic image is generated in a ew minutes. The main contributions o our work can be summarized as ollows: (a) ersective deormation correction rom a mosaic image or a lat Figure 2. Flow o roosed method. target, (b) dewared mosaic image generation or a curved document, and (c) interactive user suort on a mobile system. Camera scanning allows surlus tetural inormation on the target, or a oint on the document is catured multile times in the image sequence. Other beneits to be reaed rom our system include (d) suer-high-deinite image construction by suerresolution [11] or enhancing the solution or the lowresolution roblem (1), and (e) shading correction by intrinsic image generation [12] or irregular illumination (3). In the objective o document caturing, the location o the target, at which a user rincially oints a camera, is straightorward here or the target location roblem (4). The assumtions made in the roosed method are that intrinsic camera arameters, including lens distortion, are ied and calibrated in advance. For curved documents, the curvature must lie along a onedimensional direction, or be cylindrical, and varies smoothly on a age. 2. Video Mosaicing or Flat and Curved Document This section describes a method or generating a high-resolution and distortion-ree mosaic image rom a video sequence. The low o the roosed method is given in Figure 2. Although our system has two modes, one or a lat target (lat mode) and the other or a curved target (curved mode), the basic low or these two modes is almost the same. In the real-time stage, the system carries out 3D reconstruction rocesses rame by rame by tracking image eatures (a). The review o a temoral mosaic image is rendered in real-time and udated in every rame to show what art o the document has already been catured (b). Ater the real-time stage, the o-line stage is automatically started. In this stage, irst, re-aearing image eatures are detected in the stored video sequence (c), and camera arameters and 3D ositions o eatures estimated in the real-time stage are reined (d). I the system works in the curved mode, surace arameters are also estimated by itting a 3D surace to an estimated 3D oint cloud (e). Ater several iterations, a distortionree mosaic image o high-resolution is inally generated () alying the suer-resolution method. In the ollowing sections, irst, etrinsic camera arameters and an error unction used in the roosed method are deined. Stages (1) and (2) are then described in detail. Figure 3. Mobile PC with web-cam. 172

2.1. Deinition o Etrinsic Camera Parameter and Error Function In this section, the coordinate system and the error unction used or 3D reconstruction are deined. We deine the coordinate system such that an arbitrary oint S, y, z in the world coordinate system is rojected to the coordinate u, v in the -th image lane. Deining 6-DOF etrinsic camera arameters o the -th image as a 34 matri M the relationshi between 3D coordinate S and 2D coordinate is eressed as ollows: T au T, av, a M, y, z, 1, (1) where a is a normalizing arameter. In the above eression, is regarded as a coordinate on the ideal camera with the ocal length o 1 and o eliminated radial distortion induced by the lens. In ractice, however, is the rojected osition o ˆ uˆ, vˆ in the real image, which is given transerring according to the known intrinsic camera arameters including ocus, asect, otical center and distortion arameters. This transormation rom ˆ to is alied only to coordinate calculation in the real-time stage regarding comutational cost, whereas it will be considered recisely as a art o image dewaring in o-line stage, and omitted or simlicity in the rest o this aer. Net, the error unction used or 3D reconstruction is described. In general, the rojected osition o S to the -th subimage rame does not coincide with the actually detected osition u, v, due to errors in eature detection, etrinsic camera arameters and 3D eature ositions. In this aer, the squared error E is deined as an error unction or the eature in the -th rame as ollows: E 2. (2) 2.2. Real-time Stage or Image Acquisition In the real-time stage, etrinsic camera arameters are estimated in real-time to show some inormation that hels users in image caturing. First, a user may very roughly set the image lane o the camera arallel to the target aer. Note that this setting is not laborious or users because it is done only or the irst rame. This setting is used to calculate initial values or the real-time camera arameter estimation. The user then starts image caturing with ree camera motion. As shown in Figure 2, the real-time stage is constructed o two iterative rocesses or each rame. First, the etrinsic camera arameter is estimated by tracking eatures (a). A review image o generating a mosaic image and instruction or controlling camera motion seed are then shown in the dislay on the mobile comuter and udated every rame (b). The ollowing describes each rocess o the real-time stage. Ste (a): Feature tracking and camera arameter estimation. An iterative rocess to estimate the etrinsic camera arameters and 3D osition S o each eature oint is described. This rocess is an etension o the structurerom-motion method in [13]. In the irst rame, assuming that the image lane in the irst rame is aroimately arallel to the target, rotation and translation comonents in M are set to an identity matri and 0, resectively. For each eature oint detected in the irst rame, its 3D osition S is set to u 1, v1,1, based on the same assumtion. Note that these are only initial values, which will be corrected in the reinement rocess (Figure 2 (d)). In the succeeding rames ( 1), M is determined by iterating the ollowing stes until the last rame. Feature oint tracking: All the image eatures are tracked rom the revious rame to the current rame by using a standard temlate matching with Harris corner detector [14]. The RANSAC aroach [15] is emloyed or eliminating outliers. Etrinsic camera arameter estimation: Etrinsic camera arameter M is estimated using the eature osition u, v and its corresonding 3D osition S, y, z. Here, etrinsic camera arameters are obtained minimizing the error summation E w.r.t. 6-DOF elements in M by the Levenberg- Marquadt method. For 3D osition S o the eature, the estimated value in the revious iteration is used. 3D eature osition estimation: For every eature oint in the current rame, its 3D osition S, y, z is reined by minimizing the error unction E. In the case o the lat mode, z value i 1 i o the S is ied to a constant to imrove the accuracy o estimated arameters. Addition and deletion o eature oints: In order to obtain accurate estimates o camera arameters, stable eatures should be selected. The set o eatures to be tracked is udated based on the eature reliability [13]. Iterating the above stes, etrinsic camera arameters M and 3D eature ositions S are estimated. Ste (b): Preview image generation and instruction notiication or user. Figure 4 shows dislayed inormation or a user in our system. Usually, the user watches the right side window in the real-time stage. I necessary, the user can 173

Figure 4. User Interace or real-time stage. Letto: inut image and tracking eature oint. Letbottom: temorarily estimated camera ath and osture. Right: review o ongoing mosaicing review and seed indicator o camera motion. (a) (b) (c) (d) Figure 5. Registrations o reaearing eatures. (a) Camera ath and 3D eature osition. (b) Two temorally distant rames sharing art o scoe. (c) Temlates o the corresonding eatures. (d) Temlates in which ersective distortion is corrected. check the inut subimages and estimated camera ath on the let side windows. The review image hels the user to conirm a lacking art o the target to be scanned. For each rame, the catured subimage is resized to a lower resolution and stored to teture memory. Every stored teture is dewared onto the mosaic image lane by teture maing using Eq. (1). The boundary o the current rame image is colored in the review so that the user can gras what art o the target has currently been catured. On the other hand, the instruction shown in the bottom o the right window in Figure 4 hels the user to control the seed o camera motion. In our system, there is aroriate seed or camera motion that is shown by an arrow mark in the seed gauge, and it is graded as three ranges; too slow", aroriate," and too ast." Too ast camera motion causes a worse mosaic image quality because the error in camera arameter estimation increases due to tracking errors and shortage o tracking san o each eature. In contrast, too slow motion consumes comutational resources redundantly because the number o images increases to cature the whole document. To obtain a better result, the user controls camera seed so as to kee the arrow mark being inside aroriate" range. In turn, too much slanted camera osture, which is still tolerable or 3D reconstruction, can cause degradations such as irregular resolution and artial deocus blur in a mosaic image. The camera ose can also be checked in the let-bottom window. 2.3. O-line Stage or 3D Reconstruction Reinement and Mosaic Image Generation This section describes the 3D reconstruction arameter reinement and shae regression, so as to minimize the summation o the error unction all over the inut sequence and eature misalignment rom a arameterized columnar olynomial surace. First, as shown in Figure 2, eature oints that reaear in temorally distant rames are homologized (c). Using these reaearing eatures, the 3D reconstruction is reined (d). Then, by itting a curved surace to 3D eature oints, the target shae is estimated (e). Ater a ew iterations o stes (c) to (e), a dewared mosaic image o the target is generated according to the target shae and the etrinsic camera arameters (). Details o stes (c) to () are described below. Ste (c): Reaearing eature registration. Due to the camera motion, the eatures relatively ass though the scoe o the rames. Some eatures reaear in the sight ater the revious lameout, as shown in Figure 5. This ste detects these reaearing eatures, and distinct tracks belonging to the same reaearing eature are linked to orm a single long track. This will give tighter constraints among the sequence o camera arameters in temorally distant rames, and thus makes it ossible to reduce the cumulative errors caused by sequential camera scanning. Reaearing eatures are detected by eamining the similarity o the atterns among eatures belonging to distinct tracks. The roblem here is that even i two atterns belong to the same eature on the target, they can have dierent aearance due to ersective distortion. To remove this eect, irst, temlates o all the eatures are rojected to the itted surace (described later). Net, eature airs whose distance in 3D sace is less than a given threshold are selected and tested with the normalized cross correlation unction. I the corre- 174

lation is higher than a certain threshold, the eature air is regarded as reaearing eatures. Note that in the lat mode, the shae o the target is simly assumed as a lane. When the system works in the curved mode, this ste is skied at an initial iteration o stes (c) to (e) because surace arameters have not been estimated at the irst iteration. Ste (d): Reinement o 3D reconstruction. Since the 3D reconstruction rocess described in Section 2.2 is an iterative rocess, its result is subject to cumulative errors. In this method, by introducing a bundle-adjustment ramework [16], the etrinsic camera arameters and 3D eature ositions are globally otimized so as to minimize the sum o re-rojection errors, which is given as ollows: E. all E (3) As or reaearing eatures, all the tracks belonging to the same reaearing eature are linked, and treated as a single track. This enables the etrinsic camera arameters and 3D eature ositions to be otimized, maintaining consistency among temorally distinct rames. Ste (e): Target shae estimation by surace itting. In this ste, assuming the curvature o the target lies along one direction, the target shae is estimated using 3D oint clouds otimized in the revious ste (d). Note that this stage is skied in the case o the lat mode. First, the rincial curvature direction is comuted rom the 3D oint clouds as shown in Figure 6. Net, the 3D osition o each eature oint is rojected to a lane erendicular to the direction o minimum Figure 6. Polynomial surace regression or target shae aroimation rincial curvatures. Finally, a olynomial equation o variable order is itted to the rojected 2D coordinates, and the target shae is estimated. Let us consider or each 3D oint S a oint cloud R that consists o eature oints lying within a certain distance rom S. First, the directions o maimum and minimum curvatures are comuted or each R using local quadratic surace it. For a target whose curve lies along one direction, as assumed in this aer, the minimum rincial curvature must be 0, and its direction must be the same or all the eature oints. In ractice, however, there eists some luctuation in the directions o minimum curvature, due to the estimation errors. Thus, a voting method is alied to eliminate outliers and the dominant direction V T min v m, vmy, vmz o minimum rincial curvatures or the whole target is determined. Net, 3D osition S or each eature oint is rojected to a lane whose normal vector N coincides with Vmin i.e. P, y, z vm vmy y vmz z 0. The rojected 2D coordinates, y o S is given as ollows: T V 1 T y S V, (4) 2 where V 1 is a unit vector arallel to the rincile ais o inertia o the rojected 2D coordinates, y, and V 2 is a unit vector that is erendicular to V 1 and V min, i.e. V 2 V 1 N. Finally, the target shae arameter a 1,, a m is obtained by itting the ollowing variable-order olynomial equation to the rojected 2D coordinates, y. y m i1 a i Here, the otimal order m in the above equation is automatically determined by using geometric AIC [17]. In the case where the target is comosed o multile curved suraces, e.g. the thick bound book shown in Figure 1(a), the target is irst divided with a line where the normal vector o the locally itted quadratic surace varies discontinuously, and then the shae arameter is comuted or each art o the target. The estimated shae is used or generating a dewared mosaic image in the net rocess, as well as or removing the ersective distortion in the reaearing eature detection rocess (Figure 2(c)). Ste (): Mosaic Image Generation. Finally, a mosaic image is generated using etrinsic camera arameters and surace shae arameters. The suer-resolution and shading correction are imlemented here regarding coordinate transormation be- i. (5) 175

low. Let us consider a 2D coordinate m, n on the dewared mosaic image, as shown in Figure 6. The high-deinite iel value at m, n on the dewared mosaic image is estimated rom the intrinsic iel values at all the corresonding coordinates u, v in the inut subimage [11,12]. The relation between m, n and its corresonding 3D coordinate,, z on the itting surace is given as ollows: 2 d m, n 1 d, z. (6) d 0 The coordinate,, z is transerred to the corresonding 2D coordinate u, v on the -th image lane by the ollowing equation. T au V 1 T av M V 2. (7) T a N z Assumtion o the lat target shae (lat mode) simliies these as lanar ersective transormations. cross marks. Note that none o the inut image lanes are arallel to the target document. Figure 8 illustrates estimated etrinsic camera arameters and eature ositions on the mosaic image lane. The curved line shows the estimated camera ath and yramids show the camera ostures every 10 rames. The generated mosaic image is shown in Figure 9. We can conirm that the slanted eect is correctly removed in the inal mosaic image by comaring the result with the homograhy-based result shown in Figure 1(b) where the same inut images and the same eature trajectories with this eeriment were used. Although the target aer is not erectly lat, there are no aarent distortions in the inal mosaic image. The erormance o the Figure 7. Inut subimage samles and tracked eatures or lat target. 3. Eeriments We have develoed a video mosaicing rototye system that consists o a mobile PC (Pentium-M 1. 2GHz, Memory 1GB) and a USB web-cam. The aearance o the rototye system has been shown in Figure 3. In our system, a user starts image caturing, reerring the monitor o mobile comuter. The realtime stage or image caturing is automatically inished when a video buer becomes ull. Eeriments have been carried out or lat and curved documents. In these eeriments, the intrinsic camera arameters are calibrated in advance by Tsai s method [18], and are ied throughout image caturing. Note that in the current version o the rototye system, the curved mode is not imlemented on the mobile system. Thus, the latter eeriment or the curved surace is carried out by using a deskto PC (Xeon 3.2GHz, Memory 2GB), and the initial camera arameter estimation is rocessed oline ater image caturing using an IEEE 1394 web-cam. Eerimental results chiely ocused on 3D reconstruction are shown below. (a) To view (b) Side view Figure 8. Estimated etrinsic camera arameter and 3D eature distribution or lat document. 3.1. Flat target As shown in Figure 7, a lat target document is catured as 150 rame images o 640480 iels at 6 s with initial camera arameter estimation. Image eatures tracked in the real-time stage are deicted with Figure 9. Reconstructed mosaic image. 176

rototye system or this sequence is as ollows: 6 s or image acquisition and initial arameter estimation with review o mosaicing, 17 seconds or camera arameter reinement, and 39 seconds or generating the inal mosaic image. 3.2. Curved target Figure 13. Estimated shae. (a) Beore shading removal Figure 10. Curved target. (b) Ater shading removal Figure 14. Reconstructed mosaic image. 1 st rame 66 th rame 123 rd rame 200 th rame Figure 11. Inut subimage samles or curved target. (a) Front view (b) Side view Figure 12. Estimated etrinsic camera arameter and 3D eature distribution or curved document. An eamle o curved documents used in eeriments is shown in Figure 10. The target is comosed o 2 ages o curved suraces: one age with tets and the other with ictures and igures. The target is catured with a web-cam as a video o 200 rames at 7.5 s, and is used as an inut to our system. Samled rames o the catured video are shown in Figure 11. Tracked eature oints are deicted with cross marks. The 3D reconstruction result obtained by the roosed method is shown in Figure 12. The curved line shows the camera ath, yramids show the camera ostures every 10 rames, and the oint cloud shows 3D ositions o eature oints. As can be seen, the oint cloud coincides with the shae o the thick bound book. The shae o the target is estimated ater 3 time iterations o reaearing eature detection and the surace itting rocess. The estimated shae o the target is shown in Figure 13. In the roosed method, the otimal order o the olynomial surace itted to the target is automatically determined by geometric AIC. In this eeriment, the order is 5 and 4 or the let and right ages, resectively. The dewared mosaic images beore and ater shadow removal are shown in Figure 14(a) and (b), resectively. As can be seen, the distortion on the target has been removed in the resultant image. The erormance o the system on the deskto PC or this sequence is as ollows: 27 seconds or initial 3D reconstruction, 71 seconds or camera arameter reinement, and 113 seconds or generating the inal mosaic image. Some other mosaicing results or curved targets are shown in Figure 15. 177

4. Conclusion A novel video mosaicing method or generating a high-resolution and distortion-ree mosaic image or lat and curved documents has been roosed. With this method based on 3D reconstruction, the 6-DOF camera motion and the shae o the target document are estimated without any etra devices. In eeriments, a rototye system o mobile video mosaicing has been develoed and has been successully demonstrated. For a lat target, a mosaic image without the slanted eect is successully generated. Although the curved mode has not yet been imlemented on the current version o mobile system, we have shown the oline result or curved documents generated by deskto PC. In the curved mode, assuming the curve o the target lies along one direction, the shae model is itted to the eature oint cloud and a dewared image without shadow is automatically generated. Our uture work involves reducing the comutational cost in the curved mode and imroving the accuracy o 3D reconstruction around the inner margin o a book. Reerences (a) Poster on a round illar (b) Bottle label Figure 15. Mosaicing result eamles. Let: target aearances. Right: reconstructed images. [1] K. Junga, K. I. Kim, and A. K. Jain, Tet inormation etraction in images and video: a survey, Pattern Recognition, Vol. 37, No.5, 2004,. 977 997. [2] A. Roseneld, D. Doermann, and D. DeMenthon, Video Mining, Kluwer, Massachusetts, 2003. [3] D. Doermann, J. Liang, and H. Li, Progress in camerabased document image analysis, Proc. Int. Con. Document Analysis and Recognition, 2003,. 606 616. [4] R. Szeliski, Image Mosaicing or Tele-Reality Alications," Proc. IEEE Worksho Alications o Comuter Vision,. 230-236, 1994. [5] S. Takeuchi, D. Shibuichi, N. Terashima, and H. Tominage, Adative Resolution Image Acquisition Using Image Mosaicing Technique rom Video Sequence," Proc. Int. Con. Image Processing, vol. I,. 220-223, 2000. [6] C. T. Hsu, T. H. Cheng, R. A. Beuker, and J. K. Hong, Feature-based Video Mosaicing," Proc. Int. Con. Image Processing, Vol. 11,. 887-890, 2000. [7] M. Lhuillier, L. Quan, H. Shum, and H. T. Tsui, Relie Mosaicing by Joint View Triangulation," Proc. Int. Con. Comuter Vision and Pattern Recognition, vol. 19,. 785-790, 2001. [8] P. T. McLauchlan, and A. Jaenicke, Image Mosaicing Using Bundle Adjustment," Image and Vision Comuting, 20,. 751-759, 2002. [9] D. W. Kim and K. S. Hong, Fast Global Registration or Image Mosaicing," Proc. Int. Con. Image Processing, 11,. 295-298, 2003. [10] P. Grattono, and M. Sertino, A Mosaicing Aroach or the Acquisition and Reresentation o 3D Painted Suraces or Conservation and Restoration Puroses," Machine Vision and Alications, Vol. 15, No. 1,. 1-10, 2003. [11] D. Cael, Image Mosaicing and Suer-resolution, Sringer-Verlag, London, 2004. [12] Y. Weiss, Deriving intrinsic images rom image sequences, Proc. Int. Con. Comuter Vision,. 68-75, 2001. [13] T. Sato, M. Kanbara, N. Yokoya, and H. Takemura, Dense 3D Reconstruction o an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera," Int. J. Comuter Vision, Vol. 47, No. 1-3,. 119-129, 2002. [14] C. Harris, and M. Stehens, A Combined Corner and Edge Detector," Proc. Alvey Vision Con.,. 147-151, 1988. [15] MA Fischler, and R. C. Bolles, Random Samle Consensus: A Paradigm or Model Fitting with Alications to Image Analysis and Automated Cartograhy," Communications o the ACM, Vol. 24, No. 6,. 381-395, 1981. [16] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon, Bundle Adjustment - a Modern Synthesis," Proc. Int. Worksho Vision Algorithms,. 298-372, 1999. [17] K. Kanatani, Geometric Inormation Criterion or Model Selection,", Int. J. Comuter Vision, Vol. 26, No. 3,. 171-189, 1998. [18] R. Y. Tsai, An Eicient and Accurate Camera Calibration Technique or 3D Machine Vision," Proc. Con. Comuter Vision and Pattern Recognition,. 364-374, 1986. 178