Word Level Script Identification for Scanned Document Images

Transcription

1 Word Level Script Identification for Scanned Document Images Huanfeng Ma and David Doermann Language and Media Processing Laboratory Institute for Advanced Computer Studies University of Maryland, College Park, MD 20742, USA ABSTRACT In this paper, we compare the performance of three classifiers used to identify the script of words in scanned document images. In both training and testing, a Gabor filter is applied and 16 channels of features are extracted. Three classifiers (Support Vector Machines (SVM), Gaussian Mixture Model (GMM) and k-nearest-neighbor (k-nn)) are used to identify different scripts at the word level (glyphs separated by white space). These three classifiers are applied to a variety of bilingual dictionaries and their performance is compared. Experimental results show the capability of Gabor filter to capture script features and the effectiveness of these three classifiers for script identification at the word level. Keywords: Script Identification, Support Vector Machines (SVM), Gaussian Mixture Model (GMM), k-nearest- Neighbor (k-nn), Gabor Filter 1.1. Background 1. INTRODUCTION In recent years, the demand for tools to be able to recognize, search and retrieve written and spoken sources of multilingual information has increased tremendously. With the rapid explosion of online repositories, researchers and developers of cross-lingual search and translation systems can get a lot of resources they need easily from the Internet. However, there are still significant resources that can only be accessed in a printed form, especially for sparse, low density languages. Manipulation and conversion of these printed documents is essential for many researchers and organizations. One of the most important tasks to address with printed documents is the automatic recognition of text, which usually consists of three steps: (1) zone segmentation and text region identification using document layout analysis; (2) text line, word and character segmentation; and (3) optical character recognition (OCR). In the last step, OCR systems are often designed to work on documents with the specific script. In order to parse bilingual or multilingual documents such as patents 1 or bilingual dictionaries, or perform multilingual document retrieval, 2 the script must be identified before feeding words to an appropriate OCR system. Our motivation to do script identification stems from attempts to acquire lexicon from bilingual dictionaries. In our previous work, 2, 3 a document image was first segmented into physical zones then into entries based on extracted features. For bilingual dictionaries with one non-roman script, script identification is essential both in entry segmentation and in parsing and tagging the entry itself. Further author information: (Send correspondence to Huanfeng Ma) Huanfeng Ma: hfma@umiacs.umd.edu David Doermann: doermann@umiacs.umd.edu

2 1.2. Previous Work In earlier work on script identification, Hochberg et al. 4 described a technique for identifying 13 scripts including highly connected ones. In their algorithm, a scale-normalized cluster template was created for each script, based on the frequent characters or word shapes of this script. Scripts were then classified by comparing a subset of the document s textual symbols with these templates. Spitz et al. 1, 5 initially divided scripts into Asian (Chinese, Japanese and Korean) or Roman based on the observation that upward concavities are distributed evenly along the vertical axis of Asian characters, but tend to appear at specific locations in Roman characters. Furthermore, discrimination among Asian scripts was made on the basis of character density. Waked et al. 6 used features of the horizontal projection of a text line to classify scripts into three categories including Arabic, Roman and Ideographic. Specific features such as character complexity or curvatures can be used to distinguish different scripts if the classifier designer has sufficient knowledge of the scripts, classification, and thus, is case-dependent. In recent years, texture analysis techniques have been introduced to classify different font-styles and font-faces. Zhu et al. 7 present a font recognition algorithm based on global texture analysis. Gabor filters were used to extract the global texture features, and the extracted features were used to recognize different font styles and faces. They also demonstrated the capability to identify Chinese and English documents with different fonts. 2. SCRIPT IDENTIFICATION AT THE WORD LEVEL It should be noted that all of the script identification approaches mentioned above are at the block or page level, which means that a block or page is assumed to have the same script. Obviously, this is not the case for bilingual dictionaries where text with different scripts may be interlaced. For example, in the English-Chinese bilingual dictionary shown in Figure 1, there is no rule to identify which part should be Chinese and which part should be English unless the content is known. This means it is impossible to combine words which belong to the same script into a whole component, thus the identification must be done at the word level. Figure 1. English-Chinese dictionary. In our work, we perform script identification using the Gabor filter analysis of textures. The use of Gabor filters in extracting texture features of images was motivated by the following two factors: (1) The Gabor representation has been shown to be optimal in the sense of minimizing the joint two-dimensional uncertainty in space and frequency 8 ; and (2) Gabor filters can be considered as orientation and scale tunable edge and line detectors, and the statistics of these micro-features in a given region are often used to characterize the underlying texture information. In our previous work, 9 we proposed a general approach which computed the mean and standard deviation feature vectors of each class in the training phase. For each test sample, the classification is based on the distance between this sample and each class. Suppose the feature vector is in a d-dimensional space, and the computed mean and standard deviation feature vectors for class λ i are µ (i), α (i), where i = 1...M and M is the number of classes. Then for each test sample x R d, the distance between this sample and each class is computed using the following formula: d x k µ (i) k d(x, λ i ) = k=1 α (i) i = 1...M k

3 In this paper, the k-nearest-neighbor (k-nn), Support Vector Machine (SVM) and a Gaussian Mixture Model (GMM) are applied to the classification to improve the performance. The performance of these three classifiers will be compared in Section 4. Figure 2. System architecture. 3. SYSTEM DESIGN A diagram of the system is shown in Figure 2. The main operations in each part of the system will be described in detail in the following subsections Preprocessing The goal of document image preprocessing is to clean the document image and remove variations that may affect the final identification results. The main operations include: (1) Image deskewing; (2) Line removal; and (3) Symbol removal. Deskewing. During the scanning procedure, the document may be skewed. The word segmentation is based on the bounding box of the segmented word, which can be affected by skew. Deskewing is based on a horizontal projection profile. 10 Assuming the skew angle is less than 15 degrees, we first obtain the horizontal projection profile of all text lines. By iteratively rotating the image and computing the correlation of the profile, we can obtain the deskewing angle. The image is then simply rotated by this angle. Line removal. The lines we want to remove from bilingual dictionary pages usually appear as long horizontal lines at the top or bottom of a page, or as long vertical lines in the middle, so we are concerned primarily with these two cases in our work. Thus, the line detection and removal algorithm does not need to be complicated, a Hough Transform was applied to detect the lines to be removed. Symbol removal. In most bilingual dictionaries, there are some special symbols which belong to neither of the scripts we want to identify. Simply assigning these symbols to one class can degrade the classification performance. Before performing classification, these symbols need to be detected and removed from the original image to generate a clean image. In our work, a template matching approach was applied to complete the symbol detection. First, we extracted all the symbols and created one model template for each symbol. Then, for each saved template, we go through the image to detect and recognize the symbol based

4 on a generalized Hausdorff measure. The generalized Hausdorff distance actually measures the degree of mismatch between two point sets, which thus can also be employed to evaluate the resemblance of one point set to another. 11 Once a symbol is detected, the rectangular area that covers the symbol is simply set to background color. Figure 3 shows an example of symbol detection and removal. Figure 3. Symbol detection and removal, where the left image is the original image, the middle image shows the removed symbols, and the right image is the clean image Word Extraction and Processing Script classification is applied at the word level, so before extracting texture features, all words need to be extracted from the document image. In our work, the Docstrum algorithm 12 was applied to perform word segmentation. Word images in different classes, even different word images in the same class, may have different sizes (width and height). To make features be consistent, word image replication and scaling is applied to create a normalized image with predefined size (64 64 pixels in our case). Features used in the following sections are extracted from images with the same size. Figure 4 shows word image replication and scaling examples of two different scripts (Arabic, Roman). Figure 4. Word image replication and scaling. (a,c) original image, (b,d) normalized images Feature Extraction A pair of isotropic Gabor filters are applied to extract texture features of each class Gabor filter design The computational model for 2D isotropic Gabor filters are: h e (x, y) = g(x, y) cos[2πf(xcosθ+ysinθ)] h o (x, y) = g(x, y) sin[2πf(xcosθ+ysinθ)]

5 where h e and h o are the even-symmetric Gabor filters, and g(x,y) is an isotropic Gaussian function with form: g(x, y) = The spatial frequency responses of the Gabor functions are: where j = 1 and 1 + y 2 2πσ 2 exp( x2 2σ 2 ) H e (u, v) = [H 1(u, v) + H 2 (u, v)] 2 H o (u, v) = [H 1(u, v) + H 2 (u, v)] 2j H 1 (u, v) = exp{ 2π 2 σ 2 [(u fcosθ) 2 + (v fsinθ) 2 ]} H 2 (u, v) = exp{ 2π 2 σ 2 [(u + fcosθ) 2 + (v fsinθ) 2 ]} f, θ and σ are the spatial frequency, orientation and space constant of the Gabor envelope. In our case, the image size is normalized to 64 64, so four values of spatial frequency are selected: 0.04, 0.08, 0.16 and The combination of these four frequencies with four selected values of θ (0, 45, 90, 135 ) give a total of 16 Gabor channels. The non-orthogonality of the Gabor wavelets implies there is redundant information in the filtered images. In order to reduce the redundancy, the filters are designed to insure that the half-peak magnitude support of the filter responses in the frequency spectrum touch each other as shown in Figure 5. So the space constant σ is selected based on the formula: σ = 1/(0.6f). Figure 5. Frequency response of Gabor filters. (left: desired response, right: real response.) Feature representation The Gabor wavelet transform of an image I(x,y) is defined as: G mn (x, y) = I(s, t)gmn(x s, y t)dsdt where * indicates the complex conjugate. Based on the computed mean µ mn and the standard deviation σ mn of the magnitude of the transform coefficients, a feature vector (with dimension 32 to represent 16 channels) is constructed as: where µ mn and σ mn are computed as: x = [µ 00, σ 00, µ 01, σ 01,..., µ 33, σ 33 ] µ mn = G mn (x, y) dxdy σ mn = ( G mn (x, y) u mn ) 2 dxdy

6 3.4. Classifier Design Although the system can extract the texture features of different scripts, how they are best used depends on the characteristics of the specific scripts. It is important to assign appropriate weights to different features based on the training samples. As mentioned in Section 2, in previous work, 9 we performed the classification based on the distance of a test sample to each class. The distribution of training samples in the feature space was only taken into account by normalizing the distance with the standard deviations of training samples. To improve the performance of classification, we employ three new classifiers k-nn classifier The k-nearest-neighbor is the extension of the Nearest Neighbor classifier which was first introduced by Cover and Hart 13 in Illustrated in Figure 6, a test sample x is classified by assigning it the label most frequently represented among the k nearest samples. A decision is made by examining the labels of the k nearest neighbors and taking a vote. Figure 6. The k-nearest-neighbor classifier. It starts at the test sample x and grows a spherical region until k training samples are enclosed. The test sample is labeled by a majority vote of these samples. In this k=3 case, the test sample x would be labeled the class of * points SVM classifier SVMs were first introduced in the late seventies, but are recently receiving increased attention. SVMs have been applied in many fields such as handwritten digit recognition, 14 object recognition, 15 speaker identification, 16 face detection in images 17 and text categorization. 18 The SVM classifier constructs a best separating hyperplane (the maximal margin plane) in a high-dimensional feature space which is defined by nonlinear transformations from the original feature variables. Consider the binary classification task in which we have a set of training samples {x i, y i }, i = 1,..., N, y i { 1, 1}, x i R d, where y i are labels corresponding to two classes λ 1 and λ 2 and y i = ±1, the discriminant function is defined as: with the decision rule and all training points are correctly classified if g(x) = w T Φ(x) + b (1) w T Φ(x i ) + b > 0 for x i λ 1 with y i = +1 (2) w T Φ(x i ) + b < 0 for x i λ 2 with y i = 1 (3) y i (w T Φ(x i ) + b) > 0 for all i (4) Figure 7(a) shows two linearly separable sets of data. Many possible hyperplanes can separate these two sets. The goal of SVM is to determine the hyperplane for which the margin - the distance between two parallel hyperplanes (H1 and H2 in Figure 7, which are termed the canonical hyperplanes) on each side of the hyperplane

7 (a) (b) Figure 7. Separating hyperplanes for two sets of data. (a) Linear separating hyperplanes; (b) Nonlinear separating hyperplanes. The separating hyperplane is H : w T Φ(x)+b = 0 and two canonical hyperplanes are H 1 : w T Φ(x)+b = +1 and H 2 : w T Φ(x) + b = 1. The circled data points (lie on two canonical hyperplanes) are support vectors. H that separates the data - is the largest. The data points that lie on the two canonical hyperplanes are called support vectors (circled in Figure 7). The transformation defined by mapping function Φ(x) in Eq. 1 can be linear or nonlinear which can be applied to the separation of linearly-separable and nonlinearly-separable-only data. Figure 7(a) shows an example of separating hyperplanes of linearly separable data, while the two data sets shown in Figure 7(b) can only be separated nonlinearly. For nonlinear SVMs, the kernel function K(x i, x j ), which is defined as K(x i, x j ) = Φ(x i ) Φ(x j ) can be polynomial, Gaussian or sigmoid. Burges 18 gave a detailed description on how to find the separating hyperplanes. We chose the SVM implementation SVM-light 19 and the polynomial kernel function in our work. The SVM was trained using randomly chosen training pages GMM classifier The Gaussian Mixture Model (GMM) classifier is used to model the probability density function of a feature vector, x, by the weighted combination of M multi-variate Gaussian densities: p(x Λ) = M p i g i (x) i=1 where the weight (mixing parameter) p i corresponds to the prior probability that feature x was generated by component i, and satisfies M i=1 p i = 1. Each component λ i is represented by a Gaussian mixture model λ i = N(p i, µ i, Σ i ) whose probability density can be described as: g i (x) = 1 (2π)d Σ i exp( 1 2 (x µ i) T Σ 1 i (x µ i )) where µ i and Σ i are the mean vector and covariance matrix of Gaussian mixture component i respectively, and d is the dimension of the input feature vector. So the Gaussian mixture is completely specified by the mean vectors, covariance matrices and mixture weights of all components and can be represented by Λ = {λ i = N(p i, µ i, Σ i )} i = 1...M The probability that an observed input vector x belongs to the class λ i = N(p i, µ i, Σ i ) is given, in terms of density, by p(λ i x) = p(x λ i)p(λ i ) g i (x) = p i p(x Λ) M j=1 p (5) jg j (x) For script identification, component M is the number of different scripts. So for bilingual documents and 16 channels of Gabor filter features, we have M = 2 and d = 32. Given N training samples {x 1, x 2,..., x N }, using

8 standard techniques, the initial Gaussian mixture model represented by (p i, µ i, Σ i ) is estimated from the training samples as: p i = 1 N N n=1 p(λ i x n ) = N i N N n=1 µ i = p(λ i x n )x n N n=1 p(λ = 1 N i x (i) k (7) i x n ) N i k=1 N n=1 Σ i = p(λ i x n )(x n µ i )(x n µ i ) T N n=1 p(λ = 1 N i i x n ) N i (x (i) k k=1 (6) µ i)(x (i) k µ i) T (8) In Eqs. 6, 7 and 8, 1 i M and N i is the number of samples which belong to class λ i. Considering the fact that the distributions of script components on different pages are different, the estimated models are refined iteratively via the maximum-likelihood detection. At each iteration, the decision for each observation x (test sample) is: λ 1 > p(λ 1 x) < p(λ 2 x) (9) λ 2 Substituting Eq. 5 into the above equation, then computing the likelihood of both sides, we can get the following maximum likelihood decision rule: (x µ 2 ) T Σ 1 2 (x µ 2) (x µ 1 ) T Σ 1 The procedure to obtain the classifier to identify the two scripts is: λ 1 > < 1 (x µ 1) ln ( Σ 1 ) ln ( Σ 2 ) + ln p 2 ln p 1 (10) λ 2 (1) Estimate the parameters (p i, µ i, Σ i ) of Gaussian mixture models using Eqs. 6, 7 and 8; (2) For each feature vector x, perform the classification based on Eq. 10; (3) Reestimate the parameters (p i, µ i, Σ i ) based on the newly classified features vectors; (4) Go back to step (2) to perform the classification again until the iteration stop condition is satisfied; 4. EXPERIMENTS The proposed approaches were applied to 20 randomly chosen pages of four bilingual dictionaries: Arabic-English, Korean-English, Hindi-English and Chinese-English dictionary. Based on these pages, we did the following two experiments Experiment 1: leave-one-out This experiment is used to test how the individual classifier affects the performance for limited data. For each of the four dictionaries, we partition the 20 pages into 19 training pages and 1 test page. The process is repeated a total of 20 times and the accuracy across all partitions is shown in Table Experiment 2: use-one-training In this experiment, one single page of the 20 pages is selected as the training set. The trained system is applied to all of the other pages and the average accuracy is recorded. Compared with the first experiment, these results show how a smaller (and more realistic) training set affects the performance. The results of this experiment are shown in Table 2.

9 Table 1. Leave-One-Out experimental results (k=3 for k-nn; STD:standard deviation). Scripts Arabic Chinese Korean Hindi Classifiers k-nn SVM GMM k-nn SVM GMM k-nn SVM GMM k-nn SVM GMM Accuracy (%) Average STD Median Minimum Maximum Table 2. Use-One-Training experimental results (k=3 for k-nn; STD:standard deviation) Scripts Arabic Chinese Korean Hindi Classifiers k-nn SVM GMM k-nn SVM GMM k-nn SVM GMM k-nn SVM GMM Accuracy % Average STD Median Minimum Maximum Experimental Result Analysis The results in Tables 1 and 2 show the capability of Gabor filters to capture the features of different scripts and the effectiveness of these three classifiers to identify scripts at the word level. The comparison of results in Table 1 and Table 2 show that large number of training samples (19 pages) can produce better performance than a small number of training samples although the performance difference is often minimal. In order to show the robustness of these classifiers, we can also visualize the average accuracy and standard

10 (a) (b) Figure 8. The means and standard deviations of the three classifiers working on the four dictionaries. deviation of each classifier for each dictionary. From Figure 8(a), we can see that for large number of training samples, all these three classifiers are robust with the maximal standard deviation 5.43%, and the k-nn classifier obtains the best average accuracy while the SVM has the minimal deviation. Figure 8(b) shows that relatively small number of training samples can still provide reasonable results although we must be very careful when selecting the training set. In the last column of Table 2, there is a very low accuracy 15.34% which is highlighted in italic. The reason for such a low accuracy is that the page used for training only had several words, which made the GMM classifier fail. Figure 8(b) also shows that for small training sets, the performance of these three classfiers are almost the same. In the above analysis, we always set the k value of the k-nn classifier to 3. However, the choice of the k value for the k-nn classifier may also affect the performance of this classifier. By choosing different k values (k =1,3,5), we obtained the results of the leave-one-out experiment for all of the four dictionaries, which are shown in Figure 9. The results in this figure can explain that: for this script identification case, 3 -NN classifier can often get the best result, while the trend of results for different k values are consistent. Figure 9. Experimental result comparison of k-nn classifier with k=1,3,5. The results are sorted based on the values of the 3 -NN classifier.

11 (a) (b) (c) Figure 10. Word segmentation for different scripts and image quality. (a)over-segmentation of Arabic words; (b)oversegmentation of Chinese words caused by low image quality; (b)perfect word segmentation of Chinese words Factors in the Preprocessing Phase that Affect the Performance We are trying to provide a general script identification approach. The identification is a sequential process which includes three main phases: document image preprocessing, word segmentation and script identification. By manually examining the results, we found that the following factors in the preprocessing phase could affect the identification results: Word segmentation and font face: Since the script identification is at the word level, word segmentation results heavily affect the script identification performance. By browsing the results of the Arabic-English dictionary, we noticed that many of the incorrectly identified Arabic words are over-segmented, which was caused by the nature of Arabic language and the layout of this dictionary, some of the bounding boxes of text lines overlap (Figure 10(a)). Many of the incorrect identifications occurred when an italic Roman script word was identified as Arabic. This may be caused by the fact that they have similar texture features. Word segmentation and image quality: By checking the identification result of Chinese/Roman with low performance, we found that the incorrect identification was also caused by over-segmentation of Chinese words (Figure 10(b)). In addition to the over-segmentation of words, another factor that contributed to incorrect identification is the low image quality. Comparing the second and third images in Figure 10, we can see that the image in Figure 10(b) has lower quality and the word segmentation is not as good (this page has the lowest accuracy). The image in Figure 10(c) has higher quality and the word segmentation is perfect (this page has the highest identification accuracy). Single-character word: In all of the four dictionaries, another challenge is single-character Roman words. Although the identification is performed at word level, by training, we still can obtain the global texture representation of the Roman script. While for the Roman script, after word image replication and scaling, single characters may not have similar texture, leading to incorrect identification. 5. CONCLUSION In this paper, we have compared the performance of three classifiers applied to script identification at the word level. All of the three classifies are based on the Gabor filter features. Experiments were carried on Arabic-, Chinese-, Hindi- and Korean-English bilingual dictionaries, and the results show the effectiveness of the classifiers. Compared to our previous work, 9 all of these three classifiers (k-nn, SVM and GMM) can significantly improve the classification performance. Since our classification is at the word level, one primary factor that may affect the accuracy is word segmentation, which may be caused by scanning noise, text line spacing, word spacing, size and so on. We strongly believe that the results could be significantly improved by addressing the word segmentation problems. 6. ACKNOWLEDGMENT The support of this research under DARPA cooperative agreement N , National Science Foundation grant EIA and DOD contract MDA90402C0406 is gratefully acknowledged.

12 REFERENCES 1. A. L. Spitz, Determination of the script and language content of document images, IEEE Trans. Pattern Analysis and Machine Intelligence 19(3), pp , D. Doermann, H. Ma, B. Karagol-Ayan, and D. W. Oard, Lexicon acquisition from bilingual dictionaries, in SPIE Conference Document Recognition and Retrieval, pp , (San Jose, CA), H. Ma and D. Doermann, Bootstrapping structured page segmentation, in SPIE Conference Document Recognition and Retrieval, pp , (Santa Clara, CA), J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, Automatic script identification from document images using cluster-based templates, IEEE Trans. Pattern Analysis and Machine Intelligence 19(2), pp , P. Sibun and A. L. Spitz, Language determination: Natural language processing from scanned document images, in Proc. 4th Conference on Applied Natural Language Processing, pp , (Stuttgart), B. Waked, S. Bergler, and C. Y. Suen, Skew detection, page segmentation, and script classification of printed document images, in IEEE International Conference on Systems, Man, and Cybernetics (SMC 98), pp , (San Diego, CA), Y. Zhu, T. Tan, and Y. Wang, Font recognition based on global texture analysis, IEEE Trans. Pattern Analysis and Machine Intelligence 23(10), pp , J. G. Daugman, Complete discrete 2d gabor transforms by neural networks for image analysis and compression, IEEE Trans. Acoustics, Speech and Signal Processing 36, pp , H. Ma and D. Doermann, Gabor filter based multi-class classifier for scanned document images, in 7th International Conference on Document Analysis and Recognition, pp , (Edinburgh, Scotland), D. J. Ittner and H. S. Baird, Language-free layout analysis, in IAPR 2nd Int l Conf. on Document Analysis and Recognition, pp , (Tsukuba SCience City, Japan), D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, Comparing images using the hausdorff distance, IEEE Trans. on Pattern Analysis and Machine Intelligence 15(9), pp , L. O Gorman, The document spectrum for page layout analysis, IEEE Trans. Pattern Analysis and Machine Intelligence 15(11), pp , T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Trans. Information Theory IT-13(1), pp , C. Cortes and V. Vapnik, Support-vector networks, Machine Learning 20, pp , V. Blanz, B. Scholkopf, H. Bulthoff, C. Burges, V. Vapnik, and T. Vetter, Comparison of view-based object recognition algorithms using realistic 3d models, in International Conference on Artificial Neural Networks, pp , (Berlin), M. Schmidt, Identifying speaker with support vector networks, in Interface 96 Proceedings, (Sydney), E. Osuna, R. Freund, and F. Girosi, Training support vector machines: an application to face detection, in 1997 Conference on Computer Vision and Pattern Recognition, pp , (San Juan, Puerto Rico), C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2(2), pp , T. Joachims, Advances in Kernel Methods-Support Vector Learning, ch. Making Large-Scale SVM Learning Practical, pp MIT-Press, 1999.