Indian Lipi Recognition Using Image Analysis Technology Bikram Ballav* Department of CSE, BPUT, Bhubaneswar, Odisha, India Abstract A document image analysis technology is optical character recognition (OCR) where scanned digital images provide as input to the system and translated into an editable machine comprehend digital text format. OCR is an interested topic for researcher in the past decades. It has been seen that efficient algorithms have increased the speed and accuracy of character recognition. A substantial amount of work has been done on foreign languages such as English, Chinese, etc. but very few papers are there for Indian languages baring a few for Hindi and Bengali. Hence, our research work was directed towards development of a novel algorithm for Offline Typewritten Odia Character recognition using Template Matching. Keywords: character recognition, digital text, image translation, OCR *Corresponding Author E-mail: bikram7777@gmail.com INTRODUCTION Optical character recognition (OCR) is the process of translating images of handwritten or printed text into a format understood by machines for the purpose of editing, indexing/searching, and a reduction in storage size. [1 3] The first step in OCR is going back to the roots of the languages and studying the individual characters which make up the language. The Odia language has 50 different characters, 12 of them are vowels and rests are consonants, to make easier recognition I have developed an algorithm. The recognition of characters and numeral of a language is a challenging problem as because of different font sizes and different types of writing variations. The character recognition (CR) can be classified into two types: 1. On-line character recognition systems 2. Off-line character recognition systems [4 6] Organization of This Paper This paper is organized into this way. Section Motivation describes the motivation of this paper. In section Data Collection, Odia languages and collection of data are explained. The major steps in character recognition are discussed in section Major Steps in Odia Character Recognition. Implementation details and proposed framework presented in section Methodology and Proposed Algorithm. The experimental results discussed in section Results, and finally conclusion and future work of the paper is given in section Conclusions and Future Work. MOTIVATION A large amount of work has been done in this area for some Indian Languages like: Bengali, Devanagari, Tamil, etc. Thus, I am making an attempt to develop the OCR system for Odia language. DATA COLLECTION India is a multi-lingual and multi-script country, and Odia is one of the popular languages in India which is mainly used in the state of Odisha. The Odia script, by which Oriya language is written, is IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 11
Indian Lipi Recognition Using Image Analysis Technology Bikram Ballav developed from the Kalinga script, one of the many descendants of the Brahmi script of ancient India. As like other Indian scripts also in Odia language, the concept of upper/lower case is absent. These 12 independent vowels, 11 vowels have dependent forms (i.e., excluding first vowel). These characters are called basic characters and the Odia numerals of Odia script and their corresponding English numerals are shown in Figures 1 3. Writing style is from left to right. In Odia script, a vowel following a consonant takes a modified shape. Depending on the vowel, its modified shape is placed at the left, right, both left and right, or bottom of the consonant. Fig. 2. Odia Consonants. Fig. 3. Odia Numerals. These modified shapes are called modifiers or matra as shown in Figure 4. There are more than 200 compound characters in Odia script, [7] and in this paper I consider the recognition of off-line typewritten Odia basic characters by using template matching with Unicodes. Similarity shaped characters may make difficulties and make the recognition system more complex to get higher recognition rate which is shown in Figure 5. The Unicode of characters are shown in Figure 11. Fig. 4. Odia Modified Characters. Fig. 5. Similar Shaped Characters. MAJOR STEPS IN ODIA CHARACTER RECOGNITION The process of OCR consists of a series of stages, with each stage passing its results on to the next in pipeline fashion as shown in Figure 6. Fig. 1. Odia Vowels. Data collection Pre-processing Segmentation Extraction of features Classification Unicode mapping Text file Fig. 6. Block Diagram of OCR. IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 12
The techniques for extraction of features are often divided into three main groups, where the features are found from: 1. The distribution of points. 2. Transformations and series expansions. 3. Structural analysis. Template-matching technique is the process of finding the location of a sub image called a template inside an image which is different from the others in that no features are actually extracted. Instead the matrix containing the image of the input character is directly matched with a set of prototype characters representing each possible class. The distance between the pattern and each prototype is computed, and the class of the prototype giving the best match is assigned to the pattern. The technique is simple and easy to implement in hardware and has been used in many commercial OCR machines. However, this technique is sensitive to noise and style variations and has no way of handling rotated characters. Also a small number of possible postures can be recognized. If the application requires a large posture set, then template matching will not work better. METHODOLOGY AND PROPOSED ALGORITHM The steps of proposed algorithm are implemented in MATLAB (R2012.a) version as per the above block diagram shown in Figure 6. Database Creation A database of all character images of Odia scripts from ଅ-ଲ of pixels 50 50. Data Acquisition Through the scanning process a digital image is captured. Scanned images are stored in some picture file such as BMP, JPG, GIF, etc. as shown in Figure 7. Fig. 7. (a,b) Two Input Images. RGB to Gray Conversion In the pre-processing, 1st stage is to convert the input RGB image into gray scale image as shown in Figure 8. [8] Binarization Binarization is the process of converting a gray scale image (0 255 pixel values) into binary image (0 and 1 pixel values) by selecting a threshold value in between 0 and 255 (here threshold value is 128) as shown in Figure 8. [9,10] Fig. 8. Document Image Binarization. (a) Binary Image, (b) Gray Image. Skew Detection and Correction While scanning the image, if the paper/source document is not aligned properly, it may cause the components to be tilted. This could lead to erroneous behavior of the OCR system. [11] To prevent this, Skew detection and Correction method has been devised, which detect and remove the skew from the image and later the boundaries of particular images are adjusted so that image looks like an original image. Segmentation It is an operation where image is decomposed into sub images of individual IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 13
Indian Lipi Recognition Using Image Analysis Technology Bikram Ballav symbols. Character segmentation is main requirement that determines the utility of conventional character recognition systems. It includes line, word, and character segmentation. Line Segmentation In a printed script, the text lines are almost of same height, provided that the script is written in a specific font size. Here, the script is composed by a type-machine, so the font size is uniform everywhere. Between two text lines, there is a narrow horizontal band with either no pixel or very few pixels. Hence, by checking break-points through them and storing them will be useful for detecting the valleys in it, text line bands can be retrieved. Character Segmentation After the line segmentation, each and every segmented line goes through the process of character segmentation. Each line is segmented in its isolated characters for further operation. Feature Extraction After character segmentation, features from each segmented character are extracted which is in the form of Matrix as shown in Figure 9. Fig. 9. Character Extraction in form of Matrix. Classification Using Template- Matching This technique is different from the others because no features are actually extracted. Instead the matrix containing the image of the input character is directly matched with a set of prototype characters representing each possible class. The distance between the pattern and each prototype is computed, and the class of the prototype giving the best match is assigned to the pattern. The technique is simple and easy to implement in hardware and has been used in many commercial OCR machines. There are two steps in building a classifier: training and testing. These steps can be broken down further into sub-steps as shown in Figure 10. Fig. 10. Template Mapping Approach. Unicode Mapping The Unicode standard is the universal character encoding scheme which defines the uniform way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation of global software. The Odia Unicode range is U + 0B00 to U + 0B7F. For example, the Unicode for the character ଅ is 0B05; the Unicode for the character ଙ is 0B19. The Unicode characters are comprised of 2 bytes in nature. The Unicode standard reflects the basic principle, which emphasizes that each character code has a width of 16 bits. Unicode text is simple to parse and process and Unicode characters have well-defined semantics. Hence, Unicode is chosen as the encoding scheme for the current work. After classification the characters are recognized and a mapping table is created IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 14
in which the Unicode for the corresponding characters are mapped. As shown in Figure 11, Table 1 shows the Unicode with the corresponding Odia characters. %Accuracy = No. of characters found correctly *100 Total no. of patterns %Accuracy = 42 * 100 = 93.87% So %Accuracy = 93.87% To illustrate the accuracy of Odia characters, typewritten text images of different fonts of different sizes have been tested under OCR algorithm by using MATLAB (R2010.a/64-bit), and then performance was measured using this sample as shown in the Figures 14 and 15. Fig. 12. Input Image with All Vowels and Consonants. Fig. 11. Unicode Mapping. RESULTS No standardized test sets exist for character recognition, and as the performance of an OCR system is highly dependent on the quality of the input, this makes it difficult to evaluate and compare different systems. Still, recognition rates are often given, and usually presented as the percentage of characters correctly classified. According to result Figures 12 and 13 only 2nd vowel is not matching properly. Fig. 13. Output in Text Format of all Vowels and Consonants. The evaluation of OCR system follows three different performance rates: Recognition rate: The proportions of correctly classified characters are. IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 15
Indian Lipi Recognition Using Image Analysis Technology Bikram Ballav Rejection rate: The proportion of characters which the systems were unable to recognize. Rejected characters can be flagged by the OCR-system, and are therefore easily retraceable for manual correction. Error rate: The proportion of characters erroneously classified. Misclassified characters go by undetected by the system, and manual inspection of the recognized text is necessary to detect and correct these errors. Fig. 14. Input Image with Noise. Fig. 15. Output with Text Format. CONCLUSIONS AND FUTURE WORK The type written Odia character recognition algorithm was successfully tested using large number of test sample of different font images. Accuracy was about 94% if basic characters and numerals are considered. Around 6% of the characters deviated because of similarity between the characters. Our work was basically focused on Template matching technique with Unicode mapping which can efficiently extract features and match the template with Unicode from each individual character. As a result, the recognition process of this system becomes smoothly and even though this system prototype could give several advantages to the users, but this system prototype is still facing a number of limitations with handwritten characters and compound characters. Recognition of character is still a challenging problem since there is a variation in same character due to different font size, different types of noises, and involvement of different persons. Further research could be done to improve the system prototype into a better system by taking the handwritten characters and compound characters with template matching technique. REFERENCES 1. Optical character recognition http://en.wikipedia.org/wiki/optical_c haracter_recognition. 2. Jadhav D.A., Veeresh G.K. Multi-Font/Size Character Recognition, Int J Adv Eng Technol. 2012. 3. http://ethesis.nitrkl.ac.in/3823/1/thesi s_-_odia_offline_character_recognit ion_-_108cs021.pdf 4. Kapoor R., Gupta S., Sharma C.M. Multi-font/size character recognition and document scanning, Int J Comput Appl. 2011; 23(1): 21 4p. Tavel P. Modeling and Simulation Design. AK Peters Ltd; 2007. 5. Sharma P., Singh R. Performance of English Character Recognition with and without Noise, Int J Comput Trends Technol. 2013; 4(3). 6. http://www.cs.uic.edu/~srizvi/bit_th esis.pdf. IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 16
7. Special Issue on character recognition and document understanding, IEICE Trans Inform Syst. 1996; E79-D(5). 8. Chaudhuri B.B., Pal U., Mitra M., Automatic recognition of printed Oriya script, IEEE Trans Pattern Recog Mach Intell. 2002; 27(1): 23 34p. 9. Chen M.Y., Kundu A., Zhou J. Off-line handwritten word recognition using a HMM type stochastic network, IEEE Trans Pattern Recog Mach Intell. 1994; 16: 481 6p. 10. Chandarana J., Kapadia M. Optical character recognition, IJETAE Tranjact. 2014; 4(5). 11. The Unicode Standard (U0B00.pdf) http://www.unicode.org/public/7.0.0/c harts/ for a complete archived file of character code charts for Unicode 7.0. Bikram Ballav is an Assistant Professor for CSE department in Biju Pattanayak University of Technology (BPUT), Odisha. He completed his M Tech in CSE form SoA University, Odisha, India and B Tech from Asansol Engineering College, WB, India. His area of interest is Mobile sensor network and network security. He also worked in natural language processing area. He has published 5 research papers among them 3 are International Journal paper and 2 are International conference paper. He loves to study books and listening to music. IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 17