Indian Lipi Recognition Using Image Analysis Technology

Similar documents
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH

Automatic Detection of PCB Defects

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

A Lightweight and Effective Music Score Recognition on Mobile Phone

Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature

Navigation Aid And Label Reading With Voice Communication For Visually Impaired People

Visual Structure Analysis of Flow Charts in Patent Images

Document Image Processing - A Review

A Dynamic Approach to Extract Texts and Captions from Videos


Keywords image processing, signature verification, false acceptance rate, false rejection rate, forgeries, feature vectors, support vector machines.

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

LEAF COLOR, AREA AND EDGE FEATURES BASED APPROACH FOR IDENTIFICATION OF INDIAN MEDICINAL PLANTS

Signature Region of Interest using Auto cropping

FPGA Implementation of Human Behavior Analysis Using Facial Image

Analecta Vol. 8, No. 2 ISSN

Automatic License Plate Recognition using Python and OpenCV

How To Filter Spam Image From A Picture By Color Or Color

DIAGONAL BASED FEATURE EXTRACTION FOR HANDWRITTEN ALPHABETS RECOGNITION SYSTEM USING NEURAL NETWORK

Script and Language Identification for Handwritten Document Images. Judith Hochberg Kevin Bowers * Michael Cannon Patrick Kelly

Image Compression through DCT and Huffman Coding Technique

Handwritten Kannada Characters Recognition using Curvelet Transform

Barcode Based Automated Parking Management System

PageX: An Integrated Document Processing and Management Software for Digital Libraries

Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding

Automatic Extraction of Signatures from Bank Cheques and other Documents

Medical Image Segmentation of PACS System Image Post-processing *

Multimodal Biometric Recognition Security System

Comparison of Elastic Matching Algorithms for Online Tamil Handwritten Character Recognition

Image Processing Based Automatic Visual Inspection System for PCBs

How to Measure for Exterior Shutters

Introduction to Pattern Recognition

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

Cursive Handwriting Recognition for Document Archiving

Mouse Control using a Web Camera based on Colour Detection

International Journal of Advanced Information in Arts, Science & Management Vol.2, No.2, December 2014

International Language Character Code

Using Neural Networks to Create an Adaptive Character Recognition System

OCR Optical Character Recognition

OCRopus Addons. Internship Report. Submitted to:

LOCAL SURFACE PATCH BASED TIME ATTENDANCE SYSTEM USING FACE.

An Implementation of a High Capacity 2D Barcode

Signature verification using Kolmogorov-Smirnov. statistic

Research on Chinese financial invoice recognition technology

Word processing software

Diagnosis of multi-operational machining processes through variation propagation analysis

Efficient on-line Signature Verification System

Signature Segmentation from Machine Printed Documents using Conditional Random Field

A Survey on Product Aspect Ranking

Determining optimal window size for texture feature extraction methods

Document Image Retrieval using Signatures as Queries

ELFRING FONTS INC. MICR FONTS FOR WINDOWS

Introduction to Unicode. By: Atif Gulzar Center for Research in Urdu Language Processing

Face detection is a process of localizing and extracting the face region from the

TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE

International Journal of Advanced Computer Technology (IJACT) ISSN: Least Significant Bit algorithm for image steganography

Using Lexical Similarity in Handwritten Word Recognition

The Scientific Data Mining Process

Automatic Recognition Algorithm of Quick Response Code Based on Embedded System

2695 P a g e. IV Semester M.Tech (DCN) SJCIT Chickballapur Karnataka India

Instruction Set Architecture (ISA)

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION

Colour Image Segmentation Technique for Screen Printing

A Novel Cryptographic Key Generation Method Using Image Features

Using MATLAB to Measure the Diameter of an Object within an Image

Context-aware Library Management System using Augmented Reality

PCB Defect Detection Using Image Processing And Embedded System

Low-resolution Character Recognition by Video-based Super-resolution

Data Storage 3.1. Foundations of Computer Science Cengage Learning

The use of binary codes to represent characters

Pattern Recognition of Japanese Alphabet Katakana Using Airy Zeta Function

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Localization of Text Editor using Java Programming

DESIGN OF DIGITAL SIGNATURE VERIFICATION ALGORITHM USING RELATIVE SLOPE METHOD

The Development of a Pressure-based Typing Biometrics User Authentication System

Handwritten Digit Recognition with a Back-Propagation Network

Printed Circuit Board Defect Detection using Wavelet Transform

ZIMBABWE SCHOOL EXAMINATIONS COUNCIL. COMPUTER STUDIES 7014/01 PAPER 1 Multiple Choice SPECIMEN PAPER

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

A Study of Automatic License Plate Recognition Algorithms and Techniques

Online Farsi Handwritten Character Recognition Using Hidden Markov Model

A Method of Caption Detection in News Video

A Basic Summary of Image Formats

Elfring Fonts, Inc. PCL MICR Fonts

Handwritten Signature Verification using Neural Network

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Detection and Restoration of Vertical Non-linear Scratches in Digitized Film Sequences

Bisecting K-Means for Clustering Web Log data

Keywords : complexity, dictionary, compression, frequency, retrieval, occurrence, coded file. GJCST-C Classification : E.3

Transcription:

Indian Lipi Recognition Using Image Analysis Technology Bikram Ballav* Department of CSE, BPUT, Bhubaneswar, Odisha, India Abstract A document image analysis technology is optical character recognition (OCR) where scanned digital images provide as input to the system and translated into an editable machine comprehend digital text format. OCR is an interested topic for researcher in the past decades. It has been seen that efficient algorithms have increased the speed and accuracy of character recognition. A substantial amount of work has been done on foreign languages such as English, Chinese, etc. but very few papers are there for Indian languages baring a few for Hindi and Bengali. Hence, our research work was directed towards development of a novel algorithm for Offline Typewritten Odia Character recognition using Template Matching. Keywords: character recognition, digital text, image translation, OCR *Corresponding Author E-mail: bikram7777@gmail.com INTRODUCTION Optical character recognition (OCR) is the process of translating images of handwritten or printed text into a format understood by machines for the purpose of editing, indexing/searching, and a reduction in storage size. [1 3] The first step in OCR is going back to the roots of the languages and studying the individual characters which make up the language. The Odia language has 50 different characters, 12 of them are vowels and rests are consonants, to make easier recognition I have developed an algorithm. The recognition of characters and numeral of a language is a challenging problem as because of different font sizes and different types of writing variations. The character recognition (CR) can be classified into two types: 1. On-line character recognition systems 2. Off-line character recognition systems [4 6] Organization of This Paper This paper is organized into this way. Section Motivation describes the motivation of this paper. In section Data Collection, Odia languages and collection of data are explained. The major steps in character recognition are discussed in section Major Steps in Odia Character Recognition. Implementation details and proposed framework presented in section Methodology and Proposed Algorithm. The experimental results discussed in section Results, and finally conclusion and future work of the paper is given in section Conclusions and Future Work. MOTIVATION A large amount of work has been done in this area for some Indian Languages like: Bengali, Devanagari, Tamil, etc. Thus, I am making an attempt to develop the OCR system for Odia language. DATA COLLECTION India is a multi-lingual and multi-script country, and Odia is one of the popular languages in India which is mainly used in the state of Odisha. The Odia script, by which Oriya language is written, is IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 11

Indian Lipi Recognition Using Image Analysis Technology Bikram Ballav developed from the Kalinga script, one of the many descendants of the Brahmi script of ancient India. As like other Indian scripts also in Odia language, the concept of upper/lower case is absent. These 12 independent vowels, 11 vowels have dependent forms (i.e., excluding first vowel). These characters are called basic characters and the Odia numerals of Odia script and their corresponding English numerals are shown in Figures 1 3. Writing style is from left to right. In Odia script, a vowel following a consonant takes a modified shape. Depending on the vowel, its modified shape is placed at the left, right, both left and right, or bottom of the consonant. Fig. 2. Odia Consonants. Fig. 3. Odia Numerals. These modified shapes are called modifiers or matra as shown in Figure 4. There are more than 200 compound characters in Odia script, [7] and in this paper I consider the recognition of off-line typewritten Odia basic characters by using template matching with Unicodes. Similarity shaped characters may make difficulties and make the recognition system more complex to get higher recognition rate which is shown in Figure 5. The Unicode of characters are shown in Figure 11. Fig. 4. Odia Modified Characters. Fig. 5. Similar Shaped Characters. MAJOR STEPS IN ODIA CHARACTER RECOGNITION The process of OCR consists of a series of stages, with each stage passing its results on to the next in pipeline fashion as shown in Figure 6. Fig. 1. Odia Vowels. Data collection Pre-processing Segmentation Extraction of features Classification Unicode mapping Text file Fig. 6. Block Diagram of OCR. IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 12

The techniques for extraction of features are often divided into three main groups, where the features are found from: 1. The distribution of points. 2. Transformations and series expansions. 3. Structural analysis. Template-matching technique is the process of finding the location of a sub image called a template inside an image which is different from the others in that no features are actually extracted. Instead the matrix containing the image of the input character is directly matched with a set of prototype characters representing each possible class. The distance between the pattern and each prototype is computed, and the class of the prototype giving the best match is assigned to the pattern. The technique is simple and easy to implement in hardware and has been used in many commercial OCR machines. However, this technique is sensitive to noise and style variations and has no way of handling rotated characters. Also a small number of possible postures can be recognized. If the application requires a large posture set, then template matching will not work better. METHODOLOGY AND PROPOSED ALGORITHM The steps of proposed algorithm are implemented in MATLAB (R2012.a) version as per the above block diagram shown in Figure 6. Database Creation A database of all character images of Odia scripts from ଅ-ଲ of pixels 50 50. Data Acquisition Through the scanning process a digital image is captured. Scanned images are stored in some picture file such as BMP, JPG, GIF, etc. as shown in Figure 7. Fig. 7. (a,b) Two Input Images. RGB to Gray Conversion In the pre-processing, 1st stage is to convert the input RGB image into gray scale image as shown in Figure 8. [8] Binarization Binarization is the process of converting a gray scale image (0 255 pixel values) into binary image (0 and 1 pixel values) by selecting a threshold value in between 0 and 255 (here threshold value is 128) as shown in Figure 8. [9,10] Fig. 8. Document Image Binarization. (a) Binary Image, (b) Gray Image. Skew Detection and Correction While scanning the image, if the paper/source document is not aligned properly, it may cause the components to be tilted. This could lead to erroneous behavior of the OCR system. [11] To prevent this, Skew detection and Correction method has been devised, which detect and remove the skew from the image and later the boundaries of particular images are adjusted so that image looks like an original image. Segmentation It is an operation where image is decomposed into sub images of individual IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 13

Indian Lipi Recognition Using Image Analysis Technology Bikram Ballav symbols. Character segmentation is main requirement that determines the utility of conventional character recognition systems. It includes line, word, and character segmentation. Line Segmentation In a printed script, the text lines are almost of same height, provided that the script is written in a specific font size. Here, the script is composed by a type-machine, so the font size is uniform everywhere. Between two text lines, there is a narrow horizontal band with either no pixel or very few pixels. Hence, by checking break-points through them and storing them will be useful for detecting the valleys in it, text line bands can be retrieved. Character Segmentation After the line segmentation, each and every segmented line goes through the process of character segmentation. Each line is segmented in its isolated characters for further operation. Feature Extraction After character segmentation, features from each segmented character are extracted which is in the form of Matrix as shown in Figure 9. Fig. 9. Character Extraction in form of Matrix. Classification Using Template- Matching This technique is different from the others because no features are actually extracted. Instead the matrix containing the image of the input character is directly matched with a set of prototype characters representing each possible class. The distance between the pattern and each prototype is computed, and the class of the prototype giving the best match is assigned to the pattern. The technique is simple and easy to implement in hardware and has been used in many commercial OCR machines. There are two steps in building a classifier: training and testing. These steps can be broken down further into sub-steps as shown in Figure 10. Fig. 10. Template Mapping Approach. Unicode Mapping The Unicode standard is the universal character encoding scheme which defines the uniform way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation of global software. The Odia Unicode range is U + 0B00 to U + 0B7F. For example, the Unicode for the character ଅ is 0B05; the Unicode for the character ଙ is 0B19. The Unicode characters are comprised of 2 bytes in nature. The Unicode standard reflects the basic principle, which emphasizes that each character code has a width of 16 bits. Unicode text is simple to parse and process and Unicode characters have well-defined semantics. Hence, Unicode is chosen as the encoding scheme for the current work. After classification the characters are recognized and a mapping table is created IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 14

in which the Unicode for the corresponding characters are mapped. As shown in Figure 11, Table 1 shows the Unicode with the corresponding Odia characters. %Accuracy = No. of characters found correctly *100 Total no. of patterns %Accuracy = 42 * 100 = 93.87% So %Accuracy = 93.87% To illustrate the accuracy of Odia characters, typewritten text images of different fonts of different sizes have been tested under OCR algorithm by using MATLAB (R2010.a/64-bit), and then performance was measured using this sample as shown in the Figures 14 and 15. Fig. 12. Input Image with All Vowels and Consonants. Fig. 11. Unicode Mapping. RESULTS No standardized test sets exist for character recognition, and as the performance of an OCR system is highly dependent on the quality of the input, this makes it difficult to evaluate and compare different systems. Still, recognition rates are often given, and usually presented as the percentage of characters correctly classified. According to result Figures 12 and 13 only 2nd vowel is not matching properly. Fig. 13. Output in Text Format of all Vowels and Consonants. The evaluation of OCR system follows three different performance rates: Recognition rate: The proportions of correctly classified characters are. IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 15

Indian Lipi Recognition Using Image Analysis Technology Bikram Ballav Rejection rate: The proportion of characters which the systems were unable to recognize. Rejected characters can be flagged by the OCR-system, and are therefore easily retraceable for manual correction. Error rate: The proportion of characters erroneously classified. Misclassified characters go by undetected by the system, and manual inspection of the recognized text is necessary to detect and correct these errors. Fig. 14. Input Image with Noise. Fig. 15. Output with Text Format. CONCLUSIONS AND FUTURE WORK The type written Odia character recognition algorithm was successfully tested using large number of test sample of different font images. Accuracy was about 94% if basic characters and numerals are considered. Around 6% of the characters deviated because of similarity between the characters. Our work was basically focused on Template matching technique with Unicode mapping which can efficiently extract features and match the template with Unicode from each individual character. As a result, the recognition process of this system becomes smoothly and even though this system prototype could give several advantages to the users, but this system prototype is still facing a number of limitations with handwritten characters and compound characters. Recognition of character is still a challenging problem since there is a variation in same character due to different font size, different types of noises, and involvement of different persons. Further research could be done to improve the system prototype into a better system by taking the handwritten characters and compound characters with template matching technique. REFERENCES 1. Optical character recognition http://en.wikipedia.org/wiki/optical_c haracter_recognition. 2. Jadhav D.A., Veeresh G.K. Multi-Font/Size Character Recognition, Int J Adv Eng Technol. 2012. 3. http://ethesis.nitrkl.ac.in/3823/1/thesi s_-_odia_offline_character_recognit ion_-_108cs021.pdf 4. Kapoor R., Gupta S., Sharma C.M. Multi-font/size character recognition and document scanning, Int J Comput Appl. 2011; 23(1): 21 4p. Tavel P. Modeling and Simulation Design. AK Peters Ltd; 2007. 5. Sharma P., Singh R. Performance of English Character Recognition with and without Noise, Int J Comput Trends Technol. 2013; 4(3). 6. http://www.cs.uic.edu/~srizvi/bit_th esis.pdf. IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 16

7. Special Issue on character recognition and document understanding, IEICE Trans Inform Syst. 1996; E79-D(5). 8. Chaudhuri B.B., Pal U., Mitra M., Automatic recognition of printed Oriya script, IEEE Trans Pattern Recog Mach Intell. 2002; 27(1): 23 34p. 9. Chen M.Y., Kundu A., Zhou J. Off-line handwritten word recognition using a HMM type stochastic network, IEEE Trans Pattern Recog Mach Intell. 1994; 16: 481 6p. 10. Chandarana J., Kapadia M. Optical character recognition, IJETAE Tranjact. 2014; 4(5). 11. The Unicode Standard (U0B00.pdf) http://www.unicode.org/public/7.0.0/c harts/ for a complete archived file of character code charts for Unicode 7.0. Bikram Ballav is an Assistant Professor for CSE department in Biju Pattanayak University of Technology (BPUT), Odisha. He completed his M Tech in CSE form SoA University, Odisha, India and B Tech from Asansol Engineering College, WB, India. His area of interest is Mobile sensor network and network security. He also worked in natural language processing area. He has published 5 research papers among them 3 are International Journal paper and 2 are International conference paper. He loves to study books and listening to music. IJIPPR (2015) 11 17 JournalsPub 2015. All Rights Reserved Page 17