Abstract. 2.0 The Telugu Character Set

Online Recognition of Handwritten Telugu Characters M. Srinivas Rao, Gowrishankar, V.S.Chakravarthy Department of Electrical Engineering, Indian Institute of Technology, Madras, Chennai 600 036. Abstract A system for online recognition of handwritten Telugu script is presented. A handwritten character is constructed by executing a sequence of strokes. A structure- or shape-based representation of a stroke is used in which a stroke is represented as a string of shape features. Using this string representation, an unknown stroke is identified by comparing it with a database of strokes. A full character is recognized by identifying all the component strokes. Development of similar systems for other Indian scripts is contemplated. 1.0 Introduction Online handwriting recognition consists of recognizing a script as it is written using an electronic stylus or a pen on a tablet. Parameters related to the pen tip like position, velocity, acceleration and sometimes pressure (on the writing surface) are available to the data acquisition system [1,]. Although early work looks back to the sixties, electronic pen devices have gained special attention in the recent times due to the increased demand for more human-like interfaces with the computer [2,3]. Online handwriting recognition takes on a novel significance in the context of Indian languages. Presently, word processing in Indian languages can be a vexing experience, considering the restriction on use of the regular keyboard, designed for English. Elaborate keyboard mapping systems are normally used in case of Indian languages, which are not convenient to use. A comfortable solution would be to let the user write in a natural, normal fashion using a suitable pen-like device and let the computer do the rest. That transfers the burden of learning keyboard mappings from the user to the computer. This is the motivating idea of the present work which describes a system for online recognition of Telugu script, a language most widely spoken in South India. 2.0 The Telugu Character Set Telugu script, as other Indian scripts, is generally written in non-cursive style, unlike English handwriting, which is normally written in cursive style rendering recognition difficult. However, Indian scripts pose a peculiar problem non-existent in European scripts the problem of composite characters. Unlike in English alphabet where a single character represents a consonant or a vowel, in Indian scripts a composite character represents either a complete syllable, or the coda of one syllable and the onset of another. Therefore, although the basic units that form composite characters of a script are not that many (O(10 2 )), these units by various combinations lead to a large number (O(10 4 )) of composite characters. For example, the word stree, a single syllable comprising three consonant sounds and a vowel, is represented by a single composite character in which graphical elements corresponding to the 3 consonants and the vowel are combined to form a complex structure. Moreover,, Telugu script is the most complex of all Indian scripts for two reasons: a) it has largest number of vowels and consonants and, b) it has complex composition rules. The Telugu script consists of 14 vowels, 36 consonants, 3 special characters. As a general rule when the vowel appears at the beginning of the word, the full vowel character is displayed. When a vowel appears inside a word, a vowel modifier is added to the preceding consonant creating a composite character. The number of combinations possible here to form a composite character is very large. For example,all the above 36 consonants can combine with vowels(i.e.

consonant-vowel type approximately 576 in number). Also the 36 consonant modifiers can combine to form composite characters of consonants-consonant-vowel (CCV) type (this number runs approximately to 20,736). Also a character can have more than one consonant modifier (as in stree ). Although rare CCCV type characters are also encountered in Telugu script. Wordprocessing with such a complex script using the QWERTY keyboard can be a harrowing experience, considering the cumbersome keyboard mappings the user has to learn. A pen-based interface in such a situation can prove to be a most convenient tool. 3.0 The Algorithm In our approach to recognition, a handwritten character is represented as a combination of strokes. A stroke is defined as the trajectory negotiated by the pen tip between two successive touch down and lift off of pen tip from the writing surface. Therefore, the first step in character recognition is to recognize all the component strokes in a character. Once the component strokes are identified, the character itself can be easily recognized. The main modules in our Telugu online recognition system are shown in fig. 1. i)pen Input: We have used the Superpen TM, a product of UC Logic Inc. to generate online character data. The product, which comes with an electronic pen and a tablet, can also be used as a regular substitute for a mouse. Therefore, the pen trajectory can be recorded directly using the standard functions used for reading mouse coordinates. The pen also has provision for real-time monitoring of the pressure exerted by the pen tip on the tablet. However, pressure information is not used in our recognition algorithm. Therefore, data output by this stage consists of a sequence of strokes, where each stroke consists of the x- and y-coordinates of the pen tip, over a finite interval of time. ii) Preprocessing: Preprocessing consists of normalization and smoothing of strokes. The x- and y-coordinates of each stroke are normalized such that the stroke fits tightly inside a unit square. The time-varying coordinate functions, x(t) and y(t), are then smoothed independently along t- axis using a Gaussian filter of suitable standard deviation. iii) Feature extraction: A feature-based approach, rather than a template-based one, is perhaps more appropriate for handwritten character recognition, considering the extent of distortion that handwritten characters undergo. In an earlier work [4], we have identified certain general features known as the shape features of handwritten characters, which are less susceptible to distortion introduced by writing. A set of 18 shape features used in the present system are summarized in Fig. 2. The 18 shape features are denoted by uppercase English alphabets from A to R. iv) Stroke Identification: In this step, the shape feature string of an unknown stroke is compared with a database of such strings. This stroke database has shape feature strings of all the strokes that form handwritten Telugu characters. A single stroke may have multiple variations. All the variations of a single stroke are given a common identity the stroke Label. The stroke identification step consists of mapping an unknown string of shape features onto a stroke Label. Special issues arise when comparing the unknown string with a database string. Regular string matching techniques give disastrous results since, often, stray features are inserted or expected features absent in the unknown string. Therefore, the situation warrants use of softmatching of strings.

(v) Character Recognition: In this stage, the stroke Labels discovered in the previous stage are related to whole characters. This is particularly essential in case of characters that are composed of multiple strokes. One helpful feature of Telugu script is that all the strokes that constitute a composite character can spatially grouped either in terms of nearness or in terms of overlap in horizontal direction. This enables us to perform recognition one composite character at a time without requiring any further contextual information. We need to make more precise the definition of a character onto which the component strokes are being mapped in this stage. Two kinds of codes exist for any character set used in a word-processor program. One is a syntactic or a syllabic code, like the ASCII code for instance, and the other is a display code, which is the font code. These two codes are related in a simple one-to-one fashion in fonts that are based on ASCII characters. But in case of Indian scripts the relation between the two codes is not so simple. For Indian scripts there is a syntactic or a syllabic code based on the sounds of the characters. Noted among them is the Indian Script Code for Information Interchange (ISCII), a government standard, which amazingly is a common code for all Indian script systems [5]. The other code used for display purposes is a font code, known in the present case as the Indian Standard FOnt Code (ISFOC), which changes with script. Telugu script is written from left to right, like all other Indian script (barring perso-arabic scripts) using isolated graphical structures, each of which is separated from its neighbors in the horizontal or x- direction. Every such horizontally isolated structure has a unique ID in the ISCII. In the present final stage of character recognition, the stroke Labels from the previous stage are mapped onto ISCII IDs. The final output of the present system is in ISCII code. Stroke Labels to ISCII conversion is done using simple look-up tables. This can be easily illustrated using an example. Fig. 4 shows a handwritten Telugu character stree. It represents a complete syllable comprising 3 consonants ( sa + ta + ra ) and a vowel ( ee ) (C 1 C 2 C 3 V form). The stroke labeled C1-base in the figure represents the first consonant. The loop on the top labeled v represents the vowel ee, which in Telugu script is always associated with the first consonant. The strokes labeled c2 and c3 in the figure represent the consonant modifiers ta and ra respectively. First all the strokes in the composite character are identified and the consonant modifiers present (if any) are identified and flagged out from that group of stokes, then in the remaining group of strokes we look for the base (no character is without a base) and identify all these remaining strokes with the help of a look-up table. Here first the base is searched for (C1), followed by the remaining strokes among the various combinations possible with that base (i.e. vowel V is identified). After obtaining the ISCII codes of this composite character (without the consonant modifier), the ISCII code of the consonant modifier (C2, C3, C4 if they exist) is appended to this and the final, actual composite character is displayed in ileap. An output document in ISCII code, instead of ISFOC, will have greater mobility since it becomes directly readable by well-known Indian language ISCII-based word-processing programs like ileap, Leap Office etc. The ISCII-based document can not- only be edited by the above-mentioned word-processors, but it can also be displayed in other Indian scripts supported by ISCII. 4.0 Discussion We have presented a system for online recognition of Telugu characters. In our approach, a character is considered as a combination of multiple strokes. A single stroke is represented as a string of shape features. Individual strokes are identified by comparing the unknown stroke with a database of strokes. Combinations of strokes are then mapped onto ISCII codes of Telugu characters. Full details of our algorithm cannot be revealed for reasons of confidentiality.

Fig. 5 shows a screen dump of the user-interface of our system. The system is developed for the windows platform and uses Superpen TM, a product of UC Logic, Inc. The system is only meant to demonstrate the success of the algorithm and not to function as a regular editor. The writer writes anywhere on the lines shown on the display window, and the recognized characters are displayed using ileap software (Fig. 6). Since the current system is a research prototype we have not conducted a detailed performance evaluation. However, those who have the knowledge of Telugu script will appreciate the difficulty in recognizing the displayed handwritten characters. Currently our stroke database is quite small (312 samples with 239 unique strokes) and has stroke data from only a few writers. Note that the database contains only strokes and does not deal with stroke combinations. Therefore although there are only a few strokes their combinations can generate all the necessary O(10 4 ) composite characters. Future efforts towards improvement of this system will include collection of strokes from a wider selection of writers, incorporating more variations of each stroke. 1. R. Plamondon and S.N. Srihari, On-line and Off-line Handwriting Recognition: A comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 1, January, 2000. 2. M. Eden, Handwriting and Pattern Recognition, IRE Transactions on Information Theory, vol. 8, 1962. 3. R. Plamondon, D. Lopresti, L.R.B. Shoemaker and R. Srihari, On-line Handwriting Recognition, Encyclopedia of Electrical and Electronics Eng., J.G. Webster, ed., vol. 15, pp.123-146, New York: Wiley, 1999. 4. V.S. Chakravarthy and B. Kompella, The Shape of Handwritten Character, National Conference on Document Analysis and Character Recognition NCDAR2001, Mandya, India. July 12-13, 2001. 5. Indian Script Code for Information Interchange, IS 13194-1991, Bureau of Indian Standards, Manak Bhavan, 9 Bahadur Shah Zafar Marg, New Delhi. Pen Input Preprocessing Feature Extraction Stroke Database Stroke-Char LUT Stroke Recognition Character Recognition Display/Archive Figure 1: A schematic describing process flow of the online Telugu character recognition system.

Figure 4: Handwritten Telugu character stree, an example of composite character comprising 3 consonants and a vowel. Fig 5: A sample of handwritten Telugu text displayed in the client area of our software. Figure 2: List of shape features. Figure 6: Our software recognizes handwritten text shown in Figure 5. Two mistakes were made in the second line: a) in the 3 rd character of the first word and the 3 rd character of the second word. Figure 3: Shape features extracted from Telugu character la.