The Encoding and Processing of Indian Languages - an Alternative Approach

Transcription

1 The Encoding and Processing of Indian Languages - an Alternative Approach Md Maruf HASAN Computational Linguistic Laboratory, Nara Institute of Science and Technology , Takayama, Ikoma, Nara, Japan, mmhasan@computer.org Abstract CJK (Chinese, Japanese and Korean) language systems are using multi-byte coding, while encoding of existing Indian languages is based on single-byte coding scheme. Due to the special characteristics of Indian languages, it is advantageous to consider multi-byte encoding for computer processing of Indian languages. In this paper, we discuss a model of encoding and processing Indian languages in a multi-byte framework. Keywords: Character Coding, Multilingual Text Processing, Indian Language Processing. Introduction Chinese, Japanese and Korean (CJK) languages are ideographic and consist of a huge number of ideographic characters. CJK languages are being processed by computers of every computing platform quite efficiently (Ken Lunde, 1999) for several years. Due to the large character set of these languages, double-byte encoding is used in many implemented systems. Some CJK implementations are based on UNICODE recommendation (UNICODE, 1996). Indian languages, on the other hand, are considered alphabetic languages with a small set of characters to encode. The single-byte coding scheme of less than a hundred characters is adopted to place these characters in the Extended ASCII area and process them like European languages. UNICODE Consortium also recommended encoding these few number of characters in a compact coding space. In comparison to the European languages, there are certain special characteristics of Indian languages, which can be better handled by using a multi-byte coding scheme. (e.g., French) consist of a few accented characters. Single-byte encoding with proper rendering can efficiently process these European languages. Chinese, Japanese and Korean languages also have simple onedimensional arrangements of their ideographic characters in their written forms. Although Indian langauges have alphabet sets of less than a hundred characters including consonants and vowels, (unlike European languages) this small set of characters generate a few thousands complicated and irregular ligatures. Original glyphs of the vowels and consonants along with complicated rendering techniques can t always maintain the typical face of the ligature. Moreover, dictionary sorting of Indian languages is very unique. Consonants and vowels have different priority in sorting. Using single-byte encoding of the small set of characters (alphabets) of an Indian language is comparable to encoding the small number of radicals of the Chinese language and generating all the several thousand Chinese characters (ideographs) using randering. For Korean language, encoding the Jomo letters and then, randering and generating thousands of Hangul (ideographs) can be another similar example. However, neither in Chinese nor in Korean, such measures are adopted to encode and process these languages. Because of the special characteristics of Indian languages mentioned above, we can take advantage of multi-byte encoding and encode all the consonants and the vowels (comparable to Korean Jomo letters and Chinese radicals) along with the ligatures (comparable to Korean Hanguls and Chinese characters) rather than encoding only the vowels and the consonants. Indian languages can be processed efficiently in this way. This paper addresses such a multi-byte encoding model for Indian languages. English text is a straight-forward one-dimensional array of characters (letters). Other European languages

2 1 Overview of CJK Text Processing Computer processing of Chinese, Japanese and Korean texts is technically quite similar to each other. In this section, we will use Chinese as an example to explain the processing method in detail (Zhao et al., 1990). Coding A multi-byte code is allocated for every Chinese ideographic character. This code is called the Internal Code. Several other codes, e.g., Interchange Code for data communication, QuWei Code for quickly locating a character, etc. are defined for different purposes (Zhao et al., 1990). Font For each Chinese ideographic character, a 16x16 bitmapped font (mainly for display) is created and saved as a binary file. Other high resolution fonts (e.g., 24x24, 48x48 bitmapped fonts) are also created and saved as a binary file (mostly for printing). Current systems are also using TrueType fonts intensively (Lin et al., 1994). Input The input of a huge number of characters using a managable keyboard is only possible through a conversion process, also known as Front End Processing (FEP). For example, Microsoft IME (Input Method Extension, a frequently used FEP) includes a number of methods for inputting CJK characters efficiently. A table consists of Input Keys and target characters' Internal Code is used to map the Input Keys to the target character (ideographs). Almost all the input methods are one-to-many mapping. A prompt-line display and selection mechanism is used to select the appropriate character. The input rate is highly optimized by using different techniques which minimizes the number of keystrokes required to locate a character, a word or a phrase. Output Since output is a simple one-to-one mapping, output of CJK texts is similar to that of other languages. A font management engine locates the unique font for a given internal code and sends it to the appropriate output devices. 2 Characteristics of Indian Languages Indian languages are derived from Sanskrit (an ancient language) script and are used by a population of one billion people. Hindi (Indian national language), Bengali (national language of Bangladesh and the Indian state of West Bengal), Nepali, Tamil, Gurumukhi, Gujarati, Oriya, Telegu, Kannada, Malayalam, etc. are examples of Sanskrit based languages. Hindi and Nepali are written in Devanagari script; all other languages are written in their own scripts. These languages share a number of common linguistic characteristics. Unlike CJK languages, Indian languages consist of a small alphabet set of more or less 50 letters, including the vowels and the consonants. However, these languages do not follow a straight-forward one dimentional array like style to form words as it is the case in English and some other western languages. Table 1: Devanagari and Bengali Codepages (Source: Unicode) Table 1, shows the Devanagari (left hand side) and the Bengali (right hand side) coding table recommended by UNICODE consortium. Single-byte encoding of the symbols included in the table are currently being used in the existing systems. The similarities of the two languages can be noticed from the table. Both languages have a small set of characters to code. Two dimensional characteristics and complex randering is also noticable in Table 1. Moreover, characters in the same position in the left hand side (Devanagari) and in the right hand side (Bengali) mostly share similar pronounciations. These

3 characteristics are also common for other Indian languages. (Consonant) * (Vowel) * ] type ligature, the second or higher order consonants also have a different priority level than that of their individual occurrences or their occurrence as the first consonants in a ligature. A sorting example is illustrated in Figure 2. Considering the special characteristics mentioned above, a mathematical model of processing Indian languages which requires a multi-byte encoding scheme is proposed in the next section. In this approach, all the possible glyphs (including all the ligatures, the vowels and the consonants) are encoded. Input method is also proposed to facilitate an efficient way of inputting text. Figure 1: Example Text Rendering of Indian Text Source: Unicode Although most of the Indian languages have a tiny alphabet set, in constituting a word, the glyphs of the letters may take several different forms and shapes depending on their places of occurance. Vowels usually change their shapes when appearing with a consonant or a ligature and this change is sometimes irregular. Consonants and vowels, and two or more consonants with or without vowels, may combine together to form ligatures and their combined forms may have a totally different look. Moreover, the sequence of letters does not always appears in a straight-forward order. A vowel pronounced after a consonant may appear on the top or at the bottom of that consonant, and it may even appear before the consonant. There are some cases, when one part of a single vowel appears before the consonant and the other part appears after the consonant. Formation of ligatures have many other irregularities. Randering is a complicated issue for Indian language processing. The complexity of randering is easily noticeable from Figure 1, where randering of the Devanagari script is shown. Indian languages also have a unique sorting mechanism. Unlike English, consonants and vowels have different priorities in sorting. Words are sorted by taking the consonant s order as the first consideration and then the associated vowel s order as the second consideration. In a [(Consonant) (Consonant) Figure 2: Example of Complex Sorting 3 Mathematical Model of Multi-byte Coding of Indian Languages In this section, we will introduce the mathematical model of Indian languages, taking Bengali as an example language. Other Sanskrit based Indian languages can also be modelled in the same way. In this model, we treat letters (vowels and consonants) in the Bengali alphabet as radicals or Jomo in CJK languages. Rather than encoding only the letters, we propose encoding all the letters as well as all the linguistically meaningful ligatures they may form.

4 Linguistic analysis is necessary to find only the potential ligatures since many possible ligatures are never used in reality. Finally, we treat these ligatures (along with the independent vowels and consonants) in the same way as the CJK characters (including radicals or Jomos) are treated in CJK systems. This is to process Indian languages in the same fashion as characters are processed in the CJK systems. 3.1 Basic Definitions Definition 1. A consonant in Bengali is represented as c i, i = 1 to 39. (There are 39 commonly used consonants in Bengali ). Constant Set, C = {c i } Definition 2. An independent vowel (dependent vowels are the symbolic variations of independent vowels usually appear with consonants/ligatures) is represented as v j, j = 1 to 11 (There are 11 commonly used vowels in Bengali). Vowel Set, V = {v j } Definition 3. The combinations of one consonant and one vowel; one consonant and one diacritical mark; two or more consonants (with or without vowel or diacritical mark) is called ligature and represented as l k, k = 1 to 2,500. (We analyzed the Bengali language and found that there are about 2,500 commonly used ligatures) Ligature Set, L = {l k } Definition 4. Including Bengali numerals, monetary and other symbols, etc., there are about 20 commonly used symbols in Bengali and they are represented as S l, l = 1 to 20. Symbol Set, S = {s l } Definition 5. Word Constituent Unit (u m ) is defined in the following way: u m c i v j l k, generally m 3000 for Indian languages Word Constituent Unit Set, U = {u m } Definition 6. Words are represented as w n, n = 1 to α virtually. w n = (u m ) + Word Set, W= {w n } Definition 7. We denote B as a set of Bengali characters as follows: Bengali Character Set, B = { b i b i U or b i S } 3.2 Mathematical Model of Multi-byte Code for Bengali language Definition 8. Each element in B is assigned a unique multi-byte (16 bit, in case of double-byte coding) internal code, i i. There exists a function σ, so that i i = σ(b i ) and b j = σ -1 (i i ). Internal Code Set, I = { i i i i = 16 for double-byte coding} If b i appears before b j in the dictionary, then corresponding i i and i j will satisfy i i < i j. Notice here that sorting of Bengali words can now be done simply by comparing the internal codes. 3.3 Mathematical Model of Bengali Character Input Like any CJK system, several input methods can be designed for Indian languages too. We designed an input method called IAYS (Input As You Spell, spelling is unambiguous for Indian languages), where the user is provided with a keyboard layout which includes only the vowels, consonants and symbols. To input a Bengali word constituent unit, users would type in the sequence of consonants and vowels as they spell the unit. For some cases, a selection option will appear in the prompt line for disambiguation. Apparently, it seems that selection key-strokes are an extra overhead. However, the input method's performance can be further optimized using associative rules and word-based or phrase-based input techniques. Word-based and phrasebased input methods give an amazingly high input rate as proven for the CJK systems. It is because the words and the phrases are less ambiguous than a single character, so the selection key is not necessary for most cases. Moreover, abbreviated input of words and phrases is also possible which leads to a high input rate. A simple input method uses the table lookup mechanism, where we have a table of Input Codes (spelling attributes for each word constituent unit) and their respective Internal Codes in each row. It can be noted here that inputting Indian languages is less ambiguous than their CJK counterparts due to the smaller mapping space. For example, Bengali input involves a mapping of 50 to 3,000, where in CJK systems, the mappings are usually 50 to more than 6,000. The following two definitions explain the input process mathematically.

5 Definition 9. Spelling Attribute Set. The spelling attributes of a word constituent unit consist of relevant vowels and consonants. Spelling Attribute Set, A = {a i a i C, a i V} Definition 10. Input Method is a one-to-many mapping, ρ from the spelling attribute set to the Bengali character set: b i = ρ (a j ). 3.4 Mathematical Model of Bengali Character Output Output process is a one-to-one mapping, θ, which maps the unique Internal Code Set into Fonts Attribute Set. The output mechanism is less complicated since it is a mapping between internal codes and font attributes, a one-to-one mapping. Definition 11. Fonts Attribute Set, F = { f i f i is a binary(0,1) sequence of 16*16, 24*24 bits, etc.} Definition 12. Output of a Bengali character is a one-to-one mapping, θ, where f i = θ(i j ), a mapping from internal code to the relevant font. 4 Experimentation and Validation Experimentation and validation of this approach is made by adding the 3,000 glyphs of Bengali word constituent units into the user defined space of the existing CJK systems and by appending the lookup table accordingly in the Front End Processing system. Bengali text processing has been made possible instantly and equally efficiently like CJK texts are processed in the original system. Moreover, our Bengali system instantly inherited all the other resources of the host CJK system. That is, all the available applications are instantly usable with Bengali language, too. Conclusion This is the very first implementation of an Indian language in the CJK text processing framework using multi-byte coding. Although the ligature analysis, font design, etc. are not so efficient and error-free for the time being, the approach we explained here focuses on a more computationally inclined way of processing Indian languages. English, the European languages, and the CJK languages have a long history of development on several platforms of computing (Hu et al., 1989). Processing Indian languages in CJK framework can equally provide instant inheritance of the research results accumulated for CJK languages over the past years. UNICODE advocates using the multi-byte code for every language. Therefore, acquiring extra codespaces for Indian languages to encode their character sets in the same manner as the CJK languages are encoded is technically feasible too. The mathematical model of the CJK system (Qian et al., 1992) is very similar to that of our Bengali system. Thus, it remains easier to port our Bengali system in other platforms where CJK languages have already been successfully implemented. Multilingual environment is assured in this way. Acknowledgement I want to thank Professor Yuji Matsumoto, my current supervisor in Nara Institute of Science and Technology, Japan for kindly reviewing and commenting on this work. Thanks are also due to Professor Mao Yu-Heng and Professor Dai Mei-Er of Tsinghua University, China for their encouragement and advice. Among other Bangladeshi fellows who helped in designing font, testing the prototype and commenting, I must specially acknowledge the contribution of Mohammed Kawser and Ashraful Huq from the initial stage of this research. References Hu Xian-Xiang et al Implementation of a Multilingual Computational Environment Based on X Window System. In Proceedings of Chinese Computing Conference '89, pp. 64 Ken Lunde CJKV Information Processing: Chinese, Japanese, Korean and Vietnamese computing. O'Reilly & Associates, Inc. Lin Yaw-Jen et al Conversion of METAFONT file to TRUETYPE. In Proceedings of International Conference of Chinese Computing, ICCC-94, pp Qian Pei-De et al CCDOS Technical Handbook Volume 1. Tsinghua University Press, Beijing, China (in Chinese). UNICODE Consortium The Unicode Standard 2.0. Addison Wesley. URL: Zhao Po-Zhang et al Chinese Information Processing Technique. Aeronautics and Aerospace Press, Beijing, China (in Chinese).