A POS-based Word Prediction System for the Persian Language

Transcription

1 A POS-based Word Prediction System for the Persian Language Masood Ghayoomi 1 Ehsan Daroodi 2 1 Nancy 2 University, Nancy, France masood29@gmail.com 2 Iran National Science Foundation, Tehran, Iran darrudi@insf.org Abstract. Word prediction is the problem of guessing the words which are likely to follow in a given text segment by displaying a list of the most probable words that could appear in that position. In this research, we designed and implemented three word predictors for Persian. Our baseline is a statisticalbased system which uses language models. The first system uses word statistics; in the second one we use the main syntactic categories of a Persian POS tagged corpus; and the last one uses the main syntactic categories along with their morphological, syntactic and semantic subcategories. Using KeyStroke Saving (KSS) as the most important metrics to evaluate systems performance, the primary word-based statistical system achieved 37% KSS, and the second system that used only the main syntactic categories with word-statistics achieved 38.95% KSS. Our last system which used all of the available information to the words get the best result by 42.45% KSS. Keywords: word prediction, statistical language modeling, POS tagging 1 Introduction A word prediction system facilitates the typing of text for users with physical or cognitive disabilities. As the user enters each letter of the word, the system displays a list of most likely completions of the partially typed word. As the user continues typing more letters, the system updates the suggestion list accordingly. If the required word is in the list, the user can select it with a single keystroke. Then, the system tries to predict the next word. It displays a list of suggestions to the user, who can select the next intended word if it appears in the list. Otherwise, the user can enter the first letter of the next word to restrict the suggestions. The process continues until the completion of the text. For someone with physical disabilities, each keystroke is an effort; as a result, the prediction system saves the user's energy by reducing his or her physical effort. Additionally, the system assists the user in the composition of the well-formed text

2 qualitatively and quantitatively (Fazly, 2002). Moreover, the system helps to increase the user s concentration (Klund and Novak, 2001). Traditionally, word predictors have been built based on statistical language modeling (SLM; Gustavii and Pederssen, 2003). SLM could be merely based on the probability of a sequence of n given words (n-gram), or a combination of the sequence of words themselves taking advantage of the Part-of-Speech (POS) tags of the words. Using such knowledge of the language makes predictions more appropriate. A number of word prediction systems are available today for languages such as English and Swedish that use the linguistic knowledge of these languages. This paper discusses the design and implementation of a word prediction SLM based system which uses the POS tags for Persian text. 2 Related Work Early prediction systems that were developed in the 1980s were used as writing assistance systems for people with learning difficulties. Those early systems mainly suggested the high frequency words that matched the partially typed word and ignored the entire previous context (Swiffin et al, 1985) such as SoothSayer, and PAL (Booth et al, 1990) for English. PAL has been shown to save over 50% of keystrokes. Systems like Profet (Carlberger et al, 1997a; Carlberger et al, 1997b) for Swedish, and WordQ (Nantais et al, 2001; Shein et al, 2001) for English are among the examples that use word unigram and bigram sequences. Profet has saved keystrokes by 26.1% (Carlberger, 1997). Ghayoomi (2004) reports the first attempt to develop a word prediction system for Persian. His system simply used the statistical knowledge of uni-, bi- and trigram word models in algorithms. It is further reported that this system saves keystrokes by 57.57% (Ghayoomi and Assi, 2005). The best result that their system has achieved experimentally is 65.46% KSS after adaptation of the system to the user s writing style (Ghayoomi, 2006). Using solely statistical word knowledge for prediction often results in the suggestion of inappropriate words syntactically. In contrast, by using the POS tags of a language in prediction algorithms, we can filter the inappropriate words in the predictions. Systems such as Syntax PAL (Morris et al, 1992) for English, Prophet (Carlberger, 1997) for Swedish are among the examples which have used syntactic knowledge of the language in predictions. Syntax PAL has decreased the problems of using PAL and has made it possible for the users to write longer and more complicated sentences (Wood, 1996). Prophet saved 33% keystrokes (Carlberger, 1997) compared to the earlier version, Profet. This paper discusses the design and implementation of a word predictor for Persian using the bi, tri-, and quadrogram word statistics, and the bi-, tri-, and quadrogram POS tag statistics of the language. The paper also compares a system that solely uses word statistics with the designed systems that use word statistics as well as POS tags.

3 3 Language Models 3.1 N-gram Word Modeling The task of predicting the next word can be stated as attempting to estimate the probability function P: P(W n W 1,, W n-1 ) In such a stochastic problem, we use the previous word(s), the history, to predict the next word. To give reasonable prediction to the words which appear together, we try to use the Markov assumption that only the last few words affect the next word (Fazly, 2002). So if we construct a model where all histories restrict the word that would appear in the next position, we will then have an (n-1) th order Markov model or an n-gram word model (Manning and Schütze, 1999; Jurafsky and Martin, 2000). 3.2 Knowledge Modeling The systems that merely use statistical modeling for prediction often present words that are syntactically, semantically or pragmatically inappropriate (Rosenfeld, 1994; McCoy and Demasco, 1995). Syntactic prediction is a method that tries to present words that are appropriate syntactically in a particular position within the sentence. This means that knowledge from the syntactic structure of the language is used. In syntactic prediction, POS tags of all the words in a corpus are identified and the system uses this knowledge for making predictions (Fazly, 2002; Woods, 1996). Statistical syntax and rule-based grammars are two general syntactic prediction methods Statistical Syntax This approach uses the sequence of syntactic categories and POS tags for predictions. The appearance of a word in this method is based upon the correct usage of syntactic categories. In other words, the Markov assumption about n-gram word tags is used. Fazly (2002) has discussed three methods that can be used to obtain statistical knowledge about the syntax: (a) POS tags only, (b) previous word and two previous POS tags, and (c) linear combination. In the system presented here, we have used the three previous words as well as their syntactic knowledge in order to predict the following word. 4 Some Properties of Persian Persian is a member of the Indo-European language family and has many features in common with them in terms of morphology, syntax, phonology, and lexicon. Although Persian uses a modified version of the Arabic alphabet, it is worth noting that Arabic is from the Semitic family of languages and the two languages differ from one another in many respects. One important point which is related to the topic of the

4 present research is that there are a number of graphemes which represent the same spoken sound. The alphabet used in Persian is more appropriate for the Arabic sound system. For instance, the letters, ذ, ز ض and ظ are four letters of the alphabet in both Persian and Arabic. However, all of these letters are pronounced the same way in Persian, namely /z/, while, they are each pronounced differently in Arabic. Persian writing system is right to left, the same as Arabic, but quite distinct from the European languages that have a left to right writing system. Persian letters have joined or non-joined forms; i.e., based on the position that the letters appear within a word, they have different forms. The vocabulary of Persian has been greatly influenced by Arabic and to some extent by French in which a great number of words are borrowed from these two languages. Space is a word boundary for Persian words. There is also pseudo-space behaving as a morpheme boundary within a word. Persian is a null-subject language with SOV word order in unmarked structures. Word order is relatively free in Persian. The subject mood is widely used. Verbs are inflected in the language and they indicate tense and aspect, and agree with subject in person and number. The language does not make use of gender. 5 Word Prediction Algorithms Suppose the user is typing a sentence and the following sequence has been entered so far from right to left according to the Persian writing system: W i PW i PPW i PPPW i Where PPPW i, PPW i and PW i are the most recently completed words, and W i is the current word that is going to be predicted or completed. Let W be the set of all words in the lexicon that would likely appear in that position. Our statistical prediction algorithm first attempts to estimate the probability of each candidate word s POS, (t Wi ), according to the previous tags (t PWi ), (t PPWi), and (t PPPWi). Then, it tries to estimate the probability of the candidate word in the current position, (W i ), according to the previous words (PW i ), (PPW i ), and (PPPW i ); i.e., P(w i, t Wi PW, t PWi, PPW i, t PPWi, PPPW i, t PPPWi ) Then the algorithm selects the N most appropriate words from W that are likely to be the user s intended words, where N is usually between 1, 5, 9 or 10 based on the experiment done by Soede and Foulds (1986). The general approach is to estimate the probability of each candidate word, W i W, being the user s required word in that context based on the POS tags of the preceding words. 6 Methodology 6.1 Corpus The corpus that we have used in our research consists of about ten million tokens; it also contains about 143 thousand types. It seems to be a balanced corpus in the sense

5 that to be a good representative of the language in terms of source, genre, style, registers, and theme percent of the available texts are written, and 20 percent are dialog transcriptions. The source of the data is the Internet, publications, magazines, journals, newspapers, and various circular letters. For our purposes, we have divided the corpus into three parts: nine million tokens as training corpus; one million as developing corpus; and half a million as test corpus. 6.2 Annotation To annotate the corpus in our research, some inflectional morphemes are automatically added to the stems. Instead of a space, a pseudo-space is used between the components of a word to make the separated morphemes to become joined to each other in order to form a complete word. The spelling of certain words was replaced by a list of accepted spellings. The corpus is tagged both automatically and manually. First, a POS tagger was trained manually. The most important reason to tag them manually was in order to be able to distinguish homographs in terms of both syntactic distribution and semantic features. Then, based on the context, the corpus was tagged automatically. The accuracy of the tagger was experimentally over 90%. Finally, the corpus was checked again manually to remove bugs and problems. Homographs and scientific texts were problematic for the tagger. Other problems were with genitive (Ezafe 2 ), words not existent in the lexicon, and some multicategorical functional words such as اين /in/ (this), ا ن /ān/ (that). The examples below show problems in tagging ساعت /sā?at/ (watch). There are 19 POS tags as main syntactic categories in the corpus along with morphological, syntactic, and semantic subcategories. Example (1) below illustrates how tags are ordered in terms of their hierarchy: (1) اين ساعت دو هزار تومان ارزش دارد. in sā?at do hezār tumān arzeš dārad. this watch two thousand Thamen worth has This watch is worth two thousand Thamens. The tag order of sā?at in this example is N, SING, COM. Its main syntactic category is noun ; and its semantic subcategories are single and common. Compare these categories and subcategories with example (2) below: (2) ساعت دو ا نجا میا يم. sā?at-e do ānjā mi āyam hour-genitive two there progressive-come-i I am coming there by two o clock. 1 This corpus is provided by the Research Center for Intelligent Signal Processing. 2 Ezafe in Persian is a vowel /e/. It is a genitive case marker; and it has only phonetic representation but is not written. It functions something like of in English.

6 The tag order of sā?at-e in this example is N, SING, TIME, GEN. Its main syntactic category is noun ; its semantic subcategories are single and time ; and genitive (Ezafe) is its syntactic subcategory (Bijankhan, p.c). 6.3 Tokenization For the tokenization process, we used a software written in Visual Basic to compute, the needed statistics. The software ran on the training corpus to compute word bi-, tri-, and quadrograms. The software was then used to extract POS bi-, tri- and quadrograms of the main categories only. The software was finally used to extract POS bi-, tri- and quadrograms of the main categories with their morphological, syntactic, and semantic subcategories. Space was considered as a word boundary, and alphanumeric characters were treated as words. Finally, all words along with their POS tags (unigram) were extracted from the corpus as the main lexicon of the system. These sources of information for the system were organized in hash tables. 6.4 Solving Sparseness Since a big corpus includes only a fraction of n-grams, increasing n makes the distribution of the events rarer. We have used the Simple Linear Interpolation (SLI) method (Manning and Schütze, 1999) to smooth the probability distribution. The developed corpus was used to compute the lambda values of both word and POS n- gram models to solve the sparse data problem. We have used the Boosting Algorithm to compute the lambda values (Freund and Shapire, 1996). 7 Implementation 7.1 The Algorithm The architecture of our algorithm is shown in Figure 1. The system we developed has four major components: (a) the statistical information extracted from the training corpus for the prediction algorithm, (b) the component computing lambda values for solving the sparseness of both word and POS n-gram models, (c) the predictive program that tries to suggest words to the simulated user, and (d) a simulated user that types the test text. Component (c) has two parts: word completion and word prediction. The prediction algorithm first completes the partially spelled word and then it predicts the next probable words and presents them in the suggestion list. The simulated typist is a perfect user who always chooses the desired word when it is available in the prediction list and does not miss it. 7.2 Performance Measures Following Woods (1996) and Fazly (2002), we used three standard performance metrics to evaluate our system. Keystroke Saving (KSS) is referred to the percentage of keystrokes that the user saves by using the word prediction system. A higher value for keystroke saving implies

7 Figure 1. The architecture of our algorithm Training Corpus Tagged Corpus Annotating Untagged Corpus Extract N-gram Statistics Developing Corpus Computing Probability Computing Lambda Value Prediction Settings Simulate Test Corpus Test Result a better performance. Hit Rate (HR) is the percentage of correct words that appear in the suggestion list without entering any letters of the following word. A higher hit rate also implies a better performance. Keystroke until Prediction (KuP) refers to the average number of keystrokes that the user enters for each word before it appears in the prediction list. A lower value for this measure implies a better performance. 8 Results Since the corpus we used to develop our systems was different from the Persian corpus used by Ghayoomi (2004) and Ghayoomi and Assi (2005), our obtained results were not comparable with the output of their systems. One of the differences of their corpus with the one we used in our research is in terms of the number of tokens in their training, development, and test corpus. Their training corpus contained of about 6 million tokens; the development corpus about 850 thousand tokens; and the test corpus about 13 thousands tokens.

8 The other difference is the genre of their corpus in which only newspaper texts have been gathered for the training, development, and test corpus. Quite contrary in our corpus it has contained a wider coverage of genres. The n-gram word models that have been used in their algorithms are merely word statistics of uni-, bi-, and trigram. They have not benefited from the POS tags of the words; while we developed and tested our system in three different scenarios. The first test used only bi-, tri- and quadrogram word models; we called it System A. A second system was tested using the described n-gram word models along with the words POS bi-, tri- and quadrograms of the main syntactic categories only; we called it System B. Finally the system was tested using the described n-gram word models along with the words POS n-grams of both the main syntactic categories and their morphological, syntactic, and semantic subcategories; we called this System C. The test corpus was given to the simulated typist. It contained half a million tokens, and 1,950,000 characters; white space was not treated as a character. The reason for not considering space is that after selecting any word, a space will automatically be entered which results in a saved keystroke. On the other hand, to select a word from the list, one of the Function Keys, F1 through F9, are required to be pressed in order to drag and drop the intended word into the text being typed. The result is that the keystroke which was saved by entering the automatic space would now be lost. The virtual typist is a Visual C ++ program that reads in each text letter by letter. After reading each letter, it determines what the correct prediction for the current position is. The prediction program is then called and a list of suggestions is returned to the user. The user searches the prediction list for the correct one. If the correct prediction is found in the list, the user increases the amount of correct predictions by the predictor. The correctly predicted word is then completed and the user continues to read the rest of the text. The results obtained from using the various n-gram models are presented in table 1 for only 9 suggestions in the prediction list: Table 1: Summary of the results obtained by using word and POS n-gram models from the test corpus KSS% HR% KuP System A System B System C As shown in table 1, higher KSS and HR and the lowest KuP were obtained when the system used the word statistics and syntactic knowledge of the language (systems B and C); compared to the model which only used word statistics (system A). But this differentiation is not very remarkable between systems A and B when only words and the first main syntactic categories are used for prediction. Probably the reason is that the main syntactic categories are too noisy for the system, and the 2% better performance is achieved by simply doing minor filtering the sequence of words by

9 considering the main syntactic categories that belong to the words. System C had the best performance among the developed systems, since it has used all of the word and syntactic knowledge available to the system; so having more syntactic information available to the words would highly improve predictions. The 42.45% KSS means that for each 100 characters that the user is required to type to enter a text segment, at least 42 characters are entered by the system, and the rest, the remaining 58 characters, were entered by the user. 56% of words, more than half of the user s required words, appeared in the prediction list before entering any of the letters of the following word. At least one keystroke is needed by the user to type a word on the system while the average length of words for the corpus we used was Conclusion By using POS tags of Persian in the word prediction algorithm, we achieved a higher keystroke saving rate. Since every keystroke is an effort for disabled users, the result obtained is very important for users with disabilities. Moreover, there is a significant difference between the performances of the system that uses all of the available syntactic knowledge which achieved a sudden increase in KSS, comparing to the one that uses mere word statistic knowledge. Using the POS tags of the language allows the system to filter words in the predictions list that are syntactically inappropriate in a particular position within the sentence. Thus, it would increase the user s confidence to enable him or her to select words from the prediction list that can result in better written sentences, along with imposing a lower cognition load on him or her. This feature is useful for users with cognitive disabilities, specially the ones suffering from aphasia. 10 Further Work To achieve higher percentage of KSS, we are planning to add the feature of adaptability of the system to the user s writing style. By adapting itself to the user, the system would gradually improve its performance. Also, it is necessary to add a POS tager to the system in order to identify the POS tags of new words and tag them automatically. Bibliography Booth, L. and W. Beattie and A. Newell (1990) I know what you mean. Special Children, pp Carlberger, J. (1997) Word Prediction: Design and Implementation of a probabilistic Word Prediction Program. Master dissertation. Royal Institute of Technology. Stockholm. Carlberger, A. and T. Magnuson and J. Carlberger and H. Wachtmeister and S. Hunnicutt. (1997a) Probability-based word prediction for writing support in dyslexia. In Barner, R., Heldner, M., Sullivan, K., and Wretling, P., editors, Proceedings of Fonetik '97 Conference, Volume 4, pp

10 Carlberger, A. and J. Carlberger and T. Magnuson and M.S. Hunnicutt and S.E. Palazuelos-Cagigas and S.A. Navarro. (1997b) Profet, a new generation of word prediction: An evaluation study. Copestake, A., Langer, S. and Palazuelos-Cagigas S., editors, Natural Language Processing for Communication aids, In Proceedings of a workshop sponsored by ACL, Madrid, Spain, pp Fazly, A. (2002) The Use of Syntax in Word Completion Utilities. Master dissertation. Canada: University of Toronto. Freund, Y. and R.E. Shapire. (1996) Experiments with new boosting algorithm In Proceedings of ICML. Ghayoomi, M. (2004) Word Prediction in Computational Processing of the Persian Language. Master dissertation. Iran: Islamic Azad University, Tehran Central Branch. Ghayoomi, M. (2006) "Using word prediction systems for users with disabilities: A case study" In Proceedings of the 2nd Workshop on the Persian Language and Computer, Tehran University, Iran. June 27-28, pp: Ghayoomi, M. and S.M. Assi. (2005) Word prediction in a running text: A statistical language modeling for the Persian language In Proceedings of the Australasian Language Technology Workshop, University of Sydney, Australia. Dec 10-11, pp: Gustavii, E. and E Pettersson (2003) A Swedish Grammar for Word Prediction. Stockholm: Uppsala University Jurafsky, D. and J.H. Martin. (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. New Jersey: Prentice-Hall. Klund, J. and M. Novak (2001) If word prediction can help, which program do you choose? Manning, C.D., and H. Schütze. (1999) Foundations of Statistical Natural Language Processing. The MIT Press. McCoy, K. and P. Demasco (1995) Some application of natural language processing to the field of augmentative and alternative communication In Proceeding of the IJCAI 95 Workshop on Developing AI Applications for People with Disabilities. Morris, C and A. Newell and L. Booth nd J. Arnott (1992) Syntax pal: A system to improve the written syntax of language-impaired users. Assistive Technology, 4(2):51-59, Sept Nantais, T. and F. Shein and M. Johansson. (2001) Efficacy of the word prediction algorithm in WordQ TM. In Proceedings of the 24 th Annual Conference on Technology and Disability, RESNA. Rosenfeld, R. (1994) Adaptive Statistical Language Modeling: A Maximum Entropy Approach. PhD. dissertation. Pittsburgh: Canegie Mellon University. Shein, F. and T. Nantais and R. Nishiyama and C. Tam and P. Marshall. (2001) Word cueing for persons with writing difficulties: WordQ. The16 th Annual International Conference on Technology and Persons with Disabilities, California State University at Northridge, Los Angeles, CA, March. Soede, M. and R.A. Foulds (1986) Dilemma of prediction in communication aids and mental load. In Proceedings of the 9 th Annual Conference on Rehabilitation Technology,

11 Swiffin, A.L. and J.A. Pickering and J. L. Arnott, and A. F. Newell (1985) PAL: An effort efficient portable communication aid and keyboard emulator. In Proceedings of the 8 th Annual Conference on Rehabilitation Technology, pp Wood, M.E.J. (1996) Syntactic Pre-Processing in Single-Word Prediction for Disabled People. Ph.D. dissertation. University of Bristol, Bristol.