A POS-based Word Prediction System for the Persian Language

Size: px
Start display at page:

Download "A POS-based Word Prediction System for the Persian Language"

Transcription

1 A POS-based Word Prediction System for the Persian Language Masood Ghayoomi 1 Ehsan Daroodi 2 1 Nancy 2 University, Nancy, France masood29@gmail.com 2 Iran National Science Foundation, Tehran, Iran darrudi@insf.org Abstract. Word prediction is the problem of guessing the words which are likely to follow in a given text segment by displaying a list of the most probable words that could appear in that position. In this research, we designed and implemented three word predictors for Persian. Our baseline is a statisticalbased system which uses language models. The first system uses word statistics; in the second one we use the main syntactic categories of a Persian POS tagged corpus; and the last one uses the main syntactic categories along with their morphological, syntactic and semantic subcategories. Using KeyStroke Saving (KSS) as the most important metrics to evaluate systems performance, the primary word-based statistical system achieved 37% KSS, and the second system that used only the main syntactic categories with word-statistics achieved 38.95% KSS. Our last system which used all of the available information to the words get the best result by 42.45% KSS. Keywords: word prediction, statistical language modeling, POS tagging 1 Introduction A word prediction system facilitates the typing of text for users with physical or cognitive disabilities. As the user enters each letter of the word, the system displays a list of most likely completions of the partially typed word. As the user continues typing more letters, the system updates the suggestion list accordingly. If the required word is in the list, the user can select it with a single keystroke. Then, the system tries to predict the next word. It displays a list of suggestions to the user, who can select the next intended word if it appears in the list. Otherwise, the user can enter the first letter of the next word to restrict the suggestions. The process continues until the completion of the text. For someone with physical disabilities, each keystroke is an effort; as a result, the prediction system saves the user's energy by reducing his or her physical effort. Additionally, the system assists the user in the composition of the well-formed text

2 qualitatively and quantitatively (Fazly, 2002). Moreover, the system helps to increase the user s concentration (Klund and Novak, 2001). Traditionally, word predictors have been built based on statistical language modeling (SLM; Gustavii and Pederssen, 2003). SLM could be merely based on the probability of a sequence of n given words (n-gram), or a combination of the sequence of words themselves taking advantage of the Part-of-Speech (POS) tags of the words. Using such knowledge of the language makes predictions more appropriate. A number of word prediction systems are available today for languages such as English and Swedish that use the linguistic knowledge of these languages. This paper discusses the design and implementation of a word prediction SLM based system which uses the POS tags for Persian text. 2 Related Work Early prediction systems that were developed in the 1980s were used as writing assistance systems for people with learning difficulties. Those early systems mainly suggested the high frequency words that matched the partially typed word and ignored the entire previous context (Swiffin et al, 1985) such as SoothSayer, and PAL (Booth et al, 1990) for English. PAL has been shown to save over 50% of keystrokes. Systems like Profet (Carlberger et al, 1997a; Carlberger et al, 1997b) for Swedish, and WordQ (Nantais et al, 2001; Shein et al, 2001) for English are among the examples that use word unigram and bigram sequences. Profet has saved keystrokes by 26.1% (Carlberger, 1997). Ghayoomi (2004) reports the first attempt to develop a word prediction system for Persian. His system simply used the statistical knowledge of uni-, bi- and trigram word models in algorithms. It is further reported that this system saves keystrokes by 57.57% (Ghayoomi and Assi, 2005). The best result that their system has achieved experimentally is 65.46% KSS after adaptation of the system to the user s writing style (Ghayoomi, 2006). Using solely statistical word knowledge for prediction often results in the suggestion of inappropriate words syntactically. In contrast, by using the POS tags of a language in prediction algorithms, we can filter the inappropriate words in the predictions. Systems such as Syntax PAL (Morris et al, 1992) for English, Prophet (Carlberger, 1997) for Swedish are among the examples which have used syntactic knowledge of the language in predictions. Syntax PAL has decreased the problems of using PAL and has made it possible for the users to write longer and more complicated sentences (Wood, 1996). Prophet saved 33% keystrokes (Carlberger, 1997) compared to the earlier version, Profet. This paper discusses the design and implementation of a word predictor for Persian using the bi, tri-, and quadrogram word statistics, and the bi-, tri-, and quadrogram POS tag statistics of the language. The paper also compares a system that solely uses word statistics with the designed systems that use word statistics as well as POS tags.

3 3 Language Models 3.1 N-gram Word Modeling The task of predicting the next word can be stated as attempting to estimate the probability function P: P(W n W 1,, W n-1 ) In such a stochastic problem, we use the previous word(s), the history, to predict the next word. To give reasonable prediction to the words which appear together, we try to use the Markov assumption that only the last few words affect the next word (Fazly, 2002). So if we construct a model where all histories restrict the word that would appear in the next position, we will then have an (n-1) th order Markov model or an n-gram word model (Manning and Schütze, 1999; Jurafsky and Martin, 2000). 3.2 Knowledge Modeling The systems that merely use statistical modeling for prediction often present words that are syntactically, semantically or pragmatically inappropriate (Rosenfeld, 1994; McCoy and Demasco, 1995). Syntactic prediction is a method that tries to present words that are appropriate syntactically in a particular position within the sentence. This means that knowledge from the syntactic structure of the language is used. In syntactic prediction, POS tags of all the words in a corpus are identified and the system uses this knowledge for making predictions (Fazly, 2002; Woods, 1996). Statistical syntax and rule-based grammars are two general syntactic prediction methods Statistical Syntax This approach uses the sequence of syntactic categories and POS tags for predictions. The appearance of a word in this method is based upon the correct usage of syntactic categories. In other words, the Markov assumption about n-gram word tags is used. Fazly (2002) has discussed three methods that can be used to obtain statistical knowledge about the syntax: (a) POS tags only, (b) previous word and two previous POS tags, and (c) linear combination. In the system presented here, we have used the three previous words as well as their syntactic knowledge in order to predict the following word. 4 Some Properties of Persian Persian is a member of the Indo-European language family and has many features in common with them in terms of morphology, syntax, phonology, and lexicon. Although Persian uses a modified version of the Arabic alphabet, it is worth noting that Arabic is from the Semitic family of languages and the two languages differ from one another in many respects. One important point which is related to the topic of the

4 present research is that there are a number of graphemes which represent the same spoken sound. The alphabet used in Persian is more appropriate for the Arabic sound system. For instance, the letters, ذ, ز ض and ظ are four letters of the alphabet in both Persian and Arabic. However, all of these letters are pronounced the same way in Persian, namely /z/, while, they are each pronounced differently in Arabic. Persian writing system is right to left, the same as Arabic, but quite distinct from the European languages that have a left to right writing system. Persian letters have joined or non-joined forms; i.e., based on the position that the letters appear within a word, they have different forms. The vocabulary of Persian has been greatly influenced by Arabic and to some extent by French in which a great number of words are borrowed from these two languages. Space is a word boundary for Persian words. There is also pseudo-space behaving as a morpheme boundary within a word. Persian is a null-subject language with SOV word order in unmarked structures. Word order is relatively free in Persian. The subject mood is widely used. Verbs are inflected in the language and they indicate tense and aspect, and agree with subject in person and number. The language does not make use of gender. 5 Word Prediction Algorithms Suppose the user is typing a sentence and the following sequence has been entered so far from right to left according to the Persian writing system: W i PW i PPW i PPPW i Where PPPW i, PPW i and PW i are the most recently completed words, and W i is the current word that is going to be predicted or completed. Let W be the set of all words in the lexicon that would likely appear in that position. Our statistical prediction algorithm first attempts to estimate the probability of each candidate word s POS, (t Wi ), according to the previous tags (t PWi ), (t PPWi), and (t PPPWi). Then, it tries to estimate the probability of the candidate word in the current position, (W i ), according to the previous words (PW i ), (PPW i ), and (PPPW i ); i.e., P(w i, t Wi PW, t PWi, PPW i, t PPWi, PPPW i, t PPPWi ) Then the algorithm selects the N most appropriate words from W that are likely to be the user s intended words, where N is usually between 1, 5, 9 or 10 based on the experiment done by Soede and Foulds (1986). The general approach is to estimate the probability of each candidate word, W i W, being the user s required word in that context based on the POS tags of the preceding words. 6 Methodology 6.1 Corpus The corpus that we have used in our research consists of about ten million tokens; it also contains about 143 thousand types. It seems to be a balanced corpus in the sense

5 that to be a good representative of the language in terms of source, genre, style, registers, and theme percent of the available texts are written, and 20 percent are dialog transcriptions. The source of the data is the Internet, publications, magazines, journals, newspapers, and various circular letters. For our purposes, we have divided the corpus into three parts: nine million tokens as training corpus; one million as developing corpus; and half a million as test corpus. 6.2 Annotation To annotate the corpus in our research, some inflectional morphemes are automatically added to the stems. Instead of a space, a pseudo-space is used between the components of a word to make the separated morphemes to become joined to each other in order to form a complete word. The spelling of certain words was replaced by a list of accepted spellings. The corpus is tagged both automatically and manually. First, a POS tagger was trained manually. The most important reason to tag them manually was in order to be able to distinguish homographs in terms of both syntactic distribution and semantic features. Then, based on the context, the corpus was tagged automatically. The accuracy of the tagger was experimentally over 90%. Finally, the corpus was checked again manually to remove bugs and problems. Homographs and scientific texts were problematic for the tagger. Other problems were with genitive (Ezafe 2 ), words not existent in the lexicon, and some multicategorical functional words such as اين /in/ (this), ا ن /ān/ (that). The examples below show problems in tagging ساعت /sā?at/ (watch). There are 19 POS tags as main syntactic categories in the corpus along with morphological, syntactic, and semantic subcategories. Example (1) below illustrates how tags are ordered in terms of their hierarchy: (1) اين ساعت دو هزار تومان ارزش دارد. in sā?at do hezār tumān arzeš dārad. this watch two thousand Thamen worth has This watch is worth two thousand Thamens. The tag order of sā?at in this example is N, SING, COM. Its main syntactic category is noun ; and its semantic subcategories are single and common. Compare these categories and subcategories with example (2) below: (2) ساعت دو ا نجا میا يم. sā?at-e do ānjā mi āyam hour-genitive two there progressive-come-i I am coming there by two o clock. 1 This corpus is provided by the Research Center for Intelligent Signal Processing. 2 Ezafe in Persian is a vowel /e/. It is a genitive case marker; and it has only phonetic representation but is not written. It functions something like of in English.

6 The tag order of sā?at-e in this example is N, SING, TIME, GEN. Its main syntactic category is noun ; its semantic subcategories are single and time ; and genitive (Ezafe) is its syntactic subcategory (Bijankhan, p.c). 6.3 Tokenization For the tokenization process, we used a software written in Visual Basic to compute, the needed statistics. The software ran on the training corpus to compute word bi-, tri-, and quadrograms. The software was then used to extract POS bi-, tri- and quadrograms of the main categories only. The software was finally used to extract POS bi-, tri- and quadrograms of the main categories with their morphological, syntactic, and semantic subcategories. Space was considered as a word boundary, and alphanumeric characters were treated as words. Finally, all words along with their POS tags (unigram) were extracted from the corpus as the main lexicon of the system. These sources of information for the system were organized in hash tables. 6.4 Solving Sparseness Since a big corpus includes only a fraction of n-grams, increasing n makes the distribution of the events rarer. We have used the Simple Linear Interpolation (SLI) method (Manning and Schütze, 1999) to smooth the probability distribution. The developed corpus was used to compute the lambda values of both word and POS n- gram models to solve the sparse data problem. We have used the Boosting Algorithm to compute the lambda values (Freund and Shapire, 1996). 7 Implementation 7.1 The Algorithm The architecture of our algorithm is shown in Figure 1. The system we developed has four major components: (a) the statistical information extracted from the training corpus for the prediction algorithm, (b) the component computing lambda values for solving the sparseness of both word and POS n-gram models, (c) the predictive program that tries to suggest words to the simulated user, and (d) a simulated user that types the test text. Component (c) has two parts: word completion and word prediction. The prediction algorithm first completes the partially spelled word and then it predicts the next probable words and presents them in the suggestion list. The simulated typist is a perfect user who always chooses the desired word when it is available in the prediction list and does not miss it. 7.2 Performance Measures Following Woods (1996) and Fazly (2002), we used three standard performance metrics to evaluate our system. Keystroke Saving (KSS) is referred to the percentage of keystrokes that the user saves by using the word prediction system. A higher value for keystroke saving implies

7 Figure 1. The architecture of our algorithm Training Corpus Tagged Corpus Annotating Untagged Corpus Extract N-gram Statistics Developing Corpus Computing Probability Computing Lambda Value Prediction Settings Simulate Test Corpus Test Result a better performance. Hit Rate (HR) is the percentage of correct words that appear in the suggestion list without entering any letters of the following word. A higher hit rate also implies a better performance. Keystroke until Prediction (KuP) refers to the average number of keystrokes that the user enters for each word before it appears in the prediction list. A lower value for this measure implies a better performance. 8 Results Since the corpus we used to develop our systems was different from the Persian corpus used by Ghayoomi (2004) and Ghayoomi and Assi (2005), our obtained results were not comparable with the output of their systems. One of the differences of their corpus with the one we used in our research is in terms of the number of tokens in their training, development, and test corpus. Their training corpus contained of about 6 million tokens; the development corpus about 850 thousand tokens; and the test corpus about 13 thousands tokens.

8 The other difference is the genre of their corpus in which only newspaper texts have been gathered for the training, development, and test corpus. Quite contrary in our corpus it has contained a wider coverage of genres. The n-gram word models that have been used in their algorithms are merely word statistics of uni-, bi-, and trigram. They have not benefited from the POS tags of the words; while we developed and tested our system in three different scenarios. The first test used only bi-, tri- and quadrogram word models; we called it System A. A second system was tested using the described n-gram word models along with the words POS bi-, tri- and quadrograms of the main syntactic categories only; we called it System B. Finally the system was tested using the described n-gram word models along with the words POS n-grams of both the main syntactic categories and their morphological, syntactic, and semantic subcategories; we called this System C. The test corpus was given to the simulated typist. It contained half a million tokens, and 1,950,000 characters; white space was not treated as a character. The reason for not considering space is that after selecting any word, a space will automatically be entered which results in a saved keystroke. On the other hand, to select a word from the list, one of the Function Keys, F1 through F9, are required to be pressed in order to drag and drop the intended word into the text being typed. The result is that the keystroke which was saved by entering the automatic space would now be lost. The virtual typist is a Visual C ++ program that reads in each text letter by letter. After reading each letter, it determines what the correct prediction for the current position is. The prediction program is then called and a list of suggestions is returned to the user. The user searches the prediction list for the correct one. If the correct prediction is found in the list, the user increases the amount of correct predictions by the predictor. The correctly predicted word is then completed and the user continues to read the rest of the text. The results obtained from using the various n-gram models are presented in table 1 for only 9 suggestions in the prediction list: Table 1: Summary of the results obtained by using word and POS n-gram models from the test corpus KSS% HR% KuP System A System B System C As shown in table 1, higher KSS and HR and the lowest KuP were obtained when the system used the word statistics and syntactic knowledge of the language (systems B and C); compared to the model which only used word statistics (system A). But this differentiation is not very remarkable between systems A and B when only words and the first main syntactic categories are used for prediction. Probably the reason is that the main syntactic categories are too noisy for the system, and the 2% better performance is achieved by simply doing minor filtering the sequence of words by

9 considering the main syntactic categories that belong to the words. System C had the best performance among the developed systems, since it has used all of the word and syntactic knowledge available to the system; so having more syntactic information available to the words would highly improve predictions. The 42.45% KSS means that for each 100 characters that the user is required to type to enter a text segment, at least 42 characters are entered by the system, and the rest, the remaining 58 characters, were entered by the user. 56% of words, more than half of the user s required words, appeared in the prediction list before entering any of the letters of the following word. At least one keystroke is needed by the user to type a word on the system while the average length of words for the corpus we used was Conclusion By using POS tags of Persian in the word prediction algorithm, we achieved a higher keystroke saving rate. Since every keystroke is an effort for disabled users, the result obtained is very important for users with disabilities. Moreover, there is a significant difference between the performances of the system that uses all of the available syntactic knowledge which achieved a sudden increase in KSS, comparing to the one that uses mere word statistic knowledge. Using the POS tags of the language allows the system to filter words in the predictions list that are syntactically inappropriate in a particular position within the sentence. Thus, it would increase the user s confidence to enable him or her to select words from the prediction list that can result in better written sentences, along with imposing a lower cognition load on him or her. This feature is useful for users with cognitive disabilities, specially the ones suffering from aphasia. 10 Further Work To achieve higher percentage of KSS, we are planning to add the feature of adaptability of the system to the user s writing style. By adapting itself to the user, the system would gradually improve its performance. Also, it is necessary to add a POS tager to the system in order to identify the POS tags of new words and tag them automatically. Bibliography Booth, L. and W. Beattie and A. Newell (1990) I know what you mean. Special Children, pp Carlberger, J. (1997) Word Prediction: Design and Implementation of a probabilistic Word Prediction Program. Master dissertation. Royal Institute of Technology. Stockholm. Carlberger, A. and T. Magnuson and J. Carlberger and H. Wachtmeister and S. Hunnicutt. (1997a) Probability-based word prediction for writing support in dyslexia. In Barner, R., Heldner, M., Sullivan, K., and Wretling, P., editors, Proceedings of Fonetik '97 Conference, Volume 4, pp

10 Carlberger, A. and J. Carlberger and T. Magnuson and M.S. Hunnicutt and S.E. Palazuelos-Cagigas and S.A. Navarro. (1997b) Profet, a new generation of word prediction: An evaluation study. Copestake, A., Langer, S. and Palazuelos-Cagigas S., editors, Natural Language Processing for Communication aids, In Proceedings of a workshop sponsored by ACL, Madrid, Spain, pp Fazly, A. (2002) The Use of Syntax in Word Completion Utilities. Master dissertation. Canada: University of Toronto. Freund, Y. and R.E. Shapire. (1996) Experiments with new boosting algorithm In Proceedings of ICML. Ghayoomi, M. (2004) Word Prediction in Computational Processing of the Persian Language. Master dissertation. Iran: Islamic Azad University, Tehran Central Branch. Ghayoomi, M. (2006) "Using word prediction systems for users with disabilities: A case study" In Proceedings of the 2nd Workshop on the Persian Language and Computer, Tehran University, Iran. June 27-28, pp: Ghayoomi, M. and S.M. Assi. (2005) Word prediction in a running text: A statistical language modeling for the Persian language In Proceedings of the Australasian Language Technology Workshop, University of Sydney, Australia. Dec 10-11, pp: Gustavii, E. and E Pettersson (2003) A Swedish Grammar for Word Prediction. Stockholm: Uppsala University Jurafsky, D. and J.H. Martin. (2000) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. New Jersey: Prentice-Hall. Klund, J. and M. Novak (2001) If word prediction can help, which program do you choose? Manning, C.D., and H. Schütze. (1999) Foundations of Statistical Natural Language Processing. The MIT Press. McCoy, K. and P. Demasco (1995) Some application of natural language processing to the field of augmentative and alternative communication In Proceeding of the IJCAI 95 Workshop on Developing AI Applications for People with Disabilities. Morris, C and A. Newell and L. Booth nd J. Arnott (1992) Syntax pal: A system to improve the written syntax of language-impaired users. Assistive Technology, 4(2):51-59, Sept Nantais, T. and F. Shein and M. Johansson. (2001) Efficacy of the word prediction algorithm in WordQ TM. In Proceedings of the 24 th Annual Conference on Technology and Disability, RESNA. Rosenfeld, R. (1994) Adaptive Statistical Language Modeling: A Maximum Entropy Approach. PhD. dissertation. Pittsburgh: Canegie Mellon University. Shein, F. and T. Nantais and R. Nishiyama and C. Tam and P. Marshall. (2001) Word cueing for persons with writing difficulties: WordQ. The16 th Annual International Conference on Technology and Persons with Disabilities, California State University at Northridge, Los Angeles, CA, March. Soede, M. and R.A. Foulds (1986) Dilemma of prediction in communication aids and mental load. In Proceedings of the 9 th Annual Conference on Rehabilitation Technology,

11 Swiffin, A.L. and J.A. Pickering and J. L. Arnott, and A. F. Newell (1985) PAL: An effort efficient portable communication aid and keyboard emulator. In Proceedings of the 8 th Annual Conference on Rehabilitation Technology, pp Wood, M.E.J. (1996) Syntactic Pre-Processing in Single-Word Prediction for Disabled People. Ph.D. dissertation. University of Bristol, Bristol.

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction Uwe D. Reichel Department of Phonetics and Speech Communication University of Munich reichelu@phonetik.uni-muenchen.de Abstract

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

A Swedish Grammar for Word Prediction

A Swedish Grammar for Word Prediction A Swedish Grammar for Word Prediction Ebba Gustavii and Eva Pettersson ebbag,evapet @stp.ling.uu.se Master s thesis in Computational Linguistics Språkteknologiprogrammet (Language Engineering Programme)

More information

Turkish Radiology Dictation System

Turkish Radiology Dictation System Turkish Radiology Dictation System Ebru Arısoy, Levent M. Arslan Boaziçi University, Electrical and Electronic Engineering Department, 34342, Bebek, stanbul, Turkey arisoyeb@boun.edu.tr, arslanle@boun.edu.tr

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Mixed Trigrams Approach for Context Sensitive Spell Checking A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke 1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline

More information

Special Topics in Computer Science

Special Topics in Computer Science Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy fabio.tamburini@unibo.it Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

Text-To-Speech Technologies for Mobile Telephony Services

Text-To-Speech Technologies for Mobile Telephony Services Text-To-Speech Technologies for Mobile Telephony Services Paulseph-John Farrugia Department of Computer Science and AI, University of Malta Abstract. Text-To-Speech (TTS) systems aim to transform arbitrary

More information

Overview of MT techniques. Malek Boualem (FT)

Overview of MT techniques. Malek Boualem (FT) Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,

More information

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

More information

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu

More information

Research Portfolio. Beáta B. Megyesi January 8, 2007

Research Portfolio. Beáta B. Megyesi January 8, 2007 Research Portfolio Beáta B. Megyesi January 8, 2007 Research Activities Research activities focus on mainly four areas: Natural language processing During the last ten years, since I started my academic

More information

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy

More information

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

Master of Arts in Linguistics Syllabus

Master of Arts in Linguistics Syllabus Master of Arts in Linguistics Syllabus Applicants shall hold a Bachelor s degree with Honours of this University or another qualification of equivalent standard from this University or from another university

More information

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that

More information

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP

More information

CS 533: Natural Language. Word Prediction

CS 533: Natural Language. Word Prediction CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed

More information

Reading Competencies

Reading Competencies Reading Competencies The Third Grade Reading Guarantee legislation within Senate Bill 21 requires reading competencies to be adopted by the State Board no later than January 31, 2014. Reading competencies

More information

Study Plan. Bachelor s in. Faculty of Foreign Languages University of Jordan

Study Plan. Bachelor s in. Faculty of Foreign Languages University of Jordan Study Plan Bachelor s in Spanish and English Faculty of Foreign Languages University of Jordan 2009/2010 Department of European Languages Faculty of Foreign Languages University of Jordan Degree: B.A.

More information

Study Plan for Master of Arts in Applied Linguistics

Study Plan for Master of Arts in Applied Linguistics Study Plan for Master of Arts in Applied Linguistics Master of Arts in Applied Linguistics is awarded by the Faculty of Graduate Studies at Jordan University of Science and Technology (JUST) upon the fulfillment

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

Building A Vocabulary Self-Learning Speech Recognition System

Building A Vocabulary Self-Learning Speech Recognition System INTERSPEECH 2014 Building A Vocabulary Self-Learning Speech Recognition System Long Qin 1, Alexander Rudnicky 2 1 M*Modal, 1710 Murray Ave, Pittsburgh, PA, USA 2 Carnegie Mellon University, 5000 Forbes

More information

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5

More information

Processing: current projects and research at the IXA Group

Processing: current projects and research at the IXA Group Natural Language Processing: current projects and research at the IXA Group IXA Research Group on NLP University of the Basque Country Xabier Artola Zubillaga Motivation A language that seeks to survive

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* Jonathan Yamron, James Baker, Paul Bamberg, Haakon Chevalier, Taiko Dietzel, John Elder, Frank Kampmann, Mark Mandel, Linda Manganaro, Todd Margolis,

More information

Using classes has the potential of reducing the problem of sparseness of data by allowing generalizations

Using classes has the potential of reducing the problem of sparseness of data by allowing generalizations POS Tags and Decision Trees for Language Modeling Peter A. Heeman Department of Computer Science and Engineering Oregon Graduate Institute PO Box 91000, Portland OR 97291 heeman@cse.ogi.edu Abstract Language

More information

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches POS Tagging 1 POS Tagging 2 Words taken isolatedly are ambiguous regarding its POS Yo bajo con el hombre bajo a PP AQ

More information

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE A.J.P.M.P. Jayaweera #1, N.G.J. Dias *2 # Virtusa Pvt. Ltd. No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka * Department of Statistics

More information

English Grammar Checker

English Grammar Checker International l Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 English Grammar Checker Pratik Ghosalkar 1*, Sarvesh Malagi 2, Vatsal Nagda 3,

More information

PHONETIC TOOL FOR THE TUNISIAN ARABIC

PHONETIC TOOL FOR THE TUNISIAN ARABIC PHONETIC TOOL FOR THE TUNISIAN ARABIC Abir Masmoudi 1,2, Yannick Estève 1, Mariem Ellouze Khmekhem 2, Fethi Bougares 1, Lamia Hadrich Belguith 2 (1) LIUM, University of Maine, France (2) ANLP Research

More information

Automatic Identification of Arabic Language Varieties and Dialects in Social Media

Automatic Identification of Arabic Language Varieties and Dialects in Social Media Automatic Identification of Arabic Language Varieties and Dialects in Social Media Fatiha Sadat University of Quebec in Montreal, 201 President Kennedy, Montreal, QC, Canada sadat.fatiha@uqam.ca Farnazeh

More information

Improving Word-Based Predictive Text Entry with Transformation-Based Learning

Improving Word-Based Predictive Text Entry with Transformation-Based Learning Improving Word-Based Predictive Text Entry with Transformation-Based Learning David J. Brooks and Mark G. Lee School of Computer Science University of Birmingham Birmingham, B15 2TT, UK d.j.brooks, m.g.lee@cs.bham.ac.uk

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Towards a Visually Enhanced Medical Search Engine

Towards a Visually Enhanced Medical Search Engine Towards a Visually Enhanced Medical Search Engine Lavish Lalwani 1,2, Guido Zuccon 1, Mohamed Sharaf 2, Anthony Nguyen 1 1 The Australian e-health Research Centre, Brisbane, Queensland, Australia; 2 The

More information

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms ESSLLI 2015 Barcelona, Spain http://ufal.mff.cuni.cz/esslli2015 Barbora Hladká hladka@ufal.mff.cuni.cz

More information

Contemporary Linguistics

Contemporary Linguistics Contemporary Linguistics An Introduction Editedby WILLIAM O'GRADY MICHAEL DOBROVOLSKY FRANCIS KATAMBA LONGMAN London and New York Table of contents Dedication Epigraph Series list Acknowledgements Preface

More information

Establishing the Uniqueness of the Human Voice for Security Applications

Establishing the Uniqueness of the Human Voice for Security Applications Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.

More information

LIUM s Statistical Machine Translation System for IWSLT 2010

LIUM s Statistical Machine Translation System for IWSLT 2010 LIUM s Statistical Machine Translation System for IWSLT 2010 Anthony Rousseau, Loïc Barrault, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans,

More information

THE BACHELOR S DEGREE IN SPANISH

THE BACHELOR S DEGREE IN SPANISH Academic regulations for THE BACHELOR S DEGREE IN SPANISH THE FACULTY OF HUMANITIES THE UNIVERSITY OF AARHUS 2007 1 Framework conditions Heading Title Prepared by Effective date Prescribed points Text

More information

A System for Labeling Self-Repairs in Speech 1

A System for Labeling Self-Repairs in Speech 1 A System for Labeling Self-Repairs in Speech 1 John Bear, John Dowding, Elizabeth Shriberg, Patti Price 1. Introduction This document outlines a system for labeling self-repairs in spontaneous speech.

More information

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for

More information

Common Core Progress English Language Arts

Common Core Progress English Language Arts [ SADLIER Common Core Progress English Language Arts Aligned to the [ Florida Next Generation GRADE 6 Sunshine State (Common Core) Standards for English Language Arts Contents 2 Strand: Reading Standards

More information

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System Asanee Kawtrakul ABSTRACT In information-age society, advanced retrieval technique and the automatic

More information

Reliable and Cost-Effective PoS-Tagging

Reliable and Cost-Effective PoS-Tagging Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,chen@iis.sinica.edu.tw Abstract In order to achieve

More information

An Adaptive Approach to Named Entity Extraction for Meeting Applications

An Adaptive Approach to Named Entity Extraction for Meeting Applications An Adaptive Approach to Named Entity Extraction for Meeting Applications Fei Huang, Alex Waibel Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave. Pittsburgh,PA 15213 fhuang@cs.cmu.edu,

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH

ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH Journal of Computer Science 9 (7): 922-927, 2013 ISSN: 1549-3636 2013 doi:10.3844/jcssp.2013.922.927 Published Online 9 (7) 2013 (http://www.thescipub.com/jcs.toc) ARABIC PERSON NAMES RECOGNITION BY USING

More information

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014 COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE Fall 2014 EDU 561 (85515) Instructor: Bart Weyand Classroom: Online TEL: (207) 985-7140 E-Mail: weyand@maine.edu COURSE DESCRIPTION: This is a practical

More information

UNIVERSITY OF JORDAN ADMISSION AND REGISTRATION UNIT COURSE DESCRIPTION

UNIVERSITY OF JORDAN ADMISSION AND REGISTRATION UNIT COURSE DESCRIPTION Course Description B.A Degree Spanish and English Language and Literature 2203103 Spanish Language for Beginners (1) (3 credit hours) Prerequisite : none In combination with Spanish for Beginners (2),

More information

Characteristics of Computational Intelligence (Quantitative Approach)

Characteristics of Computational Intelligence (Quantitative Approach) Characteristics of Computational Intelligence (Quantitative Approach) Shiva Vafadar, Ahmad Abdollahzadeh Barfourosh Intelligent Systems Lab Computer Engineering and Information Faculty Amirkabir University

More information

Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6

Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6 Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6 4 I. READING AND LITERATURE A. Word Recognition, Analysis, and Fluency The student

More information

Natural Language Processing: A Model to Predict a Sequence of Words

Natural Language Processing: A Model to Predict a Sequence of Words Natural Language Processing: A Model to Predict a Sequence of Words Gerald R. Gendron, Jr. Confido Consulting Spot Analytic Chesapeake, VA Gerald.gendron@gmail.com; LinkedIn: jaygendron ABSTRACT With the

More information

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models Natural Language Processing CS 6120 Spring 2014 Northeastern University David Smith (some slides from Jason Eisner and Dan Klein) summary

More information

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France annlor@limsi.fr Cyril Grouin LIMSI-CNRS rue John von Neumann 91400

More information

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts

More information

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Database Interface for the Community Based Monitoring System * Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University

More information

31 Case Studies: Java Natural Language Tools Available on the Web

31 Case Studies: Java Natural Language Tools Available on the Web 31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software

More information

Training and evaluation of POS taggers on the French MULTITAG corpus

Training and evaluation of POS taggers on the French MULTITAG corpus Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,

More information

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Robust Methods for Automatic Transcription and Alignment of Speech Signals Robust Methods for Automatic Transcription and Alignment of Speech Signals Leif Grönqvist (lgr@msi.vxu.se) Course in Speech Recognition January 2. 2004 Contents Contents 1 1 Introduction 2 2 Background

More information

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language LUCS Minor 16, 2010. ISSN 1104-1609. ChildFreq: An Online Tool to Explore Word Frequencies in Child Language Rasmus Bååth Lund University Cognitive Science Kungshuset, Lundagård, 222 22 Lund rasmus.baath@lucs.lu.se

More information

Computers and the Creative Process

Computers and the Creative Process Computers and the Creative Process Kostas Terzidis In this paper the role of the computer in the creative process is discussed. The main focus is the investigation of whether computers can be regarded

More information

UNIVERSITY OF PUERTO RICO RIO PIEDRAS CAMPUS COLLEGE OF HUMANITIES DEPARTMENT OF ENGLISH

UNIVERSITY OF PUERTO RICO RIO PIEDRAS CAMPUS COLLEGE OF HUMANITIES DEPARTMENT OF ENGLISH UNIVERSITY OF PUERTO RICO RIO PIEDRAS CAMPUS COLLEGE OF HUMANITIES DEPARTMENT OF ENGLISH Instructor: Dr. Alicia Pousada Course Title: Study of language Course Number: INGL 4205 Number of Credit Hours:

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

A prototype infrastructure for D Spin Services based on a flexible multilayer architecture

A prototype infrastructure for D Spin Services based on a flexible multilayer architecture A prototype infrastructure for D Spin Services based on a flexible multilayer architecture Volker Boehlke 1,, 1 NLP Group, Department of Computer Science, University of Leipzig, Johanisgasse 26, 04103

More information

Data Deduplication in Slovak Corpora

Data Deduplication in Slovak Corpora Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain

More information

209 THE STRUCTURE AND USE OF ENGLISH.

209 THE STRUCTURE AND USE OF ENGLISH. 209 THE STRUCTURE AND USE OF ENGLISH. (3) A general survey of the history, structure, and use of the English language. Topics investigated include: the history of the English language; elements of the

More information

What Is Linguistics? December 1992 Center for Applied Linguistics

What Is Linguistics? December 1992 Center for Applied Linguistics What Is Linguistics? December 1992 Center for Applied Linguistics Linguistics is the study of language. Knowledge of linguistics, however, is different from knowledge of a language. Just as a person is

More information

Robustness of a Spoken Dialogue Interface for a Personal Assistant

Robustness of a Spoken Dialogue Interface for a Personal Assistant Robustness of a Spoken Dialogue Interface for a Personal Assistant Anna Wong, Anh Nguyen and Wayne Wobcke School of Computer Science and Engineering University of New South Wales Sydney NSW 22, Australia

More information

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System Oana NICOLAE Faculty of Mathematics and Computer Science, Department of Computer Science, University of Craiova, Romania oananicolae1981@yahoo.com

More information

Online Farsi Handwritten Character Recognition Using Hidden Markov Model

Online Farsi Handwritten Character Recognition Using Hidden Markov Model Online Farsi Handwritten Character Recognition Using Hidden Markov Model Vahid Ghods*, Mohammad Karim Sohrabi Department of Electrical and Computer Engineering, Semnan Branch, Islamic Azad University,

More information

Specialty Answering Service. All rights reserved.

Specialty Answering Service. All rights reserved. 0 Contents 1 Introduction... 2 1.1 Types of Dialog Systems... 2 2 Dialog Systems in Contact Centers... 4 2.1 Automated Call Centers... 4 3 History... 3 4 Designing Interactive Dialogs with Structured Data...

More information

Perplexity Method on the N-gram Language Model Based on Hadoop Framework

Perplexity Method on the N-gram Language Model Based on Hadoop Framework 94 International Arab Journal of e-technology, Vol. 4, No. 2, June 2015 Perplexity Method on the N-gram Language Model Based on Hadoop Framework Tahani Mahmoud Allam 1, Hatem Abdelkader 2 and Elsayed Sallam

More information

St. Petersburg College. RED 4335/Reading in the Content Area. Florida Reading Endorsement Competencies 1 & 2. Reading Alignment Matrix

St. Petersburg College. RED 4335/Reading in the Content Area. Florida Reading Endorsement Competencies 1 & 2. Reading Alignment Matrix Course Credit In-service points St. Petersburg College RED 4335/Reading in the Content Area Florida Reading Endorsement Competencies 1 & 2 Reading Alignment Matrix Text Rule 6A 4.0292 Specialization Requirements

More information

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP Bridge Consulting Based in Florence, Italy Foundedin 1998 98 employees Business Areas Retail, Manufacturing and Fashion Knowledge

More information

Chapter 7. Language models. Statistical Machine Translation

Chapter 7. Language models. Statistical Machine Translation Chapter 7 Language models Statistical Machine Translation Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house

More information

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study Aron Henriksson 1, Martin Hassel 1, and Maria Kvist 1,2 1 Department of Computer and System Sciences

More information

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR 1 Gauri Rao, 2 Chanchal Agarwal, 3 Snehal Chaudhry, 4 Nikita Kulkarni,, 5 Dr. S.H. Patil 1 Lecturer department o f Computer Engineering BVUCOE,

More information

CS 6740 / INFO 6300. Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

CS 6740 / INFO 6300. Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage CS 6740 / INFO 6300 Advanced d Language Technologies Graduate-level introduction to technologies for the computational treatment of information in humanlanguage form, covering natural-language processing

More information

Kindergarten Common Core State Standards: English Language Arts

Kindergarten Common Core State Standards: English Language Arts Kindergarten Common Core State Standards: English Language Arts Reading: Foundational Print Concepts RF.K.1. Demonstrate understanding of the organization and basic features of print. o Follow words from

More information

Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker

Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker CS-889 Spring 2011 Project 2: Term Clouds (HOF) Implementation Report Members: Nicole Sparks (project leader), Charlie Greenbacker Abstract: This report describes the methods used in our implementation

More information