Chinese Term Extraction from Web Pages Based on Compound word Productivity

Size: px
Start display at page:

Download "Chinese Term Extraction from Web Pages Based on Compound word Productivity"

Transcription

1 Chinese Term Extraction from Web Pages Based on Compound word Productivity Hiroshi Nakagawa Information Technology Center, The University of Tokyo 7-3- Hongou, Bunkyo Tokyo, JAPAN, Abstract In this paper, we propose an automatic term recognition system for Chinese. Our idea is based on the relation between a compound word and its constituents that are simple words or individual Chinese character. More precisely, we basically focus on how many words/characters adjoin the word/character in question to form compound words. We also take into account the frequency of term. We evaluated word based method and character based method with several Chinese Web pages, resulting in precision of 75% for top ten candidate terms. Introduction Automatic term recognition, ATR in short, aims at extracting domain specific terms from a corpus or Web pages. Domain specific terms are terms that express the concept specifically defined in the given domain. They are required to have a unique meaning in order for efficient communication about the topic of the domain. It is, however, difficult to decide automatically whether they are unique. So we put this issue aside. In terms of feasibility, their grammatical status is important, for instance part of speeches. Although they are not necessarily confined to simple words where simple word means a word which could not be further divided into shorter and more basic words, the majority of them are actually compound words,. Thus, we here focus on both of simple and compound words. In terms of text length, even one Web page which is not long gives us a number of domain specific vocabulary like national library, library policy if the Web page is about libraries. If we expand domain specific terms to this extent, the big portion of domain specific terms are compound words. Obviously, the majority of compound words consist of relatively small number of distinct simple words. In this situation, it is natural to pay Hiroyuki ojima *, Akira Maeda** Faculty of Economics, The University of Tokyo 7-3- Hongou, Bunkyo Tokyo, JAPAN, * kojima@e.u-tokyo.ac.jp, **maeda@lib.u-tokyo.ac.jp attention to the relation among compound words and their constituent simple words. (ageura and Umino 996) proposed an important feature of domain specific terms called termhood which refers to the degree that a linguistic unit is related to a domain-specific concept. Presumably, it is necessary to develop an ATR method that calculates termhood of each term candidate extracted from a domain corpus that usually consists of a number of documents. Many works of ATR use statistics of term candidate distribution in a corpus such as term frequency to calculate the termhood of every term candidate. This frequency based methods, however, heavily depend on the size of corpus. Thus we do not expect a good result if we extract domain specific terms from one or a few Web pages. If we shift our focus from a corpus based statistics like frequency to term space that consists of all term candidates, we expect better result of extracted terms even from one Web page because of the following reason: A set of term candidates has its own structure like relations between compound words and their constituent simple words as stated before. The statistical information about these relations comes from more microscopic structure than term frequency. Thus, if we utilize more information from term space, it is reasonable to expect better performance in extracting from a small text like one Web page. Without this kind of information, we will be suffering from the shortage of information for ATR. Now look at frequency based information and information inherent with term space more closely. Even though several kinds of statistics about actual use in a corpus such as term frequency give a good approximation of termhood. They are not necessarily meanings in a writer's mind. On the contrary, the statistics of term space can reflect the meaning in a writer s mind because it is up to a writer s decision how to make a compound word term to express a complicated concept using simple word terms as its components. More precisely, if a certain simple word, say N,

2 expresses the basic concept of a domain that the document treats, the writer of the document, we expect, uses N not only many times but in various ways. One of typical way of this kind is that he/she composes quite a few compound words using N and uses these compound words in documents he/she writes. For this reason, we have to focus on the relation among simple words and compound words when pursuing new ATR methods. One of the attempts to make use of this relation has been done by Nakagawa and Mori (2003). Their method is based on the number of distinct simple words that come left or right of a simple word term to make up compound word terms. In this paper, we apply their method to deal with Web pages written in Chinese. In this paper, section 2 gives the background of ATR methods. In section 3 we introduce ATR method developed by Nakagawa and Mori(2003). Section 4, 5 and 6 are for how to apply their method to Chinese language and evaluation of two proposed method: ) Word based method using Chinese morphological analyzer ICTCLAS(Zhang, Yu, Xiong and Liu. 2003), 2) Stop character based method. 2 Background 2. Typical Procedures of Automatic Term Recognition An ATR procedure consists of two procedures in general. The first one is a procedure of extracting term candidates from a corpus. The second procedure is to assign each term candidate a score that indicates how likely the term candidate is a term to be recognized. Then all candidates are ranked according to their scores. In the remaining part of this section, we describe the background of a candidate extraction procedure and a scoring procedure respectively. 2.2 Candidates Extraction In term candidates extraction from the given text corpus, we mainly focus on compound words as well as simple words. To extract compound words which are promising term candidates and at the same time to exclude undesirable strings such as is a or of the, the frequently used method is to filter out the words that are the member of a stopword-list. The structure of complex term is another important factor for automatic term candidate extraction. A syntactic structure that is the result of parsing is focused on in many works. Since we focus on these complex structures, the first task in extracting term candidates is a morphological analysis including part of speech (POS) tagging. There are no explicit word boundary marker in Chinese, we first have to do morphological analysis which segments out words from a sentence and does POS tagging at the same time. After POS tagging, the complex structures mentioned above are extracted as term candidates. Previous studies have proposed many promising ways for this purpose, for instance, Smadja and Mceown (990), and Frantzi and Ananiadou (996) tried to treat more general structures like collocations. 2.3 Scoring The next step of ATR is to assign each term candidate its score in order to rank them in descending order of termhood. Many researchers have sought the definition of term candidate s score which approximates termhood. In fact, many of those proposals make use of statistics of actual use in a corpus such as term frequency which is so powerful and simple that many researchers directly or indirectly have used it. The combination of term frequency and inverse document frequency is also well studied i.e. (Uchimoto et al 2000), (Fukushige and Noguchi 2000). On the other hand, several scoring methods that are neither directly nor heavily based on frequency of term candidates have been proposed. Among those, Ananiadou et al. proposed C-value (Frantzi and Ananiadou 996) which counts how independently the given compound word is used in the given corpus. Hisamitsu (2000) proposes a way to measure termhood which estimates how far the document containing given term is different from the distribution of documents not containing the given term. However, the method proposed by Nakagawa and Mori (2003) outperforms these methods in terms of NTCIR TMREC task(ageura, et al, 999). 2.4 Chinese Term Extraction As for Chinese language NLP, very many works about word segmentation were published i.e. (Ma and Xia 2003). Nevertheless the term Term extraction has not yet been used for Chinese NLP, key words extraction have been a target for a long time. For instance, key words extraction from news articles (Li. et al. 2003) is the recent result which uses frequency and length of character string for scoring. Max-duplicated string based method (Yang and Li. 2002) is also promising. In spite of previous research efforts, there have been no attempt so far to apply the relation between simple and compound words to Chinese term extraction, and that is exactly what we propose in this paper.

3 3 Scoring methods with Simple word Bigrams 3. Simple word Bigrams The relation between a simple word and complex words that include the simple word is very important in terms of term space structure. Nevertheless, to my knowledge, this relation has not been paid enough attention so far except for the method proposed by Nakagawa and Mori (2003). In this paper, taking over their works, we focus on compound words among the various types of complex terms. In technical documents, the majority of domain-specific terms are noun phrases or compound words consisting of small size vocabulary of simple words. This observation leads to a new scoring methods that measures how many distinct compound words contain the simple word in question as their part in a given document or corpus. Here, suppose the situation where simple word: N occurs with other simple words as a part of many compound words shown in Figure where [N M] means bigram of noun N and M. [LN N] (#L) [LN2 N] (#L2) : : [LNn N](#Ln) [N RN](#R) [N RN2](#R2) [N RNm](#Rm) Figure. Noun Bigram and their Frequency In Figure, [LNi N] (i=,..,n) and [N RNj] (j=,...,m) are simple word bigrams which make (a part of) compound words. #Li and #Rj (i=,..,n and j=,..,m) mean the frequency of the bigram [LNi N] and [N RNj] in the corpus respectively. Note that since we depict only bigrams, compound words like [LNi N RNj] which contains [LNi N] and/or [N RNj] as their parts might actually occur in a corpus. Note that this noun trigram might be a part of longer compound words. We show an example of a set of noun bigrams. Suppose that we extract compound words including trigram as term candidates from a corpus as shown in the following example. Example. trigram statistics, word trigram, class trigram, word trigram, trigram acquisition, word trigram statistics, character trigram Then, noun bigrams consisting of a simple word trigram are shown in Figure 2 where the number between ( and ) shows the frequency in the corpus. word trigram (3) trigram statistics (2) class trigram () trigram acquisition () character trigram() Figure 2. An example of noun bigram Now we focus on and utilize simple word bigrams to define the scoring function. Note that we are only concerned with simple word bigrams and not with a simple word per se because, as stated before, we are concerned with the relation between a compound word and its component simple words. 3.2 Scoring Function 3.2. Score of simple word Since there are infinite number of scoring functions based on [LNi N] or [N RNj], we here consider the following simple but representative scoring functions. #LDN(N) and #RDN(N) : These are the number of distinct simple words which directly precede or succeed N. These coincide with n and m in Figure respectively. For instance, in an example shown in Figure 2, #LDN(trigram)=3, #RDN(trigram)=2. Using #LDN and #RDN we define LN(N) and RN(N): These are based on the number of occurrence of each noun bigram, and defined for [LNi N] and [N RNj] as follows respectively. #LDN(N) LN(N) = (#Li) () i= #RDN(N) RN(N) = (#Rj) (2) j= LN(N) and RN(N) are the frequencies of nouns that directly precede or succeed N. For instance, in an example shown in Figure 2, LN(trigram)=5, and RN(trigram)=3. Let s think about the background of these scoring functions. #LDN(N) and #RDN(N), where we do not take into account the frequency of each noun bigram but take into account the number of distinct nouns that adjoin to N to make compound words. That indicates how linguistically and domain dependently productive the noun:n is in a given corpus. That means that if N presents a key and/or basic concept of the domain treated by the corpus, writers in that domain work out many distinct compound words with N to express more complicated concepts. On the other hand, as for LN(N) and RN(N), we also focus on frequency of each noun bigram as well. In other words, statistic

4 bias in actual use of noun:n is, this time, one of our main concern. For example, in Figure 2, LN(trigram,2)=, and RN(trigram,2)=5. In conclusion, since LN(N) and RN(N) convey more information than #LDN(N) and #RDN(N), we adopt LN(N) and RN(N) in this research Score of compound words The next thing to do is expanding those scoring functions for simple word to the scoring functions for compound words. We adopt a geometric mean for this purpose. Now think of a compound word : CN = N N 2 N L, where N i (i=,.., L) is a simple word. Then a geometric mean: LR of CN is defined as follows. 2L L LR(CN) = (LN(N) i )(RN(N) i ) + + i= (3) For instance, if we use LN(N) and RN(N) in example, LR(trigram) = ( 3+ ) (5+ ) = LR does not depend on the length of CN where length means the number of simple words that consist of CN. This is because since we have not yet had any idea about the relation between the importance of a compound word and a length of the compound word, it is fair to treat all compound words, including simple words, equally no matter how long or short each compound word is Combining LR and Frequency of Nouns We still have not fully utilized the information about statistics of actual use in a corpus in the bigram based methods described in 3.2. and Among various kinds of information about actual use, the important and basic one is the frequency of single-and compound words that occur independently. The term independently means that the left and right adjacent words are not nouns. For instance, word patterns occurs independently in we use the word patterns which occur in this sentence. Since the scoring functions proposed in 3.2. is noun bigram statistics, the number of this kind of independent occurrences of nouns themselves have not been used so far. If we take this information into account, the better results are expected. Thus, if a simple word or a compound word occurs independently, the score of the noun is simply multiplied by the number of its independent occurrences. We call this new scoring function as FLR(CN) which is defined as follows. if N occurs independently then F LR(CN) = LR(CN) f(cn) where f(cn) means the number of independent occurrences of noun CN (4) 4 Term Extraction for Chinese based on Morphological Analysis If we try to apply the scoring method proposed in section 3 directly to a Chinese text, every word should be POS tagged because we extract multiword unit of several types of POS tag sequences as candidates of domain specific terms. For this we need a Chinese morphological analyzer because Chinese is an agglutinative language. Actually, we use Chinese morphological analyzer: ICTCLAS(Zhang and Liu 2004). As term candidates, we extract compound word: MWU having the following POS tag sequence expressed in (5). A multi-word-unit: MWU is defined by the following CFG rules where the right hand sides are expressed as a regular expression. MWU <--[ag a]* [ng n nr ns nt nz nx vn an i j]+ MWU <-- MWU? b [ng n nr ns nt nz nx vn an i j] + MWU <-- [ag a] + [u k] MWU MWU <-- MWU (u k he-2 yu-3) MWU (5) where ag, a, n, u are all tags used in ICTCLAS. Roughly speaking (5) means an adjective followed by the repetition of [adjective noun particle] followed by a noun. The problem is the ambiguity of POS tagging because the same word is very often used verb as well as noun. In addition, unknown words like newly appeared proper names also impairs the accuracy. Due to this problem caused by morphological analyzer, the accuracy is degraded. Once we segment out word sequences conforming the above POS tag sequences, we calculate LN and RN of each component word. In calculation of LN and RN, a word whose POS is c, u or k is omitted. In other words, if a word sequence w w2 w3 where POS of w2 is c u or k, then we calculate RN of w and LN of w3 by regarding the word sequence as w w3. Then we combine LN and RN of each word to calculate FLR by definition of (3) and (4) to sort all extracted candidates in descending order of FLR. We apply the proposed methods to 30 Web pages from People s Daily news. The areas are social, international and IT related news. The average length is characters. Firstly, we extract relevant terms by hand from each news article and use it as the gold standard. The average number of gold standard terms per one news particle is 5.9 words. Secondly, we extract terms from each news article and sort them in descending

5 order by the proposed method and evaluate them by a precision of top N terms defined as follows. CT()= if th term is one of the gold standard terms. 0 otherwise CT ( i) i precision = = ( ) (6) where N is the number of the gold standard terms, and in our experiment, N=20. Precision(), where =,..,20, are shown in Figure 3 as Strict. We also use another precision rate precision which is not strict and defined as follows. CTpart()= if one of gold standard terms. is a part of th term 0 otherwise CTpart( i) i precision = = '( ) (7) These are also shown in Figure 3 as Partly. documents, we expect POS sequences of terminologies with high accuracy. However, if we consider terminology extraction from Web pages, the possibility of unexpected POS sequence may rise. The third reason is language independency. Currently proposed and/or used morphological analyzers heavily depend either upon the sophisticated linguistic knowledge about the target language or upon a big size corpus of the target language if machine learning is employed. These linguistic resources, however, are not always available. Due to these reasons, we also developed term candidate extraction system which does not use a morphological analyzer. Instead of morphological analyzer, we try to employ a stop word list. In Chinese, as stop words, we find many character unigrams and bigrams because one Chinese character conveys larger amount of information than a character of Latin alphabet. They are partly shown in Appendix A. As term candidates, we simply extract character strings between two stop words that are nearest each other within a sentence. Obviously, the character strings thus extracted are not necessarily meaningful compound words. Therefore we cannot directly use these strings as words to calculate LN and RN function. Back to the idea that Chinese characters are ideograms, we come up to the idea that we calculate LN and RN of each character appearing within every character strings extracted as candidates. An example is shown in Figure 4. Figure 3. Strict and partly precision of word based extraction method. From Figure 3, we see that If we pick up the ten highest ranked terms, about 75% of them meet the gold standard. The case we loosen the definition of precision shows better than the strict case of (6) but the difference is not so large. That means that the proposed word based ranking method works very well to extract important Chinese terms from news articles. 5 Character based Term Extraction There are several reasons why we would like to develop a term extraction system without morphological analyzer. The first reason is that the accuracy of morphological analyzer is, in spite of the great advancement of these years, still around 95% (GuoDong and Jian 2003). The second reason is that there possibly exist terminologies with unexpected POS sequences. If we deal only with academic papers or technical LN(zuo-4)=3 RN(zuo-4)=2 Figure 4. LN and RN of Chinese character zuo-4 In calculation of LN and RN, we neglect characters whose POS are c,u or k as same as we did in morphological analyzer based method. Once we calculate LN and RN of each character, FLR of every character string is calculated as defined by (3) and (4) to sort them in descending order of FLR. Actually this idea is very similar with left and right entropy used to extract two character Chinese words from a corpus (Luo and Sun. 2003). However what we would like to extract is a set of longer compound words or even phrases used in a Web page. Moreover we only use the Web page and do not use any other language resources such as a big corpus at all due to the reason described above in this section.

6 We evaluate the proposed character based extraction method against the same Web pages from People s Daily news used in Morphological Analysis based method described in Section 4. We also use the same gold standard terms described in Section 4 for evaluation. The strict and partly precision defined by (6) and (7) are used. The result is shown in Figure 5. Figure 5. Strict and partly precision of character based extraction method. Comparing Figure 3 with Figure 5, apparently the result of extracted terms of word based method is better than that of character based method. However, it does not necessarily mean that the character based term extraction is inferior. If you take a glance at the stop word list of Appendix A, it seems that several of the stop words are selected mainly from words in auxiliary verbs, pronouns, adverbs, particles, prepositions, conjunctions, exclamations, onomatopoeic words and punctuation marks. However, in reality, our selection is based rather on meaning, usage and generally frequency of use than parts of speech. Thus some of them are not function words but content words in order to exclude non-domainspecific words. Actually, the stop words are not only character unigram but character bigram. Because Chinese character is ideograph and each character may have plural meanings, it is difficult only to use character unigram as a stop word in Chinese. Our method based on these viewpoints resulted in getting an interesting consequence. We show an example of news article and extracted terms from it by this method in Appendix B and Appendix C. This news article is entitled The Culture of Tibetan Web Site is formally created. Let s take a look at an underlined sentence in this short article and underlined terms extracted from there. This sentence says: According to the introduction, The Culture of Tibetan Web Site is a site of special pure culture for the purpose of investigating the essence of Tibetan culture, showing the scale of Tibetan culture and raising the spirit of Tibetan culture. In the case of method based on stop word list, we can extract compound term of investigating the essence of Tibetan culture, showing the scale of Tibetan culture ( raising the spirit of Tibetan culture ( and so on from this sentence. On the contrary, by the term extraction method based on morphological analysis, gerund, for example, showing( ) and raising ( ), can not be extracted. We said that the majority of domain specific terms are noun phrases or compound words consisting of small size vocabulary of simple words as stated in section 3. So we especially have paid attention to relation among nouns. However most of Chinese nouns can also be used as verbs. Moreover inflection of Chinese verbs can hardly be recognized visually. It is difficult to distinguish verb from noun by morphological analysis. Certainly ICTCLAS classifies the character that has meaning of both verb and noun into the category of vn (verb and noun). But gerunds and verbal noun infinitives are not contained in vn. For instance, means not only write a letter but writing letter. Thus we have to pay attention to verbs in Chinese too. Only by tuning up stop word list, we can take gerunds and verbal noun infinitives into account to some extent. Appendix C shows one of the evidence of this observation. 6 Conclusion In this paper, we apply automatic term recognition system based on FLR proposed by Nakagawa and Mori (2003) to Chinese Web pages because the term extraction from small text like one Web page is the future oriented topic. We proposed two methods: word based and character based extraction and ranking using the compound word productivity of simple words. Since the accuracies of term recognition are around 60% for top,000 term candidates in NTCIR TMREC task(ageura et al 999), the result of 75% accuracy of top ten candidates is a good start. References Ananiadou, S A Methodology for Automatic Term Recognition. In Proceedings of 4th International Conference on Computational Linguistics : GuoDong Zhou and Jian Su A Chinese Efficient Analyser Integrating Word Segmentation, Part-Of- Speech Tagging, Partial Parsing and Full Parsing, In

7 Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003 :78-83 Frantzi, T.. and Ananiadou, S Extracting nested collocations. In Proceedings of 6th International Conference on Computational Linguistics :4-46. Fukushige, Y. and N. Nogichi Statistical and linguistic approaches to automatic term recognition NTCIR experiments at Matsushita. Terminology 6(2) : Hua-PingZhang, Hong-uYu, De-Yi Xiong and Qun Liu HHMM-based Chinese Lexical Analyzer ICTCLAS. In Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003 :84-87 Hisamitsu, T, A Method of Measuring Term Representativeness. In Proceedings of 8th International Conference on Computational Linguistics : ageura,. and Umino, B Methods of automatic term recognition: a review. Terminology 3(2) : ageura,. et al, 999. TMREC Task: Overview and Evaluation. In Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition : Shengfen Luo and Maosong Sun Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures. In Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003 :24-30 Qing Ma and Fei Xia Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003, Sappro Sujian Li,Houfeng Wang,Shiwen Yu and Chengsheng Xin News-Oriented Automatic Chinese eyword Indexing. In Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003 :92-97 Nakagawa, H. and Tatsunori Mori Automaic Term Recognition based on Statistics of Compound words and their Components, Terminology 9(2) :20-29 Smadja, F.A. and Mckeown,.R Automatically extracting and representing collocations for language generation. In Proceedings of the 28th Annual Meetings of the Association for Computational Linguistics : Uchimoto,., S.Sekine, M. Murata, H.Ozaku and H. Isahara Term recognition using corpora from different fields. Terminology 6(2) : Wenfeng Yang and Xing Li Chinese keyword extraction based on max-duplicated strings of the documents. In Proceedings of the 25th annual international ACM SIGIR conference on Research and develop-ment in information retrieval, pp evin Zhang and Qun Liu ICTCLAS. /manual /node2.html Appendix A: A part of stop word list Appendix B: An example of news article Appendix C: Terms extracted from Appendix B. Word Based (Top 0 terms with score of equation (4) ) Character Based(Top terms with score equation (4) )

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus

Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus Extraction of Chinese Compound Words An Experimental Study on a Very Large Corpus Jian Zhang Department of Computer Science and Technology of Tsinghua University, China ajian@s1000e.cs.tsinghua.edu.cn

More information

stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings,

stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings, Section 9 Foreign Languages I. OVERALL OBJECTIVE To develop students basic communication abilities such as listening, speaking, reading and writing, deepening their understanding of language and culture

More information

English Appendix 2: Vocabulary, grammar and punctuation

English Appendix 2: Vocabulary, grammar and punctuation English Appendix 2: Vocabulary, grammar and punctuation The grammar of our first language is learnt naturally and implicitly through interactions with other speakers and from reading. Explicit knowledge

More information

Comparative Analysis on the Armenian and Korean Languages

Comparative Analysis on the Armenian and Korean Languages Comparative Analysis on the Armenian and Korean Languages Syuzanna Mejlumyan Yerevan State Linguistic University Abstract It has been five years since the Korean language has been taught at Yerevan State

More information

Albert Pye and Ravensmere Schools Grammar Curriculum

Albert Pye and Ravensmere Schools Grammar Curriculum Albert Pye and Ravensmere Schools Grammar Curriculum Introduction The aim of our schools own grammar curriculum is to ensure that all relevant grammar content is introduced within the primary years in

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Writing Common Core KEY WORDS

Writing Common Core KEY WORDS Writing Common Core KEY WORDS An educator's guide to words frequently used in the Common Core State Standards, organized by grade level in order to show the progression of writing Common Core vocabulary

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

Discourse Markers in English Writing

Discourse Markers in English Writing Discourse Markers in English Writing Li FENG Abstract Many devices, such as reference, substitution, ellipsis, and discourse marker, contribute to a discourse s cohesion and coherence. This paper focuses

More information

EAP 1161 1660 Grammar Competencies Levels 1 6

EAP 1161 1660 Grammar Competencies Levels 1 6 EAP 1161 1660 Grammar Competencies Levels 1 6 Grammar Committee Representatives: Marcia Captan, Maria Fallon, Ira Fernandez, Myra Redman, Geraldine Walker Developmental Editor: Cynthia M. Schuemann Approved:

More information

Identifying Prepositional Phrases in Chinese Patent Texts with. Rule-based and CRF Methods

Identifying Prepositional Phrases in Chinese Patent Texts with. Rule-based and CRF Methods Identifying Prepositional Phrases in Chinese Patent Texts with Rule-based and CRF Methods Hongzheng Li and Yaohong Jin Institute of Chinese Information Processing, Beijing Normal University 19, Xinjiekou

More information

Section 8 Foreign Languages. Article 1 OVERALL OBJECTIVE

Section 8 Foreign Languages. Article 1 OVERALL OBJECTIVE Section 8 Foreign Languages Article 1 OVERALL OBJECTIVE To develop students communication abilities such as accurately understanding and appropriately conveying information, ideas,, deepening their understanding

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier

More information

Indiana Department of Education

Indiana Department of Education GRADE 1 READING Guiding Principle: Students read a wide range of fiction, nonfiction, classic, and contemporary works, to build an understanding of texts, of themselves, and of the cultures of the United

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Sentence Blocks. Sentence Focus Activity. Contents

Sentence Blocks. Sentence Focus Activity. Contents Sentence Focus Activity Sentence Blocks Contents Instructions 2.1 Activity Template (Blank) 2.7 Sentence Blocks Q & A 2.8 Sentence Blocks Six Great Tips for Students 2.9 Designed specifically for the Talk

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

Ohio Early Learning and Development Standards Domain: Language and Literacy Development

Ohio Early Learning and Development Standards Domain: Language and Literacy Development Ohio Early Learning and Development Standards Domain: Language and Literacy Development Strand: Listening and Speaking Topic: Receptive Language and Comprehension Infants Young Toddlers (Birth - 8 months)

More information

Paraphrasing controlled English texts

Paraphrasing controlled English texts Paraphrasing controlled English texts Kaarel Kaljurand Institute of Computational Linguistics, University of Zurich kaljurand@gmail.com Abstract. We discuss paraphrasing controlled English texts, by defining

More information

Simple maths for keywords

Simple maths for keywords Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd adam@lexmasterclass.com Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all

More information

ONLINE ENGLISH LANGUAGE RESOURCES

ONLINE ENGLISH LANGUAGE RESOURCES ONLINE ENGLISH LANGUAGE RESOURCES Developed and updated by C. Samuel for students taking courses at the English and French Language Centre, Faculty of Arts (Links live as at November 2, 2009) Dictionaries

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

English-Japanese machine translation

English-Japanese machine translation [Information processing: proceedings of the International Conference on Information Processing, Unesco, Paris 1959] 163 English-Japanese machine translation By S. Takahashi, H. Wada, R. Tadenuma and S.

More information

UNIVERSITÀ DEGLI STUDI DELL AQUILA CENTRO LINGUISTICO DI ATENEO

UNIVERSITÀ DEGLI STUDI DELL AQUILA CENTRO LINGUISTICO DI ATENEO TESTING DI LINGUA INGLESE: PROGRAMMA DI TUTTI I LIVELLI - a.a. 2010/2011 Collaboratori e Esperti Linguistici di Lingua Inglese: Dott.ssa Fatima Bassi e-mail: fatimacarla.bassi@fastwebnet.it Dott.ssa Liliana

More information

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word Pupil SPAG Card 1 1 I know about regular plural noun endings s or es and what they mean (for example, dog, dogs; wish, wishes) 2 I know the regular endings that can be added to verbs (e.g. helping, helped,

More information

Developing an Academic Essay

Developing an Academic Essay 2 9 In Chapter 1: Writing an academic essay, you were introduced to the concepts of essay prompt, thesis statement and outline. In this chapter, using these concepts and looking at examples, you will obtain

More information

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23 POS Def. Part of Speech POS POS L645 POS = Assigning word class information to words Dept. of Linguistics, Indiana University Fall 2009 ex: the man bought a book determiner noun verb determiner noun 1

More information

Points of Interference in Learning English as a Second Language

Points of Interference in Learning English as a Second Language Points of Interference in Learning English as a Second Language Tone Spanish: In both English and Spanish there are four tone levels, but Spanish speaker use only the three lower pitch tones, except when

More information

Houghton Mifflin Harcourt StoryTown Grade 1. correlated to the. Common Core State Standards Initiative English Language Arts (2010) Grade 1

Houghton Mifflin Harcourt StoryTown Grade 1. correlated to the. Common Core State Standards Initiative English Language Arts (2010) Grade 1 Houghton Mifflin Harcourt StoryTown Grade 1 correlated to the Common Core State Standards Initiative English Language Arts (2010) Grade 1 Reading: Literature Key Ideas and details RL.1.1 Ask and answer

More information

The Cambridge English Scale explained A guide to converting practice test scores to Cambridge English Scale scores

The Cambridge English Scale explained A guide to converting practice test scores to Cambridge English Scale scores The Scale explained A guide to converting practice test scores to s Common European Framework of Reference (CEFR) English Scale 230 Key Preliminary First Advanced Proficiency Business Preliminary Business

More information

Compound Sentences and Coordination

Compound Sentences and Coordination Compound Sentences and Coordination Mary Westervelt Reference: Ann Hogue (2003) The Essentials of English: A Writer s Handbook. New York, Pearson Education, Inc. When two sentences are combined in a way

More information

Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

More information

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language Correlation: English language Learning and Instruction System and the TOEFL Test Of English as a Foreign Language Structure (Grammar) A major aspect of the ability to succeed on the TOEFL examination is

More information

GMAT.cz www.gmat.cz info@gmat.cz. GMAT.cz KET (Key English Test) Preparating Course Syllabus

GMAT.cz www.gmat.cz info@gmat.cz. GMAT.cz KET (Key English Test) Preparating Course Syllabus Lesson Overview of Lesson Plan Numbers 1&2 Introduction to Cambridge KET Handing Over of GMAT.cz KET General Preparation Package Introduce Methodology for Vocabulary Log Introduce Methodology for Grammar

More information

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words Standard 3: Writing Process 3.1: Prewrite 58-69% 10.LA.3.1.2 Generate a main idea or thesis appropriate to a type of writing. (753.02.b) Items may include a specified purpose, audience, and writing outline.

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?

Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics? Historical Linguistics Diachronic Analysis What is Historical Linguistics? Historical linguistics is the study of how languages change over time and of their relationships with other languages. All languages

More information

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines , 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing

More information

English. Universidad Virtual. Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado) Parts of Speech. Nouns.

English. Universidad Virtual. Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado) Parts of Speech. Nouns. English Parts of speech Parts of Speech There are eight parts of speech. Here are some of their highlights. Nouns Pronouns Adjectives Articles Verbs Adverbs Prepositions Conjunctions Click on any of the

More information

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty

Understanding Video Lectures in a Flipped Classroom Setting. A Major Qualifying Project Report. Submitted to the Faculty 1 Project Number: DM3 IQP AAGV Understanding Video Lectures in a Flipped Classroom Setting A Major Qualifying Project Report Submitted to the Faculty Of Worcester Polytechnic Institute In partial fulfillment

More information

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Mixed Trigrams Approach for Context Sensitive Spell Checking A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu

More information

Speaking for IELTS. About Speaking for IELTS. Vocabulary. Grammar. Pronunciation. Exam technique. English for Exams.

Speaking for IELTS. About Speaking for IELTS. Vocabulary. Grammar. Pronunciation. Exam technique. English for Exams. About Collins series has been designed to be easy to use, whether by learners studying at home on their own or in a classroom with a teacher: Instructions are easy to follow Exercises are carefully arranged

More information

Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on Heterogeneous Tagging Systems

Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on Heterogeneous Tagging Systems Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on Heterogeneous Tagging Systems Chu-Ren Huang 1, Lung-Hao Lee 1, Wei-guang Qu 2, Jia-Fei Hong 1, Shiwen Yu 2 1 Institute

More information

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE A.J.P.M.P. Jayaweera #1, N.G.J. Dias *2 # Virtusa Pvt. Ltd. No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka * Department of Statistics

More information

Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation

Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation Tibetan Number Identification Based on Classification of Number Components in Tibetan Word Segmentation Huidan Liu Institute of Software, Chinese Academy of Sciences, Graduate University of the huidan@iscas.ac.cn

More information

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details Strand: Reading Literature Key Ideas and Craft and Structure Integration of Knowledge and Ideas RL.K.1. With prompting and support, ask and answer questions about key details in a text RL.K.2. With prompting

More information

The Oxford Learner s Dictionary of Academic English

The Oxford Learner s Dictionary of Academic English ISEJ Advertorial The Oxford Learner s Dictionary of Academic English Oxford University Press The Oxford Learner s Dictionary of Academic English (OLDAE) is a brand new learner s dictionary aimed at students

More information

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Cross-Lingual Concern Analysis from Multilingual Weblog Articles Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/

More information

9 The Difficulties Of Secondary Students In Written English

9 The Difficulties Of Secondary Students In Written English 9 The Difficulties Of Secondary Students In Written English Abdullah Mohammed Al-Abri Senior English Teacher, Dakhiliya Region 1 INTRODUCTION Writing is frequently accepted as being the last language skill

More information

Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction

Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction Minjun Park Dept. of Chinese Language and Literature Peking University Beijing, 100871, China karmalet@163.com Yulin Yuan

More information

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files Journal of Universal Computer Science, vol. 21, no. 4 (2015), 604-635 submitted: 22/11/12, accepted: 26/3/15, appeared: 1/4/15 J.UCS From Terminology Extraction to Terminology Validation: An Approach Adapted

More information

Terminology Extraction from Log Files

Terminology Extraction from Log Files Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,

More information

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5

More information

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum) Year 1 reading expectations Year 1 writing expectations Responds speedily with the correct sound to graphemes (letters or groups of letters) for all 40+ phonemes, including, where applicable, alternative

More information

Understanding Clauses and How to Connect Them to Avoid Fragments, Comma Splices, and Fused Sentences A Grammar Help Handout by Abbie Potter Henry

Understanding Clauses and How to Connect Them to Avoid Fragments, Comma Splices, and Fused Sentences A Grammar Help Handout by Abbie Potter Henry Independent Clauses An independent clause (IC) contains at least one subject and one verb and can stand by itself as a simple sentence. Here are examples of independent clauses. Because these sentences

More information

Web Information Mining and Decision Support Platform for the Modern Service Industry

Web Information Mining and Decision Support Platform for the Modern Service Industry Web Information Mining and Decision Support Platform for the Modern Service Industry Binyang Li 1,2, Lanjun Zhou 2,3, Zhongyu Wei 2,3, Kam-fai Wong 2,3,4, Ruifeng Xu 5, Yunqing Xia 6 1 Dept. of Information

More information

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering

More information

Reliable and Cost-Effective PoS-Tagging

Reliable and Cost-Effective PoS-Tagging Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,chen@iis.sinica.edu.tw Abstract In order to achieve

More information

Teaching English as a Foreign Language (TEFL) Certificate Programs

Teaching English as a Foreign Language (TEFL) Certificate Programs Teaching English as a Foreign Language (TEFL) Certificate Programs Our TEFL offerings include one 27-unit professional certificate program and four shorter 12-unit certificates: TEFL Professional Certificate

More information

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning Morphology Morphology is the study of word formation, of the structure of words. Some observations about words and their structure: 1. some words can be divided into parts which still have meaning 2. many

More information

DIFFICULTIES AND SOME PROBLEMS IN TRANSLATING LEGAL DOCUMENTS

DIFFICULTIES AND SOME PROBLEMS IN TRANSLATING LEGAL DOCUMENTS DIFFICULTIES AND SOME PROBLEMS IN TRANSLATING LEGAL DOCUMENTS Ivanka Sakareva Translation of legal documents bears its own inherent difficulties. First we should note that this type of translation is burdened

More information

Special Topics in Computer Science

Special Topics in Computer Science Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS

More information

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5 Page 1 of 57 Grade 3 Reading Literary Text Principles of Reading (P) Standard 1: Demonstrate understanding of the organization and basic features of print. Standard 2: Demonstrate understanding of spoken

More information

CERTIFICATION EXAMINATIONS FOR OKLAHOMA EDUCATORS (CEOE )

CERTIFICATION EXAMINATIONS FOR OKLAHOMA EDUCATORS (CEOE ) CERTIFICATION EXAMINATIONS FOR OKLAHOMA EDUCATORS (CEOE ) FIELD 74: OKLAHOMA GENERAL EDUCATION TEST (OGET ) Subarea Range of Competencies I. Critical Thinking Skills: Reading and Communications 01 05 II.

More information

Criterion SM Online Essay Evaluation: An Application for Automated Evaluation of Student Essays

Criterion SM Online Essay Evaluation: An Application for Automated Evaluation of Student Essays Criterion SM Online Essay Evaluation: An Application for Automated Evaluation of Student Essays Jill Burstein Educational Testing Service Rosedale Road, 18E Princeton, NJ 08541 jburstein@ets.org Martin

More information

Kindergarten Common Core State Standards: English Language Arts

Kindergarten Common Core State Standards: English Language Arts Kindergarten Common Core State Standards: English Language Arts Reading: Foundational Print Concepts RF.K.1. Demonstrate understanding of the organization and basic features of print. o Follow words from

More information

Handbook on Test Development: Helpful Tips for Creating Reliable and Valid Classroom Tests. Allan S. Cohen. and. James A. Wollack

Handbook on Test Development: Helpful Tips for Creating Reliable and Valid Classroom Tests. Allan S. Cohen. and. James A. Wollack Handbook on Test Development: Helpful Tips for Creating Reliable and Valid Classroom Tests Allan S. Cohen and James A. Wollack Testing & Evaluation Services University of Wisconsin-Madison 1. Terminology

More information

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Database Interface for the Community Based Monitoring System * Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University

More information

How To Rank Term And Collocation In A Newspaper

How To Rank Term And Collocation In A Newspaper You Can t Beat Frequency (Unless You Use Linguistic Knowledge) A Qualitative Evaluation of Association Measures for Collocation and Term Extraction Joachim Wermter Udo Hahn Jena University Language & Information

More information

Beyond single words: the most frequent collocations in spoken English

Beyond single words: the most frequent collocations in spoken English Beyond single words: the most frequent collocations in spoken English Dongkwang Shin and Paul Nation This study presents a list of the highest frequency collocations of spoken English based on carefully

More information

LESSON THIRTEEN STRUCTURAL AMBIGUITY. Structural ambiguity is also referred to as syntactic ambiguity or grammatical ambiguity.

LESSON THIRTEEN STRUCTURAL AMBIGUITY. Structural ambiguity is also referred to as syntactic ambiguity or grammatical ambiguity. LESSON THIRTEEN STRUCTURAL AMBIGUITY Structural ambiguity is also referred to as syntactic ambiguity or grammatical ambiguity. Structural or syntactic ambiguity, occurs when a phrase, clause or sentence

More information

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3 Yıl/Year: 2012 Cilt/Volume: 1 Sayı/Issue:2 Sayfalar/Pages: 40-47 Differences in linguistic and discourse features of narrative writing performance Abstract Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu

More information

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System Thiruumeni P G, Anand Kumar M Computational Engineering & Networking, Amrita Vishwa Vidyapeetham, Coimbatore,

More information

CS 533: Natural Language. Word Prediction

CS 533: Natural Language. Word Prediction CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed

More information

INVESTIGATING DISCOURSE MARKERS IN PEDAGOGICAL SETTINGS:

INVESTIGATING DISCOURSE MARKERS IN PEDAGOGICAL SETTINGS: ARECLS, 2011, Vol.8, 95-108. INVESTIGATING DISCOURSE MARKERS IN PEDAGOGICAL SETTINGS: A LITERATURE REVIEW SHANRU YANG Abstract This article discusses previous studies on discourse markers and raises research

More information

Regular Expressions and Automata using Haskell

Regular Expressions and Automata using Haskell Regular Expressions and Automata using Haskell Simon Thompson Computing Laboratory University of Kent at Canterbury January 2000 Contents 1 Introduction 2 2 Regular Expressions 2 3 Matching regular expressions

More information

Elements of Writing Instruction I

Elements of Writing Instruction I Elements of Writing Instruction I Purpose of this session: 1. To demystify the goals for any writing program by clearly defining goals for children at all levels. 2. To encourage parents that they can

More information

Universiteit Leiden. ICT in Business. Leiden Institute of Advanced Computer Science (LIACS) Capability Maturity Model for Software Usage

Universiteit Leiden. ICT in Business. Leiden Institute of Advanced Computer Science (LIACS) Capability Maturity Model for Software Usage Universiteit Leiden ICT in Business Capability Maturity Model for Software Usage Name: Yunwei Huang Student-no: s1101005 Date: 16/06/2014 1st supervisor: Dr. Luuk Groenewegen 2nd supervisor: Dr. Nelleke

More information

Parts of Speech. Skills Team, University of Hull

Parts of Speech. Skills Team, University of Hull Parts of Speech Skills Team, University of Hull Language comes before grammar, which is only an attempt to describe a language. Knowing the grammar of a language does not mean you can speak or write it

More information

Culture and Language. What We Say Influences What We Think, What We Feel and What We Believe

Culture and Language. What We Say Influences What We Think, What We Feel and What We Believe Culture and Language What We Say Influences What We Think, What We Feel and What We Believe Unique Human Ability Ability to create and use language is the most distinctive feature of humans Humans learn

More information

BOOK REVIEW. John H. Dobson, Learn New Testament Greek (Grand Rapids: Baker Academic, 3rd edn, 2005). xiii + 384 pp. Hbk. US$34.99.

BOOK REVIEW. John H. Dobson, Learn New Testament Greek (Grand Rapids: Baker Academic, 3rd edn, 2005). xiii + 384 pp. Hbk. US$34.99. [JGRChJ 8 (2011 12) R61-R65] BOOK REVIEW John H. Dobson, Learn New Testament Greek (Grand Rapids: Baker Academic, 3rd edn, 2005). xiii + 384 pp. Hbk. US$34.99. There is no doubt that translation is central

More information

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details Strand: Reading Literature Key Ideas and Details Craft and Structure RL.3.1 Ask and answer questions to demonstrate understanding of a text, referring explicitly to the text as the basis for the answers.

More information

Language Lessons. Secondary Child

Language Lessons. Secondary Child Scope & Sequence for Language Lessons for the Secondary Child by Sandi Queen Queen Homeschool Supplies, Inc. Lesson 1: Picture Study and Narration Lesson 2: Creative Writing Lesson 3-5: For Copywork Lesson

More information

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT) The Development of Multimedia-Multilingual Storage, Retrieval and Delivery for E-Organization (STREDEO PROJECT) Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit

More information

Database Design For Corpus Storage: The ET10-63 Data Model

Database Design For Corpus Storage: The ET10-63 Data Model January 1993 Database Design For Corpus Storage: The ET10-63 Data Model Tony McEnery & Béatrice Daille I. General Presentation Within the ET10-63 project, a French-English bilingual corpus of about 2 million

More information

31 Case Studies: Java Natural Language Tools Available on the Web

31 Case Studies: Java Natural Language Tools Available on the Web 31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software

More information

Student Achievement in Asian Languages Education. Part 2: Descriptions of Student Achievement

Student Achievement in Asian Languages Education. Part 2: Descriptions of Student Achievement Student Achievement in Asian Languages Education Part 2: Descriptions of Student Achievement This report is in four volumes: Part 1: Project Report Part 2: Descriptions of Student Achievement Part 3: Appendices

More information

Keywords academic writing phraseology dissertations online support international students

Keywords academic writing phraseology dissertations online support international students Phrasebank: a University-wide Online Writing Resource John Morley, Director of Academic Support Programmes, School of Languages, Linguistics and Cultures, The University of Manchester Summary A salient

More information

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students 69 A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students Sarathorn Munpru, Srinakharinwirot University, Thailand Pornpol Wuttikrikunlaya, Srinakharinwirot University,

More information

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Finding Advertising Keywords on Web Pages. Contextual Ads 101 Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The

More information

KINDGERGARTEN. Listen to a story for a particular reason

KINDGERGARTEN. Listen to a story for a particular reason KINDGERGARTEN READING FOUNDATIONAL SKILLS Print Concepts Follow words from left to right in a text Follow words from top to bottom in a text Know when to turn the page in a book Show spaces between words

More information

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process

More information