Retrieval effectiveness study with Farsi language
|
|
|
- Chad Ferguson
- 9 years ago
- Views:
Transcription
1 Publié dans les Actes 9e Conférence en Recherche d Information et Applications CORIA 12, 25-40, 2012 qui doivent être utilisés pour toute référence à ce travail 1 Retrieval effectiveness study with Farsi language Mitra Akasereh, Jacques Savoy Université de Neuchâtel Institut d'informatique - Rue Emile-Argand 11, CH Neuchâtel [email protected], [email protected] ABSTRACT. Having Farsi as the underlying language and using a test collection of 166,774 documents and 100 topics, this experiment evaluates the retrieval effectiveness of different IR models while using a light and a plural stemmer as well as n-grams and trunc-n indexing strategies. Moreover the impact of stoplist removal is evaluated. According to the obtained results the DFR-I(n e )C2 model is the best performing one. The proposed light and plural stemmer improve the retrieval performance compare to non-stemming approach. Indexing strategies trunc-4 and trunc-5 have also a positive impact on the performance while 3-grams and trunc-3 have the most negative impact on the results. The results reveal that for Farsi stoplist removal plays an important role in improving the retrieval performance. A query-byquery analysis on the results shows that avoiding extreme results would be possible by adding extra controls and rules, according to Farsi morphology, to the stemming algorithms. RÉSUMÉ. Dans le but d utiliser le persan comme langue de référence, et en utilisant une collection test de documents et de 100 requêtes, cette étude évalue la performance des différents modèles de RI sur lesquels sont appliquées diverses stratégies d indexation et de recherche. De plus, cette étude évalue l impact de l élimination de la liste des mots-outils lors de l indexation. Selon les résultats obtenus, le modèle DFR-I(n e )C2 est le plus performant. L enracineur léger et l enracineur pluriel améliorent la performance en comparaison à l approche sans enracineur. Les stratégies d indexation, comme tronc-4 et tronc-5 améliorent la performance, alors que les approches comme 3-grams et tronc-3 ont l impact le plus négatif sur les résultats. Les résultats révèlent que l élimination de la liste des mots-outils joue un rôle important dans l'amélioration de la performance. L'analyse requêtes par requêtes montre qu il serait possible d ajouter des règles supplémentaires aux enracineurs, pour éviter des résultats erronés. KEY WORDS: Farsi language, Information retrieval in Persian, Persian morphology MOTS-CLÉS: Langue persane, Recherche d information en persan, morphologie du persan
2 2 1. Introduction In spite of the constant growing of the need for information retrieval (IR) tools dealing with languages other than English, there has been much less done for Persian language (Farsi). In order to create effective IR tools for a new language a good basis would be readapting portions of certain existing retrieval systems for this new language (Savoy 1999). Of course restructuring these existing tools is not a trivial task. Different languages with different linguistic characteristics have their particular affects and restrictions in the process of IR systems development. A good start point is to choose a proper IR model and then to provide the necessary linguistic tools considering the characteristics (e.g., grammatical, morphological) of the target language (Savoy, 2004). Accordingly in our experiment studying Farsi language, we first study Farsi morphology. We then propose a light and a plural (very light) stemmer along with a stopword list for this language. And finally after applying these with different IR models on our test collection, we make a query-by-query analysis on the results to discover the weaknesses of these different methods. Our aim is to then create more accurate and effective IR tools for this language for both monolingual and bilingual retrieval. The rest of this paper is organized as follows: Section 2 is a brief introduction to Persian language and its morphology. Section 3 represents the setup of our experiment. Section 4 contains the obtained results and states the related analysis while Section 5 concludes the experiment. 2. Persian language Persian language, also known as Farsi, is a subclass of the western Iranian languages. This language is a member of Indo-European languages and in terms of orthography it belongs to the Arabic script-based languages. The underlying morphology is a bit more complex than English but it is not a difficult one compared to languages such as Turkish or Finnish (Dolamic & Savoy, 2009). Having more than 100 million native speakers Persian language is the official language of Iran, Afghanistan and Tajikistan (called Persian or Farsi, Dari and Tajik respectively). It is also spoken in parts of some other countries in the Middle East. As a member of Indo-European languages, Persian is to some extends related to the majority of European languages, including English and German, and has been in interaction with other non-iranian languages like Arabic, Turkish, Hindi, Mongolian, Latin, Greek, Russian and French (Dolamic & Savoy, 2009; Bankhan et al., 2011).
3 Persian morphology & problems with Persian language Persian language is written using 32 letters: the 28 Arabic letters plus four more characters, گگ /ɡ/, چچ /tʃ/, پپ /p/, ژژ /ʒ/, which are not used in classical Arabic. These letters are written from right to left, and for many of them the form differs if they are connected to another letter or isolated, at the beginning, in the middle or as the final letter in a word Word segmentation Defining the word s boundary in Persian is, sometimes, a challenging task. For some cases there is not a unique way of writing which makes it difficult to distinguish the word s boundary. The cursive nature of Arabic script causes this problem. Words which consist of minimum two morphemes can be written either concatenated or separately. Only if the first morpheme ends with one of the letters that cannot be concatenated to the next letter اا ) or ا آ /ɒː/, دد /d/, ذذ /z/, رر /r/, /v/) then the two morphemes cannot be concatenated. But still وو /ʒ/, ژژ /z/, زز when morphemes are not concatenated, there would be two different ways to write them: with a blank space between two morphemes or the Zero-Width-Non-Joiner (ZWNJ) (represented by # in the following example). In general for a word made of n morphemes there are at most 3 n-1 different ways of writing among which some of them are not correct or less common (indicated by * in the example below). For example having the word abc, in which n=3 and considering that morphemes a and b end with a concatenative letter there will be 9 possible forms of writing: {abc*, a b c, a#b#c, ab#c, ab c, a#bc*, a bc*, a b#c, and a#b c}. Although there are orthographic rules, published by the Persian Academy of Language and Literature (PALL), for the grammar of Persian orthography still there exists orthographic variation in different texts (Bankhan et al., 2011). This variety of writing creates challenges in the process of tokenization as well as stemming phase Inflectional morphology Like other Indo-European languages Persian has affixitive morphology. This means that words are modified by concatenating suffixes or prefixes to them (Dolamic & Savoy, 2009). Adding a prefix or suffix to a stem may cause a change in the word s orthography which makes the process of stemming and lemmatization more complex. Therefore extra rules are needed to conflate related words to their common stem (some examples of this phenomenon are mentioned in the next paragraphs). In Persian there are different ways to define different grammatical cases. The genitive and possessive cases can be shown by coupling two nouns using the particle /-e/ known as ezafe. Ezafe usually in not written and only pronounced. This هه /ʔ/, known as hamze, if the noun ends by an attached ء particle changes to a /h/ written as ە /-je/ and to a یی /-je/ if the word ends with اا /ɒː/ or وو /uː/. These are cases where ezafe is appeared in writing. The possessive construction as explained is for cases where the possessor is mentioned after the object if not the
4 possessive case is shown by adding different suffixes regarding the person of the /- شانن /-tɒn/, تانن /-mɒn/, مانن /-esh/, يیش /-et/, يیت /-æm/, اامم (i.e., possessor ʃɒn/). /- هھھھا To denote the plural, different suffixes can be used in Persian. The suffix hɒ/ is reserved for non-human nouns. This suffix can be written both attached to the گانن /-ɒn/ is used for human nouns and becomes اانن word or separately. The suffix /-gɒn/ if the word ends by هه /h/ and sometimes changes to يیانن /-jɒn/ if the word ends by other vowels. For nouns borrowed from Arabic, the plural form can be formed either by adding the suffix ااتت /-ɒt/ or the suffix يین /-in/ or the broken form for plurals, which owns an irregular format. While making plural form using suffix گانن /-gɒn/ if the word ends with a silent هه /he/ this letter will be omitted from the end of the word. The suffix then either attaches to the word or stays separately if the last letter (after omitting the هه /he/) is among the letters that cannot be joined to next letter. In Persian comparatives are formed by adding the suffix tar تر /-tær/ (with some exceptional cases where the word completely changes). The superlatives are built by adding the suffix تريین /-tæɾin/. And to make relative adjectives there are different suffixes most commonly the suffix یی /-je/. In other cases suffixes هه /h/, found. /-gɒn/ can be گانن /-in/ and rarely يین There is no grammatical gender in Persian language. In order to specify the natural gender the words man and woman or the adjectives male and female are using Other grammatical issues Persian language does not have specific definite or indefinite articles (as the and a or an in English). In general definite or indefinite words can be distinguished by looking at the structure of a phrase rather than the existence of a specific particle. Indefinite articles can be expressed in two ways. First with the suffix یی /-je/ which changes to اایی /-i/ if the word ends with a silent هه /h/. It changes to يیی /-jiː/ while being added to a plural noun made by adding the suffix noun. /jek/ (one) before the expected يیک /-hɒ/. Second, by adding the numeral هھھھا For the definite nouns the particle رراا /ɾɒ/ (the particle following the noun in accusative cases) is considered to have the function of the definite article. The relative یی /-je/ as explained above can also have the function of definite article if the word, to which suffix یی /-je/ is added followed by the particle کهھ /ke/. While in spoken grammar adding an ending هه /h/ to the word makes the word definite. In Farsi proper names are written in the same way as other words so it is not easy to prevent them from being stemmed. For example the word اايیراانن (Iran) could be stemmed into اايیر as the suffix اانن /-ɒn/ is one of the suffixes used for pluralisation. 4
5 In this language homonyms exist as in all other languages. But additionally the fact that the short vowels /æ/, /e/, /o/ are not written in Persian script, causes a remarkable number of double or even triple heteronyms and thus resulting ambiguity (e.g., سر /seɾ/, secret, سر /sæɾ/, head and سر /soɾ/, slippery ) (AleAhmad et al., 2008) Experiment architecture 3.1. Test collection The test-collection used for this experiment is the collection made available during the CLEF This collection is made up of 166,774 newspaper articles, with approximately 202 terms per document (after stopword removal), extracted from a national Iranian newspaper ( Hamshahri ) between the years 1996 to There are 100 topics (from Topic #551 to Topic #650) in the collection, where the first 50 topics are made and assessed during CLEF2008 and the last 50 ones during CLEF2009. The topics have a total number of 9,625 relevant documents with mean of 96.25, median of 89.5 (standard deviation 62.14). The topics cover a variety of subjects such as politics, literature, art, economics, etc. in both national and international domain. Subjects like Electronic commerce, smoking & heart disease, cinema, Tehran international book fair, Football world cup or Global oil price variation. Topic #574 ليیگ برتر ), قهھرمانن Champion of first Pro League ) with 7 relevant items has the smallest number of pertinent documents while Topic #649 نفت ددوولت خاتمی ), بحراانن Khatami government oil crises ) with 266 relevant items has the greatest number of relevant documents. Following the TREC model, each topic is divided into three sections: the title (T), which is a brief title, the description (D) that gives a one-sentence description and the narrative part (N), which specifies the relevance assessment criteria. Although in this experiment only the title section is used. The corpus is coded in UTF Stoplist, stemming and indexing strategies The applied stopwords list contains 881 terms covering the frequent terms such as determinants, prepositions, conjunctions, pronouns, different forms of some auxiliary verbs and also some suffixes (for cases where suffixes are written separately from the word). As indexing strategies different automatic indexing methods is applied on the collection to be evaluated and compared. Two, language-independent, indexing approaches that are used are n-grams (which is the act of producing the overlapping sequences of n characters (McNamee & Mayfield, 2004) and trunc-n (which is the process of truncating a word by keeping its first n characters and cutting of the
6 remaining letters). Different values for n, for both n-grams and trunc-n are tested to find which value of n gives the best performance. For stemming a light suffix-stripping algorithm is used which removes the morphological suffixes (mostly inflections) such as possessive, plural, relative, etc. The removal is adjusted by quantitative restrictions, means that for the removal the length of the term is taken into consideration in order to have a meaningful sequence as the result and also not to remove a whole word entirely. The procedure is mostly focused on nouns and adjectives and different forms of verbs are not taken into account. Another stemmer is also tested on the collection that removes only the plural suffixes. The proposed stoplist and light stemmer are freely available at IR models Six different IR models are implemented in the experiment to be evaluated and compared. The models are as follows: The first model is the classical tf idf model, where the weight for each indexing term t i is the product of its term frequency in the document D j ( tf ) and the logarithm of its inverse document frequency ( idf j ). We normalized the index weights using cosine normalization (Manning et al., 2008). As another vector-space model, the Lnu-ltc model, suggested by (Singhal, 2002), is adopted. In this model the length of the document is taken into account. Here the index weight for the document term (Lnu) is calculated as: w = [log(tf )+1]. norm i [1] 1 with norm i = tf (1 + log( )) ((1 slope) pivot + ( slope nti )) nti Where nt i is the length of document D i (number of its index terms), slope and pivot are constants. The index weight for the query term (ltc) is calculated as: w qj = (log( tf ) + 1) idf norm with qj j q norm q = k ( tf 1 qk idf As the first probabilistic model the Okapi (BM25) (Robertson et al., 2000) model, which also takes the document into account is used. For this model the parameters are fixed as b =0.75, k 1 =1.2 and advl=202. Also two other probabilistic models, DFR-PL2 and DFR-I(n e )C2 based on measuring the divergence from randomness (DFR) family are used (Amati et al., 2002). Here we have: j ) 2 [2]
7 7 w = Inf Inf = log (Prob ( tf )) (1 Prob ( tf )) [3] 2 DFR-PL2 is defined by: Prob 1 e = λ j λ tfn! tfn j [4] tfn Prob 2 = tfn + 1 c mean _ dl with tfn = tf log 2 (1 + ) [5] l i And DFR- I(n e )C2 is defined by: 1 n + 1 n 1 tc j Inf = tfn log with ne = n ( 1 ( ) ) [6] ne n tc 1 Prob 2 j + = 1 df ( tfn + 1) j [7] And finally one approach based on language model (LM) known as nonparametric probabilistic model is employed. Here the adopted model is the one suggested by Hiemstra (Hiemstra, 2000), which defined as follow where λ j is a smoothing factor (set to 0.35 for all index terms) and lc is an estimation of the corpus C length: [ j P( t j d i ) + (1 j ) P( t j C ] P ( d q) = P( d ) λ λ ) [8] with 3.4. Evaluation i i t q j tf P ( t j d i ) = and l i df j P( t j C) = ( lc = df ) k k lc To evaluate the retrieval performance the mean average precision (MAP) is used based on the 100 queries. The usage of mean provides the same level of importance for all the queries. In some cases the average measurements do not sufficiently describe the total performance (e.g., when extreme values influence the average). To overcome this inadequacy a query-by-query analysis is also applied for some of the models and strategies. Looking at specific examples helps to have a more precise understanding of the reasons behind the obtained results and gather more detailed information on how different strategies work.
8 8 4. Results and analysis Table 1 shows the results obtained during the experiment. In the following sections, referring to these results, different aspects are addressed to analyse. Table 1. Mean average precision (MAP) of different IR models and different stemmers Mean Average Precision LM DFR-PL2 tf idf DFR- I(n e)c2 Okapi Lnu-ltc no stem. /no stoplist no stem grams grams grams grams light stemmer light stem. /no stoplist plural stemmer trunc trunc trunc Average IR Models Evaluation Referring to Table 1 the best performing IR model is DFR-I(n e )C2 for any given stemming or indexing strategy. After DFR-I(n e )C2 model the best overall performances are respectively for DFR-PL2 and Okapi models. When comparing DFR-I(n e )C2 model with Okapi model (which has also good MAP results), by applying trunc-4 strategy, DFR-I(n e )C2 model performs better for 72 queries out of 100 (producing a higher AP). But the difference between APs for قيیمت ( #594 Topic each topic is not remarkable. The biggest difference is seen for Flight prices ) for which DFR-I(n e )C2 model performs an AP of , بليیط هھھھوااپيیما while Okapi gives In general number of topics for which the change is bigger than 15% is only 11. Having DFR-I(n e )C2 scheme as the model with the best performance, for this model the best stemming or indexing strategies are respectively trunc-4, light stemmer and trunc-5 (with a small difference between each of them). This is the
9 case for almost all the evaluated models except for the classical tf idf model for which plural stemmer and trunc-5 perform the best. One reason for which the truncating methods, with value of n equal to 4 or 5, performances are approximately the same as light stemmer can be explained as follow: Persian words are normally short terms so cutting off the end of a word after four or five characters either does not change the word at all (the word stays as it is) or leads to cutting off only the suffixes if there is any suffix attached to the term. Performing a query-by-query analysis confirms this conclusion. Comparing the performance of these three methods for the 100 queries using DFR-I(n e )C2 model shows that there is not a big difference between the performances for each query سلامتی وو ( #554 Topic except in 3 or 4 cases. One of these extreme cases is found for health & stress ), here the worth performance is for trunc-4 strategy (AP ااسترسس of compare to for trunc-5 and for light stemmer) this is due to ( سلامم & ااستر ) the fact that by truncating the two terms of the topic, the result causes ambiguity so there are documents with high ranks which are not really جشنهھایی سنتی اايیراانی ) #630 Topic relevant to the subject. Another extreme case is for Iranian traditional celebrations ) for which the light stemmer is the best performing one. With the trunc-4 or trunc-5 strategies the plural and genitive suffixes in celebrations ) are not removed properly. The same problem occurs for, جشنهھایی ) relative suffix یی in the term, سنتی ) traditional ) which still remains attached to the term by applying trunc-4 or trunc-5. But with the light stemmer these suffixes plus the relative suffix یی in, اايیراانی ) Iranian ) will be deleted correctly returning the correct stem and consequently retrieving the documents with any different forms for the stems اايیراانن & سنت جشن ) celebration, tradition & Iran ). As a result an AP of is obtained with light stemmer compare to and with trunc-5 and trunc-4. Same case is for Topic #648 برجهھایی ددووقلو ), حملهھ Twin towers attack ) where by applying the light stemmer the plural and genitive suffixes. برجج and it is correctly indexed under its lemma برجهھایی are deleted from the word This was not the case when applying trunc-4 or trunc-5 methods. For Topic #650 Imported fuel price volatility ) we have Aps of the best AP, نوساناتت بنزيین ووااررددااتی ) is for trunc-5, for trunc-4 compare to for light stemmer. Hear بنزيین the weak performance of light stemmer is due to over-stemming of the term which is stemmed into بنز and causes retrieving of non-relevant documents with higher ranks than relevant ones Differences between stemming and non-stemming approaches Table 2 shows for each stemming strategy the average of its performance for the six different IR models as well as the change percentage of this average over the average performance of no stemming approach. Based on these values the trunc-5 strategy has the best average of and an improvement of 2.6%. With a small difference trunc-4 and light stemmer are the
10 next best ones having an improvement of 2.2%. Applying methods like n-grams (especially with a small value of n) or trunc-3 clearly decrease the retrieval performance comparing to no stemming method. It can also be driven from the results that applying approaches like 3-grams or trunc-3 having more negative impact on a simple model like tf idf than on a robust one like DFR-I(n e )C2. The change percentage of MAP using 3-grams over no stemming approach for tf idf model is -17.9% compare to -2.1% for DFR-I(n e )C2. And by using trunc-3 this percentage is -19.2% for tf idf compare to -2.8% for DFR-I(n e )C2. Table 2. Average performance for stemming strategies & its change percentage over no stemming approach 10 LM DFR- PL2 Mean Average Precision tf idf DFR- I(n e)c2 Okapi Lnultc Average no stem Change % over no stem. 3-grams % 4-grams % 5-grams % 6-grams % light stem % Pl. stem % trunc % trunc % trunc % Considering the best performing model (DFR-I(n e )C2), the MAP results for this model applying different strategies and their change percentage over no stemming method is shown separately in Table 3. The results shows that again the trunc-4 and trunc-5 strategies have the most improvement over no stemming and trunc-3 and 3- grams decrease the performance comparing to no stemming approach. Table 3. MAP for DFR-I(n e )C2 model and different stemming methods & change percentage over no stemming approach MAP for DFR- I(n e)c2 Change % over no stem. no stem. 3gram 4gram 5gram 6gram light stem. plural stem. trunc3 trunc4 trunc % -0.2% -0.0% +0.5% 2.4% +1.4% -2.8% +2.9% +2.8%
11 A query-by-query analysis on this model shows that among the 100 queries trunc-4 improves the average precision for 56 queries comparing to no stemming while no stemming gives a better AP for 63 queries comparing to the results of trunc-3. This analysis also shows that in the cases where stemming strategies give a better result than no stemming approach (strategies with positive change percentages in Table 3), the number of queries for which the AP is improved is more or less equal to which the value of AP decreased. Comparing with no stemming: 6-grams method improves the AP for 50 topics, light stemmer for 51, plural stemmer for 43 and trunc-5 for 47. Results in Tables 2 and 3 reveal that even though light stemmer and plural stemmer approaches give better performance that no stemming approach but the difference is rather limited. One reason can be the fact that for many topics suffixes in are written separately from the words so in these cases the stemming algorithm will not transform the word. Thus the result will be the same as ignoring the stemming phase. For the same reason by looking at results in Table 2 we can see that there is not a notable difference between the MAP obtained with light stemmer and the plural stemmer. Because in the former the plural suffixes are removed by the stemmer and for other kinds of suffixes, if they are detached from the term, the پرووندهه هھھھایی فسادد ( #569 Topics stoplist removal will do the deletion. For example in Tourist جاذذبهھ هھھھایی گرددشگریی ) #617 or Economic corruption cases ) ااقتصاددیی attractions ) where the relative plural and genitive suffixes یی هھھھا and یی are separated from related terms the AP resulted from light stemmer does not differ from the one resulted with no stemming approach. On the contrary, by looking at topics where suffixes are attached to the terms the efficiency of light stemmer can be noticed. For Topic #630 ( جشنهھایی سنتی اايیراانی Iranian traditional celebrations ), where the same suffixes are attached to the searched keywords, the AP with DFR- I(n e )C2 model without stemming is This performance rises to by, حملهھ برجهھایی ددووقلو ) #648 Topic applying the light stemmer. Another example is Twin towers attack ), where again the plural suffixe هھھھا is concatenated to the term tower. In this example, the AP raises from (without stemming) to after applying a light stemmer. Another reason is over-stemming. In topic تورر inflation changes to توررمم Inflation in Iran ) the term توررمم ددرر اايیراانن ) #600 after stemming which is the same stem for the term tour. As a result among the first 10 documents retrieved by applying light stemmer, documents talking about traveling tours can be found which results a decrease of AP from (without stemming) to , after applying the light stemmer N-grams & Trunc-n For all the models except DFR-I(n e )C2 and tf idf the worst results are obtained by applying the 3-grams method. For DFR-I(n e )C2 and tf idf models the worst performance is resulted from trunc-3 and then (with a small difference) by 3-grams. Even though n-grams seems to be an effective indexing strategy for languages such
12 as Korean or Chinese (Abdou & Savoy, 2006), the results here show that, for Farsi, it is not more effective than word-based representations. The bad performance of these methods when choosing a small value for n can be explained by the fact that truncating the words and leaving only the first three letters or splitting them into overlapping sequences of three characters causes lots of ambiguity. For example for Topic #75 دديیدنی ااستانن گلستانن ), مناططق Tourist attractions in Golestan province ) trunc-3 and 3-grams result too much ambiguity that the retrieval functions only based on the name of the province. As a result the second ranked retrieved document talks about meeting between the president and delegates of Golestan province. But by applying 6-grams or trunc-5 the terms attraction and tourist remain as their original form results more accurate matching. From the results depicted in Table 2, it can be deducted that for n-grams and trunc-n approaches the bigger the value of n is the better is the performance. Actually as in Farsi words are usually not too long by choosing the bigger value for n, these approaches work like no stemming approach. In Table 3 the average of performance over the different 6 models shows that there is an increase of 7.8% when using trunc-5 compare to trunc-3 while 6-grams increase the average performance with 8.5% compared to 3-grams approach. When looking at the results for all queries for DFR-I(n e )C2 model it reveals that trunc-5 method gives a better AP than trunc-3 in 68 cases while 6-grams results a better AP than 3-grams in 54 cases. Table 4 presents, for 3 different queries, values of AP resulted from different values of n, in order to give an example of how AP increases by increasing the value of n. However this is not always the case as we can see in Table 5. One reason for this behaviour could be the existence of stems with 3 letters in the query so when using n-grams or trunc-n schemes with n equal to 3, it will be possible to obtain these stem with 3 letters from all the different forms in which they occurred in documents thus causing a better retrieval performance. For instance in Topic #565 خشک سالی ) خساررااتت Drought damages ) there is the term drought which is a compound composed of two terms both having 3 خشک سالی نفت World fuel economy ) the term ااقتصادد جهھانی نفت ) #571 Topic letters, in Fuel and in Topic #636 هھھھواا ا آلوددگی Air pollution the term هھھھواا Air which again consists of 3 letters. Table 4. AP for sample queries resulted from DFR-I(n e )C2 model and n-grams and trunc-n strategies Average Precision trunc-3 trunc-4 trunc-5 3-grams 4-grams 5-grams 6-grams Topic # Topic # Topic #
13 Table 5. AP for sample queries resulted from DFR-I(n e )C2 model and n-grams and trunc-n strategies Average Precision trunc-3 trunc-4 trunc-5 3-grams 4-grams 5-grams 6-grams Topic # Topic # Topic # Stoplist removal In order to analyse the effect of stoplist removal on retrieval effectiveness, the no stemmer strategy and the light stemmer were applied to the six IR models with and without applying stoplist removal. The results of these tests are shown in Table 6 and Table 7. Table 6. MAP for no stemmer strategy with and without stoplist removal & change percentage for each model and for the average MAP LM DFR- PL2 Mean Average Precision DFRtf idf Okapi Lnu-ltc average I(n e)c2 no stem. /no stoplist no stemmer + stoplist Change % +4.1% +3.1% +22.8% -0.4% +3.9% +0.9% +4.3% Table 7. MAP for light stemmer strategy with and without stoplist removal & change percentage for each model and for the average MAP LM DFR- PL2 Mean Average Precision DFRtf idf Okapi Lnu-ltc average I(n e)c2 light stem. /no stoplist light stemmer + stoplist Change % +6.8% +4.3% +27.4% +1.2% +7.7% +3.7% +6.9% Obviously, stoplist removal helps to improve the retrieval effectiveness, as there is an increase of 4.3% (for no stemming) and 6.9% (for light stemmer) of average performance by applying stoplist removal. The results reveal that for both approaches without stoplist removal the DFR-I(n e )C2 model has still the highest MAP compare to other IR models. In fact applying stoplist removal phase has a small impact on the MAP for this model (-0.4% for no stemming approach and 1.2%
14 for light stemmer). Performing the stoplist removal phase has its most impact on tf idf model. In this model as the term weighting depends on the term frequency, obviously in a text without noise a more accurate term weight and similarity calculation can be performed. For Okapi model (where there is also a remarkable change by adding stoplist removal) light stemmer with stoplist removal gives a better AP for 73 queries than without stoplist removal. For the same model and no stemming approach there is better AP results for 69 queries when applying stoplist removal compare to ignoring this step. Taking the Okapi model and light stemmer as an example, when analysing each query separately it can be found Topic #646 بهھ خاررجج اازز اايیراانن ) ااعزاامم Send off from Iran to abroad ) where the AP changed from (without stoplist removal) to (with stoplist removal), or Topic #638 سرمايیهھ گذاارریی ددرر اايیراانن ) مواانع Barriers to make an investment in Iran ) for which the performance changes from (without stoplist removal) to (with stoplist removal). On the other hand examples like Topic #609 بندیی ميیوهه صاددررااتی ) بستهھ Packing export fruit ) can be found where the AP is when stoplist removal is ignored compared to after stoplist removal. The reason of this behaviour for this particular topic is that the term packing in Farsi is a compound formed by two stems both included in the stoplist and thus being deleted from the topic after stoplist removal. This results in retrieval of documents about export fruit subject without covering packing. But such cases are not very often. In fact, cases where ignoring stoplist removal decreased the value of AP more than 10% is seen for only 3 queries. Another issue that can be driven from the results in Table 6 and Table 7 is that applying only stoplist removal without performing any stemming approach gives a better result than applying the light stemmer without stoplist removal. The reason is again the many suffixes not attached to the words which are put in the stoplist and thus removed only by stoplist removal and not by the light stemmer Conclusion From the obtained results in this experiment, having Farsi as the underlying language, the following conclusions can be drawn. In general IR models based on DFR paradigm are giving the best retrieval results for any stemming or indexing strategies. DFR-I(n e )C2 was the best performing IR model followed by DFR-PL2. With the best performing model applying trunc-4, light stemmer and trunc-5, respectively, gives the best retrieving performance compare to other stemming or indexing approaches. Stemming approaches, either light stemmer or plural stemmer, improve the performance comparing to non-stemming approach with plural stemmer being less effective than the light stemmer. Trunc-n stemming strategy when n is equal to 4 or 5 also increases the retrieval performance compare to non-stemming approach. Putting above mentioned indexing and stemming strategies in increasing order of effectiveness the order would be, with a slight difference: trunc-5, light
15 stemmer and trunc-4 and after all the plural stemmer. However, it can be deduced from the results that, for Persian language, either stemming or truncating does not make a significant improvement on performance. N-grams approach for any given n decreases the retrieval performance. The worst performance obtained by using 3- grams and trunc-3 approaches. For Persian language stoplist removal helps a lot to improve he retrieval performance. Actually for this language stoplist removal has a more positive impact on performance than applying light stemmer or the plural stemmer. The query-by-query analysis shows that during stemming phase there are some extreme situations happening because of some exceptions or rules in Persian morphology. These could be handled in several ways such as: Making a more precise morphological analysis in order to add extra rules to the stemming algorithm and hence enhancing the quality of suffix-removal process. Adding some extra controls on suffix-removal process such as defining names (personal, geographic, products, etc.) so as not to stem them. Adding certain rules in order to control the correctness of spelling after suffix removal (for cases where adding suffixes change the spelling). Taking account of part-of-speech (Savoy, 1993). In addition some techniques of query expansion such as pseudo-relevance feedback (PRF or blind-query expansion) can be applied for this language to evaluate their effect on enhancing the retrieval effectiveness. 15 Acknowledgements This work was supported in part by the Swiss NSF under Grant # /1. 6. References Abdou, S., & Savoy, J, Statistical and comparative evaluation of various indexing and search models, AIRS, vol. 4182, 2006, p AleAhmad, A., Kamalloo, E., Zareh, A., Rahgozar, M., & Oroumchian, F., Cross Language Experiments at Persian@CLEF 2008, LNCS, vol. 5706, p Amati, G., & Van Rsbergen, C., Probabilistic models of information retrieval based on measuring the divergence from randomness ACM Trans. Inf. Syst., vol. 20 no. 4, 2002, p Bankhan, M., Sheykhzadegan, J., Bahrani, M., & Ghayoomi, M., Lessons from building a Persian written corpus: Peykar, Language Resources and Evaluation, vol. 45 no. 2, 2011, p Dolamic, L., & Savoy, J., Ad Hoc Retrieval with the Persian Language, CLEF 2009 proceedings, 2009, p
16 Dolamic, L., & Savoy, J., Persian Language, Is Stemming Efficient?, DEXA 2009 proceedings, 2009, p Hiemstra, D., Using language models for information retrieval, CTIT Ph.D.Thesis, McNamee, P., Mayfield, J., Character n-gram tokenization for European language text retrieval, IR Journal, vol. 7 no. 1-2, 2004, p Manning, C. D., Raghavan, P., & Schütze, H., Introduction to Information Retrieval, Cambridge, Cambridge University Press, Robertson, S. E., Walker, S., & Hancock-Beaulieu, M., Experimentation as a way of life: Okapi at TREC, Inf. Process. Manage., vol. 36 no. 1, 2000, p Savoy, J., Stemming of French Words Based on Grammatical Categories, JASIS, vol. 44 no. 1, 1993, p Savoy, J., A stemming procedure and stopword list for general French corpora, JASIS, vol. 50 no.10, 1999, p Savoy, J., Combining multiple strategies for effective cross-language retrieval, IR Journal, vol. 7 no. 1-2, 2004, p Singhal, A., AT & T at TREC-6, 25th ACM Conference on Research and Development in Information Retrieval, ACM/SIGIR, 2002, p
An Information Retrieval using weighted Index Terms in Natural Language document collections
Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia
Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning
Morphology Morphology is the study of word formation, of the structure of words. Some observations about words and their structure: 1. some words can be divided into parts which still have meaning 2. many
TS3: an Improved Version of the Bilingual Concordancer TransSearch
TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009 Computer assisted translation Preferred by
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
Pupil SPAG Card 1. Terminology for pupils. I Can Date Word
Pupil SPAG Card 1 1 I know about regular plural noun endings s or es and what they mean (for example, dog, dogs; wish, wishes) 2 I know the regular endings that can be added to verbs (e.g. helping, helped,
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
English Appendix 2: Vocabulary, grammar and punctuation
English Appendix 2: Vocabulary, grammar and punctuation The grammar of our first language is learnt naturally and implicitly through interactions with other speakers and from reading. Explicit knowledge
Comparative Analysis on the Armenian and Korean Languages
Comparative Analysis on the Armenian and Korean Languages Syuzanna Mejlumyan Yerevan State Linguistic University Abstract It has been five years since the Korean language has been taught at Yerevan State
26 Mitra Akasereh, Jacques Savoy
CORIA 2012, pp. 25 40, Bordeaux, 21-23 mars 2012 26 Mitra Akasereh, Jacques Savoy Retrieval effectiveness study with Farsi language 27 28 Mitra Akasereh, Jacques Savoy Retrieval effectiveness study with
Third Grade Language Arts Learning Targets - Common Core
Third Grade Language Arts Learning Targets - Common Core Strand Standard Statement Learning Target Reading: 1 I can ask and answer questions, using the text for support, to show my understanding. RL 1-1
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages LJILJANA DOLAMIC and JACQUES SAVOY University of Neuchatel 11 The main goal of this article is to describe
The University of Lisbon at CLEF 2006 Ad-Hoc Task
The University of Lisbon at CLEF 2006 Ad-Hoc Task Nuno Cardoso, Mário J. Silva and Bruno Martins Faculty of Sciences, University of Lisbon {ncardoso,mjs,bmartins}@xldb.di.fc.ul.pt Abstract This paper reports
Albert Pye and Ravensmere Schools Grammar Curriculum
Albert Pye and Ravensmere Schools Grammar Curriculum Introduction The aim of our schools own grammar curriculum is to ensure that all relevant grammar content is introduced within the primary years in
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
Writing Common Core KEY WORDS
Writing Common Core KEY WORDS An educator's guide to words frequently used in the Common Core State Standards, organized by grade level in order to show the progression of writing Common Core vocabulary
Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details
Strand: Reading Literature Key Ideas and Details Craft and Structure RL.3.1 Ask and answer questions to demonstrate understanding of a text, referring explicitly to the text as the basis for the answers.
Houghton Mifflin Harcourt StoryTown Grade 1. correlated to the. Common Core State Standards Initiative English Language Arts (2010) Grade 1
Houghton Mifflin Harcourt StoryTown Grade 1 correlated to the Common Core State Standards Initiative English Language Arts (2010) Grade 1 Reading: Literature Key Ideas and details RL.1.1 Ask and answer
AP FRENCH LANGUAGE 2008 SCORING GUIDELINES
AP FRENCH LANGUAGE 2008 SCORING GUIDELINES Part A (Essay): Question 31 9 Demonstrates STRONG CONTROL Excellence Ease of expression marked by a good sense of idiomatic French. Clarity of organization. Accuracy
Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007
Information Retrieval Lecture 8 - Relevance feedback and query expansion Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 32 Introduction An information
Ling 201 Syntax 1. Jirka Hana April 10, 2006
Overview of topics What is Syntax? Word Classes What to remember and understand: Ling 201 Syntax 1 Jirka Hana April 10, 2006 Syntax, difference between syntax and semantics, open/closed class words, all
INSTITUT FEMTO-ST. Webservices Platform for Tourism (WICHAI)
INSTITUT FEMTO-ST UMR CNRS 6174 Webservices Platform for Tourism (WICHAI) Kitsiri Chochiang Fouad Hanna Marie-Laure Betbeder Jean-Christophe Lapayre Rapport Technique n RTDISC2015-1 DÉPARTEMENT DISC October
AP FRENCH LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES
AP FRENCH LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES Identical to Scoring Guidelines used for German, Italian, and Spanish Language and Culture Exams Interpersonal Writing: E-mail Reply 5: STRONG
Points of Interference in Learning English as a Second Language
Points of Interference in Learning English as a Second Language Tone Spanish: In both English and Spanish there are four tone levels, but Spanish speaker use only the three lower pitch tones, except when
Module 9. Lesson 9:00 La Culture. Le Minitel. Can you guess what these words mean? surfer le net. chatter. envoyer un mail. télécharger.
Module 9 Lesson 9:00 La Culture Le Minitel The Minitel was a machine used so people could access: Can you guess what these words mean? surfer le net chatter envoyer un mail télécharger être en ligne les
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
Adlai E. Stevenson High School Course Description
Adlai E. Stevenson High School Course Description Division: World Languages Course Number: SPA201, SPA202 Course Title: Spanish 2 Course Description: Students continue to develop listening, speaking, reading,
LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5
Page 1 of 57 Grade 3 Reading Literary Text Principles of Reading (P) Standard 1: Demonstrate understanding of the organization and basic features of print. Standard 2: Demonstrate understanding of spoken
Incorporating Window-Based Passage-Level Evidence in Document Retrieval
Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework
Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 [email protected] 2 [email protected] Abstract A vast amount of assorted
Unnatural language detection
Lyon - France Unnatural language detection Thomas Lavergne (Jeune Chercheur) France Telecom R&D 2 av Pierre Marzin 22300 Lannion [email protected] ABSTRACT. In the context of web search
Database Design For Corpus Storage: The ET10-63 Data Model
January 1993 Database Design For Corpus Storage: The ET10-63 Data Model Tony McEnery & Béatrice Daille I. General Presentation Within the ET10-63 project, a French-English bilingual corpus of about 2 million
Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3
Yıl/Year: 2012 Cilt/Volume: 1 Sayı/Issue:2 Sayfalar/Pages: 40-47 Differences in linguistic and discourse features of narrative writing performance Abstract Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu
Linguistic Universals
Armin W. Buch 1 2012/11/28 1 Relying heavily on material by Gerhard Jäger and David Erschler Linguistic Properties shared by all languages Trivial: all languages have consonants and vowels More interesting:
Clinical Decision Support with the SPUD Language Model
Clinical Decision Support with the SPUD Language Model Ronan Cummins The Computer Laboratory, University of Cambridge, UK [email protected] Abstract. In this paper we present the systems and techniques
Ontology-Based Multilingual Information Retrieval
Ontology-Based Multilingual Information Retrieval Jacques Guyot * Saïd Radhouani *,** Gilles Falquet * * Centre universitaire d informatique 24, rue Général-Dufour, CH-1211 Genève 4, Switzerland ** Laboratoire
Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?
Historical Linguistics Diachronic Analysis What is Historical Linguistics? Historical linguistics is the study of how languages change over time and of their relationships with other languages. All languages
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
http://setup.clihome.com/cli.cmap.web/home/maps/viewmapmultipleyear.aspx?teach...
Page 1 of 6 READING - 1st School Teacher Email Course# Grade Level Lettie Brown Elementary School VanDerVoorn, Lauri [email protected] LA1100 1 Show Icon August 2015 August 2014 Review of
Morphemes, roots and affixes. 28 October 2011
Morphemes, roots and affixes 28 October 2011 Previously said We think of words as being the most basic, the most fundamental, units through which meaning is represented in language. Words are the smallest
Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details
Strand: Reading Literature Key Ideas and Craft and Structure Integration of Knowledge and Ideas RL.K.1. With prompting and support, ask and answer questions about key details in a text RL.K.2. With prompting
Cohesive writing 1. Conjunction: linking words What is cohesive writing?
Cohesive writing 1. Conjunction: linking words What is cohesive writing? Cohesive writing is writing which holds together well. It is easy to follow because it uses language effectively to guide the reader.
Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language
Correlation: English language Learning and Instruction System and the TOEFL Test Of English as a Foreign Language Structure (Grammar) A major aspect of the ability to succeed on the TOEFL examination is
GMAT.cz www.gmat.cz [email protected]. GMAT.cz KET (Key English Test) Preparating Course Syllabus
Lesson Overview of Lesson Plan Numbers 1&2 Introduction to Cambridge KET Handing Over of GMAT.cz KET General Preparation Package Introduce Methodology for Vocabulary Log Introduce Methodology for Grammar
Indiana Department of Education
GRADE 1 READING Guiding Principle: Students read a wide range of fiction, nonfiction, classic, and contemporary works, to build an understanding of texts, of themselves, and of the cultures of the United
Towards a Visually Enhanced Medical Search Engine
Towards a Visually Enhanced Medical Search Engine Lavish Lalwani 1,2, Guido Zuccon 1, Mohamed Sharaf 2, Anthony Nguyen 1 1 The Australian e-health Research Centre, Brisbane, Queensland, Australia; 2 The
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Victor Shoup Avi Rubin. fshoup,[email protected]. Abstract
Session Key Distribution Using Smart Cards Victor Shoup Avi Rubin Bellcore, 445 South St., Morristown, NJ 07960 fshoup,[email protected] Abstract In this paper, we investigate a method by which smart
Index. 344 Grammar and Language Workbook, Grade 8
Index Index 343 Index A A, an (usage), 8, 123 A, an, the (articles), 8, 123 diagraming, 205 Abbreviations, correct use of, 18 19, 273 Abstract nouns, defined, 4, 63 Accept, except, 12, 227 Action verbs,
ARABIC PERSON NAMES RECOGNITION BY USING A RULE BASED APPROACH
Journal of Computer Science 9 (7): 922-927, 2013 ISSN: 1549-3636 2013 doi:10.3844/jcssp.2013.922.927 Published Online 9 (7) 2013 (http://www.thescipub.com/jcs.toc) ARABIC PERSON NAMES RECOGNITION BY USING
Glossary of literacy terms
Glossary of literacy terms These terms are used in literacy. You can use them as part of your preparation for the literacy professional skills test. You will not be assessed on definitions of terms during
Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)
Year 1 reading expectations Year 1 writing expectations Responds speedily with the correct sound to graphemes (letters or groups of letters) for all 40+ phonemes, including, where applicable, alternative
KINDGERGARTEN. Listen to a story for a particular reason
KINDGERGARTEN READING FOUNDATIONAL SKILLS Print Concepts Follow words from left to right in a text Follow words from top to bottom in a text Know when to turn the page in a book Show spaces between words
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Blaenau Gwent Schools Year 8 Scheme of Work. UNIT 1 January April 2012
Blaenau Gwent Schools Year 8 Scheme of Work UNIT 1 January April 2012 Objectives To express likes and dislikes and give reasons To make considered comparisons in English as well as the TL and give reasons
Year 10 Scheme of Work Expo. Module Topic Units Objectives Language/Grammar
Personal identification/family & Friends/Sport & Leisure 1 - Moi Year 10 Scheme of Work Déjà vu 1: Moi...et quelques autres Déjà vu 2: Les choses que j aime faire talk about yourself talk about others
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
SPELLING DOES MATTER
Focus and content of the Session 1 Introduction Introduction to NSW syllabus objectives: A. Communicate through speaking, listening, reading, writing, viewing and representing B. Use language to shape
Resources: Encore Tricolore 1 (set textbook), Linguascope, Boardworks, languagesonline, differentiated resources devised by MFL staff.
CURRICULUM PLAN MODERN FOREIGN LANGUAGES YEAR 7 FRENCH Resources: Encore Tricolore 1 (set textbook), Linguascope, Boardworks, languagesonline, differentiated resources devised by MFL staff. TERM ONE Introductory
ENGLISH LANGUAGE - SCHEMES OF WORK. For Children Aged 8 to 12
1 ENGLISH LANGUAGE - SCHEMES OF WORK For Children Aged 8 to 12 English Language Lessons Structure Time Approx. 90 minutes 1. Remind class of last topic area explored and relate to current topic. 2. Discuss
Hebrew. Afro-Asiatic languages. Modern Hebrew is spoken in Israel, the United States, the
Jennifer Wagner Hebrew The Hebrew language belongs to the West-South-Central Semitic branch of the Afro-Asiatic languages. Modern Hebrew is spoken in Israel, the United States, the United Kingdom, Australia,
Correlation to the Common Core State Standards for English Language Arts, Grade 3
Correlation to the Common Core State Standards for English Language Arts, Grade 3 Journeys Grade 3 LESSON 1 LESSON 2 LESSON 3 LESSON 4 LESSON 5 1 Houghton Mifflin Harcourt Publishing Company. All rights
Discovering suffixes: A Case Study for Marathi Language
Discovering suffixes: A Case Study for Marathi Language Mudassar M. Majgaonker Comviva Technologies Limited Gurgaon, India Abstract Suffix stripping is a pre-processing step required in a number of natural
stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings,
Section 9 Foreign Languages I. OVERALL OBJECTIVE To develop students basic communication abilities such as listening, speaking, reading and writing, deepening their understanding of language and culture
EAST PENNSBORO AREA COURSE: LFS 416 SCHOOL DISTRICT
EAST PENNSBORO AREA COURSE: LFS 416 SCHOOL DISTRICT Unit: Grammar Days: Subject(s): French 4 Grade(s):9-12 Key Learning(s): Students will passively recognize target grammatical structures alone and in
FUNCTIONAL SKILLS ENGLISH - WRITING LEVEL 2
FUNCTIONAL SKILLS ENGLISH - WRITING LEVEL 2 MARK SCHEME Instructions to marker There are 30 marks available for each of the three tasks, which should be marked separately, resulting in a total of 90 marks.
Simple maths for keywords
Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd [email protected] Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all
Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics
Grade and Unit Timeframe Grammar Mechanics K Unit 1 6 weeks Oral grammar naming words K Unit 2 6 weeks Oral grammar Capitalization of a Name action words K Unit 3 6 weeks Oral grammar sentences Sentence
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON Essam S. Hanandeh, Department of Computer Information System, Zarqa University, Zarqa, Jordan [email protected] ABSTRACT The massive
A web-based multilingual help desk
LTC-Communicator: A web-based multilingual help desk Nigel Goffe The Language Technology Centre Ltd Kingston upon Thames Abstract Software vendors operating in international markets face two problems:
Archived Content. Contenu archivé
ARCHIVED - Archiving Content ARCHIVÉE - Contenu archivé Archived Content Contenu archivé Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject
SEPTEMBER Unit 1 Page Learning Goals 1 Short a 2 b 3-5 blends 6-7 c as in cat 8-11 t 12-13 p
The McRuffy Kindergarten Reading/Phonics year- long program is divided into 4 units and 180 pages of reading/phonics instruction. Pages and learning goals covered are noted below: SEPTEMBER Unit 1 1 Short
Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6
Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6 4 I. READING AND LITERATURE A. Word Recognition, Analysis, and Fluency The student
Data Pre-Processing in Spam Detection
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Data Pre-Processing in Spam Detection Anjali Sharma Dr. Manisha Manisha Dr. Rekha Jain
Latin and Greek Elements in English
Chapter 1: Dictionaries one purpose of this class is to learn to use the dictionary fully and effectively especially, the etymologies [often in braces] pilgrim, n. [Fr. pelerin; It. pellegrino, from L.
Teaching Vocabulary to Young Learners (Linse, 2005, pp. 120-134)
Teaching Vocabulary to Young Learners (Linse, 2005, pp. 120-134) Very young children learn vocabulary items related to the different concepts they are learning. When children learn numbers or colors in
CURRICULUM GUIDE. French 2 and 2 Honors LAYF05 / LAYFO7
CURRICULUM GUIDE French 2 and 2 Honors LAYF05 / LAYFO7 Course Description The course treats all language learning skills: listening, speaking, reading and writing. Students learn to manipulate structural
Data Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
Module 8. Lesson 8:00 La Culture Northwestern France contains these 3 regions: 1) 2) 3)
Module 8 Lesson 8:00 La Culture Northwestern France contains these 3 regions: 1) 2) 3) Choose the correct region and place the letter in the blank: A. Bretagne B. Basse Normandie C.Pays de la Loire Mont-St
PDF hosted at the Radboud Repository of the Radboud University Nijmegen
PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this
CURRICULUM PACING GUIDE GRADE/SUBJECT: /English. 1st Nine Weeks 1
Approximately Alabama Course Of Study/Quality Core Standards. Ask and answer questions to demonstrate understanding of a text, referring explicitly to the text as the basis for the answers [RL.3.] (.)
3rd Grade Common Core State Standards Writing
Text Types and Purposes Writing CCSS.ELA-Literacy.W.3.1 Write opinion pieces on topics or texts, supporting a point of view with reasons. CCSS.ELA-Literacy.W.3.1a Introduce the topic or text they are writing
FRENCH AS A SECOND LANGUAGE TRAINING
FRENCH AS A SECOND LANGUAGE TRAINING Beginner 1 This course is intended for people who have never studied French or people who have taken French in the past but have either forgotten most of it or have
This image cannot currently be displayed. Course Catalog. Language Arts 400. 2016 Glynlyon, Inc.
This image cannot currently be displayed. Course Catalog Language Arts 400 2016 Glynlyon, Inc. Table of Contents COURSE OVERVIEW... 1 UNIT 1: READING AND WRITING... 3 UNIT 2: READING FOR MEANING... 3 UNIT
Axiomatic Analysis and Optimization of Information Retrieval Models
Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang Zhai Dept. of Computer Science University of Illinois at Urbana Champaign USA http://www.cs.illinois.edu/homes/czhai Hui Fang
The New Forest Small School
The New Forest Small School Spanish For Children Aged 11 to 16 OCR GCSE in Spanish J732 AIMS AND OBJECTIVES To provide: A meaningful and enjoyable educational experience Known and achievable but challenging
SunFDDI 6.0 on the Sun Enterprise 10000 Server
SunFDDI 6.0 on the Sun Enterprise 10000 Server Sun Microsystems, Inc. 901 San Antonio Road Palo Alto, CA 94303-4900 USA 650 960-1300 Fax 650 969-9131 Part No.: 806-3610-11 November 1999, Revision A Send
EAP 1161 1660 Grammar Competencies Levels 1 6
EAP 1161 1660 Grammar Competencies Levels 1 6 Grammar Committee Representatives: Marcia Captan, Maria Fallon, Ira Fernandez, Myra Redman, Geraldine Walker Developmental Editor: Cynthia M. Schuemann Approved:
