Methods for the Extraction of Hungarian Multi-Word Lexemes
|
|
|
- Diane Gilbert
- 9 years ago
- Views:
Transcription
1 Methods for the Extraction of Hungarian Multi-Word Lexemes Balázs Kis*, Begoña Villada Moirón, Tamás Bíró, Gosse Bouma, Gábor Pohl*, Gábor Ugray*, John Nerbonne Rijksuniversiteit Groningen * MorphoLogic, Budapest Abstract This paper describes an experiment on extracting Hungarian multi-word lexemes from a corpus, using statistical methods. Corpus preparation the addition of POS tags and stems was done automatically. From the corpus, verb+noun+casemark patterns were extracted as collocation candidates. Evaluation shows that the statistical methods used by Villada Moirón (2004a) to identify Dutch V + PP collocations, can also be applied to the Hungarian data. Some collocation types (such as verbal arguments) require special extraction methods, as explained in the evaluation section. Finally, we suggest that the extraction process can be further improved by a blend of statistical techniques with rule-based and dictionary-based methods. 1 Introduction This paper presents partial results of a Hungarian Dutch research project on finding and processing multi-word lexemes. The main research goal is to provide an efficient method to extract multi-word lexemes from Hungarian corpora, in order to have 1. better linguistic coverage for the Hungarian language, 2. a more efficient way to improve lexicons, 3. quality improvement in existing Hungarian linguistic applications. The method is built on the basis of on-going research on identification of Dutch collocations by Villada and Bouma (2002) and Villada Moirón (2004a) and it uses the ngram statistical package (Pedersen and Banerjee 2003). So far, a corpusbased statistical method has been implemented for finding Hungarian multi-word lexemes. This presents a special challenge to the methods themselves as they were developed for a language with a very different morphology and syntax. Although there are considerable results in the field of the automatic identification of collocations and multi-word lexemes in English (Kilgarrif and Tugwell 2001), German (Kermes and Heid 2003), Dutch, etc., no previous research has been carried out for Hungarian at this level of complexity. After providing our working definition of multi-word lexemes, we describe the tools and the statistical method used for multi-word lexeme extraction in section 2. Section 3 reports on our evaluation results and discusses a few observations. Finally, section 4 proposes improvements and further research, while we summarize our conclusions in section 5.
2 48 Kis et al. 1.1 The working definition of multi-word lexemes Multi-word lexemes (MWLs) are potentially a very broad concept. For a unified and coherent definition, we may start from the definition of a lexeme itself: A lexeme is the basic abstract unit of the lexicon (... ) (Routledge Dictionary of Language and Linguistics 1996). This definition does not make assumptions about the actual wording of a lexeme, i.e. it can be either a single-word or a multi-word expression. Multi-word lexemes become a problem with natural language processing because single-word units have clear boundaries in electronic text, while multi-word ones have not. A concept represented by a lexeme in different languages can be broken up into words in different ways. In European languages, isolation (English and Romance languages in some cases) and extreme compounding (Dutch, German, Hungarian etc.) also exist. Example (1) is taken from technical terminology. (1) direction assistée (French) stuurbekrachtiging (Dutch) szervokormány (Hungarian) power steering (English) The above definition of a lexeme as a basic unit may seem vague; we could also use some more specific synonyms here: a lexeme is not necessarily a minimal unit of language, neither semantically, nor in terms of syntax. It is more appropriate to label them as integral units, which are not required to be atomic units at the same time. A lexicographer has an internalized notion of a lexeme when he chooses an expression to be included in a dictionary as a headword. In several cases, dictionary headwords are not atomic units (in terms of morphology and syntax), but they are considered as integral semantic units. Since our goal is to apply computational tools to detect multi-word lexemes in running texts, we need a working definition specifying clear features of MWLs that can be used directly in the computational model. Given the theoretical definition, we are looking for pieces of text forming a single semantic unit. Computational modeling of semantics is relatively difficult. Therefore, we may wish to identify more specific features of MWLs. One such feature is (semantic) non-compositionality where the distinct sense of a lexical unit within a multiword lexeme cannot be separated, i.e. the meaning of the unit is different from the combination of the meaning of the parts. This type of non-compositionality is difficult to model because it still requires semantics and several multi-word lexemes do not even display this feature. If we accept that a MWL might have a non-atomic meaning too (when the MWL is a lexical, morphosyntactic or syntactic compound, having a composite meaning, but referring to one specific thing, like in the case of power steering ), there remains one clear feature that an MWL can display, namely, fixedness (Moon 1998). A fixed expression chosen as a model for MWLs is a collocation displaying syntactic irregularity and/or semantic non-compositionality. Let us emphasize that
3 Methods for the Extraction of Hungarian Multi-Word Lexemes 49 these characteristics used as working criteria for differentiating between multiword lexemes and ordinary collocations do not imply syntactic non-variability, i.e. the definition still allows for internal modification of the same collocation. Here, we consider collocations as two or more co-occurring unigrams, without any assumptions about their dependence or independence. Let us also note that MWLs form only a subset of fixed expressions because there are fixed expressions such as proverbs and idioms that do not fulfill the original criterion of being a single semantic unit. Co-occurrence and syntactic irregularity are surface characteristics relatively well recognizable by formal methods. Our hypothesis is that implementing these in a computational model provides for accurate approximation of the ideal set of multiword lexemes. Within the scope of our experiment, the working model assumes that parts component unigrams of multi-word lexemes (e.g. take into account ) combine with a better-than-chance frequency, i.e. it is more probable for them to occur together than we would expect on the basis of their individual frequencies. By our working definition, the necessary but not always sufficient defining criteria of an MWL is some degree of lexical fixedness; syntactic variability is allowed. However, for some purposes, we still need to look for semantic noncompositionality: 1. In machine translation, we have to determine if a source-language expression may be translated as a combination or different wording must be used. 2. In computational terminology, we have to determine if a given expression belongs to the semantic domain in question. In section 4.3, where we relate our work to the results of other researchers, aspects of computational terminology are emphasized to a great extent. Although terminology is a somewhat narrower concept than multi-word lexemes, published methods of terminology extraction and enrichment are very similar to those described here. 1.2 Cross-language and cross-platform investigation The resource preparation process is not part of the ngram package. Therefore, new dataset extraction procedures were required to provide suitable input. To this end, we have developed a new collocation candidate extraction package primarily for use with Hungarian texts, so that we can compare our results to those in other languages (Kis, Villada, Bouma, Bíró, Nerbonne, Ugray and Pohl 2004). This step was taken on the basis of the assumption that a computational method is proved stronger if it may be transferred to other languages and/or computational platforms and remains successful, i.e. continues to produce useful results.
4 50 Kis et al. 2 Statistical investigation of Hungarian collocations The extraction process for the Hungarian collocations has been designed to provide results compatible with those achieved earlier by Villada Moirón (2004b) and Villada Moirón (2004a). This applies both to the types of collocations extracted, and the way the intermediate results are formatted. 2.1 Extraction tools Collocation candidates are identified using a Hungarian parser named Humor- ESK. 1 This parser is capable of identifying named entities, NPs, VPs, and most sentence structures, and it provides a fairly deep parse. This package provides for intelligent collocation searching as one is able to specify the types of the components of the collocations and a window size in terms of terminal symbols, within which the co-occurrence must appear. With this system, a terminal symbol is either a word or a punctuation mark (with the exception of periods at the end of abbreviations or decimal points in Hungarian, commas within numbers). Parsing accuracy of the HumorESK parser is currently being measured. As of now, we have estimates that indicate a 70 to 98 percent recall in finding NP heads depending on the type of the text parsed. Precision is difficult to estimate because on the one hand, the parser identifies all partial NPs, on the other hand, it performs a heuristic filtering by applying highest-level NP rules looking for a longest-match. This causes occasional problems with detecting NP boundaries. These are partially corrected by applying higher level parsing, thus some of these problematic parses are discarded. In Hungarian, the detection of NP boundaries is closely related to finding the NP head because the head is almost always the last word in the NP. Within the project, we are using the ngram program, 2 developed entirely in Perl by Pedersen and Banerjee (2003). The ngram package was only used to apply the statistical tests and rank the ngrams (expressions) in datasets. The dataset is a collection of candidate ngrams, their observed frequency in the verb + noun + casemark event space, as well as the frequency of its component unigrams and partial bigrams. This very same scheme is used by Villada Moirón (2004a). 2.2 The corpus On site, we had the recently compiled SZAK Corpus (Kis and Kis 2003) available, an English-Hungarian parallel corpus of technical texts. The SZAK Corpus has been compiled by a publishing company who have now been purposefully working on a corpus of publications. The corpus consists of a monolingual subcorpus of 1 The name stands for High-Speed Unification-based Morphology Enriched by Syntactic Knowledge (Pr ósz éky 1996), which indicates that it is a bottom-up parser based on lookups of finite syntactic patterns in a lexicon. In the earlier versions, the entire grammar used to be finitized into a single lexicon by means of RTNs, and thus its operation was very similar to that of HuMor, MorphoLogic s morphological analyzer. 2 Formerly known as NSP (Ngram Statistics Package), ngram is a SourceForge project available at
5 Methods for the Extraction of Hungarian Multi-Word Lexemes 51 original Hungarian works in the field of computing, with a size of approx. 500,000 words, and a bilingual parallel corpus of translated works in computing, with approx. one million words per language. 3 The main goal of the corpus is to facilitate research on terminology and translation studies, with text structures taken into consideration. This corpus did not have morphosyntactic annotation at the time. During the present experiment, we used the Hungarian components of the corpus (the monolingual part and the Hungarian texts from the bi-lingual subcorpus) for the statistical tests. The Hungarian subcorpus contains approximately 1.5 million words altogether. A corpus of this size is suitable for seeking the most frequent multi-word lexemes. It is too small to be used to detect infrequent combinations. 2.3 Types of collocations extracted In the experiment, we investigated verb + noun + casemark patterns as candidates, where the noun is in fact the head of an NP which has the casemark attached as a suffix. This structure has been selected in order to provide results that are to some extent comparable or compatible with those acquired by Villada (ms). We wished to find a Hungarian construction that most closely corresponds to verb + PP collocations studied earlier in the Dutch experiment. In Hungarian, the role of prepositions is fulfilled by a case suffix at the end of an NP s head, which is almost always at the end of the NP itself: (2) Az utca vég-é-n The street end-possessed-superessive At the end of the street The typological difference between Hungarian and Dutch makes therefore the experiment especially interesting. The Hungarian morphological analyzer separates the case suffix from the lemma, then the parser classifies the case itself based on the surface form of the suffix. Note that Hungarian also has postpositions, but they express more unusual grammatical relations than the approximately twenty cases. Therefore, their role in Hungarian is significantly smaller than the role of prepositions in English or Dutch. This is why we have ignored them in our experiment. 2.4 The extraction method Based on the HumorESK parser described in section 2.1, a special candidate extractor was written that takes a single meta-rule as the description of the collocation we are looking for: VX!(lex),NP-FULL!(lex,case):5 3 The texts are copyrighted, but the copyright proprietors have granted us the right to use them for research purposes, provided that the texts will not be re-published in their entirety. The corpus includes the full text of all electronically available publications on the backlist of the publisher, resulting in the corpus size mentioned above.
6 52 Kis et al. Using this metarule, the extractor program will extract trigrams consisting of the lemma of a verb, the lemma of the head of an NP, and the casemark of (the head of) the same NP, 4 where the verb and the NP occur within a window of 5 terminal symbols. This window is far smaller than the one used by Villada (ms.). It was intentionally chosen after some empirical experiments in order to limit the otherwise significant level of noise (i.e. irrelevant co-occurrences of verbs and noun phrases), introduced partly by the lack of disambiguated POS tagging. In addition to the surface and lexical forms, the program is capable of extracting any feature of a node in the parse tree. The need for this is obvious if we aim at extracting Hungarian entities corresponding to prepositions that are in fact the types of case marks attached to nouns. Without disambiguated POS tagging, morphological analysis and parsing is run in a single process. Although parsing is rather deep and the nature of both the parser and the grammar allow for rules (patterns) overriding other rules, some parsing errors can still occur due to morphological misclassification. Most of these errors are quite obvious and directly related to the types of errors the morphological analyzer is most prone to. We have reviewed a number of morphological misclassifications, and developed a filtering mechanism which discards some of the collocation candidates. This filtering program uses metarules similar to those of the extractor. These rules are entirely heuristic and are based on morphological ambiguities where the less probable interpretation (e.g. a noun instead of a number) might have been used to build an NP, a VX, or any other node in the parse tree. This post-filtering mechanism is in place only to limit the misclassification noise in the system, and is inactive when a Hungarian POS tagger is present. 2.5 Statistical measures applied When selecting statistical functions to apply to our data sets, we have relied on the evaluation of the Dutch experiments (Villada, ms). We have disregarded those functions that the Dutch evaluation found less precise, and we selected two that provided the best precision with the best recall. These two functions were log likelihood (Dunning 1993) and salience (Kilgarrif and Tugwell 2001). Both measures assess the components of an ngram occurring together against each component occurring independently. The log likelihood score of a bigram is the ratio between two likelihoods: (i) the likelihood of seeing one component of a collocation given that another is present, 4 We cannot really talk about heads in HumorESK as it does not use a head-driven grammar. However, it specifies explicit feature inheritance: if the grammar has been written accordingly, one can treat the lex feature of an NP node as the lex of its head.
7 Methods for the Extraction of Hungarian Multi-Word Lexemes 53 and (ii) the likelihood of seeing the same component of a collocation in the absence of the other. When the ratio is large, we have evidence of statistical dependence. To compute the log likelihood ratio, we use a simpler formula, namely, the log likelihood chi-square ratio G 2 (see Agresti (2002)). For a bigram (w i, w j ), G 2 adds up the product of the observed bigram frequency O ij and the logarithm result of dividing the observed frequency O ij by the expected bigram frequency E ij : G 2 = 2 i,j O ij log 2 O ij E ij The salience measure is an adjustment to the mutual information test. Mutual information (I(w i, w j )) compares the probability of seeing the unigrams in an ngram together to the probability of the independent occurrence of each. The salience adjustment multiplies the mutual information score by the logarithm of the ngram s observed frequency, thus it promotes the frequent ngrams to the top ranks. Kilgarrif and Tugwell (2001) calculate it as follows: Sal(w i, w j ) = I(w i, w j ) log 2 O ij = log 2 P (w i, w j ) P (w i )P (w j ) log 2 O ij The two statistics are designed to assess the dependence between two random variables (two unigrams). Because our candidate expressions are verb, noun, casemark trigrams, we decided to represent each candidate trigram by two bigrams: verb, noun casemark and verb noun, casemark. Bearing this in mind, we first compute both statistics for each bigram. Bigrams are ranked according to the statistic so that, candidate bigrams which were assigned a large score will get a higher rank than bigrams assigned a small score. In a next step, the ranks of each partial bigram are added up. We order the candidate trigrams according to the resulting rank. The trigrams with the highest ranks are those whose association measure score was higher overall. 3 Evaluation We have evaluated the results of ranking the candidates in detail. We looked at the results from both the log likelihood and the salience functions. Because this was the very first experiment with this procedure and these texts, we had no handtagged data to compare the results to. Thus we decided to manually check those candidates ranked among the first 100 either by the log likelihood or the salience functions. Manual checking was performed by 3 independent human judges assigning binary scores: a candidate was assigned 0 if it was not a multi-word lexeme recognizable to the judges; 1 if it was either a compositional or a non-compositional expression. In a voting scheme, if a candidate received two 1 votes (score 2), it was accepted as a valid expression. Then the judges determined whether a candidate, having received a score of 2 or 3, is compositional or non-compositional. As a baseline, we used the ranking provided by the raw frequency list, i.e. ordering
8 54 Kis et al. Measure Multi-word lexemes non-compositional transparent Salience Log likelihood Raw frequency Table 1: Number of multi-word lexemes among the 100 highest ranked trigrams. Measure Incorrect patterns Salience 10 Log likelihood 15 Table 2: Evaluation of the influence of parse errors on the 100 highest ranked phrases. the candidates in dataset according to the number of occurrences of each candidate in the corpus. Table 1 shows that both the salience measure and the log likelihood test contain a substantial number of multi-word lexemes in their top 100. Both tests also outperform the baseline. The candidate rankings produced by the log likelihood and salience tests are rather similar, with a Spearman s correlation coefficient ρ = 0.94 calculated over the ranks of expressions. Nevertheless, the salience method seems more reliable as it produces less noise in the 100-best list. A number of collocations were classified as real multi-word lexemes whose meaning was entirely non-compositional, regardless of the technical domain of the corpus. The majority of the multi-word lexemes can be classified as transparent, i.e. semantically compositional, by common sense. However, many compositional combinations form a terminological collocation relating to the technical field of the texts in the corpus, such as the trigram kattint + parancsgomb + SUB in example (3). (3) kattint parancsgomb-ra click button SUBLATIVE click a button Thus, most of these transparent collocations are indeed important from the aspect of translation, since these should be translated consistently. 3.1 Error analysis There were some instances of noise that remained in the dataset. Some errors were ranked surprisingly high, considerably lowering the overall precision of the statistical methods (See Table 2). Parse errors typically lead to a situation where human reviewers are unable
9 Methods for the Extraction of Hungarian Multi-Word Lexemes 55 Collocation Rank by salience Rank by log-likelihood vesz + figyelem + ILL * veszik + figyelem + ILL vesz + igény + ILL * veszik + igény + ILL 40 > 100 Table 3: Morphological ambiguity represented in the Hungarian collocation tests These problematic co- to reconstruct a phrase corresponding to the pattern. occurrences may be caused by three reasons: 1. morphological misclassification or unwanted ambiguity; 2. accidental co-occurrence of a verb and an NP where the NP is not a real argument of the verb; 3. the unigrams are part of a larger fixed expression. The second type of error can be eliminated if we use deeper parsing instead of NP chunking and lemmatizing. Thus we could determine if the V and the NP belong to the same VP subtree and list only those co-occurrences where they do. In the following, we present two further observations that account for errors, and help in improving the tools we used. 3.2 Hungarian morphological ambiguity spotted and resolved correctly Let us consider the Hungarian verbal form veszik, having two possible morphological analyses: 1. It is the 3rd person plural form of the base verb vesz, meaning take. In this meaning, it frequently co-occurs with the illative form of the noun figyelem [ attention ], because the multiword lexeme figyelembe vesz (plural: figyelembe veszik) mean take something into account. 2. It is also the 3rd person singular form of the base verb veszik [ to be lost ]. Obviously, the latter verb cannot be combined with the illative case of figyelem (the result would be nonsense such as to be lost in attention ). However, the form veszik within a corpus occurrence of the combination figyelembe veszik ( they take into account ) will be automatically parsed in both ways by the morphological analyzer. Consequently, this token increases the number of vesz, figyelem, ILL trigrams found, as well as that of veszik, figyelem, ILL. Obviously, cases of the singular form figyelembe vesz count only as the first one. Table 3 shows the ranks of these two collocations, both by the log-likelihood and the salience measures. The table shows another frequent collocation (igénybe
10 56 Kis et al. vesz) of the same base verb vesz with the illative of the noun igény [ claim, demand ] with a non-compositional meaning make use of something. This collocation, quite naturally, displays the same ambiguity inherent in the verb form. Note that the salience function ranked both incorrect interpretations rather highly while ranking the correct ones even higher. This can be attributed to the fact that (a) the salience measure is adjusted using the frequency of the collocation, and (b) these collocations (using the 3rd plural form, which displays the ambiguity in question) occur fairly often in the technical corpus we have been investigating. However, when looking at the measures together, we can see that the correct interpretation was ranked considerably higher. Using this information, we can recognize different analyzes of ambiguous words, where the collocations of different interpretations are ranked differently. Here the more highly ranked interpretation is very likely to be part of a multiword lexeme, and it is possible that ambiguous words can be very efficiently disambiguated when they occur in these collocations. 5 (This observation, however, requires further investigation.) 3.3 Casemarked NP s as verbal affixes Similarly to Dutch and German, Hungarian has verbal affixes that form one word with the verb only when the affix immediately precedes the verb. As the following examples including the verb bemegy ( go in, enter ) show, different syntactic factors (the presence of an auxiliary, focus, negation,...) may alter the position of the verbal affix. (The hyphens in (4) were inserted for clarity, and do not appear in written texts.) (4) a. A gyermek be-megy a házba. The child in-go the house-illative The child enters the house. b. A gyermek nem megy be a házba. The child not go in the house-illative The child does not enter the house. c. A gyermek be fog menni a házba. The child in FUTURE go-inf the house-illative The child will enter the house. Furthermore, some nouns, including their case ending, are on the way to becoming similar verbal affixes. Their syntactic behavior is similar, and orthographic tradition writes some of them joint to the verb whenever they immediately precede it. For instance, the noun lét ( being, existence ) in the sublative case (létre) and the verb jön ( to come ) have created the complex verb létrejön ( come into being, be 5 Note that we needed non-disambiguated corpus annotation to make this observation.
11 Methods for the Extraction of Hungarian Multi-Word Lexemes 57 Collocation Salience Log-likelihood hoz + lét + SUB l étrehoz create 1 1 vesz + ész + SUB észrevesz notice 8 12 jön + lét + SUB l étrej ön come into being Table 4: Affixed verbs written as separate words in special cases only established ): (5) a. A szervezet létre-jön The institution being-sublative-come The institution gets established. b. A szervezet nem jön létre The institution not come being-sublative The institution does not get established. c. A szervezet létre fog jönni The institution being-sublative FUTURE come-inf The institution will get established. When the latter structures occur as separate words, they are understood together very strongly as a collocation. These are obviously the strongest candidates for multi-word lexemes (see Table 4); however, they are already included in almost all Hungarian dictionaries as single verbs. When written separately, the dataset extraction process spotted these occurrences as verb+noun+casemark collocations, and the statistics ranked these instances very high. The ranks on the table might be misleading: it is more convincing if we point out that these collocations are in fact the first occurrences of each verb in the ranked dataset, according to both measures. Whenever the noun behaving as a verbal affix was spelled in one word with the verb, then the dataset extraction process saw it as a (complex) verb to which an additional NP head was added to form the verb + noun + casemark trigram. Unlike the case when the verbal affix was seen as an argument of the verb, these latter trigrams are indeed far from being multi-word lexemes. In order to assess the statistical techniques, we have, therefore, compared the rank of the trigram hoz + lét + SUB ( bring into being ) to the rank of trigrams like létrehoz + fájl + ACC ( create a file ). The difference in ranking is highly significant: the latter type, even when possibly forming a terminological collocation, do not appear within the 250 highest ranked candidates.
12 58 Kis et al. 4 Suggestions for improvement 4.1 Improving the statistical methods In order to improve the precision and reliability of the statistical extraction process, three issues need to be addressed: 1. The accuracy of the morphological analysis of the corpus text should be increased either by means of introducing a (weakly) disambiguating POS tagger, or through improving the Hungarian morphological analyzer in general. 2. Larger corpora and corpora on other domains should be used instead of, or in addition to the current technical corpus. The acquisition of larger corpora, compilation of a new corpus and enrichment of the technical corpus are in progress. 3. The Hungarian parser should be used more efficiently. The present extraction scheme uses the verbal (VX) and NP nodes independently; all it checks for is their co-occurrence within a specified window. We expect to improve this by ensuring that the co-occurring nodes, i.e. the components of the collocation, are in fact children of the same VP node. We suppose that some of the false argument errors can be eliminated this way. It is also possible to improve dataset extraction through detailed corrections. This includes the proper treatment of verbal affixes, which are at present discarded by the parser at a lower level. Furthermore, nominal postpositions rare in Hungarian, and therefore presently ignored could be processed as further case categories. Other types of collocations should also be investigated. For example, a V + casemark dataset could be derived from the verb + noun + casemark collocations. This would provide important information on Hungarian verbal argument structures. However, this can produce sensible results if and only if the relatively free Hungarian argument structure is taken into account. Thus the verb and any of its arguments can form a candidate, regardless of their relative position as long as the surface order is concerned. 4.2 Introducing further methods Statistical measures calculated from monolingual corpora are only one technique for identifying multiword lexemes. The extraction of multiword lexemes could potentially be implemented as a blend of different methods, of which statistical investigation is only one, though not unimportant. As mentioned in Section 2.1, statistical methods are able to detect fixed expressions, without regard to the eventual semantic non-compositionality which may indeed need to be detected in some cases, where multi-word expressions may not differ in combinatorial frequency.
13 Methods for the Extraction of Hungarian Multi-Word Lexemes 59 There are fields of collocation research where both rule-based and statistical methods are applied. Real strength lies in the combination of the two. As we have access to many bilingual dictionaries and are working to develop various translation (support) tools, it is an obvious step to investigate collocations through their translations. If a collocation has a non-compositional translation in another language, chances are that its meaning is not compositional either. Our most ambitious non-statistical idea of approaching multi-word lexemes is a contrastive method that aims at examining collocations through their translations in another language. This approach is still being worked on. However, the basic idea is a very practical one: it proposes using a blend of several methods, including corpus-based statistical investigation, dictionary-based lexicon enrichment and contrastive tests based on phrase-level alignment. We would require a parallel corpus aligned at the sentence level, and a high-quality bilingual dictionary. If there is a given type of collocation in the text of one language (the one under investigation), such as an NP or a V + NP + casemark pattern, and there is at least one word in it whose obvious (dictionary-based) translations cannot be found in the alignment pair, it will be a good candidate for further testing. 4.3 Relation to similar fields There are research fields similar to the generalized research on multiword lexemes that deserve attention. One such field is computational terminology (Castellví, Bagot and Palatresi (2001), Jacquemin (2001)), which employs more or less the same methods, namely, 1. statistical collocation testing, 2. collocation extraction by means of parsing, 3. dynamic methods that start from a reference glossary and use either statistical or heuristic methods to enrich it. Another closely related field is named entity recognition, whose methods have good and simple examples of learning extraction rules. The Hungarian parser in fact employs a robust (still heuristic) named entity recognizer implemented as a subset of its grammar. 5 Conclusion We have investigated Hungarian multi-word lexemes. We applied a reliable statistical scheme (used earlier by Villada (ms.) and Villada Moirón (2004a)) to a new and typologically different language. The evaluation of the initial experiments shows that the same statistical methods can be applied to Hungarian corpora, however, some collocation types (such as verbal arguments) require special extraction methods.
14 60 Kis et al. This project attempts to improve on existing methods not by inventing entirely new extraction schemes; the emphasis is rather put on constructing an efficient blend of existing heuristic, statistical or dictionary-based methods. These methods can be used as cascaded modules, or in a parallel manner, voting on each candidate. The ongoing work is directed towards improving the dataset extraction procedures for the statistical methods, and, at the same time, developing a dictionarybased methods for testing candidates through their translations. Acknowledgments This work has been carried out in parallel with similar work on Dutch corpora by a joint Dutch-Hungarian research group supported by NWO-OTKA under grant number References Agresti, A.(2002), Categorical Data Analysis, John Wiley and Sons, New York. Castellví, M. T. C., Bagot, R. E. and Palatresi, J.(2001), Automatic term detection: A review of current systems, in D. Bourigault, C. Jacquemin and M.- C. L Homme (eds), Recent Advances in Computational Terminology, John Benjamins, Amsterdam Philadelphia, pp Dunning, T.(1993), Accurate methods for the statistics of surprise and coincidence, Computational linguistics 19(1), Jacquemin, C.(2001), Spotting and Discovering Terms through Natural Language Processing, MIT Press, Cambridge (Mass.). Kermes, H. and Heid, U.(2003), Using chunked corpora for the acquisition of collocations and idiomatic expressions, Papers in Computational Lexicography: Proceedings of COMPLEX 2003, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, pp Kilgarrif, A. and Tugwell, D.(2001), Word sketch: extraction and display of significant collocations for lexicography, Proceedings of the 39th ACL and 10th EACL- workshop Collocation: computational extraction, analysis and explotation, Toulouse, pp Kis, A. and Kis, B.(2003), A prescriptive corpus-based technical dictionary. development of a multi-purpose technical dictionary, Papers in Computational Lexicography: Proceedings of COMPLEX 2003, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, pp Kis, B., Villada, B., Bouma, G., Bíró, T., Nerbonne, J., Ugray, G. and Pohl, G.(2004), A new approach to the corpus-based statistical investigation of hungarian multi-word lexemes, Proceedings of 4th International Conference on Language Resources and Evaluation (LREC) 2004, Lisbon, Portugal, pp Moon, R.(1998), Fixed Expressions and Idioms in English - A Corpus-Based Approach, Oxford University Press, Oxford, UK.
15 Methods for the Extraction of Hungarian Multi-Word Lexemes 61 Pedersen, T. and Banerjee, S.(2003), The design, implementation and use of the ngram statistics package., Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, pp Prószéky, G.(1996), Syntax as meta-morphology, Proceedings of COLING-96, Copenhagen, Denmark, pp. Vol.2, Routledge Dictionary of Language and Linguistics(1996), Routledge. Villada, B. and Bouma, G.(2002), A corpus-based approach to the acquisition of collocational prepositional phrases, Proceedings of EURALEX 2002, Center for Sprokteknologi, Copenhagen, Denmark, pp Villada Moirón, M. B.(2004a), Discarding noise in an automatically acquired lexicon of support verb constructions, Proceedings of 4th International Conference on Language Resources and Evaluation (LREC) 2004, Lisbon, Portugal, pp Villada Moirón, M. B.(2004b), Acquisition of Dutch support verb collocations: a model comparison. ms. Groningen University URL: rug.nl/ begona/papers/svcmodels.ps.
16
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
The Oxford Learner s Dictionary of Academic English
ISEJ Advertorial The Oxford Learner s Dictionary of Academic English Oxford University Press The Oxford Learner s Dictionary of Academic English (OLDAE) is a brand new learner s dictionary aimed at students
Learning Translations of Named-Entity Phrases from Parallel Corpora
Learning Translations of Named-Entity Phrases from Parallel Corpora Robert C. Moore Microsoft Research Redmond, WA 98052, USA [email protected] Abstract We develop a new approach to learning phrase
Overview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
Simple maths for keywords
Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd [email protected] Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Appendix B Data Quality Dimensions
Appendix B Data Quality Dimensions Purpose Dimensions of data quality are fundamental to understanding how to improve data. This appendix summarizes, in chronological order of publication, three foundational
Teaching terms: a corpus-based approach to terminology in ESP classes
Teaching terms: a corpus-based approach to terminology in ESP classes Maria João Cotter Lisbon School of Accountancy and Administration (ISCAL) (Portugal) Abstract This paper will build up on corpus linguistic
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
Brill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
Register Differences between Prefabs in Native and EFL English
Register Differences between Prefabs in Native and EFL English MARIA WIKTORSSON 1 Introduction In the later stages of EFL (English as a Foreign Language) learning, and foreign language learning in general,
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering
Learning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
CS 533: Natural Language. Word Prediction
CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
1 Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing Lourdes Araujo Dpto. Sistemas Informáticos y Programación, Univ. Complutense, Madrid 28040, SPAIN (email: [email protected])
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
Taxonomies in Practice Welcome to the second decade of online taxonomy construction
Building a Taxonomy for Auto-classification by Wendi Pohs EDITOR S SUMMARY Taxonomies have expanded from browsing aids to the foundation for automatic classification. Early auto-classification methods
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Noam Chomsky: Aspects of the Theory of Syntax notes
Noam Chomsky: Aspects of the Theory of Syntax notes Julia Krysztofiak May 16, 2006 1 Methodological preliminaries 1.1 Generative grammars as theories of linguistic competence The study is concerned with
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
Hybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files
Journal of Universal Computer Science, vol. 21, no. 4 (2015), 604-635 submitted: 22/11/12, accepted: 26/3/15, appeared: 1/4/15 J.UCS From Terminology Extraction to Terminology Validation: An Approach Adapted
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France [email protected] Cyril Grouin LIMSI-CNRS rue John von Neumann 91400
A Method for Automatic De-identification of Medical Records
A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA [email protected] Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA [email protected] Abstract
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
A Mixed Trigrams Approach for Context Sensitive Spell Checking
A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA [email protected], [email protected]
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Use the Academic Word List vocabulary to make tips on Academic Writing. Use some of the words below to give advice on good academic writing.
Use the Academic Word List vocabulary to make tips on Academic Writing Use some of the words below to give advice on good academic writing. abstract accompany accurate/ accuracy/ inaccurate/ inaccuracy
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System
An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System Thiruumeni P G, Anand Kumar M Computational Engineering & Networking, Amrita Vishwa Vidyapeetham, Coimbatore,
Report Writing: Editing the Writing in the Final Draft
Report Writing: Editing the Writing in the Final Draft 1. Organisation within each section of the report Check that you have used signposting to tell the reader how your text is structured At the beginning
Database Design For Corpus Storage: The ET10-63 Data Model
January 1993 Database Design For Corpus Storage: The ET10-63 Data Model Tony McEnery & Béatrice Daille I. General Presentation Within the ET10-63 project, a French-English bilingual corpus of about 2 million
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,
Text Analytics Illustrated with a Simple Data Set
CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University
Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP
Doctoral School of Historical Sciences Dr. Székely Gábor professor Program of Assyiriology Dr. Dezső Tamás habilitate docent
Doctoral School of Historical Sciences Dr. Székely Gábor professor Program of Assyiriology Dr. Dezső Tamás habilitate docent The theses of the Dissertation Nominal and Verbal Plurality in Sumerian: A Morphosemantic
Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning
Morphology Morphology is the study of word formation, of the structure of words. Some observations about words and their structure: 1. some words can be divided into parts which still have meaning 2. many
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Parsing Technology and its role in Legacy Modernization. A Metaware White Paper
Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks
Context Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]
Sentiment analysis: towards a tool for analysing real-time students feedback
Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: [email protected] Mihaela Cocea Email: [email protected] Sanaz Fallahkhair Email:
Reliable and Cost-Effective PoS-Tagging
Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,[email protected] Abstract In order to achieve
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR
Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University
Language Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
Frequency Dictionary of Verb Phrase Constructions
Frequency Dictionary of Verb Phrase Constructions An automatic lexical acquisition method and its applications theses of the Ph.D. dissertation Bálint Sass supervisor: Gábor Prószéky, D.Sc. Pázmány Péter
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Comparing Methods to Identify Defect Reports in a Change Management Database
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com
MULTIFUNCTIONAL DICTIONARIES
In: A. Zampolli, A. Capelli (eds., 1984): The possibilities and limits of the computer in producing and publishing dictionaries. Linguistica Computationale III, Pisa: Giardini, 279-288 MULTIFUNCTIONAL
English II Writing. Persuasive Writing Rubric
English II Writing Persuasive Writing Rubric Score Point 1 The essay represents a very limited writing performance. The organizing structure of the essay is inappropriate to the purpose or the specific
Handbook on Test Development: Helpful Tips for Creating Reliable and Valid Classroom Tests. Allan S. Cohen. and. James A. Wollack
Handbook on Test Development: Helpful Tips for Creating Reliable and Valid Classroom Tests Allan S. Cohen and James A. Wollack Testing & Evaluation Services University of Wisconsin-Madison 1. Terminology
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources
Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources Michelle
Sample Size and Power in Clinical Trials
Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance
Estimating Probability Distributions
Estimating Probability Distributions Readings: Manning and Schutze, Section 6.2 Jurafsky & Martin, Section 6.3 One of the central problems we face in using probability models for NLP is obtaining the actual
Quality Assurance at NEMT, Inc.
Quality Assurance at NEMT, Inc. Quality Assurance Policy NEMT prides itself on the excellence of quality within every level of the company. We strongly believe in the benefits of continued education and
24. PARAPHRASING COMPLEX STATEMENTS
Chapter 4: Translations in Sentential Logic 125 24. PARAPHRASING COMPLEX STATEMENTS As noted earlier, compound statements may be built up from statements which are themselves compound statements. There
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect
NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR 1 Gauri Rao, 2 Chanchal Agarwal, 3 Snehal Chaudhry, 4 Nikita Kulkarni,, 5 Dr. S.H. Patil 1 Lecturer department o f Computer Engineering BVUCOE,
3. Mathematical Induction
3. MATHEMATICAL INDUCTION 83 3. Mathematical Induction 3.1. First Principle of Mathematical Induction. Let P (n) be a predicate with domain of discourse (over) the natural numbers N = {0, 1,,...}. If (1)
ANALEC: a New Tool for the Dynamic Annotation of Textual Data
ANALEC: a New Tool for the Dynamic Annotation of Textual Data Frédéric Landragin, Thierry Poibeau and Bernard Victorri LATTICE-CNRS École Normale Supérieure & Université Paris 3-Sorbonne Nouvelle 1 rue
Ling 201 Syntax 1. Jirka Hana April 10, 2006
Overview of topics What is Syntax? Word Classes What to remember and understand: Ling 201 Syntax 1 Jirka Hana April 10, 2006 Syntax, difference between syntax and semantics, open/closed class words, all
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark [email protected] Outline Flow chart Linguateca Palavras History
PP-Attachment. Chunk/Shallow Parsing. Chunk Parsing. PP-Attachment. Recall the PP-Attachment Problem (demonstrated with XLE):
PP-Attachment Recall the PP-Attachment Problem (demonstrated with XLE): Chunk/Shallow Parsing The girl saw the monkey with the telescope. 2 readings The ambiguity increases exponentially with each PP.
Quality Assurance at NEMT, Inc.
Quality Assurance at NEMT, Inc. Quality Assurance Policy NEMT prides itself on the excellence of quality within every level of the company. We strongly believe in the benefits of continued education and
Syntax: Phrases. 1. The phrase
Syntax: Phrases Sentences can be divided into phrases. A phrase is a group of words forming a unit and united around a head, the most important part of the phrase. The head can be a noun NP, a verb VP,
Chapter 2 The Information Retrieval Process
Chapter 2 The Information Retrieval Process Abstract What does an information retrieval system look like from a bird s eye perspective? How can a set of documents be processed by a system to make sense
Log Normalization Systems and CEE Profiles
Log Normalization Systems and CEE Profiles Rainer Gerhards, Adiscon GmbH 2011-10-24 Draft 1.0.1 Overview This document describes the overall approach of log normalization and its relationship to CEE and
The Role of Sentence Structure in Recognizing Textual Entailment
Blake,C. (In Press) The Role of Sentence Structure in Recognizing Textual Entailment. ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, Prague, Czech Republic. The Role of Sentence Structure
Paraphrasing controlled English texts
Paraphrasing controlled English texts Kaarel Kaljurand Institute of Computational Linguistics, University of Zurich [email protected] Abstract. We discuss paraphrasing controlled English texts, by defining
Outline of today s lecture
Outline of today s lecture Generative grammar Simple context free grammars Probabilistic CFGs Formalism power requirements Parsing Modelling syntactic structure of phrases and sentences. Why is it useful?
