Aligning Word Senses in GermaNet and the DWDS Dictionary of the German Language
|
|
|
- Marvin Sparks
- 5 years ago
- Views:
From this document you will learn the answers to the following questions:
What is the German word for the German wordnet GermaNet?
What were the initial set of 47 randomly selected words?
What do lexical resources display variation in the number of word senses that lexicographers assign to a given lexical entry?
Transcription
1 Aligning Word Senses in GermaNet and the DWDS Dictionary of the German Language Verena Henrich, Erhard Hinrichs, Reinhild Barkey Department of Linguistics University of Tübingen, Germany Abstract A comparison and alignment of lexical resources brings about considerable mutual benefits for all resources involved. For all sense distinctions that are completely parallel in two resources, such an alignment provides supporting external evidence for the validity of sense distinction and allows enriching word senses by information contained in the other resource. By contrast, for all non-matching sense distinctions, reason for revisiting and possibly revising the lexical entries in question is provided. The purpose of this paper is to compare the German wordnet GermaNet with the Digital Dictionary of the German Language (DWDS) and to align word senses in the two resources. The paper presents issues that arise in practice when such an alignment is performed and indicates the benefits that both resources will gain. Introduction It has long been recognized that the identification and differentiation of word senses is one of the harder tasks that lexicographers face. As a result, lexical resources display considerable variation in the number of word senses that lexicographers assign to a given lexical entry in a dictionary. Against this background, lexicographic practice has undertaken considerable efforts to find external knowledge sources that can aid in distinguishing and identifying word senses. The external knowledge sources that are most widely used for this purpose are very large electronic corpora that can be harvested for a given word under lexicographic consideration. Another type of resource that has also been explored as an external reference point is the comparison with another semantic dictionary that has been constructed independently for the same language. The present paper reports on an ongoing project in which the German wordnet GermaNet (Hamp and Feldweg, 997; Henrich and Hinrichs, 2) is compared to the word senses contained in the Digital Dictionary of the German Language (Digitales Wörterbuch der Deutschen Sprache, DWDS; Klein and Geyken, 2). Both resources are long-term lexicographic projects aiming at a comprehensive coverage of contemporary standard German in electronic form. What makes a comparison between these resources particularly interesting and useful is the fact that they utilize two different methods for constructing word meanings. The DWDS is based on the digital versions of three pre-existing dictionaries: the Dictionary of Contemporary German (Wörterbuch der deutschen Gegenwartssprache, WDG), the Grimm Dictionaries Deutsches Wörterbuch von Jacob Grimm und Wilhelm Grimm (DWB) and its revised version (2DWB), as well as the Etymological German Dictionary by Wolfgang Pfeifer (EtymWb). The lexical entries inherited from these dictionaries have been revised and amended by information harvested from large electronic corpora of contemporary German (Didakowski et al., 22). DWDS lexical entries are structured by the number of senses which may be further differentiated by an enumeration of subsenses. Senses are accompanied by examples harvested from German text corpora or by so-called competence examples that are manually constructed. The conception of word meaning underlying GermaNet adheres to the idea of a network of meaningfully related words and concepts that are interlinked by a set of lexical and conceptual relations and that was first realized in the Princeton WordNet for English (Fellbaum, 998). The set of lexical and conceptual relations include synonymy, hypernymy/hyponymy, meronymy/holonymy, causation, antonymy, and pertainymy. The comparison of GermaNet and the DWDS dictionary will focus on the alignment of Germa- 63
2 Net senses (synsets and lexical units) with the senses and subsenses of DWDS lexical entries. The benefits of this GermaNet-DWDS comparison include the following: If the set of sense distinctions match for a given word lemma in both resources, then this provides supporting external evidence for the validity of these sense distinctions. If the set of sense distinctions differ between the two resources, then this provides reason for revisiting and possibly revising the lexical entries in question. Apart from the comparison of word senses, each resource stands to gain from the GermaNet- DWDS mapping in the following ways: It becomes possible to implement an intelligent semantic search for the DWDS that provides users not just with the word senses of a given lexical entry but also with lexical information about related words. GermaNet synsets and lexical units can be enriched by suitable definitions as well as examples contained in the DWDS. The purpose of this paper is to present the results of a pilot study that concentrates on a set of issues that arise in practice when such a mapping is performed. 2 Survey of the Overlapping Coverage The total number of lemmas that have lexical entries in both resources is 48,36 2 (6,2 adjectives, 34,366, and 7,735 ), which covers about 53.5% of all lemmas encoded in GermaNet. At first glance, this overlap might seem low. However, on a closer look, there is an explanation for this which mainly concerns the following three points: The history of the two resources causes differences in coverage. The DWDS is based on the digital versions of three pre-existing dictionaries that do not include most recent contemporary language. By contrast, the terms to be included in GermaNet follow frequency lists extracted from large corpora such as newspaper texts and Wikipedia, which also contain recent contemporary language. Both resources have different basic decisions on what terms and senses should be included. The perspectives and guidelines that the lexicographers of both resources pursue differ. For example, the resources deviate in the inclusion of regional, obsolete, technical, and colloquial terms as well as most recent contemporary language. This further explains why the coverage of GermaNet and the DWDS differs. Since compounding is a highly productive phenomenon of the German language, the question of which compounds to include in a lexical resource is not trivial to answer. There are many newly created compounds that eventually after some undefined time and depending on the frequency of general usage might become part of the fundamental vocabulary of the German language. Thus, especially for the coverage of compounds, there is a huge deviation between the two resources. Since senses in the DWDS might be further differentiated by an enumeration of subsenses, a survey on word senses involves more than one comparison. GermaNet distinguishes 59,495 senses for the 48,36 lemmas that the two resources share. The overall number of 6,53 main sense distinctions in the DWDS is very similar. On the contrary, the number of main senses plus subsenses on the highest level encoded in the DWDS is 74,346, which is more than in GermaNet. This suggests a mapping on the main sense level of the DWDS. The outcome of this survey proves that there is a considerable overlap of word lemmas with a comparable amount of senses in both resources, which supports the usefulness of conducting a sense alignment. 3 Evaluation of the Sense Alignment In order to be able to evaluate the alignment on the level of senses and subsenses, the lexical entries for an initial set of 47 randomly selected word lemmas (see Section 4 for the selection process) have been manually analyzed with regard to the appropriateness of matching senses from one resource onto the senses in the other resource. The variability of how good the senses can be matched leads to a division into four classes that are illustrated and described in the following four subsections in descending order according to their alignment appropriateness. 2 All numbers are calculated on GermaNet s current release 8. as of April 23 and on the DWDS subset taken from version.4.7 and filtered for all lexical entries for lemmas covered by both resources. This filtered subset has been made available to us on August 9,
3 Figure : Sense mapping using the example of Pferd (class ). 3. Class : Exact match of main senses Class represents the ideal case, i.e., senses in GermaNet correspond to main senses in the DWDS. The German noun Pferd is a case in point. As illustrated in Figure, this lemma has the three distinct senses in both resources representing an animal horse, a gymnastic horse, and a chess knight. All word senses that fall into this class show an identical overlapping lexical coverage and an identical granularity level of sense distinctions. For both GermaNet and the DWDS, this provides mutual supporting evidence for the validity of these sense distinctions. For GermaNet, the obvious gain for all these senses is an enrichment by suitable definitions and examples contained in the corresponding DWDS senses. For the DWDS, it becomes possible for all these senses that an intelligent semantic search provides users not just with the word senses of a given lexical entry but also with lexical information about related words. 3.2 Class 2: Exact match of subsenses There are several senses in GermaNet that do not correspond to main senses in the DWDS but which correspond to subsenses in the DWDS. These latter ones are included in class 2. Figure 2 gives an example using the word Bogen. In GermaNet, there are two distinct senses representing a violin bow and a bow as a weapon (see the left side of Figure 2). In the DWDS, there is a main sense described as gebogenes Gerät curved device which is further differentiated into the two subsenses of a violin bow (sub ) and a bow as a weapon (sub 2) see the two entries denoted by sub and sub 2 on the right side of Figure 2. The overall coverage for these senses is the same. It is only the granularity level of the sense distinctions that differs. The reason for this difference results from different perspectives and guidelines of how to model word senses that the lexicographers of both resources pursue. There is an agreement between lexicographers of both resources that the two senses under consideration should be modeled separately. The question of whether to constitute two separate word senses or two subsenses of a common main sense is bound to the nature of the resources, i.e., Germa- Net does not further distinguish word senses into subsenses. The senses that fall into this class again provide support for the validity of the sense distinctions for both resources. Furthermore, the enrichment of GermaNet senses with definitions and examples as well as the enrichment of DWDS senses with information on related words is equally possible than it is described for class. 3.3 Class 3: Partly overlapping coverage and different sense distinctions Class 3 contains senses for which a straightforward one-to-one mapping is not possible. This includes the following two cases: (i) two separate senses from one resource are jointly represented by only one sense in the other resource and (ii) the core meaning of two senses is the same, but the two senses are still not completely identical in their coverage. The German noun Pranke is a case in point for case (i). The DWDS encodes a sense defined as Vordertatze, besonders von großen Raubtieren; umgangssprachlich, scherzhaft, übertragen: große, starke Hand forepaw of an animal, especially a predator; colloquial, jokingly, figurative: big, strong hand (see the right side in Figure 3). In GermaNet, Pranke has the two fine-grained senses denoting the paw of an animal and the 65
4 Figure 2: Sense mapping using the example of Bogen (class 2). figurative term for the human hand (see the left side in Figure 3). In this example, both these more specific GermaNet senses are subsumed under one single DWDS sense. In the second case (ii) that is subsumed by class 3, there is no complete coverage of the meaning of one sense in one resource with the corresponding sense in the other resource. The core sense is mostly identical, but there are meaning aspects that led the lexicographers to decide differently on whether to explicitly encode a separate sense in the dictionary or not. An example of this type is the German noun Sturm storm. Both GermaNet and the DWDS encode a sense referring to the weather phenomenon. Accompanying example sentences of this word sense in the DWDS include instances exemplifying a figurative usage, such as, for example, ein Sturm der Entrüstung a storm of indignation. That means, the figurative meaning of Sturm is explicitly mentioned in the DWDS weather phenomenon sense without encoding a separate sense or subsense. By contrast, the figurative meaning of Sturm is not present in GermaNet neither as part of the corresponding weather phenomenon sense nor explicitly as a separate sense. The phenomena of both cases (i) and (ii) cannot solely be explained by the lexicographic background of the two resources. They rather illustrate different lexicographic perspectives of how to distinguish senses of a word. The question at what point a meaning should be regarded as a distinct sense or subsense to be included in a dictionary is a difficult issue in lexicographic work. Aspects that affect this decision include figurative meaning, technical, colloquial, or regional usage of a term. Both in the paw and in the storm examples, the lexicographers of the two resources have made different decisions with respect to the status of the figurative meaning of a word sense. As for the benefit from a mapping of senses in this class, it would mean that each example sentence for the DWDS senses in question has to be analyzed individually in order to decide whether it can be assigned to a GermaNet sense. Nonetheless, it is interesting to further analyze these cases since they concern the identification and differentiation of word senses which is one of the harder tasks that lexicographers face. 3.4 Class 4: Distinct coverage This class comprises lemmas where there is at least one sense or subsense in one resource that does not have a corresponding entry in the other resource. An example of this kind is illustrated in Figure 4 using the example of Maus. For this word, GermaNet encodes the two senses of the mouse as an animal and the computer mouse (see the left side of Figure 4). The DWDS also encodes the animal sense of a mouse, but it does not include the computer mouse sense. Instead, the DWDS lists Mäuse (plural for Maus) in the sense of an informal synonym for money (see the right side of Figure 4). As illustrated in the mouse example, both resources gain benefit from a sense alignment by mutually providing suggestions of possibly missing senses. In general, with the help of simple word comparisons, it is easy to automatically compile lists of lemmas that serve as candidates to be inserted into a dictionary. By contrast, it is much more difficult to provide (automatic) suggestions of possibly missing senses. In all cases where the sense alignment discovers different 66
5 Figure 3: Sense mapping using the example of Pranke (class 3). sets of sense distinctions between GermaNet and the DWDS, this provides reason for revisiting and possibly revising the lexical entries in question. 4 Evaluation Statistics The selection of the initial set of manually aligned word lemmas is guided by the following criteria: The selected words include all three word classes of adjectives,, and. In order to ensure a detailed evaluation of lexical items with different degrees of polysemy, the evaluation reports results for five different polysemy classes: words having (i) one sense in GermaNet, (ii) two senses in GermaNet, (iii) three or four senses, (iv) five to ten senses, and (v) more than ten senses in GermaNet. The sample as a whole represents a good balance of word classes and number of distinct word senses. That is, for adjectives and, 35 lemmas were randomly selected for each of the polysemy classes (i) to (v). Since the coverage for is higher compared to the coverage of the other two word classes, 5 nominal lemmas were randomly chosen for each polysemy class. Table shows the total number of word lemmas and corresponding word senses (in parentheses) in each polysemy class for the three word classes 3 that were manually aligned by two experienced lexicographers. Column All POS contains the summed numbers for all word classes (i.e., partof-speech, POS) separately for the polysemy classes. 3 The information both about the number of distinct word senses as well as about the word category of the lemmas is taken from GermaNet. Senses Adjectives Nouns Verbs All POS 35 (35) 5 (5) 35 (35) 2 (2) 2 35 (7) 5 () 35 (7) 2 (24) (4) 5 (6) 35 (2) 2 (387) 5 8 (5) 5 (282) 35 (29) 93 (542) > 3 (36) 4 (92) 7 (228) Total 3 (27) 23 (629) 54 (68) 47 (,57) Table : Aligned word lemmas (corresponding word senses in parentheses) and their sense distributions Note that the number of lemmas for adjectives with three or four senses and for and with more than ten senses is lower than mentioned above. The reason is simply because there are only few lemmas encoded both in GermaNet and the DWDS that fall into these classes, i.e., 8, 3, and 4, respectively. Adjectives with more than ten senses do not exist at all. Altogether, 47 distinct word lemmas were manually checked by the lexicographers. These lemmas correspond to,57 senses (in Germa- Net) of which 3 adjectives, 23, and 54. That is, the 47 words have an average of 3.2 senses (2.4 for adjectives, 3. for, and 4. for ). With the help of the manual sense alignment, it is possible to classify senses according to their alignment appropriateness, i.e., into classes to 4 described in Sections Table 2 lists the counts of these,57 Germa- Net senses classified into the four alignment classes separately for the previously defined polysemy classes (columns). The rightmost column depicts the overall results without classifying words with respect to their number of different senses. The rows show the different alignment classes 4 separately for each of the three word categories of adjectives,, and. The last row (All cl.) sums all aligned senses for each word class per polysemy class. Rows marked with denote results for all word categories. 67
6 Figure 4: Sense mapping using the example of Maus (class 4). Class Class 2 Class 3 Class 4 All Cl. Senses in GermaNet > Total (47%) (53%) (4%) (47%) Table 2: Sense distribution of the different alignment classes 36 (3%) 6 (%) (8%) 26 (4%) 92 (34%) 53 (24%) 22 (36%) 465 (3%) 6 (7%) 8 (3%) 38 (6%) 35 (9%) 27 (%) 629 (%) 68 (%),57 (%) The numbers in Table 2 count senses rather than lemmas. Note that this implies that senses of a single lemma do not necessarily all have to be classified to the same alignment class but can belong to different classes what arises quite frequently in practice. An example of this kind is the lemma Maus which has already been discussed in Section 3.4 (see Figure 4). The first GermaNet sense depicting the mouse as an animal has a corresponding main sense on the DWDS side; meaning that this sense is counted for alignment class. On the contrary, the second GermaNet sense for this lemma, which represents the computer mouse sense, does not have a corresponding match on the DWDS side. Thus, the second sense has to be counted for class 4. 5 Discussion of the Results To begin with the most prominent and important result, classes (exact match of main senses) and 2 (exact match of subsenses) together arise in 6% of all cases, i.e., 47% and 4%, respectively see Table 2. This suggests that for three out of five word senses from GermaNet there is a matching sense in the DWDS with which a GermaNet sense can be aligned. This underscores the overall feasibility of a sense alignment between the two lexical resources. The obvious gain for all these senses is the mutual enrichment by sense-specific information such as suitable definitions, examples, and lexical relations taken from the matching sense. Class arises in 47% of all cases and thus much more frequently than all other classes. The fact that matches between GermaNet senses and main senses in the DWDS (class ) outnumber matches between GermaNet senses and subsenses in the DWDS (class 2) was to be expected. This confirms the conception of word senses on the same granularity level in both resources. Both classes 3 (partly overlapping coverage and different sense distinctions) and 4 (distinct coverage) reveal differences between GermaNet and the DWDS that prevent a straight forward sense alignment. The explanation why class 3 arises in 3% of all cases, i.e., why there are differences in the distinction of senses, is due to the lexicographic background of the two resources. The lexicographers of GermaNet and the DWDS pursue different perspectives and guidelines of how to model word senses, e.g. with respect to the sense granularity. Thus, from a lexicographer s perspective, it is interesting to analyze these cases since they concern the identification and differentiation of word senses which is one of the harder tasks that lexicographers face. To gain benefit from a mapping of senses in this class, it would mean that all information for a 68
7 sense has to be analyzed in order to be individually assigned to a corresponding sense. Class 4, which indicates a distinct coverage of GermaNet and the DWDS, shows fewest occurrences. In only 9% of all GermaNet senses, there is no corresponding entry in the DWDS. It should be kept in mind that this number only applies to those 48,36 lemmas that are encoded in both resources. For all remaining lemmas, there are no lexical entries in the DWDS at all and thus these word senses would fall into class 4 as well. The evaluation for class 4 is biased towards one direction, i.e., it regards GermaNet senses with missing entries in the DWDS. Since it is also interesting to analyze and compare the other way around where there are DWDS senses lacking matches in GermaNet, these cases have also been recorded during the manual alignment. Altogether, there are 384 word senses (22 adjectival, 4 nominal, and 58 verbal senses) in the DWDS that do not have a corresponding entry in Germa- Net. In all cases where the sets of sense distinctions differ between the two resources, this provides reason for revisiting and possibly revising the lexical entries in question. Of course, this also applies to all those word lemmas for which there is a lexical entry in only one of the two resources. A comparison of the results for the three different word classes and polysemy classes yields the following tendencies: Words with only one GermaNet sense almost exclusively fall into class for all three word classes. This is not surprising since those words usually have one or few senses in the DWDS and thus the probability that the same most prominent sense of a word is encoded in both resources is significant. More than half of all (53%) fall into class much fewer (%) fall into class 2. By contrast, there are only 4% of all in class, but proportionally almost twice as many (8%) classified to class 2 compared to. This is especially remarkable for with more than four senses. One reason for this difference is the variety in the granularity level of the sense distinctions in GermaNet and the DWDS which arises more often for than for. The deviation between the three word classes for polysemous words, i.e., words with more than one sense in GermaNet, is interesting to observe. Adjectives and show a proportionally larger number of occurrences in class 3 (34% and 36%, respectively) compared to (24%). This means that there are more words with a partly overlapping coverage and different sense distinctions for adjectives and than for, e.g., where two senses from one resource jointly describe one sense of the other resource. By contrast, this ratio is reversed for class 4, where there are proportionally nearly twice as many occurrences for (3%) than for adjectives and (7% and 6%, respectively). The explanation for this is that there are more nominal senses that are not encoded in one resource, but more adjectival and verbal senses that encode an overlapping coverage with a different distinction of senses. All in all, it is worthwhile to perform a complete sense alignment between GermaNet and the DWDS. This will open up a wide range of benefits for both resources, including the harvesting of sense-specific information and the external support of sense distinctions for matching senses as well as indicators for revisiting and possibly revising the lexical entries in question for nonmatching senses. 6 Related Work There has been a considerable body of research for English that investigates the alignment of the Princeton WordNet with Wikipedia (including Ruiz-Casado et al., 25; Ponzetto and Navigli, 2; Niemann and Gurevych, 2), with Wiktionary (including Meyer and Gurevych, 2), with the Longman Dictionary of Contemporary English and with Roget's thesaurus (Kwong, 998), with the Hector lexicon (Litkowski, 999), or with the Oxford Dictionary of English (Navigli, 26). Previous work for German has been on the alignment of GermaNet with the German version of Wiktionary (Henrich et al., 2) and with the German Wikipedia (Henrich et al., 22). However, there is no other previous research that tries to align GermaNet to the DWDS. 7 Conclusion and Future Work This initial pilot study has proven the feasibility of a sense alignment between GermaNet and the DWDS both in term of quantity and appropriateness. We have learned about the differences in the distinction of senses that are due to different perspectives and guidelines of how to model word senses that the lexicographers of both resources pursue. The classification of senses ac- 69
8 cording to their appropriateness to be aligned with senses from the other resource allows an individual treatment of different issues and phenomena that arise in practice when an alignment of two resources is performed. The alignment of GermaNet with the DWDS brings about considerable mutual benefits for both resources. For all sense distinctions that are completely parallel in the two resources, the alignment provides supporting external evidence for the validity of sense distinction and allows enriching word senses by information contained in the other resource. By contrast, for all nonmatching sense distinctions, reason for revisiting and possibly revising the lexical entries in question is provided. The natural next step, which we have already started to work on, is to implement an algorithm that automatically aligns senses from the two resources. This provides a good basis for the lexicographer s work of post-correcting the automatic alignment and revising the senses in both resources, which still remains a complex and substantial task to be performed. Acknowledgments We are very grateful to our student assistants Sabrina Galasso and Amit Vrubel, who helped us with the evaluation reported in Sections 4 and 5. Special thanks go to our colleagues in Berlin for making available the DWDS to us. Financial support was provided by the German Research Foundation (DFG) as part of the Collaborative Research Center Emergence of Meaning (SFB 833), by the German Ministry of Education and Technology (BMBF) as part of the research grant CLARIN-D, and by a research grant from the University of Tübingen. References Jörg Didakowski, Lothar Lemnitzer, and Alexander Geyken. 22. Automatic example sentence extraction for a contemporary German dictionary. Proceedings of EURALEX 22, Oslo, pp Christiane Fellbaum. (eds.) WordNet An Electronic Lexical Database. The MIT Press. Birgit Hamp and Helmut Feldweg GermaNet a Lexical-Semantic Net for German. Proceedings of ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications, Madrid. Verena Henrich and Erhard Hinrichs. 2. GernEdiT The GermaNet Editing Tool. Proceedings of the Seventh Conference on International Language Resources and Evaluation, pp Verena Henrich, Erhard Hinrichs, and Klaus Suttner: Automatically Linking GermaNet to Wikipedia for Harvesting Corpus Examples for GermaNet Senses. Journal for Language Technology and Computational Linguistics (JLCL), Vol. 27, No., 22, pp. -9. Verena Henrich, Erhard Hinrichs, and Tatiana Vodolazova. 2. Semi-Automatic Extension of GermaNet with Sense Definitions from Wiktionary. Proceedings of 5th Language & Technology Conference (LTC 2), Pozna", Poland, 2, pp Wolfgang Klein and Alexander Geyken. 2. Das Digitale Wörterbuch der Deutschen Sprache (DWDS). Heid, Ulrich/Schierholz, Stefan/Schweickard, Wolfgang/Wiegand, Herbert Ernst/Gouws, Rufus H./Wolski, Werner (Hg.): Lexikographica. Berlin/New York, pp Oi Yee Kwong Aligning wordnet with additional lexical resources. Proceedings of the COL- ING-ACL 98 Workshop on Usage of WordNet in Natural Language Processing Systems, Montreal, QC, Canada, pp Kenneth C. Litkowski Towards a meaning-full comparison of lexical resources. Proceedings of the ACL Special Interest Group on the Lexicon Workshop on Standardizing Lexical Resources, College Park, MD, USA, pp Christian M. Meyer and Irina Gurevych. 2. What psycholinguists know about chemistry: Aligning Wiktionary and WordNet for increased domain coverage. Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2), pages , Elisabeth Niemann and Iryna Gurevych. 2. The People s Web meets Linguistic Knowledge: Automatic Sense Alignment of Wikipedia and WordNet. Proceedings of the Ninth International Conference on Computational Semantics, pp Roberto Navigli. 26. Meaningful clustering of senses helps boost word sense disambiguation performance. Proceedings of COLING 26 and ACL 26. Association for Computational Linguistics, pp Simone P. Ponzetto and Roberto Navigli. 2. Knowledge-rich Word Sense Disambiguation rivaling supervised system. Proceedings of the 48th Annual Meeting of the ACL, pp Maria Ruiz-Casado, Enrique Alfonseca, and Pablo Castells. 25. Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets. Advances in Web Intelligence, Volume 3528 of LNCS, Springer Verlag, pp
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering
Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets
Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells Computer Science Dep., Universidad Autonoma de Madrid, 28049 Madrid, Spain
Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries
Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries Patanakul Sathapornrungkij Department of Computer Science Faculty of Science, Mahidol University Rama6 Road, Ratchathewi
The Oxford Learner s Dictionary of Academic English
ISEJ Advertorial The Oxford Learner s Dictionary of Academic English Oxford University Press The Oxford Learner s Dictionary of Academic English (OLDAE) is a brand new learner s dictionary aimed at students
Visualizing WordNet Structure
Visualizing WordNet Structure Jaap Kamps Abstract Representations in WordNet are not on the level of individual words or word forms, but on the level of word meanings (lexemes). A word meaning, in turn,
An Efficient Database Design for IndoWordNet Development Using Hybrid Approach
An Efficient Database Design for IndoWordNet Development Using Hybrid Approach Venkatesh P rabhu 2 Shilpa Desai 1 Hanumant Redkar 1 N eha P rabhugaonkar 1 Apur va N agvenkar 1 Ramdas Karmali 1 (1) GOA
Clustering of Polysemic Words
Clustering of Polysemic Words Laurent Cicurel 1, Stephan Bloehdorn 2, and Philipp Cimiano 2 1 isoco S.A., ES-28006 Madrid, Spain [email protected] 2 Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe,
Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)
Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet Roberto Navigli, Tiziano
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Reference Books. (1) English-English Dictionaries. Fiona Ross FindYourFeet.de
Reference Books This handout originated many years ago in response to requests from students, most of them at Konstanz University. Students from many different departments asked me for advice on dictionaries,
Appendix B Data Quality Dimensions
Appendix B Data Quality Dimensions Purpose Dimensions of data quality are fundamental to understanding how to improve data. This appendix summarizes, in chronological order of publication, three foundational
A Mapping of CIDOC CRM Events to German Wordnet for Event Detection in Texts
A Mapping of CIDOC CRM Events to German Wordnet for Event Detection in Texts Martin Scholz Friedrich-Alexander-University Erlangen-Nürnberg Digital Humanities Research Group Outline Motivation: information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology
Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate
Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu
Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision
Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision Viktor PEKAR Bashkir State University Ufa, Russia, 450000 [email protected] Steffen STAAB Institute AIFB,
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX
COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy
1 Basic concepts. 1.1 What is morphology?
EXTRACT 1 Basic concepts It has become a tradition to begin monographs and textbooks on morphology with a tribute to the German poet Johann Wolfgang von Goethe, who invented the term Morphologie in 1790
Biological kinds and the causal theory of reference
Biological kinds and the causal theory of reference Ingo Brigandt Department of History and Philosophy of Science 1017 Cathedral of Learning University of Pittsburgh Pittsburgh, PA 15260 E-mail: [email protected]
What Makes a Good Online Dictionary? Empirical Insights from an Interdisciplinary Research Project
Proceedings of elex 2011, pp. 203-208 What Makes a Good Online Dictionary? Empirical Insights from an Interdisciplinary Research Project Carolin Müller-Spitzer, Alexander Koplenig, Antje Töpel Institute
Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment
Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer
Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.
Comparing Ontology-based and Corpusbased Domain Annotations in WordNet. A paper by: Bernardo Magnini Carlo Strapparava Giovanni Pezzulo Alfio Glozzo Presented by: rabee ali alshemali Motive. Domain information
ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU
ONTOLOGIES p. 1/40 ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU Unlocking the Secrets of the Past: Text Mining for Historical Documents Blockseminar, 21.2.-11.3.2011 ONTOLOGIES
A terminology model approach for defining and managing statistical metadata
A terminology model approach for defining and managing statistical metadata Comments to : R. Karge (49) 30-6576 2791 mail [email protected] Content 1 Introduction... 4 2 Knowledge presentation...
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -
Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Philipp Sorg and Philipp Cimiano Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany {sorg,cimiano}@aifb.uni-karlsruhe.de
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Package wordnet. January 6, 2016
Title WordNet Interface Version 0.1-11 Package wordnet January 6, 2016 An interface to WordNet using the Jawbone Java API to WordNet. WordNet () is a large lexical database
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Simple maths for keywords
Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd [email protected] Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all
Modeling adjectives in GL: accounting for all adjective classes 1
Modeling adjectives in GL: accounting for all adjective classes 1 Sara Mendes Raquel Amaro Group for the Computation of Lexical Grammatical Knowledge Centre of Linguistics, University of Lisbon Avenida
Lexico-Semantic Relations Errors in Senior Secondary School Students Writing ROTIMI TAIWO Obafemi Awolowo University, Ile-Ife, Nigeria
Nordic Journal of African Studies 10(3): 366-373 (2001) Lexico-Semantic Relations Errors in Senior Secondary School Students Writing ROTIMI TAIWO Obafemi Awolowo University, Ile-Ife, Nigeria ABSTRACT The
An Integrated Approach to Automatic Synonym Detection in Turkish Corpus
An Integrated Approach to Automatic Synonym Detection in Turkish Corpus Dr. Tuğba YILDIZ Assist. Prof. Dr. Savaş YILDIRIM Assoc. Prof. Dr. Banu DİRİ İSTANBUL BİLGİ UNIVERSITY YILDIZ TECHNICAL UNIVERSITY
Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization
Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it
BUSINESS RULES AND GAP ANALYSIS
Leading the Evolution WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Discovery and management of business rules avoids business disruptions WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Business Situation More
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
Domain Knowledge Extracting in a Chinese Natural Language Interface to Databases: NChiql
Domain Knowledge Extracting in a Chinese Natural Language Interface to Databases: NChiql Xiaofeng Meng 1,2, Yong Zhou 1, and Shan Wang 1 1 College of Information, Renmin University of China, Beijing 100872
Historical Linguistics. Diachronic Analysis. Two Approaches to the Study of Language. Kinds of Language Change. What is Historical Linguistics?
Historical Linguistics Diachronic Analysis What is Historical Linguistics? Historical linguistics is the study of how languages change over time and of their relationships with other languages. All languages
KNOWLEDGE ORGANIZATION
KNOWLEDGE ORGANIZATION Gabi Reinmann Germany [email protected] Synonyms Information organization, information classification, knowledge representation, knowledge structuring Definition The term
Usability metrics for software components
Usability metrics for software components Manuel F. Bertoa and Antonio Vallecillo Dpto. Lenguajes y Ciencias de la Computación. Universidad de Málaga. {bertoa,av}@lcc.uma.es Abstract. The need to select
Word Sense Disambiguation as an Integer Linear Programming Problem
Word Sense Disambiguation as an Integer Linear Programming Problem Vicky Panagiotopoulou 1, Iraklis Varlamis 2, Ion Androutsopoulos 1, and George Tsatsaronis 3 1 Department of Informatics, Athens University
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
Doctoral School of Historical Sciences Dr. Székely Gábor professor Program of Assyiriology Dr. Dezső Tamás habilitate docent
Doctoral School of Historical Sciences Dr. Székely Gábor professor Program of Assyiriology Dr. Dezső Tamás habilitate docent The theses of the Dissertation Nominal and Verbal Plurality in Sumerian: A Morphosemantic
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Text Analytics Illustrated with a Simple Data Set
CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to
Purposes and Processes of Reading Comprehension
2 PIRLS Reading Purposes and Processes of Reading Comprehension PIRLS examines the processes of comprehension and the purposes for reading, however, they do not function in isolation from each other or
User studies, user behaviour and user involvement evidence and experience from The Danish Dictionary
User studies, user behaviour and user involvement evidence and experience from The Danish Dictionary Henrik Lorentzen, Lars Trap-Jensen Society for Danish Language and Literature, Copenhagen, Denmark E-mail:
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
Association Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
The End of Primary School Test Marleen van der Lubbe, Cito, The Netherlands
The End of Primary School Test Marleen van der Lubbe, Cito, The Netherlands The Dutch Education System Before explaining more about the End of Primary School Test (better known as Citotest), it is important
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
What Are the Differences?
Comparison between the MSA manual and VDA Volume 5 What Are the Differences? MSA is short for Measurement Systems Analysis. This document was first published by the Automotive Industry Action Group (AIAG)
A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches
J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria 105 A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria
Central and South-East European Resources in META-SHARE
Central and South-East European Resources in META-SHARE Tamás VÁRADI 1 Marko TADIĆ 2 (1) RESERCH INSTITUTE FOR LINGUISTICS, MTA, Budapest, Hungary (2) FACULTY OF HUMANITIES AND SOCIAL SCIENCES, ZAGREB
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France [email protected] Cyril Grouin LIMSI-CNRS rue John von Neumann 91400
Depth-of-Knowledge Levels for Four Content Areas Norman L. Webb March 28, 2002. Reading (based on Wixson, 1999)
Depth-of-Knowledge Levels for Four Content Areas Norman L. Webb March 28, 2002 Language Arts Levels of Depth of Knowledge Interpreting and assigning depth-of-knowledge levels to both objectives within
The European Financial Reporting Advisory Group (EFRAG) and the Autorité des Normes Comptables (ANC) jointly publish on their websites for
The European Financial Reporting Advisory Group (EFRAG) and the Autorité des Normes Comptables (ANC) jointly publish on their websites for information purpose a Research Paper on the proposed new Definition
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-WORDS Alok Ranjan Pal 1, 3, Anirban Kundu 2, 3, Abhay Singh 1, Raj Shekhar 1, Kunal Sinha 1 1 College of Engineering and Management,
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
The German Taxpayer-Panel
Schmollers Jahrbuch 127 (2007), 497 ± 509 Duncker & Humblot, Berlin The German Taxpayer-Panel By Susan Kriete-Dodds and Daniel Vorgrimler The use of panel data has become increasingly popular in socioeconomic
Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.
Unit 1 Number Sense In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions. BLM Three Types of Percent Problems (p L-34) is a summary BLM for the material
3. Mathematical Induction
3. MATHEMATICAL INDUCTION 83 3. Mathematical Induction 3.1. First Principle of Mathematical Induction. Let P (n) be a predicate with domain of discourse (over) the natural numbers N = {0, 1,,...}. If (1)
What s in a Lexicon. The Lexicon. Lexicon vs. Dictionary. What kind of Information should a Lexicon contain?
What s in a Lexicon What kind of Information should a Lexicon contain? The Lexicon Miriam Butt November 2002 Semantic: information about lexical meaning and relations (thematic roles, selectional restrictions,
National Disability Authority Resource Allocation Feasibility Study Final Report January 2013
National Disability Authority Resource Allocation Feasibility Study January 2013 The National Disability Authority (NDA) has commissioned and funded this evaluation. Responsibility for the evaluation (including
How To Write A German Reference Corpus Of Computer Mediated Communication
DeRiK: A German Reference Corpus of Computer-Mediated Communication Michael Beißwenger 1, Maria Ermakova 2, Alexander Geyken 2, Lothar Lemnitzer 2, Angelika Storrer 1 1 Department of German Language and
Web opinion mining: How to extract opinions from blogs?
Web opinion mining: How to extract opinions from blogs? Ali Harb [email protected] Mathieu Roche LIRMM CNRS 5506 UM II, 161 Rue Ada F-34392 Montpellier, France [email protected] Gerard Dray [email protected]
Learning in Abstract Memory Schemes for Dynamic Optimization
Fourth International Conference on Natural Computation Learning in Abstract Memory Schemes for Dynamic Optimization Hendrik Richter HTWK Leipzig, Fachbereich Elektrotechnik und Informationstechnik, Institut
Natural Language Processing. Part 4: lexical semantics
Natural Language Processing Part 4: lexical semantics 2 Lexical semantics A lexicon generally has a highly structured form It stores the meanings and uses of each word It encodes the relations between
Paraphrasing controlled English texts
Paraphrasing controlled English texts Kaarel Kaljurand Institute of Computational Linguistics, University of Zurich [email protected] Abstract. We discuss paraphrasing controlled English texts, by defining
The open lexical infrastructure of Språkbanken
The open lexical infrastructure of Språkbanken Lars Borin, Markus Forsberg, Leif-Jöran Olsson and Jonatan Uppström Språkbanken, University of Gothenburg, Sweden {lars.borin, markus.forsberg, leif-joran.olsson,
A Survey on Product Aspect Ranking Techniques
A Survey on Product Aspect Ranking Techniques Ancy. J. S, Nisha. J.R P.G. Scholar, Dept. of C.S.E., Marian Engineering College, Kerala University, Trivandrum, India. Asst. Professor, Dept. of C.S.E., Marian
Get the most value from your surveys with text analysis
PASW Text Analytics for Surveys 3.0 Specifications Get the most value from your surveys with text analysis The words people use to answer a question tell you a lot about what they think and feel. That
On the use of antonyms and synonyms from a domain perspective
On the use of antonyms and synonyms from a domain perspective Debela Tesfaye IT PhD Program Addis Ababa University Addis Ababa, Ethiopia [email protected] Carita Paradis Centre for Languages and Literature
COMPARATIVES WITHOUT DEGREES: A NEW APPROACH. FRIEDERIKE MOLTMANN IHPST, Paris [email protected]
COMPARATIVES WITHOUT DEGREES: A NEW APPROACH FRIEDERIKE MOLTMANN IHPST, Paris [email protected] It has become common to analyse comparatives by using degrees, so that John is happier than Mary would
Absolute versus Relative Synonymy
Article 18 in LCPJ Danglli, Leonard & Abazaj, Griselda 2009: Absolute versus Relative Synonymy Absolute versus Relative Synonymy Abstract This article aims at providing an illustrated discussion of the
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
j A Handbook of Lexicography
j A Handbook of Lexicography This book provides a systematic survey of the theory and methods of dictionary-making (including the linguistic background): what types of dictionary there are, how different
The study of words. Word Meaning. Lexical semantics. Synonymy. LING 130 Fall 2005 James Pustejovsky. ! What does a word mean?
Word Meaning LING 130 Fall 2005 James Pustejovsky The study of words! What does a word mean?! To what extent is it a linguistic matter?! To what extent is it a matter of world knowledge? Thanks to Richard
Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources
Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources Michelle
UBY A Large-Scale Unified Lexical-Semantic Resource Based on LMF
UBY A Large-Scale Unified Lexical-Semantic Resource Based on LMF Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian Wirth Ubiquitous Knowledge Processing
