UBY A Large-Scale Unified Lexical-Semantic Resource Based on LMF
|
|
|
- Agnes Joseph
- 10 years ago
- Views:
Transcription
1 UBY A Large-Scale Unified Lexical-Semantic Resource Based on LMF Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian Wirth Ubiquitous Knowledge Processing Lab (UKP-DIPF) German Institute for Educational Research and Educational Information Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science Technische Universität Darmstadt Abstract We present UBY, a large-scale lexicalsemantic resource combining a wide range of information from expert-constructed and collaboratively constructed resources for English and German. It currently contains nine resources in two languages: English WordNet, Wiktionary, Wikipedia, FrameNet and VerbNet, German Wikipedia, Wiktionary and GermaNet, and multilingual OmegaWiki modeled according to the LMF standard. For FrameNet, VerbNet and all collaboratively constructed resources, this is done for the first time. Our LMF model captures lexical information at a fine-grained level by employing a large number of Data Categories from ISOCat and is designed to be directly extensible by new languages and resources. All resources in UBY can be accessed with an easy to use publicly available API. 1 Introduction Lexical-semantic resources (LSRs) are the foundation of many NLP tasks such as word sense disambiguation, semantic role labeling, question answering and information extraction. They are needed on a large scale in different languages. The growing demand for resources is met neither by the largest single expert-constructed resources (ECRs), such as WordNet and FrameNet, whose coverage is limited, nor by collaboratively constructed resources (CCRs), such as Wikipedia and Wiktionary, which encode lexical-semantic knowledge in a less systematic form than ECRs, because they are lacking expert supervision. Previously, there have been several independent efforts of combining existing LSRs to enhance their coverage w.r.t. their breadth and depth, i.e. (i) the number of lexical items, and (ii) the types of lexical-semantic information contained (Shi and Mihalcea, 2005; Johansson and Nugues, 2007; Navigli and Ponzetto, 2010b; Meyer and Gurevych, 2011). As these efforts often targeted particular applications, they focused on aligning selected, specialized information types. To our knowledge, no single work focused on modeling a wide range of ECRs and CCRs in multiple languages and a large variety of information types in a standardized format. Frequently, the presented model is not easily scalable to accommodate an open set of LSRs in multiple languages and the information mined automatically from corpora. The previous work also lacked the aspects of lexicon format standardization and API access. We believe that easy access to information in LSRs is crucial in terms of their acceptance and broad applicability in NLP. In this paper, we propose a solution to this. We define a standardized format for modeling LSRs. This is a prerequisite for resource interoperability and the smooth integration of resources. We employ the ISO standard Lexical Markup Framework (LMF: ISO 24613:2008), a metamodel for LSRs (Francopoulo et al., 2006), and Data Categories (DCs) selected from ISOCat. 1 One of the main challenges of our work is to develop a model that is standard-compliant, yet able to express the information contained in diverse LSRs, and that in the long term supports the integration of the various resources. The main contributions of this paper can be 1
2 summarized as follows: (1) We present an LMFbased model for large-scale multilingual LSRs called UBY-LMF. We model the lexical-semantic information down to a fine-grained level of information (e.g. syntactic frames) and employ standardized definitions of linguistic information types from ISOCat. (2) We present UBY, a largescale LSR implementing the UBY-LMF model. UBY currently contains nine resources in two languages: English WordNet (WN, Fellbaum (1998), Wiktionary 2 (WKT-en), Wikipedia 3 (WPen), FrameNet (FN, Baker et al. (1998)), and VerbNet (VN, Kipper et al. (2008)); German Wiktionary (WKT-de), Wikipedia (WP-de), and GermaNet (GN, Kunze and Lemnitzer (2002)), and the English and German entries of OmegaWiki 4 (OW), referred to as OW-en and OW-de. OW, a novel CCR, is inherently multilingual its basic structure are multilingual synsets, which are a valuable addition to our multilingual UBY. Essential to UBY are the nine pairwise sense alignments between resources, which we provide to enable resource interoperability on the sense level, e.g. by providing access to the often complementary information for a sense in different resources. (3) We present a Java-API which offers unified access to the information contained in UBY. We will make the UBY-LMF model, the resource UBY and the API freely available to the research community. 5 This will make it easy for the NLP community to utilize UBY in a variety of tasks in the future. 2 Related Work The work presented in this paper concerns standardization of LSRs, large-scale integration thereof at the representational level, and the unified access to lexical-semantic information in the integrated resources. Standardization of resources. Previous work includes models for representing lexical information relative to ontologies (Buitelaar et al., 2009; McCrae et al., 2011), and standardized single wordnets (English, German and Italian wordnets) in the ISO standard LMF (Soria et al., 2009; Henrich and Hinrichs, 2010; Toral et al., 2010) McCrae et al. (2011) propose LEMON, a conceptual model for lexicalizing ontologies as an extension of the LexInfo model (Buitelaar et al., 2009). LEMON provides an LMF-implementation in the Web Ontology Language (OWL), which is similar to UBY-LMF, as it also uses DCs from ISOCat, but diverges further from the standard (e.g. by removing structural elements such as the predicative representation class). While we focus on modeling lexical-semantic information comprehensively and at a fine-grained level, the goal of LEMON is to support the linking between ontologies and lexicons. This goal entails a task-targeted application: domain-specific lexicons are extracted from ontology specifications and merged with existing LSRs on demand. As a consequence, there is no available large-scale instance of the LEMON model. Soria et al. (2009) define WordNet-LMF, an LMF model for representing wordnets used in the KYOTO project, and Henrich and Hinrichs (2010) do this for GN, the German wordnet. These models are similar, but they still present different implementations of the LMF metamodel, which hampers interoperability between the resources. We build upon this work, but extend it significantly: UBY goes beyond modeling a single ECR and represents a large number of both ECRs and CCRs with very heterogeneous content in the same format. Also, UBY-LMF features deeper modeling of lexical-semantic information. Henrich and Hinrichs (2010), for instance, do not explicitly model the argument structure of subcategorization frames, since each frame is represented as a string. In UBY-LMF, we represent them at a fine-grained level necessary for the transparent modeling of the syntaxsemantics interface. Large-scale integration of resources. Most previous research efforts on the integration of resources targeted at world knowledge rather than lexical-semantic knowledge. Well known examples are YAGO (Suchanek et al., 2007), or DBPedia (Bizer et al., 2009). Atserias et al. (2004) present the Meaning Multilingual Central Repository (MCR). MCR integrates five local wordnets based on the Interlingual Index of EuroWordNet (Vossen, 1998). The overall goal of the work is to improve word sense disambiguation. This work is similar to ours, as it
3 aims at a large-scale multilingual resource and includes several resources. It is however restricted to a single type of resource (wordnets) and features a single type of lexical information (semantic relations) specified upon synsets. Similarly, de Melo and Weikum (2009) create a multilingual wordnet by integrating wordnets, bilingual dictionaries and information from parallel corpora. None of these resources integrate lexicalsemantic information, such as syntactic subcategorization or semantic roles. McFate and Forbus (2011) present NULEX, a syntactic lexicon automatically compiled from WN, WKT-en and VN. As their goal is to create an open-license resource to enhance syntactic parsing, they enrich verbs and nouns in WN with inflection information from WKT-en and syntactic frames from VN. Thus, they only use a small part of the lexical information present in WKT-en. Padró et al. (2011) present their work on lexicon merging within the Panacea Project. One goal of Panacea is to create a lexical resource development platform that supports large-scale lexical acquisition and can be used to combine existing lexicons with automatically acquired ones. To this end, Padró et al. (2011) explore the automatic integration of subcategorization lexicons. Their current work only covers Spanish, and though they mention the LMF standard as a potential data model, they do not make use of it. Shi and Mihalcea (2005) integrate FN, VN and WN, and Palmer (2009) presents a combination of Propbank, VN and FN in a resource called SEM- LINK in order to enhance semantic role labeling. Similar to our work, multiple resources are integrated, but their work is restricted to a single language and does not cover CCRs, whose popularity and importance has grown tremendously over the past years. In fact, with the exception of NULEX, CCRs have only been considered in the sense alignment of individual resource pairs (Navigli and Ponzetto, 2010a; Meyer and Gurevych, 2011). API access for resources. An important factor to the success of a large, integrated resource is a single public API, which facilitates the access to the information contained in the resource. The most important LSRs so far can be accessed using various APIs, for instance the Java WordNet API, 6 or the Java-based Wikipedia API. 7 With a stronger focus of the NLP community on sharing data and reproducing experimental results these tools are becoming important as never before. Therefore, a major design objective of UBY is a single API. This is similar in spirit to the motivation of Pradhan et al. (2007), who present integrated access to corpus annotations as a main goal of their work on standardizing and integrating corpus annotations in the OntoNotes project. To summarize, related work focuses either on the standardization of single resources (or a single type of resource), which leads to several slightly different formats constrained to these resources, or on the integration of several resources in an idiosyncratic format. CCRs have not been considered at all in previous work on resource standardization, and the level of detail of the modeling is insufficient to fully accommodate different types of lexical-semantic information. API access is rarely provided. This makes it hard for the community to exploit their results on a large scale. Thus, it diminishes the impact that these projects might achieve upon NLP beyond their original specific purpose, if their results were represented in a unified resource and could easily be accessed by the community through a single public API. 3 UBY Data model LMF defines a metamodel of LSRs in the Unified Modeling Language (UML). It provides a number of UML packages and classes for modeling many different types of resources, e.g. wordnets and multilingual lexicons. The design of a standard-compliant lexicon model in LMF involves two steps: in the first step, the structure of the lexicon model has to be defined by choosing a combination of the LMF core package and zero to many extensions (i.e. UML packages). In the second step, these UML classes are enriched by attributes. To contribute to semantic interoperability, it is essential for the lexicon model that the attributes and their values refer to Data Categories (DCs) taken from a reference repository. DCs are standardized specifications of the terms that are used for attributes and their values, or in other words, the linguistic vocabulary occurring
4 in a lexicon model. Consider, for instance, the term lexeme that is defined differently in WN and FN: in FN, a lexeme refers to a word form, not including the sense aspect. In WN, on the contrary, a lexeme is an abstract pairing of meaning and form. According to LMF, the DCs are to be selected from ISOCat, the implementation of the ISO Data Category Registry (DCR, Broeder et al. (2010)), resulting in a Data Category Selection (DCS). Design of UBY-LMF. We have designed UBY- LMF 8 as a model of the union of various heterogeneous resources, namely WN, GN, FN, and VN on the one hand and CCRs on the other hand. Two design principles guided our development of UBY-LMF: first, to preserve the information available in the original resources and to uniformly represent it in UBY-LMF. Second, to be able to extend UBY in the future by further languages, resources, and types of linguistic information, in particular, alignments between different LSRs. Wordnets, FN and VN are largely complementary regarding the information types they provide, see, e.g. Baker and Fellbaum (2009). Accordingly, they use different organizational units to represent this information. Wordnets, such as WN and GN, primarily contain information on lexical-semantic relations, such as synonymy, and use synsets (groups of lexemes that are synonymous) as organizational units. FN focuses on groups of lexemes that evoke the same prototypical situation (so-called semantic frames, Fillmore (1982)) involving semantic roles (so-called frame elements). VN, a large-scale verb lexicon, is organized in Levin-style verb classes (Levin, 1993) (groups of verbs that share the same syntactic alternations and semantic roles) and provides rich subcategorization frames including semantic roles and a specification of semantic predicates. UBY-LMF employs several direct subclasses of Lexicon in order to account for the various organization types found in the different LSRs considered. While the LexicalEntry class reflects the traditional headword-based lexicon organization, Synset represents synsets from wordnets, SemanticPredicate models FN semantic frames, and SubcategorizationFrameSet corresponds to VN alternation classes. 8 See SubcategorizationFrame is composed of syntactic arguments, while SemanticPredicate is composed of semantic arguments. The linking between syntactic and semantic arguments is represented by the SynSemCorrespondence class. The SenseAxis class is very important in UBY-LMF, as it connects the different source LSRs. Its role is twofold: first, it links the corresponding word senses from different languages, e.g. English and German. Second, it represents monolingual sense alignments, i.e. sense alignments between different lexicons in the same language. The latter is a novel interpretation of SenseAxis introduced by UBY-LMF. The organization of lexical-semantic knowledge found in WP, WKT, and OW can be modeled with the classes in UBY-LMF as well. WP primarily provides encyclopedic information on nouns. It mainly consists of article pages which are modeled as Senses in UBY-LMF. WKT is in many ways similar to traditional dictionaries, because it enumerates senses under a given headword on an entry page. Thus, WKT entry pages can be represented by LexicalEntries and WKT senses by Senses. OW is different from WKT and WP, as it is organized in multilingual synsets. To model OW in UBY-LMF, we split the synsets per language and included them as monolingual Synsets in the corresponding Lexicon (e.g., OW-en or OWde). The original multilingual information is preserved by adding a SenseAxis between corresponding synsets in OW-en and OW-de. The LMF standard itself contains only few linguistic terms and does neither specify attributes nor their values. Therefore, an important task in developing UBY-LMF has been the specification of attributes and their values along with the proper attachment of attributes to LMF classes. In particular, this task involved selecting DCs from ISO- Cat and, if necessary, adding new DCs to ISOCat. Extensions in UBY-LMF. Although UBY- LMF is largely compliant with LMF, the task of building a homogeneous lexicon model for many highly heterogeneous LSRs led us to extend LMF in several ways: we added two new classes and several new relationships between classes. First, we were facing a huge variety of lexicalsemantic labels for many different dimensions of
5 semantic classification. Examples of such dimensions include ontological type (e.g. selectional restrictions in VN and FN), domain (e.g. Biology in WN), style and register (e.g. labels in WKT, OW), or sentiment (e.g. sentiment of lexical units in FN). Since we aim at an extensible LMF-model, capable of representing further dimensions of semantic classification, we did not squeeze the information on semantic classes present in the considered LSRs into existing LMF classes. Instead, we addressed this issue by introducing a more general class, SemanticLabel, which is an optional subclass of Sense, SemanticPredicate, and SemanticArgument. This new class has three attributes, encoding the name of the label, its type (e.g. ontological, register, sentiment), and a numeric quantification (e.g. sentiment strength). Second, we attached the subclass Frequency to most of the classes in UBY-LMF, in order to encode frequency information. This is of particular importance when using the resource in machine learning applications. This extension of the standard has already been made in WordNet-LMF (Soria et al., 2009). Currently, the Frequency class is used to keep corpus frequencies for lexical units in FN, but we plan to use it for enriching many other classes with frequency information in future work, such as Senses or SubcategorizationFrames. Third, the representation of FN in LMF required adding two new relationships between LMF classes: we added a relationship between SemanticArgument and Definition, in order to represent the definitions available for frame elements in FN. In addition, we added a relationship between the Context class and the MonoLingualExternalRef, to represent the links to annotated corpus sentences in FN. Finally, WKT turned out to be hard to tackle, because it contains a special kind of ambiguity in the semantic relations and translation links listed for senses: the targets of both relations and translation links are ambiguous, as they refer to lemmas (word forms), rather than to senses (Meyer and Gurevych, 2010). These ambiguous relation targets could not directly be represented in LMF, since sense and translation relations are defined between senses. To resolve this, we added a relationship between SenseRelation and FormRepresentation, in order to encode the ambiguous WKT relation target as a word form. Disambiguating the WKT relation targets to infer the target sense is left to future work. A related issue occurred, when we mapped WN to LMF. WN encodes morphologically related forms as sense relations. UBY-LMF represents these related forms not only as sense relations (as in WordNet-LMF), but also at the morphological level using the RelatedForm class from the LMF Morphology extension. In LMF, however, the RelatedForm class for morphologically related lexemes is not associated with the corresponding sense in any way. Discarding the WN information on the senses involved in a particular morphological relation would lead to information loss in some cases. Consider as an example the WN verb buy (purchase) which is derivationally related to the noun buy, while on the other hand buy (accept as true, e.g. I can t buy this story) is not derivationally related to the noun buy. We addressed this issue by adding a sense attribute to the RelatedForm class. Thus, in extension of LMF, UBY-LMF allows sense relations to refer to a form relation target and morphological relations to refer to a sense relation target. Data Categories in UBY-LMF. We encountered large differences in the availability of DCs in ISOCat for the morpho-syntactic, lexicalsyntactic, and lexical-semantic parts of UBY- LMF. Many DCs were missing in ISOCat and we had to enter them ourselves. While this was feasible at the morpho-syntactic and lexical-syntactic level, due to a large body of standardization results available, it was much harder at the lexicalsemantic level where standardization is still ongoing. At the lexical-semantic level, UBY-LMF currently allows string values for a number of attribute values, e.g. for semantic roles. We can easily integrate the results of the ongoing standardization efforts into UBY-LMF in the future. 4 UBY Population with information 4.1 Representing LSRs in UBY-LMF UBY-LMF is represented by a DTD (as suggested by the standard) which can be used to automatically convert any given resource into the corresponding XML format. 9 This conversion requires a detailed analysis of the resource to be converted, followed by the definition of a mapping of the 9 Therefore, UBY-LMF can be considered as a serialization of LMF.
6 concepts and terms used in the original resource to the UBY-LMF model. There are two major tasks involved in the development of an automatic conversion routine: first, the basic organizational unit in the source LSR has to be identified and mapped, e.g. synset in WN or semantic frame in FN, and second, it has to be determined, how a (LMF) sense is defined in the source LSR. A notable aspect of converting resources into UBY-LMF is the harmonization of linguistic terminology used in the LSRs. For instance, a WN Word and a GN Lexical Unit are mapped to Sense in UBY-LMF. We developed reusable conversion routines for the future import of updated versions of the source LSRs into UBY, provided the structure of the source LSR remains stable. These conversion routines extract lexical data from the source LSRs by calling their native APIs (rather than processing the underlying XML data). Thus, all lexical information which can be accessed via the APIs is converted into UBY-LMF. Converting the LSRs introduced in the previous section yielded an instantiation of UBY-LMF named UBY. The LexicalResource instance UBY currently comprises 10 Lexicon instances, one each for OW-de and OW-en, and one lexicon each for the remaining eight LSRs. 4.2 Adding Sense Alignments Besides the uniform and standardized representation of the single LSRs, one major asset of UBY is the semantic interoperability of resources at the sense level. In the following, we (i) describe how we converted already existing sense alignments of resources into LMF, and (ii) present a framework to infer alignments automatically for any pair of resources. Existing Alignments. Previous work on sense alignment yielded several alignments, such as WN WP-en (Niemann and Gurevych, 2011), WN WKT-en (Meyer and Gurevych, 2011) and VN FN (Palmer, 2009). We converted these alignments into UBY-LMF by creating a SenseAxis instance for each pair of aligned senses. This involved mapping the sense IDs from the proprietary alignment files to the corresponding sense IDs in UBY. In addition, we integrated the sense alignments already present in OW and WP. Some OW entries provide links to the corresponding WP page. Also, the German and English language editions of WP and OW are connected by inter-language links between articles (Senses in UBY). We can expect that these links have high quality, as they were entered manually by users and are subject to community control. Therefore, we straightforwardly imported them into UBY. Alignment Framework. Automatically creating new alignments is difficult because of word ambiguities, different granularities of senses, or language specific conceptualizations (Navigli, 2006). To support this task for a large number of resources across languages, we have designed a flexible alignment framework based on the state-of-the-art method of Niemann and Gurevych (2011). The framework is generic in order to allow alignments between different kinds of entities as found in different resources, e.g. WN synsets, FN frames or WP articles. The only requirement is that the individual entities are distinguishable by a unique identifier in each resource. The alignment consists of the following steps: First, we extract the alignment candidates for a given resource pair, e.g. WN sense candidates for a WKT-en entry. Second, we create a gold standard by manually annotating a subset of candidate pairs as valid or non-valid. Then, we extract the sense representations (e.g. lemmatized bag-of-words based on glosses) to compute the similarity of word senses (e.g. by cosine similarity). The gold standard with corresponding similarity values is fed into Weka (Hall et al., 2009) to train a machine learning classifier, and in the final step this classifier is used to automatically classify the candidate sense pairs as (non-)valid alignment. Our framework also allows us to train on a combination of different similarity measures. Using our framework, we were able to reproduce the results reported by Niemann and Gurevych (2011) and Meyer and Gurevych (2011) based on the publicly available evaluation datasets 10 and the configuration details reported in the corresponding papers. Cross-Lingual Alignment. In order to align word senses across languages, we extended the monolingual sense alignment described above to the cross-lingual setting. Our approach utilizes 10
7 Moses, 11 trained on the Europarl corpus. The lemma of one of the two senses to be aligned as well as its representations (e.g. the gloss) is translated into the language of the other resource, yielding a monolingual setting. E.g., the WN synset {vessel, watercraft} with its gloss a craft designed for water transportation is translated into {Schiff, Wasserfahrzeug} and Ein Fahrzeug für Wassertransport, and then the candidate extraction and all downstream steps can take place in German. An inherent problem with this approach is that incorrect translations also lead to invalid alignment candidates. However, these are most probably filtered out by the machine learning classifier as the calculated similarity between the sense representations (e.g. glosses) should be low if the candidates do not match. We evaluated our approach by creating a crosslingual alignment between WN and OW-de, i.e. the concepts in OW with a German lexicalization. 12 To our knowledge, this is the first study on aligning OW with another LSR. OW is especially interesting for this task due to its multilingual concepts, as described by Matuschek and Gurevych (2011). The created gold standard could, for instance, be re-used to evaluate alignments for other languages in OW. To compute the similarity of word senses, we followed the approach by Niemann and Gurevych (2011) while covering both translation directions. We used the cosine similarity for comparing the German OW glosses with the German translations of WN glosses and cosine and personalized page rank (PPR) similarity for comparison of the German OW glosses translated into English with the original English WN glosses. Note that PPR similarity is not available for German as it is based on WN. Thereby, we filtered out the OW concepts without a German gloss which left us with 11,806 unique candidate pairs. We randomly selected 500 WN synsets for analysis yielding 703 candidate pairs. These were manually annotated as being (non-)alignments. For the subsequent machine learning task we used a simple thresholdbased classifier and ten-fold cross validation. Table 1 summarizes the results of different system configurations. We observe that translation OmegaWiki consists of interlinked languageindependent concepts to which lexicalizations in several languages are attached. Translation Similarity direction measure P R F 1 EN > DE Cosine (Cos) DE > EN Cos DE > EN PPR DE > EN PPR + Cos Table 1: Cross-lingual alignment results into English works significantly better than into German. Also, the more elaborate similarity measure PPR yields better results than cosine similarity, while the best result is achieved by a combination of both. Niemann and Gurevych (2011) make a similar observation for the monolingual setting. Our F-measure of in the best configuration lies between the results of Meyer and Gurevych (2011) (0.66) and Niemann and Gurevych (2011) (0.78), and thus verifies the validity of the machine translation approach. Therefore, the best alignment was subsequently integrated into UBY. 5 Evaluating UBY We performed an intrinsic evaluation of UBY by computing a number of resource statistics. Our evaluation covers two aspects: first, it addresses the question if our automatic conversion routines work correctly. Second, it provides indicators for assessing UBY in terms of the gain in coverage compared to the single LSRs. Correctness of conversion. Since we aim to preserve the maximal amount of information from the original LSRs, we should be able to replace any of the original LSRs and APIs by UBY and the UBY-API without losing information. As the conversion is largely performed automatically, systematic errors and information loss could be introduced by a faulty conversion routine. In order to detect such errors and to prove the correctness of the automatic conversion and the resulting representation, we have compared the original resource statistics of the classes and information types in the source LSRs to the corresponding classes in their UBY counterparts. For instance, the number of lexical relations in WordNet has been compared to the number of SenseRelations in the UBY WordNet lexicon For detailed analysis results see the UBY website.
8 Lexical Sense Lexicon Entry Sense Relation FN 9,704 11,942 GN 83,091 93, ,213 OW-de 30,967 34,691 60,054 OW-en 51,715 57,921 85,952 WP-de 790, , ,286 WP-en 2,712,117 2,921,455 3,364,083 WKT-de 85,575 72, ,358 WKT-en 335, , ,595 WN 156, ,978 8,559 VN 3,962 31,891 UBY 4,259,894 4,691,313 5,300,941 Table 2: UBY resource statistics (selected classes). Lexicon pair Languages SenseAxis WN WP-en EN EN 50,351 WN WKT-en EN EN 99,662 WN VN EN EN 40,716 FN VN EN EN 17,529 WP-en OW-en EN EN 3,960 WP-de OW-de DE DE 1,097 WN OW-de EN DE 23,024 WP-en WP-de EN DE 463,311 OW-en OW-de EN DE 58,785 UBY All 758,435 Table 3: UBY alignment statistics. Gain in coverage. UBY offers an increased coverage compared to the single LSRs as reflected in the resource statistics. Tables 2 and 3 show the statistics on central classes in UBY. As UBY is organized in several Lexicons, the number of UBY lexical entries is the sum of the lexical entries in all 10 Lexicons. Thus, UBY contains more than 4.2 million lexical entries, 4.6 million senses, 5.3 million semantic relations between senses and more than 750,000 alignments. These statistics represent the total numbers of lexical entries, senses and sense relations in UBY without filtering of identical (i.e. corresponding) lexical entries, senses and relations. Listing the number of unique senses would require a full alignment between all integrated resources, which is currently not available. We can, however, show that UBY contains over 3.08 million unique lemma-pos combinations for English and over 860,000 for German, over 3.94 million in total, see Table 4. Therefore, we assessed the coverage on lemma level. Table 4 also shows the number of lemmas with entries in one or more than one lexicon, additionally split by POS and language. Lemmas occurring only once in UBY increase the coverage at lemma level. For lemmas with parallel entries in several UBY lexicons, new information becomes available in the form of additional sense definitions and complementary information types attached to lemmas. Finally, the increase in coverage at sense level can be estimated for senses that are aligned across at least two UBY-lexicons. We gain access to all available, partly complementary information types attached to these aligned senses, e.g. semantic relations, subcategorization frames, encyclopedic or multilingual information. The number of pairwise sense alignments provided by UBY is given in Table 3. In addition, we computed how many senses simultaneously take part in at least two pairwise sense alignments. For English, this applies to 31,786 senses, for which information from 3 UBY lexicons is available. EN Lexicons noun verb adjective ,630 1, ,439 1,948 2, ,856 4,727 12, ,900,652 50,209 41,731 Σ (unique EN) 3,080,771 DE Lexicons noun verb adjective 4 1, , ,813 3,174 2, ,770 6,108 7,737 Σ (unique DE) 862,879 Table 4: Number of lemmas (split by POS and language) with entries in i UBY lexicons, i = 1,..., 5. 6 Using UBY UBY API. For convenient access to UBY, we implemented a Java-API which is built around the Hibernate 14 framework. Hibernate allows to easily store the XML data which results from converting resources into Uby-LMF into a corresponding SQL database. Our main design principle was to keep the access to the resource as simple as possible, despite the rich and complex structure of UBY. Another 14
9 important design aspect was to ensure that the functionality of the individual, resource-specific APIs or user interfaces is mirrored in the UBY API. This enables porting legacy applications to our new resource. To facilitate the transition to UBY, we plan to provide reference tables which list the corresponding UBY-API operations for the most important operations in the WN API, some of which are shown in Table 5. WN function Dictionary getindexword(pos, lemma) IndexWord getlemma() Synset getgloss() getwords() Pointer gettype() Word getpointers() UBY function UBY getlexicalentries( pos, lemma) LexicalEntry getlemmaform() Synset getdefinitiontext() getsenses() SynsetRelation getrelname() Sense getsenserelations() Table 5: Some equivalent operations in WN API and UBY API. While it is possible to limit access to single resources by a parameter and thus mimic the behavior of the legacy APIs (e.g. only retrieve Synsets and their relations from WN), the true power of UBY API becomes visible when no such constraints are applied. In this case, all imported resources are queried to get one combined result, while retaining the source of the respective information. On top of this, the information about existing sense alignments across resources can be accessed via SenseAxis relations, so that the returned combined result covers not only the lexical, but also the sense level. Community issues. One of the most important reasons for UBY is creating an easy-to-use powerful LSR to advance NLP research and development. Therefore, community building around the resource is one of our major concerns. To this end, we will offer free downloads of the lexical data and software presented in this paper under open licenses, namely: The UBY-LMF DTD, mappings and conversion tools for existing resources and sense alignments, the Java API, and, as far as licensing allows, 15 already converted resources. If resources cannot be made available for download, the conversion tools will still allow users with access to these resources to import them into UBY easily. In this way, it will be possible for users to build their custom UBY containing selected resources. As the underlying resources are subject to continuous change, updates of the corresponding components will be made available on a regular basis. 7 Conclusions We presented UBY, a large-scale, standardized LSR containing nine widely used resources in two languages: English WN, WKT-en, WP-en, FN and VN, German WP-de, WKT-de, and GN, and OW in English and German. As all resources are modeled in UBY-LMF, UBY enables structural interoperability across resources and languages down to a fine-grained level of information. For FN, VN and all of the CCRs in English and German, this is done for the first time. Besides, by integrating sense alignments we also enable the lexical-semantic interoperability of resources. We presented a unified framework for aligning any LSRs pairwise and reported on experiments which align OW-de and WN. We will release the UBY-LMF model, the resource and the UBY-API at the time of publication. 16 Due to the added value and the large scale of UBY, as well as its ease of use, we believe UBY will boost the performance of NLP making use of lexical-semantic knowledge. Acknowledgments This work has been supported by the Emmy Noether Program of the German Research Foundation (DFG) under grant No. GU 798/3-1 and by the Volkswagen Foundation as part of the Lichtenberg-Professorship Program under grant No. I/ We thank Richard Eckart de Castilho, Yevgen Chebotar, Zijad Maksuti and Tri Duc Nghiem for their contributions to this project. References Jordi Atserias, Luís Villarejo, German Rigau, Eneko Agirre, John Carroll, Bernardo Magnini, and Piek 15 Only GermaNet is subject to a restricted license and cannot be redistributed in UBY format. 16
10 Vossen The Meaning Multilingual Central Repository. In Proceedings of the second international WordNet Conference (GWC 2004), pages 23 30, Brno, Czech Republic. Collin F. Baker and Christiane Fellbaum Word- Net and FrameNet as complementary resources for annotation. In Proceedings of the Third Linguistic Annotation Workshop, ACL-IJCNLP 09, pages , Suntec, Singapore. Collin F. Baker, Charles J. Fillmore, and John B. Lowe The Berkeley FrameNet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (COLING-ACL 98, pages 86 90, Montreal, Canada. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann DBpedia A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, (7): Daan Broeder, Marc Kemps-Snijders, Dieter Van Uytvanck, Menzo Windhouwer, Peter Withers, Peter Wittenburg, and Claus Zinn A Data Category Registry- and Component-based Metadata Framework. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC), pages 43 47, Valletta, Malta. Paul Buitelaar, Philipp Cimiano, Peter Haase, and Michael Sintek Towards Linguistically Grounded Ontologies. In Lora Aroyo, Paolo Traverso, Fabio Ciravegna, Philipp Cimiano, Tom Heath, Eero Hyvönen, Riichiro Mizoguchi, Eyal Oren, Marta Sabou, and Elena Simperl, editors, The Semantic Web: Research and Applications, pages , Berlin/Heidelberg, Germany. Springer. Gerard de Melo and Gerhard Weikum Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on Information and knowledge management (CIKM 09), CIKM 09, pages , New York, NY, USA. ACM. Christiane Fellbaum WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, USA. Charles J. Fillmore Frame Semantics. In The Linguistic Society of Korea, editor, Linguistics in the Morning Calm, pages Hanshin Publishing Company, Seoul, Korea. Gil Francopoulo, Nuria Bel, Monte George, Nicoletta Calzolari, Monica Monachini, Mandy Pet, and Claudia Soria Lexical Markup Framework (LMF). In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pages , Genoa, Italy. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten The WEKA Data Mining Software: An Update. ACM SIGKDD Explorations Newsletter, 11(1): Verena Henrich and Erhard Hinrichs Standardizing wordnets in the ISO standard LMF: Wordnet- LMF for GermaNet. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pages , Beijing, China. Richard Johansson and Pierre Nugues Using WordNet to extend FrameNet coverage. In Proceedings of the Workshop on Building Framesemantic Resources for Scandinavian and Baltic Languages, at NODALIDA, pages 27 30, Tartu, Estonia. Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer A Large-scale Classification of English Verbs. Language Resources and Evaluation, 42: Claudia Kunze and Lothar Lemnitzer GermaNet representation, visualization, application. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC), pages , Las Palmas, Canary Islands, Spain. Beth Levin English Verb Classes and Alternations. The University of Chicago Press, Chicago, IL, USA. Michael Matuschek and Iryna Gurevych Where the journey is headed: Collaboratively constructed multilingual Wiki-based resources. In SFB 538: Mehrsprachigkeit, editor, Hamburger Arbeiten zur Mehrsprachigkeit, Hamburg, Germany. John McCrae, Dennis Spohr, and Philipp Cimiano Linking Lexical Resources and Ontologies on the Semantic Web with Lemon. In The Semantic Web: Research and Applications, volume 6643 of Lecture Notes in Computer Science, pages Springer, Berlin/Heidelberg, Germany. Clifton J. McFate and Kenneth D. Forbus NULEX: an open-license broad coverage lexicon. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT 11, pages , Portland, OR, USA. Christian M. Meyer and Iryna Gurevych Worth its Weight in Gold or Yet Another Resource A Comparative Study of Wiktionary, OpenThesaurus and GermaNet. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing: 11th International Conference, volume 6008 of Lecture Notes in Computer Science, pages Berlin/Heidelberg: Springer, Iaşi, Romania. Christian M. Meyer and Iryna Gurevych What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for Increased Domain
11 Coverage. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pages , Chiang Mai, Thailand. Roberto Navigli and Simone Paolo Ponzetto. 2010a. BabelNet: Building a Very Large Multilingual Semantic Network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages , Uppsala, Sweden, July. Roberto Navigli and Simone Paolo Ponzetto. 2010b. Knowledge-rich Word Sense Disambiguation Rivaling Supervised Systems. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages , Uppsala, Sweden. Roberto Navigli Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), pages , Sydney, Australia. Elisabeth Niemann and Iryna Gurevych The People s Web meets Linguistic Knowledge: Automatic Sense Alignment of Wikipedia and WordNet. In Proceedings of the 9th International Conference on Computational Semantics (IWCS), pages , Oxford, UK. Muntsa Padró, Núria Bel, and Silvia Necsulescu Towards the Automatic Merging of Lexical Resources: Automatic Mapping. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), pages , Hissar, Bulgaria. Martha Palmer Semlink: Linking PropBank, VerbNet and FrameNet. In Proceedings of the Generative Lexicon Conference (GenLex-09), pages 9 15, Pisa, Italy. Sameer S. Pradhan, Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel OntoNotes: A Unified Relational Semantic Representation. In Proceedings of the International Conference on Semantic Computing, pages , Washington, DC, USA. Lei Shi and Rada Mihalcea Putting Pieces Together: Combining FrameNet, VerbNet and Word- Net for Robust Semantic Parsing. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CI- CLing), pages , Mexico City, Mexico. Claudia Soria, Monica Monachini, and Piek Vossen Wordnet-LMF: fleshing out a standardized format for Wordnet interoperability. In Proceedings of the 2009 International Workshop on Intercultural Collaboration, pages , Palo Alto, CA, USA. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum Yago: A Core of Semantic Knowledge. In Proceedings of the 16th International Conference on World Wide Web, pages , Banff, Canada. Antonio Toral, Stefania Bracale, Monica Monachini, and Claudia Soria Rejuvenating the Italian WordNet: Upgrading, Standarising, Extending. In Proceedings of the 5th Global WordNet Conference (GWC), Bombay, India. Piek Vossen, editor EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht, Netherlands.
An Efficient Database Design for IndoWordNet Development Using Hybrid Approach
An Efficient Database Design for IndoWordNet Development Using Hybrid Approach Venkatesh P rabhu 2 Shilpa Desai 1 Hanumant Redkar 1 N eha P rabhugaonkar 1 Apur va N agvenkar 1 Ramdas Karmali 1 (1) GOA
ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU
ONTOLOGIES p. 1/40 ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU Unlocking the Secrets of the Past: Text Mining for Historical Documents Blockseminar, 21.2.-11.3.2011 ONTOLOGIES
Comprendium Translator System Overview
Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4
Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)
Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet Roberto Navigli, Tiziano
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries
Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries Patanakul Sathapornrungkij Department of Computer Science Faculty of Science, Mahidol University Rama6 Road, Ratchathewi
Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization
Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it
Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED
Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED 17 19 June 2013 Monday 17 June Salón de Actos, Facultad de Psicología, UNED 15.00-16.30: Invited talk Eneko Agirre (Euskal Herriko
Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu
Domain Adaptive Relation Extraction for Big Text Data Analytics Feiyu Xu Outline! Introduction to relation extraction and its applications! Motivation of domain adaptation in big text data analytics! Solutions!
Aligning Word Senses in GermaNet and the DWDS Dictionary of the German Language
Aligning Word Senses in GermaNet and the DWDS Dictionary of the German Language Verena Henrich, Erhard Hinrichs, Reinhild Barkey Department of Linguistics University of Tübingen, Germany {vhenrich,eh,rbarkey}@sfs.uni-tuebingen.de
Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval
Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information
Central and South-East European Resources in META-SHARE
Central and South-East European Resources in META-SHARE Tamás VÁRADI 1 Marko TADIĆ 2 (1) RESERCH INSTITUTE FOR LINGUISTICS, MTA, Budapest, Hungary (2) FACULTY OF HUMANITIES AND SOCIAL SCIENCES, ZAGREB
DISCOVERING RESUME INFORMATION USING LINKED DATA
DISCOVERING RESUME INFORMATION USING LINKED DATA Ujjal Marjit 1, Kumar Sharma 2 and Utpal Biswas 3 1 C.I.R.M, University Kalyani, Kalyani (West Bengal) India [email protected] 2 Department of Computer
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
CLOVA: An Architecture for Cross-Language Semantic Data Querying
CLOVA: An Architecture for Cross-Language Semantic Data Querying John McCrae Semantic Computing Group, CITEC University of Bielefeld Bielefeld, Germany [email protected] Jesús R. Campaña Department
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources
Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources Michelle
Interchanging lexical resources on the Semantic Web
Interchanging lexical resources on the Semantic Web John McCrae, Guadalupe Aguado-de- Cea, Paul Buitelaar, Philipp Cimiano, Thierry Declerck, Asunción Gómez- Pérez, Jorge Gracia, Laura Hollink, et al.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Multilingual and Localization Support for Ontologies
Multilingual and Localization Support for Ontologies Mauricio Espinoza, Asunción Gómez-Pérez and Elena Montiel-Ponsoda UPM, Laboratorio de Inteligencia Artificial, 28660 Boadilla del Monte, Spain {jespinoza,
Semantic Search in Portals using Ontologies
Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br
Visualizing WordNet Structure
Visualizing WordNet Structure Jaap Kamps Abstract Representations in WordNet are not on the level of individual words or word forms, but on the level of word meanings (lexemes). A word meaning, in turn,
Package wordnet. January 6, 2016
Title WordNet Interface Version 0.1-11 Package wordnet January 6, 2016 An interface to WordNet using the Jawbone Java API to WordNet. WordNet () is a large lexical database
Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets
Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells Computer Science Dep., Universidad Autonoma de Madrid, 28049 Madrid, Spain
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
LiDDM: A Data Mining System for Linked Data
LiDDM: A Data Mining System for Linked Data Venkata Narasimha Pavan Kappara Indian Institute of Information Technology Allahabad Allahabad, India [email protected] Ryutaro Ichise National Institute of
An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials
ehealth Beyond the Horizon Get IT There S.K. Andersen et al. (Eds.) IOS Press, 2008 2008 Organizing Committee of MIE 2008. All rights reserved. 3 An Ontology Based Method to Solve Query Identifier Heterogeneity
Application of ontologies for the integration of network monitoring platforms
Application of ontologies for the integration of network monitoring platforms Jorge E. López de Vergara, Javier Aracil, Jesús Martínez, Alfredo Salvador, José Alberto Hernández Networking Research Group,
LDIF - Linked Data Integration Framework
LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany [email protected],
WordNet Website Development And Deployment Using Content Management Approach
WordNet Website Development And Deployment Using Content Management Approach N eha R P rabhugaonkar 1 Apur va S N ag venkar 1 Venkatesh P P rabhu 2 Ramdas N Karmali 1 (1) GOA UNIVERSITY, Taleigao - Goa
Overview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base
Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base Aitor González Agirre, Egoitz Laparra, German Rigau Basque Country University Donostia, Basque Country {aitor.gonzalez,
Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies
Building a Spanish MMTx by using Automatic Translation and Biomedical Ontologies Francisco Carrero 1, José Carlos Cortizo 1,2, José María Gómez 3 1 Universidad Europea de Madrid, C/Tajo s/n, Villaviciosa
Hybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
Search Result Diversification Methods to Assist Lexicographers
Search Result Diversification Methods to Assist Lexicographers Lars Borin Markus Forsberg Karin Friberg Heppin Richard Johansson Annika Kjellandsson Språkbanken, Department of Swedish, University of Gothenburg
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning [email protected]
INF5820 Natural Language Processing - NLP H2009 Jan Tore Lønning [email protected] Semantic Role Labeling INF5830 Lecture 13 Nov 4, 2009 Today Some words about semantics Thematic/semantic roles PropBank &
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Evaluation of a layered approach to question answering over linked data
Evaluation of a layered approach to question answering over linked data Sebastian Walter 1, Christina Unger 1, Philipp Cimiano 1, and Daniel Bär 2 1 CITEC, Bielefeld University, Germany http://www.sc.cit-ec.uni-bielefeld.de/
Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure
Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure Fuminori Kimura Faculty of Culture and Information Science, Doshisha University 1 3 Miyakodani Tatara, Kyoutanabe-shi,
The Prolog Interface to the Unstructured Information Management Architecture
The Prolog Interface to the Unstructured Information Management Architecture Paul Fodor 1, Adam Lally 2, David Ferrucci 2 1 Stony Brook University, Stony Brook, NY 11794, USA, [email protected] 2 IBM
A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches
J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria 105 A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria
Using Semantic Data Mining for Classification Improvement and Knowledge Extraction
Using Semantic Data Mining for Classification Improvement and Knowledge Extraction Fernando Benites and Elena Sapozhnikova University of Konstanz, 78464 Konstanz, Germany. Abstract. The objective of this
ONTOLOGY FOR MOBILE PHONE OPERATING SYSTEMS
ONTOLOGY FOR MOBILE PHONE OPERATING SYSTEMS Hasni Neji and Ridha Bouallegue Innov COM Lab, Higher School of Communications of Tunis, Sup Com University of Carthage, Tunis, Tunisia. Email: [email protected];
Representing Specialized Events with FrameBase
Representing Specialized Events with FrameBase Jacobo Rouces 1, Gerard de Melo 2, and Katja Hose 1 1 Aalborg University, Denmark [email protected], [email protected] 2 Tsinghua University, China [email protected]
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata
Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive
Survey Results: Requirements and Use Cases for Linguistic Linked Data
Survey Results: Requirements and Use Cases for Linguistic Linked Data 1 Introduction This survey was conducted by the FP7 Project LIDER (http://www.lider-project.eu/) as input into the W3C Community Group
Customer Intentions Analysis of Twitter Based on Semantic Patterns
Customer Intentions Analysis of Twitter Based on Semantic Patterns Mohamed Hamroun [email protected] Mohamed Salah Gouider [email protected] Lamjed Ben Said [email protected] ABSTRACT
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
The open lexical infrastructure of Språkbanken
The open lexical infrastructure of Språkbanken Lars Borin, Markus Forsberg, Leif-Jöran Olsson and Jonatan Uppström Språkbanken, University of Gothenburg, Sweden {lars.borin, markus.forsberg, leif-joran.olsson,
Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU http://ixa.si.ehu.es
KYOTO () Intelligent Content and Semantics Knowledge Yielding Ontologies for Transition-Based Organization http://www.kyoto-project.eu/ Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System
Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering
Lightweight Data Integration using the WebComposition Data Grid Service
Lightweight Data Integration using the WebComposition Data Grid Service Ralph Sommermeier 1, Andreas Heil 2, Martin Gaedke 1 1 Chemnitz University of Technology, Faculty of Computer Science, Distributed
Mining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner Petar Ristoski, Christian Bizer, and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {petar.ristoski,heiko,chris}@informatik.uni-mannheim.de
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
How To Use Networked Ontology In E Health
A practical approach to create ontology networks in e-health: The NeOn take Tomás Pariente Lobo 1, *, Germán Herrero Cárcel 1, 1 A TOS Research and Innovation, ATOS Origin SAE, 28037 Madrid, Spain. Abstract.
Mapping a Traditional Dialectal Dictionary with Linked Open Data
Mapping a Traditional Dialectal Dictionary with Linked Open Data Eveline Wandl-Vogt 1, Thierry Declerck 1,2 1 Institute for Corpus Linguistics and Text Technology, Austrian Academy of Sciences Sonnenfelsgasse
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde
Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction
Discovering and Querying Hybrid Linked Data
Discovering and Querying Hybrid Linked Data Zareen Syed 1, Tim Finin 1, Muhammad Rahman 1, James Kukla 2, Jeehye Yun 2 1 University of Maryland Baltimore County 1000 Hilltop Circle, MD, USA 21250 [email protected],
THE SEMANTIC WEB AND IT`S APPLICATIONS
15-16 September 2011, BULGARIA 1 Proceedings of the International Conference on Information Technologies (InfoTech-2011) 15-16 September 2011, Bulgaria THE SEMANTIC WEB AND IT`S APPLICATIONS Dimitar Vuldzhev
Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.
Comparing Ontology-based and Corpusbased Domain Annotations in WordNet. A paper by: Bernardo Magnini Carlo Strapparava Giovanni Pezzulo Alfio Glozzo Presented by: rabee ali alshemali Motive. Domain information
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery
Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery Jan Paralic, Peter Smatana Technical University of Kosice, Slovakia Center for
Processing: current projects and research at the IXA Group
Natural Language Processing: current projects and research at the IXA Group IXA Research Group on NLP University of the Basque Country Xabier Artola Zubillaga Motivation A language that seeks to survive
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
LEXUS: a web based lexicon tool
LEXUS: a web based lexicon tool Jacquelijn Ringersma Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Content Max Planck Institute Archive of linguistic resources Tool support (archiving
A Semantic web approach for e-learning platforms
A Semantic web approach for e-learning platforms Miguel B. Alves 1 1 Laboratório de Sistemas de Informação, ESTG-IPVC 4900-348 Viana do Castelo. [email protected] Abstract. When lecturers publish contents
Simplifying e Business Collaboration by providing a Semantic Mapping Platform
Simplifying e Business Collaboration by providing a Semantic Mapping Platform Abels, Sven 1 ; Sheikhhasan Hamzeh 1 ; Cranner, Paul 2 1 TIE Nederland BV, 1119 PS Amsterdam, Netherlands 2 University of Sunderland,
Implementing Ontology-based Information Sharing in Product Lifecycle Management
Implementing Ontology-based Information Sharing in Product Lifecycle Management Dillon McKenzie-Veal, Nathan W. Hartman, and John Springer College of Technology, Purdue University, West Lafayette, Indiana
HOPS Project presentation
HOPS Project presentation Enabling an Intelligent Natural Language Based Hub for the Deployment of Advanced Semantically Enriched Multi-channel Mass-scale Online Public Services IST-2002-507967 (HOPS)
Multilingual Event Detection using the NewsReader Pipelines
Multilingual Event Detection using the NewsReader Pipelines Rodrigo Agerri, Itziar Aldabe, Egoitz Laparra, German Rigau Antske Fokkens 3, Paul Huijgen 3, Ruben Izquierdo 3, Marieke van Erp 3, Piek Vossen
Network Working Group
Network Working Group Request for Comments: 2413 Category: Informational S. Weibel OCLC Online Computer Library Center, Inc. J. Kunze University of California, San Francisco C. Lagoze Cornell University
Talend Metadata Manager. Reduce Risk and Friction in your Information Supply Chain
Talend Metadata Manager Reduce Risk and Friction in your Information Supply Chain Talend Metadata Manager Talend Metadata Manager provides a comprehensive set of capabilities for all facets of metadata
Mahesh Srinivasan. Assistant Professor of Psychology and Cognitive Science University of California, Berkeley
Department of Psychology University of California, Berkeley Tolman Hall, Rm. 3315 Berkeley, CA 94720 Phone: (650) 823-9488; Email: [email protected] http://ladlab.ucsd.edu/srinivasan.html Education
A View Integration Approach to Dynamic Composition of Web Services
A View Integration Approach to Dynamic Composition of Web Services Snehal Thakkar, Craig A. Knoblock, and José Luis Ambite University of Southern California/ Information Sciences Institute 4676 Admiralty
Czech Verbs of Communication and the Extraction of their Frames
Czech Verbs of Communication and the Extraction of their Frames Václava Benešová and Ondřej Bojar Institute of Formal and Applied Linguistics ÚFAL MFF UK, Malostranské náměstí 25, 11800 Praha, Czech Republic
