Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet
|
|
|
- Neil Ford
- 10 years ago
- Views:
Transcription
1 1 Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet Bento Carlos DIAS-DA-SILVA Faculdade de Ciências e Letras, Universidade Estadual Paulista, Rodovia Araraquara-Jau Km 1, Araraquara, São Paulo, Brazil [email protected] Abstract: This paper discusses particular linguistic challenges in the task of compiling the Brazilian Portuguese Wordnet, the Wordnet.Br. After setting the scene by overviewing methodological issues, it focuses on the basic steps taken to compile the Wordnet.Br core database: a machine-tractable thesaurus-like lexical database. The discussion is split between three domains: the Linguistic Domain, the Representation Domain, and the Computational Domain. 1. Human Language Technology and Linguistics It is a fact that an overwhelming growth in Human Language Technology research (henceforth HLT) has taken place since the potential for building computer models of natural language understanding and generation was recognized by the pioneers of machine translation in the early 1950's. As a result, natural language processing (henceforth NLP) has become a discipline in ferment, and gathers researchers with a wide range of backgrounds and interests, emphasizing its diverse aspects, and employing manifold methods and techniques. Despite the enthusiasm, there have been drawbacks, some of which due to either lack of appreciation for the complexity of natural languages or underspecification of the complexity of the task itself, which reveals a disturbing gap between HLT and Linguistics. Furthermore, Linguistics has either disregarded computational issues altogether or provided the ammunition to deaden the enthusiastic development of NLP technologies. On the one hand, the HLT challenge is to develop both user-visible NLP applications (e.g., spell and grammar checkers, machine translation systems, information retrieval systems text/speech synthesis, and recognition systems) This research is sponsored by CNPq-Brasília and FAPESP-São Paulo, Brazil.
2 Bento Carlos Dias-da-Silva 2 and user-transparent NLP components (e.g., grammars, parsers, tree-banks, lexicons, and lexical resources). On the other hand, the NLP task is to emulate a particular type of a knowledge processing system where complex linguistic and extra-linguistic pieces of knowledge are formally represented and electronically applied to exploit and to perform a number of linguistic as well as metalinguistic tasks: check spelling and grammar, analyze morphological and syntactic structures, understand and produce texts, translate words, sentences and texts, make and answer questions, and help linguists themselves develop their own linguistic models (Dias-da-Silva, 1996). We assume a compromise between HLT and Linguistics and, based on the Artificial Intelligence notion of Knowledge Representation Systems (Hayes- Roth, 1990, Durkin, 1994), propose a the three-domain approach methodology that claims that the linguistic knowledge (i.e., linguistic information) needed to feed NLP systems, like a rare metal, must be mined (the elicitation of the relevant general linguistic information and usage), molded (the computer-tractable representation of that information), and assembled (the computational encoding of the resulting representation into and by means of computer programs). It amounts to saying that the process of designing and implementing NLP systems (i.e., the HLT research itself) should comprise the following iterative and evolutionary phases of analysis in three complementary domains: The Linguistic Domain (the mining phase), where the elicitation of the relevant general linguistic information and usage is made; The Representational Domain (the molding phase), where the computer-tractable representation of that information is dealt with; The Computational Domain (the assembling phase), where the computational encoding of the resulting representation into and by means of computer programs is tackled. Accordingly, the process of implementing the Wordnet.Br has been split between three complementary domains. This paper, in particular, resorts to the three-domain approach methodology to discuss the initial compilation stage of the Brazilian Portuguese Wordnet (henceforth Wordnet.Br): the task of sorting over 44,000 Brazilian Portuguese words into a machine-tractable thesaurus-like lexical database (henceforth the Wordnet.Br core database),
3 Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet 3 which is the building block of Wordnet.Br, after Princeton s WordNet, with capital "N", and EuroWordNet. 1 In the Linguistic Domain, basic notions of thesaurus and meaning similarity, and strategies for reusing published dictionaries as reference corpus and for mining synonym sets are set up; in the Representation Domain, the representation scheme for lexical meanings and sense relations is established, plus the overall lexical database design; and in the Computational Domain, the editing tool and the statistics of Wordnet.Br core database are sketched. 2. The Linguistic Domain 2.1 The Thesaurus Denotations In what follows, we present a survey of the denotations of the term thesaurus in Brazilian Portuguese, and single out the one we had in mind when we embarked on the compilation of our computerized lexical resourc. This specification turned out to be necessary for different specialists have used the term thesaurus to denote at least six different objets (Flexner, 1997; Lutz, 1994, Neufeldt, 1997; Roget, 1953): 1. An inventory of the vocabulary items in use in a particular language; 2. A thematically based dictionary, i.e., an onomasiologic dictionary; 3. A dictionary containing a store of synonyms and antonyms; 4. An index to information stored in a computer, consisting of a comprehensive list of subjects concerning which information may be retrieved by using the proper key terms; 5. A file containing a store of synonyms that are displayed to the user during the automatic proofreading process; 6. A dictionary of synonyms and antonyms stored in memory for use in word processing. The Wordnet.Br core database is an instance of Object 6. 1 Future work will include the specification of glosses for each synset and of hyponymy and meronymy relations between those synonym sets
4 Bento Carlos Dias-da-Silva Synonymy and Similarity of Meaning The Wordnet.Br core database compilation process benefited from two key WordNet notions: the notions of synset and of lexical matrix. It is common ground that absolute synonyms are rare in language, if they exist at all. Thus, the notion of synset is derived from the conception of the symmetrical relation of meaning similarity, for "theories of lexical semantics do not depend on truth-functional conceptions of synonymy: semantic similarity is sufficient", and synonymy proper is understood as "simply one end of a continuum on which similarity of meaning can be graded" (Miller and Fellbaum, 1991, p.202). 2.3 Reusability of Published Dictionaries and the Reference Corpus It is a fact that the compilation of a bulky dictionary is a time consuming activity and requires a team of more than fifty lexicographers, each responsible for (i) selecting the headwords which will head the dictionary entries, (ii) defining the number of senses for each headword, and (iii) exemplifying the senses with sentences and expressions from a selected corpus. As a matter of fact, Dictionary entries specify a cluster of information: orthographical, phonological, etymological, morphological, syntactic, definitional, collocational, variational, register information about words, and sense relations such as synonymy and antonymy. Dictionaries extensively use the synonymy and antonymy word forms in their defining procedure to define headwords. It is also a fact that lexicographers are aware that compiling dictionary entries involves making a very hard decision as to dealing with polysemy and homonymy. In other words, they have to decide on whether to lump or split word senses, or on whether to create fresh new entries for the same word form. Such decisions, however, are arbitrary, for lexicographers take their own personal experience and expertise to make their decisions; and probably that is the only way they manage to compile their unique store of words. Thus, reusing lexicographical information requires caution.
5 Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet 5 It must be stressed though that if we want to use dictionary lexicographical information in natural language processing projects, it must be mined and filtered carefully. 2 The advent of computers have allowed lexicographers to use machinereadable, large-scale corpuses in their work, establishing procedures as follows (Stubbs, 2001): (a) to gather concordances from the corpus; (b) to cluster the concordances around nuclear sense clusters; (c) to lump or split nuclear clusters; (d) to encode the relevant lexical information by means of the highly-constrained language of dictionary definitions. Given our small team of researchers, and the two-year time stipulated for the project, we bypassed those procedures and decided to reuse five outstanding published dictionaries of Brazilian Portuguese, which were chosen for the following reasons: (i) their being "fruits of the cumulative wisdom of generations of lexicographers", and their "sheer breadth of coverage" (just to borrow Kilgarriff's words, 1993, p.365); (ii) the relevant sense relations one of the five dictionaries registers can be complemented by similar pieces of information found in the other four; (iii) instead of using the Aristotelian analytical definition (i.e., genus and differentiae) to define word senses, they extensively use the synonymy and antonymy word forms in their defining procedure, feature that helped speed up the process of collecting and selelcting thousasnds of synonym and antonym word forms. Two of them, Ferreira (1999) and Weiszflog (1998) are the most traditional and bulkier Brazilian Portuguese dictionaries. Their electronic versions speeded up further the process of synonym and antonym mining. Barbosa (1999) and Fernandes (1997) are specific dictionaries of synonyms and antonyms, and were used as complementary material. The fifth dictionary is a dictionary of verbs (Borba, 1990) that uses a Chafe-based semantic classification of verbs (Chafe, 1970). For each verb entry, the Borba's dictionary registers the relevant categories ("state", "action", "process", and "action-process"), its sense definitions, when available, its synonyms, its grammatical features, its potential argument structures, its selectional restrictions, and sample sentences extracted from corpora. Such specificity help fine tune the process of compiling the verb synsets. 2 Acquiring such information is a hard problem and has been usually approached by reusing, merging, and tuning existing lexical material. This initiative has been frequently reported in the literature (see Kilgarriff, 1993, 1997, and the papers cited therein).
6 Bento Carlos Dias-da-Silva Dictionary Sense Distinctions and Leading Strategies In the heart of the task of compiling dictionaries for the general public is the specification of word sense distinctions. On analyzing the LDOCE entries (Summers, 1995), Kilgarriff (1993, p ) categorized four general types of sense distinctions made by lexicographers. "Generalizing Metaphors", i.e., a sense that is the generalization of a specific sense. For example: martelar (to hammer) sense 1: hit with a hammer (Core meaning) sense 2: insist (Generalizing meaning) "Must-be-theres", i.e., one of the senses is a logical consequence of the other. For example: casamento (marriage) sense 1: the event of getting married (Event) sense 2: the subsequent state of being married (Resulting state) "Domain Shift", i.e., a sense that extends the "original" sense to different domains. For example: leve (light) sense 1: not heavy, with little weight (Mass dimension) sense 2: nimble, agile" (Kinetic dimension) "Natural and social kinds", i.e., the different word senses apply to world entities or situations that have many attributes in common, but belong to different classes of things. For example: asa (wing) sense 1: a bird s wing (Natural) sense 2: an airplane wing (Social) Besides being aware of these sense distinctions, the following leading strategies were observed by our team of linguists: Checking whether particular grammatical or semantic features were necessary to lump together or to split over synonym sets (necessity strategy);
7 Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet 7 Checking the symmetry property of both synonymy and antonymy (consistency strategy); Checking how wide the sense variation of a lexical unit were so that new senses would be posited (centrality strategy). 3. The Representation Domain 3.1 The Synset and the Lexical Matrix Constructs The Wordnet.Br core database compilation process benefited from the two key WordNet constructs: the synset and the lexical matrix. It is common ground that absolute synonyms are rare in language, if they exist at all. Thus, the notion of synset is derived from the aforementioned conception of the symmetrical relation of meaning similarity. Miller and Fellbaum (1991) argue that each synset is a set made up of semantically similar words that serve as unambiguous designators of meanings; they also assume that a speaker of a language has mastered collections of concepts and are expected to recognize them from the words that make up the synsets. The notion of lexical matrix, in turn, is intended to capture the "many to many" associations between form and meaning. In other words, it is conceived of as a mapping between written words, form representations, and synsets, meaning representations. After adopting the key WordNet notions, the linguits embarked on the processes of mining synsets. The best way to understand how the compilers "mined" for synonyms into the reference corpus is to follow a real example. Let us take, as our starting point of the mining process, the verb lembrar (English: "to remember"). Weiszflog (1998) distinguishes seven senses. After collecting the synonyms, and disregarding their definitions, the following synonym sets could be compiled: 1. {lembrar, recordar} (English: {"to remember", "to recall"}) 2. {lembrar, advertir, notar} (English: {"to remember", "to warn", "to notify"})
8 Bento Carlos Dias-da-Silva 8 3. {lembrar, sugerir} (English: {"to suggest", "to evoke", "to hint"}) 4. {lembrar, recomendar} (English: {"to remember", "to commend"}) After that preliminary analysis, the linguist checked the consistency of the four synonym sets by looking up the dictionary synonym entries for the remaining five verbs: recordar, advertir, notar, sugerir, and recomendar. Accordingly, the linguist, for example, looked up the dictionary entry for the verb recordar. Its first sense is given by the paraphrase trazer à memória (English; "to call back to memory"), and its fourth sense by the synonym lembrar. As these two senses are very close, and the examples confirm the similarity between the two, the synonym set 1 said to be consistent. The very same process was repeated to every verb listed above until the list was exhausted. The analytical cycle began again by collecting the synonyms from the next dictionary entry in the alphabetical order. It should be pointed up that, when the linguist analyzed the verb esquecer (English: "to forget"), the canonical Brazilian Portuguese antonym for lembrar, he found only one synonym for it: the verb olvidar (Vulgar Latin: "oblitare"; English: "to efface"). So, after the consistency analysis, the following synonym set was compiled: 5. {esquecer, olvidar} The dictionary also registers this antonymy indirectly: lembrar and esquecer are defined by means of the paraphrases trazer à memória and perder a memória de (English: "to stop remembering"), respectively. Thus, the information was checked through cross-reference of entries and confirmed the antonymic pair (lembrar, esquecer), which stresses the importance of examining paraphrases carefully. Just for the record: the synonym set (6) and its antonym set (7) are transcribed bellow: 6. {amentar2, comemorar, ementar, escordar1, lembrar, memorar, reconstituir, recordar, relembrar, rememorar, rever1, revisitar, reviver, revivescer, ver} 7. {deslembrar, desmemoriar, esquecer, olvidar}
9 Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet The Wordnet.Br Core Database Design Each Wordnet.Br core database entry consists of the following template: [<Headword> n (<X>) Sense n.1 [{Synset}; {Antonym Synset}]... Sense n.m [{Synset}; {Antonym Synset}]] where n is the entry identification number; X is a noun, verb, adjective, or adverb; and n.1... n.m are sense identification numbers of the entry n. From the logical point of view, the Wordnet.Br core database overall structure is made up of two lists: an Entry List (EL), the Wordnet.Br core database entries ordered alphabetically, and the Synset List (SL), the list of the synsets. Each element of a synset is necessarily an element of the EL. Each EL entry, besides being specified for its graphemic representation, it is also specified for a particular Sense Specification (SS). Each SS is indexed by three memory pointers: the "synonymy pointer" points to a particular synset (say synset 1) in the SL; the "antonymy pointer" points to a particular synset in the SL (when there is one) which is the antonym of the synset 1; and the "sense pointer", besides identifying the sense, say sense 1, points to the particular entry in the EL to which both synsets are associated. Each synset in the SL is represented as "double-faced" list. One side lists specific elements of the EL that are members of the synset and the other side specifies a list SSs to which the synset is part. In other words, let us name the faces: the Entry Face (EF) and the Sense Face (SF). The EF contains pointers to the elements of the EL that are related to one another by means of the synonymy relation. The SF contains a list of SSs that indicates all SSs to which the synset is linked. A conventional relational database management system is used. Its main functionalities include: the storage of general information of the Wordnet.Br core database and its bookkeeping. The design includes the complete loading of all entries and their related information. The key feature is the automatic entry generation. Once the synset is entered in the Wordnet.Br core database
10 Bento Carlos Dias-da-Silva 10 or updated, the system generates the appropriate entries automatically. Just to illustrate with numbers: 3,872 verb synsets generate 10,204 verb entries. 4. The Computational Domain 4.1 The Editing Tool The editing tool is a Windows -based interface where the linguists enter synsets, sample-sentences, glosses, and generates diffrent lists (synsets listed by syntactic category, number of elements, degree of homonymy and polysemy, and list of sample sentences) and statistics (number of headwords, synsets, antonymous synsets, and headword/synset ratio). 4.2 The Wordnet.Br Core Statistics CATEGORY LEXICAL UNITS SYNSETS Verbs ~ 11,000 ~ Nouns ~ 17,000 ~ Adjectives ~ 15,000 ~ 6,000 Adverbs ~ 1,000 ~ 500 TOTAL ~ 44,000 ~ 18,500 As the paper alerted care must be taken not to carry over published dictionary flaws into the Wordnet.Br core. But, despite their imperfections, the dictionaries selected as the reference corpus proved to be valuable resources of lexical-semantic information. Thanks to them, and to the systematic mining process and filtering strategies, the Wordnet.Br core database, with circa 18,000 synsets, can be further refined and updated to the Wordnet.Br. Accordingly, future steps will involve the specification of glosses for each sense, of sample sentences and expressions for each word form, and of the logical-conceptual relations of meronymy-holonymy and hyponymyhypernymy.
11 Human Language Technology Research and the Development of the Brazilian Portuguese Wordnet 11 References Azevedo, F.F.S. (1983) Dicionário Analógico da Língua Portuguesa. Brasília: Thesaurus. Barbosa, O. (1999) Grande Dicionário de Sinônimos e Antônimos. Ediouro, Rio de Janeiro. Borba, F.S. (coord.) (1990) Dicionário Gramatical de Verbos do Português Contemporâneo do Brasil. Editora da Unesp, São Paulo. Chafe, W. (1970) Meaning and the Structure of Language. The University of Chicago Press, Chicago. Cruse, D.A. (1986) Lexical Semantics. Cambridge University Press, New York. Dias-da-Silva, B.C. (1996). The technological facet of language studies: natural language processing, PhD Diss, FCL-UNESP, Araraquara, Brasil. (In Portuguese) Dias-da-silva, B.C. (1998) Bridging the gap between linguistic theory and natural language processing. In: Caron, B. (ed.) 16th International Congress of Linguists. Pergamon-Elsevier Science, Oxford 10 p. Dias-da-Silva, B.C., Oliveira, M.F., Hasegawa, R., Moraes, H.R., Amorim, D., Paschoalino, C. Nascimento, A.C. (2000) A construção de um thesaurus eletrônico para o português do Brasil. In: Proceedings of the 5th PROPOR - Encontro para o processamento computacional da língua portuguesa escrita e falada, Atibaia, Brazil, p Dias-da-Silva, B.C., Oliveira, M. F., Moraes, H.R (2002) Groundwork for the Development of the Brazilian Portuguese Wordnet. In: Ranchhold, E.M.; Mamede, N.J. (eds.) Advances in natural language processing. Springer- Verlag, Berlin, p Durkin, J. (1994). Expert systems: design and development. London: Prentice Hall International. Fellbaum, C. (ed.) (1998) WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, Mass. Fernandes, F. (1997) Dicionário de Sinônimos e Antônimos da Língua Portuguesa. Globo, São Paulo. Ferreira, A.B.H. (1999) Dicionário Aurélio Eletrônico Século XXI (versão 3.0). Lexicon, São Paulo. Flexner, S.B. (ed.) (1997): Random House Webster s Unabridged Electronic Dictionary (Version 2.0).New York.,: Hndom House Inc. Hayes-Roth, F. (1990). Expert Systems. In: E. Shapiro (ed.) Encyclopedia of artificial intelligence. New York: Wiley, p Kilgarriff, A. (1993) Dictionary word sense distinctions: an inquiry into their nature. Computer and the Humanities, 26, p
12 Bento Carlos Dias-da-Silva 12 Kilgarriff, A. (1997) I don t Believe in Word Senses. Computer and the Humanities, 31, p Kilgarriff, A., Yallop, (2000) C. "What's in a Thesaurus?". In: Proceedings of the 2nd Conference on Language Resources and Evaluation, Athens, Greece, 8 p. Lutz, W.D (1994) The Cambridge Thesaurus of American English. Cambridge: Cambridge University Press. Miller, C., Fellbaum, C. (1998) Semantic networks of English. Cognition, 41, p Nascentes, A. (1981) Dicionário de Sinônimos. São Paulo: Nova Fronteira. Neufeldt, V. (ed.) (1997) Webster s New World Dictionary & Thesaurus (Version 1.0). New York: Macmillan. Roget, P.M. (1953) Thesaurus. Middlessex: Penguin Books. (Original ed. 1852) Stubbs, M. (2001) Words and phrases. Oxford: Blackwell. Summers, D. (ed.) (1995) Longman Dictionary of Contemporary English. Longman, Essex. Weiszflog, W. (ed) (1998) Michaelis português moderno dicionário da língua portuguesa (versão 1.1). DTS Software Brasil Ltda, São Paulo.
Methods and Tools for Encoding the WordNet.Br Sentences, Concept Glosses, and Conceptual-Semantic Relations
Methods and Tools for Encoding the WordNet.Br Sentences, Concept Glosses, and Conceptual-Semantic Relations Bento C. Dias-da-Silva 1, Ariani Di Felippo 2, Ricardo Hasegawa 3 Centro de Estudos Lingüísticos
An Efficient Database Design for IndoWordNet Development Using Hybrid Approach
An Efficient Database Design for IndoWordNet Development Using Hybrid Approach Venkatesh P rabhu 2 Shilpa Desai 1 Hanumant Redkar 1 N eha P rabhugaonkar 1 Apur va N agvenkar 1 Ramdas Karmali 1 (1) GOA
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Language Meaning and Use
Language Meaning and Use Raymond Hickey, English Linguistics Website: www.uni-due.de/ele Types of meaning There are four recognisable types of meaning: lexical meaning, grammatical meaning, sentence meaning
The Oxford Learner s Dictionary of Academic English
ISEJ Advertorial The Oxford Learner s Dictionary of Academic English Oxford University Press The Oxford Learner s Dictionary of Academic English (OLDAE) is a brand new learner s dictionary aimed at students
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering
Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software
Semantic Research using Natural Language Processing at Scale; A continued look behind the scenes of Semantic Insights Research Assistant and Research Librarian Presented to The Federal Big Data Working
EFL Learners Synonymous Errors: A Case Study of Glad and Happy
ISSN 1798-4769 Journal of Language Teaching and Research, Vol. 1, No. 1, pp. 1-7, January 2010 Manufactured in Finland. doi:10.4304/jltr.1.1.1-7 EFL Learners Synonymous Errors: A Case Study of Glad and
Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.
Comparing Ontology-based and Corpusbased Domain Annotations in WordNet. A paper by: Bernardo Magnini Carlo Strapparava Giovanni Pezzulo Alfio Glozzo Presented by: rabee ali alshemali Motive. Domain information
DanNet From Dictionary to Wordnet
DanNet From Dictionary to Wordnet Jörg Asmussen Society for Danish Language and Literature, DSL, Copenhagen Bolette Sandford Pedersen Centre for Language Technology, CST, University of Copenhagen Lars
Using WordNet.PT for translation: disambiguation and lexical selection decisions
INTERNATIONAL JOURNAL OF TRANSLATION Vol. XX, No. XX, XXXX XXXX Using WordNet.PT for translation: disambiguation and lexical selection decisions University of Lisbon, Portugal ABSTRACT Wordnets are extensively
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that
Facilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat [email protected] Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
ISSN: 2278-5299 365. Sean W. M. Siqueira, Maria Helena L. B. Braz, Rubens Nascimento Melo (2003), Web Technology for Education
International Journal of Latest Research in Science and Technology Vol.1,Issue 4 :Page No.364-368,November-December (2012) http://www.mnkjournals.com/ijlrst.htm ISSN (Online):2278-5299 EDUCATION BASED
CREATING LEARNING OUTCOMES
CREATING LEARNING OUTCOMES What Are Student Learning Outcomes? Learning outcomes are statements of the knowledge, skills and abilities individual students should possess and can demonstrate upon completion
Processing: current projects and research at the IXA Group
Natural Language Processing: current projects and research at the IXA Group IXA Research Group on NLP University of the Basque Country Xabier Artola Zubillaga Motivation A language that seeks to survive
Brazilian Portuguese WordNet: A Computational Linguistic Exercise of Encoding Bilingual Relational Lexicons
IJCLA VOL. 1, NO. 1-2, JAN-DEC 2010, PP. 137-150 RECEIVED 10/11/09 ACCEPTED 16/01/10 FINAL 10/03/10 Brazilian Portuguese WordNet: A Computational Linguistic Exercise of Encoding Bilingual Relational Lexicons
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
St. Petersburg College. RED 4335/Reading in the Content Area. Florida Reading Endorsement Competencies 1 & 2. Reading Alignment Matrix
Course Credit In-service points St. Petersburg College RED 4335/Reading in the Content Area Florida Reading Endorsement Competencies 1 & 2 Reading Alignment Matrix Text Rule 6A 4.0292 Specialization Requirements
Package wordnet. January 6, 2016
Title WordNet Interface Version 0.1-11 Package wordnet January 6, 2016 An interface to WordNet using the Jawbone Java API to WordNet. WordNet () is a large lexical database
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Natural Language Processing. Part 4: lexical semantics
Natural Language Processing Part 4: lexical semantics 2 Lexical semantics A lexicon generally has a highly structured form It stores the meanings and uses of each word It encodes the relations between
Indiana Department of Education
GRADE 1 READING Guiding Principle: Students read a wide range of fiction, nonfiction, classic, and contemporary works, to build an understanding of texts, of themselves, and of the cultures of the United
ESL 005 Advanced Grammar and Paragraph Writing
ESL 005 Advanced Grammar and Paragraph Writing Professor, Julie Craven M/Th: 7:30-11:15 Phone: (760) 355-5750 Units 5 Email: [email protected] Code: 30023 Office: 2786 Room: 201 Course Description:
Teaching terms: a corpus-based approach to terminology in ESP classes
Teaching terms: a corpus-based approach to terminology in ESP classes Maria João Cotter Lisbon School of Accountancy and Administration (ISCAL) (Portugal) Abstract This paper will build up on corpus linguistic
A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students
69 A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students Sarathorn Munpru, Srinakharinwirot University, Thailand Pornpol Wuttikrikunlaya, Srinakharinwirot University,
MULTIFUNCTIONAL DICTIONARIES
In: A. Zampolli, A. Capelli (eds., 1984): The possibilities and limits of the computer in producing and publishing dictionaries. Linguistica Computationale III, Pisa: Giardini, 279-288 MULTIFUNCTIONAL
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
COMPARATIVES WITHOUT DEGREES: A NEW APPROACH. FRIEDERIKE MOLTMANN IHPST, Paris [email protected]
COMPARATIVES WITHOUT DEGREES: A NEW APPROACH FRIEDERIKE MOLTMANN IHPST, Paris [email protected] It has become common to analyse comparatives by using degrees, so that John is happier than Mary would
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded
ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU
ONTOLOGIES p. 1/40 ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU Unlocking the Secrets of the Past: Text Mining for Historical Documents Blockseminar, 21.2.-11.3.2011 ONTOLOGIES
Master of Arts in Linguistics Syllabus
Master of Arts in Linguistics Syllabus Applicants shall hold a Bachelor s degree with Honours of this University or another qualification of equivalent standard from this University or from another university
Building a Large Scale Lexical Ontology for Portuguese
Building a Large Scale Lexical Ontology for Portuguese Nuno Seco Linguateca Node of Coimbra http://linguateca.dei.uc.pt Agenda Motivations Goals Ontology Extraction Ontology Evaluation Study the Systematicity
A Mixed Trigrams Approach for Context Sensitive Spell Checking
A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA [email protected], [email protected]
UNIVERSITY OF PUERTO RICO RIO PIEDRAS CAMPUS COLLEGE OF HUMANITIES DEPARTMENT OF ENGLISH
UNIVERSITY OF PUERTO RICO RIO PIEDRAS CAMPUS COLLEGE OF HUMANITIES DEPARTMENT OF ENGLISH Instructor: Dr. Alicia Pousada Course Title: Study of language Course Number: INGL 4205 Number of Credit Hours:
TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments
TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments Grzegorz Dziczkowski, Katarzyna Wegrzyn-Wolska Ecole Superieur d Ingenieurs
Comprendium Translator System Overview
Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4
A Knowledge-based System for Translating FOL Formulas into NL Sentences
A Knowledge-based System for Translating FOL Formulas into NL Sentences Aikaterini Mpagouli, Ioannis Hatzilygeroudis University of Patras, School of Engineering Department of Computer Engineering & Informatics,
Extracting Business. Value From CAD. Model Data. Transformation. Sreeram Bhaskara The Boeing Company. Sridhar Natarajan Tata Consultancy Services Ltd.
Extracting Business Value From CAD Model Data Transformation Sreeram Bhaskara The Boeing Company Sridhar Natarajan Tata Consultancy Services Ltd. GPDIS_2014.ppt 1 Contents Data in CAD Models Data Structures
Overview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization
Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it
The Dictionary of the Common Modern Greek Language is being compiled 1 under
THE DICTIONARY OF THE COMMON MODERN GREEK LANGUAGE OF THE INSTITUTE OF MODERN GREEK STUDIES (MANOLIS TRIANDAFYLLIDIS FOUNDATION) OF THE ARISTOTLE UNIVERSITY OF THESSALONIKI ANASTASSIA TZIVANOPOULOU Aristotle
Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6
Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6 4 I. READING AND LITERATURE A. Word Recognition, Analysis, and Fluency The student
Intro to Linguistics Semantics
Intro to Linguistics Semantics Jarmila Panevová & Jirka Hana January 5, 2011 Overview of topics What is Semantics The Meaning of Words The Meaning of Sentences Other things about semantics What to remember
Psychology G4470. Psychology and Neuropsychology of Language. Spring 2013.
Psychology G4470. Psychology and Neuropsychology of Language. Spring 2013. I. Course description, as it will appear in the bulletins. II. A full description of the content of the course III. Rationale
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,
TERMINOGRAPHY and LEXICOGRAPHY What is the difference? Summary. Anja Drame TermNet
TERMINOGRAPHY and LEXICOGRAPHY What is the difference? Summary Anja Drame TermNet Summary/ Conclusion Variety of language (GPL = general purpose SPL = special purpose) Lexicography GPL SPL (special-purpose
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia
Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia Outline I What is CALL? (scott) II Popular language learning sites (stella) Livemocha.com (stacia) III IV Specific sites
Customer Intentions Analysis of Twitter Based on Semantic Patterns
Customer Intentions Analysis of Twitter Based on Semantic Patterns Mohamed Hamroun [email protected] Mohamed Salah Gouider [email protected] Lamjed Ben Said [email protected] ABSTRACT
ICAME Journal No. 24. Reviews
ICAME Journal No. 24 Reviews Collins COBUILD Grammar Patterns 2: Nouns and Adjectives, edited by Gill Francis, Susan Hunston, andelizabeth Manning, withjohn Sinclair as the founding editor-in-chief of
Absolute versus Relative Synonymy
Article 18 in LCPJ Danglli, Leonard & Abazaj, Griselda 2009: Absolute versus Relative Synonymy Absolute versus Relative Synonymy Abstract This article aims at providing an illustrated discussion of the
Teaching Vocabulary to Young Learners (Linse, 2005, pp. 120-134)
Teaching Vocabulary to Young Learners (Linse, 2005, pp. 120-134) Very young children learn vocabulary items related to the different concepts they are learning. When children learn numbers or colors in
The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)
The Development of Multimedia-Multilingual Storage, Retrieval and Delivery for E-Organization (STREDEO PROJECT) Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit
Get the most value from your surveys with text analysis
PASW Text Analytics for Surveys 3.0 Specifications Get the most value from your surveys with text analysis The words people use to answer a question tell you a lot about what they think and feel. That
Find the signal in the noise
Find the signal in the noise Electronic Health Records: The challenge The adoption of Electronic Health Records (EHRs) in the USA is rapidly increasing, due to the Health Information Technology and Clinical
Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets
Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets Maria Ruiz-Casado, Enrique Alfonseca and Pablo Castells Computer Science Dep., Universidad Autonoma de Madrid, 28049 Madrid, Spain
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
English Descriptive Grammar
English Descriptive Grammar 2015/2016 Code: 103410 ECTS Credits: 6 Degree Type Year Semester 2500245 English Studies FB 1 1 2501902 English and Catalan FB 1 1 2501907 English and Classics FB 1 1 2501910
Sentiment analysis for news articles
Prashant Raina Sentiment analysis for news articles Wide range of applications in business and public policy Especially relevant given the popularity of online media Previous work Machine learning based
Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries
Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries Patanakul Sathapornrungkij Department of Computer Science Faculty of Science, Mahidol University Rama6 Road, Ratchathewi
Vocabulary notebooks: implementation and outcomes
Vocabulary notebooks: implementation and outcomes Clyde Fowle With the recent focus in applied linguistics on lexical competence, and the impact this has had on language teaching, many language teachers
Taxonomies for Auto-Tagging Unstructured Content. Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013
Taxonomies for Auto-Tagging Unstructured Content Heather Hedden Hedden Information Management Text Analytics World, Boston, MA October 1, 2013 About Heather Hedden Independent taxonomy consultant, Hedden
To download the script for the listening go to: http://www.teachingenglish.org.uk/sites/teacheng/files/learning-stylesaudioscript.
Learning styles Topic: Idioms Aims: - To apply listening skills to an audio extract of non-native speakers - To raise awareness of personal learning styles - To provide concrete learning aids to enable
2. SEMANTIC RELATIONS
2. SEMANTIC RELATIONS 2.0 Review: meaning, sense, reference A word has meaning by having both sense and reference. Sense: Word meaning: = concept Sentence meaning: = proposition (1) a. The man kissed the
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Reading Competencies
Reading Competencies The Third Grade Reading Guarantee legislation within Senate Bill 21 requires reading competencies to be adopted by the State Board no later than January 31, 2014. Reading competencies
Some Reflections on the Making of the Progressive English Collocations Dictionary
43 Some Reflections on the Making of the Progressive English Collocations Dictionary TSUKAMOTO Michihisa Faculty of International Communication, Aichi University E-mail: [email protected] 1939
MATRIX OF STANDARDS AND COMPETENCIES FOR ENGLISH IN GRADES 7 10
PROCESSES CONVENTIONS MATRIX OF STANDARDS AND COMPETENCIES FOR ENGLISH IN GRADES 7 10 Determine how stress, Listen for important Determine intonation, phrasing, points signaled by appropriateness of pacing,
Developing a Theory-Based Ontology for Best Practices Knowledge Bases
Developing a Theory-Based Ontology for Best Practices Knowledge Bases Daniel E. O Leary University of Southern California 3660 Trousdale Parkway Los Angeles, CA 90089-0441 [email protected] Abstract Knowledge
What s in a Lexicon. The Lexicon. Lexicon vs. Dictionary. What kind of Information should a Lexicon contain?
What s in a Lexicon What kind of Information should a Lexicon contain? The Lexicon Miriam Butt November 2002 Semantic: information about lexical meaning and relations (thematic roles, selectional restrictions,
How To Teach Reading
Florida Reading Endorsement Alignment Matrix Competency 1 The * designates which of the reading endorsement competencies are specific to the competencies for English to Speakers of Languages (ESOL). The
Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang
Sense-Tagging Verbs in English and Chinese Hoa Trang Dang Department of Computer and Information Sciences University of Pennsylvania [email protected] October 30, 2003 Outline English sense-tagging
CORPUS VS DICTIONARY IN EFL CLASSES
English Matters III 65 CORPUS VS DICTIONARY IN EFL CLASSES Ivana Cimermanová Abstract: Dictionaries (bilingual, monolingual, visual, specific etc.) have become a natural aid for learners of English, teachers
LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5
Page 1 of 57 Grade 3 Reading Literary Text Principles of Reading (P) Standard 1: Demonstrate understanding of the organization and basic features of print. Standard 2: Demonstrate understanding of spoken
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
ELECTRONIC DICTIONARIES AND ESP STUDENTS
ELECTRONIC DICTIONARIES AND ESP STUDENTS VINȚEAN Adriana Lucian Blaga University of Sibiu, Romania MATIU Ovidiu Lucian Blaga University of Sibiu, Romania Abstract: This article presents the results of
Latin WordNet project
Latin WordNet project Stefano Minozzi Laboratorio di Informatica Umanistica Università degli Studi di Verona Latin WordNet project Laboratorio di Informatica Umanistica Università degli Studi di Verona
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
SPELLING DOES MATTER
Focus and content of the Session 1 Introduction Introduction to NSW syllabus objectives: A. Communicate through speaking, listening, reading, writing, viewing and representing B. Use language to shape
ROGET S THESAURUS AS A LEXICAL RESOURCE FOR NATURAL LANGUAGE PROCESSING
ROGET S THESAURUS AS A LEXICAL RESOURCE FOR NATURAL LANGUAGE PROCESSING Mario Jarmasz Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for
Thesis Proposal Verb Semantics for Natural Language Understanding
Thesis Proposal Verb Semantics for Natural Language Understanding Derry Tanti Wijaya Abstract A verb is the organizational core of a sentence. Understanding the meaning of the verb is therefore key to
Basic Unified Process: A Process for Small and Agile Projects
Basic Unified Process: A Process for Small and Agile Projects Ricardo Balduino - Rational Unified Process Content Developer, IBM Introduction Small projects have different process needs than larger projects.
Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)
Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet Roberto Navigli, Tiziano
Visualizing WordNet Structure
Visualizing WordNet Structure Jaap Kamps Abstract Representations in WordNet are not on the level of individual words or word forms, but on the level of word meanings (lexemes). A word meaning, in turn,
The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project
Seminar on Dec 19 th Abstracts & speaker information The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project Eleni Bozia (USA) Angelos Barmpoutis (USA)
Moving Enterprise Applications into VoiceXML. May 2002
Moving Enterprise Applications into VoiceXML May 2002 ViaFone Overview ViaFone connects mobile employees to to enterprise systems to to improve overall business performance. Enterprise Application Focus;
