Development of a Dependency Treebank for Russian and its Possible Applications in NLP
|
|
|
- Philip Phillips
- 10 years ago
- Views:
Transcription
1 Development of a Dependency Treebank for Russian and its Possible Applications in NLP Igor BOGUSLAVSKY, Ivan CHARDIN, Svetlana GRIGORIEVA, Nikolai GRIGORIEV, Leonid IOMDIN, Lеonid KREIDLIN, Nadezhda FRID Laboratory for Computational Linguistics Institute for Information Transmission Problems Russian Academy of Sciences Bolshoj Karetnyj per. 19, , Moscow RUSSIA {bogus, ic, sveta, grig, iomdin, lenya, nadya}@iitp.ru Abstract The paper describes a tagging scheme designed for the Russian Treebank and presents tools used for corpus creation. 1. Introductory Remarks The present paper describes a project aimed at developing the first annotated corpus of Russian texts. Large text corpora have been used in the computational linguistics community for quite a long time now; at present, over 20 large corpora for the main European languages are available, the largest of them containing hundreds of millions of words (Language Resources (1997); Marcus, Santorini and Marcinkiewicz (1993); Kurohashi, Nagao (1998)). For Russian, annotated corpora had been nonexistent until 2000 when the first part of the corpus reported here was compiled (Boguslavsky et al., 2000). Since then, Russian corpus linguistics has been evolving rapidly, and several groups of researchers announced their intent to create corpora. Of these, the CLD-MSU Corpus project looks particularly promising. It aims at building a morphologically tagged corpus, and an upgrade to syntactic annotation is envisaged in the future even though it is not pursued at the present stage (Sichinava, 2001). Different tasks require different annotation levels that entail different amount of information about text structure. The corpus that is being created in the framework of the project under discussion consists of several subcorpora that differ in the level of annotation. The following three levels are envisaged: lemmatized texts: for every word, its normal form (lemma) and part of speech are indicated; morphologically tagged texts: for every word, along with the lemma and the part of speech, a full set of inflectional morphological attributes is specified; syntactically tagged texts: apart from the full morphological markup at the word level, every sentence is assigned a syntactic structure. We annotate Russian texts with dependency structures a formalism which we consider more suitable for Slavonic languages with their relatively free word order than constituent structures. The structure not only contains information as to which words of the sentence are syntactically linked, but also relegates each link to one of the several dozen syntactic types (at present, we use 78 syntactic relations). This is an important feature, since the majority of syntactically annotated corpora, both those already available and under construction, represent the syntactic structure by means of constituents. The closest analogue to our work is Prague Dependency Treebank (PDT) an annotated corpus of Czech collected at Charles University in Prague (see Hajicova, Panevova, Sgall, 1998). In this corpus, the syntactic data are also expressed in the dependency formalism, although the inventory of syntactic functional relations is much smaller than ours as it only has 23 relations. Our corpus therefore gives a more fine-grained representation of syntactic phenomena. On the other hand, Czech researchers made an extremely interesting attempt to incorporate into their annotation information on discourse structure (topic-focus opposition) (Bemova et al., 1999). Besides PDT, several other corpus-related projects use some kind of dependency structures; worth noticing are NEGRA for German (Brants et al, 1999) and Alpino Dependency Treebank for Dutch (Van der Beek et al., 2001). In what follows, we describe the types of texts used to create the corpus (Section 2), markup format (Section 3), annotation tools and procedures (Section 4), and types of linguistic data included in the markup (Section 5). 2. Source Text Selection The well-known Uppsala University Corpus of contemporary Russian prose has been chosen as the primary source for the first part of our corpus, which has already been completed. This part contains about 10,000 sentences. The Uppsala Corpus is well balanced between fiction and journalistic genre, with a smaller percentage of scientific and popular science texts. The Corpus includes samples of contemporary Russian prose, as well as excerpts from newspapers and magazines of the last few decades of the 20 th century, and gives a representative coverage of written Russian in modern use. Conversational examples are scarce and appear as dialogues inside fiction texts.
2 The second part of the corpus consists of several hundred short texts published in on various Internet news portals. The bulk of the texts come from the following newswires: Each text is a small story (up to 30 sentences) about a single event. Their themes include political, financial, cultural, and sports news, both domestic and international; and a certain amount of texts deal with hi-tech achievements. We have done our best to make source text selection a representative sample of Internet news delivery in Russian. 3. Markup Format The design principles were formulated as follows: layered markup several annotation levels coexist and can be extracted or processed independently; incrementality it should be easy to add higher annotation levels; convenient parsing of the annotated text by means of standard software packages. The most natural solution to meet these criteria is an XML-based markup language. We have tried to make our format compatible with TEI (Text Encoding for Interchange, see TEI Guidelines (1994)), introducing new elements or attributes only in situations where TEI markup does not provide adequate means to describe the text structure in the dependency grammar framework. Listed below are types of information about text structure that must be encoded in the markup, and the respective tags/attributes used to carry this information Splitting Text into Sentences. A special container element <S> (available in TEI) is used to delimit sentence boundaries. The element may have an optional ID attribute that supplies a unique identifier for the sentence within the text; this identifier may be used to store information about extra-sentential relations in the text. It may also have a COMMENT attribute, used by linguists to store notes and observations on about particular syntactic phenomena encountered in the sentence; 3.2. Splitting Sentences into Lexical Items (Words). The words are delimited by a container element <W>. Like sentences, words may have a unique ID attribute that is used to refer to the word within the sentence; 3.3. Assigning Morphological Features to Words. Morphological information is ascribed to the word by means of two attributes attached to the <W> tag: LEMMA normalized word form and FEAT list of morphological features Storing Information about Syntactic Structure. To annotate the information about syntactic dependencies, we use two other attributes attached to the <W> element: DOM the ID of the word on which W depends and LINK syntactic function label. The formalism has special provisions to store auxiliary information, e.g. multiple morphological analyses and syntactic trees. They will not appear in the final version of the corpus. 4. Annotation Tools and Procedures The procedure of corpus data acquisition is semiautomatic. An initial version of markup is generated by a computer using a general purpose morphological analyzer and syntax parser engine; after that, the results of the automatic processing are submitted to human post-editing. The analysis engine (morphology and parsing) is based upon the ETAP-3 machine translation engine as discussed in Apresjan et al. (1992, 1993). To support the creation of annotated data, a variety of tools have been designed and implemented. All tools are Win32 applications written in C++. The tools available are: a program that creates sentence boundaries markup, called Chopper; a post-editor for building, editing and managing syntactically annotated texts Structure Editor (or StrEd). The amount of manual labor required to build annotations depends on the complexity of the input data. StrEd offers different options for building structures. Most sentences can be reliably processed without any human intervention; in which case, a linguist should only look through the result of the processing and endorse it. If the structure contains errors, the linguist can edit it using a userfriendly graphical interface (see screenshots below). If the errors are too many or no structure could be produced, the linguist may resort to a special split-and-run mode. This mode involves manual pre-chunking of the input sentence into such pieces that have a more transparent structure and applying the analyzer/parser to every chunk in turn. Then the linguist must manually integrate the subtrees produced for every chunk into a single tree structure. If the linguist has come across an especially difficult syntactic construction so that he/she is uncertain about what the adequate structure is, he/she may mark as doubtful the whole sentence or else single words whose functions are not completely clear. The information will be stored in the markup, and StrEd will visualize the respective sentence as one needing further editing.
3 Fig. 1 presents the main dialog window for editing sentence properties. An operator can edit the markup directly in any text editor, or edit single properties using a graphical interface. The source text under analysis is written in the top line of the edit window: Xotja pis mo ne bylo podpisano, ja mgnovenno dogadalsja, kto ego napisal [Although the letter was not signed, I instantly guessed who had written it]. The information about particular words is written into a list: e.g. the first word xotja [although] has an identifier ID="1"; the lemmatized form is XOTJA; its feature list consists of a single feature a part-of-speech characteristic (it is a conjunction); the word depends on a word with ID="8" by the adverbial relation (link type is "adverb"). By double-clicking an item in the word list or pressing the button, a linguist can invoke dialog windows for editing way of editing the structure consists in invoking a Tree Editor window, shown in Fig. 2 with the same sentence as in the previous picture. The Tree Editor interface is simple and natural. Words of the source sentence are written on the left, their lemmas are placed into gray rectangles, and their morphological features are written on the right. The syntactic relations are shown as arrows directed from the master to the slave; the link types are indicated in rounded rectangles on the arcs. All text fields except for the source sentence are editable in-place. Moreover, one can drag the rounded rectangles: dropping it on a word means that this word is declared a new master for the word from which the rectangle was dragged. A single right-button click on the lemma rectangle pops up the word properties dialog. All source sentence raw markup word list comments Figure 1. Sentence Properties dialog in StrEd. properties of single words. However, the most convenient colors, sizes and fonts are customizable.
4 Figure 2. Tree Editor dialog in StrEd. 5. Types of Linguistic Information by Level 5.1. Morphological Information The morphological analyzer ascribes features to every word. The feature set for Russian includes: part of speech, animateness, gender, number, case, degree of comparison, short form (for adjectives and participles), representation (of verbs), aspect, tense, person, and voice Syntactic Information As we have already mentioned, the result of the parsing is a tree composed of links as leaves and words as nodes. All the links are binary and oriented; they link single words rather than syntactic groups. For every syntactic group, one word (head) is chosen to represent it as a slave in larger syntactic units; all other members of the group become slaves of the head. In a typical case, the number of nodes in the syntactic tree corresponds to the number of word tokens. However, several exceptional situations occur in which the number of nodes may be either less or greater than the number of word tokens. The latter case is especially interesting. We postulate such a description in the following cases: a) Copulative sentences in the present tense where the auxiliary verb can be omitted. This is treated as a special zero-form of the copula, e.g. On uchitel [He is a teacher, lit. He teacher]. The copula should be introduced in the syntactic representation. b) Elliptical constructions (e.g. omitted members of contrasted coordinative expressions), like in Ja kupil rubashku, a on galstuk [I bought a shirt, and he bought a necktie, lit. I bought a shirt, and he a necktie]. The latter type of sentences should be discussed in more detail. Elliptical constructions are known to be one of the toughest problems in the formalization of natural language syntax. In our corpus, we decided to reconstruct the omitted elements in the syntactic trees, marking them with a special phantom feature. In the above example, a phantom node is inserted into the sentence between the words on he and galstuk necktie. This new node will have a lemma POKUPAT` [BUY] and will bear exactly the same morphological features as the wordform kupil [bought] physically present in the sentence, plus a special phantom marker. In certain cases, the feature set for the phantom may differ from that of the prototype, e.g. in a slightly modified phrase Ja kupil rubashku, a ona galstuk [I bought a shirt, and she (bought) a necktie] the phantom node will have the feminine gender, as required by the agreement with the subject of the second clause. Most real-life elliptical constructs can be represented in this way. The inventory of syntactic relationship types generated by the ETAP-3 system is vast enough: at present, we count 78 different syntactic function types. All relations are divided into 6 major groups: actant, attributive, quantitative, adverbial, coordinative, auxiliary. For readers convenience, we will give equivalent English examples: Actant relations link the predicate word to its arguments. Some examples ([X] master, [Y] slave): predicative Pete [Y] knows [Х]; completive (1, 2, 3) translate [Х] the book [Y, 1-compl] from [Y1, 2-compl] English into [Y2, 3-compl] Russian Attributive relations often link a noun to a modifier expressed by an adjective, another noun, a participle clause, etc: relative The house [Х] we live[y] in. Quantitative relations link a noun to a quantifier or numeral, or two such words together: quantitative five [Y] pages [Х]; auxiliary-quantitative thirty [Y] five [Х]; Adverbial relations link the predicate word to various adverbial modifiers: adverbial He came [Х] every evening [Y]; parenthetic In [Y] my opinion, he is [Х] right. Coordinative relations serve phrases and clauses coordinated by conjunctions:
5 coordinative buy apples [Х 1 ], pears [Y 1 = Х 2 ] and [Y 2 ] apricots; coordinative-conjunctive buy apples and [Х] pears [Y]. Auxiliary relations typically link two elements that form a single syntactic unit (e.g. an analytical verb form): analytical will [Х] buy [Y]; The list of syntactic relations is not closed. The process of data acquisition brings up a variety of rare syntactic constructions, hardly, if at all, covered by traditional grammars. In some cases, this has led to the introduction of new syntactic link types in order to reflect a partucular relation between single words and make the syntactic structure unambiguous. 6. Application of the Tagged Corpus in NLP The first type of research application on which we have started to test the annotated corpus is resolution of syntactic ambiguity in the course of Russian parsing as part of ETAP-3 Russian-to-English machine translation. Within the parser, an additional filter has been created that assigns weights to all potential subtrees of the sentence processed that consist of two to four nodes (so-called N- grams of the 1 st, 2 nd and 3 rd order) based on their relative occurrence in the dependency trees belonging to the Treebank. (For details, see Chardin 2001). Weights of the concurrent subtrees are compared with the existing priority values of individual syntactic links in the operational space of the parser and modify these values accordingly, which eventually helps create a more adequate tree structure. First results are promising; a detailed report on the ongoing experiments is being prepared. In fact, the results of such experiments are reusable in the creation of the Treebank itself, since new automatically derived parses that take into account the subtree weights are likely to show a better conformity with the previously produced corpus and will require less manual editing. Conclusion Corpus creation is not yet completed: at present, the full syntactic markup has been generated for 12,000 sentences (180,000 words), which constitutes 50% of the total amount planned. Our approach permits to include all information expressed by morphological and syntactic means in contemporary Russian. We expect that the new corpus will stimulate a broad range of further research and development projects.. We plan to make the corpus publicly available after completion. Acknowledgement This work has been supported by the Russian Foundation of Fundamental Research with a grant No References Apresjan Ju.D., Boguslavskij I.M., Iomdin L.L., Lazurskij A.V., Sannikov V.Z. and Tsinman L.L. (1992). The linguistics of a Machine Translation System. Meta, 37 (1), pp Apresjan Ju.D., Boguslavskij I.M., Iomdin L.L., Lazurskij A.V., Sannikov V.Z. and Tsinman L.L. (1993). Système de traduction automatique ETAP. // La Traductique. P.Bouillon and A.Clas (eds). Les Presses de l'université de Montréal, Montréal. Boguslavsky I.M., Grigorieva S.A., Grigoriev N.V, Kreidlin L.G, Frid N.E. (2000). Dependency Treebank for Russian: Concepts, Tools, Types of Information. // Proceedings of the 18 th Conference on Computational Linguistics. Vol 2, , Saarbrücken. Chardin I.S. (2001). Using a tagged corpus to resolve syntactic ambiguity in the ETAP-3 Linguistic Processor. // Proceedings of the 2 nd All-Russian Conference Theory and Practice of Speech Resources (ARSO-2001). Moscow State University, Moscow. [in Russian] Hajicova E., Panevova J., Sgall P. (1998). Language Resources Need Annotations To Make Them Really Reusable: The Prague Dependency Treebank. // Proceedings of the First International Conference on Language Resources & Evaluation, pp Kurohashi S., Nagao M. (1998). Building a Japanese Parsed Corpus while Improving the Parsing System. // Proceedings of the First International Conference on Language Resources & Evaluation, pp Language Resources (1997). Survey of the State of the Art in Human Language Technology. Eds. G. B. Varile, A. Zampolli, Linguistica Computazionale, vol. XII XIII, pp Marcus M. P., Santorini B., and Marcinkiewicz M.-A. (1993). Building a large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, Vol. 19, No. 2. TEI Guidelines (1994). TEI Guidelines for Electronic Text Encoding and Interchange (P3). URL: Brants Th., Skut W., and Uszkoreit H. (1999). Syntactic annotation of a German newspaper corpus. // Proceedings of the ATALA Treebank Workshop, pp , Paris, France. Van der Beek L., Bouma G., Malouf R., van Noord G. (2001). The Alpino Dependency Treebank, In: Proceedings of LINC2001. Sichinava, D.V. (2001). On the problem of building Russian linguistic corpora for the Internet. URL: [in Russian] Bemova A., Hajic J., Hladka B., Panevova J. (1999). Morphological and Syntactic Tagging of the Prague Dependency Treebank. URL:
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX
COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy
Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013
Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34 Basic text analysis Before any sophisticated analysis, we want ways to get a sense of text data
Context Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Hybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle
SCIENTIFIC PAPERS, UNIVERSITY OF LATVIA, 2010. Vol. 757 COMPUTER SCIENCE AND INFORMATION TECHNOLOGIES 11 22 P. The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle Armands Šlihte Faculty
Paraphrasing controlled English texts
Paraphrasing controlled English texts Kaarel Kaljurand Institute of Computational Linguistics, University of Zurich [email protected] Abstract. We discuss paraphrasing controlled English texts, by defining
TRANSLATING POLISH TEXTS INTO SIGN LANGUAGE IN THE TGT SYSTEM
20th IASTED International Multi-Conference Applied Informatics AI 2002, Innsbruck, Austria 2002, pp. 282-287 TRANSLATING POLISH TEXTS INTO SIGN LANGUAGE IN THE TGT SYSTEM NINA SUSZCZAŃSKA, PRZEMYSŁAW SZMAL,
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde
Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde LREC 2012, Istanbul May 25, 2012 NEDERBOOMS Exploitation of Dutch treebanks for research in linguistics September
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
Annotation Guidelines for Dutch-English Word Alignment
Annotation Guidelines for Dutch-English Word Alignment version 1.0 LT3 Technical Report LT3 10-01 Lieve Macken LT3 Language and Translation Technology Team Faculty of Translation Studies University College
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH Gertjan van Noord Deliverable 3-4: Report Annotation of Lassy Small 1 1 Background Lassy Small is the Lassy corpus in which the syntactic annotations
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
DEPENDENCY PARSING JOAKIM NIVRE
DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes
Building gold-standard treebanks for Norwegian
Building gold-standard treebanks for Norwegian Per Erik Solberg National Library of Norway, P.O.Box 2674 Solli, NO-0203 Oslo, Norway [email protected] ABSTRACT Språkbanken at the National Library of Norway
Overview of the TACITUS Project
Overview of the TACITUS Project Jerry R. Hobbs Artificial Intelligence Center SRI International 1 Aims of the Project The specific aim of the TACITUS project is to develop interpretation processes for
Genre distinctions and discourse modes: Text types differ in their situation type distributions
Genre distinctions and discourse modes: Text types differ in their situation type distributions Alexis Palmer and Annemarie Friedrich Department of Computational Linguistics Saarland University, Saarbrücken,
Comprendium Translator System Overview
Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4
Automatic Detection and Correction of Errors in Dependency Treebanks
Automatic Detection and Correction of Errors in Dependency Treebanks Alexander Volokh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany [email protected] Günter Neumann DFKI Stuhlsatzenhausweg
INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning [email protected]
INF5820 Natural Language Processing - NLP H2009 Jan Tore Lønning [email protected] Semantic Role Labeling INF5830 Lecture 13 Nov 4, 2009 Today Some words about semantics Thematic/semantic roles PropBank &
Text Mining - Scope and Applications
Journal of Computer Science and Applications. ISSN 2231-1270 Volume 5, Number 2 (2013), pp. 51-55 International Research Publication House http://www.irphouse.com Text Mining - Scope and Applications Miss
CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature
CS4025: Pragmatics Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature For more info: J&M, chap 18,19 in 1 st ed; 21,24 in 2 nd Computing Science, University of
Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian
Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian Cristina Bosco, Alessandro Mazzei, and Vincenzo Lombardo Dipartimento di Informatica, Università di Torino, Corso Svizzera
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg March 1, 2007 The catalogue is organized into sections of (1) obligatory modules ( Basismodule ) that
ICAME Journal No. 24. Reviews
ICAME Journal No. 24 Reviews Collins COBUILD Grammar Patterns 2: Nouns and Adjectives, edited by Gill Francis, Susan Hunston, andelizabeth Manning, withjohn Sinclair as the founding editor-in-chief of
NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR 1 Gauri Rao, 2 Chanchal Agarwal, 3 Snehal Chaudhry, 4 Nikita Kulkarni,, 5 Dr. S.H. Patil 1 Lecturer department o f Computer Engineering BVUCOE,
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
Managing large sound databases using Mpeg7
Max Jacob 1 1 Institut de Recherche et Coordination Acoustique/Musique (IRCAM), place Igor Stravinsky 1, 75003, Paris, France Correspondence should be addressed to Max Jacob ([email protected]) ABSTRACT
Chapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
Prague Markup Language Framework
Prague Markup Language Framework Jirka Hana and Jan Štěpánek Charles University in Prague, Faculty of Mathematics and Physics [email protected] Abstract In this paper we describe the Prague Markup
Learning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Analyzing Data Using Excel
Analyzing Data Using Excel What you will do: Create a spreadsheet Use formulas and basic formatting Import text files Save worksheets as web pages Add interactivity to web worksheets Use pivot tables Create
L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25
L130: Chapter 5d Dr. Shannon Bischoff Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25 Outline 1 Syntax 2 Clauses 3 Constituents Dr. Shannon Bischoff () L130: Chapter 5d 2 / 25 Outline Last time... Verbs...
POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition
POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
But have you ever wondered how to create your own website?
Foreword We live in a time when websites have become part of our everyday lives, replacing newspapers and books, and offering users a whole range of new opportunities. You probably visit at least a few
A Visual Language Based System for the Efficient Management of the Software Development Process.
A Visual Language Based System for the Efficient Management of the Software Development Process. G. COSTAGLIOLA, G. POLESE, G. TORTORA and P. D AMBROSIO * Dipartimento di Informatica ed Applicazioni, Università
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and
A chart generator for the Dutch Alpino grammar
June 10, 2009 Introduction Parsing: determining the grammatical structure of a sentence. Semantics: a parser can build a representation of meaning (semantics) as a side-effect of parsing a sentence. Generation:
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Structural and Semantic Indexing for Supporting Creation of Multilingual Web Pages
Structural and Semantic Indexing for Supporting Creation of Multilingual Web Pages Hiroshi URAE, Taro TEZUKA, Fuminori KIMURA, and Akira MAEDA Abstract Translating webpages by machine translation is the
Master of Arts in Linguistics Syllabus
Master of Arts in Linguistics Syllabus Applicants shall hold a Bachelor s degree with Honours of this University or another qualification of equivalent standard from this University or from another university
Electronic offprint from. baltic linguistics. Vol. 3, 2012
Electronic offprint from baltic linguistics Vol. 3, 2012 ISSN 2081-7533 Nɪᴄᴏʟᴇ Nᴀᴜ, A Short Grammar of Latgalian. (Languages of the World/Materials, 482.) München: ʟɪɴᴄᴏᴍ Europa, 2011, 119 pp. ɪѕʙɴ 978-3-86288-055-3.
Ling 201 Syntax 1. Jirka Hana April 10, 2006
Overview of topics What is Syntax? Word Classes What to remember and understand: Ling 201 Syntax 1 Jirka Hana April 10, 2006 Syntax, difference between syntax and semantics, open/closed class words, all
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Word Clouds - Educational Tools by Terence Cavanaugh
Word Clouds - Educational Tools by Terence Cavanaugh While browsing the web you might see some tag clouds or word clouds, often they are over to the side of the page and look like bunches of words random
Computer Aided Document Indexing System
Computer Aided Document Indexing System Mladen Kolar, Igor Vukmirović, Bojana Dalbelo Bašić, Jan Šnajder Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 0000 Zagreb, Croatia
Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software
Semantic Research using Natural Language Processing at Scale; A continued look behind the scenes of Semantic Insights Research Assistant and Research Librarian Presented to The Federal Big Data Working
Peeling Back the Layers Sister Grade Seven
2-7th pages 68-231.15 8/3/04 9:58 AM Page 178 Peeling Back the Layers Sister Grade Seven Skill Focus Grammar Composition Reading Strategies Annotation Determining Main Idea Generalization Inference Paraphrase
Curriculum for the Master of Arts programme in Slavonic Studies at the Faculty of Humanities 2 of the University of Innsbruck
The English version of the curriculum for the Master of Arts programme in Slavonic Studies is not legally binding and is for informational purposes only. The legal basis is regulated in the curriculum
PICAI Italian Language Courses for Adults 6865, Christophe-Colomb, Montreal, Quebec Tel: 514-271 5590 Fax: 514 271 5593 Email: picai@axess.
PICAI Italian Language Courses for Adults 6865, Christophe-Colomb, Montreal, Quebec Tel: 514-271 5590 Fax: 514 271 5593 Email: [email protected] School Year 2013/2014 September April (24 lessons; 3hours/lesson
10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words
Standard 3: Writing Process 3.1: Prewrite 58-69% 10.LA.3.1.2 Generate a main idea or thesis appropriate to a type of writing. (753.02.b) Items may include a specified purpose, audience, and writing outline.
Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3
Yıl/Year: 2012 Cilt/Volume: 1 Sayı/Issue:2 Sayfalar/Pages: 40-47 Differences in linguistic and discourse features of narrative writing performance Abstract Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu
A Statistical Parser for Czech*
Michael Collins AT&T Labs-Research, Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932 mcollins@research, att.com A Statistical Parser for Czech* Jan Haj i~. nstitute of Formal and Applied Linguistics
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
Overview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
Application of Natural Language Interface to a Machine Translation Problem
Application of Natural Language Interface to a Machine Translation Problem Heidi M. Johnson Yukiko Sekine John S. White Martin Marietta Corporation Gil C. Kim Korean Advanced Institute of Science and Technology
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
Parsing Technology and its role in Legacy Modernization. A Metaware White Paper
Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks
Helpful Hints, Inserting Images, and creating Links to documents
Helpful Hints, Inserting Images, and creating Links to documents When you re putting together an email with images, links and formatting, you re really writing a piece of HTML code. Editors use what are
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Annotation in Language Documentation
Annotation in Language Documentation Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29 Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations
D2.4: Two trained semantic decoders for the Appointment Scheduling task
D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
1 Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing Lourdes Araujo Dpto. Sistemas Informáticos y Programación, Univ. Complutense, Madrid 28040, SPAIN (email: [email protected])
OWA User Guide. Table of Contents
OWA User Guide Table of Contents 1. Basic Functionality of Outlook Web Access... 2 How to Login to Outlook Web Access (OWA)... 2 Change Password... 3 Mail... 3 Composing Mail... 5 Attachments - Web Ready
Multi language e Discovery Three Critical Steps for Litigating in a Global Economy
Multi language e Discovery Three Critical Steps for Litigating in a Global Economy 2 3 5 6 7 Introduction e Discovery has become a pressure point in many boardrooms. Companies with international operations
Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability
Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability Ana-Maria Popescu Alex Armanasu Oren Etzioni University of Washington David Ko {amp, alexarm, etzioni,
Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University
Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP
MARY. V NP NP Subject Formation WANT BILL S
In the Logic tudy Guide, we ended with a logical tree diagram for WANT (BILL, LEAVE (MARY)), in both unlabelled: tudy Guide WANT BILL and labelled versions: P LEAVE MARY WANT BILL P LEAVE MARY We remarked
Udacity cs101: Building a Search Engine. Extracting a Link
Udacity cs101: Building a Search Engine Unit 1: How to get started: your first program Extracting a Link Introducing the Web Crawler (Video: Web Crawler)... 2 Quiz (Video: First Quiz)...2 Programming (Video:
THE BACHELOR S DEGREE IN SPANISH
Academic regulations for THE BACHELOR S DEGREE IN SPANISH THE FACULTY OF HUMANITIES THE UNIVERSITY OF AARHUS 2007 1 Framework conditions Heading Title Prepared by Effective date Prescribed points Text
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE A.J.P.M.P. Jayaweera #1, N.G.J. Dias *2 # Virtusa Pvt. Ltd. No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka * Department of Statistics
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
PUBLIC Preferences Setup Automated Analytics User Guide
SAP Predictive Analytics 2.3 2015-08-27 PUBLIC Automated Analytics User Guide Content 1 About Startup Options....3 1.1 Accessing the Preferences Dialog.... 3 2 Setting the General Options....4 2.1 Default
1 Basic concepts. 1.1 What is morphology?
EXTRACT 1 Basic concepts It has become a tradition to begin monographs and textbooks on morphology with a tribute to the German poet Johann Wolfgang von Goethe, who invented the term Morphologie in 1790
DEALMAKER: An Agent for Selecting Sources of Supply To Fill Orders
DEALMAKER: An Agent for Selecting Sources of Supply To Fill Orders Pedro Szekely, Bob Neches, David Benjamin, Jinbo Chen and Craig Milo Rogers USC/Information Sciences Institute Marina del Rey, CA 90292
