Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian
|
|
|
- Ethel Banks
- 10 years ago
- Views:
Transcription
1 Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian Cristina Bosco, Alessandro Mazzei, and Vincenzo Lombardo Dipartimento di Informatica, Università di Torino, Corso Svizzera 185, 10149, Torino Abstract. The aim of Evalita Parsing Task is at defining and extending the state of the art for parsing Italian by encouraging the application of existing models and approaches. Therefore, as in the first edition, the Task includes two tracks, i.e. dependency and constituency. This second track is based on a development set in a format, which is an adaptation for Italian of the Penn Treebank format, and has been applied by conversion to an existing dependency Italian treebank. The paper describes the constituency track and the data for development and testing of the participant systems. Moreover it presents and discusses the results, which positively compare with those obtained for constituency parsing in the Evalita 07 Parsing Task. Keywords: Parsing evaluation, constituency, Italian, treebank, Penn Treebank format 1 Introduction: motivations of the constituency Parsing Task The general aim of the Evalita Parsing Task contests is at defining and extending the current state of the art in parsing Italian with reference to the existing resources, by encouraging the application of existing models to this language. In the first edition, in 2007 ([1], [2], [3]), the focus was mainly on the application to the Italian language of various parsing approaches, i.e. rule-based and statistical, and paradigms, i.e. constituency and dependency. The same development data, extracted from the Turin University Treebank (TUT 1 ), have been therefore distributed both in dependency (TUT native) and constituency (TUT-Penn) format, and the Task was articulated in two parallel tracks, i.e. dependency and constituency. The results for dependency parsing have been evaluated as very close to the state of the art for English, while those for constituency showed a higher distance from it. Notwithstanding the results of the Parsing Task in 2007 and the increasing popularity of the dependency parsing, the Evalita 09 Parsing Task includes again also a constituency parsing track, together with the track for dependency. While the development of new approaches for constituency parsing has allowed 1 tutreeb
2 for the achievement of results of around F 92.1 [4] for English using the Penn Treebank, various experiences demonstrated that it is impossible to reproduce these results on other languages, e.g. Czech [5], German [6], Chinese [7], Italian [8], and on treebanks others than Penn [9]. Nevertheless, by proposing again the constituency track, we aim at contributing to the investigation on the causes of this irreproducibility with reference to Italian. Moreover, since the test set for the constituency and that for the dependency track share the same sentences, the contest makes available new materials for the development of cross-paradigm analyses about constituency parsing for Italian. The focus of this paper is on the constituency track of the Evalita 09 Parsing Task only (see [10] for dependency track at the Evalita 09). The paper is organized as follows. In the following section, we describe the constituency track. In the other sections, we show the development data sets, the evaluation measures, the participation results and the discussion about them. 2 Definition of the constituency parsing task The constituency parsing task is defined as the activity of assigning a syntactic structure to a given set of PoS tagged sentences (called test set), using a fully automatic parser and according to the constituency-based annotation scheme presented in a large set of sentences (called training or development set). The evaluation for this task is based on an annotated version of the test set (called gold standard test set). 3 Data sets The data for the constituency track are from TUT, the treebank for Italian developed by the Natural Language Processing group of the Department of Computer Science of the University of Turin 2. Even if smaller than other existing Italian resources, i.e. VIT [11] and ISST [12], TUT makes available more annotation formats [13], [14] that allowed for a larger variety of training and testing for parsing systems and for meaningfull comparisons with theoretical linguistic frameworks. In particular, since it is available both in dependency and constituency format, this treebank has been used in Evalita contests in 2007 and is used in 2009 as reference treebank. A recent project involving TUT concerns instead the development of the CCG-TUT, a treebank of Combinatory Categorial Grammar derivations for Italian [15]. The native annotation scheme of TUT is dependency-based, but in order to increase the comparability with other existing resources and in particular with the Penn Treebank, TUT has been converted in the TUT-Penn format, an adaptation of the Penn Treebank format for Italian. TUT-Penn has a morphosyntactic tagging richer than the native Penn format used for English, but implements almost the same syntactic structure of Penn. In fact, since Italian is 2 The free download of the resource is available at tutreeb.
3 inflectionally richer than English, TUT-Penn includes more PoS tags, i.e. 68 tags in TUT vs 36 in Penn Treebank [16]. Among the tags used in Penn, 22 are basic and 14 represent additional morphological features, while in TUT 19 are basic and 26 represent features. In both the tag sets, each feature can be composed with a limited number of basic tags, e.g. in TUT-Penn DE (for demonstrative) can be composed only with the basic tags ADJ (for adjective) and PRO (for pronoun). The higher number of tags in TUT-Penn is mainly motivated by the more fine-grained tagging for adjective, pronoun and especially verbs. For instance, in Penn a verb form can be annotated by using the basic tag VB (for basic form) or with a feature, e.g. VBD (for past tense) or VBG (for gerund or past participle), VBN (for past participle), VBP (non-third person singular present), VBZ (third person singular present), or can be annotated as MD (for modal). By contrary in TUT-Penn, the verb is annotated as VMA (for verb main), VMO (for modal) and VAU (for auxiliary) always associated with one of the 11 features that represent the conjugation, e.g. VMO PA for a modal verb in participle past or VMA IM for a main verb in imperfect. With respect to the syntactic annotation of Penn Treebank, TUT-Penn is different only for some particular constructions and phenomena. For instance, TUT- Penn features a representation different from Penn for the subjects occurring after the verb. In Italian, where the word order is more free than in English, this is a quite common phenomenon, which is typically challenging for phrase structures (since the subject is considered as external argument of the VP). Therefore the TUT-Penn annotates a special functional tag, i.e. EXTPSBJ on the lexically realized subject (PRO PE loro) and a null element co-indexed with it, (-NONE- *-233 ) but positioned in the canonical position of the subject, as in the following example (sentence ALB-108 from the development set, subcorpus newspaper): Erano loro quelli che in città guadagnavano di più. (word-by-word rough translation: Were they that in the town gained more). ( (S (NP-SBJ (-NONE- *-233)) (VP (VMA~IM Erano) (NP-EXTPSBJ-233 (PRO~PE loro)) (NP-PRD (NP (PRO~DE quelli)) (SBAR (NP-4333 (PRO~RE che)) (S (NP-SBJ (-NONE- *-4333)) (PP-LOC (PREP in) (NP (NOU~CA città))) (VP (VMA~IM guadagnavano) (ADVP (ADVB DI_PIÙ))))))) (..)) ) Observe instead that TUT-Penn implements the same Penn annotation of null elements and coindexing, as you can see in the example where the relative
4 clause is represented as in the Penn format: the head of the structure is the relative pronoun (PRO RE che), and a null element coindexed with the pronoun (-NONE- *-4333 ) is the subject of the clause. As well as TUT in native dependency format, the treebank in TUT-Penn format has been newly released in 2009 in an improved version where the annotation is more correct and consistent with respect to the previous version, that used in Evalita 07. Among the improvements of this last release there is the annotation of multi-word expressions, which are currently annotated by considering all the elements of an expression composed by more words as a single word or terminal node of the constituency tree (see e.g., in the previous example, the adverbial multi word expression di più more). The development set of the constituency track includes 2,200 sentences that correspond to 64,193 annotated tokens in TUT native format. The corpus is organized in two subcorpora, i.e. one from Italian newspaper (1,100 sentences and 33,522 tokens in TUT format) and one from the Italian Civil Law Code (1,100 sentences and 30,671 tokens in TUT format). The test set includes 200 sentences and features the same balancement of the development set: 100 sentences from newspapers and 100 from the Civil Law Code. In order to make possible comparisons with the dependency track, the same sentences of the test set for the constituency track are included in the test set for the dependency track. More precisely, the dependency track includes two subtasks, i.e. the Main Subtask 3, whose test set includes all the 200 sentences of the test set for the constituency track, and the Pilot Subtask 4, whose test set includes the 100 sentences from newspapers of the test set for the constituency track. 4 Evaluation measures and participation results We have used for the evaluation of constituency parsing results the standard metric EVALB ([5], it is a bracket scoring program that reports labelled precision (LP), recall (LR), F-score (LF), non crossing and tagging accuracy for given data. Note that the official measure for the final result is the F-score. As usual we did not score the TOP label; moreover, in contrast with usual, in order to have a direct comparison with dependency subtask we do use punctuation in scoring. We had two participants to the constituency track, i.e. FBKirst Lavelli CPAR and UniAmsterdam Sangati CPAR 5. The parser from FBKirst adopts probabilistic context-free grammars model; the parser from University of Amsterdam adopts the DOP model. 3 Where the development set includes, in TUT format, also the 2,200 sentences of the development set of the constituency track. 4 Where the development set is extracted from ISST 5 The name of each system that participated to the contest is composed according to the following pattern: institution author XPAR, where X is D for dependency and C for constituency.
5 The evaluation of the participation results for the constituency track is presented in table 1. Note that the difference between two results is taken to be significant if p < 0.05 (see and conll/software.html#eval). We can observe that the best results for this track have been achieved by the FBKirst Lavelli CPAR. Nevertheless, according to the p-value the difference between the first and second score cannot be considered as significant for recall. Table 1. Constituency parsing: evaluation on all the test set (200 sentences). LF LR LP p for LR p for LP Participant FBKirst Lavelli CPAR UniAmsterdam Sangati CPAR With respect to the subcorpora that represent the text genres of TUT within the test set, we can observe that the best results have been achieved on the set of sentences extracted from civil code section (see table 2). Table 2. Constituency parsing: separate evaluation on the newspaper (100 sentences) and civil law (100 sentences) test set. newspaper civillaw Participant LF LR LP LF LR LP FBKirst Lavelli DPAR UniAmsterdam Sangati DPAR 5 Discussion The results obtained in the constituency parsing track positively compare with the previous Evalita experience. The best results in 2007 and in the current contest have been achieved by the same team of research, i.e. FBKirst Lavelli DPAR, but with a different parser (Berkeley vs Bikel s [17]). In Evalita 07 the best scores are LF 67.97%, LP 65.36% and LR 70.81%, while in Evalita 09 the best scores are higher with an improvement of 10.76% for the LF, 12.12% for LP and 9.21% for LR.
6 Nevertheless, the distance from the results for English (F-score 92.1% [4]) remains high. Among the motivations for this distance it is important to take into account first of all the limited size of the treebank used for the training of the Italian parsers with respect to the Penn Treebank for English. Also the limited number of participants (two as in Evalita 07), which confirms the low popularity of the constituency parsing with respect to the dependency parsing for Italian (e.g. in Evalita 09 six participants for the main dependency subtask and five for the pilot), shows that we can expect from this area of research only a limited number of contributes and correlated evidences. Such a kind of evidences should come in the future also from comparisons among different constituency formats, as happened this year for dependency. With respect to the text genres involved in the evaluation we can say that civil code is easier to parse with respect to newspaper texts in constituency parsing s well as in dependency [10]. Finally, with respect to dependency track, for constituency the results remain lower, but the distance is meaningfully lower than in It is well known in literature that is difficult to compare dependency and constituency evaluations, but the distance among the results can be clearly interpreted according to the hypothesis that dependency parsing is more adequate for Italian than the constituency parsing, at least with reference to the currently applied models and annotation schemes. 6 References References 1. Bosco C., Mazzei A., Lombardo V.: Evalita Parsing Task: an analysis of the first parsing system contest for Italian. Intelligenza Artificiale, vol. 12, pp (2007) 2. Bosco, C., Mazzei, A., Lombardo, V., Attardi, G., Corazza, A., Lavelli, A., Lesmo, L., Satta, G., Simi, M.: Comparing Italian parsers on a common treebank: the Evalita experience. In: Proceedings of LREC 08, pp (2008) 3. Magnini, B., Cappelli, A., Tamburini, F., Bosco, C., Mazzei, A., Lombardo, V., Bertagna, F., Calzolari, N., Toral, A., Bartalesi Lenzi, V., Sprugnoli, R., Speranza, M.: Evaluation of Natural Language Tools for Italian: EVALITA In: Proceedings of LREC 08, pp (2008) 4. McClosky D., Charniak E., Johnson M.: When is self-training effettive for parsing? In: Proceedings of CoLing (2008) 5. Collins M., Hajic J., Ramshaw L., Tillmann C.: A Statistical Parser of Czech. In: Proceedings of ACL 99, pp (1999) 6. Dubey A., Keller F.: Probabilistic parsing for German using sister-head dependencies. In: Proceedings of ACL 03, pp (2003) 7. Levy R., Manning C.: Is it harder to parse Chinese, or the Chinese treebank? In: Proceedings of ACL 03, pp (2003) 8. Corazza A., Lavelli A., Satta G., Zanoli R.: Analyzing an Italian treebank with state-of-the-art statistical parser. In: Proceedings of TLT-2004, pp (2004) 9. Gildea D.: Corpus variation and parser performance. In: Proceedings of EMNLP 01, pp (2001)
7 10. Bosco C., Montemagni S., Mazzei A., Lombardo V., Dell Orletta F., Lenci A.: Evalita 09 Parsing Task: comparing dependency parsers and treebanks. In: Proceedings of EVALITA 2009 (2009) 11. Delmonte R.: Strutture sintattiche dall analisi computazionale di corpora di italiano. Franco Angeli, Milano (2008) 12. Montemagni S., Barsotti F., Battista M., Calzolari N., Corazzari O., Lenci A., Zampolli A., Fanciulli F., Massetani M., Raffaelli R., Basili R., Pazienza M. T., Saracino D., Zanzotto F., Mana N., Pianesi F., Delmonte R.: Building the Italian Syntactic-Semantic Treebank. In: Abeillé A. (ed.): Building and Using syntactically annotated corpora, pp Kluwer, Dordrecht (2003) 13. Bosco C., Lombardo V.: Comparing linguistic information in treebank annotations. In: Proceedings of LREC 06, pp (2006) 14. Bosco C.: Multiple-step treebank conversion: from dependency to Penn format. In: Proceedings of the Workshop on Linguistic Annotation at the ACL 07, pp (2007) 15. Bos, J., Bosco, C., Mazzei, A.: Converting a Dependency-Based Treebank to a Categorial Grammar Treebank for Italian. In: Proceedings of the 8th Workshop on Treebanks and Linguistic Theories, in press (2009) 16. Santorini B.: Part-of-Speech tagginf guidelines for the Penn Treebank project (3rd revision). Technical report no. MS-CIS at University of Pennsylvania, reports/570/ 17. Corazza A., Lavelli A., Satta G.: Phrase Based Statistical Parsing. Intelligenza Artificiale, vol. 12, pp (2007) 18. Buchholz S., Marsi E.: CoNLL-X Shared Task on Multilingual Dependency Parsing. In: Proceedings of the CoNLL-X, pp (2006) 19. Nivre J., Hall J., Kübler S., McDonald R., Nilsson J., Riedel S., Yuret D.: The CoNLL 2007 Shared Task on Dependency Parsing. In: Proceedings of the EMNLP- CoNLL, pp (2007)
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Overview of the EVALITA 2009 PoS Tagging Task
Overview of the EVALITA 2009 PoS Tagging Task G. Attardi, M. Simi Dipartimento di Informatica, Università di Pisa + team of project SemaWiki Outline Introduction to the PoS Tagging Task Task definition
Automatic Detection and Correction of Errors in Dependency Treebanks
Automatic Detection and Correction of Errors in Dependency Treebanks Alexander Volokh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany [email protected] Günter Neumann DFKI Stuhlsatzenhausweg
Effective Self-Training for Parsing
Effective Self-Training for Parsing David McClosky [email protected] Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson David McClosky - [email protected]
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural
DEPENDENCY PARSING JOAKIM NIVRE
DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing
1 Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing Lourdes Araujo Dpto. Sistemas Informáticos y Programación, Univ. Complutense, Madrid 28040, SPAIN (email: [email protected])
Building gold-standard treebanks for Norwegian
Building gold-standard treebanks for Norwegian Per Erik Solberg National Library of Norway, P.O.Box 2674 Solli, NO-0203 Oslo, Norway [email protected] ABSTRACT Språkbanken at the National Library of Norway
A Dynamic Oracle for Arc-Eager Dependency Parsing
A Dynamic Oracle for Arc-Eager Dependency Parsing Yoav Gold ber g 1 Joakim N ivre 1,2 (1) Google Inc. (2) Uppsala University [email protected], [email protected] ABSTRACT The standard training regime
Transition-Based Dependency Parsing with Long Distance Collocations
Transition-Based Dependency Parsing with Long Distance Collocations Chenxi Zhu, Xipeng Qiu (B), and Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science,
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Chinese Open Relation Extraction for Knowledge Acquisition
Chinese Open Relation Extraction for Knowledge Acquisition Yuen-Hsien Tseng 1, Lung-Hao Lee 1,2, Shu-Yen Lin 1, Bo-Shun Liao 1, Mei-Jun Liu 1, Hsin-Hsi Chen 2, Oren Etzioni 3, Anthony Fader 4 1 Information
Open Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
Context Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]
Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 & the Fourth International Workshop EVALITA 2014
to es Qu Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 & the Fourth International Workshop EVALITA 2014 art pp ka oo e-b 9-11 December 2014, Pisa e ien ci en rol
POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23
POS Def. Part of Speech POS POS L645 POS = Assigning word class information to words Dept. of Linguistics, Indiana University Fall 2009 ex: the man bought a book determiner noun verb determiner noun 1
A Statistical Parser for Czech*
Michael Collins AT&T Labs-Research, Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932 mcollins@research, att.com A Statistical Parser for Czech* Jan Haj i~. nstitute of Formal and Applied Linguistics
Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles
Why language is hard And what Linguistics has to say about it Natalia Silveira Participation code: eagles Christopher Natalia Silveira Manning Language processing is so easy for humans that it is like
Development of a Dependency Treebank for Russian and its Possible Applications in NLP
Development of a Dependency Treebank for Russian and its Possible Applications in NLP Igor BOGUSLAVSKY, Ivan CHARDIN, Svetlana GRIGORIEVA, Nikolai GRIGORIEV, Leonid IOMDIN, Lеonid KREIDLIN, Nadezhda FRID
Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science
Engineering & Health Informa2on Science Engineering NLP Solu/ons for Structured Informa/on from Clinical Text: Extrac'ng Sen'nel Events from Pallia've Care Consult Le8ers Canada-China Clean Energy Initiative
Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014
Introduction! BM1 Advanced Natural Language Processing Alexander Koller! 17 October 2014 Outline What is computational linguistics? Topics of this course Organizational issues Siri Text prediction Facebook
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Genre distinctions and discourse modes: Text types differ in their situation type distributions
Genre distinctions and discourse modes: Text types differ in their situation type distributions Alexis Palmer and Annemarie Friedrich Department of Computational Linguistics Saarland University, Saarbrücken,
Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang
Sense-Tagging Verbs in English and Chinese Hoa Trang Dang Department of Computer and Information Sciences University of Pennsylvania [email protected] October 30, 2003 Outline English sense-tagging
Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar http://www.cis.upenn.edu/ anoop/ [email protected]
Applying Co-Training Methods to Statistical Parsing Anoop Sarkar http://www.cis.upenn.edu/ anoop/ [email protected] 1 Statistical Parsing: the company s clinical trials of both its animal and human-based
Domain Adaptation for Dependency Parsing via Self-training
Domain Adaptation for Dependency Parsing via Self-training Juntao Yu 1, Mohab Elkaref 1, Bernd Bohnet 2 1 University of Birmingham, Birmingham, UK 2 Google, London, UK 1 {j.yu.1, m.e.a.r.elkaref}@cs.bham.ac.uk,
Chunk Parsing. Steven Bird Ewan Klein Edward Loper. University of Melbourne, AUSTRALIA. University of Edinburgh, UK. University of Pennsylvania, USA
Chunk Parsing Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA University of Edinburgh, UK University of Pennsylvania, USA March 1, 2012 chunk parsing: efficient and robust approach
LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*
LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM* Jonathan Yamron, James Baker, Paul Bamberg, Haakon Chevalier, Taiko Dietzel, John Elder, Frank Kampmann, Mark Mandel, Linda Manganaro, Todd Margolis,
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives
Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer Timur Gilmanov, Olga Scrivner, Sandra Kübler Indiana University
INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning [email protected]
INF5820 Natural Language Processing - NLP H2009 Jan Tore Lønning [email protected] Semantic Role Labeling INF5830 Lecture 13 Nov 4, 2009 Today Some words about semantics Thematic/semantic roles PropBank &
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
Syntax: Phrases. 1. The phrase
Syntax: Phrases Sentences can be divided into phrases. A phrase is a group of words forming a unit and united around a head, the most important part of the phrase. The head can be a noun NP, a verb VP,
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
Annotation Guidelines for Dutch-English Word Alignment
Annotation Guidelines for Dutch-English Word Alignment version 1.0 LT3 Technical Report LT3 10-01 Lieve Macken LT3 Language and Translation Technology Team Faculty of Translation Studies University College
Research Portfolio. Beáta B. Megyesi January 8, 2007
Research Portfolio Beáta B. Megyesi January 8, 2007 Research Activities Research activities focus on mainly four areas: Natural language processing During the last ten years, since I started my academic
Learning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
Automating Legal Research through Data Mining
Automating Legal Research through Data Mining M.F.M Firdhous, Faculty of Information Technology, University of Moratuwa, Moratuwa, Sri Lanka. [email protected] Abstract The term legal research generally
Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction
Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction Minjun Park Dept. of Chinese Language and Literature Peking University Beijing, 100871, China [email protected] Yulin Yuan
Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013
Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34 Basic text analysis Before any sophisticated analysis, we want ways to get a sense of text data
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Comprendium Translator System Overview
Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4
Annotation and Evaluation of Swedish Multiword Named Entities
Annotation and Evaluation of Swedish Multiword Named Entities DIMITRIOS KOKKINAKIS Department of Swedish, the Swedish Language Bank University of Gothenburg Sweden [email protected] Introduction
The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle
SCIENTIFIC PAPERS, UNIVERSITY OF LATVIA, 2010. Vol. 757 COMPUTER SCIENCE AND INFORMATION TECHNOLOGIES 11 22 P. The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle Armands Šlihte Faculty
Outline of today s lecture
Outline of today s lecture Generative grammar Simple context free grammars Probabilistic CFGs Formalism power requirements Parsing Modelling syntactic structure of phrases and sentences. Why is it useful?
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
Brill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
Text Generation for Abstractive Summarization
Text Generation for Abstractive Summarization Pierre-Etienne Genest, Guy Lapalme RALI-DIRO Université de Montréal P.O. Box 6128, Succ. Centre-Ville Montréal, Québec Canada, H3C 3J7 {genestpe,lapalme}@iro.umontreal.ca
Hybrid Strategies. for better products and shorter time-to-market
Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
BEER 1.1: ILLC UvA submission to metrics and tuning task
BEER 1.1: ILLC UvA submission to metrics and tuning task Miloš Stanojević ILLC University of Amsterdam [email protected] Khalil Sima an ILLC University of Amsterdam [email protected] Abstract We describe
Training and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
Constraints in Phrase Structure Grammar
Constraints in Phrase Structure Grammar Phrase Structure Grammar no movement, no transformations, context-free rules X/Y = X is a category which dominates a missing category Y Let G be the set of basic
Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability
Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability Ana-Maria Popescu Alex Armanasu Oren Etzioni University of Washington David Ko {amp, alexarm, etzioni,
Simple Type-Level Unsupervised POS Tagging
Simple Type-Level Unsupervised POS Tagging Yoong Keok Lee Aria Haghighi Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {yklee, aria42, regina}@csail.mit.edu
LINGUISTIC PROCESSING IN THE ATLAS PROJECT
LINGUISTIC PROCESSING IN THE ATLAS PROJECT Leonardo LESMO, Alessandro Mazzei, Daniele Radicioni Interaction Models Group Dipartimento di Informatica e Centro di Scienze Cognitive Università di Torino {lesmo,mazzei,radicion}@di.unito.it
Less Grammar, More Features
Less Grammar, More Features David Hall Greg Durrett Dan Klein Computer Science Division University of California, Berkeley {dlwh,gdurrett,klein}@cs.berkeley.edu Abstract We present a parser that relies
How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH
LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH Gertjan van Noord Deliverable 3-4: Report Annotation of Lassy Small 1 1 Background Lassy Small is the Lassy corpus in which the syntactic annotations
Robustness and processing difficulty models. A pilot study for eye-tracking data on the French Treebank
Robustness and processing difficulty models. A pilot study for eye-tracking data on the French Treebank Stéphane RAUZY Philippe BLACHE Aix-Marseille Université & CNRS Laboratoire Parole & Langage Aix-en-Provence,
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Linguistics to Structure Unstructured Information
Linguistics to Structure Unstructured Information Authors: Günter Neumann (DFKI), Gerhard Paaß (Fraunhofer IAIS), David van den Akker (Attensity Europe GmbH) Abstract The extraction of semantics of unstructured
Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy
Binarizing Syntax Trees to Improve Syntax-Based Machine Translation Accuracy Wei Wang and Kevin Knight and Daniel Marcu Language Weaver, Inc. 4640 Admiralty Way, Suite 1210 Marina del Rey, CA, 90292 {wwang,kknight,dmarcu}@languageweaver.com
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark [email protected] Outline Flow chart Linguateca Palavras History
Detecting Parser Errors Using Web-based Semantic Filters
Detecting Parser Errors Using Web-based Semantic Filters Alexander Yates Stefan Schoenmackers University of Washington Computer Science and Engineering Box 352350 Seattle, WA 98195-2350 Oren Etzioni {ayates,
Machine Learning Approach To Augmenting News Headline Generation
Machine Learning Approach To Augmenting News Headline Generation Ruichao Wang Dept. of Computer Science University College Dublin Ireland [email protected] John Dunnion Dept. of Computer Science University
ETL Ensembles for Chunking, NER and SRL
ETL Ensembles for Chunking, NER and SRL Cícero N. dos Santos 1, Ruy L. Milidiú 2, Carlos E. M. Crestana 2, and Eraldo R. Fernandes 2,3 1 Mestrado em Informática Aplicada MIA Universidade de Fortaleza UNIFOR
Object-Extraction and Question-Parsing using CCG
Object-Extraction and Question-Parsing using CCG Stephen Clark and Mark Steedman School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh, UK stevec,steedman @inf.ed.ac.uk James R. Curran
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
Wiki-ly Supervised Part-of-Speech Tagging
Wiki-ly Supervised Part-of-Speech Tagging Shen Li Computer & Information Science University of Pennsylvania [email protected] João V. Graça L 2 F INESC-ID Lisboa, Portugal [email protected] Ben
Automated Extraction of Security Policies from Natural-Language Software Documents
Automated Extraction of Security Policies from Natural-Language Software Documents Xusheng Xiao 1 Amit Paradkar 2 Suresh Thummalapenta 3 Tao Xie 1 1 Dept. of Computer Science, North Carolina State University,
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) James Clarke, Vivek Srikumar, Mark Sammons, Dan Roth Department of Computer Science, University of Illinois, Urbana-Champaign.
A chart generator for the Dutch Alpino grammar
June 10, 2009 Introduction Parsing: determining the grammatical structure of a sentence. Semantics: a parser can build a representation of meaning (semantics) as a side-effect of parsing a sentence. Generation:
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX
COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy
BILINGUAL TRANSLATION SYSTEM
BILINGUAL TRANSLATION SYSTEM (FOR ENGLISH AND TAMIL) Dr. S. Saraswathi Associate Professor M. Anusiya P. Kanivadhana S. Sathiya Abstract--- The project aims in developing Bilingual Translation System for
Language Model of Parsing and Decoding
Syntax-based Language Models for Statistical Machine Translation Eugene Charniak ½, Kevin Knight ¾ and Kenji Yamada ¾ ½ Department of Computer Science, Brown University ¾ Information Sciences Institute,
Task-oriented Evaluation of Syntactic Parsers and Their Representations
Task-oriented Evaluation of Syntactic Parsers and Their Representations Yusuke Miyao Rune Sætre Kenji Sagae Takuya Matsuzaki Jun ichi Tsujii Department of Computer Science, University of Tokyo, Japan School
A Method for Automatic De-identification of Medical Records
A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA [email protected] Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA [email protected] Abstract
L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25
L130: Chapter 5d Dr. Shannon Bischoff Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25 Outline 1 Syntax 2 Clauses 3 Constituents Dr. Shannon Bischoff () L130: Chapter 5d 2 / 25 Outline Last time... Verbs...
