Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Similar documents

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Building a Question Classifier for a TREC-Style Question Answering System

Customizing an English-Korean Machine Translation System for Patent Translation *

Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde

A chart generator for the Dutch Alpino grammar

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Comparing Ontology-based and Corpusbased Domain Annotations in WordNet.

Transition-Based Dependency Parsing with Long Distance Collocations

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Identifying Prepositional Phrases in Chinese Patent Texts with. Rule-based and CRF Methods

Detecting Parser Errors Using Web-based Semantic Filters

Customer Intentions Analysis of Twitter Based on Semantic Patterns

Effective Self-Training for Parsing

INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning jtl@ifi.uio.no

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Natural Language Database Interface for the Community Based Monitoring System *

The Proposition Bank: An Annotated Corpus of Semantic Roles

Special Topics in Computer Science

Identifying Focus, Techniques and Domain of Scientific Papers

Semantic analysis of text and speech

Word Completion and Prediction in Hebrew

Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets

Statistical Machine Translation

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Question Prediction Language Model

Natural Language Processing. Part 4: lexical semantics

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Clustering Connectionist and Statistical Language Processing

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Automatic Pronominal Anaphora Resolution in English Texts

PiQASso: Pisa Question Answering System

Chinese Open Relation Extraction for Knowledge Acquisition

A Mixed Trigrams Approach for Context Sensitive Spell Checking

The Role of Sentence Structure in Recognizing Textual Entailment

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25

Interactive Dynamic Information Extraction

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)

Context Grammar and POS Tagging

Using Knowledge Extraction and Maintenance Techniques To Enhance Analytical Performance

Automatic Pronominal Anaphora Resolution. in English Texts

Application of Natural Language Interface to a Machine Translation Problem

Clustering of Polysemic Words

Computer Standards & Interfaces

Chinese-Japanese Machine Translation Exploiting Chinese Characters

Natural Language Processing

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Knowledge-Based WSD on Specific Domains: Performing Better than Generic Supervised WSD

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Extended Lexical-Semantic Classification of English Verbs

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Construction of Thai WordNet Lexical Database from Machine Readable Dictionaries

Overview of the EVALITA 2009 PoS Tagging Task

TREC 2003 Question Answering Track at CAS-ICT

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Language Model of Parsing and Decoding

Question Classification using Head Words and their Hypernyms

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

What s in a Lexicon. The Lexicon. Lexicon vs. Dictionary. What kind of Information should a Lexicon contain?

Automated Extraction of Security Policies from Natural-Language Software Documents

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

Schema documentation for types1.2.xsd

Semantic annotation of requirements for automatic UML class diagram generation

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Sentiment Analysis of Twitter Data

Evaluating Sentiment Analysis Methods and Identifying Scope of Negation in Newspaper Articles

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Word Sense Disambiguation as an Integer Linear Programming Problem

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

Hybrid Strategies. for better products and shorter time-to-market

Genre distinctions and discourse modes: Text types differ in their situation type distributions

Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction

Transcription:

Sense-Tagging Verbs in English and Chinese Hoa Trang Dang Department of Computer and Information Sciences University of Pennsylvania htd@linc.cis.upenn.edu October 30, 2003

Outline English sense-tagging Senseval-1 verbs Senseval-2 verbs WordNet verb sense groupings Chinese sense-tagging Penn Chinese Treebank People s Daily News Sense-tagging in PropBank II 1

Local Contextual Predicates for English WSD Collocational (Ratnaparkhi pos-tagger): target verb w; pos of w; pos of words at positions -1, +1, wrt w; words at positions -2, -1, +1, +2, wrt w syntactic (Collins parser): is the sentence containing w passive; is there a sentential complement, subject, direct object, or indirect object the words (if any) in the positions of subject, direct object, indirect object, particle, prepositional complement (and its object) semantic (Nymble: Bikel et al.): Named Entity tag (PERSON, ORGANIZATION, LOCATION) for proper nouns, and WN synsets and hypernyms for all nouns in above syntactic relation to w 2

Topical Contextual Keywords Generate list of keywords from training set for each verb: Sort all words k by entropy È of Ë Ò µ, where k appears anywhere in context, provided that k appears in more than (= 2) instances in the corpus Select 200-300 words k with lowest entropy (most informative) 3

Senseval-1 Lexical Sample Task Lexicon: Hector lexical database, senses are organized in hierarchies Corpus: British National Corpus High average inter-annotator agreement (95.5%) 13 verbs (12 senses/verb in corpus) Avg training set size: 215 instances/verb Baseline (most frequent sense): 57% 4

Senseval-1 Verb Results System Accuracy p-value Avg. System 66.4 0.001 ETS (Naive Bayes) 71.0 0.005 MaxEnt (lex+trans+topic) 72.3 0.100 MaxEnt (best variants) 73.7 0.400 JHU-final (Decision List) 74.3-5

Senseval-2 English Verb Lexical Sample Task Lexicon: WordNet1.7; senses are also grouped Corpus: Penn Treebank WSJ, supplemented with British National Corpus Inter-annotator agreement: 71% 29 verbs, mostly highly polysemous (16 senses/verb in corpus) Avg training set size: 110 instances/verb Baseline (most frequent sense): 40% Best system performance: 60% 6

System Accuracy and Feature Types (English) Feature (local) Accuracy Feature (local, topic) Accuracy collocation 48.3 collocation 52.9 +syn 53.9 +syn 54.2 +syn+sem 59.0 +syn+sem 60.2 Linguistically richer features improve system accuracy 7

Senseval-2 Verbs Results System Accuracy p-value Avg. System 38.2 0.001 SMU 56.3 0.010 JHU 56.6 0.020 KUNLP 57.6 0.100 MaxEnt 60.2 (Human) 71.3 0.001 8

Senseval-2 verb groupings methodology Groupings of senses done after sense-tagging for Senseval-2 Double blind grouping of each verb by two people Discussion of criteria used for groupings - syntactic and semantic Adjudication of groupings by third person using agreed-upon criteria 9

Groupings improve performance Well-defined groupings improve human inter-annotator agreement (71% to 82%) Random grouping produced insignificant improvement in interannotator agreement (71% to 73%) Similar improvement in system score (60% to 70%) 10

Chinese WSD (CTB) Lexicon: CETA (Chinese-English Translation Assistance) Dictionary Corpus: Penn Chinese Treebank (100K words) Manual segmentation, pos-tagging, parsing 28 words (multiple verb senses, possibly other pos), most polysemous in 5K-word sample of corpus 3.5 senses/word in corpus Baseline (most frequent sense): 77% 11

Contextual predicates (Chinese) Local features: Collocational features: same as for English, plus follows verb feature syntactic features: hassubj, subj, hasobj, obj-p, obj, hasinobj, Comp-VP, VP- Comp, Comp-IP, hasprd semantic features (for verbs only): HowNet noun category for each subject and object Topical features: Same as for English 12

System Accuracy and Feature Types (CTB) Feature type Accuracy Std. Dev. collocation 86.8 1.0 collocation (+ pos) 93.4 0.5 collocation + syntax 94.3 0.4 collocation + syntax + semantics 94.4 0.6 baseline 76.7 13

Chinese WSD (PDN) Five words with low accuracy and counts in CTB subsequently sense-tagged in People s Daily News (1M words). PDN corpus has manual segmentation, pos-tagging; no parse About 200 sentences/word in PDN 8.2 senses/verb in corpus Baseline (most frequent sense): 58% Automatic segmentation, pos-tagging, parsing 14

System Accuracy and Feature Types (PDN, automatic) Feature type Accuracy Std. Dev. collocation 72.3 2.2 collocation (+ pos) 70.3 2.9 collocation + syntax 71.7 3.0 collocation + syntax + semantics 72.7 3.1 baseline 57.6 15

System Accuracy and Feature Types (PDN, manual) Feature Type Accuracy Std. Dev. collocation 71.4 4.3 collocation (+ pos) 74.7 2.3 collocation + topic 72.1 3.1 16

Differences between English and Chinese Higher number of verbs in Chinese than English Lower polysemy per verb for Chinese Many multi-character Chinese verbs Much ambiguitiy in Chinese is at level of word segmentation Lexical collocational information may be sufficient for Chinese 17

PropBank II sense-tagging Feasibility study - tag a reasonable set of polysemous words in Eng/Chin CTB determine realistic, concrete sense-tagging goals for next two years Which sense distinctions will be most relevant to IE and MT? how fine-grained do we really need to be? What is the most efficient/accurate way to produce the data? hierarchical tagging? active learning? does hand correcting automatic tagging bias the results? 18