Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

Size: px

Start display at page:

Download "Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang"

Donna Tucker
9 years ago
Views:

1 Sense-Tagging Verbs in English and Chinese Hoa Trang Dang Department of Computer and Information Sciences University of Pennsylvania October 30, 2003

2 Outline English sense-tagging Senseval-1 verbs Senseval-2 verbs WordNet verb sense groupings Chinese sense-tagging Penn Chinese Treebank People s Daily News Sense-tagging in PropBank II 1

3 Local Contextual Predicates for English WSD Collocational (Ratnaparkhi pos-tagger): target verb w; pos of w; pos of words at positions -1, +1, wrt w; words at positions -2, -1, +1, +2, wrt w syntactic (Collins parser): is the sentence containing w passive; is there a sentential complement, subject, direct object, or indirect object the words (if any) in the positions of subject, direct object, indirect object, particle, prepositional complement (and its object) semantic (Nymble: Bikel et al.): Named Entity tag (PERSON, ORGANIZATION, LOCATION) for proper nouns, and WN synsets and hypernyms for all nouns in above syntactic relation to w 2

indirect object the words (if any) in the positions of subject, direct object, indirect object, particle, prepositional complement (and its object) semantic

4 Topical Contextual Keywords Generate list of keywords from training set for each verb: Sort all words k by entropy È of Ë Ò µ, where k appears anywhere in context, provided that k appears in more than (= 2) instances in the corpus Select words k with lowest entropy (most informative) 3

anywhere in context, provided that k appears in more than (= 2) instances

5 Senseval-1 Lexical Sample Task Lexicon: Hector lexical database, senses are organized in hierarchies Corpus: British National Corpus High average inter-annotator agreement (95.5%) 13 verbs (12 senses/verb in corpus) Avg training set size: 215 instances/verb Baseline (most frequent sense): 57% 4

6 Senseval-1 Verb Results System Accuracy p-value Avg. System ETS (Naive Bayes) MaxEnt (lex+trans+topic) MaxEnt (best variants) JHU-final (Decision List)

7 Senseval-2 English Verb Lexical Sample Task Lexicon: WordNet1.7; senses are also grouped Corpus: Penn Treebank WSJ, supplemented with British National Corpus Inter-annotator agreement: 71% 29 verbs, mostly highly polysemous (16 senses/verb in corpus) Avg training set size: 110 instances/verb Baseline (most frequent sense): 40% Best system performance: 60% 6

Corpus Inter-annotator agreement: 71% 29 verbs, mostly highly polysemous (16 senses/verb

8 System Accuracy and Feature Types (English) Feature (local) Accuracy Feature (local, topic) Accuracy collocation 48.3 collocation syn syn syn+sem syn+sem 60.2 Linguistically richer features improve system accuracy 7

3 collocation 52.9 +syn 53.9 +syn 54.2 +syn+sem 59.

9 Senseval-2 Verbs Results System Accuracy p-value Avg. System SMU JHU KUNLP MaxEnt 60.2 (Human)

10 Senseval-2 verb groupings methodology Groupings of senses done after sense-tagging for Senseval-2 Double blind grouping of each verb by two people Discussion of criteria used for groupings - syntactic and semantic Adjudication of groupings by third person using agreed-upon criteria 9

people Discussion of criteria used for groupings - syntactic and

11 Groupings improve performance Well-defined groupings improve human inter-annotator agreement (71% to 82%) Random grouping produced insignificant improvement in interannotator agreement (71% to 73%) Similar improvement in system score (60% to 70%) 10

produced insignificant improvement in interannotator agreement

12 Chinese WSD (CTB) Lexicon: CETA (Chinese-English Translation Assistance) Dictionary Corpus: Penn Chinese Treebank (100K words) Manual segmentation, pos-tagging, parsing 28 words (multiple verb senses, possibly other pos), most polysemous in 5K-word sample of corpus 3.5 senses/word in corpus Baseline (most frequent sense): 77% 11

pos-tagging, parsing 28 words (multiple verb senses, possibly other pos), most

13 Contextual predicates (Chinese) Local features: Collocational features: same as for English, plus follows verb feature syntactic features: hassubj, subj, hasobj, obj-p, obj, hasinobj, Comp-VP, VP- Comp, Comp-IP, hasprd semantic features (for verbs only): HowNet noun category for each subject and object Topical features: Same as for English 12

obj, hasinobj, Comp-VP, VP- Comp, Comp-IP, hasprd semantic features (for verbs only):

14 System Accuracy and Feature Types (CTB) Feature type Accuracy Std. Dev. collocation collocation (+ pos) collocation + syntax collocation + syntax + semantics baseline

15 Chinese WSD (PDN) Five words with low accuracy and counts in CTB subsequently sense-tagged in People s Daily News (1M words). PDN corpus has manual segmentation, pos-tagging; no parse About 200 sentences/word in PDN 8.2 senses/verb in corpus Baseline (most frequent sense): 58% Automatic segmentation, pos-tagging, parsing 14

PDN corpus has manual segmentation, pos-tagging; no parse About 200

16 System Accuracy and Feature Types (PDN, automatic) Feature type Accuracy Std. Dev. collocation collocation (+ pos) collocation + syntax collocation + syntax + semantics baseline

2 collocation (+ pos) 70.3 2.9 collocation + syntax 71.

17 System Accuracy and Feature Types (PDN, manual) Feature Type Accuracy Std. Dev. collocation collocation (+ pos) collocation + topic

18 Differences between English and Chinese Higher number of verbs in Chinese than English Lower polysemy per verb for Chinese Many multi-character Chinese verbs Much ambiguitiy in Chinese is at level of word segmentation Lexical collocational information may be sufficient for Chinese 17

multi-character Chinese verbs Much ambiguitiy in Chinese is at level

19 PropBank II sense-tagging Feasibility study - tag a reasonable set of polysemous words in Eng/Chin CTB determine realistic, concrete sense-tagging goals for next two years Which sense distinctions will be most relevant to IE and MT? how fine-grained do we really need to be? What is the most efficient/accurate way to produce the data? hierarchical tagging? active learning? does hand correcting automatic tagging bias the results? 18

relevant to IE and MT? how fine-grained do we really need to be?

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed