Effective Self-Training for Parsing

Similar documents

Domain Adaptation for Dependency Parsing via Self-training

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Statistical Machine Translation

DEPENDENCY PARSING JOAKIM NIVRE

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models

Evalita 09 Parsing Task: constituency parsers and the Penn format for Italian

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Machine Learning for natural language processing

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Classification of Natural Language Interfaces to Databases based on the Architectures

Language Model of Parsing and Decoding

Benefits of HPC for NLP besides big data

Sense-Tagging Verbs in English and Chinese. Hoa Trang Dang

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Factored Translation Models

31 Case Studies: Java Natural Language Tools Available on the Web

An End-to-End Discriminative Approach to Machine Translation

Object-Extraction and Question-Parsing using CCG

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Multi-Engine Machine Translation by Recursive Sentence Decomposition

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Shallow Parsing with Apache UIMA

A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

Micro blogs Oriented Word Segmentation System

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

Automatic slide assignation for language model adaptation

Translating the Penn Treebank with an Interactive-Predictive MT System

#hardtoparse: POS Tagging and Parsing the Twitterverse

Leveraging Frame Semantics and Distributional Semantic Polieties

D2.4: Two trained semantic decoders for the Appointment Scheduling task

Less Grammar, More Features

A Mixed Trigrams Approach for Context Sensitive Spell Checking

SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer

Outline of today s lecture

A chart generator for the Dutch Alpino grammar

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Modeling coherence in ESOL learner texts

Basic Parsing Algorithms Chart Parsing

Strategies for Training Large Scale Neural Network Language Models

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Transition-Based Dependency Parsing with Long Distance Collocations

2014/02/13 Sphinx Lunch

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

BEER 1.1: ILLC UvA submission to metrics and tuning task

Automatic Detection and Correction of Errors in Dependency Treebanks

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Brill s rule-based PoS tagger

Question Prediction Language Model

Terminology Extraction from Log Files

Language and Computation

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability

This page intentionally left blank

Identifying Focus, Techniques and Domain of Scientific Papers

Effectiveness of Domain Adaptation Approaches for Social Media PoS Tagging

Course: Model, Learning, and Inference: Lecture 5

Machine Learning Approach To Augmenting News Headline Generation

Artificial Intelligence Exam DT2001 / DT2006 Ordinarie tentamen

IBM Research Report. Scaling Shrinkage-Based Language Models

Task-oriented Evaluation of Syntactic Parsers and Their Representations

Active Learning SVM for Blogs recommendation

SZTE-NLP: Aspect Level Opinion Mining Exploiting Syntactic Cues

PoS-tagging Italian texts with CORISTagger

Joint POS Tagging and Text Normalization for Informal Text

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Learning the Structure of Task-driven Human-Human Dialogs

Transcription:

Effective Self-Training for Parsing David McClosky dmcc@cs.brown.edu Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-1

Parsing David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-2

Parsing I need a sentence with ambiguity. David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-2

Parsing S NP VP. PRP VBP NP. I need NP PP DT NN IN NP a sentence with NN ambiguity I need a sentence with ambiguity. David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-2

Parsing s is a sentence π is a parse tree parse(s) = arg max π p(π s) such that yield(π) = s David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-3

Flow Chart David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-4

Flow Chart David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-5

n-best parsing S NP VP. PRP VBP NP. p(π 1 ) = 7.25 10 20 I need NP PP a sentence with ambiguity S NP VP. PRP VBP NP PP. p(π 2 ) = 7.05 10 21 I need a sentence with ambiguity David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-6

Reranking Parsers Best parses are not always first, but the correct parse is often in the top 50 Rerankers rescore parses from the n-best parser using more complex (not necessarily context-free) features Oracle rerankers on the Charniak parser s 50-best list can achieve over 95% f-score David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-7

Flow Chart David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-8

Our reranking parser Parser and reranker as described in Charniak and Johnson (ACL 2005) with new features Lexicalized context-free generative parser, maximum entropy discriminative reranker New reranking features improve reranking parser s performance by 0.3% on section 23 over ACL 2005 David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-9

Unlabelled data Question: Can we improve the reranking parser with cheap unlabeled data? David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-10

Unlabelled data Question: Can we improve the reranking parser with cheap unlabeled data? Self-training Co-training Clustering n-grams, use clusters as general class of n-grams Improve vocabulary, n-gram language model etc. David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-10

Self-training Train model from labeled data train reranking parser on WSJ Use model to annotate unlabeled data use model to parse NANC Combine annotated data with labeled training data merge WSJ training data with parsed NANC data Train a new model from the combined data train reranking parser on WSJ+NANC data Optional: repeat with new model on more unlabeled data David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-11

Flow Chart David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-12

Previous work Parsing: Charniak (1997), confirmed by Steedman et al. (2003) insignificant improvement Part of speech tagging: Clark et al. (2003) minor improvement/damage depending on amount of training data Parser adaptation: Bacchiani et al. (2006) helps when parsing WSJ when training on Brown corpus and self-training on news data David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-13

Experiments (overview) How should we annotate data? (parser or reranking parser) How much unlabelled data should we label? How should we combine annotated unlabeled data with true data? David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-14

Annotating unlabeled data Annotator Sentences added Parser Reranking parser 0 (baseline) 90.3 50k 90.1 90.7 500k 90.0 90.9 1,000k 90.0 90.8 1,500k 90.0 90.8 2,000k 91.0 Parser (not reranking parser) f-scores on all sentences in section 22 David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-15

Annotating unlabeled data WSJ Section Sentences added 1 22 24 0 (baseline) 91.8 92.1 90.5 50k 91.8 92.4 90.8 500k 92.0 92.4 90.9 1,000k 92.1 92.2 91.3 2,000k 92.2 92.0 91.3 Reranking parser f-scores for all sentences David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-16

Weighting WSJ data Wall Street Journal data is more reliable than the self-trained data Multiply each event in Wall Street Journal data by a constant to give it a higher relative weight events = c events wsj + events nanc Increasing WSJ weight tends to improve f-scores. Based on development data, our best model is WSJ 5+1,750k sentences from NANC David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-17

Evaluation on test section Model f parser f reranker Charniak and Johnson (2005) 91.0 Current baseline 89.7 91.3 Self-trained 91.0 92.1 f-scores from all sentences in WSJ section 23 David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-18

The Story So Far... Retraining parser on its own output doesn t help Retraining parser on the reranker s output helps Retraining reranker on the reranker s output doesn t help David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-19

Analysis: Global changes Oracle f-scores increase, self-trained parser has greater potential Model 1-best 10-best 50-best Baseline 89.0 94.0 95.9 WSJ 1 + 250k 89.8 94.6 96.2 WSJ 5 + 1,750k 90.4 94.8 96.4 Average of log 2 Pr(1-best) Pr(50th-best) increases from 12.0 (baseline parser) to 14.1 (self-trained parser) David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-20

Sentence-level Analysis Better No change Worse Better No change Worse Number of sentences (smoothed) 20 40 60 80 100 Number of sentences 0 500 1000 1500 2000 0 10 20 30 40 50 60 Sentence length 0 1 2 3 4 5 Unknown words Better No change Worse Better No change Worse Number of sentences 0 500 1000 1500 2000 Number of sentences 200 400 600 0 1 2 3 4 5 Number of CCs 0 2 4 6 8 10 Number of INs David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-21

Effect of Sentence Length Number of sentences (smoothed) 20 40 60 80 100 Better No change Worse 0 10 20 30 40 50 60 Sentence length David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-22

The Goldilocks Effect TM Number of sentences (smoothed) 20 40 60 80 100 Better No change Worse 0 10 20 30 40 50 60 Sentence length David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-23

... and... Number of sentences 0 500 1000 1500 2000 Better No change Worse 0 1 2 3 4 5 Number of CCs David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-24

Ongoing work Parser adaptation (McClosky, Charniak, and Johnson ACL 2006) Sentence selection Clustering local trees Other ways of combining data David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-25

Conclusions Self-training can improve on state-of-the-art parsing for Wall Street Journal Reranking parsers can self-train their first stage parser More analysis is needed to understand why reranking is necessary Self-trained reranking parser available from: ftp://ftp.cs.brown.edu/pub/nlparser David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-26

Acknowledgements This work was supported by NSF grants LIS9720368, and IIS0095940, and DARPA GALE contract HR0011-06-2-0001. Thanks to Michael Collins, Brian Roark, James Henderson, Miles Osborne, and the BLLIP team for their comments. Questions? David McClosky - dmcc@cs.brown.edu - NAACL 2006-6.5.2006-27