Practical Structured Learning for Natural Language Processing

Transcription

1 Practical Structured Learning for Natural Language Processing Hal Daumé III Information Sciences Institute University of Southern California Committee: Daniel Marcu (Adviser), Kevin Knight, Eduard Hovy Stefan Schaal, Gareth James, Andrew McCallum Slide 1

2 All have innate structure to Processing of language to varying solve a degrees problem What Is Natural Language Processing? Machine Translation Summarization Information Extraction Input Arabic sentence Long documents Document Output English sentence Short documents Database Question Answering Corpus + query Corpus based NLP: Answer Parsing Collect example input/output Sentence pairs and Syntactic learn tree Understanding Sentence Logical sentence Slide 2

3 What is Machine Learning? Automatically uncovering structure in data Classification Regression Dimensionality reduction Clustering Sequence labeling Input Vector Vector Vector Vectors Vectors Output Bit (or bits) Real number Shorter vector Partition Labels Slide 3 Reinforcement learning State space + observations Policy

4 NLP versus Machine Learning Machine Translation Summarization Information Extraction Question Answering Parsing Understanding Natural Decomposition Classification Regression Dimensionality reduction Clustering Sequence labeling Arabic sentence Long documents Document Corpus + query Sentence Sentence Encoding Vector Vector Vector Vectors Vectors English sentence Short documents Database Answer Syntactic tree Logical sentence Ad Hoc Provably Search Good Bit (or bits) Real number Shorter vector Partition Labels Slide 4 Reinforcement learninginspired State space Policy

5 Entity Detection and Tracking Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer. 1.3 (Example Problem: EDT) Slide 5 As the National Portrait Gallery planned to reveal that only one of half a dozen claimed portraits of William Shakespeare can now be considered genuine, Prof Hildegard Hammerschmidt Hummel said she could prove that there were at least four surviving portraits of the playwright. Why? Summarization, Machine Translation, etc...

6 Why is EDT Hard? Long range dependencies World Knowledge Syntax, Semantics and Discourse 1.3 (Example Problem: EDT) Slide 6 Highly constrained outputs Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer.

7 How is EDT Typically Attacked? Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer. Shakespeare scholarship,... a German academic claimed... PER * * * GPE PER * 5.2 (EDT: Prior Work) Shakespeare Slide 7 he playwright German academic Sequence Labeling Binary Classification + Clustering

8 Learning in NLP Applications Model Features Learning Search Slide 8

9 Result: Searn (= Search + Learn ) Computationally efficient Strong theoretical guarantees 3 (Search-based Structured Prediction) Slide 9 State of the art performance Broadly applicable Easy to implement

10 Talk Outline Searn: Search based Structured Prediction Experimental Results: Sequence Labeling Entity Detection and Tracking Automatic Document Summarization Discussion and Future Work Slide 10

11 Structured Prediction Formulated as a maximization problem: 2.2 (Machine Learning: Structured Prediction) Slide 11 y = arg max y Y x f x, y ; Search Model Features Learning Search (and learning) tractable only for very simple Y(x) and f

12 Reduction for Structured Prediction 3.3 (Search-based Structured Prediction) Slide 12 Idea: view structured prediction in light of search D P N R V D P N R V D P N R V I sing merrily Each step here looks like it could be represented as a weighted multiclass problem. Can we formalize this idea?

13 Error-Limiting Reductions A formal mapping between learning problems F [Beygelzimer et al., ICML 2005] 2.3 (Machine Learning: Reductions) Slide 13 Learning Problem A What we want to solve F 1 Learning Problem B F maps examples from A to examples from B What we know how to solve F 1 maps a classifier h for B to a classifier that solves A Error limiting: good performance on B implies good performance on A

14 Our Reduction Goal 2.3 (Machine Learning: Reductions) Slide 14 Structured Prediction Error bounds get composed Searn Importance Weighted Classification New techniques for binary classification lead to new techniques for structured prediction Any generalization theory for binary classification immediately applies to structured prediction Weighted All Pairs [Beygelzimer et al., ICML 2005] Costsensitive Classification Binary Classification Costing [Zadronzy, Langford & Abe, ICDM 2003]

15 Two Easy Reductions X {±1} R 0 Costing [Zadronzy, Langford & Abe, ICDM 2003] X {±1} h h h L IW L BC E [w] 2.3 (Machine Learning: Reductions) Slide 15 X R 0 R 0 1v2 1v3 2v Weighted All Pairs 1vK 2vK [Beygelzimer, Dani, Hayes, Langford and Zadrozny, ICML 2005] X {±1} R 0 L CS 2 L IW

16 A First Attempt 3.3 (Search-based Structured Prediction) Slide 16 D N A R V D N A R V D N A R V I sing merrily At each correct state: Train classifier to choose next state Thm (Kääriäinen): There is a 1st order binary Markov problem such that a classification error implies a Hamming loss of: T T T 2

17 Reducing Structured Prediction D N A R V D N A R V D N A R V I sing merrily (Searn: Optimal Policy) Key Assumption: Optimal Policy for training data Slide 17 Given: input, true output and state; Return: best successor state Weak!

18 Idea Optimal Policy Learned Policy (Features) 3.4 (Searn: Training) Slide 18

19 Iterative Algorithm: Searn Set current policy = optimal policy Repeat: Decode using current policy (Searn: Algorithm)... after a German academic claimed... * * GPE PER * PER Generate classification problems ORG * * PER * * PER * Learn new multiclass classifier Interpolate: cur = *cur + (1 )*new L = 0 L = 0.3 L = 0.2 L = 0.25 Slide 19

20 Theoretical Analysis Theorem: For conservative, after 2T3 ln T iterations, the loss of the learned policy is bounded as follows: 3.5 (Searn: Theoretical Analysis) Slide 20 L h L h 0 2 T ln T l avg 1 ln T c max T Loss of the optimal policy Average multiclass classification loss Worst case per step loss

21 Why Does Searn Work? A classifier tells us how to search 3.7 (Searn: Discussion and Conclusions) Impossible to train on full space Want to train only where we'll wind up Searn tells us where we'll wind up Slide 21

23 Proof on Concept: Sequence Labels Spanish Named Entity Recognition Handwriting Recognition Chunking+ Tagging (Sequence Labeling: Comparison) Slide 23 SVM ISO CRF Searn Searn+data LR SVM M3N Searn Searn+data Incremental Perceptron LaSO CRF Searn

25 Entity Detection and Tracking (EDT: Joint Detection and Coreference: Search) Slide 25 Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer.... he had died of cancer... he had died of cancer... he had died of cancer... he had died of cancer... he had died of cancer... he had died of cancer... he had died of cancer

26 EDT Features Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer. Lexical 5.[45].3 (EDT: Feature Functions) Syntactic Semantic Gazetteer Knowledge based Slide 26

27 EDT Results (ACE 2004 data) (EDT: Experimental Results) Slide Sys1 Sys2 Sys3 Sys4 Searn Previous state of the art with extra proprietary data

28 Feature Contributions (EDT: Experimental Results) Slide Lex -Syn -Sem -Gaz -KB

30 New Task: Summarization Give me That's information too much about the Falkland information That's islands perfect! to read! war Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74 day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands. The Falkland islands war, in 1982, was fought between Britain and Argentina. Standard approach is sentence extraction, but that is often deemed too coarse to produce good, very short summaries. We wish to also drop words and phrases => document compression Slide 30

31 6.1 (Summarization: Vine-Growth Model) Structure of Search Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74 day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands. man Optimal Policy not analytically available; Approximated with beam search. Slide 31 To make search more tractable, we run an initial round of sentence 5x length Lay sentences out sequentially Generate a dependency parse of each sentence Mark Theeach root as a a frontier big node Repeat: Choose a frontier node node to add to the summary Add all its children sandwich to the frontier Finish when we have enough words ate The man ate a big sandwich S1 S2 S3 S4 S5... Sn = frontier node = summary node

32 6.7 (Summarization: Error Analaysis) Example Output (40 word limit) Slide 32 Sentence Extraction + Compression: Argentina and Britain announced an agreement, nearly eight years after they fought a 74 day war a populated archipelago off Argentina's coast. Argentina gets out the red carpet, official royal visitor since the end of the Falklands war in Vine Growth: Argentina and Britain announced to restore full ties, eight years after they fought a 74 day war over the Falkland islands. Britain invited Argentina's minister Cavallo to London in 1992 in the first official visit since the Falklands war in Diplomatic ties restored 3 Falkland war was in Major cabinet member visits 3 Cavallo visited UK 5 Exchanges were in War was 74 days long 3 War between Britain and Argentina

33 Results (Summarization: Experimental Results) Slide 33 Approx. Optimal Vine Growth Vine Growth (Searn) Optimal Extract [Daumé III and Marcu, DUC 2005]

35 What Can Searn Do? Solve structured prediction problems......efficiently. 7 (Conclusions and Future Directions) Slide 35...with theoretical guarantees....with weak assumptions on structure....and outperform the competition....with little extra code required.

36 What Can Searn Not Do? (Yet) Solve structured prediction problems......with only weak feedback....with hidden variables. 7 (Conclusions and Future Directions) Slide 36...with enormous branching factors....without pre defined search spaces....with combinatorial outputs....with only an optimal path.

37 Thanks! Questions? Slide 37

38 BACKUP SLIDES Slide 38

39 Optimal Policies Weak Feedback Optimal Path? Most NLP Problems Approximately Optimal Policy Slide 39? Optimal Policy Optimally Completable Loss augmented Search Normalized Search Searn [Daumé III, Langford, Marcu, submitted to NIPS 2006] Incremental Perceptron [Collins & Roark, EMNLP 2004] LaSO [Daumé III, Marcu, ICML 2005] Max margin Markov Networks [Taskar, Guestrin, Koller, NIPS 2003] Conditional Random Fields [Lafferty, McCallum, Pereira, ICML 2001]