Training PCFGs. BM1 Advanced Natural Language Processing. Alexander Koller. 2 December 2014

Size: px

Start display at page:

Download "Training PCFGs. BM1 Advanced Natural Language Processing. Alexander Koller. 2 December 2014"

Doris Pope
8 years ago
Views:

1 Trag CFGs BM1 Advanced atural Language rocessg Alexander Koller 2 December 2014

2 robabilistic CFGs [1.0] V [0.5] [0.8] [0.5] i [0.2] V shot [1.0] [0.4] [1.0] elephant [0.3] [1.0] [0.3] an [0.5] [0.5] (let s pretend for simplicity that = R$)

3 arse trees p = p = V shot 0.5 an elephant V shot an elephant correct = more probable parse tree

4 Today arameters of CFG = rule probabilities. How do we learn parameters from corpora? maximum likelihood estimation hard EM usg Viterbi soft EM usg the side-outside algorithm

5 ML Estimation Assume we have a treebank. that is, every sentence annotated by hand with its correct parse tree Then we can use MLE to obta rule probabilities: (A w) = C(A w) C(A ) = C(A w) w 0 C(A w 0 ) tandard way of parameter estimation practice. Works well, not much smoothg necessary on TB.

6 Example TV V shot an elephant slept

7 Example TV V shot an elephant slept

8 Example TV V shot an elephant slept

9 Example TV V shot an elephant slept [0] TV [1/4] elephant [1/3] V [1/4] [2/3] [1/2]

10 Unsupervised estimation MLE works well for English. German: Tiger treebank exists, but is hard for CFGs, e.g. because of free word order. most other languages: phrase structure annotations unavailable, expensive to create -> unsupervised methods? Unsupervised methods: provide CFG, learn parameters from unannotated corpus show first hard EM, then soft EM ideas structive and generalize to related problems

11 Hard aka Viterbi EM n the absence of syntactic annotations, learner must vent its own parse trees. Viterbi EM: start with some parameter estimate produce syntactic annotations by computg best tree for each sentence usg Viterbi apply MLE to re-estimate parameters repeat as long as needed This is not real EM!

12 Example 1 [0.6] TV [1/3] elephant [0.2] V [1/3] [0.2] [1/3] p = (vs ) p = TV V slept shot an elephant

13 Example 1 [0.6] TV [1/3] elephant [0.2] V [1/3] [0.2] [1/3] p = (vs ) p = TV V slept shot an elephant

14 Example 1 [0.6] TV [1/3] elephant [0.2] V [1/3] [0.2] [1/3] p = (vs ) p = TV V slept shot an elephant

15 MLE on Viterbi parses 2 [1/4] TV [1/3] elephant [1/4] V [1/3] [1/2] [1/3] p = (vs ) p = TV V shot an elephant slept

16 ome thgs to note n this example, the likelihood creased. this need not always be the case for Viterbi EM Viterbi EM commits to a sgle parse tree per sentence. This has advantages and disadvantages: parse tree easy to compute, and can simply apply MLE ignores all uncertaty we had about correct parse (wng parse tree takes all)

17 Towards real (aka soft ) EM idea: weighted countg of rules all parse trees Viterbi-EM t 1 t 2 t 3 t 4 1 C t1 (r) + 0 C t2 (r) + 0 C t3 (r) + 0 C t4 (r) EM t 1 t 2 t 3 t 4 (t 1 w) C t1 (r) + (t 2 w) C t2 (r) + (t 3 w) C t3 (r) + (t 4 w) C t4 (r)

18 Expected counts Defe expected count of rule A B C, based on previous parameter estimate. E(A! BC)= t2t (t w) C t (A! BC) f we have them, can re-estimate parameters: (A! BC)= E(A! BC) E(A! r) r Challenge: How to compute E(A B C) efficiently? we assume grammars CF here

19 Fundamental idea E(A! BC)= t2t (t w) C t (A! BC) = 1 (w) = 1 (w) = 1 (w) = 1 (w) (t) C t (A! BC) t2t (t) rule for i, j, k t is A! BC t2t i,j,k! (t) rule for i, j, k t is A! BC i,j,k t2t i,j,k t of this form! (t) w 1 B A w n C (note that (t, w) = (t)) w i w j-1 w j w k-1

20 Fundamental idea E(A! BC)= t2t (t w) C t (A! BC) = 1 (w) = 1 (w) = 1 (w) = 1 (w) (t) C t (A! BC) t2t (t) rule for i, j, k t is A! BC t2t i,j,k! (t) rule for i, j, k t is A! BC i,j,k t2t i,j,k t of this form! (t) call this term μ(a B C, i, j, k) w 1 B A w n C (note that (t, w) = (t)) w i w j-1 w j w k-1

21 µ(a! BC,i,j,k)= Computg μ (t) = (A, i, k) (A! BC) (B,i,j) (C, j, k) t of this form w 1 B A w n C w i w j-1 w j w k-1

22 µ(a! BC,i,j,k)= Computg μ (t) = (A, i, k) (A! BC) (B,i,j) (C, j, k) t of this form w 1 B A w n C w i w j-1 w j w k-1 side probability (B,i,j) = t for B ) w i...w j 1 (t)

23 µ(a! BC,i,j,k)= Computg μ (t) = (A, i, k) (A! BC) (B,i,j) (C, j, k) t of this form w 1 B A w n C w i w j-1 w j w k-1 side probability (B,i,j) = t for B ) w i...w j 1 (t) outside probability (A, i, k) = t for ) w 1...w i 1 Aw k...w n (t)

24 nside probabilities (A, i, k) = t for B ) w i...w k 1 (t) A B C w i w j-1 w j w k-1 (A, i, i + 1) = (A! w i ) (A, i, k) = (A! BC) (B,i,j) (C, j, k) A!B C i<j<k

25 nside probabilities (A, i, k) = t for B ) w i...w k 1 (t) B A C special case: (w) = α(,1,n+1) w i w j-1 w j w k-1 (A, i, i + 1) = (A! w i ) (A, i, k) = (A! BC) (B,i,j) (C, j, k) A!B C i<j<k

26 Outside probabilities (A, i, k) = t for ) w 1...w i 1 Aw k...w n (t) = (B! AC) (C, k, j) (B,i,j) + (B! CA) (C, j, i) (B,j,k) B!A C k<japplen B!C A 1applej<i B B A C C A w... 1 w... i w... k w j w n w... 1 w... j w... i w k w n

27 Outside probabilities (A, i, k) = t for ) w 1...w i 1 Aw k...w n (t) = (B! AC) (C, k, j) (B,i,j) + (B! CA) (C, j, i) (B,j,k) B!A C k<japplen B!C A 1applej<i B base case: β(a,1,n+1) = 1 iff A = B A C C A w... 1 w... i w... k w j w n w... 1 w... j w... i w k w n

28 The nside-outside Algorithm tart with some itial estimate of parameters. For each sentence w, compute α, β, and μ. Compute expected counts E(A B C). sum expected counts over all sentences remember that (w) = α(, 1, n+1) Re-estimate (A B C) from expected counts. terate until convergence.

29 ome remarks nside-outside creases likelihood each step. Charniak ereira But huge problems with local maxima. Carroll & Charniak 92 fd 300 different local maxima for 300 different itial parameter estimates. mprove by partially bracketg strgs (ereira & chabes 92). Also, quite slow. Every EM step takes time O(n 3 ). Lari & Young 90 fd that learng a grammar from a corpus generated from CFG with nontermals requires 3 nontermals.

30 ummary Learng parameters of CFGs: maximum likelihood estimation from raw text hard EM : iterate MLE on Viterbi parses EM: use side-outside algorithm with expected rule counts CFG parsg with MLE parse gets f-score 70 s. Will improve on this next time. Have assumed that CFG is given and only parameters are to be learned. Will fix this January.

Online EFFECTIVE AS OF JANUARY 2013

Online EFFECTIVE AS OF JANUARY 2013 2013 A and C Session Start Dates (A-B Quarter Sequence*) 2013 B and D Session Start Dates (B-A Quarter Sequence*) Quarter 5 2012 1205A&C Begins November 5, 2012 1205A Ends December 9, 2012 Session Break