Tekniker för storskalig parsning

Size: px

Start display at page:

Download "Tekniker för storskalig parsning"

Claribel Hancock
8 years ago
Views:

1 Tekniker för storskalig parsning Diskriminativa modeller Joakim Nivre Uppsala Universitet Institutionen för lingvistik och filologi Tekniker för storskalig parsning 1(19)

Institutionen för lingvistik och filologi joakim.

2 Generative Models A generative statistical model defines the joint probability P(x, y) of input x and output y Pros: Learning problems have closed form solutions Related probabilities can be derived: Conditionalization: P(y x) = P(x,y) P(x) Marginalization: P(x) = P y P(x, y) Cons: Rigid independence assumptions (or intractable parsing) Indirect modeling of parsing problem Tekniker för storskalig parsning 2(19)

Conditionalization: P(y x) = P(x,y) P(x) Marginalization: P(x) = P y P(x, y) Cons: Rigid independence

3 Discriminative Models A discriminative statistical model defines the conditional probability P(y x) of output y given input x Pros: No rigid independence assumptions More direct modeling of parsing problem Cons: Learning problems require numerical approximation Related probabilities cannot be derived: No way to compute P(x, y) from P(y x) No way to compute P(x) or P(y) from P(y x) Tekniker för storskalig parsning 3(19)

Cons: Learning problems require numerical approximation Related probabilities cannot be derived: No way

4 Conditional and Discriminative Models Subdivision of discriminative models Conditional model: Explicitly model the conditional probability P(y x) Use model in mapping X Y: argmax y P(y x) Purely discriminative model: Directly optimize mapping X Y No explicit model of conditional probability P(y x) Tekniker för storskalig parsning 4(19)

mapping X Y: argmax y P(y x) Purely discriminative model: Directly optimize mapping

5 Local and Global Models Local discriminative models: Maximize probability (or accuracy) of local decisions in the derivation of analysis y given input x Find globally optimal solution by making a sequence of locally optimal decisions Global discriminative models: Maximize probability (or accuracy) of complete analysis y given input x Examples: Local: Transition-based dependency parsing Global: Graph-based dependency parsing Tekniker för storskalig parsning 5(19)

decisions Global discriminative models: Maximize probability (or accuracy) of complete analysis y given input x

6 Local Discriminative Models Conditional history-based model: m P(y x) = P(d i Φ(d 1,..., d i 1, x)) i=1 Probabilities can be conditioned on properties of the input For example: Lookahead in left-to-right derivations Compare generative model: m P(x, y) = P(d i Φ(d 1,..., d i 1 )) i=1 Tekniker för storskalig parsning 6(19)

.., d i 1, x)) i=1 Probabilities can be conditioned on properties of the input

7 Parsing Model GEN(x): Defined by derivational process (for example, transition system) EVAL(y): Score local decisions, conditioned on input and history Combine local scores into global scores Tekniker för storskalig parsning 7(19)

decisions, conditioned on input and history Combine local

8 Inference Local discriminative models typically use greedy inference: Deterministic best-first search Beam search with agenda of k best hypotheses Properties: Very efficient Reasonably accurate thanks to lookahead No guarantee that globally best solution is found Tekniker för storskalig parsning 8(19)

hypotheses Properties: Very efficient Reasonably accurate thanks to

9 Learning Learning problem: Local decision, conditioned on input and history Conditional: Estimate P(d i Φ(d 1,..., d i 1, x)) Training: Conditional MLE (more later) Purely discriminative: Optimize mapping Φ(d 1,..., d i 1, x) d i Training: Any classifier (SVM, Perceptron, CMLE,... ) Tekniker för storskalig parsning 9(19)

.., d i 1, x)) Training: Conditional MLE (more later) Purely discriminative:

10 Evaluation Criteria Robustness: Yes (same as generative history-based models) Disambiguation: Yes, thanks to probability model or classifier Accuracy: Sometimes state of the art Efficiency: Very good (often linear complexity) Tekniker för storskalig parsning 10(19)

model or classifier Accuracy: Sometimes state of the art

11 Global Discriminative Models Global models: No specific factorization of P(y x) Features can be defined over arbitrary substructures Training optimizes probability/accuracy of global structures In practice, some shortcuts are always necessary: Restricted scope of features Approximate inference Tekniker för storskalig parsning 11(19)

probability/accuracy of global structures In practice, some shortcuts are always

12 Parsing Model GEN(x): Formal grammar: HPSG [Toutanova et al. 2002, Miyao et al. 2003] LFG [Riezler et al. 2002, Kaplan et al. 2004] CCG [Clark and Curran 2004] Generative statistical parser (reranking) [Charniak and Johnson 2005] All possible trees over some alphabets [Taskar et al. 2004, McDonald et al. 2005] EVAL(y): Score related to P(y x) Tekniker för storskalig parsning 12(19)

2004] CCG [Clark and Curran 2004] Generative statistical parser (reranking) [Charniak and

13 Inference Exact parsing with global model is intractable Strategy 1: Use dynamic programming (chart parsing) Restrict feature scope Example: Graph-based dependency parsing (projective) Strategy 2: Use an independent generative component Restrict GEN(x) to make exact inference feasible Examples: Grammar-driven parsers for HPSG, LFG, CCG Reranking parsers using k-best list from statistical parser Tekniker för storskalig parsning 13(19)

generative component Restrict GEN(x) to make exact inference feasible Examples: Grammar-driven parsers for

14 Learning 1: Linear Classifiers Score S(x, y) defined as inner product of two vectors: S(x, y) = f(x, y) w = k w i f i (x, y) i=1 Feature vector: f(x, y) = f1 (x, y),..., f k (x, y) Weight vector: w = w 1,..., w k Note: Each fi (x, y) is a numerical feature of x and y Each wi is a real-valued weight for f i (x, y): Positive if f i (x, y) tends to occur in good trees Negative if f i (x, y) tends to occur in bad trees S(x, y) summarizes the evidence of all (non-zero) features Tekniker för storskalig parsning 14(19)

.., w k Note: Each fi (x, y) is a numerical feature of x and y Each wi is a real-valued weight for f i (x, y): Positive if f i

15 Learning 2: Discriminative Training Find weights that maximize accuracy on training set Training criterion (for all y in training set): y = argmax y GEN(x)f(x, y ) w Examples: Perceptron learning Support vector machines Tekniker för storskalig parsning 15(19)

training set): y = argmax y GEN(x)f(x, y ) w Examples:

16 Learning 3: Log-Linear Models Transform score to conditional probability: P(y x) = exp [f(x, y) w] y GEN(x) exp [f(x, y ) w] Note: exp [f(x, y) w] > 0 exp [f(x, y) w] y GEN(x) exp [f(x, y ) w] If exp a = b, then a = log b log a b = log a + log b Linear sum of products corresponds to log of probability Tekniker för storskalig parsning 16(19)

GEN(x) exp [f(x, y ) w] If exp a = b, then a = log b log a b = log a + log b Linear

17 Learning 4: Conditional MLE Joint MLE: Find estimate that maximizes P(x, y) for training set Easy: Relative frequencies (analytical, closed form solution) Conditional MLE: Find estimate that maximizes P(y x) for training set Hard: No closed form solution Numerical optimization: Many methods: Iterative scaling, gradient descent,... Computationally intensive Guaranteed to converge to global maximum Tekniker för storskalig parsning 17(19)

training set Hard: No closed form solution Numerical optimization: Many methods: Iterative scaling, gradient

18 Evaluation Criteria Robustness: Depends on GEN(x) Disambiguation: Yes, same as other statistical models Efficiency: Not so good, especially during training Accuracy: Currently the state of the art Tekniker för storskalig parsning 18(19)

Efficiency: Not so good, especially during training

19 Summary Discriminative models focus on conditional distribution P(y x) Pros: No rigid independence assumptions more global features Easy to combine with different base parsers Cons: Learning requires computationally intensive numerical methods Inference is often intractable and requires approximation Log-linear models are most widely used Tekniker för storskalig parsning 19(19)

Cons: Learning requires computationally intensive numerical methods Inference is often

20 References and Further Reading Eugene Charniak and Mark Johnson Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages Stephen Clark and James R. Curran Parsing the WSJ using CCG and log-linear models. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pages Ronald M. Kaplan, Stefan Riezler, Tracy Holloway King, John T. Maxwell III, Alexander Vasserman, and Richard Crouch Speed and accuracy in shallow and deep stochastic parsing. In Proceedings of Human Language Technology and the Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages Ryan McDonald, Koby Crammer, and Fernando Pereira Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages Yusuke Miyao, T. Ninomiya, and Jun ichi Tsujii Tekniker för storskalig parsning 19(19)

Parsing the WSJ using CCG and log-linear models. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 104 111. Ronald M.

21 Probabilistic modeling of argument structures including non-local dependencies. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), pages Stephan Riezler, Margaret H. King, Ronald M. Kaplan, Richard Crouch, John T. Maxwell III, and Mark Johnson Parsing the Wall Street Journal using a Lexical-Functional Grammar and discriminative estimation techniques. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages Ben Taskar, Dan Klein, Michael Collins, Daphne Koller, and Christopher Manning Max-margin parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1 8. Kristina Toutanova, Christopher D. Manning, Stuart M. Shieber, Dan Flickinger, and Stephan Oepen Parse disambiguation for a rich HPSG grammar. In Proceedings of the 1st Workshop on Treebanks and Linguistic Theories (TLT), pages Tekniker för storskalig parsning 19(19)

Online Large-Margin Training of Dependency Parsers

Online Large-Margin Training of Dependency Parsers Ryan McDonald Koby Crammer Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia, PA {ryantm,crammer,pereira}@cis.upenn.edu