Training a Log-Linear Parser with Loss Functions via Softmax-Margin

Size: px
Start display at page:

Download "Training a Log-Linear Parser with Loss Functions via Softmax-Margin"

Transcription

1 Training a Log-Linear Parser with Loss Functions via Softmax-Margin Michael Auli School of Informatics University of Edinburgh m.auli@sms.ed.ac.uk Adam Lopez HLTCOE Johns Hopkins University alopez@cs.jhu.edu Abstract Log-linear parsing models are often trained by optimizing likelihood, but we would prefer to optimise for a task-specific metric like F- measure. Softmax-margin is a convex objective for such models that minimises a bound on expected risk for a given loss function, but its naïve application requires the loss to decompose over the predicted structure, which is not true of F-measure. We use softmaxmargin to optimise a log-linear CCG parser for a variety of loss functions, and demonstrate a novel dynamic programming algorithm that enables us to use it with F-measure, leading to substantial gains in accuracy on CCG- Bank. When we embed our loss-trained parser into a larger model that includes supertagging features incorporated via belief propagation, we obtain further improvements and achieve a labelled/unlabelled dependency F-measure of 89.3%/94.0% on gold part-of-speech tags, and 87.2%/92.8% on automatic part-of-speech tags, the best reported results for this task. 1 Introduction Parsing models based on Conditional Random Fields (CRFs; Lafferty et al., 2001) have been very successful (Clark and Curran, 2007; Finkel et al., 2008). In practice, they are usually trained by maximising the conditional log-likelihood (CLL) of the training data. However, it is widely appreciated that optimizing for task-specific metrics often leads to better performance on those tasks (Goodman, 1996; Och, 2003). An especially attractive means of accomplishing this for CRFs is the softmax-margin (SMM) objective (Sha and Saul, 2006; Povey and Woodland, 2008; Gimpel and Smith, 2010a) ( 2). In addition to retaining a probabilistic interpretation and optimizing towards a loss function, it is also convex, making it straightforward to optimise. Gimpel and Smith (2010a) show that it can be easily implemented with a simple change to standard likelihood-based training, provided that the loss function decomposes over the predicted structure. Unfortunately, the widely-used F-measure metric does not decompose over parses. To solve this, we introduce a novel dynamic programming algorithm that enables us to compute the exact quantities needed under the softmax-margin objective using F-measure as a loss ( 3). We experiment with this and several other metrics, including precision, recall, and decomposable approximations thereof. Our ability to optimise towards exact metrics enables us to verify the effectiveness of more efficient approximations. We test the training procedures on the state-of-the-art Combinatory Categorial Grammar (CCG; Steedman 2000) parser of Clark and Curran (2007), obtaining substantial improvements under a variety of conditions. We then embed this model into a more accurate model that incorporates additional supertagging features via loopy belief propagation. The improvements are additive, obtaining the best reported results on this task ( 4). 2 Softmax-Margin Training The softmax-margin objective modifies the standard likelihood objective for CRF training by reweighting 333 Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages , Edinburgh, Scotland, UK, July 27 31, c 2011 Association for Computational Linguistics

2 each possible outcome of a training input according to its risk, which is simply the loss incurred on a particular example. This is done by incorporating the loss function directly into the linear scoring function of an individual example. Formally, we are given m training pairs (x (1), y (1) )...(x (m), y (m) ), where each x (i) X is drawn from the set of possible inputs, and each y (i) Y(x (i) ) is drawn from a set of possible instance-specific outputs. We want to learn the K parameters θ of a log-linear model, where each λ k θ is the weight of an associated feature h k (x, y). Function f(x, y) maps input/output pairs to the vector h 1 (x, y)...h K (x, y), and our log-linear model assigns probabilities in the usual way. exp{θ T f(x, y)} p(y x) = y Y(x) exp{θt f(x, y )} (1) The conditional log-likelihood objective function is given by Eq. 2 (Figure 1). Now consider a function l(y, y ) that returns the loss incurred by choosing to output y when the correct output is y. The softmaxmargin objective simply modifies the unnormalised, unexponentiated score θ T f(x, y ) by adding l(y, y ) to it. This yields the objective function (Eq. 3) and gradient computation (Eq. 4) shown in Figure 1. This straightforward extension has several desirable properties. In addition to having a probabilistic interpretation, it is related to maximum margin and minimum-risk frameworks, it can be shown to minimise a bound on expected risk, and it is convex (Gimpel and Smith, 2010b). We can also see from Eq. 4 that the only difference from standard CLL training is that we must compute feature expectations with respect to the cost-augmented scoring function. As Gimpel and Smith (2010a) discuss, if the loss function decomposes over the predicted structure, we can treat its decomposed elements as unweighted features that fire on the corresponding structures, and compute expectations in the normal way. In the case of our parser, where we compute expectations using the inside-outside algorithm, a loss function decomposes if it decomposes over spans or productions of a CKY chart. 3 Loss Functions for Parsing Ideally, we would like to optimise our parser towards a task-based evaluation. Our CCG parser is evaluated on labeled, directed dependency recovery using F-measure (Clark and Hockenmaier, 2002). Under this evaluation we will represent output y and ground truth y as variable-sized sets of dependencies. We can then compute precision P (y, y ), recall R(y, y ), and F-measure F 1 (y, y ). P (y, y ) = y y y R(y, y ) = y y y F 1 (y, y ) = 2P R P + R = 2 y y y + y (5) (6) (7) These metrics are positively correlated with performance they are gain functions. To incorporate them in the softmax-margin framework we reformulate them as loss functions by subtracting from one. 3.1 Computing F-Measure-Augmented Expectations at the Sentence Level Unfortunately, none of these metrics decompose over parses. However, the individual statistics that are used to compute them do decompose, a fact we will exploit to devise an algorithm that computes the necessary expectations. Note that since y is fixed, F 1 is a function of two integers: y y, representing the number of correct dependencies in y ; and y, representing the total number of dependencies in y, which we will denote as n and d, respectively. 1 Each pair n, d leads to a different value of F 1. Importantly, both n and d decompose over parses. The key idea will be to treat F 1 as a non-local feature of the parse, dependent on values n and d. 2 To compute expectations we split each span in an otherwise usual inside-outside computation by all pairs n, d incident at that span. Formally, our goal will be to compute expectations over the sentence a 1...a L. In order to abstract away from the particulars of CCG we present the algorithm in relatively familiar terms as a variant of 1 For numerator and denominator. 2 This is essentially the same trick used in the oracle F-measure algorithm of Huang (2008), and indeed our algorithm is a sumproduct variant of that max-product algorithm. 334

3 λ k = min θ min θ m θ T f(x (i), y (i) ) + log i=1 m θ T f(x (i), y (i) ) + log i=1 m h k (x (i), y (i) ) + i=1 y Y(x (i) ) y Y(x (i) ) y Y(x (i) ) exp{θ T f(x (i), y)} (2) exp{θ T f(x (i), y) + l(y (i), y)} (3) exp{θ T f(x (i), y) + l(y (i), y)} y Y(x (i) ) exp{θt f(x (i), y ) + l(y (i), y )} h k(x (i), y) (4) Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4). the classic inside-outside algorithm (Baker, 1979). We use the notation a : A for lexical entries and BC A to indicate that categories B and C combine to form category A via forward or backward composition or application. 3 The weight of a rule is denoted with w. The classic algorithm associates inside score I(A i,j ) and outside score O(A i,j ) with category A spanning sentence positions i through j, computed via the following recursions. I(A i,i+1 ) =w(a i+1 : A) I(A i,j ) = I(B i,k )I(C k,j )w(bc A) I(GOAL) =I(S 0,L ) O(GOAL) =1 O(A i,j ) = O(C i,k )I(B j,k )w(ab C)+ O(C k,j )I(B k,i )w(ba C) The expectation of A spanning positions i through j is then I(A i,j )O(A i,j )/I(GOAL). Our algorithm extends these computations to state-split items A i,j,n,d. 4 Using functions n + ( ) and d + ( ) to respectively represent the number of correct and total dependencies introduced by a parsing action, we present our algorithm in Fig. 3. The final inside equation and initial outside equation incorporate the loss function for all derivations having a particular F-score, enabling us to obtain the 3 These correspond respectively to unary rules A a and binary rules A BC in a Chomsky normal form grammar. 4 Here we use state-splitting to refer to splitting an item A i,j into many items A i,j,n,d, one for each n, d pair. desired expectations. A simple modification of the goal equations enables us to optimise precision, recall or a weighted F-measure. To analyze the complexity of this algorithm, we must ask: how many pairs n, d can be incident at each span? A CCG parser does not necessarily return one dependency per word (see Figure 2 for an example), so d is not necessarily equal to the sentence length L as it might be in many dependency parsers, though it is still bounded by O(L). However, this behavior is sufficiently uncommon that we expect all parses of a sentence, good or bad, to have close to L dependencies, and hence we expect the range of d to be constant on average. Furthermore, n will be bounded from below by zero and from above by min( y, y ). Hence the set of all possible F-measures for all possible parses is bounded by O(L 2 ), but on average it should be closer to O(L). Following McAllester (1999), we can see from inspection of the free variables in Fig. 3 that the algorithm requires worst-case O(L 7 ) and average-case O(L 5 ) time complexity, and worse-case O(L 4 ) and average-case O(L 3 ) space complexity. Note finally that while this algorithm computes exact sentence-level expectations, it is approximate at the corpus level, since F-measure does not decompose over sentences. We give the extension to exact corpus-level expectations in Appendix A. 3.2 Approximate Loss Functions We will also consider approximate but more efficient alternatives to our exact algorithms. The idea is to use cost functions which only utilise statistics 335

4 I(A i,i+1,n,d ) = w(a i+1 : A) iffn = n + (a i+1 : A), d = d + (a i+1 : A) I(A i,j,n,d ) = I(B i,k,n,d )I(C k,j,n,d )w(bc A) {n,n :n +n +n + (BC A)=n}, {d,d :d +d +d + (BC A)=d} I(GOAL) = ( I(S 0,L,n,d ) 1 2n ) d + y n,d ( O(S 0,N,n,d ) = 1 2n ) d + y O(A i,j,n,d ) = {n,n :n n n + (AB C)=n}, {d,d :d d d + (AB C)=d} {n,n :n n n + (BA C)=n}, {d,d :d d d + (BA C)=d} O(C i,k,n,d )I(B j,k,n,d )w(ab C)+ O(C k,j,n,d )I(B k,i,n,d )w(ba C) Figure 3: State-split inside and outside recursions for computing softmax-margin with F-measure. Figure 2: Example of flexible dependency realisation in CCG: Our parser (Clark and Curran, 2007) creates dependencies arising from coordination once all conjuncts are found and treats and as the syntactic head of coordinations. The coordination rule (Φ) does not yet establish the dependency and - pears (dotted line); it is the backward application (<) in the larger span, apples and pears, that establishes it, together with and - pears. CCG also deals with unbounded dependencies which potentially lead to more dependencies than words (Steedman, 2000); in this example a unification mechanism creates the dependencies likes - apples and likes - pears in the forward application (>). For further examples and a more detailed explanation of the mechanism as used in the C&C parser refer to Clark et al. (2002). available within the current local structure, similar to those used by Taskar et al. (2004) for tracking constituent errors in a context-free parser. We design three simple losses to approximate precision, recall and F-measure on CCG dependency structures. Let T (y) be the set of parsing actions required to build parse y. Our decomposable approximation to precision simply counts the number of incorrect dependencies using the local dependency counts, n + ( ) and d + ( ). DecP (y) = d + (t) n + (t) (8) t T (y) To compute our approximation to recall we require the number of gold dependencies, c + ( ), which should have been introduced by a particular parsing action. A gold dependency is due to be recovered by a parsing action if its head lies within one child span and its dependent within the other. This yields a decomposed approximation to recall that counts the number of missed dependencies. DecR(y) = c + (t) n + (t) (9) t T (y) 336

5 Unfortunately, the flexible handling of dependencies in CCG complicates our formulation of c +, rendering it slightly more approximate. The unification mechanism of CCG sometimes causes dependencies to be realised later in the derivation, at a point when both the head and the dependent are in the same span, violating the assumption used to compute c + (see again Figure 2). Exceptions like this can cause mismatches between n + and c +. We set c + = n + whenever c + < n + to account for these occasional discrepancies. Finally, we obtain a decomposable approximation to F-measure. DecF 1(y) = DecP (y) + DecR(y) (10) 4 Experiments Parsing Strategy. CCG parsers use a pipeline strategy: we first multitag each word of the sentence with a small subset of its possible lexical categories using a supertagger, a sequence model over these categories (Bangalore and Joshi, 1999; Clark, 2002). Then we parse the sentence under the requirement that the lexical categories are fixed to those preferred by the supertagger. In our experiments we used two variants on this strategy. First is the adaptive supertagging (AST) approach of Clark and Curran (2004). It is based on a step function over supertagger beam widths, relaxing the pruning threshold for lexical categories only if the parser fails to find an analysis. The process either succeeds and returns a parse after some iteration or gives up after a predefined number of iterations. As Clark and Curran (2004) show, most sentences can be parsed with very tight beams. Reverse adaptive supertagging is a much less aggressive method that seeks only to make sentences parsable when they otherwise would not be due to an impractically large search space. Reverse AST starts with a wide beam, narrowing it at each iteration only if a maximum chart size is exceeded. Table 1 shows beam settings for both strategies. Adaptive supertagging aims for speed via pruning while the reverse strategy aims for accuracy by exposing the parser to a larger search space. Although Clark and Curran (2007) found no actual improvements from the latter strategy, we will show that with our softmax-margin-trained models it can have a substantial effect. Parser. We use the C&C parser (Clark and Curran, 2007) and its supertagger (Clark, 2002). Our baseline is the hybrid model of Clark and Curran (2007), which contains features over both normalform derivations and CCG dependencies. The parser relies solely on the supertagger for pruning, using exact CKY for search over the pruned space. Training requires calculation of feature expectations over packed charts of derivations. For training, we limited the number of items in this chart to 0.3 million, and for testing, 1 million. We also used a more permissive training supertagger beam (Table 2) than in previous work (Clark and Curran, 2007). Models were trained with the parser s L-BFGS trainer. Evaluation. We evaluated on CCGbank (Hockenmaier and Steedman, 2007), a right-most normalform CCG version of the Penn Treebank. We use sections (39603 sentences) for training, section 00 (1913 sentences) for development and section 23 (2407 sentences) for testing. We supply gold-standard part-of-speech tags to the parsers. We evaluate on labelled and unlabelled predicate argument structure recovery and supertag accuracy. 4.1 Training with Maximum F-measure Parses So far we discussed how to optimise towards taskspecific metrics via changing the training objective. In our first experiment we change the data on which we optimise CLL. This is a kind of simple baseline to our later experiments, attempting to achieve the same effect by simpler means. Specifically, we use the algorithm of Huang (2008) to generate oracle F-measure parses for each sentence. Updating towards these oracle parses corrects the reachability problem in standard CLL training. Since the supertagger is used to prune the training forests, the correct parse is sometimes pruned away reducing data utilisation to 91%. Clark and Curran (2007) correct for this by adding the gold tags to the parser input. While this increases data utilisation, it biases the model by training in an idealised setting not available at test time. Using oracle parses corrects this bias while permitting 99% data utilisation. The labelled F-score of the oracle parses lies at 98.1%. Though we expected that this might result in some improvement, results (Table 3) show that this has no 337

6 Condition Parameter Iteration AST β (beam width) k (dictionary cutoff) Reverse β k Table 1: Beam step function used for standard (AST) and less aggressive (Reverse) AST throughout our experiments. Parameter β is a beam threshold while k bounds the number of lexical categories considered for each word. Condition Parameter Iteration Training β k C&C 07 β k Table 2: Beam step functions used for training: The first row shows the large scale settings used for most experiments and the standard C&C settings. (cf. Table 1) LF LP LR UF UP UR Data Util (%) Baseline % Max-F Parses % CCGbank+Max-F % Table 3: Performance on section 00 of CCGbank when comparing models trained with treebank-parses (Baseline) and maximum F-score parses (Max-F) using adaptive supertagging as well as a combination of CCGbank and Max-F parses. Evaluation is based on labelled and unlabelled F-measure (LF/UF), precision (LP/UP) and recall (LR/UR). effect. However, it does serve as a useful baseline. 4.2 Training with the Exact Algorithm We first tested our assumptions about the feasibility of training with our exact algorithm by measuring the amount of state-splitting. Figure 4 plots the average number of splits per span against the relative span-frequency; this is based on a typical set of training forests containing over 600 million states. The number of splits increases exponentially with span size but equally so decreases the number of spans with many splits. Hence the small number of states with a high number of splits is balanced by a large number of spans with only a few splits: The highest number of splits per span observed with our settings was 4888 but we find that the average number of splits lies at 44. Encouragingly, this enables experimentation in all but very large scale settings. Figure 5 shows the distribution of n and d pairs across all split-states in the training corpus; since Average number of splits Average number of splits Percentage of total spans span length 12% 10% Figure 4: Average number of state-splits per span length as introduced by a sentence-level F-measure loss function. The statistics are averaged over the training forests generated using the settings described in 4. n, the number of correct dependencies, over d, the number of all recovered dependencies, is precision, the graph shows that only a minority of states have either very high or very low precision. The range of values suggests that the softmax-margin criterion 8% 6% 4% 2% 0% % of total spans 338

7 will have an opportunity to substantially modify the expectations, hopefully to good effect.!"!" #!" $!" %!" &!" '!" (!" )!"!"#$%&'()'233',%-%!,%!*.%/'0,1'!*'!!!!!!" '!!!!!!*#!!!!!!!" #!!!!!!!*#'!!!!!!" #'!!!!!!*$!!!!!!!" Figure 5: Distribution of states with d dependencies of which n are correct in the training forests. We next turn to the question of optimization with these algorithms. Due to the significant computational requirements, we used the computationally less intensive normal-form model of Clark and Curran (2007) as well as their more restrictive training beam settings (Table 2). We train on all sentences of the training set as above and test with AST. In order to provide greater control over the influence of the loss function, we introduce a multiplier τ, which simply amends the second term of the objective function (3) to: log y Y (x i ) exp{θ T f(x i, y) + τ l(y i, y)} Figure 6 plots performance of the exact loss functions across different settings of τ on various evaluation criteria, for models restricted to at most 3000 items per chart at training time to allow rapid experimentation with a wide parameter set. Even in this constrained setting, it is encouraging to see that each loss function performs best on the criteria it optimises. The precision-trained parser also does very well on F-measure; this is because the parser has a tendency to perform better in terms of precision than recall. '!" &!" %!" $!" #!"!"#$%&'()'*(&&%*+',%-%,%!*.%/'0!1' 4.3 Exact vs. Approximate Loss Functions With these results in mind, we conducted a comparison of parsers trained using our exact and approximate loss functions. Table 4 compares their performance head to head when restricting training chart sizes to 100,000 items per sentence, the largest setting our computing resources allowed us to experiment with. The results confirm that the loss-trained models improve over a likelihood-trained baseline, and furthermore that the exact loss functions seem to have the best performance. However, the approximations are extremely competitive with their exact counterparts. Because they are also efficient, this makes them attractive for larger-scale experiments. Training time increases by an order of magnitude with exact loss functions despite increased theoretical complexity ( 3.1); there is no significant change with approximate loss functions. Table 5 shows performance of the approximate losses with the large scale settings initially outlined ( 4). One striking result is that the softmax-margin trained models coax more accurate parses from the larger search space, in contrast to the likelihoodtrained models. Our best loss model improves the labelled F-measure by over 0.8%. 4.4 Combination with Integrated Parsing and Supertagging As a final experiment, we embed our loss-trained model into an integrated model that incorporates Markov features over supertags into the parsing model (Auli and Lopez, 2011). These features have serious implications on search: even allowing for the observation of Fowler and Penn (2010) that our CCG is weakly context-free, the search problem is equivalent to finding the optimal derivation in the weighted intersection of a regular and context-free language (Bar-Hillel et al., 1964), making search very expensive. Therefore parsing with this model requires approximations. To experiment with this combined model we use loopy belief propagation (LBP; Pearl et al., 1985), previously applied to dependency parsing by Smith and Eisner (2008). A more detailed account of its application to our combined model can be found in (2011), but we sketch the idea here. We construct a graphical model with two factors: one is a distribu- 339

8 !"#$%%$&'()*$"+,-$'!"#$%%$&'4$0"%%' '(*/) '(*.) '(*-) '(*,) '(*+) '() '(*/) '(*-) '(*+) '/*<) '/*=) '/*/) 1"234563) 7+)4822) 9:3%52586)4822) ;3%"44)4822),) -).) /) () +0).",'!"#$%%$&'/-$01+123' '=*+) '(*<) '(*=) '(*/) '(*-) '(*+) '/*<) 1"234563) 7+)4822) 9:3%52586)4822) ;3%"44)4822),) -).) /) () +0).",'!"#!$# 1"234563) 7+)4822) 9:3%52586)4822) ;3%"44)4822),) -).) /) () +0).",' 5,6$-7"88138'900,-"0:' <-*</) <-*<) <-*'/) <-*') <-*=/) <-*=) 1"234563) 7+)4822) 9:3%52586)4822) ;3%"44)4822),) -).) /) () +0).",'!%#!&# Figure 6: Performance of exact cost functions optimizing F-measure, precision and recall in terms of (a) labelled F-measure, (b) precision, (c) recall and (d) supertag accuracy across various settings of τ on the development set. section 00 (dev) section 23 (test) LF LP LR UF UP UR LF LP LR UF UP UR CLL DecP DecR DecF P R F Table 4: Performance of exact and approximate loss functions against conditional log-likelihood (CLL): decomposable precision (DecP), recall (DecR) and F-measure (DecF1) versus exact precision (P), recall (R) and F-measure (F1). Evaluation is based on labelled and unlabelled F-measure (LF/UF), precision (LP/UP) and recall (LR/UR). 340

9 section 00 (dev) section 23 (test) AST Reverse AST Reverse LF UF ST LF UF ST LF UF ST LF UF ST CLL DecP DecR DecF Table 5: Performance of decomposed loss functions in large-scale training setting. Evaluation is based on labelled and unlabelled F-measure (LF/UF) and supertag accuracy (ST). tion over supertag variables defined by a supertagging model, and the other is a distribution over these variables and a set of span variables defined by our parsing model. 5 The factors communicate by passing messages across the shared supertag variables that correspond to their marginal distributions over those variables. Hence, to compute approximate expectations across the entire model, we run forwardbackward to obtain posterior supertag assignments. These marginals are passed as inside values to the inside-outside algorithm, which returns a new set of posteriors. The new posteriors are incorporated into a new iteration of forward-backward, and the algorithm iterates until convergence, or until a fixed number of iterations is reached we found that a single iteration is sufficient, corresponding to a truncated version of the algorithm in which posteriors are simply passed from the supertagger to the parser. To decode, we use the posteriors in a minimum-risk parsing algorithm (Goodman, 1996). Our baseline models are trained separately as before and combined at test time. For softmax-margin, we combine a parsing model trained with F1 and a supertagger trained with Hamming loss. Table 6 shows the results: we observe a gain of up to 1.5% in labelled F1 and 0.9% in unlabelled F1 on the test set. The loss functions prove their robustness by improving the more accurate combined models up to 0.4% in labelled F1. Table 7 shows results with automatic part-of-speech tags and a direct comparison with the Petrov parser trained on CCGbank (Fowler and Penn, 2010) which we outpeform on all metrics. 5 These complex factors resemble those of Smith and Eisner (2008) and Dreyer and Eisner (2009); they can be thought of as case-factor diagrams (McAllester et al., 2008) 5 Conclusion and Future Work The softmax-margin criterion is a simple and effective approach to training log-linear parsers. We have shown that it is possible to compute exact sentencelevel losses under standard parsing metrics, not only approximations (Taskar et al., 2004). This enables us to show the effectiveness of these approximations, and it turns out that they are excellent substitutes for exact loss functions. Indeed, the approximate losses are as easy to use as standard conditional log-likelihood. Empirically, softmax-margin training improves parsing performance across the board, beating the state-of-the-art CCG parsing model of Clark and Curran (2007) by up to 0.8% labelled F-measure. It also proves robust, improving a stronger baseline based on a combined parsing and supertagging model. Our final result of 89.3%/94.0% labelled and unlabelled F-measure is the best result reported for CCG parsing accuracy, beating the original C&C baseline by up to 1.5%. In future work we plan to scale our exact loss functions to larger settings and to explore training with loss functions within loopy belief propagation. Although we have focused on CCG parsing in this work, we expect our methods to be equally applicable to parsing with other grammar formalisms including context-free grammar or LTAG. Acknowledgements We would like to thank Stephen Clark, Christos Christodoulopoulos, Mark Granroth-Wilding, Gholamreza Haffari, Alexandre Klementiev, Tom Kwiatkowski, Kira Mourao, Matt Post, and Mark Steedman for helpful discussion related to this work and comments on previous drafts, and the 341

10 section 00 (dev) section 23 (test) AST Reverse AST Reverse LF UF ST LF UF ST LF UF ST LF UF ST CLL BP DecF SA Table 6: Performance of combined parsing and supertagging with belief propagation (BP); using decomposed-f1 as parser-loss function and supertag-accuracy (SA) as loss in the supertagger. section 00 (dev) section 23 (test) LF LP LR UF UP UR LF LP LR UF UP UR CLL Petrov I BP DecF SA Table 7: Results on automatically assigned POS tags. Petrov I-5 is based on the parser output of Fowler and Penn (2010); evaluation is based on sentences for which all parsers returned an analysis. anonymous reviewers for helpful comments. We also acknowledge funding from EPSRC grant EP/P504171/1 (Auli); and the resources provided by the Edinburgh Compute and Data Facility. A Computing F-Measure-Augmented Expectations at the Corpus Level To compute exact corpus-level expectations for softmaxmargin using F-measure, we add an additional transition before reaching the GOAL item in our original program. To reach it, we must parse every sentence in the corpus, associating statistics of aggregate n, d pairs for the entire training set in intermediate symbols Γ (1)...Γ (m) with the following inside recursions. I(Γ (1) n,d ) = I(S(1) 0, x ( 1),n,d ) I(Γ (l) n,d ) = I(GOAL) = n,d n,n :n +n =n I(Γ (l 1) n,d )I(S (l) 0,N,n,d ) ( I(Γ (m) n,d ) 1 2n ) d + y Outside recursions follow straightforwardly. Implementation of this algorithm would require substantial distributed computation or external data structures, so we did not attempt it. References M. Auli and A. Lopez A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing. In Proc. of ACL, June. J. K. Baker Trainable grammars for speech recognition. Journal of the Acoustical Society of America, 65. S. Bangalore and A. K. Joshi Supertagging: An Approach to Almost Parsing. Computational Linguistics, 25(2): , June. Y. Bar-Hillel, M. Perles, and E. Shamir On formal properties of simple phrase structure grammars. In Language and Information: Selected Essays on their Theory and Application, pages S. Clark and J. R. Curran The importance of supertagging for wide-coverage CCG parsing. In COL- ING, Morristown, NJ, USA. S. Clark and J. R. Curran Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. Computational Linguistics, 33(4): S. Clark and J. Hockenmaier Evaluating a Wide- Coverage CCG Parser. In Proceedings of the LREC 2002 Beyond Parseval Workshop, pages 60 66, Las Palmas, Spain. S. Clark, J. Hockenmaier, and M. Steedman Building deep dependency structures with a widecoverage CCG parser. In Proc. of ACL. 342

11 S. Clark Supertagging for Combinatory Categorial Grammar. In TAG+6. M. Dreyer and J. Eisner Graphical models over multiple strings. In Proc. of EMNLP. J. R. Finkel, A. Kleeman, and C. D. Manning Feature-based, conditional random field parsing. In Proceedings of ACL-HLT. T. A. D. Fowler and G. Penn Accurate contextfree parsing with combinatory categorial grammar. In Proc. of ACL. K. Gimpel and N. A. Smith. 2010a. Softmax-margin CRFs: training log-linear models with cost functions. In HLT 10: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. K. Gimpel and N. A. Smith. 2010b. Softmax-margin training for structured log-linear models. Technical Report CMU-LTI , Carnegie Mellon University. J. Goodman Parsing algorithms and metrics. In Proc. of ACL, pages , Jun. J. Hockenmaier and M. Steedman CCGbank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics, 33(3): L. Huang Forest Reranking: Discriminative parsing with Non-Local Features. In Proceedings of ACL- 08: HLT. J. Lafferty, A. McCallum, and F. Pereira Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML, pages D. McAllester, M. Collins, and F. Pereira Casefactor diagrams for structured probabilistic modeling. Journal of Computer and System Sciences, 74(1): D. McAllester On the complexity analysis of static analyses. In Proc. of Static Analysis Symposium, volume 1694/1999 of LNCS. Springer Verlag. F. J. Och Minimum error rate training in statistical machine translation. In Proc. of ACL, Jul. J. Pearl Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. D. Povey and P. Woodland Minimum phone error and I-smoothing for improved discrimative training. In Proc. of ICASSP. F. Sha and L. K. Saul Large margin hidden Markov models for automatic speech recognition. In Proc. of NIPS. D. A. Smith and J. Eisner Dependency parsing by belief propagation. In Proc. of EMNLP. M. Steedman The syntactic process. MIT Press, Cambridge, MA. B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning Max-margin parsing. In Proc. of EMNLP, pages 1 8, Jul. 343

Training a Log-Linear Parser with Loss Functions via Softmax-Margin

Training a Log-Linear Parser with Loss Functions via Softmax-Margin Training a Log-Linear Parser with Loss Functions via Softmax-Margin Michael Auli (University of Edinburgh; graduating spring 2012) joint work with Adam Lopez (Johns Hopkins University) Log-Linear Parsing

More information

DEPENDENCY PARSING JOAKIM NIVRE

DEPENDENCY PARSING JOAKIM NIVRE DEPENDENCY PARSING JOAKIM NIVRE Contents 1. Dependency Trees 1 2. Arc-Factored Models 3 3. Online Learning 3 4. Eisner s Algorithm 4 5. Spanning Tree Parsing 6 References 7 A dependency parser analyzes

More information

Tekniker för storskalig parsning

Tekniker för storskalig parsning Tekniker för storskalig parsning Diskriminativa modeller Joakim Nivre Uppsala Universitet Institutionen för lingvistik och filologi joakim.nivre@lingfil.uu.se Tekniker för storskalig parsning 1(19) Generative

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

Online Large-Margin Training of Dependency Parsers

Online Large-Margin Training of Dependency Parsers Online Large-Margin Training of Dependency Parsers Ryan McDonald Koby Crammer Fernando Pereira Department of Computer and Information Science University of Pennsylvania Philadelphia, PA {ryantm,crammer,pereira}@cis.upenn.edu

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including

More information

Effective Self-Training for Parsing

Effective Self-Training for Parsing Effective Self-Training for Parsing David McClosky dmcc@cs.brown.edu Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson David McClosky - dmcc@cs.brown.edu

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

More information

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar http://www.cis.upenn.edu/ anoop/ anoop@linc.cis.upenn.edu

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar http://www.cis.upenn.edu/ anoop/ anoop@linc.cis.upenn.edu Applying Co-Training Methods to Statistical Parsing Anoop Sarkar http://www.cis.upenn.edu/ anoop/ anoop@linc.cis.upenn.edu 1 Statistical Parsing: the company s clinical trials of both its animal and human-based

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Direct Loss Minimization for Structured Prediction

Direct Loss Minimization for Structured Prediction Direct Loss Minimization for Structured Prediction David McAllester TTI-Chicago mcallester@ttic.edu Tamir Hazan TTI-Chicago tamir@ttic.edu Joseph Keshet TTI-Chicago jkeshet@ttic.edu Abstract In discriminative

More information

Multi-Relational Record Linkage

Multi-Relational Record Linkage Multi-Relational Record Linkage Parag and Pedro Domingos Department of Computer Science and Engineering University of Washington Seattle, WA 98195, U.S.A. {parag,pedrod}@cs.washington.edu http://www.cs.washington.edu/homes/{parag,pedrod}

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Semantic parsing with Structured SVM Ensemble Classification Models

Semantic parsing with Structured SVM Ensemble Classification Models Semantic parsing with Structured SVM Ensemble Classification Models Le-Minh Nguyen, Akira Shimazu, and Xuan-Hieu Phan Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa,

More information

A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing

A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing Alexander M. Rush,2 Michael Collins 2 SRUSH@CSAIL.MIT.EDU MCOLLINS@CS.COLUMBIA.EDU Computer Science

More information

Object-Extraction and Question-Parsing using CCG

Object-Extraction and Question-Parsing using CCG Object-Extraction and Question-Parsing using CCG Stephen Clark and Mark Steedman School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh, UK stevec,steedman @inf.ed.ac.uk James R. Curran

More information

Maximum Likelihood Graph Structure Estimation with Degree Distributions

Maximum Likelihood Graph Structure Estimation with Degree Distributions Maximum Likelihood Graph Structure Estimation with Distributions Bert Huang Computer Science Department Columbia University New York, NY 17 bert@cs.columbia.edu Tony Jebara Computer Science Department

More information

Distributed Structured Prediction for Big Data

Distributed Structured Prediction for Big Data Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning

More information

Log-Linear Models. Michael Collins

Log-Linear Models. Michael Collins Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:

More information

Learning and Inference over Constrained Output

Learning and Inference over Constrained Output IJCAI 05 Learning and Inference over Constrained Output Vasin Punyakanok Dan Roth Wen-tau Yih Dav Zimak Department of Computer Science University of Illinois at Urbana-Champaign {punyakan, danr, yih, davzimak}@uiuc.edu

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

English Grammar Checker

English Grammar Checker International l Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 English Grammar Checker Pratik Ghosalkar 1*, Sarvesh Malagi 2, Vatsal Nagda 3,

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z DANIEL BIRMAJER, JUAN B GIL, AND MICHAEL WEINER Abstract We consider polynomials with integer coefficients and discuss their factorization

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

A Systematic Cross-Comparison of Sequence Classifiers

A Systematic Cross-Comparison of Sequence Classifiers A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel grurgrur@gmail.com, feldman@cs.biu.ac.il,

More information

Brill s rule-based PoS tagger

Brill s rule-based PoS tagger Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based

More information

Case-Factor Diagrams for Structured Probabilistic Modeling

Case-Factor Diagrams for Structured Probabilistic Modeling Case-Factor Diagrams for Structured Probabilistic Modeling David McAllester TTI at Chicago mcallester@tti-c.org Michael Collins CSAIL Massachusetts Institute of Technology mcollins@ai.mit.edu Fernando

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

More information

Compression algorithm for Bayesian network modeling of binary systems

Compression algorithm for Bayesian network modeling of binary systems Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the

More information

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

More information

Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

More information

A* CCG Parsing with a Supertag-factored Model

A* CCG Parsing with a Supertag-factored Model A* CCG Parsing with a Supertag-factored Model Mike Lewis School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK mike.lewis@ed.ac.uk Mark Steedman School of Informatics University of Edinburgh

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where. Introduction Linear Programming Neil Laws TT 00 A general optimization problem is of the form: choose x to maximise f(x) subject to x S where x = (x,..., x n ) T, f : R n R is the objective function, S

More information

Transition-Based Dependency Parsing with Long Distance Collocations

Transition-Based Dependency Parsing with Long Distance Collocations Transition-Based Dependency Parsing with Long Distance Collocations Chenxi Zhu, Xipeng Qiu (B), and Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science,

More information

Stochastic Inventory Control

Stochastic Inventory Control Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

More information

Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

More information

2014/02/13 Sphinx Lunch

2014/02/13 Sphinx Lunch 2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue

More information

NEMDE INTERCONNECTOR NON-PHYSICAL LOSS MODEL CHANGES - BUSINESS SPECIFICATION

NEMDE INTERCONNECTOR NON-PHYSICAL LOSS MODEL CHANGES - BUSINESS SPECIFICATION NEMDE INTERCONNECTOR NON-PHYSICAL LOSS MODEL CHANGES - BUSINESS SPECIFICATION PREPARED BY: DOCUMENT NO: 173-0190 VERSION NO: 1.03 EFFECTIVE DATE: 01/07/2010 ENTER STATUS: Market Operations Performance

More information

OPRE 6201 : 2. Simplex Method

OPRE 6201 : 2. Simplex Method OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams André Ciré University of Toronto John Hooker Carnegie Mellon University INFORMS 2014 Home Health Care Home health care delivery

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation Average Redistributional Effects IFAI/IZA Conference on Labor Market Policy Evaluation Geert Ridder, Department of Economics, University of Southern California. October 10, 2006 1 Motivation Most papers

More information

A CCG Parsing with a Supertag-factored Model

A CCG Parsing with a Supertag-factored Model A CCG Parsing with a Supertag-factored Model Mike Lewis School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK mike.lewis@ed.ac.uk Mark Steedman School of Informatics University of Edinburgh

More information

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection

More information

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

More information

Factoring Patterns in the Gaussian Plane

Factoring Patterns in the Gaussian Plane Factoring Patterns in the Gaussian Plane Steve Phelps Introduction This paper describes discoveries made at the Park City Mathematics Institute, 00, as well as some proofs. Before the summer I understood

More information

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom. Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Offline 1-Minesweeper is NP-complete

Offline 1-Minesweeper is NP-complete Offline 1-Minesweeper is NP-complete James D. Fix Brandon McPhail May 24 Abstract We use Minesweeper to illustrate NP-completeness proofs, arguments that establish the hardness of solving certain problems.

More information

Discovering process models from empirical data

Discovering process models from empirical data Discovering process models from empirical data Laura Măruşter (l.maruster@tm.tue.nl), Ton Weijters (a.j.m.m.weijters@tm.tue.nl) and Wil van der Aalst (w.m.p.aalst@tm.tue.nl) Eindhoven University of Technology,

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Discrete Optimization

Discrete Optimization Discrete Optimization [Chen, Batson, Dang: Applied integer Programming] Chapter 3 and 4.1-4.3 by Johan Högdahl and Victoria Svedberg Seminar 2, 2015-03-31 Todays presentation Chapter 3 Transforms using

More information

Review of Fundamental Mathematics

Review of Fundamental Mathematics Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

More information

3. The Junction Tree Algorithms

3. The Junction Tree Algorithms A Short Course on Graphical Models 3. The Junction Tree Algorithms Mark Paskin mark@paskin.org 1 Review: conditional independence Two random variables X and Y are independent (written X Y ) iff p X ( )

More information

Randomization Approaches for Network Revenue Management with Customer Choice Behavior

Randomization Approaches for Network Revenue Management with Customer Choice Behavior Randomization Approaches for Network Revenue Management with Customer Choice Behavior Sumit Kunnumkal Indian School of Business, Gachibowli, Hyderabad, 500032, India sumit kunnumkal@isb.edu March 9, 2011

More information

Finding the M Most Probable Configurations Using Loopy Belief Propagation

Finding the M Most Probable Configurations Using Loopy Belief Propagation Finding the M Most Probable Configurations Using Loopy Belief Propagation Chen Yanover and Yair Weiss School of Computer Science and Engineering The Hebrew University of Jerusalem 91904 Jerusalem, Israel

More information

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

By choosing to view this document, you agree to all provisions of the copyright laws protecting it. This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal

More information

5.1 Bipartite Matching

5.1 Bipartite Matching CS787: Advanced Algorithms Lecture 5: Applications of Network Flow In the last lecture, we looked at the problem of finding the maximum flow in a graph, and how it can be efficiently solved using the Ford-Fulkerson

More information

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15

More information

Integrating Benders decomposition within Constraint Programming

Integrating Benders decomposition within Constraint Programming Integrating Benders decomposition within Constraint Programming Hadrien Cambazard, Narendra Jussien email: {hcambaza,jussien}@emn.fr École des Mines de Nantes, LINA CNRS FRE 2729 4 rue Alfred Kastler BP

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

South Carolina College- and Career-Ready (SCCCR) Algebra 1

South Carolina College- and Career-Ready (SCCCR) Algebra 1 South Carolina College- and Career-Ready (SCCCR) Algebra 1 South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR) Mathematical Process

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs

Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs Stavros Athanassopoulos, Ioannis Caragiannis, and Christos Kaklamanis Research Academic Computer Technology Institute

More information

Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

Optimal shift scheduling with a global service level constraint

Optimal shift scheduling with a global service level constraint Optimal shift scheduling with a global service level constraint Ger Koole & Erik van der Sluis Vrije Universiteit Division of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Classification with Hybrid Generative/Discriminative Models

Classification with Hybrid Generative/Discriminative Models Classification with Hybrid Generative/Discriminative Models Rajat Raina, Yirong Shen, Andrew Y. Ng Computer Science Department Stanford University Stanford, CA 94305 Andrew McCallum Department of Computer

More information

Revisiting Output Coding for Sequential Supervised Learning

Revisiting Output Coding for Sequential Supervised Learning Revisiting Output Coding for Sequential Supervised Learning Guohua Hao and Alan Fern School of Electrical Engineering and Computer Science Oregon State University Corvallis, OR 97331 USA {haog, afern}@eecs.oregonstate.edu

More information

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

More information

Solution of Linear Systems

Solution of Linear Systems Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

More information

A Constraint Programming based Column Generation Approach to Nurse Rostering Problems

A Constraint Programming based Column Generation Approach to Nurse Rostering Problems Abstract A Constraint Programming based Column Generation Approach to Nurse Rostering Problems Fang He and Rong Qu The Automated Scheduling, Optimisation and Planning (ASAP) Group School of Computer Science,

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

arxiv:1112.0829v1 [math.pr] 5 Dec 2011

arxiv:1112.0829v1 [math.pr] 5 Dec 2011 How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

More information