Approximte Fctoring for A Serch Ari Hghighi, John DeNero, Dn Klein Computer Science Division University of Cliforni Berkeley {ri42, denero, klein}@cs.berkeley.edu Abstrct We present novel method for creting A estimtes for structured serch problems. In our pproch, we project complex model onto multiple simpler models for which exct inference is efficient. We use n optimiztion frmework to estimte prmeters for these projections in wy which bounds the true costs. Similr to Klein nd Mnning 3, we then combine completion estimtes from the simpler models to guide serch in the originl complex model. We pply our pproch to bitext prsing nd lexiclized prsing, demonstrting its effectiveness in these domins. 1 Introduction Inference tsks in NLP often involve serching for n optiml output from lrge set of structured outputs. For mny complex models, selecting the highest scoring output for given observtion is slow or even intrctble. One generl technique to increse efficiency while preserving optimlity is A serch Hrt et l., 1968; however, successfully using A serch is chllenging in prctice. The design of dmissible or nerly dmissible heuristics which re both effective close to ctul completion costs nd lso efficient to compute is difficult, open problem in most domins. As result, most work on serch hs focused on non-optiml methods, such s bem serch or pruning bsed on pproximte models Collins, 1999, though in certin cses dmissible heuristics re known Och nd Ney, 0; Zhng nd Gilde, 6. For exmple, Klein nd Mnning 3 show clss of projection-bsed A estimtes, but their ppliction is limited to models which hve very restrictive kind of score decomposition. In this work, we broden their projectionbsed technique to give A estimtes for models which do not fctor in this restricted wy. Like Klein nd Mnning 3, we focus on serch problems where there re multiple projections or views of the structure, for exmple lexicl prsing, in which trees cn be projected onto either their CFG bckbone or their lexicl ttchments. We use generl optimiztion techniques Boyd nd Vndenberghe, 5 to pproximtely fctor model over these projections. Solutions to the projected problems yield heuristics for the originl model. This pproch is flexible, providing either dmissible or nerly dmissible heuristics, depending on the detils of the optimiztion problem solved. Furthermore, our pproch llows modeler explicit control over the trde-off between the tightness of heuristic nd its degree of indmissibility if ny. We describe our technique in generl nd then pply it to two concrete NLP serch tsks: bitext prsing nd lexiclized monolingul prsing. 2 Generl Approch Mny inference problems in NLP cn be solved with gend-bsed methods, in which we incrementlly build hypotheses for lrger items by combining smller ones with some locl configurtionl structure. We cn formlize such tsks s grph serch problems, where sttes encpsulte prtil hypotheses nd edges combine or extend them loclly. 1 For exmple, in HMM decoding, the sttes re nchored lbels, e.g. VBD[5], nd edges correspond to hidden trnsitions, e.g. VBD[5] DT[6]. The serch problem is to find miniml cost pth from the strt stte to gol stte, where the pth cost is the sum of the costs of the edges in the pth. 1 In most complex tsks, we will in fct hve hypergrph, but the extension is trivil nd not worth the dded nottion.
b b c ' b' b c b b' b c' b b b b b b b c b b b c b c Locl Configurtions b b c c b 1.0 cb c ' b' 3.0 4.0 Originl Cost Mtrix 1.0 b' c' 3.0 c' b' c' b' 3.0 3.0 4.0 Fctored Cost Mtrix b b c c b 1.0 cb c b c ' b' 3.0 5.0 Originl Cost Mtrix 1.0 b' c' 3.0 c' b' c' b' 3.0 3.0 4.0 Fctored Cost Mtrix Figure 1: Exmple cost fctoring: In, ech cell of the mtrix is locl configurtion composed of two projections the row nd column of the cell. In b, the top mtrix is n exmple cost mtrix, which specifies the cost of ech locl configurtion. The bottom mtrix represents our fctored estimtes, where ech entry is the sum of configurtion projections. For this exmple, the ctul cost mtrix cn be decomposed exctly into two projections. In c, the top cost mtrix cnnot be exctly decomposed long two dimensions. Our fctored cost mtrix hs the property tht ech fctored cost estimte is below the ctul configurtion cost. Although our fctoriztion is no longer tight, it still cn be used to produce n dmissible heuristic. For probbilistic inference problems, the cost of n edge is typiclly negtive log probbility which depends only on some locl configurtion type. For instnce, in PCFG prsing, the hyperedges reference nchored spns X[i, j], but the edge costs depend only on the locl 1 rule type X Y Z. We will 1 1 use to refer to locl configurtion nd use c to refer to its cost. Becuse edge costs re sensitive only to locl configurtions, the cost of pth is c. A serch requires heuristic function, which is n estimte hs of the completion cost, the cost of best pth from stte s to gol. In this work, following Klein nd Mnning 3, we consider problems with projections or views, which define mppings to simpler stte nd configurtion spces. For instnce, suppose tht we re using n HMM to jointly model prt-of-speech POS nd nmed-entity-recognition NER tgging. There might be one projection onto the NER component nd nother onto the POS component. Formlly, projection π is mpping from sttes to some corser domin. A stte projection induces projections of edges nd of the entire grph πg. We re prticulrly interested in serch problems with multiple projections {π 1,..., π l } where ech projection, π i, hs the following properties: its stte projections induce well-defined projections of the locl configurtions π i used for scoring, nd the projected serch problem dmits simpler inference. For instnce, the POS projection in our NER- POS HMM is simpler HMM, though the gins from this method re greter when inference in the projections hve lower symptotic complexity thn 1 the originl problem see sections 3 nd 4. In defining projections, we hve not yet delt with the projected scoring function. Suppose tht the cost of locl configurtions decomposes long projections s well. In this cse, c = c i, A 1 where A is the set of locl configurtions nd c i represents the cost of configurtion under projection π i. A toy exmple of such cost decomposition in the context of Mrkov process over two-prt sttes is shown in figure 1b, where the costs of the joint trnsitions equl the sum of costs of their projections. Under the strong ssumption of eqution 1, Klein nd Mnning 3 give n dmissible A bound. They note tht the cost of pth decomposes s sum of projected pth costs. Hence, the following is n dmissible dditive heuristic Felner et l., 4, hs = h i s 2 where h i s denote the optiml completion costs in the projected serch grph π i G. Tht is, the completion cost of stte bounds the sum of the completion costs in ech projection. In virtully ll cses, however, configurtion costs will not decompose over projections, nor would we expect them to. For instnce, in our joint POS-NER tsk, this ssumption requires tht the POS nd NER
trnsitions nd observtions be generted independently. This independence ssumption undermines the motivtion for ssuming joint model. In the centrl contribution of this work, we exploit the projection structure of our serch problem without mking ny ssumption bout cost decomposition. Rther thn ssuming decomposition, we propose to find scores φ for the projected configurtions which re pointwise dmissible: φ i c, A 3 Here, φ i represents fctored projection cost of π i, the π i projection of configurtion. Given pointwise dmissible φ i s we cn gin pply the heuristic recipe of eqution 2. An exmple of fctored projection costs re shown in figure 1c, where no exct decomposition exists, but pointwise dmissible lower bound is esy to find. Clim. If set of fctored projection costs {φ 1,..., φ l } stisfy pointwise dmissibility, then the heuristic from 2 is n dmissible A heuristic. Proof. Assume 1,..., k re configurtions used to optimlly rech the gol from stte s. Then, h s = = kx c j j=1 lx kx j=1! kx φ i j j=1 lx φ i j lx h i s = hs The first inequlity follows from pointwise dmissibility. The second inequlity follows becuse ech inner sum is completion cost for projected problem π i nd therefore h i s lower bounds it. Intuitively, we cn see two sources of slck in such projection heuristics. First, there my be slck in the pointwise dmissible scores. Second, the best pths in the projections will be overly optimistic becuse they hve been decoupled see figure 5 for n exmple of decoupled best pths in projections. 2.1 Finding Fctored Projections for Non-Fctored Costs We cn find fctored costs φ i which re pointwise dmissible by solving n optimiztion problem. We think of our unknown fctored costs s block vector φ = [φ 1,.., φ l ], where vector φ i is composed of the fctored costs, φ i, for ech configurtion A. We cn then find dmissible fctored costs by solving the following optimiztion problem, minimize φ such tht, γ = c γ 4 φ i, A γ 0, A We cn think of ech γ s the mount by which the cost of configurtion exceeds the fctored projection estimtes the pointwise A gp. Requiring γ 0 insures pointwise dmissibility. Minimizing the norm of the γ vribles encourges tighter bounds; indeed if γ = 0, the solution corresponds to n exct fctoring of the serch problem. In the cse where we minimize the 1-norm or -norm, the problem bove reduces to liner progrm, which cn be solved efficiently for lrge number of vribles nd constrints. 2 Viewing our procedure decision-theoreticlly, by minimizing the norm of the pointwise gps we re effectively choosing loss function which decomposes long configurtion types nd tkes the form of the norm i.e. liner or squred losses. A complete investigtion of the lterntives is beyond the scope of this work, but it is worth pointing out tht in the end we will cre only bout the gp on entire structures, not configurtions, nd individul configurtion fctored costs need not even be pointwise dmissible for the overll heuristic to be dmissible. Notice tht the number of constrints is A, the number of possible locl configurtions. For mny serch problems, enumerting the possible configurtions is not fesible, nd therefore neither is solving n optimiztion problem with ll of these constrints. We del with this sitution in pplying our technique to lexiclized prsing models section 4. Sometimes, we might be willing to trde serch optimlity for efficiency. In our pproch, we cn explicitly mke this trde-off by designing n lterntive optimiztion problem which llows for slck 0. 2 We used the MOSEK pckge Andersen nd Andersen,
in the dmissibility constrints. We solve the following soft version of problem 4: minimize φ such tht, γ = c γ + + C γ 5 φ i, A where γ + = mx{0, γ} nd γ = mx{0, γ} represent the componentwise positive nd negtive elements of γ respectively. Ech γ > 0 represents configurtion where our fctored projection estimte is not pointwise dmissible. Since this sitution my result in our heuristic becoming indmissible if used in the projected completion costs, we more hevily penlize overestimting the cost by the constnt C. 2.2 Bounding Serch Error In the cse where we llow pointwise indmissibility, i.e. vribles γ, we cn bound our serch error. Suppose γ mx = mx A γ nd tht L is the length of the longest optiml solution for the originl problem. Then, hs h s + L γ mx, s S. This ɛ-dmissible heuristic Ghllb nd Allrd, 1982 bounds our serch error by L γ mx. 3 3 Bitext Prsing In bitext prsing, one jointly infers synchronous phrse structure tree over sentence w s nd its trnsltion w t Melmed et l., 4; Wu, 1997. Bitext prsing is nturl cndidte tsk for our pproximte fctoring technique. A synchronous tree projects monolingul phrse structure trees onto ech sentence. However, the costs ssigned by weighted synchronous grmmr WSG G do not typiclly fctor into independent monolingul WCFGs. We cn, however, produce useful surrogte: pir of monolingul WCFGs with structures projected by G nd weights tht, when combined, underestimte the costs of G. Prsing optimlly reltive to synchronous grmmr using dynmic progrm requires time On 6 in the length of the sentence Wu, 1997. This high degree of complexity mkes exhustive bitext prsing infesible for ll but the shortest sentences. In 3 This bound my be very loose if L is lrge. contrst, monolingul CFG prsing requires time On 3 in the length of the sentence. 3.1 A Prsing Alterntively, we cn serch for n optiml prse guided by heuristic. The sttes in A bitext prsing re rooted bispns, denoted X [i, j] :: Y [k, l]. Sttes represent joint prse over subspns [i, j] of w s nd [k, l] of w t rooted by the nonterminls X nd Y respectively. Given WSG G, the lgorithm prioritizes stte or edge e by the sum of its inside cost β G e the negtive log of its inside probbility nd its outside estimte he, or completion cost. 4 We re gurnteed the optiml prse if our heuristic he is never greter thn α G e, the true outside cost of e. We now consider heuristic combining the completion costs of the monolingul projections of G, nd gurntee dmissibility by enforcing point-wise dmissibility. Ech stte e = X [i, j] :: Y [k, l] projects pir of monolingul rooted spns. The heuristic we propose sums independent outside costs of these spns in ech monolingul projection. he = α s X [i, j] + α t Y [k, l] These monolingul outside scores re computed reltive to pir of monolingul WCFG grmmrs G s nd G t given by splitting ech synchronous rule X s α β r = γ δ Y t into its components π s r = X αβ nd π t r = Y γδ nd weighting them vi optimized φ s r nd φ t r, respectively. 5 To lern pointwise dmissible costs for the monolingul grmmrs, we formulte the following optimiztion problem: 6 minimize γ,φ s,φ t γ 1 such tht, γ r = cr [φ s r + φ t r] for ll synchronous rules r G φ s 0, φ t 0, γ 0 4 All inside nd outside costs re Viterbi, not summed. 5 Note tht we need only prse ech sentence monolingully once to compute the outside probbilities for every spn. 6 The stted objective is merely one resonble choice mong mny possibilities which require pointwise dmissibility nd encourge tight estimtes.
Cost under G s Cost under G t Cost under G NP s NP t «NN s 1 NNS s 2 NNS t 2 de NN t 1! i Trget j k l Source Monolingul completions scored by fctored model i Trget j k l Source Synchronized completion scored by fctored model i Trget j k l Source Synchronized completion scored by originl model Figure 2: The gp between the heuristic left nd true completion cost right comes from relxing the synchronized problem to independent subproblems nd slck in the fctored models. b NP NNS sistems de NN trduccion funcionn veces NN NP NNS S RB VB Trnsltion systems sometimes work Figure 2 digrms the two bounds tht enforce the dmissibility of he. For ny outside cost α G e, there is corresponding optiml completion structure o under G, which is n outer shell of synchronous tree. o projects monolingul completions o s nd o t which hve well-defined costs c s o s nd c t o t under G s nd G t respectively. Their sum c s o s + c t o t will underestimte α G e by pointwise dmissibility. Furthermore, the heuristic we compute underestimtes this sum. Recll tht the monolingul outside score α s X [i, j] is the miniml costs for ny completion of the edge. Hence, α s X [i, j] c s o s nd α t X [k, l] c t o t. Admissibility follows. 3.2 Experiments We demonstrte our technique using the synchronous grmmr formlism of tree-to-tree trnsducers Knight nd Grehl, 4. In ech weighted rule, n ligned pir of nonterminls genertes two ordered lists of children. The non-terminls in ech list must lign one-to-one to the non-terminls in the other, while the terminls re plced freely on either side. Figure 3 shows n exmple rule. Following Glley et l. 4, we lern grmmr by projecting English syntx onto foreign lnguge vi word-level lignments, s in figure 3b. 7 We prsed 1 English-Spnish sentences using grmmr lerned from 40,000 sentence pirs of the English-Spnish Europrl corpus. 8 Figure 4 shows tht A expnds substntilly fewer sttes while serching for the optiml prse with our op- 7 The bilingul corpus consists of trnsltion pirs with fixed English prses nd word lignments. Rules were scored by their reltive frequencies. 8 Rre words were replced with their prts of speech to limit the memory consumption of the prser. Figure 3: A tree-to-tree trnsducer rule. b An exmple trining sentence pir tht yields rule. timiztion heuristic. The exhustive curve shows edge expnsions using the null heuristic. The intermedite result, lbeled English only, used only the English monolingul outside score s heuristic. Similr results using only Spnish demonstrte tht both projections contribute to prsing efficiency. All three curves in figure 4 represent running times for finding the optiml prse. Zhng nd Gilde 6 offer different heuristic for A prsing of ITG grmmrs tht provides forwrd estimte of the cost of ligning the unprsed words in both sentences. We cnnot directly pply this technique to our grmmr becuse tree-to-tree trnsducers only lign non-terminls. Insted, we cn ugment our synchronous grmmr model to include lexicl lignment component, then employ both heuristics. We lerned the following two-stge genertive model: tree-to-tree trnsducer genertes trees whose leves re prts of speech. Then, the words of ech sentence re generted, either jointly from ligned prts of speech or independently given null lignment. The cost of complete prse under this new model decomposes into the cost of the synchronous tree over prts of speech nd the cost of generting the lexicl items. Given such model, both our optimiztion heuristic nd the lexicl heuristic of Zhng nd Gilde 6 cn be computed independently. Crucilly, the sum of these heuristics is still dmissible. Results pper in figure 4b. Both heuristics lexicl nd optimiztion lone improve prsing performnce, but their sum opt+lex substntilly improves upon either one.
1 1 Exhustive Exhustive English English Only Only Optimiztion Optimiztion 0 0 5 57 79 911 113 1315 15 Sentence Sentence length length b 1 1 Exhustive Exhustive Lexicl Lexicl Optimiztion Optimiztion Opt+Lex Opt+Lex 0 0 5 5 7 7 9 911 1113 1315 15 Sentence Sentence length length Figure 4: Prsing efficiency results with optimiztion heuristics show tht both component projections constrin the problem. b Including lexicl model nd corresponding heuristic further increses prsing efficiency. 4 Lexiclized Prsing We next pply our technique to lexiclized prsing Chrnik, 1997; Collins, 1999. In lexiclized prsing, the locl configurtions re lexiclized rules of the form X[h, t] Y [h, t ] Z[h, t], where h, t, h, nd t re the hed word, hed tg, rgument word, nd rgument tg, respectively. We will use r = X Y Z to refer to the CFG bckbone of lexiclized rule. As in Klein nd Mnning 3, we view ech lexiclized rule, l, s hving CFG projection, π c l = r, nd dependency projection, π d l = h, t, h, t see figure 5. 9 Brodly, the CFG projection encodes constituency structure, while the dependency projection encodes lexicl selection, nd both projections re symptoticlly more efficient thn the originl problem. Klein nd Mnning 3 present fctored model where the CFG nd dependency projections re generted independently though with comptible brcketing: P Y [h, t]z[h, t ] X[h, t] = 6 P Y Z XP h, t t, h In this work, we explore the following non-fctored model, which llows correltions between the CFG nd dependency projections: P Y [h, t]z[h, t ] X[h, t] = P Y Z X, t, h 7 P t t, Z, h, h P h t, t, Z, h, h This model is brodly representtive of the successful lexiclized models of Chrnik 1997 nd 9 We ssume informtion bout the distnce nd direction of the dependency is encoded in the dependency tuple, but we omit it from the nottion for compctness. Collins 1999, though simpler. 10 4.1 Choosing Constrints nd Hndling Unseen Dependencies Idelly we would like to be ble to solve the optimiztion problem in 4 for this tsk. Unfortuntely, exhustively listing ll possible configurtions lexicl rules yields n imprcticl number of constrints. We therefore solve relxed problem in which we enforce the constrints for only subset of the possible configurtions, A A. Once we strt dropping constrints, we cn no longer gurntee pointwise dmissibility, nd therefore there is no reson not to lso llow penlized violtions of the constrints we do list, so we solve 5 insted. To generte the set of enforced constrints, we first include ll configurtions observed in the gold trining trees. We then smple novel configurtions by choosing X, h, t from the trining distribution nd then using the model to generte the rest of the configurtion. In our experiments, we ended up with 434,329 observed configurtions, nd smpled the sme number of novel configurtions. Our penlty multiplier C ws 10. Even if we supplement our trining set with mny smple configurtions, we will still see new projected dependency configurtions t test time. It is therefore necessry to generlize scores from trining configurtions to unseen ones. We enrich our procedure by expressing the projected configurtion costs s liner functions of fetures. Specificlly, we define feture vectors f c r nd f d h, t, h t over the CFG nd dependency projections, nd intro- 10 All probbility distributions for the non-fctored model re estimted by Witten-Bell smoothing Witten nd Bell, 1991 where conditioning lexicl items re bcked off first.
S NP S VP S NP NP PP NP VBD DT NNS NP P P reopened These stocks RB reopened-vbd These-DT stocks-nns reopened-vbd These stocks eventully-rb eventully reopened-vbd reopened S, reopened-vbd NP S, stocks-nns ADVP S, eventully-rb VP S, reopened-vbd DT These NNS stocks RB eventully VBD reopened eventully Actul Cost: 18.7 Best Projected CFG Cost: 4.1 Best Projected Dep. Cost: 9.5 CFG Projection Cost : 6.9 Dep. Projection Cost: 11.1 b c Figure 5: Lexiclized prsing projections. The figure in is the optiml CFG projection solution nd the figure in b is the optiml dependency projection solution. The tree in c is the optiml solution for the originl problem. Note tht the sum of the CFG nd dependency projections is lower bound lbeit firly tight one on ctul solution cost. duce corresponding weight vectors w c nd w d. The weight vectors re lerned by solving the following optimiztion problem: minimize γ,w c,w d γ + 2 + C γ 2 8 such tht, w c 0, w d 0 γ l = cl [w T c f c r + w T d f dh, t, h, t ] for l = r, h, t, h, t A Our CFG feture vector hs only indictor fetures for the specific rule. However, our dependency feture vector consists of n indictor feture of the tuple h, t, h, t including direction, n indictor of the prt-of-speech type t, t lso including direction, s well s bis feture. 4.2 Experimentl Results We tested our pproximte projection heuristic on two lexiclized prsing models. The first is the fctored model of Klein nd Mnning 3, given by eqution 6, nd the second is the non-fctored model described in eqution 7. Both models use the sme prent-nnotted hed-binrized CFG bckbone nd bsic dependency projection which models direction, but not distnce or vlence. 11 In ech cse, we compred A using our pproximte projection heuristics to exhustive serch. We mesure efficiency in terms of the number of expnded hypotheses edges popped; see figure 6. 12 In both settings, the fctored A pproch substntilly outperforms exhustive serch. For the fc- 11 The CFG nd dependency projections correspond to the PCFG-PA nd DEP-BASIC settings in Klein nd Mnning 3. 12 All models re trined on section 2 through 21 of the English Penn treebnk, nd tested on section 23. tored model of Klein nd Mnning 3, we cn lso compre our reconstructed bound to the known tight bound which would result from solving the pointwise dmissible problem in 4 with ll constrints. As figure 6 shows, the exct fctored heuristic does outperform our pproximte fctored heuristic, primrily becuse of mny looser, bckedoff cost estimtes for unseen dependency tuples. For the non-fctored model, we compred our pproximte fctored heuristic to one which only bounds the CFG projection s suggested by Klein nd Mnning 3. They suggest, φ c r = min cl l A:π cl=r where we obtin fctored CFG costs by minimizing over dependency projections. As figure 6 illustrtes, this CFG only heuristic is substntilly less efficient thn our heuristic which bounds both projections. Since our heuristic is no longer gurnteed to be dmissible, we evluted its effect on serch in severl wys. The first is to check for serch errors, where the model-optiml prse is not found. In the cse of the fctored model, we cn find the optiml prse using the exct fctored heuristic nd compre it to the prse found by our lerned heuristic. In our test set, the pproximte projection heuristic filed to return the model optiml prse in less thn 1% of sentences. Of these serch errors, none of the costs were more thn 0.1% greter thn the model optiml cost in negtive log-likelihood. For the non-fctored model, the model optiml prse is known only for shorter sentences which cn be prsed exhustively. For these sentences up to length 15, there were no serch errors. We cn lso check for violtions of pointwise dmissibility for configurtions encoun-
Avg. E in 0 5 10 15 20 25 30 35 40 Sentence length 1 Exhustive Approx. Fctored Exct Fctored b 1 Exhustive CFG Only Approx. Fctored 0 5 10 15 20 25 30 35 40 Sentence length 0 5 10 15 20 25 30 35 40 Sentence length Figure 6: Edges popped by exhustive versus fctored A serch. The chrt in is using the fctored lexiclized model from Klein nd Mnning 3. The chrt in b is using the non-fctored lexiclized model described in section 4. tered during serch. For both the fctored nd nonfctored model, less thn 2% of the configurtions scored by the pproximte projection heuristic during serch violted pointwise dmissibility. While this is pper bout inference, we lso mesured the ccurcy in the stndrd wy, on sentences of length up to 40, using EVALB. The fctored model with the pproximte projection heuristic chieves n F 1 of 82.2, mtching the performnce with the exct fctored heuristic, though slower. The non-fctored model, using the pproximte projection heuristic, chieves n F 1 of 83.8 on the test set, which is slightly better thn the fctored model. 13 We note tht the CFG nd dependency projections re s similr s possible cross models, so the increse in ccurcy is likely due in prt to the nonfctored model s coupling of CFG nd dependency projections. 5 Conclusion We hve presented technique for creting A estimtes for inference in complex models. Our technique cn be used to generte provbly dmissible estimtes when ll serch trnsitions cn be enumerted, nd n effective heuristic even for problems where ll trnsitions cnnot be efficiently enumerted. In the future, we pln to investigte lterntive objective functions nd error-driven methods for lerning heuristic bounds. Acknowledgments We would like to thnk the nonymous reviewers for their comments. This work is supported by DHS fellowship to the first 1 Exhustive uthor nd Microsoft new fculty fellowship to the Approx. Fctored third uthor. References Exct Fctored E. D. Andersen nd K. D. Andersen. 0. The MOSEK interior point optimizer for liner progrmming. In H. Frenk et l., editor, High Performnce Optimiztion. Kluwer Acdemic Publishers. Stephen Boyd nd Lieven Vndenberghe. 5. Convex Optimiztion. Cmbridge University Press. Eugene Chrnik. 1997. Sttisticl prsing with context-free grmmr nd word sttistics. In Ntionl Conference on Artificil Intelligence. Michel Collins. 1999. Hed-driven sttisticl models for nturl lnguge prsing. Ariel Felner, Richrd Korf, nd Srit Hnn. 4. Additive pttern dtbse heuristics. JAIR. Michel Glley, Mrk Hopkins, Kevin Knight, nd Dniel Mrcu. 4. Wht s in trnsltion rule? In HLT-NAACL. Mlik Ghllb nd Dennis G. Allrd. 1982. A ɛ - n efficient ner dmissible heuristic serch lgorithm. In IJCAI. P. Hrt, N. Nilsson, nd B. Rphel. 1968. A forml bsis for the heuristic determintion of minimum cost pths. In IEEE Trnsctions on Systems Science nd Cybernetics. IEEE. Dn Klein nd Christopher D. Mnning. 3. Fctored A* serch for models over sequences nd trees. In IJCAI. Kevin Knight nd Jonthn Grehl. 4. Trining tree trnsducers. In HLT-NAACL. I. Dn Melmed, Giorgio Stt, nd Ben Wellington. 4. Generlized multitext grmmrs. In ACL. F. J. Och nd H. Ney. 0. Improved sttisticl lignment models. In ACL. In H. Witten nd Timothy C. Bell. 1991. The zero-frequency 0 5 10 15 20 25 30 35 40 Sentence length problem: Estimting the probbilities of novel events in dptive text compression. IEEE. Deki Wu. 1997. Stochstic inversion trnsduction grmmrs nd bilingul prsing of prllel corpor. Comput. Linguist. Ho Zhng nd Dniel Gilde. 6. Efficient serch for inversion trnsduction grmmr. In EMNLP. 13 Since we cnnot exhustively prse with this model, we cnnot compre our F 1 to n exct serch method.