Online Max-Margin Weight Learning for Markov Logic Networks

Online Max-Margin Weigh Learning for Markov Logic Neworks Tuyen N. Huynh Raymond J. Mooney Absrac Mos of he exising weigh-learning algorihms for Markov Logic Neworks (MLNs) use bach raining which becomes compuaionally expensive and even infeasible for very large daases since he raining examples may no fi in main memory. To overcome his problem, previous work has used online learning algorihms o learn weighs for MLNs. However, his prior work has only applied exising online algorihms, and here is no comprehensive sudy of online weigh learning for MLNs. In his paper, we derive a new online algorihm for srucured predicion using he primaldual framework, apply i o learn weighs for MLNs, and compare agains exising online algorihms on hree large, real-world daases. The experimenal resuls show ha our new algorihm generally achieves beer accuracy han exising mehods, especially on noisy daases. Keywords: Online learning, srucured predicion, saisical relaional learning 1 Inroducion Saisical relaional learning (SRL) concerns he inducion of probabilisic knowledge ha suppors accurae predicion for muli-relaional srucured daa [11]. These powerful SRL models have been successfully applied o a variey of real-world problems. However, he power of hese models come wih a cos, since hey can be compuaionally expensive o rain, in paricular since mos exising SRL learning mehods employ bach raining where he learner mus repeaedly run inference over all raining examples in each ieraion. Training becomes even more expensive in larger daases conaining housands of examples, and even infeasible in some cases where here is no enough main memory o fi he raining daa [24]. A well-known soluion o his problem is online learning where he learner sequenially processes one example a a ime. In his work, we look a he problem of online weigh learning for Markov Logic Neworks (MLNs), a recenly developed SRL model ha generalizes boh full firs-order logic and Markov neworks [29, 10]. Riedel and Meza-Ruiz Deparmen of Compuer Science, The Universiy of Texas a Ausin, 1616 Guadalupe, Suie 2.408, Ausin, Texas 78701, USA. {hnuyen,mooney}@cs.uexas.edu [31] and Mihalkova and Mooney [24] have used online learning algorihms o learn weighs for MLNs. However, previous work only applied one exising online algorihm o MLNs and did no provide a comparaive sudy of online weigh learning for MLNs. In his work, we derive a new online algorihm for srucured predicion [1] from he primal-dual framework for srongly convex loss funcions [16], which is he laes framework for deriving online algorihms ha have low regre, and apply i o learn weighs for MLNs and compare agains exising online algorihms ha have been used in previous work. The experimenal resuls show ha our new algorihms generally achieve beer accuracy han exising algorihms on hree large, real-world daases, especially on noisy daases. 2 Background 2.1 Noaion We use lower case leers (e.g. w, λ) o denoe scalars, bold face leers (e.g. x, y, λ) o denoe vecors, and upper case leers (e.g. W, X) o denoe ses. The inner produc beween vecors w and x is denoed by w, x. 2.2 MLNs An MLN consiss of a se of weighed firs-order clauses. I provides a way of sofening firsorder logic by making siuaions in which no all clauses are saisfied less likely bu no impossible [29, 10]. More formally, le X be he se of all proposiions describing a world (i.e. he se of all ground aoms), F be he se of all clauses in he MLN, w i be he weigh associaed wih clause f i F, G fi be he se of all possible groundings of clause f i, and Z be he normalizaion consan. Then he probabiliy of a paricular ruh assignmen x o he variables in X is defined as [29]: P (X = x) = 1 Z exp fi F w i g(x) g G fi = 1 Z exp w i n i (x) fi F

where g(x) is 1 if g is saisfied and 0 oherwise, and n i(x) = P g G fi g(x) is he number of groundings of f i ha are saisfied given he curren ruh assignmen o he variables in X. There are wo inference asks in MLNs. The firs one is o infer he Mos Probable Explanaion (MPE) or he mos probable ruh values for a se of unknown lierals y given a se of known lierals x, provided as evidence. Boh approximae and exac MPE mehods for MLNs have been proposed [17, 30, 14]. The second inference ask is compuing he condiional probabiliies of some unknown lierals, y, given some evidence x. Compuing hese probabiliies is also inracable, bu here are good approximaion algorihms such as MC- SAT [27] and lifed belief propagaion [36]. There are wo approaches o weigh learning in MLNs: generaive and discriminaive. In generaive learning, he goal is o learn a weigh vecor ha maximizes he likelihood of all he observed daa [29]. In discriminaive learning, we know a priori which predicaes will be used o supply evidence and which ones will be queried, and he goal is o correcly predic he laer given he former. Several discriminaive weigh learning mehods have been proposed, mos of which ry o find weighs ha maximize he condiional log-likelihood of he daa [35, 20, 13]. Recenly, Huynh and Mooney [14] proposed a max-margin approach o learn weighs for MLNs. 2.3 The Primal-Dual Algorihmic Framework for Online Convex Opimizaion In his secion, we briefly review he primal-dual framework for srongly convex loss funcions [16] which is he laes framework for deriving online algorihms ha have low regre, he difference beween he cumulaive loss of he online algorihm and he cumulaive loss of he opimal offline soluion. Considering he following primal opimizaion problem: (2.1) inf w W P +1(w) = inf w W ( (σ)f(w) + ) g i (w) i=1 where f : W R + is a funcion ha measures he complexiy of he weigh vecors in W, g i : W R is a loss funcion, and σ is non-negaive scalar. For example, if W = R d, f(w) = 1 2 w 2 2, and g i(w) = max y Y [ρ(y, y) w, (φ(x, y ) φ(x, y) ] + where φ(x, y) : X Y R d is a join feaure funcion, hen he above opimizaion problem is he max-margin srucured classificaion problem [39, 41, 38]. We can Algorihm 1 A general incremenal dual ascen algorihm for σ-srongly convex loss funcion [16] Inpu: A srongly convex funcion f, a posiive scalar σ for = 1 o T do Se: w = f ` P 1 1 σ i=1 λ i Receive: l (w ) = σf(w ) + g (w ) Choose (λ +1 1,..., λ +1 ) ha saisfy he condiion: λ g (w ) s.. D +1(λ +1 1,..., λ +1 ) D +1(λ 1,..., λ 1, λ ) end for rewrie he opimizaion problem in Eq. 2.1 as follows: ( ) (σ)f(w 0 ) + g i (w i ) inf w 0,w 1,...,w i=1 s.. w 0 W, i 1..., w i = w 0 where we inroduce new vecors w 1,..., w and consrain hem o all be equal o w 0. The dual of his problem is: sup D +1 (λ 1,..., λ ) λ 1,...,λ ( = sup [ (σ)f 1 λ 1,...,λ (σ) ) λ i i=1 ] gi (λ i ) i=1 where each λ is a vecor of Lagrange mulipliers for he equaliy consrain w = w 0, and f, g1,..., g are he Fenchel conjugae funcions of f, g 1,..., g. A Fenchel conjugae funcion of a funcion f : W R is defined as f (θ) = sup w W ( w, θ f(w)). See [16] for deails on he seps o derive he dual problem. From he weak dualiy heorem [3], we know ha he dual objecive is upper bounded by he opimal value of he primal problem. Thus, if an online algorihm can incremenally ascend he dual objecive funcion in each sep, hen is performance is close o he performance of he bes fixed weigh vecor ha minimizes he primal objecive funcion (he bes offline learner), since by increasing he dual objecive, he algorihm moves closer o he opimal primal value. Based on his observaion, Kakade e. al. [16] proposed he general online incremenal dual ascen algorihm (Algorihm 1), where g (w ) = {λ : w W, g (w) g (w ) λ, (w w ) } is he se of subgradiens of g a w. The condiion λ g s.. D +1(λ +1 1,..., λ +1 ) D +1(λ 1,..., λ 1, λ ) ensures he dual objecive is increased in each sep. The regre of any algorihm derived from Algorihm 1 is O(log T ) [16], where T is he number of examples seen so far. A simple updae rule ha saisfies he condiion in Algorihm 1 is o find a subgradien λ g (w ) and se λ +1 = λ and keep all oher λ i s unchanged (i.e.

λ +1 i = λ i, i < ). However, he gain in he dual objecive for his simple updae rule is minimal. To achieve he larges gain in he dual objecive, one can opimize all he λ i s a each sep. Bu his approach is usually compuaionally prohibiive o use since a each sep, we need o solve a large opimizaion problem: (λ +1 1,..., λ +1 ) arg max D +1 (λ 1,..., λ ) λ 1,...,λ A compromise approach is o fully opimize he dual objecive funcion a each ime sep bu only wih respec o he las variable λ : λ +1 i = { λ i arg max λ D +1 (λ 1,..., λ 1, λ ) if i < if i = This is called he Coordinae-Dual-Ascen (CDA) updae rule. If we can find a closed-form soluion of he opimizaion problem wih respec o he las variable λ, hen he compuaional complexiy of he CDA updae is similar o he simple updae bu he gain in he dual objecive funcion is larger. Previous work [34] showed ha algorihms which more aggressively ascend he dual funcion have beer performance. In he nex secion, we will show ha i is possible o obain a closed-form soluion of he CDA updae rule for he case of srucured predicion. 3 Online Coordinae-Dual-Ascen Algorihms for Srucured Predicion In his secion, we derive new online algorihms for srucured predicion based on he algorihmic framework described in he previous secion using he CDA updae rule. In srucured predicion [1], he label y of each example x X belongs o some srucure oupu space Y. We assume ha here is a join feaure funcion φ(x, y) : X Y R d and he predicion funcion akes he following form: h w (x) = arg max y Y w, φ(x, y) So in his case he weigh vecor w lies in R d. A sandard complexiy funcion used in many asks is f(w) = 1 2 w 2 2. Regarding he loss funcion g, a generalized version of he Hinge loss is widely used in max-margin srucured predicion [39, 41] l MM (w, (x, y )) = max y Y [ρ(y, y) w, (φ(x, y ) φ(x, y) ] + where ρ(y, y ) is a non-negaive label loss funcion ha measures he difference beween he wo labels y, y such as he Hamming loss. However, minimizing he above loss resuls in an opimizaion problem wih a lo of consrains in he primal (one consrain for each possible label y Y) which is usually expensive o solve. To overcome his problem, we consider wo simpler varians of he max-margin loss which only involves a paricular label: he maximal loss funcion and he predicion-based loss funcion. Maximal loss (ML) funcion This loss funcion is based on he maximal loss label a sep, y ML = arg max y Y {ρ(y, y) + w, φ(x, y) }: l ML (w, (x, y )) = [ ρ(y, y ML ) w, ( φ(x, y ) φ(x, y ML ) ) ] + is he inpu o he maximal loss (i is possible in online learning since he loss is compued afer he weigh vecor w is chosen), herefore i does no depend on he weigh vecor w for which we wan o compue he loss. So he maximal loss funcion only concerns he paricular consrain for wheher he rue label y is scored higher han he maximal loss label The loss l ML (w, (x, y )) is he greaes loss he algorihm would suffer a sep if i used he maximal loss label y ML as he predicion. On he oher hand, i checks wheher he max-margin consrains are saisfied since if l ML (w, (x, y )) = 0 hen y ML = y, and i means ha he curren weigh vecor w scores he correc label y higher han any oher label y where he difference is a leas ρ(y, y ). Noe ha he maximal loss label y ML, which is also called he loss-augmened inference problem [38], is only feasible for some decomposable label loss funcions [38] such as Hamming loss (he number of misclassified aoms) since he maximal loss label depends on he label loss funcion ρ(y, y ). This is he reason why we wan o consider he second loss funcion, predicion-based loss, which can be used wih any label loss funcion such as (1 F 1 ) loss, where F 1 is he harmonic mean of precision and recall. Predicion-based loss (PL) funcion This loss funcion is based on he prediced label y P = h w (x ) = arg max y Y w, φ(x, y) : wih a margin of ρ(y, y ML ). This is he key difference beween he maximal loss and he max-margin loss since he laer looks a he consrains of all possible labels. The main drawback of he maximal loss is ha finding he maximal loss label y ML l P L (w, (x, y )) = [ ρ(y, y P ) w, ( φ(x, y ) φ(x, y P ) ) ] + Like he maximal loss, he predicion-based loss only concerns he consrain for he predicion label y P. We have l P L(w, (x, y )) l ML(w, (x, y )) since y ML

is he maximal loss label for w. As a resul, he Algorihm 2 Online Coordinae-Dual-Ascen Algorihms updae based on he predicion-based loss funcion is for Srucured Predicion less aggressive han he one based on he maximal loss 1: Parameers: A consan σ > 0; Label loss funcion funcion. However, he predicion-based loss funcion ρ(y, y ) 2: Iniialize: w can be used wih any label loss funcion since he 1 = 0 prediced label y P 3: for i = 1 o T do does no depend on he label loss 4: Receive an insance x funcion. 5: Predic y P = arg max y Y w, φ(x, y) To apply he primal-dual algorihmic framework 6: Receive he correc arge y described in he previous secion, we need o find he 7: (For maximal loss) Compue y ML = Fenchel conjugae funcion of he complexiy funcion arg max y Y {ρ(y, y) + w, φ(x, y) } f(w) and he loss funcion g(w). The Fenchel conjugae 8: Compue φ : funcion of he complexiy funcion f(w) = 1 2 w 2 2 is 8: PL: φ = φ(x, y ) φ(x, y P ) iself, i.e. f (θ) = 1 2 θ 2 2 [3]. For he loss funcion, 8: ML: φ = φ(x, y ) φ(x, y ML ) recall ha he Fenchel conjugae funcion of he Hingeloss g(w) = [γ w, x ] + is: 9: PL (CDA): l = ˆρ(y, y P ) 1 w 9: Compue loss:, φ + { 9: ML (CDA): l = ˆρ(y, y ML ) 1 w, φ + g γα if θ { αx : α [0, 1]} 10: Updae: (θ) = oherwise 10: CDA: w +1 = 1 l w + min{1/(σ), } φ φ 2 2 11: end for (Appendix A in [33]). We can see ha boh he predicion-based loss and he maximal loss have he same form as he Hinge-loss where γ is replaced by he label loss funcion l(y, y P ) and l(y, y ML ), and x is If α [0, 1], hen α is he maximizer of he problem. replaced by φ P L = φ(x, y ) φ(x, y P ) and φ ML = If α < 0, hen 0 is he maximizer and if α > 1 hen 1 φ(x, y ) φ(x, y ML ) for he predicion-based loss and is he maximizer. In summary, he soluion of he above he maximal loss respecively. Using he resul of opimizaion is: he Hinge-loss, we have he Fenchel conjugae funcion of he predicion-based loss and he maximal loss as follows: α max = ( { h D Ei } g ρ(y, y P ML )α if θ { α φ P L ML : α [0, 1]} (σ)ρ(y,y P ML )+ λ 1:( 1), φ P L ML + (θ) = min 1, oherwise The nex sep is o derive he closed-form soluion of he CDA updae rule. The opimizaion problem ha we need o solve is: ( (3.2) argmax λ (σ)f λ ) 1:( 1) + λ g (λ ) (σ) where λ 1:( 1) = 1 i=1 λ i. Subsiuing he conjugae funcion f and g as above in he equaion 3.2, we obain he following opimizaion problem: arg max (σ) α [0,1] 2 = arg max α [0,1] λ 1:( 1) α φ P L ML (σ) α 2 φ P L ML 2 2 2(σ) + α ρ(y, y P ML ) + 1 (σ) λ 1:( 1) 2 2 2(σ) D 2 2 + αρ(y, y P ML ) λ 1:( 1), φ P L ML E «This objecive funcion is a funcion of α only and in fac i is a concave parabola whose maximum aains a he poin: α = (σ)ρ(y, y P ML ) + φ P L ML 2 2 λ 1:( 1), φ P L ML φ P L ML 2 2 To obain he updae in erms of he weigh vecors w, we have: w +1 = f 1 σ λ1: «= 1 σ (λ 1:( 1) + λ ) = λ 1:( 1) 1 σ σ ( αmax φ P L ML ) (σ( 1))w = + σ 8 h D 1 >< (σ)ρ(y, y P ML σ min >: 1, ) + = 1 w + 8 h >< 1 min >: σ, ρ(y, y P ML D ) 1 (σ( 1))w, φ P L ML φ P L ML 2 2 φ P L ML 2 2 w, φ P L ML Ei 9 >= + Ei 9 >= + L ML φp >; The new mehod is summarized in Algorihm 2. Ineresingly, his updae formula has he same form as ha of he subgradien algorihm [25] which is derived L ML φp >;

from he simple updae crierion: w +1 = w 1 σ (σw φ ML ) = 1 w + 1 σ φml The key difference is in he learning rae. The learning rae of he subgradien algorihm, which is equal o 1/(σ), does no depend on he loss suffered a each sep, while he learning rae of CDA is he minimizaion of 1/(σ) and he loss suffered a each sep. In he beginning, when is small and herefore 1/(σ) is large (assuming σ is small), CDA s learning rae is conrolled by he loss suffered a each sep. In conras, when is large and herefore 1/(σ) is small, hen he learning rae of CDA is driven by he quaniy 1/(σ). In oher words, a he beginning, when he model is no good, CDA aggressively updaes he model based on he loss suffered a each sep; and laer when he model is good, i updaes he model less aggressively. We can use he derived CDA algorihm o perform online weigh learning for MLNs since he weigh learning problem in MLNs can be cas as a max-margin srucured predicion problem [14]. For MLNs, he number of rue groundings of he clauses plays he role of he join feaure funcion φ(x, y). 4 Experimenal Evaluaion In his secion, we conduc experimens o answer he following quesions in he conex of MLNs: 1. How does our new online learning algorihm, CDA, compare o exising online max-margin learning mehods? In paricular, is i beer han he subgradien mehod due o is more aggressive updae in he dual? 2. How does i compare o exising bach max-margin weigh learning mehods? 3. How well does using he predicion-based loss compare o he maximal loss in pracice? 4.1 Daases We ran experimens on hree large, real-world daases: he CieSeer daase [19] for bibliographic ciaion segmenaion, a web search query daase [24] obained from Microsof Research for query disambiguaion, and he CoNLL 2005 daase [4] for Semanic Role Labeling. For CieSeer, we used he version creaed by Poon and Domingos [28] and he simples MLN, he isolaed segmenaion model, in heir work. 1 The daase conains 1,563 bibliographic ciaions such as: J. Jaffar, J. - L. Lassez. Consrain logic programming. In Proceedings of he Foureenh ACM symposium of he principles of programming languages, pages 111-119, Munich, 1987. The ask is o segmen each of hese ciaions ino hree fields: Auhor, Tile and Venue. The daase has four independen subses consising of ciaions o disjoin publicaions in four differen research areas. For he search query disambiguaion, we used he daa creaed by Mihalkova and Mooney [24]. The daase consiss of housands of search sessions where ambiguous queries are asked. The daa are spli ino 3 disjoin ses: raining, validaion, and es. There are 4, 618 search sessions in he raining se, 4, 803 sessions in he validaion se, and 11, 234 sessions in he es se. In each session, he se of possible search resuls for a given ambiguous query is given, and he goal is o rank hese resuls based on how likely i will be clicked by he user. A user may click on more han one resul for a given query. To solve his problem, Mihalkova and Mooney [24] proposed hree differen MLNs which correspond o differen levels of informaion used in disambiguaing he query. We used all hree MLNs in our experimens. In comparison o he Cieseer daase, he search query daase is larger bu is much noisier since a user can click on a resul because i is relevan or because he user is jus doing an exploraory search. The CoNLL 2005 daase conains over 40, 000 senences from Wall Sree Journal (WSJ). Given a senence, he ask is o analyze he proposiions expressed by some arge verbs of he senence. In paricular, for each arge verb, all of is semanic componens mus be idenified and labeled wih heir semanic roles as in he following senence for he verb accep. [ A0 He] [ AM MOD would] [ AM NEG n ] [ V accep] [ A1 anyhing of value] from [ A2 hose he was wriing abou]. A verb and is se of semanic roles form a proposiion in he senence, and a senence usually conains more han one proposiion. Each proposiion serves as a raining example. The daase consiss of hree disjoin subses: raining, developmen, and es. The number of proposiions (or examples) in he raining, developmen, and es ses are: 90, 750; 3, 248; and 5, 267 respecively. 2 We used he MLN consruced by Riedel [30] which conains clauses ha capure he feaures of consiuens and dependencies beween semanic componens of he same verb. 1 Boh he daase and he MLN can be found a hp: //alchemy.cs.washingon.edu/daa/cieseer/ 2 We only used he WSJ par of he es se.

Table 1: F 1 scores on CieSeer daase. Highes F 1 scores are shown in bold. Algorihms Consrain Face Reasoning Reinforcemen MM-HM 93.187 92.467 92.581 95.496 1-bes-MIRA-HM 90.982 90.598 93.124 97.518 1-bes-MIRA-F 1 89.764 90.046 93.200 96.841 Subgradien-HM 90.957 89.859 91.505 95.318 CDA-PL-HM 91.245 90.992 92.589 96.516 CDA-PL-F 1 91.742 92.368 92.726 96.994 CDA-ML-HM 93.287 93.204 93.448 97.560 4.2 Mehodology To answer he above quesions, we ran experimens wih he following sysems: MM: The offline max-margin weigh learner for MLNs proposed by Huynh and Mooney [14]. 3 1-bes MIRA: MIRA is one of he firs online learning algorihms for srucured predicion proposed by McDonald e. al [22]. A simple version of MIRA, called 1-bes MIRA, is widely used in pracice since is updae rule has a closed-form soluion. 1-bes MIRA has been used in previous work [31] o learn weighs for MLNs. In each round, i updaes he weigh vecors w as follows: w +1 = w + [ ρ(y, y P ) w, φ P L φ P L 2 2 ] + φ P L Subgradien: This algorihm proposed by Raliff e al. [25] is an exension of he Greedy Projecion algorihm [42] o he case of srucured predicion. Is updae rule is an insance of he simple updae crierion CDA: Our newly derived online learning algorihm presened in Algorihm 2. Regarding label loss funcions, we use Hamming (HM) loss which is he sandard loss funcion for srucured predicion [39, 41]. As menioned earlier, Hamming loss is a decomposable loss funcion, so i can be used wih boh maximal loss and predicionbased loss. Since F 1 is he sandard evaluaion meric for he ciaion segmenaion ask on Cieseer, we also considered he label loss funcion 100(1 F 1 ) [15]. However, since his loss funcion is no decomposable, we can only use i wih he predicion-based loss. In raining, for online learning algorihms, we use he exac MPE inference based on Ineger Linear Programming (ILP) described by Huynh and Mooney [14] 3 This max-margin weigh learner has been shown o be comparable o oher offline weigh learners for MLNs [14]. on Cieseer and web search query daases, and Cuing Plane Inference [30] on he CoNLL 2005 daase. For he offline weigh learner MM, we use he approximae inference algorihm developed by Huynh and Mooney [14] since i is compuaionally inracable o run exac inference for all raining examples a once. In esing, we use MCSAT o compue marginal probabiliies for he web search query daase since we wan o rank he query resuls, and exac MPE inference on he oher wo daases. For all online learning algorihms, we ran one pass over he raining se and used he average weigh vecor o predic on he es se. For CieSeer, we ran four-fold cross-validaion (i.e. leave one opic ou). The parameer σ of he Subgradien and CDA is se based on he performance on he validaion se excep Cieseer where he parameer is se based on raining performance. For esing he saisical significance beween he performance of differen algorihms, we use McNemar s es [9] on Cieseer and a wo-sided paired -es on he web search query. The significance level was se o 5% (p-value smaller han 0.05) for boh cases. 4.3 Merics Like previous work, for ciaion segmenaion on Cieseer, we used F 1 a he oken level o measure he performance of each algorihm; for search query disambiguaion, we used MAP (Mean Average Precision) which measures how close he relevan resuls are o he op of he ranking; and for semanic role labeling on CoNLL 2005, we used F 1 of he prediced argumens as described in [4]. 4.4 Resuls and Discussion Table 1 presens he F 1 scores of differen algorihms on Cieseer. On his daase, he CDA algorihm wih maximal loss, CDA- ML-HM, has he bes F 1 scores across four folds. These resuls are saisically significanly beer han hose of subgradien mehod. So aggressive updae in he dual resuls in a beer F 1 scores. The F 1 scores of CDA-ML- HM are a lile bi higher han hose of 1-bes-MIRA, bu he difference is no significan. Ineresingly, wih

Table 2: Average raining ime on CieSeer daase. Algorihms MM-HM 1-bes-MIRA-HM 1-bes-MIRA-F 1 Subgradien-HM CDA-PL-HM CDA-PL-F 1 CDA-ML-HM Average raining ime 90.282 min. 11.772 min. 11.768 min. 12.655 min. 11.869 min. 11.915 min. 12.887 min. Table 3: MAP scores on Microsof search query daase. Highes MAP scores are shown in bold. Algorihms MLN1 MLN2 MLN3 CD 0.375 0.386 0.366 1-bes-MIRA-HM 0.366 0.375 0.379 Subgradien-HM 0.374 0.397 0.396 CDA-PL-HM 0.382 0.397 0.398 CDA-ML-HM 0.380 0.397 0.397 he possibiliy of using exac inference in raining, CDA is a lile bi more accurae han he bach maxmargin algorihm (MM) since he bach learner can only afford o use approximae inference in raining. Oher advanages of online algorihms are in erms of raining ime and memory. Table 2 shows he average raining ime of differen algorihms on his daase. All online learning algorihms ook on average abou 12-13 minues for raining while he bach one ook an hour and a half on he same machine. In addiion, since online algorihms process one example a a ime, hey use much less memory han bach mehods. On he oher hand, he running ime resuls also confirm ha he new algorihm, CDA, has he same compuaional complexiy as oher exising online mehods. Regarding he comparison beween maximal loss and predicion-based loss, he former is beer han he laer on his daase due o is more aggressive updaes. For predicion-based loss funcion, here is no much difference beween using differen label loss funcions in his case. Table 3 shows he MAP scores of differen algorihms on he Microsof web search query daase. The firs row in he able is from Mihalkova and Mooney [24] who used a varian of he srucured percepron [5] called Conrasive Divergence (CD) [12] o do online weigh learning for MLNs. I is clear ha he CDA algorihm has beer MAP scores han CD. For his daase, we were unable o run offline weigh learning since he large amoun of raining daa exhaused mem- Figure 1: Learning curve on CoNLL 2005 ory during raining. The 1-bes MIRA has he wors MAP scores on his daase. This behavior can be explained as follows. From he updae rule of he 1-bes MIRA algorihm, we can see ha i aggressively updaes he weigh vecor according o he loss incurred in each round. Since his daase is noisy, his updae rule leads o overfiing. This also explains why he subgradien algorihm has good performance on his daa since is updae rule does no depend on he loss incurred in each round. The MAP scores of he CDA algorihms are no significanly beer han ha of he subgradien mehod, bu heir performance is more consisen across he hree MLNs. Regarding he loss funcion, he MAP scores of CDA-PL and CDA-ML are almos he same. Figure 1 shows he learning curve of hree online learning algorihms: CDA, 1-bes MIRA and subgradien on he CoNLL 2005 daase. In general, he relaive accuracy of hree algorihms is similar o wha we have seen on Cieseer. CDA ouperforms he subgradien mehod across he whole learning curve. In paricular, a 30, 000 raining examples, abou 1/3 of he raining se, he F 1 score of CDA is already beer han he ha of he subgradien mehod rained on he whole raining se. The performance of CDA and 1-bes MIRA are comparable o each oher, excep on he early par of he learning curve (less han 10, 000 examples) where he F 1 scores of CDA are abou 1 o 2 percenage poins higher han hose of 1-bes MIRA. The CoNLL 2005 daase was carefully annoaed by expers [26], which is a ime consuming and expensive process. Nowadays, a faser and cheaper way o obain his ype of annoaion is using crowdsourcing

services such as Amazon Mechanical Turk, 4 which is possible o assign annoaion jobs o housands of people and ge resuls back in a few hours [37]. However, a downside of his approach is he big variance in he qualiy of labels obained from differen annoaors. As a resul, here is a lo of noise in he annoaed daa. To simulae his ype of noisy labeled daa, we inroduce random noise o he CoNLL 2005 daase. A p percen noise, here is probabiliy p ha an argumen in a proposiion is swapped wih anoher argumen in he same proposiion. For example, an argumen wih role A0 may be swapped o an argumen wih role A1 and vice versa. Figure 2 shows he F 1 scores of he above hree online learning algorihms on noisy CoNLL 2005 daase a various levels of noise. Wih he presence of noise, CDA is he mos accurae and also he mos robus o noise among he hree algorihms. For 10% noise and higher, CDA is significanly beer han he oher wo mehods. The F 1 score of CDA a a noise level of 50% is 8.5% higher han ha of 1-bes MIRA and 12.6% higher han ha of he subgradien mehod. On he oher hand, comparing wih he F 1 score on he clean daase, he F 1 score of CDA a 50% of noise only drops 8.4 poins while hose of 1-bes MIRA and subgradien drop abou 17.6 and 16.1 respecively. In addiion, he F 1 score of CDA a 50% noise is higher han he F 1 score of 1-bes MIRA a 35% noise and comparable o he F 1 score of subgradien mehod a 20% noise. In summary, our new online learning algorihm CDA has generally beer accuracy han exising maxmargin online mehods for srucured predicion such as 1-bes MIRA and he subgradien mehod which have been shown o achieve good performance in previous work. In paricular, CDA is significanly beer han oher mehods on noisy daases. 5 Relaed Work Online learning for max-margin srucured predicion has been sudied in several pieces of previous work. In addiion o hose menioned earlier, a family of online algorihms similar o he 1-bes MIRA, called passiveaggressive algorihms, was presened in [7]. Anoher piece of relaed work is he exponeniaed gradien algorihm [2, 6] which also performs updaes based on he dual of he primal problem. However, he dual problem in [2, 6] is more complicaed and expensive o solve since i was derived based on he max-margin loss, l MM. As a resul, o efficienly solve he problem, he auhors assume ha each label y is a se of pars and boh he join feaure and he label loss funcion can be decomposed ino a sum over hose 4 hps://www.murk.com/murk/ Figure 2: F 1 scores on noisy CoNLL 2005 for he individual pars. Even under his assumpion, efficienly compuing he marginal values of he par variables is sill a challenging problem. In he conex of online weigh learning for MLNs, one relaed algorihm is SampleRank [8] which uses a sampling algorihm o generae samples from a given raining example and updaes he weigh vecor whenever i misranks a pair of samples. So unlike radiional online learning algorihms ha perform one updae per example, SampleRank performs muliple updaes per example. However, he performance of SampleRank highly depends on he sampling algorihm, and which sampling algorihms are bes is an open research quesion. The issue of predicion-based loss versus maximal loss has been discussed previously [7, 32], bu no experimens have been conduced o compare hem on real-world daases. 6 Fuure Work In his work, we applied our derived online learning algorihm o MLNs, bu i can be used for any srucured predicion model. So i would be ineresing o apply he same mehod o oher srucured predicion models such as M3Ns [39], Srucural SVMs [41], RMNs [40], and FACTORIE [21]. On he oher hand, like mos online learners, our algorihm assumes ha he model s srucure (e.g. he se of clauses in an MLN) is correc, and only updaes he model parameers (e.g. he weighs of an MLN). However, in pracice, he inpu srucure is usually no opimal, so i should be also revised. A number of mehods for learning and revis-

ing MLN srucure have been developed [23, 18]; however, hey are all bach algorihms ha do no scale adequaely o very large raining ses. We are currenly developing a new algorihm ha performs boh online parameer and srucure learning. 7 Conclusions We have presened a comprehensive sudy of online weigh learning for MLNs. Based on he primal-dual framework, we derived a new CDA online algorihm for srucured predicion and applied i o learn weighs for MLNs and compared i o exising online mehods on hree large, real-world daases. Our new algorihm generally achieved beer accuracy han exising online mehods. In paricular, our new algorihm is more accurae and robus when raining daa is noisy. Acknowledgmens The auhors hank Sebasian Riedel for providing he MLN for he semanic role labeling ask on CoNLL 2005. We also hank IBM for he free academic license of CPLEX. This research is suppored by a gif from Microsof Research and by ARO MURI gran W911NF-08-1-0242. Mos of he experimens were run on he Masodon Cluser, provided by NSF Gran EIA- 0303609. The firs auhor also hanks he Vienam Educaion Foundaion (VEF) for is sponsorship. References [1] Bakir, G.H., Hofmann, T., Schölkopf, B., Smola, A.J., Taskar, B., Vishwanahan, S.V.N. (eds.): Predicing Srucured Daa. The MIT Press (2007) [2] Barle, P.L., Collins, M., Taskar, B., McAlleser, D.A.: Exponeniaed gradien algorihms for largemargin srucured classificaion. In: Advances in Neural Informaion Processing Sysems 17, NIPS 2004, December 13-18, 2004, Vancouver, Briish Columbia, Canada (2005) [3] Boyd, S., Vandenberghe, L.: Convex Opimizaion. Cambridge Universiy Press (2004) [4] Carreras, X., Màrquez, L.: Inroducion o he CoNLL- 2005 shared ask: Semanic role labeling. In: Proceedings of he Ninh Conference on Compuaional Naural Language Learning (CoNLL-2005). pp. 152 164. Ann Arbor, MI (Jun 2005) [5] Collins, M.: Discriminaive raining mehods for hidden Markov models: Theory and experimens wih percepron algorihms. In: Proceedings of he 2002 Conference on Empirical Mehods in Naural Language Processing (EMNLP-02). Philadelphia, PA (Jul 2002) [6] Collins, M., Globerson, A., Koo, T., Carreras, X., Barle, P.L.: Exponeniaed gradien algorihms for condiional random fields and max-margin Markov neworks. Journal of Machine Learning Research 9, 1775 1822 (2008) [7] Crammer, K., Dekel, O., Keshe, J., Shalev-Shwarz, S., Singer, Y.: Online passive-aggressive algorihms. Journal of Machine Learning Research 7, 551 585 (2006) [8] Culoa, A.: Learning and inference in weighed logic wih applicaion o naural language processing. Ph.D. hesis, Universiy of Massachuses (2008) [9] Dieerich, T.G.: Approximae saisical ess for comparing supervised classificaion learning algorihms. Neural Compuaion 10(7), 1895 1923 (1998) [10] Domingos, P., Lowd, D.: Markov Logic: An Inerface Layer for Arificial Inelligence. Synhesis Lecures on Arificial Inelligence and Machine Learning, Morgan & Claypool Publishers (2009) [11] Geoor, L., Taskar, B. (eds.): Inroducion o Saisical Relaional Learning. MIT Press, Cambridge, MA (2007) [12] Hinon, G.E.: Training producs of expers by minimizing conrasive divergence. Neural Compuaion 14(8), 1771 1800 (2002) [13] Huynh, T.N., Mooney, R.J.: Discriminaive srucure and parameer learning for Markov logic neworks. In: Proceedings of he 25h Inernaional Conference on Machine Learning (ICML-2008). pp. 416 423. Helsinki, Finland (2008) [14] Huynh, T.N., Mooney, R.J.: Max-margin weigh learning for Markov logic neworks. In: Proceedings of he European Conference on Machine Learning and Knowledge Discovery in Daabases (ECML PKDD 2009), Par I. pp. 564 579 (2009) [15] Joachims, T.: A suppor vecor mehod for mulivariae performance measures. In: Proceedings of 22nd Inernaional Conference on Machine Learning (ICML- 2005). pp. 377 384 (2005) [16] Kakade, S.M., Shalev-Shwarz, S.: Mind he dualiy gap: Logarihmic regre algorihms for online opimizaion. In: Koller, D., Schuurmans, D., Bengio, Y., Boou, L. (eds.) Advances in Neural Informaion Processing Sysems 21, Vancouver, Briish Columbia, Canada, December 8-11, 2008. pp. 1457 1464. MIT Press (2009) [17] Kauz, H., Selman, B., Jiang, Y.: A general sochasic approach o solving problems wih hard and sof consrains. In: Dingzhu Gu, J.D., Pardalos, P. (eds.) The Saisfiabiliy Problem: Theory and Applicaions. pp. 573 586. American Mahemaical Sociey (1997) [18] Kok, S., Domingos, P.: Learning Markov logic neworks using srucural moifs. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of he 27h Inernaional Conference on Machine Learning (ICML-10). pp. 551 558. Haifa, Israel (June 2010) [19] Lawrence, S., Giles, C.L., Bollacker, K.D.: Auonomous ciaion maching. In: Proceedings of he Third Annual Conference on Auonomous Agens (1999) [20] Lowd, D., Domingos, P.: Efficien weigh learning

for Markov logic neworks. In: Proceedings of 7h European Conference of Principles and Pracice of Knowledge Discovery in Daabases (ECML-PKDD- 2007). pp. 200 211 (2007) [21] McCallum, A., Schulz, K., Singh, S.: FACTORIE: Probabilisic Programming via Imperaively Defined Facor Graphs. In: Advances in Neural Informaion Processing Sysems 22 (NIPS-2009). pp. 1249 1257 (2009) [22] McDonald, R., Crammer, K., Pereira, F.: Online largemargin raining of dependency parsers. In: Proceedings of he 43rd Annual Meeing on Associaion for Compuaional Linguisics (ACL-05). pp. 91 98. Associaion for Compuaional Linguisics, Morrisown, NJ, USA (2005) [23] Mihalkova, L., Mooney, R.J.: Boom-up learning of Markov logic nework srucure. In: Proceedings of 24h Inernaional Conference on Machine Learning (ICML-2007). Corvallis, OR (June 2007) [24] Mihalkova, L., Mooney, R.J.: Learning o disambiguae search queries from shor sessions. In: Bunine, W.L., Grobelnik, M., Mladenic, D., Shawe-Taylor, J. (eds.) Proceedings of he European Conference on Machine Learning and Knowledge Discovery in Daabases (ECML PKDD 2009), Par II. pp. 111 127 (2009) [25] Nahan Raliff, J.A.D.B., Zinkevich, M.: (Online) subgradien mehods for srucured predicion. In: Proceedings of he Elevenh Inernaional Conference on Arificial Inelligence and Saisics (AISas) (2007) [26] Palmer, M., Gildea, D., Kingsbury, P.: The Proposiion Bank: An Annoaed Corpus of Semanic Roles. Compuaional Linguisics 31(1), 71 106 (2005) [27] Poon, H., Domingos, P.: Sound and efficien inference wih probabilisic and deerminisic dependencies. In: Proceedings of he Tweny-Firs Naional Conference on Arificial Inelligence (AAAI-06). Boson, MA (July 2006) [28] Poon, H., Domingos, P.: Join inference in informaion exracion. In: Proceedings of he Tweny-Second Conference on Arificial Inelligence (AAAI-07). pp. 913 918. Vancouver, Briish Columbia, Canada (2007) [29] Richardson, M., Domingos, P.: Markov logic neworks. Machine Learning 62, 107 136 (2006) [30] Riedel, S.: Improving he accuracy and efficiency of MAP inference for Markov logic. In: Proceedings of 24h Conference on Uncerainy in Arificial Inelligence (UAI-2008). pp. 468 475. Helsinki, Finland (2008) [31] Riedel, S., Meza-Ruiz, I.: Collecive semanic role labelling wih Markov logic. In: Proceedings of he Twelfh Conference on Compuaional Naural Language Learning (CoNLL 08) (2008) [32] Shalev-Shwarz, S.: Online Learning: Theory, Algorihms, and Applicaions. Ph.D. hesis, The Hebrew Universiy of Jerusalem (2007) [33] Shalev-Shwarz, S., Singer, Y.: Convex repeaed games and Fenchel dualiy. In: Schölkopf, B., Pla, J., Hoffman, T. (eds.) Advances in Neural Informaion Processing Sysems 19, pp. 1265 1272. MIT Press (2007) [34] Shalev-Shwarz, S., Singer, Y.: A unified algorihmic approach for efficien online label ranking. In: Proceedings of he Elevenh Inernaional Conference on Arificial Inelligence and Saisics (AISas) (2007) [35] Singla, P., Domingos, P.: Discriminaive raining of Markov logic neworks. In: Proceedings of he Twenieh Naional Conference on Arificial Inelligence (AAAI-05). pp. 868 873 (2005) [36] Singla, P., Domingos, P.: Lifed firs-order belief propagaion. In: Proceedings of he 23rd AAAI Conference on Arificial Inelligence (AAAI-08). pp. 1094 1099. Chicago, Illinois, USA (2008) [37] Snow, R., O Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fas bu is i good?: evaluaing nonexper annoaions for naural language asks. In: Proceedings of he Conference on Empirical Mehods in Naural Language Processing (EMNLP-2008). pp. 254 263. Associaion for Compuaional Linguisics, Morrisown, NJ, USA (2008) [38] Taskar, B., Chaalbashev, V., Koller, D., Guesrin, C.: Learning srucured predicion models: a large margin approach. In: Proceedings of 22nd Inernaional Conference on Machine Learning (ICML-2005). pp. 896 903. ACM, Bonn, Germany (2005) [39] Taskar, B., Guesrin, C., Koller, D.: Max-margin Markov neworks. In: Advances in Neural Informaion Processing Sysems 16 (NIPS 2003) (2003) [40] Taskar, B., Abbeel, P., Koller, D.: Discriminaive probabilisic models for relaional daa. In: Proceedings of 18h Conference on Uncerainy in Arificial Inelligence (UAI-2002). pp. 485 492. Edmonon, Canada (2002) [41] Tsochanaridis, I., Joachims, T., Hofmann, T., Alun, Y.: Suppor vecor machine learning for inerdependen and srucured oupu spaces. In: Proceedings of 21s Inernaional Conference on Machine Learning (ICML-2004). pp. 104 112. Banff, Canada (July 2004) [42] Zinkevich, M.: Online Convex Programming and Generalized Infiniesimal Gradien Ascen. In: Proceedings of 20h Inernaional Conference on Machine Learning (ICML-2003). pp. 928 936 (2003)