Stackelberg Games for Adversarial Prediction Problems

Stackelberg Games for Adversarial Predictio Problems Michael Brücker Departmet of Computer Sciece Uiversity of Potsdam, Germay mibrueck@cs.ui-potsdam.de Tobias Scheffer Departmet of Computer Sciece Uiversity of Potsdam, Germay scheffer@cs.ui-potsdam.de ABSTRACT The stadard assumptio of idetically distributed traiig ad test data is violated whe test data are geerated i respose to a predictive model. This becomes apparet, for example, i the cotext of email spam filterig, where a email service provider employs a spam filter ad the spam seder ca take this filter ito accout whe geeratig ew emails. We model the iteractio betwee learer ad data geerator as a Stackelberg competitio i which the learer plays the role of the leader ad the data geerator may react o the leader s move. We derive a optimizatio problem to determie the solutio of this game ad preset several istaces of the Stackelberg predictio game. We show that the Stackelberg predictio game geeralizes existig predictio models. Fially, we explore properties of the discussed models empirically i the cotext of email spam filterig. Categories ad Subject Descriptors I.5. [Patter Recogitio]: Models statistical; H.4.3 [Iformatio System Applicatios]: Commuicatios Applicatios electroic mail Geeral Terms Theory, Algorithms Keywords Adversarial Classificatio, Stackelberg Competitio, Predictio Game, Spam Filterig. INTRODUCTION A commo assumptio o which most learig algorithms are based is that traiig ad test data are govered by idetical distributios. However, i a variety of applicatios, the distributio that govers data at applicatio time may be iflueced by a adversary whose iterests coflict those of the learer. Cosider, for istace, the followig three scearios. I computer ad etwork security, scripts that cotrol Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, to republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. KDD, August 2 24, 20, Sa Diego, Califoria, USA. Copyright 20 ACM 978--4503-083-7//08...$0.00. attacks are egieered with botet ad itrusio detectio systems i mid. Credit card fraudsters adapt their uauthorized use of credit cards i particular, amouts charged per trasactios ad per day ad the type of busiesses that amouts are charged from such as ot to trigger alertig mechaisms employed by credit card compaies. Email spam seders desig message templates that are istatiated by odes of botets; templates are specifically desiged to produce a low spam score with curret spam filters. The domai of email spam filterig will serve as a ruig example throughout the paper. I all of these applicatios, assailats factor iformatio about coutermeasures that are beig employed ito the process of data geeratio. The iteractio betwee learer ad data geerators ca be modeled as a game i which oe player cotrols the predictive model whereas aother exercises some cotrol over the process of data geeratio. The adversary s ifluece o the geeratio of the data ca be mathematically modeled as a trasformatio that is imposed o the distributio that govers the data at traiig time. The trasformed distributio the govers the data at applicatio time. The optimizatio criterio of either player takes as argumets both, the predictive model chose by the learer ad the trasformatio carried out by the adversary. Typically, this problem is modeled uder the worst-case assumptio that the adversary desires to impose the highest possible costs o the learer. This amouts to a zero-sum game i which the loss of oe player is the gai of the other. I this settig, both players ca maximize their expected outcome by followig a miimax strategy. El Ghaoui et al. [5] derive a miimax model for iput data that are kow to lie withi some hyper-rectagles aroud the traiig istaces. Their solutio miimizes the worst-case loss over all possible choices of the data i these itervals. Lackriet et al. [0] study the miimax probability machie. This classifier miimizes the maximal probability of misclassifyig ew istaces for a give mea ad covariace matrix of each class. Geometrically, this solutio correspods to a miimax strategy with hyper-ellipsoids aroud the traiig istaces, rather tha hyper-rectagles. Similarly, worstcase solutios to classificatio games i which the adversary deletes iput features or performs arbitrary feature trasformatio have bee studied [3, 6, 7, 4, 4]. Several applicatios motivate problem settigs i which the goals of the learer ad the data geerator, while still coflictig, are ot ecessarily etirely atagoistic. For istace, a fraudster s goal of maximizig the profit made from exploitig phished accout iformatio is ot the iverse of

a email service provider s goal of achievig a high spam recogitio rate at close-to-zero false positives. Whe playig a miimax strategy, oe ofte makes overly pessimistic assumptios about the adversary s behavior ad may ot ecessarily obtai a optimal outcome. For games that do ot exhibit the zero-sum property, a game-theoretic model has bee studied that assumes both players to commit to their actios simultaeously []; that is, without iformatio about the oppoet s course of actio. Whe the parameter space of the learer s model ad the adversary s trasformatio ad both players loss fuctios satisfy specific criteria e.g., the loss fuctios have to be mootoic with distict mootoicity ad twice differetiable, the the predictio game has a uique Nash equilibrium that ca be foud by solvig a compact optimizatio problem []. The Nash equilibrium is a combiatio of parameters for the predictive model ad the adversary s trasformatio which has the property that either player beefits by uilaterally deviatig from it. For the learer, playig the Nash equilibrium istead of the miimax strategy is a optimal course of actio uder the followig sufficiet coditios: First, the adversary has to be trusted to behave ratioally i the sese of maximizig their profit by playig a Nash strategy, too. If the learer plays the Nash equilibrium but the adversary deviates from that equilibrium, the both players may fare arbitrarily poorly. Secodly, a uique equilibrium eeds to exist, sice a combiatio of actios from two distict equilibria may lead to a arbitrarily poor outcome for either player. Thirdly, the adversary must ot have ay iformatio about the predictive model that the learer commits to before geeratig the data. I practice, this assumptio ca be violated whe the adversary is able to probe the predictive model. If the adversary violates either of the above three coditios, o guaratees o the optimality ca be give ad, cosequetly, a learer may be ill-advised to play the Nash equilibrium. I practice, a spam seder may follow heuristics derived from past experiece ad experimets with the filter. Such a settig i which both players act o-simultaeously ca be modeled as a Stackelberg competitio which allows oe player the follower to be potetially fully iformed about the move of the other player the leader. We model adversarial learig as a Stackelberg competitio i which the learer acts as leader by committig to a predictive model i the first step. The model is the disclosed to the follower the data geerator who the gets to trasform the iput distributio. Some authors [9, 2] study the case i which the data geerator acts as leader ad the learer as follower. This reflects a settig i which the adversary discloses how the future distributio will differ from the curret distributio before the learer has to commit to a model, which cotradicts the ituitio of a adversarial model-buildig problem. Whe the data geerator acts as leader ad discloses the data trasformatio, the learer oly has to solve a simple optimizatio problem i order to miimize the risk o the trasformed data poits. The rest of this paper is orgaized as follows. Sectio 2 itroduces the problem settig. We formalize the Stackelberg predictio game, derive a optimizatio problem to determie the Stackelberg equilibrium, ad show how to employ kerel fuctios i Sectio 3. I Sectio 4, we preset three istaces of the SPG ad discuss their relatio to existig predictio models. We report o experimets o email spam filterig i Sectio 5; Sectio 6 cocludes. 2. PROBLEM SETTING We study predictio games betwee two players: The learer v = ad a adversary, the data geerator v = +. I our ruig example of email spam filterig, we study the competitio betwee recipiet ad seders, ot competitio amog seders. To this ed, v = refersto the recipiet whereas v = + models the etirety of all legitimate ad abusive email seders as a sigle, amalgamated player. I the past, the data geerator v = + produced a sample D = {x i, y i} of traiig istaces x i X with correspodig class labels y i Y= {, +}. These object-class pairs are draw accordig to a traiig distributio with desity fuctio px, y. By cotrast, future object-class pairs, produced by the data geerator at applicatio time, are draw from some test distributio with desity ṗx, y which may differ from px, y. The task of the learer v = istoselecttheparameters w R m of a predictive model hx = sigf wx implemeted i terms of a geeralized liear decisio fuctio f w : X R with f wx =w T φx ad feature mappig φ : X R m. The learer s theoretical costs at applicatio time are give by θ w, ṗ = Y X c x, yl f wx, yṗx, ydx, where weightig fuctio c : X Y R ad loss fuctio l : R Y R detail the weighted loss c x, yl f wx, y that the learer icurs whe the predictive model classifies istace x as hx =sigf wx while the true label is y. The positive class- ad istace-specific weightig factors c x, y withe[c X, Y] = specify the importace of miimizig the loss l f wx, y forthe correspodig object-class pair x, y. For istace, i spam filterig, the correct classificatio of o-spam messages ca be busiess-critical for email service providers while failig to detect spam messages rus up processig ad storage costs, depedig o the size of the message. The data geerator v = + ca modify the data geeratio process for future istaces. I practice, spam seders update their campaig templates which are dissemiated to the odes of botets. Formally, the data geerator trasforms the traiig distributio with desity p to the test distributio with desity ṗ. The data geerator icurs trasformatio costs by modifyig the data geeratio process which is quatified by Ω +p, ṗ. This term acts as a regularizer o the trasformatio ad may implicitly costrai the shift that ca be imposed o the distributio, depedig o the ature of the applicatio that is to be modeled. For istace, the email seder may ot be allowed to alter the traiig distributio for o-spam messages, or to modify the ature of the messages by chagig the label from spam to o-spam or vice versa. Additioally, chagig the traiig distributio for spam messages may ru up costs depedig o the extet of distortio iflicted o the iformatioal payload.

The theoretical costs of the data geerator at applicatio time are the sum of the expected predictio costs ad the trasformatio costs, θ +w, ṗ = c +x, yl +f wx, yṗx, ydx Y X +Ω +p, ṗ. I aalogy to the learer s costs, c +x, yl +f wx, y quatifies the loss that the data geerator icurs whe istace x is labeled as hx =sigf wx while the true label is y. The weightig factors c +x, y withe[c +X, Y] = express the sigificace of x, y from the perspective of the data geerator. I our example sceario, this allows to reflect that costs of correctly or icorrectly classified istaces may vary greatly across differet physical seders that are aggregated ito the amalgamated player. Sice the theoretical costs of both players deped o the test distributio, they ca, for all practical purposes, ot be calculated. Hece, we focus o a regularized, empirical couterpart of the theoretical costs based o the traiig sample D. The empirical couterpart ˆΩ +D, Ḋ of the data geerator s regularizer Ω +p, ṗ pealizes the divergece betwee traiig sample D = {x i, y i} ad a perturbated traiig sample Ḋ = { xi, yi} that would be the outcome of applyig the trasformatio that traslates p ito ṗ to sample D. The learer s cost fuctio, istead of itegratig over ṗ, sums over the elemets of the perturbated traiig sample Ḋ. The players empirical cost fuctios ca still oly be evaluated after the learer has committed to parameters w ad the data geerator to a trasformatio from traiig to test desity fuctio, but this trasformatio eed oly be represeted i terms of the effects that it will have o the traiig sample D. The trasformed traiig sample Ḋ must ot be mistake for test data; test data will be geerated uder ṗ at applicatio time after the players have committed to their actios. The empirical costs icurred by the predictive model h with parameters w ad the shift from p to ṗ amout to ˆθ w, Ḋ = ˆθ +w, Ḋ = c,il f wẋ i, y i+ρ ˆΩ w, c +,il +f wẋ i, y i+ ˆΩ+D, Ḋ, 2 where we have replaced the weightig terms cvẋi, yi by costat cost factors c v,i > 0with i cv,i =. The learer s regularizer ˆΩ w i accouts for the fact that Ḋ does ot costitute the test data itself, but is merely a traiig sample trasformed to reflect the test distributio ad the used to lear the model parameters w. Thetrade- off betwee the empirical loss ad the regularizer is cotrolled by each player s regularizatio parameter ρ v > 0for v {, +}. I our aalysis, we estimate the trasformatio costs by the average squared l 2 -distace betwee x i ad ẋ i i feature space, ˆΩ +Ḋ, D = 2 φẋi φxi 2. 3 The learer s regularizer ˆΩ pealizes the complexity of the predictive model hx =sigf wx. For our aalysis, we cosider Tikhoov regularizatio which, for liear decisio fuctios f w, reduces to the squared l 2 -orm of w, ˆΩ w = 2 w 2. 4 Note that either player s empirical costs ˆθ vw, Ḋ deped o both players actios. The cocept of a optimal choice of model parameters w regardless of the adversary s choice of a data trasformatio is therefore ot well-defied. I the followig sectio, we will refer to the Stackelberg model which idetifies the cocept of a optimal move of the leader which miimizes ˆθ over w uder the assumptio that the follower will react by miimizig ˆθ + over Ḋ give the parameters w chose by the leader. 3. STACKELBERG PREDICTION GAME We model the predictio game as a Stackelberg competitio; we refer to the resultig model as the Stackelberg predictio game SPG. A Stackelberg game is oe of the simplest dyamic games: I the first stage, the leader i our case, the learer decides o a predictive model hx = sig f wx with parameters w. I the secod stage, the data geerator, who plays the part of the follower, observes the leader s decisio ad chooses a trasformatio that chages the distributio of past istaces ito the distributio of future istaces. I this sceario, the learer has to commit to a set of parameters uilaterally whereas the data geerator ca take the model parameters w ito accout whe preparig the data trasformatio. The optimality of a Stackelberg equilibrium which we will ow itroduce rests o the assumptio that the follower the data geerator will act ratioally i the sese of choosig a trasformatio that miimizes the resultig costs ˆθ + give the disclosed w. To reach miimal costs give w, the data geerator has to idetify a sample Ḋ that costitutes a global miimum of the cost fuctio ˆθ +w, Ḋ. There may be several global miima with idetical values of the cost fuctio; i geeral, the data geerator has to idetify ay elemet Ḋ from the set of optimal resposes to w, Ḋ w = { {ẋ i, y i} : {ẋ i} argmi ˆθ+ w, {ẋ i, y i} }. ẋ,...,ẋ X Idetifyig a elemet Ḋ Ḋw amouts to solvig a regular optimizatio problem because w ca be observed before Ḋ has to be chose. A Stackelberg equilibrium is ow idetified by backward iductio. Assumig that the data geerator will decide for ay Ḋ Ḋw, the learer has to choose model parameters w that miimize the learer s cost fuctio ˆθ for ay of the possible reactios Ḋ Ḋw that are optimal for the data geerator: w argmi max ˆθ w, Ḋ. 5 w R m Ḋ Ḋw A actio w that miimizes the learer s costs ad a correspodig optimal actio Ḋ Ḋw of the data geerator are called a Stackelberg equilibrium. The Stackelberg equilibrium is a special case of a subgame perfect equilibrium which is a extesio of the Nash equilibrium for games that are played o-simultaeously.

3. Fidig a Stackelberg Equilibrium Equatio 5 establishes a hierarchical mathematical program specifically, a bilevel optimizatio problem with upper-level objective ˆθ ad lower-level objective ˆθ +. mi max ˆθ w, {ẋ i, y i} 6 w R m i :ẋ i X s.t. {ẋ i} argmi ˆθ+w, {ẋ i, y i} 7 ẋ,...,ẋ X Bilevel programs are itrisically hard to solve. Eve the simplest istace i which all costraits ad objectives are liear is kow to be NP-hard [8]. The mai difficulties arise from the costraits ẋ i X of the lower-level optimizatio problem which geerally reder costrait 7 of the upperlevel optimizatio problem to be o-differetiable i w, eve if ˆθ + is cotiuously differetiable i w ad ẋ i for i =,...,. Numerous approaches that address bilevel programs have bee studied, for istace, based o gradiet descet, pealty fuctio, ad trust-regio methods; see, for istace, [2] for a detailed survey. Commoly, these methods reformulate the optimizatio problem ito a mathematical program with equilibrium costraits. I this, the lower-level optimizatio problem is replaced by its Karush-Kuh-Tucker KKT coditios. The resultig optimizatio problem with equilibrium costraits ca be solved approximately by relaxig the complemetary coditios [5]. However these methods do ot ecessarily coverge to a local optimum ad are applicable to small problems oly. That is why we focus o a special case of the above bilevel program. The followig theorem reformulates the lowerlevel optimizatio problem ito a ucostraied problem such that costrait 7 becomes cotiuously differetiable i w. This requires the feature space iduced by mappig φ, but ot ecessarily the iput space X, to be urestricted ad the data geerator s loss fuctio l +z, y tobecovex ad cotiuously differetiable i z R. Theorem. Let the leader s cost fuctio ˆθ ad the follower s cost fuctio ˆθ + be defied as i ad 2 with regularizers ˆΩ ad ˆΩ + defied as i 4 ad 3, respectively. Let feature mappig φ : X R m be surjective, let the data geerator s loss fuctio l +z,y be covex ad cotiuously differetiable with respect to z R for ay fixed y Y. Now let weight vector w R m ad factors τ,...,τ R be a solutio of the optimizatio problem mi c,il fwx i+τ i w 2 ρ, y i + w, i : τ i 2 w 2 8 s.t. i :0=τ i + c +,il + fwx i+τ i w 2, y i. The the Stackelberg predictio game i Equatio 6 attais a equilibrium at w, Ḋ with Ḋ = {ẋi, y i} ad ẋi {ẋ X : φẋ =φx i+τi w }. Proof. Costrait 7 says that {ẋ i } has to be a solutio of the restricted optimizatio problem mi i :ẋ i X c +,il +w T φẋ i, y i+ ρ+ 2 φẋi φxi 2. As the objective as well as the costraits are etirely defieditermsofẋ i = φẋ i, this coditio is equivalet to eforcig {ẋ i } to be a solutio of the urestricted optimizatio problem mi c +,il +w T ẋ i, y i+ ρ+ 2 ẋi φxi 2. 9 i : ẋ i R m This solutio is uiquely defied for ay fixed w as loss fuctio l +z, y isrequiredtobecovexiz, ad cosequetly i ẋ i,adtheterm ẋ i φx i 2 is quadratic i ẋ i ad therefore strictly covex for ay fixed φx i. Give w R m ad miimizer ẋ i R m,thesetx w i = {ẋ X : φẋ =ẋ i } cotais all istaces ẋ which correspod to the optimally trasformed istace i feature space ẋ i.siceφ is surjective, X w i is guarateed to be o-empty, ad cosequetly, for ay solutio {ẋ i }, there exist at least oe correspodig set of istaces {ẋi }. Asφ is ot required to be a bijective mappig, there may exist multiple istaces ẋ X w i which are optimal i the sese of miimizig the data geerator s loss. However, sice all of these istaces share the same feature represetatio ẋ i, the ier maximizatio of the upper-level optimizatio problem i 6 vaishes, mi max w R m i :ẋ i X w i mi w R m ˆθ w, {ẋ i, y i} = c,il w T ẋ i, y i + ρ 2 w 2, 0 where {x i } is the solutio of Optimizatio Problem 9. Sice 9 is covex, this costrait ca be replaced by its complemetary coditios which are give by ẋi ˆθ+w, Ḋ =0 for i =,..., where ẋi ˆθ+w, Ḋ =c+,il +w T ẋ i, y iw + ρ+ ẋi φxi. The mapped istace ẋ i that satisfies the i-th complemetary coditio is give by ẋ i = φx i+τ iw with τ i = c +,il + w T ẋ i, y i, = c +,il + w T φx i+τ iw T w, y i, = c +,il + fwx i+τ i w 2, y i. 2 Whe replacig ẋ i by i the upper-level Optimizatio Problem 0 ad eforcig Equatio 2, Optimizatio Problem 8 follows. Hece, a solutio w of 8 with correspodig τ,...,τ is also a solutio of 6 with ẋi X w i = {ẋ X : φẋ =φx i+τi w }. The objective as well as the costraits of the optimizatio problem i Theorem are geerally ot joitly covex i w ad τ,...,τ. However, uder the assumptios of the followig propositio, a locally optimal solutio ca still be foud efficietly by stadard SQP solvers. Propositio. Let loss fuctio l z, y be twice cotiuously differetiable ad loss fuctio l +z, y be covex ad thrice cotiuously differetiable with respect to z R for ay fixed y Y. The, a poit satisfyig the KKT coditios of the optimizatio problem i Equatio 8 ca be obtaied by sequetial quadratic programmig SQP methods.

The objective as well as the costraits i 8 are twice cotiuously differetiable with respect to w ad τ i for i =,...,. Hece, the correspodig complemetary coditios are cotiuously differetiable which is a sufficiet coditio to apply SQP methods; this proves Propositio. 3.2 Applyig Kerels Theorem states that a Stackelberg equilibrium with parameter vector w R m ca be obtaied by solvig the optimizatio problem i 8 which requires a explicit feature represetatio φx i of the traiig istaces. However, i some applicatios, such a feature mappig is uwieldy or eve ot existig. Istead, oe is ofte equipped with a kerel fuctio k : X X R which measures the similarity betwee two istaces. Geerally, kerel fuctio k is assumed to be a positive-semidefiite kerel such that itcabestateditermsofascalarproductithecorrespodig reproducig kerel Hilbert space; i.e., φ with kx, x =φx T φx. Makig use of the represeter theorem [3], we ca ow express weight vector w as a liear combiatio of the mapped traiig istaces; that is, w = α iφx i 3 where feature mappig φ is implicitly defied by kerel k. Whe substitutig w i 8 by 3, the squared orm of w ad decisio fuctio f w ca be completely expressed i terms of the kerel, w 2 = f wx i = α jα k kx j, x k, 4 j,k= α jkx i, x j. 5 j= Hece, the optimizatio problem i 8 ca be reformulated ito a optimizatio problem over τ,...,τ R ad the dual weights α,...,α R without the eed of a explicit feature mappig φ. However, iferrig a optimal trasformed sample Ḋ still requires the kowledge of a explicit mappig φ ad its iverse φ. Of course, this is ot a restrictio as we are iterested i the predictive model f w rather tha the trasformed sample Ḋ. Note that for computatioal reasos, it may be advisable to first costruct a explicit feature mappig from the kerel matrix ad the to trai the Stackelberg model i the primal. For istace, we ca employ the kerel PCA map φ : x K 2 [kx, x,...,kx, x ] T, 6 where K deotes the kerel matrix with K ij = kx i, x j. Withi our experimets preseted i Chapter 5 where we use liear kerels, we study all three variats: Computig the model i iput space, computig the kerelized versio, ad computig the PCA map-iduced variat. Eve though all variats yield the same solutio, usig a explicit PCA mappig is geerally fastest for reasoable. Matrix K 2 ca be computed directly from the eigevalue decompositio of the kerel matrix K; i case it is sigular we use the pseudo-iverse of K 2. 4. INSTANCES OF THE SPG Bythechoiceofl v, distict istaces of the Stackelberg predictio game SPG ca be idetified which, to some extet, geeralize existig predictio models such as the SVM for ivariaces [4] ad the SVM with ueve margis []. 4. SPG with Worst-Case Loss The SPG with worst-case loss is a istace of the Stackelberg predictio game that is characterized by a atagoicity of the weighted empirical costs of learer ad data geerator; that is, the data geerator employs the loss fuctio l wc +z, y = l z, y ad cost factors c +,i = c,i. Loss fuctios l wc + ad l caot both be covex at the same time except for a iappropriate liear fuctio ad so the requiremets of either Theorem or Propositio are violated. As we caot apply Theorem, we cosider the origial optimizatio problem Equatios 6-7. We substitute l wc + ad c +,i i the objective Equatio 2 of the lower-level optimizatio problem mi i :ẋ i X c +,il wc + f wẋ i, y i+ ρ+ φẋi φxi 2 2 which decouples ito maximizatio problems ρ+ max c,il fwẋi, yi ẋ i X 2 φẋi φxi 2. 7 A equivalet formulatio of 7 is give by max l f wẋ i, y i 8 ẋ i X i where X i = {ẋ X : c,i = ρ + φẋ 2 φxi 2 } are feasible sets of trasformed istaces. The differece betwee both formulatios is that i 8, regularizatio parameter ρ + explicitly restricts the amout of trasformatio of each istace x i. As ow the ier maximizatio of the upper-level optimizatio problem i 6 ca be stated i terms of the solutio of the lower-level optimizatio problem, l f wẋi, y i, the etire bilevel optimizatio problem reduces to the followig costraied miimizatio problem. mi c,iξ i + ρ w, i : ξ i 2 w 2 9 s.t. i : ξ i 0, ξ i max l f wẋ i, y i 20 ẋ i X i If the lower-level maximizatio problem 20 has a uique solutio for ay fixed w R m, the the above optimizatio problem ca be solved by gradiet descet where i each iteratio the maximizatio problem i 20 has to be solved for the curret iterate w k see, e.g., [4]. I case the learer choses the hige loss, l h z, y =max0, yz, 2 the SPG with worst-case loss reduces to a istace of the SVM for ivariaces [4]. 4.2 SPG with Liear Loss A secod istace of the Stackelberg predictio game is the SPG with liear loss i which the data geerator employs a liear loss fuctio, l li +z, y =z,

which pealizes high decisio values z idepedetly of the class. This choice is appropriate, for istace, i email spam filterig where the data geerator is purely iterested i the delivery of a email x which becomes ulikely for large values of z, idepedetly of the correspodig true class y. For the liear loss that is cotiuously differetiable ad covex, the costraits i 8 reduce to τ i = c +,i 22 for i =,...,. Whe choosig the hige loss 2 for the learer ad replacig τ i i 8 by 22 we arrive at the followig miimizatio problem. mi w, i : ξ i c,iξ i + ρ 2 w 2 s.t. i : ξ i 0, ξ i y i w T φx i The latter costraits ca be reformulated to y iw T φx i +y iκ i ξ i c +,i w 2 which amouts to the costraits of the SVM with ueve margis []. The oly sytactic distictio is that κ i = c +,i w 2 is idirectly defied by ad c +,i; however, for each choice of κ i 0 i the SVM with ueve margis, there exist appropriate parameters ad c +,i of a equivalet SPG with liear loss ad vice versa. Cosider the special case of equal factors c +,i = c +,j, ad cosequetly κ = κ i = κ j, for all i, j =,...,. The the margi of egative istaces becomes κ whereas the margi of positive istaces is +κ. I our example of spam filterig, this goes with the ituitio that the margi of spam istaces that vary greatly has to be larger tha the margi of o-spam istaces that remai almost umodified. This effect is stroger whe the data geerator s regularizatio parameter is small. By cotrast, if goes to ifiity, ad cosequetly κ attais zero, the the SPG with liear loss reduces to the regular SVM. 4.3 SPG with Logistic Loss Fially, this sectio itroduces the SPG with logistic loss. This istatiatio meets the precoditios of Theorem ad Propositio, ad the resultig optimizatio criterio ca be solved with stadard tools. The learer may use ay loss fuctio that is covex ad twice cotiuously differetiable Equatio 23 details the loss fuctio used i our experimets while the data geerator uses the logistic loss l log + z, y =log+ez which agai pealizes large decisio values z. The ratioale behid this loss fuctio is that the data geerator experieces costs whe the learer blocks a evet, i.e., produces a high decisio fuctio value for a istace. For istace, a legitimate seder experieces costs whe a legitimate email is erroeously blocked just like a abusive seder, also amalgamated ito the data geerator, experieces costs whe spam messages are blocked. Cost fuctio approaches zero for small values of the decisio fuctio. Now, the costraits i 8 resolve to g iw,τ i=0for i =,..., with l log + g iw,τ i=τ i +e fwx i τ i w 2 + c +,i. Fuctios g iw,τ iareotjoitlycovexiw ad τ i.however, as they are smooth i.e., ifiitely differetiable i both argumets, their roots ca be obtaied efficietly ad, cosequetly, the resultig optimizatio problem mi w, i : τ i c,il fwx i+τ i w 2, y i + ρ 2 w 2 s.t. i :0=g iw,τ i ca be solved by stadard SQP solvers. 5. EXPERIMENTAL EVALUATION The goal of this sectio is to explore the relative stregths ad weakesses of the discussed istaces of Stackelberg predictio games ad existig baselie methods i the cotext of email spam filterig. We compare a regular support vector machie SVM, logistic regressio LogReg, the SVM for ivariaces with feature scalig Ivar-SVM, [4], Nash logistic regressio Nash, [], ad the Stackelberg istaces SPG with worst-case loss SPG wc, cf. Sectio 4., SPG with liear loss SPG li, cf. Sectio 4.2, ad the SPG with logistic loss SPG log, cf. Sectio 4.3. For all Stackelberg istaces we choose the logistic loss fuctio l log z, y =log +e yz 23 for the learer which is covex ad smooth, ad cosequetly satisfies Propositio. I the absece of prior kowledge o the istace-specific costs, we set c v,i = for all v {, +}, i =,...,ad trai all methods i the PCA map iduced feature space. To solve the oliear program of the SPG with logistic loss we use the Ipopt solver [6]. We use four email corpora detailed i Table : The first data set cotais emails of a email service provider ESP collected betwee 2007 ad 200. The secod Mailiglist is a collectio of emails from publicly available mailig lists augmeted by spam emails from Bruce Gueter s spam trap of the same time period. The third corpus Privatecotais ewsletters ad spam ad o-spam emails of the authors. The last corpus is the NIST TREC 2007 spam corpus. All emails are tokeized, coverted ito biary bag-of-word vectors, ad sorted chroologically. Table: Datasetsuseditheexperimets. data set istaces features delivery period ESP 69,62 54,73 0/06/2007-27/04/200 Mailiglist 28,7 266,378 0/04/999-3/05/2006 Private 08,78 582,00 0/08/2005-3/03/200 TREC 2007 75,496 24,839 04/08/2007-07/06/2007 Our evaluatio protocol is as follows. We use the 4,000 oldest emails as traiig portio ad set the remaiig emails aside as test istaces. We use the that is, the harmoic mea of precisio ad recall as evaluatio measure ad trai all methods 20 times o a stratified subset of 200 spam ad 200 o-spam messages sampled from the traiig portio. I order to tue the regularizatio parameters we perform a 5-fold cross validatio o the traiig sample withi each repetitio of a experimet ad for each method separately. I the first experimet, we evaluate all methods ito the future by processig the test set i chroological order. Each test sample is split ito 20 disjoit subsets. We average

0.95 Performace o ESP corpus Performace o Mailiglist corpus 0.98 0.9 0.85 0.8 0.96 0.94 0.92 0.9 0.75 Oct07 Jul08 Apr09 Ja0 Performace o Private corpus Aug0 Ja03 Ju04 Nov05 Performace o TREC 2007 corpus 0.95 0.9 0.85 0.8 0.75 0.7 0.99 0.98 0.97 0.96 0.95 Mar06 May07 Aug08 Oct09 Apr07 May07 Ju07 SVM LogReg Ivar SVM Nash SPG wc SPG li SPG log Figure : of predictive models. Error bars idicate stadard errors. the o each of those subsets over the 20 models traied o differet samples draw from the traiig portio for each method ad perform a paired t-test. Figure shows that, for all data sets, the Stackelberg predictio games with liear loss ad with logistic loss outperform the regular SVM ad logistic regressio that do ot explicitly factor the adversary ito the optimizatio criterio. O the ESP corpus, the SPG with liear loss is slightly better tha the SPG with logistic loss whereas for the Mailiglist corpus the SPG with logistic loss outperforms the SPG with liear loss. O the TREC 2007 data set, most of the methods behave comparably with a slight advatage for the Nash logistic regressio ad the SPG istaces with logistic loss ad liear loss. The period over which the TREC 2007 data have bee collected is very short; therefore we believe that the traiig ad test istaces are govered by early idetical distributios. Cosequetly the gametheoretic models do ot gai a sigificat advatage over logistic regressio that assumes iid samples. For the other three data sets, the game-theoretical models outperform the iid baselies. Table 2 shows aggregated results over all four data sets. For each poit i each of the diagrams of Figure, we coduct a pairwise compariso of all methods based o a paired t-test at a cofidece level of α = 0.05. Whe a differece is sigificat, we cout this as a wi for the method that achieves a higher. Each lie of Table 2 details the wis ad, set i italics, the losses of oe method agaist all other methods. The Stackelberg predictio game with logistic loss has more wis tha it has losses agaist each of the other methods. The Stackelberg predictio game with liear loss has more wis tha losses agaist each of the other methods except for the SPG with logistic loss ad the Nash logistic regressio. The rakig cotiues with the Ivar- SVM, the SPG with worst-case loss, logistic regressio, ad the regular SVM which loses more frequetly tha it wis agaist all other methods. To study the predictive performace as well as ruig time behavior with respect to the size of the data set, we trai the baselies ad the three SPG istaces for a varyig umber of traiig examples. We report o the results for the represetative ESP data set i Figure 2. Except for SPG wc, the game models sigificatly outperform the trivial baselie methods SVM ad logistic regressio, especially for small corpus sizes. However, this comes at the price of cosiderably higher computatioal cost. For the game models, the Stackelberg istace SPG li clearly outperforms all referece methods with respect to efficiecy. Though, the larger the size of the data set, the stroger the computatioal differeces, where at the same time the discrepacy of the predictive performace dimiishes. The data geerator s regularizer that we use i the experimets does ot distiguish betwee modificatios of spam ad o-spam messages. I reality, most seders of legitimate messages do ot deliberately chage their writig behavior such as to bypass spam filters, perhaps with the exceptio of seders of legitimate ewsletters who must be careful ot to trigger filterig mechaisms. I a fial exper-

Performace o ESP corpus Executio time o ESP corpus 0.9 0 3 0.85 0.8 0.75 0.7 time i sec 0 0 50 00 200 400 800 600 3200 umber of traiig emails 50 00 200 400 800 600 3200 umber of traiig emails SVM LogReg Ivar SVM Nash SPG wc SPG li SPG log Figure 2: Predictive performace left ad executio time right for varyig sizes of the traiig data set. Table 2: Results of paired t-test over all corpora: Number of trials i which each method row has sigificatly outperformed each other method colum vs. umber of times it was outperformed. method vs. method SVM LogReg Ivar-SVM Nash SPG wc SPG li SPG log SVM 0:0 6:44 2:64 0:72 8:50 6:54 6:69 LogReg 44:6 0:0 3:4 0:72 0:29 6:48 5:57 Ivar-SVM 64:2 4:3 0:0 6:40 39:0 20:23 8:30 Nash 72:0 72:0 40:6 0:0 57:2 33:7 4:6 SPG wc 50:8 29:0 0:39 2:57 0:0 7:46 9:48 SPG li 54:6 48:6 23:20 7:33 46:7 0:0 0:23 SPG log 69:6 57:5 30:8 6:4 48:9 23:0 0:0 imet, we wat to study whether the Stackelberg model reflects this aspect of reality. Table 3 shows the average umber of modificatios i.e., word additios ad deletios performed by the seder per spam ad per o-spam email depedig o the seder s regularizatio parameter for fixed ρ. Table 3: Average umber of word additios ad deletios per istace for SPG log. o-spam spam additios deletios additios deletios 4.4.6 4.6 7.6 6 0.3 0.3 9.9.6 64 0.0 0.0 7. 8.7 256 0.0 0.0 2.4 2.8 024 0.0 0.0 0.8 0.9 As expected, the umber of trasformatios icreases iversely proportioal to the regularizatio parameter. Eve for equal cost factors c v,i, o-spam messages are rarely modified because the iterests of seder ad recipiet are coheret for legitimate messages. 6. CONCLUSIONS We model adversarial predictio problems as a game i which a learer has to commit to a predictive model usig past data whereas the data geerator may choose a trasformatio fuctio after the predictive model has bee disclosed which the defies the test distributio. This model reflects applicatios such as the detectio of etwork attacks ad spam filterig i which a assailat ca probe the filter. The cost fuctios of learer ad data geerator are geerally coflictig but are ot costraied to be perfectly atagoistic. Playig the Stackelberg equilibrium istead of a worst-case strategy based o a zero-sum model is advisable whe the data geerator ca be assumed to behave ratioal i the sese of miimizig a cost fuctio. However, i cotrast to the Nash strategy, the Stackelberg model does ot rely o the existece of a uique equilibrium ad the assumptios that the adversary has o iformatio about the predictive model ad is able to idetify ad follow the equilibrial strategy. We derived a compact optimizatio problem that determies the solutio of the resultig Stackelberg predictio game. We showed that the Stackelberg model geeralizes existig predictio models such as SVM with ueve margis ad SVM for ivariaces. We evaluated spam filters resultig from a regular SVM, logistic regressio, existig game-theoretical models, ad three istaces of the Stackelberg game o several spam-filterig data sets. The relative performace of the distict game-theoretic models varies, but we observe that whe compared to ay other model, the Stackelberg model with logistic loss has more wis tha it has losses agaist each of the baselie methods. Ackowledgmets This work was supported by the Germa Sciece Foudatio DFG uder grat SCHE 540/2- ad by STRATO AG.

7. REFERENCES [] M. Brücker ad T. Scheffer. Nash equilibria of static predictio games. I Advaces i Neural Iformatio Processig Systems. MIT Press, 2009. [2] B. Colso, P. Marcotte, ad G. Savard. A overview of bilevel optimizatio. Aals of Operatios Research, 53:235 256, 2007. [3] O. Dekel ad O. Shamir. Learig to classify with missig ad corrupted features. I Proceedigs of the Iteratioal Coferece o Machie Learig, pages 26 223. ACM, 2008. [4] O. Dekel, O. Shamir, ad L. Xiao. Learig to classify with missig ad corrupted features. Machie Learig, 82:49 78, 200. [5] L. E. Ghaoui, G. R. G. Lackriet, ad G. Natsoulis. Robust classificatio with iterval data. Techical Report UCB/CSD-03-279, EECS Departmet, Uiversity of Califoria, Berkeley, 2003. [6] A. Globerso ad S. T. Roweis. Nightmare at test time: robust learig by feature deletio. I Proceedigs of the Iteratioal Coferece o Machie Learig. ACM, 2006. [7] A.Globerso,C.H.Teo,A.J.Smola,adS.T. Roweis. Dataset Shift i Machie Learig, chapter A adversarial view of covariate shift ad a miimax approach, pages 79 98. MIT Press, 2009. [8] R. Jeroslow. The polyomial hierarchy ad a simple model for competitive aalysis. Mathematical Programmig, 32:46 64, 985. [9] M. Katarcioglu, B. Xi, ad C. Clifto. Classifier evaluatio ad attribute selectio agaist active adversaries. Data Miig ad Kowledge Discovery, 22-2:29 335, 20. [0] G. R. G. Lackriet, L. E. Ghaoui, C. Bhattacharyya, ad M. I. Jorda. A robust miimax approach to classificatio. Joural of Machie Learig Research, 3:555 582, 2002. [] Y. Li ad J. Shawe-Taylor. The SVM with ueve margis ad chiese documet categorizatio. I Proceedigs of the Pacific Asia Coferece o Laguage, Iformatio ad Computatio, pages 26 227, 2003. [2] W. Liu ad S. Chawla. A game theoretical model for adversarial learig. I ICDM Workshops, pages 25 30. IEEE Computer Society, 2009. [3] B. Schölkopf, R. Herbrich, ad A. J. Smola. A geeralized represeter theorem. I COLT: Proceedigs of the Workshop o Computatioal Learig Theory, Morga Kaufma Publishers, 200. [4] C. H. Teo, A. Globerso, S. T. Roweis, ad A. J. Smola. Covex learig with ivariaces. I Advaces i Neural Iformatio Processig Systems. MIT Press, 2007. [5] S. Veelke. A New Relaxatio Scheme for Mathematical Programs with Equilibrium Costraits: Theory a Numerical Experiece. PhDthesis, Techische Uiversität Müche, 2009. [6] A. Wächter ad L. T. Biegler. O the implemetatio of a iterior-poit filter lie-search algorithm for large-scale oliear programmig. Mathematical Programmig, 06:25 57, 2006.