CostSensitive Learning by CostProportionate Example Weighting


 Nicholas Franklin
 1 years ago
 Views:
Transcription
1 CosSensiive Learning by CosProporionae Example Weighing Bianca Zadrozny, John Langford, Naoki Abe Mahemaical Sciences Deparmen IBM T. J. Wason Research Cener Yorkown Heighs, NY 0598 Absrac We propose and evaluae a family of mehods for convering classifier learning algorihms and classificaion heory ino cossensiive algorihms and heory. The proposed conversion is based on cosproporionae weighing of he raining examples, which can be realized eiher by feeding he weighs o he classificaion algorihm (as ofen done in boosing), or by careful subsampling. We give some heoreical performance guaranees on he proposed mehods, as well as empirical evidence ha hey are pracical alernaives o exising approaches. In paricular, we propose cosing, a mehod based on cosproporionae rejecion sampling and ensemble aggregaion, which achieves excellen predicive performance on wo publicly available daases, while drasically reducing he compuaion required by oher mehods. Inroducion Highly nonuniform misclassificaion coss are very common in a variey of challenging realworld daa mining problems, such as fraud deecion, medical diagnosis and various problems in business decisionmaking. In many cases, one class is rare bu he cos of no recognizing some of he examples belonging o his class is high. In hese domains, classifier learning mehods ha do no ake misclassificaion coss ino accoun do no perform well. In exreme cases, ignoring coss may produce a model ha is useless because i classifies every example as belonging o he mos frequen class even hough misclassificaions of he leas frequen class resul in a very large cos. Recenly a body of work has aemped o address his issue, wih echniques known as cossensiive learning in he machine learning and daa mining communiies. Curren cossensiive learning research falls ino hree caegories. The firs is concerned wih making paricular classifier learners cossensiive [3, 7]. The second uses Bayes risk heory o assign each example o is lowes risk This auhor s presen address: Toyoa Technological Insiue a Chicago, 47 Eas 60h Sree, Second Floor  Press Building, Chicago, IL class [, 9, 4]. This requires esimaing class membership probabiliies and, in he case where coss are nondeerminisic, also requires esimaing expeced coss [9]. The hird caegory concerns mehods for convering arbirary classificaion learning algorihms ino cossensiive ones []. The work described here belongs o he las caegory. In paricular, he approach here is akin o he pioneering work of Domingos on MeaCos [], which also is a general mehod for convering cossensiive learning problems o cosinsensiive learning problems. However, he mehod here is disinguished by he following properies: () i is even simpler; () i has some heoreical performance guaranees; and (3) i does no involve any probabiliy densiy esimaion in is process: MeaCos esimaes condiional probabiliy disribuions via bagging wih a classifier in is procedure, and as such i also belongs o he second caegory (Bayes risk minimizaion) menioned above. The family of proposed mehods is moivaed by a folk heorem ha is formalized and proved in secion.. This heorem saes ha alering he original example disribuion D o anoher ˆD, by muliplying i by a facor proporional o he relaive cos of each example, makes any errorminimizing classifier learner accomplish expeced cos minimizaion on he original disribuion. Represening samples drawn from ˆD, however, is more challenging han i may seem. There are wo basic mehods for doing his: (i) Transparen Box: Supply he coss of he raining daa as example weighs o he classifier learning algorihm. (ii) Black Box: resample according o hese same weighs. While he ransparen box approach canno be applied o arbirary classifier learners, i can be applied o many, including any classifier which only uses he daa o calculae expecaions. We show empirically ha his mehod gives good resuls. The black box approach has he advanage ha i can be applied o any classifier learner. I urns ou, however, ha sraighforward samplingwihreplacemen can resul in severe overfiing relaed o duplicae examples. We propose, insead, o employ cosproporionae rejecion sampling o realize he laer approach, which allows us o independenly draw examples according o ˆD. This mehod comes wih a heoreical guaranee: In he wors case i produces a classifier ha achieves a leas as good
2 approximae cos minimizaion as applying he base classifier learning algorihm on he enire sample. This is a remarkable propery for a subsampling scheme: in general, we expec any echnique using only a subse of he examples o compromise predicive performance. The runime savings made possible by his sampling echnique enable us o run he classificaion algorihm on muliple draws of subsamples and average over he resuling classifiers. This las mehod is wha we call cosing (cosproporionae rejecion sampling wih aggregaion). Cosing allows us o use an arbirary cosinsensiive learning algorihm as a black box in order o accomplish cossensiive learning, achieves excellen predicive performance and can achieve drasic savings of compuaional resources. Moivaing Theory and Mehods. A Folk Theorem We assume ha examples are drawn independenly from a disribuion D wih domain X Y C where X is he inpu space o a classifier, Y is a (binary) oupu space and C 0 is he imporance (exra cos) associaed wih mislabeling ha example. The goal is o learn a classifier h : X Y which minimizes he expeced cos, E x y c D ci h x given raining daa of he form: x y c, where I is he indicaor funcion ha has value in case is argumen is rue and 0 oherwise. This model does no explicily allow using cos informaion a predicion ime alhough X migh include a cos feaure if ha is available. This formulaion of cossensiive learning in erms of one number per example is more general han cos marix formulaions which are more ypical in cossensiive learning [6, ], when he oupu space is binary. In he cos marix formulaion, coss are associaed wih false negaive, false posiive, rue negaive, and rue posiive predicions. Given he cos marix and an example, only wo enries (false posiive, rue negaive) or (false negaive, rue posiive) are relevan for ha example. These wo numbers can be furher reduced o one: (false posiive  rue negaive) or (false negaive  rue posiive), because i is he difference in cos beween classifying an example correcly or incorrecly which conrols he imporance of correc classificaion. This difference is he imporance c we use here. This seing is more general in he sense ha he imporance may vary on a examplebyexample basis. A basic folk heorem saes ha if we have examples drawn from he disribuion: c ˆD x y c D x y c E x y c D c How o formulae he problem in his way when he oupu space is no binary is nonrivial and is beyond he scope of his paper. We say folk heorem here because he resul appears o be known by some and i is sraighforward o derive i from resuls in decision heory, alhough we have no found i published. y hen opimal error rae classifiers for ˆD are opimal cos minimizers for daa drawn from D. Theorem.. (Translaion Theorem) For all disribuions, D, here exiss a consan N E x y c D c such ha for all classifiers, h: Proof. E x y c ˆD I h x y N E x y c D ci h x E x y c D ci h x y y D x y c ci h x y x y c N x y c NE x y c ˆD I h x y where ˆD x y c c D x y c N ˆD x y c I h x y Despie is simpliciy, his heorem is useful o us because he righhand side expresses he expecaion we wan o conrol (via he choice of h) and he lefhand side is he probabiliy ha h errs under anoher disribuion. Choosing h o minimize he rae of errors under ˆD is equivalen o choosing h o minimize he expeced cos under D. Similarly, εapproximae error minimizaion under ˆD is equivalen o Nεapproximae cos minimizaion under D. The prescripion for coping wih cossensiive problems is sraighforward: reweigh he disribuion in your raining se according o he imporances so ha he raining se is effecively drawn from ˆD. Doing his in a correc and general manner is more challenging han i may seem and is he opic of he res of he paper.. Transparen Box: Using Weighs Direcly.. General conversion Here we examine how imporance weighs can be used wihin differen learning algorihms o accomplish cossensiive classificaion. We call his he ransparen box approach because i requires knowledge of he paricular learning algorihm (as opposed o he black box approach ha we develop laer). The mechanisms for realizing he ransparen box approach have been described elsewhere for a number of weak learners used in boosing, bu we will describe hem here for compleeness. The classifier learning algorihm mus use he weighs so ha i effecively learns from daa drawn according o ˆD. This requiremen is easy o apply for all learning algorihms which fi he saisical query model [3]. As shown in figure, many learning algorihms can be divided ino wo componens: a porion which calculaes he (approximae) expeced value of some funcion (or query) f and a porion which forms hese queries and uses heir oupu o consruc a classifier. For example, neural neworks, decision rees, and Naive Bayes classifiers can be
3 Learning Algorihm Query/Reply Pairs Query Oracle Figure. The saisical query model. consruced in his manner. Suppor vecor machines are no easily consrucible in his way, because he individual classifier is explicily dependen upon individual examples raher han on saisics derived from he enire sample. Wih finie daa we canno precisely calculae he expecaion E x y D f x y. Wih high probabiliy, however, we can approximae he expecaion given a se of examples drawn independenly from he underlying disribuion D. Whenever we have a learning algorihm ha can be decomposed as in figure, here is a simple recipe for using he weighs direcly. Insead of simulaing he expecaion wih S x y S f x y, we use x y c S c x y c S c f x y. This mehod is equivalen o imporance sampling for ˆD using he disribuion D, and so he modified expecaion is an unbiased Mone Carlo esimae of he expecaion w.r.. ˆD. Even when a learning algorihm does no fi his model, i may be possible o incorporae imporance weighs direcly. We now discuss how o incorporae imporance weighs ino some specific learning algorihms... Naive Bayes and boosing Naive Bayes learns by calculaing empirical probabiliies for each oupu y using Bayes rule and assuming ha each feaure is independen given he oupu: P y x P x y P y P x i P x i y P y i P x i Each probabiliy esimae in he above expression can be hough of as a funcion of empirical expecaions according o D, and hus i can be formulaed in he saisical query model. For example, P x i y is jus he expecaion of I x i x i I y y divided by he expecaion of I y y. More specifically, o compue he empirical esimae of P x i y wih respec o D, we need o coun he number of raining examples ha have y as oupu, and hose having x i as he ih inpu dimension among hose. When we compue hese empirical esimaes wih respec o ˆD, we simply have o sum he weigh of each example, insead of couning he examples. (This propery is used in he implemenaion of boosed Naive Bayes [5].) To incorporae imporance weighs ino AdaBoos [8], we give he imporance weighs o he weak learner in he firs ieraion, hus effecively drawing examples from ˆD. In he subsequen ieraions, we use he sandard AdaBoos rule o updae he weighs. Therefore, he weighs are adjused according o he accuracy on ˆD, which corresponds o he expeced cos on D...3 C4.5 C4.5 [6] is a widely used decision ree learner. There is a sandard way of incorporaing example weighs o i, which in he original algorihm was inended o handle missing aribues (examples wih missing aribues were divided ino fracional examples, each wih a smaller weigh, during he growh of he ree). This same faciliy was laer used by Quinlan in he implemenaion of boosed C4.5 [5]...4 Suppor Vecor Machine The SVM algorihm [] learns he parameers a and b describing a linear decision rule h x sign a x b, so ha he smalles disance beween each raining example and he decision boundary (he margin) is maximized. I works by solving he following opimizaion problem: a a C n i ξ i subjec o: i : y i a x i b ξ i ξ i 0 minimize: V a b ξ The consrains require ha all examples in he raining se are classified correcly up o some slack ξ i. If a raining example lies on he wrong side of he decision boundary, he corresponding ξ i is greaer han. Therefore, n i ξ i is an upper bound on he number of raining errors. The facor C is a parameer ha allows one o rade off raining error and model complexiy. The algorihm can be generalized o nonlinear decision rules by replacing inner producs wih a kernel funcion in he formulas above. The SVM algorihm does no fi he saisical query model. Despie his, i is possible o incorporae imporance weighs in a naural way. Firs, we noe ha n i c iξ i, where c i is he imporance of example i, is an upper bound on he oal cos. Therefore, we can modify V a b ξ o V a b ξ a a C n i c iξ i Now C conrols model complexiy versus oal cos. The SVMLigh package [0] allows users o inpu weighs c i and works wih he modified V a b ξ as above, alhough his feaure has no ye been documened..3 Black Box: Sampling mehods Suppose we do no have ransparen box access o he learner. In his case, sampling is he obvious mehod o conver from one disribuion of examples o anoher o obain a cossensiive learner using he ranslaion heorem (Theorem.). As i urns ou, sraighforward sampling does no work well in his case, moivaing us o propose an alernaive mehod based on rejecion sampling.
4 .3. Samplingwihreplacemen Samplingwihreplacemen is a sampling scheme where each example x y c is drawn according o he disribuion p x y c c x y c S c. Many examples are drawn o creae a new daase S. This mehod, a firs pass, appears useful because every example is effecively drawn from he disribuion ˆD. In fac, very poor performance can resul when using his echnique, which is essenially due o overfiing because of he fac ha he examples in S are no drawn independenly from ˆD, as we will elaborae in he secion on experimenal resuls (Secion 3). Samplingwihoureplacemen is also no a soluion o his problem. In samplingwihoureplacemen, an example x y c is drawn from he disribuion p x y c c x y c S c and he nex example is drawn from he se S x y c. This process is repeaed, drawing from a smaller and smaller se according o he weighs of he examples remaining in he se. To see how his mehod fails, noe ha samplingwihoureplacemen m imes from a se of size m resuls in he original se, which (by assumpion) is drawn from he disribuion D, and no ˆD as desired..3. Cosproporionae rejecion sampling There is anoher sampling scheme called rejecion sampling [8] which allows us o draw examples independenly from he disribuion ˆD, given examples drawn independenly from D. In rejecion sampling, examples from ˆD are obained by firs drawing examples from D, and hen keeping (or acceping) he sample wih probabiliy proporional o ˆD D. Here, we have ˆD D c, so we accep an example wih probabiliy c Z, where Z is some consan chosen so ha max x y c S c Z, 3 leading o he name cosproporionae rejecion sampling. Rejecion sampling resuls in a se S which is generally smaller han S. Furhermore, because inclusion of an example in S is independen of oher examples, and he examples in S are drawn independenly, we know ha he examples in S are disribued independenly according o ˆD. Using cosproporionae rejecion sampling o creae a se S and hen using a learning algorihm A S is guaraneed o produce an approximaely cosminimizing classifier, as long as he learning algorihm A achieves approximae minimizaion of classificaion error. Theorem.. (Correcness) For all cossensiive sample ses S, if cosproporionae rejecion sampling produces a sample se S and A S achieves ε classificaion error: E x y c ˆD I h x y ε 3 In pracice, we choose Z max x y w S c so as o maximize he size of he se S. A daadependen choice of Z is no formally allowed for rejecion sampling. However, he inroduced bias appears small when S. A precise measuremen of small is an ineresing heoreical problem. hen h A S approximaely minimizes cos: E x y c D ci h x y εn where N E x y c D c. Proof. Rejecion sampling produces a sample se S drawn independenly from ˆD. By assumpion A S oupus a classifier h such ha E x y c ˆD I h x y ε By he ranslaion heorem (Theorem.), we know ha E x y c ˆD I h x y N E x y c D ci h x y Thus, E x y c D ci h x y εn.3.3 Sample complexiy of cosproporionae rejecion sampling The accuracy of a learned classifier generally improves monoonically wih he number of examples in he raining se. Since cosproporionae rejecion sampling produces a smaller raining se (by a facor of abou N Z), one would expec worse performance han using he enire raining se. This urns ou o no be he case, in he agnosic PAClearning model [7, ], which formalizes he noion of probably approximaely opimal learning from arbirary disribuions D. Definiion.. A learning algorihm A is said o be an agnosic PAClearner for hypohesis class H, wih sample complexiy m ε δ if for all ε 0 and δ 0, m m ε δ is he leas sample size such ha for all disribuions D (over X Y), he classificaion error rae of is oupu h is a mos ε more han he bes achievable by any member of H wih probabiliy a leas δ, whenever he sample size exceeds m. By analogy, we can formalize he noion of cossensiive agnosic PAClearning. Definiion.. A learning algorihm A is said o be a cossensiive agnosic PAClearner for hypohesis class H, wih cossensiive sample complexiy m ε δ, if for all ε 0 and δ 0, m m ε δ is he leas sample size such ha for all disribuions D (over X Y C), he expeced cos of is oupu h is a mos ε more han he bes achievable by any member of H wih probabiliy a leas δ, whenever he sample size exceeds m. We will now use his formalizaion o compare he cossensiive PAClearning sample complexiy of wo mehods: applying a given base classifier learning algorihm o a sample obained hrough cosproporionae rejecion sampling, and applying he same algorihm on he original raining se. We show ha he cossensiive sample complexiy of he laer mehod is lowerbounded by ha of he former.
5 Theorem.3. (Sample Complexiy Comparison) Fix an arbirary base classifier learning algorihm A, and suppose ha m orig ε δ and m rej ε δ, respecively, are cossensiive sample complexiy of applying A on he original raining se, and ha of applying A wih cosproporionae rejecion sampling. Then, we have m orig ε δ Ω m rej ε δ Proof. Le m ε δ be he (cosinsensiive) sample complexiy of he base classifier learning algorihm A. (If no such funcion exiss, hen neiher m orig ε δ nor m rej ε δ exiss, and he heorem holds vacuously.) Since Z is an upper bound on he cos of misclassifying an example, we have ha he cossensiive sample complexiy of using he original raining se saisfies m orig ε δ Θ m Z ε δ This is because given a disribuion ha forces ε more classificaion error han opimal, anoher disribuion can be consruced, ha forces εz more cos han opimal, by assigning cos Z o all examples on which A errs. Now from Theorem. and noing ha he cenral limi heorem implies ha cosproporionae rejecion sampling reduces he sample size by a facor of Θ N Z, he cossensiive sample complexiy for rejecion sampling is: m rej ε δ () δ Z Θ m N ε N A fundamenal heorem from PAClearning heory saes ha m ε δ Ω ε ln δ [4]. When m ε δ Θ ε ln δ δ, Equaion Θ () implies: δ m rej ε δ Z N Θ ln m orig ε N ε Finally, noe ha when m ε δ grows faser han linear in ε, we have m rej ε δ o m orig ε δ, which finishes he proof. Noe ha he linear dependence of sample size on ε is only achievable by an ideal learning algorihm, and in pracice superlinear dependence is expeced, especially in he presence of noise. Thus, he above heorem implies ha cosproporionae rejecion sampling minimizes cos beer han no sampling for wors case disribuions. This is a remarkable propery abou any sampling scheme, since one generally expecs ha predicive performance is compromised by using a smaller sample. Cosproporionae rejecion sampling seems o disill he original sample and obains a sample of smaller size, which is a leas as informaive as he original..3.4 Cosproporionae rejecion sampling wih aggregaion (cosing) From he same original raining sample, differen runs of cosproporionae rejecion sampling will produce differen raining samples. Furhermore, he fac ha rejecion sampling produces very small samples means ha he ime required for learning a classifier is generally much smaller. We can ake advanage of hese properies o devise an ensemble learning algorihm based on repeaedly performing rejecion sampling from S o produce muliple sample ses S S m, and hen learning a classifier for each se. The oupu classifier is he average over all learned classifiers. We call his echnique cosing: Cosing(Learner A, Sample Se S, coun ). For i o do (a) S rejecion sample from S wih accepance probabiliy c Z. (b) Le h i A S. Oupu h x sign i h i x The goal in averaging is o improve performance. There is boh empirical and heoreical evidence suggesing ha averaging can be useful. On he empirical side, many people have observed good performance from bagging despie hrowing away a e fracion of he samples. On he heoreical side, here has been considerable work which proves ha he abiliy o overfi of an average of classifiers migh be smaller han naively expeced when a large margin exiss. The preponderance of learning algorihms producing averaging classifiers provides significan evidence ha averaging is useful. Noe ha despie he exra compuaional cos of averaging, he overall compuaional ime of cosing is generally much smaller han ha of a learning algorihm using sample se S (wih or wihou weighs). This is he case because mos learning algorihms have running imes ha are superlinear in he number of examples. 3 Empirical evaluaion We show empirical resuls using wo realworld daases. We seleced daases ha are publicly available and for which cos informaion is available on a per example basis. Boh daases are from he direc markeing domain. Alhough here are many oher daa mining domains ha are cossensiive, such as credi card fraud deecion and medical diagnosis, publicly available daa are lacking. 3. The daases used 3.. KDD98 daase This is he wellknown and challenging daase from he KDD98 compeiion, now available a he UCI KDD reposiory [9]. The daase conains informaion abou persons who have made donaions in he pas o a paricular chariy. The decisionmaking ask is o choose which donors o mail a reques for a new donaion. The measure of success is he oal profi obained in he mailing campaign.
6 The daase is divided in a fixed way ino a raining se and a es se. Each se consiss of approximaely records for which i is known wheher or no he person made a donaion and how much he person donaed, if a donaion was made. The overall percenage of donors is abou 5%. Mailing a soliciaion o an individual coss he chariy $0 68. The donaion amoun for persons who respond varies from $ o $00. The profi obained by soliciing every individual in he es se is $0560, while he profi aained by he winner of he KDD98 compeiion was $47. The imporance of each example is he absolue difference in profi beween mailing and no mailing an individual. Mailing resuls in he donaion amoun minus he cos of mailing. No mailing resuls in zero profi. Thus, for posiive examples (respondens), he imporance varies from $0 3 o $99 3. For negaive examples (nonrespondens), i is fixed a $ DMEF daase This daase can be obained from he DMEF daase library [] for a nominal fee. I conains cusomer buying hisory for 9655 cusomers of a naionally known caalog. The decisionmaking ask is o choose which cusomers should receive a new caalog so as o maximize he oal profi on he caalog mailing campaign. Informaion on he cos of mailing a caalog is no available, so we fixed i a $. The overall percenage of respondens is abou.5%. The purchase amoun for cusomers who respond varies from $3 o $647. As is he case for he KDD98 daase, he imporance of each example is he absolue difference in profi beween mailing and no mailing a cusomer. Therefore, for posiive examples (respondens), he imporance varies from $ o $645. For negaive examples (nonrespondens), i is fixed a $. We divided he daase in half o creae a raining se and a es se. As a baseline for comparison, he profi obained by mailing a caalog o every individual on he raining se is $6474 and on he es se is $ Experimenal resuls 3.. Transparen box resuls Table (op) shows he resuls for Naive Bayes, boosed Naive Bayes (00 ieraions) C4.5 and SVMLigh on he KDD98 and DMEF daases, wih and wihou he imporance weighs. Wihou he imporance weighs, he classifiers label very few of he examples posiive, resuling in small (and even negaive) profis. Wih he coss given as weighs o he learners, he resuls improve significanly for all learners, excep C4.5. Cossensiive boosed Naive Bayes gives resuls comparable o he bes so far wih his daase [9] using more complicaed mehods. We opimized he parameers of he SVM by crossvalidaion on he raining se. Wihou weighs, no seing of he parameers prevened he algorihm of labeling all examples as negaives. Wih weighs, he bes parameers were KDD98: Mehod Wihou Weighs Wih Weighs Naive Bayes Boosed NB C SVMLigh DMEF: Mehod Wihou Weighs Wih Weighs Naive Bayes Boosed NB 3638 C SVMLigh Table. Tes se profis wih ransparen box. a polynomial kernel wih degree 3 and C for KDD98 and a linear kernel wih C for DMEF. However, even wih his parameer seing, he resuls are no so impressive. This may be a hard problem for marginbased classifiers because he daa is very noisy. Noe also ha running SVMLigh on his daase akes abou 3 orders of magniude longer han AdaBoos wih 00 ieraions. The failure of C4.5 o achieve good profis wih imporance weighs is probably relaed o he fac ha he faciliy for incorporaing weighs provided in he algorihm is heurisic. So far, i has been used only in siuaions where he weighs are fairly uniform (such as is he case for fracional insances due o missing daa). These resuls indicae ha i migh no be suiable for siuaions wih highly nonuniform coss. The fac ha i is nonrivial o incorporae coss direcly ino exising learning algorihms is he moivaion for he black box approaches ha we presen here. 3.. Black box resuls Table shows he resuls of applying he same learning algorihms o he KDD98 and DMEF daa using raining ses of differen sizes obained by samplingwihreplacemen. For each size, we repea he experimens 0 imes wih differen sampled ses o ge mean and sandard error (in parenheses). The raining se profis are on he original raining se from which we draw he sampled ses. The resuls confirm ha applicaion of samplingwihreplacemen o implemen he black box approach can resul in very poor performance due o overfiing. When here are large differences in he magniude of imporance weighs, i is ypical for an example o be picked wice (or more). In able, we see ha as we increase he sampled raining se size and, as a consequence, he number of duplicae examples in he raining se, he raining profi becomes larger while he es profi becomes smaller for C4.5. Examples which appear muliple imes in he raining se of a learning algorihm can defea complexiy conrol mechanisms buil ino learning algorihms For example, suppose ha we have a decision ree algorihm which divides he raining daa ino a growing se (used o consruc a ree)
7 KDD98: Training Tes Training Tes Training Tes NB 5 (330) 0850 (35) 8 (55) 993 (85) 53 (4) 06 (56) BNB 658 (3) 76 (383) 3838 (65) 886 () 407 (5) 335 (59) C4.5 4 (55) 9548 (33) 083 (7) 7599 (30) (5) 59 (07) SVM 030 (37) 03 (8) 8 (8) 05 (6) 3565 (9) 808 (0) DMEF: Training Tes Training Tes Training Tes NB 3398 (495) 3464 (49) 374 (793) (798) 335 (475) (405) BNB 3390 (558) (660) 3480 (806) 334 (77) (8) 3889 (733) C (467) 40 (93) (763) 988 (458) 7574 (05) 349 (59) SVM 8837 (09) 3077 (96) 363 () 3585 (89) (79) (600) Table. Profis using samplingwihreplacemen. and a pruning se (used o prune he ree for complexiy conrol purposes). If he pruning se conains examples which appear in he growing se, he complexiy conrol mechanism is defeaed. Alhough no as markedly as for C4.5, we see he same phenomenon for he oher learning algorihms. In general, as he size of he resampled size grows, he larger is he difference beween raining se profi and es se profi. And, even wih examples, we do no obain he same es se resuls as giving he weighs direcly o Boosed Naive Bayes and SVM. The fundamenal difficuly here is ha he samples in S are no drawn independenly from ˆD. In paricular, if ˆD is a densiy, he probabiliy of observing he same example wice given independen draws is 0, while he probabiliy using samplingwihreplacemen is greaer han 0. Thus samplingwihreplacemen fails because he sampled se S is no consruced independenly. Figure shows he resuls of cosing on he KDD98 and DMEF daases, wih he base learners and Z 00 or Z 647, respecively. We repeaed he experimen 0 imes for each and calculaed he mean and sandard error of he profi. The resuls for, 00 and 00 are also given in able 3. In he KDD98 case, each resampled se has only abou 600 examples, because he imporance of he examples varies from 0.68 o 99.3 and here are few imporan examples. Abou 55% of he examples in each se are posiive, even hough on he original daase he percenage of posiives is only 5%. Wih 00, he C4.5 version yields profis around $5000, which is excepional performance for his daase. In he DMEF case, each se has only abou 35 examples, because he imporances vary even more widely (from o 646) and here are even fewer examples wih a large imporance han in he KDD98 case. The percenage of posiive examples in each se is abou 50%, even hough on he original daase i was only.5%. For learning he SVMs, we used he same kernels as we did in secion. and he defaul seing for C. In ha KDD98: NB 667 (9) 3 (0) 363 (68) BNB 377 (63) 489 (9) 474 (6) C (5) 4935 (0) 506 (6) SVM 004 (393) 3075 (4) 35 (56) DMEF: NB 687 (3444) 3767 (335) 3769 (39) BNB 440 (839) (393) 3789 (364) C (345) 3699 (374) (307) SVM 7 (3487) (5) 3590 (849) Table 3. Tes se profis using cosing. secion, we saw ha by feeding he weighs direcly o he SVM, we obain a profi of $3683 on he KDD98 daase and of $36443 on he DMEF daase. Here, we obain profis around $300 and $35000, respecively. However, his did no require parameer opimizaion and, even wih 00, was much faser o rain. The reason for he speedup is ha he ime complexiy of SVM learning is generally superlinear in he number of raining examples. 4 Discussion Cosing is a echnique which produces a cossensiive classificaion from a cosinsensiive classifier using only black box access. This simple mehod is fas, resuls in excellen performance and ofen achieves drasic savings in compuaional resources, paricularly wih respec o space requiremens. This las propery is especially desirable in applicaions of cossensiive learning o domains ha involve massive amoun of daa, such as fraud deecion, argeed markeing, and inrusion deecion. Anoher desirable propery of any reducion is ha i applies o he heory as well as o concree algorihms. Thus, he reducion presened in he presen paper allows us o auomaically apply any fuure resuls in cosinsensiive classificaion o cossensiive classificaion. For example, a
8 KDD98: x 0 4 Cosing NB: KDD 98 Daase x 0 4 Cosing BNB: KDD 98 Daase x 0 4 Cosing C45: KDD 98 Daase x 0 4 Cosing SVM: KDD 98 Daase Profi. Profi. Profi. Profi x Cosing NB: DMEF Daase DMEF: 4 x Cosing BNB: DMEF Daase 04 4 x Cosing C4.5: DMEF Daase x Cosing SVM: DMEF Daase Profi 3.8 Profi 3.8 Profi 3.8 Profi Figure. Cosing: es se profi vs. number of sampled ses. bound on he fuure error rae of A S implies a bound on he expeced cos wih respec o he disribuion D. This addiional propery of a reducion is especially imporan because cossensiive learning heory is sill young and relaively unexplored. One direcion for fuure work is muliclass cossensiive learning. If here are K classes, he minimal represenaion of coss is K weighs. A reducion o cosinsensiive classificaion using hese weighs is an open problem. References [] Anifanis, S. The DMEF Daa Se Library. The Direc Markeing Associaion, New York, NY, 00. [hp://www.hedma.org/dmef/dmefdse.shml] [] Domingos, P. MeaCos: A general mehod for making classifiers cos sensiive. Proceedings of he 5h Inernaional Conference on Knowledge Discovery and Daa Mining, 5564, 999. [3] Drummond, C. & Hole, R. Exploiing he cos (in)sensiiviy of decision ree spliing crieria. Proceedings of he 7h Inernaional Conference on Machine Learning, 3946, 000. [4] Ehrenfeuch, A., Haussler, D., Kearns, M. & Valian. A general lower bound on he number of examples needed for learning. Informaion and Compuaion, 8:3, 476, 989. [5] Elkan, C. Boosing and naive bayesian learning (Technical Repor). Universiy of California, San Diego, 997. [6] Elkan, C. The foundaions of cossensiive learning. Proceedings of he 7h Inernaional Join Conference on Arificial Inelligence, , 00. [7] Fan, W., Solfo, S., Zhang, J. & Chan, P. AdaCos: Misclassificaion cossensiive boosing. Proceedings of he 6h Inernaional Conference on Machine Learning, 9705, 999. [8] Freund, Y. & Schapire, R. E. A decisionheoreic generalizaion of online learning and an applicaion o boosing. Journal of Compuer and Sysem Sciences, 55:, 939, 997. [9] Heich, S. & Bay, S. D. The UCI KDD Archive. Universiy of California, Irvine. [hp://kdd.ics.uci.edu/]. [0] Joachims, T. Making largescale SVM learning pracical. In Advances in Kernel Mehods  Suppor Vecor Learning. MIT Press, 999. [] Joachims, T. Esimaing he generalizaion performance of a SVM efficienly. Proceedings of he 7h Inernaional Conference on Machine Learning, , 000. [] Kearns, M., Schapire, R., & Sellie, L. Toward Efficien Agnosic Learning. Machine Learning, 7, 54, 998. [3] Kearns, M. Efficien noiseoleran learning from saisical queries. Journal of he ACM, 45:6, , 998. [4] Margineanu, D. Class probabiliy esimaion and cossensiive classificaion decisions. Proceedings of he 3h European Conference on Machine Learning, 708, 00. [5] Quinlan, J. R. Boosing, Bagging, and C4.5. Proceedings of he Thireenh Naional Conference on Arificial Inelligence, , 996. [6] Quinlan, J. R. C4.5: Programs for Machine Learning. San Maeo, CA: Morgan Kaufmann, 993. [7] Valian, L. A heory of he learnable. Communicaions of he ACM, 7:, 344, 984. [8] von Neumann, J. Various echniques used in connecion wih random digis, Naional Bureau of Sandards, Applied Mahemaics Series,, 3638, 95. [9] Zadrozny, B. and Elkan, C. Learning and making decisions when coss and probabiliies are boh unknown. Proceedings of he 7h Inernaional Conference on Knowledge Discovery and Daa Mining, 033, 00.
Follow the Leader If You Can, Hedge If You Must
Journal of Machine Learning Research 15 (2014) 12811316 Submied 1/13; Revised 1/14; Published 4/14 Follow he Leader If You Can, Hedge If You Mus Seven de Rooij seven.de.rooij@gmail.com VU Universiy and
More informationOUTOFBAG ESTIMATION. Leo Breiman* Statistics Department University of California Berkeley, CA. 94708 leo@stat.berkeley.edu
1 OUTOFBAG ESTIMATION Leo Breiman* Saisics Deparmen Universiy of California Berkeley, CA. 94708 leo@sa.berkeley.edu Absrac In bagging, predicors are consruced using boosrap samples from he raining se
More informationImproved Techniques for Grid Mapping with RaoBlackwellized Particle Filters
1 Improved Techniques for Grid Mapping wih RaoBlackwellized Paricle Filers Giorgio Grisei Cyrill Sachniss Wolfram Burgard Universiy of Freiburg, Dep. of Compuer Science, GeorgesKöhlerAllee 79, D79110
More informationAre Under and Overreaction the Same Matter? A Price Inertia based Account
Are Under and Overreacion he Same Maer? A Price Ineria based Accoun Shengle Lin and Sephen Rasseni Economic Science Insiue, Chapman Universiy, Orange, CA 92866, USA Laes Version: Nov, 2008 Absrac. Theories
More informationDoes Britain or the United States Have the Right Gasoline Tax?
Does Briain or he Unied Saes Have he Righ Gasoline Tax? Ian W.H. Parry and Kenneh A. Small March 2002 (rev. Sep. 2004) Discussion Paper 02 12 rev. Resources for he uure 1616 P Sree, NW Washingon, D.C.
More informationExchange Rate PassThrough into Import Prices: A Macro or Micro Phenomenon? Abstract
Exchange Rae PassThrough ino Impor Prices: A Macro or Micro Phenomenon? Absrac Exchange rae regime opimaliy, as well as moneary policy effeciveness, depends on he ighness of he link beween exchange rae
More informationKONSTANTĪNS BEŅKOVSKIS IS THERE A BANK LENDING CHANNEL OF MONETARY POLICY IN LATVIA? EVIDENCE FROM BANK LEVEL DATA
ISBN 9984 676 20 X KONSTANTĪNS BEŅKOVSKIS IS THERE A BANK LENDING CHANNEL OF MONETARY POLICY IN LATVIA? EVIDENCE FROM BANK LEVEL DATA 2008 WORKING PAPER Lavias Banka, 2008 This source is o be indicaed
More informationThe concept of potential output plays a
Wha Do We Know (And No Know) Abou Poenial Oupu? Susano Basu and John G. Fernald Poenial oupu is an imporan concep in economics. Policymakers ofen use a onesecor neoclassical model o hink abou longrun
More informationEDUCATION POLICIES AND STRATEGIES
EDUCATION POLICIES AND STRATEGIES Naional Educaion Secor Developmen Plan: A resulbased planning handbook 13 Educaion Policies and Sraegies 13 Educaion Policies and Sraegies 13 Naional Educaion Secor Developmen
More informationBIS Working Papers. Globalisation, passthrough. policy response to exchange rates. No 450. Monetary and Economic Department
BIS Working Papers No 450 Globalisaion, passhrough and he opimal policy response o exchange raes by Michael B Devereux and James Yeman Moneary and Economic Deparmen June 014 JEL classificaion: E58, F6
More informationDynamic Contracting: An Irrelevance Result
Dynamic Conracing: An Irrelevance Resul Péer Eső and Balázs Szenes Sepember 5, 2013 Absrac his paper considers a general, dynamic conracing problem wih adverse selecion and moral hazard, in which he agen
More informationAsymmetry of the exchange rate passthrough: An exercise on the Polish data 1
Asymmery of he exchange rae passhrough: An exercise on he Polish daa Jan Przysupa Ewa Wróbel 3 Absrac We propose a complex invesigaion of he exchange rae passhrough in a small open economy in ransiion.
More informationWhen Should Public Debt Be Reduced?
I M F S T A F F D I S C U S S I ON N O T E When Should Public Deb Be Reduced? Jonahan D. Osry, Aish R. Ghosh, and Raphael Espinoza June 2015 SDN/15/10 When Should Public Deb Be Reduced? Prepared by Jonahan
More informationI M F S T A F F D I S C U S S I O N N O T E
I M F S T A F F D I S C U S S I O N N O T E February 29, 2012 SDN/12/01 Two Targes, Two Insrumens: Moneary and Exchange Rae Policies in Emerging Marke Economies Jonahan D. Osry, Aish R. Ghosh, and Marcos
More informationResearch Division Federal Reserve Bank of St. Louis Working Paper Series
Research Division Federal Reserve Bank of S. Louis Working Paper Series Wihsanding Grea Recession like China Yi Wen and Jing Wu Working Paper 204007A hp://research.slouisfed.org/wp/204/204007.pdf March
More informationThe Macroeconomics of MediumTerm Aid ScalingUp Scenarios
WP//6 The Macroeconomics of MediumTerm Aid ScalingUp Scenarios Andrew Berg, Jan Goschalk, Rafael Porillo, and LuisFelipe Zanna 2 Inernaional Moneary Fund WP//6 IMF Working Paper Research Deparmen The
More informationBIS Working Papers No 365. Was This Time Different?: Fiscal Policy in Commodity Republics. Monetary and Economic Department
BIS Working Papers No 365 Was This Time Differen?: Fiscal Policy in Commodiy Republics by Luis Felipe Céspedes and Andrés Velasco, Discussion Commens by Choongsoo Kim and Guillermo Calvo Moneary and Economic
More informationBoard of Governors of the Federal Reserve System. International Finance Discussion Papers. Number 1003. July 2010
Board of Governors of he Federal Reserve Sysem Inernaional Finance Discussion Papers Number 3 July 2 Is There a Fiscal Free Lunch in a Liquidiy Trap? Chrisopher J. Erceg and Jesper Lindé NOTE: Inernaional
More informationWhy Have Economic Reforms in Mexico Not Generated Growth?*
Federal Reserve Bank of Minneapolis Research Deparmen Saff Repor 453 November 2010 Why Have Economic Reforms in Mexico No Generaed Growh?* Timohy J. Kehoe Universiy of Minnesoa, Federal Reserve Bank of
More informationA Simple Introduction to Dynamic Programming in Macroeconomic Models
Economics Deparmen Economics orking Papers The Universiy of Auckland Year A Simple Inroducion o Dynamic Programming in Macroeconomic Models Ian King Universiy of Auckland, ip.king@auckland.ac.nz This paper
More informationFIRST PASSAGE TIMES OF A JUMP DIFFUSION PROCESS
Adv. Appl. Prob. 35, 54 531 23 Prined in Norhern Ireland Applied Probabiliy Trus 23 FIRST PASSAGE TIMES OF A JUMP DIFFUSION PROCESS S. G. KOU, Columbia Universiy HUI WANG, Brown Universiy Absrac This paper
More informationWhich Archimedean Copula is the right one?
Which Archimedean is he righ one? CPA Mario R. Melchiori Universidad Nacional del Lioral Sana Fe  Argenina Third Version Sepember 2003 Published in he YieldCurve.com ejournal (www.yieldcurve.com), Ocober
More informationMining the Most Interesting Rules
Appears in Pro. of he Fifh ACM SIGKDD In l Conf. on Knowledge Disovery and Daa Mining, 145154, 1999. Mining he Mos Ineresing Rules Robero J. Bayardo Jr. IBM Almaden Researh Cener hp://www.almaden.ibm.om/s/people/bayardo/
More informationThe Simple Analytics of Helicopter Money: Why It Works Always
Vol. 8, 201428 Augus 21, 2014 hp://dx.doi.org/10.5018/economicsejournal.ja.201428 The Simple Analyics of Helicoper Money: Why I Works Always Willem H. Buier Absrac The auhor proides a rigorous analysis
More informationISSN 15183548. Working Paper Series
ISSN 583548 Working Paper Series Nonlinear Mechanisms of he Exchange Rae PassThrough: A Phillips curve model wih hreshold for Brazil Arnildo da Silva Correa and André Minella November, 006 ISSN 583548
More informationThE Papers 07/02. Do sunk exporting costs differ among markets? Evidence from Spanish manufacturing firms.
ThE Papers 07/02 Deparameno de Teoría e Hisoria Económica Universidad de Granada Do sunk exporing coss differ among markes? Evidence from Spanish manufacuring firms. Blanes Crisóbal, José Vicene. Universidad
More informationGloballyOptimal Greedy Algorithms for Tracking a Variable Number of Objects
GloballyOpimal Greedy Algorihm for Tracking a Variable Number of Objec Hamed Piriavah Deva Ramanan Charle C. Fowlke Deparmen of Compuer Science, Univeriy of California, Irvine {hpiriav,dramanan,fowlke}@ic.uci.edu
More informationA DecisionTheoretic Generalization of OnLine Learning and an Application to Boosting*
journal of compuer and sysem scences 55, 119139 (1997) arcle no. SS971504 A Decsonheorec Generalzaon of OnLne Learnng and an Applcaon o Boosng* Yoav Freund and Rober E. Schapre  A6 Labs, 180 Park Avenue,
More informationCO2 Cost Pass Through and Windfall Profits in the Power Sector
CO Cos Pass Through and Windfall Profis in he Power Seor Jos Sijm, Karsen Neuhoff and Yihsu Chen May 006 CWPE 0639 and EPRG 067 These working papers presen preliminary researh findings, and you are advised
More information