Cos-Sensiive Learning by Cos-Proporionae Example Weighing Bianca Zadrozny, John Langford, Naoki Abe Mahemaical Sciences Deparmen IBM T. J. Wason Research Cener Yorkown Heighs, NY 0598 Absrac We propose and evaluae a family of mehods for convering classifier learning algorihms and classificaion heory ino cos-sensiive algorihms and heory. The proposed conversion is based on cos-proporionae weighing of he raining examples, which can be realized eiher by feeding he weighs o he classificaion algorihm (as ofen done in boosing), or by careful subsampling. We give some heoreical performance guaranees on he proposed mehods, as well as empirical evidence ha hey are pracical alernaives o exising approaches. In paricular, we propose cosing, a mehod based on cos-proporionae rejecion sampling and ensemble aggregaion, which achieves excellen predicive performance on wo publicly available daases, while drasically reducing he compuaion required by oher mehods. Inroducion Highly non-uniform misclassificaion coss are very common in a variey of challenging real-world daa mining problems, such as fraud deecion, medical diagnosis and various problems in business decision-making. In many cases, one class is rare bu he cos of no recognizing some of he examples belonging o his class is high. In hese domains, classifier learning mehods ha do no ake misclassificaion coss ino accoun do no perform well. In exreme cases, ignoring coss may produce a model ha is useless because i classifies every example as belonging o he mos frequen class even hough misclassificaions of he leas frequen class resul in a very large cos. Recenly a body of work has aemped o address his issue, wih echniques known as cos-sensiive learning in he machine learning and daa mining communiies. Curren cos-sensiive learning research falls ino hree caegories. The firs is concerned wih making paricular classifier learners cos-sensiive [3, 7]. The second uses Bayes risk heory o assign each example o is lowes risk This auhor s presen address: Toyoa Technological Insiue a Chicago, 47 Eas 60h Sree, Second Floor - Press Building, Chicago, IL 60637. class [, 9, 4]. This requires esimaing class membership probabiliies and, in he case where coss are nondeerminisic, also requires esimaing expeced coss [9]. The hird caegory concerns mehods for convering arbirary classificaion learning algorihms ino cos-sensiive ones []. The work described here belongs o he las caegory. In paricular, he approach here is akin o he pioneering work of Domingos on MeaCos [], which also is a general mehod for convering cos-sensiive learning problems o cos-insensiive learning problems. However, he mehod here is disinguished by he following properies: () i is even simpler; () i has some heoreical performance guaranees; and (3) i does no involve any probabiliy densiy esimaion in is process: MeaCos esimaes condiional probabiliy disribuions via bagging wih a classifier in is procedure, and as such i also belongs o he second caegory (Bayes risk minimizaion) menioned above. The family of proposed mehods is moivaed by a folk heorem ha is formalized and proved in secion.. This heorem saes ha alering he original example disribuion D o anoher ˆD, by muliplying i by a facor proporional o he relaive cos of each example, makes any errorminimizing classifier learner accomplish expeced cos minimizaion on he original disribuion. Represening samples drawn from ˆD, however, is more challenging han i may seem. There are wo basic mehods for doing his: (i) Transparen Box: Supply he coss of he raining daa as example weighs o he classifier learning algorihm. (ii) Black Box: resample according o hese same weighs. While he ransparen box approach canno be applied o arbirary classifier learners, i can be applied o many, including any classifier which only uses he daa o calculae expecaions. We show empirically ha his mehod gives good resuls. The black box approach has he advanage ha i can be applied o any classifier learner. I urns ou, however, ha sraighforward sampling-wih-replacemen can resul in severe overfiing relaed o duplicae examples. We propose, insead, o employ cos-proporionae rejecion sampling o realize he laer approach, which allows us o independenly draw examples according o ˆD. This mehod comes wih a heoreical guaranee: In he wors case i produces a classifier ha achieves a leas as good
approximae cos minimizaion as applying he base classifier learning algorihm on he enire sample. This is a remarkable propery for a subsampling scheme: in general, we expec any echnique using only a subse of he examples o compromise predicive performance. The runime savings made possible by his sampling echnique enable us o run he classificaion algorihm on muliple draws of subsamples and average over he resuling classifiers. This las mehod is wha we call cosing (cosproporionae rejecion sampling wih aggregaion). Cosing allows us o use an arbirary cos-insensiive learning algorihm as a black box in order o accomplish cos-sensiive learning, achieves excellen predicive performance and can achieve drasic savings of compuaional resources. Moivaing Theory and Mehods. A Folk Theorem We assume ha examples are drawn independenly from a disribuion D wih domain X Y C where X is he inpu space o a classifier, Y is a (binary) oupu space and C 0 is he imporance (exra cos) associaed wih mislabeling ha example. The goal is o learn a classifier h : X Y which minimizes he expeced cos, E x y c D ci h x given raining daa of he form: x y c, where I is he indicaor funcion ha has value in case is argumen is rue and 0 oherwise. This model does no explicily allow using cos informaion a predicion ime alhough X migh include a cos feaure if ha is available. This formulaion of cos-sensiive learning in erms of one number per example is more general han cos marix formulaions which are more ypical in cos-sensiive learning [6, ], when he oupu space is binary. In he cos marix formulaion, coss are associaed wih false negaive, false posiive, rue negaive, and rue posiive predicions. Given he cos marix and an example, only wo enries (false posiive, rue negaive) or (false negaive, rue posiive) are relevan for ha example. These wo numbers can be furher reduced o one: (false posiive - rue negaive) or (false negaive - rue posiive), because i is he difference in cos beween classifying an example correcly or incorrecly which conrols he imporance of correc classificaion. This difference is he imporance c we use here. This seing is more general in he sense ha he imporance may vary on a example-by-example basis. A basic folk heorem saes ha if we have examples drawn from he disribuion: c ˆD x y c D x y c E x y c D c How o formulae he problem in his way when he oupu space is no binary is nonrivial and is beyond he scope of his paper. We say folk heorem here because he resul appears o be known by some and i is sraighforward o derive i from resuls in decision heory, alhough we have no found i published. y hen opimal error rae classifiers for ˆD are opimal cos minimizers for daa drawn from D. Theorem.. (Translaion Theorem) For all disribuions, D, here exiss a consan N E x y c D c such ha for all classifiers, h: Proof. E x y c ˆD I h x y N E x y c D ci h x E x y c D ci h x y y D x y c ci h x y x y c N x y c NE x y c ˆD I h x y where ˆD x y c c D x y c N ˆD x y c I h x y Despie is simpliciy, his heorem is useful o us because he righ-hand side expresses he expecaion we wan o conrol (via he choice of h) and he lef-hand side is he probabiliy ha h errs under anoher disribuion. Choosing h o minimize he rae of errors under ˆD is equivalen o choosing h o minimize he expeced cos under D. Similarly, ε-approximae error minimizaion under ˆD is equivalen o Nε-approximae cos minimizaion under D. The prescripion for coping wih cos-sensiive problems is sraighforward: re-weigh he disribuion in your raining se according o he imporances so ha he raining se is effecively drawn from ˆD. Doing his in a correc and general manner is more challenging han i may seem and is he opic of he res of he paper.. Transparen Box: Using Weighs Direcly.. General conversion Here we examine how imporance weighs can be used wihin differen learning algorihms o accomplish cossensiive classificaion. We call his he ransparen box approach because i requires knowledge of he paricular learning algorihm (as opposed o he black box approach ha we develop laer). The mechanisms for realizing he ransparen box approach have been described elsewhere for a number of weak learners used in boosing, bu we will describe hem here for compleeness. The classifier learning algorihm mus use he weighs so ha i effecively learns from daa drawn according o ˆD. This requiremen is easy o apply for all learning algorihms which fi he saisical query model [3]. As shown in figure, many learning algorihms can be divided ino wo componens: a porion which calculaes he (approximae) expeced value of some funcion (or query) f and a porion which forms hese queries and uses heir oupu o consruc a classifier. For example, neural neworks, decision rees, and Naive Bayes classifiers can be
Learning Algorihm Query/Reply Pairs Query Oracle Figure. The saisical query model. consruced in his manner. Suppor vecor machines are no easily consrucible in his way, because he individual classifier is explicily dependen upon individual examples raher han on saisics derived from he enire sample. Wih finie daa we canno precisely calculae he expecaion E x y D f x y. Wih high probabiliy, however, we can approximae he expecaion given a se of examples drawn independenly from he underlying disribuion D. Whenever we have a learning algorihm ha can be decomposed as in figure, here is a simple recipe for using he weighs direcly. Insead of simulaing he expecaion wih S x y S f x y, we use x y c S c x y c S c f x y. This mehod is equivalen o imporance sampling for ˆD using he disribuion D, and so he modified expecaion is an unbiased Mone Carlo esimae of he expecaion w.r.. ˆD. Even when a learning algorihm does no fi his model, i may be possible o incorporae imporance weighs direcly. We now discuss how o incorporae imporance weighs ino some specific learning algorihms... Naive Bayes and boosing Naive Bayes learns by calculaing empirical probabiliies for each oupu y using Bayes rule and assuming ha each feaure is independen given he oupu: P y x P x y P y P x i P x i y P y i P x i Each probabiliy esimae in he above expression can be hough of as a funcion of empirical expecaions according o D, and hus i can be formulaed in he saisical query model. For example, P x i y is jus he expecaion of I x i x i I y y divided by he expecaion of I y y. More specifically, o compue he empirical esimae of P x i y wih respec o D, we need o coun he number of raining examples ha have y as oupu, and hose having x i as he i-h inpu dimension among hose. When we compue hese empirical esimaes wih respec o ˆD, we simply have o sum he weigh of each example, insead of couning he examples. (This propery is used in he implemenaion of boosed Naive Bayes [5].) To incorporae imporance weighs ino AdaBoos [8], we give he imporance weighs o he weak learner in he firs ieraion, hus effecively drawing examples from ˆD. In he subsequen ieraions, we use he sandard AdaBoos rule o updae he weighs. Therefore, he weighs are adjused according o he accuracy on ˆD, which corresponds o he expeced cos on D...3 C4.5 C4.5 [6] is a widely used decision ree learner. There is a sandard way of incorporaing example weighs o i, which in he original algorihm was inended o handle missing aribues (examples wih missing aribues were divided ino fracional examples, each wih a smaller weigh, during he growh of he ree). This same faciliy was laer used by Quinlan in he implemenaion of boosed C4.5 [5]...4 Suppor Vecor Machine The SVM algorihm [] learns he parameers a and b describing a linear decision rule h x sign a x b, so ha he smalles disance beween each raining example and he decision boundary (he margin) is maximized. I works by solving he following opimizaion problem: a a C n i ξ i subjec o: i : y i a x i b ξ i ξ i 0 minimize: V a b ξ The consrains require ha all examples in he raining se are classified correcly up o some slack ξ i. If a raining example lies on he wrong side of he decision boundary, he corresponding ξ i is greaer han. Therefore, n i ξ i is an upper bound on he number of raining errors. The facor C is a parameer ha allows one o rade off raining error and model complexiy. The algorihm can be generalized o non-linear decision rules by replacing inner producs wih a kernel funcion in he formulas above. The SVM algorihm does no fi he saisical query model. Despie his, i is possible o incorporae imporance weighs in a naural way. Firs, we noe ha n i c iξ i, where c i is he imporance of example i, is an upper bound on he oal cos. Therefore, we can modify V a b ξ o V a b ξ a a C n i c iξ i Now C conrols model complexiy versus oal cos. The SVMLigh package [0] allows users o inpu weighs c i and works wih he modified V a b ξ as above, alhough his feaure has no ye been documened..3 Black Box: Sampling mehods Suppose we do no have ransparen box access o he learner. In his case, sampling is he obvious mehod o conver from one disribuion of examples o anoher o obain a cos-sensiive learner using he ranslaion heorem (Theorem.). As i urns ou, sraighforward sampling does no work well in his case, moivaing us o propose an alernaive mehod based on rejecion sampling.
.3. Sampling-wih-replacemen Sampling-wih-replacemen is a sampling scheme where each example x y c is drawn according o he disribuion p x y c c x y c S c. Many examples are drawn o creae a new daase S. This mehod, a firs pass, appears useful because every example is effecively drawn from he disribuion ˆD. In fac, very poor performance can resul when using his echnique, which is essenially due o overfiing because of he fac ha he examples in S are no drawn independenly from ˆD, as we will elaborae in he secion on experimenal resuls (Secion 3). Sampling-wihou-replacemen is also no a soluion o his problem. In sampling-wihou-replacemen, an example x y c is drawn from he disribuion p x y c c x y c S c and he nex example is drawn from he se S x y c. This process is repeaed, drawing from a smaller and smaller se according o he weighs of he examples remaining in he se. To see how his mehod fails, noe ha samplingwihou-replacemen m imes from a se of size m resuls in he original se, which (by assumpion) is drawn from he disribuion D, and no ˆD as desired..3. Cos-proporionae rejecion sampling There is anoher sampling scheme called rejecion sampling [8] which allows us o draw examples independenly from he disribuion ˆD, given examples drawn independenly from D. In rejecion sampling, examples from ˆD are obained by firs drawing examples from D, and hen keeping (or acceping) he sample wih probabiliy proporional o ˆD D. Here, we have ˆD D c, so we accep an example wih probabiliy c Z, where Z is some consan chosen so ha max x y c S c Z, 3 leading o he name cosproporionae rejecion sampling. Rejecion sampling resuls in a se S which is generally smaller han S. Furhermore, because inclusion of an example in S is independen of oher examples, and he examples in S are drawn independenly, we know ha he examples in S are disribued independenly according o ˆD. Using cos-proporionae rejecion sampling o creae a se S and hen using a learning algorihm A S is guaraneed o produce an approximaely cos-minimizing classifier, as long as he learning algorihm A achieves approximae minimizaion of classificaion error. Theorem.. (Correcness) For all cos-sensiive sample ses S, if cos-proporionae rejecion sampling produces a sample se S and A S achieves ε classificaion error: E x y c ˆD I h x y ε 3 In pracice, we choose Z max x y w S c so as o maximize he size of he se S. A daa-dependen choice of Z is no formally allowed for rejecion sampling. However, he inroduced bias appears small when S. A precise measuremen of small is an ineresing heoreical problem. hen h A S approximaely minimizes cos: E x y c D ci h x y εn where N E x y c D c. Proof. Rejecion sampling produces a sample se S drawn independenly from ˆD. By assumpion A S oupus a classifier h such ha E x y c ˆD I h x y ε By he ranslaion heorem (Theorem.), we know ha E x y c ˆD I h x y N E x y c D ci h x y Thus, E x y c D ci h x y εn.3.3 Sample complexiy of cos-proporionae rejecion sampling The accuracy of a learned classifier generally improves monoonically wih he number of examples in he raining se. Since cos-proporionae rejecion sampling produces a smaller raining se (by a facor of abou N Z), one would expec worse performance han using he enire raining se. This urns ou o no be he case, in he agnosic PAClearning model [7, ], which formalizes he noion of probably approximaely opimal learning from arbirary disribuions D. Definiion.. A learning algorihm A is said o be an agnosic PAC-learner for hypohesis class H, wih sample complexiy m ε δ if for all ε 0 and δ 0, m m ε δ is he leas sample size such ha for all disribuions D (over X Y), he classificaion error rae of is oupu h is a mos ε more han he bes achievable by any member of H wih probabiliy a leas δ, whenever he sample size exceeds m. By analogy, we can formalize he noion of cos-sensiive agnosic PAC-learning. Definiion.. A learning algorihm A is said o be a cos-sensiive agnosic PAC-learner for hypohesis class H, wih cos-sensiive sample complexiy m ε δ, if for all ε 0 and δ 0, m m ε δ is he leas sample size such ha for all disribuions D (over X Y C), he expeced cos of is oupu h is a mos ε more han he bes achievable by any member of H wih probabiliy a leas δ, whenever he sample size exceeds m. We will now use his formalizaion o compare he cossensiive PAC-learning sample complexiy of wo mehods: applying a given base classifier learning algorihm o a sample obained hrough cos-proporionae rejecion sampling, and applying he same algorihm on he original raining se. We show ha he cos-sensiive sample complexiy of he laer mehod is lower-bounded by ha of he former.
Theorem.3. (Sample Complexiy Comparison) Fix an arbirary base classifier learning algorihm A, and suppose ha m orig ε δ and m rej ε δ, respecively, are cos-sensiive sample complexiy of applying A on he original raining se, and ha of applying A wih cosproporionae rejecion sampling. Then, we have m orig ε δ Ω m rej ε δ Proof. Le m ε δ be he (cos-insensiive) sample complexiy of he base classifier learning algorihm A. (If no such funcion exiss, hen neiher m orig ε δ nor m rej ε δ exiss, and he heorem holds vacuously.) Since Z is an upper bound on he cos of misclassifying an example, we have ha he cos-sensiive sample complexiy of using he original raining se saisfies m orig ε δ Θ m Z ε δ This is because given a disribuion ha forces ε more classificaion error han opimal, anoher disribuion can be consruced, ha forces εz more cos han opimal, by assigning cos Z o all examples on which A errs. Now from Theorem. and noing ha he cenral limi heorem implies ha cos-proporionae rejecion sampling reduces he sample size by a facor of Θ N Z, he cossensiive sample complexiy for rejecion sampling is: m rej ε δ () δ Z Θ m N ε N A fundamenal heorem from PAC-learning heory saes ha m ε δ Ω ε ln δ [4]. When m ε δ Θ ε ln δ δ, Equaion Θ () implies: δ m rej ε δ Z N Θ ln m orig ε N ε Finally, noe ha when m ε δ grows faser han linear in ε, we have m rej ε δ o m orig ε δ, which finishes he proof. Noe ha he linear dependence of sample size on ε is only achievable by an ideal learning algorihm, and in pracice super-linear dependence is expeced, especially in he presence of noise. Thus, he above heorem implies ha cos-proporionae rejecion sampling minimizes cos beer han no sampling for wors case disribuions. This is a remarkable propery abou any sampling scheme, since one generally expecs ha predicive performance is compromised by using a smaller sample. Cosproporionae rejecion sampling seems o disill he original sample and obains a sample of smaller size, which is a leas as informaive as he original..3.4 Cos-proporionae rejecion sampling wih aggregaion (cosing) From he same original raining sample, differen runs of cos-proporionae rejecion sampling will produce differen raining samples. Furhermore, he fac ha rejecion sampling produces very small samples means ha he ime required for learning a classifier is generally much smaller. We can ake advanage of hese properies o devise an ensemble learning algorihm based on repeaedly performing rejecion sampling from S o produce muliple sample ses S S m, and hen learning a classifier for each se. The oupu classifier is he average over all learned classifiers. We call his echnique cosing: Cosing(Learner A, Sample Se S, coun ). For i o do (a) S rejecion sample from S wih accepance probabiliy c Z. (b) Le h i A S. Oupu h x sign i h i x The goal in averaging is o improve performance. There is boh empirical and heoreical evidence suggesing ha averaging can be useful. On he empirical side, many people have observed good performance from bagging despie hrowing away a e fracion of he samples. On he heoreical side, here has been considerable work which proves ha he abiliy o overfi of an average of classifiers migh be smaller han naively expeced when a large margin exiss. The preponderance of learning algorihms producing averaging classifiers provides significan evidence ha averaging is useful. Noe ha despie he exra compuaional cos of averaging, he overall compuaional ime of cosing is generally much smaller han ha of a learning algorihm using sample se S (wih or wihou weighs). This is he case because mos learning algorihms have running imes ha are superlinear in he number of examples. 3 Empirical evaluaion We show empirical resuls using wo real-world daases. We seleced daases ha are publicly available and for which cos informaion is available on a per example basis. Boh daases are from he direc markeing domain. Alhough here are many oher daa mining domains ha are cos-sensiive, such as credi card fraud deecion and medical diagnosis, publicly available daa are lacking. 3. The daases used 3.. KDD-98 daase This is he well-known and challenging daase from he KDD-98 compeiion, now available a he UCI KDD reposiory [9]. The daase conains informaion abou persons who have made donaions in he pas o a paricular chariy. The decision-making ask is o choose which donors o mail a reques for a new donaion. The measure of success is he oal profi obained in he mailing campaign.
The daase is divided in a fixed way ino a raining se and a es se. Each se consiss of approximaely 96000 records for which i is known wheher or no he person made a donaion and how much he person donaed, if a donaion was made. The overall percenage of donors is abou 5%. Mailing a soliciaion o an individual coss he chariy $0 68. The donaion amoun for persons who respond varies from $ o $00. The profi obained by soliciing every individual in he es se is $0560, while he profi aained by he winner of he KDD-98 compeiion was $47. The imporance of each example is he absolue difference in profi beween mailing and no mailing an individual. Mailing resuls in he donaion amoun minus he cos of mailing. No mailing resuls in zero profi. Thus, for posiive examples (respondens), he imporance varies from $0 3 o $99 3. For negaive examples (nonrespondens), i is fixed a $0 68. 3.. DMEF- daase This daase can be obained from he DMEF daase library [] for a nominal fee. I conains cusomer buying hisory for 9655 cusomers of a naionally known caalog. The decision-making ask is o choose which cusomers should receive a new caalog so as o maximize he oal profi on he caalog mailing campaign. Informaion on he cos of mailing a caalog is no available, so we fixed i a $. The overall percenage of respondens is abou.5%. The purchase amoun for cusomers who respond varies from $3 o $647. As is he case for he KDD-98 daase, he imporance of each example is he absolue difference in profi beween mailing and no mailing a cusomer. Therefore, for posiive examples (respondens), he imporance varies from $ o $645. For negaive examples (nonrespondens), i is fixed a $. We divided he daase in half o creae a raining se and a es se. As a baseline for comparison, he profi obained by mailing a caalog o every individual on he raining se is $6474 and on he es se is $7584. 3. Experimenal resuls 3.. Transparen box resuls Table (op) shows he resuls for Naive Bayes, boosed Naive Bayes (00 ieraions) C4.5 and SVMLigh on he KDD-98 and DMEF- daases, wih and wihou he imporance weighs. Wihou he imporance weighs, he classifiers label very few of he examples posiive, resuling in small (and even negaive) profis. Wih he coss given as weighs o he learners, he resuls improve significanly for all learners, excep C4.5. Cos-sensiive boosed Naive Bayes gives resuls comparable o he bes so far wih his daase [9] using more complicaed mehods. We opimized he parameers of he SVM by crossvalidaion on he raining se. Wihou weighs, no seing of he parameers prevened he algorihm of labeling all examples as negaives. Wih weighs, he bes parameers were KDD-98: Mehod Wihou Weighs Wih Weighs Naive Bayes 0.4 367 Boosed NB -.36 4489 C4.5 0 8 SVMLigh 0 3683 DMEF-: Mehod Wihou Weighs Wih Weighs Naive Bayes 646 3608 Boosed NB 3638 C4.5 0 478 SVMLigh 0 36443 Table. Tes se profis wih ransparen box. a polynomial kernel wih degree 3 and C 5 0 5 for KDD-98 and a linear kernel wih C 0 0005 for DMEF-. However, even wih his parameer seing, he resuls are no so impressive. This may be a hard problem for marginbased classifiers because he daa is very noisy. Noe also ha running SVMLigh on his daase akes abou 3 orders of magniude longer han AdaBoos wih 00 ieraions. The failure of C4.5 o achieve good profis wih imporance weighs is probably relaed o he fac ha he faciliy for incorporaing weighs provided in he algorihm is heurisic. So far, i has been used only in siuaions where he weighs are fairly uniform (such as is he case for fracional insances due o missing daa). These resuls indicae ha i migh no be suiable for siuaions wih highly nonuniform coss. The fac ha i is non-rivial o incorporae coss direcly ino exising learning algorihms is he moivaion for he black box approaches ha we presen here. 3.. Black box resuls Table shows he resuls of applying he same learning algorihms o he KDD-98 and DMEF- daa using raining ses of differen sizes obained by sampling-wihreplacemen. For each size, we repea he experimens 0 imes wih differen sampled ses o ge mean and sandard error (in parenheses). The raining se profis are on he original raining se from which we draw he sampled ses. The resuls confirm ha applicaion of sampling-wihreplacemen o implemen he black box approach can resul in very poor performance due o overfiing. When here are large differences in he magniude of imporance weighs, i is ypical for an example o be picked wice (or more). In able, we see ha as we increase he sampled raining se size and, as a consequence, he number of duplicae examples in he raining se, he raining profi becomes larger while he es profi becomes smaller for C4.5. Examples which appear muliple imes in he raining se of a learning algorihm can defea complexiy conrol mechanisms buil ino learning algorihms For example, suppose ha we have a decision ree algorihm which divides he raining daa ino a growing se (used o consruc a ree)
KDD-98: 000 0000 00000 Training Tes Training Tes Training Tes NB 5 (330) 0850 (35) 8 (55) 993 (85) 53 (4) 06 (56) BNB 658 (3) 76 (383) 3838 (65) 886 () 407 (5) 335 (59) C4.5 4 (55) 9548 (33) 083 (7) 7599 (30) 40704 (5) 59 (07) SVM 030 (37) 03 (8) 8 (8) 05 (6) 3565 (9) 808 (0) DMEF-: 000 0000 00000 Training Tes Training Tes Training Tes NB 3398 (495) 3464 (49) 374 (793) 33956 (798) 335 (475) 34506 (405) BNB 3390 (558) 30304 (660) 3480 (806) 334 (77) 34505 (8) 3889 (733) C4.5 37905 (467) 40 (93) 67960 (763) 988 (458) 7574 (05) 349 (59) SVM 8837 (09) 3077 (96) 363 () 3585 (89) 34309 (79) 33674 (600) Table. Profis using sampling-wih-replacemen. and a pruning se (used o prune he ree for complexiy conrol purposes). If he pruning se conains examples which appear in he growing se, he complexiy conrol mechanism is defeaed. Alhough no as markedly as for C4.5, we see he same phenomenon for he oher learning algorihms. In general, as he size of he resampled size grows, he larger is he difference beween raining se profi and es se profi. And, even wih 00000 examples, we do no obain he same es se resuls as giving he weighs direcly o Boosed Naive Bayes and SVM. The fundamenal difficuly here is ha he samples in S are no drawn independenly from ˆD. In paricular, if ˆD is a densiy, he probabiliy of observing he same example wice given independen draws is 0, while he probabiliy using sampling-wih-replacemen is greaer han 0. Thus sampling-wih-replacemen fails because he sampled se S is no consruced independenly. Figure shows he resuls of cosing on he KDD-98 and DMEF- daases, wih he base learners and Z 00 or Z 647, respecively. We repeaed he experimen 0 imes for each and calculaed he mean and sandard error of he profi. The resuls for, 00 and 00 are also given in able 3. In he KDD-98 case, each resampled se has only abou 600 examples, because he imporance of he examples varies from 0.68 o 99.3 and here are few imporan examples. Abou 55% of he examples in each se are posiive, even hough on he original daase he percenage of posiives is only 5%. Wih 00, he C4.5 version yields profis around $5000, which is excepional performance for his daase. In he DMEF- case, each se has only abou 35 examples, because he imporances vary even more widely (from o 646) and here are even fewer examples wih a large imporance han in he KDD-98 case. The percenage of posiive examples in each se is abou 50%, even hough on he original daase i was only.5%. For learning he SVMs, we used he same kernels as we did in secion. and he defaul seing for C. In ha KDD-98: 00 00 NB 667 (9) 3 (0) 363 (68) BNB 377 (63) 489 (9) 474 (6) C4.5 968 (5) 4935 (0) 506 (6) SVM 004 (393) 3075 (4) 35 (56) DMEF-: 00 00 NB 687 (3444) 3767 (335) 3769 (39) BNB 440 (839) 37376 (393) 3789 (364) C4.5 7089 (345) 3699 (374) 37500 (307) SVM 7 (3487) 33584 (5) 3590 (849) Table 3. Tes se profis using cosing. secion, we saw ha by feeding he weighs direcly o he SVM, we obain a profi of $3683 on he KDD-98 daase and of $36443 on he DMEF- daase. Here, we obain profis around $300 and $35000, respecively. However, his did no require parameer opimizaion and, even wih 00, was much faser o rain. The reason for he speedup is ha he ime complexiy of SVM learning is generally superlinear in he number of raining examples. 4 Discussion Cosing is a echnique which produces a cos-sensiive classificaion from a cos-insensiive classifier using only black box access. This simple mehod is fas, resuls in excellen performance and ofen achieves drasic savings in compuaional resources, paricularly wih respec o space requiremens. This las propery is especially desirable in applicaions of cos-sensiive learning o domains ha involve massive amoun of daa, such as fraud deecion, argeed markeing, and inrusion deecion. Anoher desirable propery of any reducion is ha i applies o he heory as well as o concree algorihms. Thus, he reducion presened in he presen paper allows us o auomaically apply any fuure resuls in cos-insensiive classificaion o cos-sensiive classificaion. For example, a
KDD-98: x 0 4 Cosing NB: KDD 98 Daase x 0 4 Cosing BNB: KDD 98 Daase x 0 4 Cosing C45: KDD 98 Daase x 0 4 Cosing SVM: KDD 98 Daase.5.5.5.5.4.4.4.4.3.3.3.3 Profi. Profi. Profi. Profi..... 0.9 4 x Cosing NB: DMEF Daase 04 0.9 0.9 DMEF-: 4 x Cosing BNB: DMEF Daase 04 4 x Cosing C4.5: DMEF Daase 04 0.9 4 x Cosing SVM: DMEF Daase 04 3.8 3.8 3.8 3.8 3.6 3.6 3.6 3.6 3.4 3.4 3.4 3.4 3. 3. 3. 3. Profi 3.8 Profi 3.8 Profi 3.8 Profi 3.8.6.6.6.6.4.4.4.4.....8.8.8.8 Figure. Cosing: es se profi vs. number of sampled ses. bound on he fuure error rae of A S implies a bound on he expeced cos wih respec o he disribuion D. This addiional propery of a reducion is especially imporan because cos-sensiive learning heory is sill young and relaively unexplored. One direcion for fuure work is muliclass cos-sensiive learning. If here are K classes, he minimal represenaion of coss is K weighs. A reducion o cos-insensiive classificaion using hese weighs is an open problem. References [] Anifanis, S. The DMEF Daa Se Library. The Direc Markeing Associaion, New York, NY, 00. [hp://www.he-dma.org/dmef/dmefdse.shml] [] Domingos, P. MeaCos: A general mehod for making classifiers cos sensiive. Proceedings of he 5h Inernaional Conference on Knowledge Discovery and Daa Mining, 55-64, 999. [3] Drummond, C. & Hole, R. Exploiing he cos (in)sensiiviy of decision ree spliing crieria. Proceedings of he 7h Inernaional Conference on Machine Learning, 39-46, 000. [4] Ehrenfeuch, A., Haussler, D., Kearns, M. & Valian. A general lower bound on he number of examples needed for learning. Informaion and Compuaion, 8:3, 47-6, 989. [5] Elkan, C. Boosing and naive bayesian learning (Technical Repor). Universiy of California, San Diego, 997. [6] Elkan, C. The foundaions of cos-sensiive learning. Proceedings of he 7h Inernaional Join Conference on Arificial Inelligence, 973-978, 00. [7] Fan, W., Solfo, S., Zhang, J. & Chan, P. AdaCos: Misclassificaion cos-sensiive boosing. Proceedings of he 6h Inernaional Conference on Machine Learning, 97-05, 999. [8] Freund, Y. & Schapire, R. E. A decision-heoreic generalizaion of on-line learning and an applicaion o boosing. Journal of Compuer and Sysem Sciences, 55:, 9-39, 997. [9] Heich, S. & Bay, S. D. The UCI KDD Archive. Universiy of California, Irvine. [hp://kdd.ics.uci.edu/]. [0] Joachims, T. Making large-scale SVM learning pracical. In Advances in Kernel Mehods - Suppor Vecor Learning. MIT Press, 999. [] Joachims, T. Esimaing he generalizaion performance of a SVM efficienly. Proceedings of he 7h Inernaional Conference on Machine Learning, 43-438, 000. [] Kearns, M., Schapire, R., & Sellie, L. Toward Efficien Agnosic Learning. Machine Learning, 7, 5-4, 998. [3] Kearns, M. Efficien noise-oleran learning from saisical queries. Journal of he ACM, 45:6, 983-006, 998. [4] Margineanu, D. Class probabiliy esimaion and cossensiive classificaion decisions. Proceedings of he 3h European Conference on Machine Learning, 70-8, 00. [5] Quinlan, J. R. Boosing, Bagging, and C4.5. Proceedings of he Thireenh Naional Conference on Arificial Inelligence, 75-730, 996. [6] Quinlan, J. R. C4.5: Programs for Machine Learning. San Maeo, CA: Morgan Kaufmann, 993. [7] Valian, L. A heory of he learnable. Communicaions of he ACM, 7:, 34-4, 984. [8] von Neumann, J. Various echniques used in connecion wih random digis, Naional Bureau of Sandards, Applied Mahemaics Series,, 36-38, 95. [9] Zadrozny, B. and Elkan, C. Learning and making decisions when coss and probabiliies are boh unknown. Proceedings of he 7h Inernaional Conference on Knowledge Discovery and Daa Mining, 03-3, 00.