Multiclass-Multilabel Classification with More Classes than Examples

Transcription

1 Multicla-Multilabel Claification with More Clae than Exaple Ofer Dekel Microoft Reearch One Microoft Way Redond, WA, 98052, USA Ohad Shair School of Coputer Science and Engineering The Hebrew Univerity Jeurale, 9904, Irael Abtract We dicu ulticla-ultilabel claification proble in which the et of clae i extreely large Mot exiting ulticla-ultilabel learning algorith expect to oberve a reaonably large aple fro each cla, and fail if they receive only a handful of exaple per cla We propoe and analyze the following two-tage approach: firt ue an arbitrary perhap heuritic claification algorith to contruct an initial claifier, then apply a iple but principled ethod to augent thi claifier by reoving harful label fro it output A careful theoretical analyi allow u to jutify our approach under oe reaonable condition uch a label parity and power-law ditribution of cla frequencie, even when the training et doe not provide a tatitically accurate repreentation of ot clae Surpriingly, our theoretical analyi continue to hold even when the nuber of clae exceed the aple ize We deontrate the erit of our approach on the abitiou tak of categorizing the entire web uing the 5 illion categorie defined on Wikipedia Introduction In ulticla-ultilabel claification, the goal i to aign one or ore label to each intance in an intance pace Each label aociate an intance with one of k poible clae An exaple of a ulticla-ultilabel proble i docuent categorization, which i the proble of aociating each docuent in a corpu with one or ore topic eg McCallu, 999 Multicla-ultilabel proble are alo coon in other field, uch a coputer viion Boutell Appearing in Proceeding of the 3 th International Conference on Artificial Intelligence and Statitic AISTATS 200, Chia Laguna Reort, Sardinia, Italy Volue 9 of JMLR: W&CP 9 Copyright 200 by the author et al, 2004 and coputational biology Barutcuoglu et al, 2006 If a training et of labeled exaple i available, a ulticla-ultilabel claifier can be learned uing a upervied achine learning algorith Typically, learning algorith for ulticla-ultilabel proble are developed and analyzed under the auption that k i held contant a grow In thi paper, we conider a different verion of the ulticla-ultilabel proble, where the et of clae grow with the nuber of exaple ie k Ω For exaple, thi ituation occur when the et of clae i a Folkonoy, a et of clae that eerge fro a collaborative tagging or a ocial tagging chee The concrete proble that otivate thi work i the proble of categorizing the entire web uing the categorie defined by the Wikipedia webite At the botto of every Wikipedia article there i a hort lit of categorie, and we define our et of clae to be the union of thee lit The Wikipedia article theelve can be ued a training exaple, ince they are labeled web page When new article are added to Wikipedia, they often introduce new categorie At the tie of writing thi paper, Wikipedia contain 29 illion article and alot 5 illion categorie The Internet provide any other exaple where k grow with For intance, photo haring webite allow uer to annotate their photo with keyword The iplied claification tak i to recoend keyword whenever new photo are uploaded Auing that the ite doe not ipoe any retriction on the keyword that ay be ued, the et of ditinct keyword i likely to grow a ore and ore photo are uploaded to the ite For uch dataet, applying tandard technique i probleatic for two reaon: Firt, ince k cale with, any clae occur only a handful of tie in the training et, o we do not have a tatitically reaonable aple fro each cla; Second, the concrete value of and k we deal with are very large, to the point that ot exiting ulticla-ultilabel learning algorith becoe coputationally intractable For exaple, the ot coon approach to ulticla-ultilabel learning i to train a eparate binary claifier for each cla For the dataet we have in ind, uch an approach i both tatitically un- 37

2 Multicla-Multilabel Claification with More Clae than Exaple tenable due to the all nuber of exaple per cla and coputationally ipractical a it require aintaining illion of hypothee for all the clae Siilar proble are encountered for other tandard approache, uch a thoe baed on ranking eg Ait et al, 2007; Craer and Singer, 2003; Lebanon and Lafferty, 2002 In practice, the only realitic olution i to turn to uch ipler claification algorith, uch a nearet neighbor ethod For exaple, conider once again the proble of categorizing the entire web uing Wikipedia categorie, and aue that we have acce to the log of an Internet earch engine We can ue the log to contruct a click graph, a bipartite graph whoe vertice include all web page and all querie ever iued to the earch engine An edge i drawn between a query Q and a web page W if enough uer iued the query Q and then clicked on W We can ue the graph ditance induced by the click graph to define a etric over web page Given a et of labeled Wikipedia page, we can label the entire web uing a nearet neighbor type algorith over thi etric In other word, label are propagated fro the labeled Wikipedia page along the edge of the click graph to the ret of the web Such iple algorith can be ipleented in a way that i alot entirely inenitive to the nuber of clae, but their iplicity often coe at the cot of lower claification accuracy One way to iprove the accuracy of a iple claification algorith i to refine it output with a eparate potlearning tep Taking uch a two-tage approach i coon in other area of achine learning For exaple, in ranking proble it i coon to run a iple and fat algorith to obtain an initial ranking, and then to run a ore accurate re-ranking algorith only on the top few reult In thi paper, we focu on a pot-learning tep that odifie a claifier by pruning certain label fro the output In other word, the original claifier aociate an intance with a et of clae and the pot-learning tep delete certain clae fro thi et The intuition behind thi approach i that in uch aive ulticla-ultilabel dataet any clae are inherently hard to learn, and attepting to predict the decreae the overall accuracy of the claifier We propoe and analyze, both theoretically and experientally, a iple label-pruning ethod The ethod i baed on coparing the nuber of true-poitive and falepoitive in each cla, and dicarding clae where the ratio of true to fale poitive exceed a certain threhold Returning to the exaple of categorizing the web, the initial nearet neighbor algorith i likely to find that any web page on claical copoer turn out to be cloe neighbor of the Wikipedia article on Mozart The nearet neighbor claifier indicriinantly aign all of the Wikipedia categorie that are aociated with the article on Mozart to all of thee page One of thee categorie i People Born in 756, which i likely to have any fale poitive acro the validation et Intuitively, the category People Born in 756 i incopatible with the click-graph baed etric we have choen, naely, our etric i unlikely to put different web page fro thi cla in cloe proxiity to each other In thi ituation, our label-pruning ethod reove thi cla fro the et of clae the claifier i allowed to output While our ethod i iple and traightforward to ipleent, it analyi i quite tricky, ince it i baed on preie that ight initially appear to be tatitically unacceptable After all, our baic auption i that ot clae are very rare, o the deciion to drop a label ay be baed on tatitically inufficient evidence Say that we ee two fale-poitive and one true-poitive of a given cla in our validation et: can we confidently decide to reove label baed on thi epirical evidence? The key to the foral analyi of our technique i to think of it overall effect rather than conidering it effect on each individual cla Indeed, we cannot concluively evaluate each cla and our technique will ot certainly itake oe good clae for bad one Neverthele, we can how that our pruning criterion reove ore bad label than good one, and overall iprove the accuracy of our claifier, under ild condition that often hold in practice Concretely, we aue that the cla frequencie follow a power-law ditribution and that every exaple only ha a rather all nuber of label To our knowledge, our theoretical approach i unique and quite ditinct fro previou analye of ulticlaultilabel learning algorith Mot previou uch analye build on technique originally developed for the analyi of binary claification algorith, and therefore require at leat oe degree of cla-wie convergence We conclude our paper with a et of experient, in which we validate our approach on the tak of categorizing web page uing the et of 5 illion Wikipedia categorie The iplicity of our approach enable u to perfor thee experient on a ingle erver, without requiring a large cluter coputer Related Work The work in Zhang, 2004 deal with ulticla claification and bear iilaritie to our work, in that the pace of poible clae can be very large copared to the ize of the dataet However, the analyi there i pecific to ulticla rather than ulticla-ultilabel learning ie each intance i aigned only a ingle label, and focue on large argin claifier with a particular rule for chooing the label of each intance A ore cloely related paper i Hu et al, 2009, which alo deal with aive ulticla-ultilabel claification It propoe a clever ethod, where a predictor i 38

3 Dekel, Shair trained on a copreed repreentation of the original label vector The original label are recontructed uing technique fro copreed ening The proble etting and oe of the auption ade uch a label parity are iilar to our work However, the approach of Hu et al, 2009 applie only to learning algorith that regre on a real-valued copreed label vector Thi i often not the cae with algorith deigned for aive dataet, uch a the click-graph baed approach decribed earlier In contrat, our approach ake no auption about the learning algorith 2 Setting and Notation We aue that the learning tak at hand i a upervied ulticla-ultilabel proble Forally, let X be an arbitrary eaurable pace, Y = {0, } k, and let D be an unknown ditribution on the product pace X Y Each eleent x,y in thi pace i copoed of an intance x and a label vector y, which i a vector of indicator y = y,,y k that pecifie which clae are aociated with x We aue that label vector apled fro D are pare, naely, that Pr j y j = for oe contant A claifier i a function h : X Y, that ap an intance x to a label vector ŷ = hx We retrict our dicuion to claifier that output pare label vector, naely j ŷj We evaluate the accuracy of a claifier uing a lo function lhx,y, which eaure the diparity between the predicted label et and the actual label et In thi paper, we focu on a iple weighted lo function that i paraeterized by γ 0, and defined a k γŷ j = 0, y j = +γŷ j =, y j = 0, where i the indicator function The paraeter γ control the iportance of fale negative v fale poitive, and the noralization by enure that the lo i alway bounded in [0, ] Our ultiate goal i to obtain a claifier h with a all rik, which i defined a E x,y D [lhx,y] We ditinguih between two phae of the learning proce In the learning phae, we ue oe learning algorith to obtain an initial claifier h We then perfor a pot-learning phae, where we find a label tranforation function ϕ : Y Y, uch that the final claifier i the copoition ϕ h In thi paper, we focu on the potlearning phae, and ake no auption on the learning phae or on the quality of the initial claifier Fro the perpective of the pot-learning phae, the initial claifier h i iply a predefined function For iplicity, we aue that the data ued to train h i independent fro the data ued in the pot-learning phae to train ϕ In principle, the label tranforation function ϕ can be arbitrarily coplex In thi paper, we focu on the iple et of label pruning rule A label pruning rule ϕỹ correpond to an eleent ỹ {0, } k, and ϕỹy i the vector reulting fro reoving the label repreented by ỹ fro the et of label repreented by y Such rule are iple to ipleent and are particularly ueful in aive ultilabel proble, where any label are both inherently noiy and very rare In uch cae, refraining fro predicting thee label can actually iprove the final claifier perforance The four baic quantitie we work with are the rik and the epirical rik of h and of ϕ h Letting S denote an iid aple of ize fro D, we define the initial epirical rik ˆR 0 = / x,y S lhx,y, the initial rik R 0 = E x,y [lhx,y], the final epirical rik ˆR ϕ = / x,y S lϕ hx i,y, and the final rik R ϕ = E x,y [lϕ hx,y] Our goal i to find a pruning rule ϕ uch that R ϕ i a all a poible For the analyi, we need to decribe thee quantitie in an alternative for, a pecified in the following eay-to-prove lea: Lea For a given claifier h, define ˆp j, = γ hx j = y j = x,y S ˆp j,0 = γ hx j =, y j = 0 ˆp j, = x,y S x,y S γ hx j = 0, y j = + γ hx j =, y j = 0 Let p j,, p j,, and p j,0 be the expected value over the aple S of ˆp j,, ˆp j,, and ˆp j,0 repectively Alo, for a fixed pruning rule ϕ, let label j pruned be an indicator that equal if and only if the pruning rule ϕ reove label j Then it hold that ˆR 0 = ˆR ϕ = R ϕ = k ˆp j,, R 0 = k k k p j,, ˆp j, + label j prunedˆp j, ˆp j,0, p j, + label j prunedp j, p j,0 3 The Pruning Method Recall that our goal i to reduce the final rik R ϕ The expreion for R ϕ given in Lea ugget that R ϕ can be reduced by reoving thoe label for which p j,0 > p j, Unfortunately, p j, and p j,0 are unknown quantitie that depend on D, and we ut reort to uing their epirical counterpart ˆp j, and ˆp j,0 Specifically, our iple label 39

4 Multicla-Multilabel Claification with More Clae than Exaple pruning procedure proceed a follow: given a aple S, calculate ˆp j, and ˆp j,0, and chooe the label pruning rule ϕ that reove all label for which ˆp j, < ˆp j,0 In other word, thi procedure prune any label for which the ratio of fale poitive to true poitive exceed γ/γ Notice that thi ake ϕ a rando function that depend on the randone of the aple S For the theoretical analyi, it will be convenient to view R ϕ and ˆR ϕ a rando variable, which depend on the rando draw of S Thi algorith eentially attept to decreae the final epirical rik ˆR ϕ in lieu of R ϕ However, notice that in our etting where k cale with, we cannot aue that each and every ˆp j,, ˆp j,0 i an accurate etiate of p j,, p j,0 In fact, our analyi how that ˆR ϕ i generally not a good etiator of R ϕ Neverthele, we can prove that our ethod reduce the final rik R ϕ copared to the initial rik R 0 with high probability, under ild condition 4 Theoretical Analyi Our pruning procedure work by aking the epirical rik ˆR ϕ a all a poible In thi ection, we how that thi i alo likely to ake R ϕ aller than R 0 The traightforward theoretical approach would be to how that for reaonably large aple, ˆR0 i cloe to R 0 and ˆR ϕ i cloe to R ϕ While the firt preie i eay to how via a large deviation inequality, it turn out that ˆR ϕ doe not necearily converge to R ϕ when the nuber of clae grow with the nuber of exaple Thi i iplied by the following theore and the dicuion which follow It proof i a iple conequence of the definition, and i oitted due to lack of pace Theore E[R ϕ ˆR ϕ ] i lower bounded by k Prlabel j prunedp j, p j,0 If we were to aue that k i fixed, we ight expect ˆp j,, ˆp j,0 to converge to p j,, p j,0 uniforly for all j =,,k Since our ethod prune label for which ˆp j, < ˆp j,0, we would have that Prlabel j prunedp j, p j,0 converge to a non-poitive quantity uniforly for any j, and thu our lower bound would converge to a non-poitive nuber However, when we aue that k grow with, ˆp j,, ˆp j,0 need not converge uniforly to p j,, p j,0, and the correlation between Prlabel j pruned and the ign of p j, p j,0 can reain weak regardle of the aple ize To give a concrete exaple, if we take γ = /2, = 0 and aue that p j, = /3k, p j,0 = /6k for all j, then we have by the theore above that E[R ϕ ˆR ϕ ] 6k k Prlabel j pruned It can be hown that when, k but ay /k = 3, the right hand ide above converge to a trictly poitive contant Therefore, it i poible that our lower bound will reain larger than oe poitive contant regardle of aple ize, which iplie that ˆR ϕ doe not converge to R ϕ in uch cae Thi obervation preciely capture the difficulty of working with a aple that doe not ufficiently repreent any of the individual clae in the proble, and i the reaon why ot exiting algorith are inadequate when the nuber of clae i not fixed Neverthele, we can how that it i poible to analyze the behavior of R ϕ directly Specifically, we prove that R ϕ i well behaved when the training et i large enough, even when k i very large and grow with Naely, although the epirical quantitie do not necearily correpond to their expected value, we can till provide high probability guarantee that our pruning ethod reduce the overall rik of the claifier In a nuthell, the analyi conit of proving that R ϕ E[R ϕ ] i all with high probability where the expectation are taken over the rando draw of the aple S, and then directly proving that E[R ϕ ] i trictly aller than R 0, under ild condition The firt part of the propoed approach i foralized in the following theore Inforally, it tate that when i large enough, R ϕ i arbitrarily cloe to it expectation with arbitrarily high probability Note that thi bound doe not depend at all on k, the nuber of clae Theore 2 For any fixed ǫ > 0, it hold that Pr R ϕ E[R ϕ ] > 2 /6+ǫ γ γ + 2/3 exp 2ǫ 2 2/3 exp 2ǫ, The proof i preented in the appendix Intuitively, the idea i to ditinguih between label for which p j, p j,0 i large, and label for which thi difference i all The firt type of label are ore coon in the data, and thu we can reliably etiate p j, p j,0 and decide whether to prune the or not On the other hand, there cannot be too any uch label, becaue j p j, + p j,0 i a bounded quantity Thi effectively liit the dienionality of the proble regardle of the paraeter k Whenever p j, p j,0 i all, the pruning proce i noiy and prone to error, but it can be hown that thee cae do not influence R ϕ too uch A careful foralization of thee idea, uing Berntein and McDiarid large deviation bound, allow u to how that R ϕ concentrate around it expectation with high probability, regardle of k Next, we need to how that R 0 E[R ϕ ] i trictly poitive, to prove that our ethod indeed reduce the final rik It turn out that the exact value of R 0 E[R ϕ ] i highly dependent on the pecific value of p j, and p j,0 for each j Intuitively, if label for which p j,0 > p j, are pruned 40

5 Dekel, Shair with high probability and label for which p j,0 p j, are pruned with low probability, we expect R 0 E[R ϕ ] to be large Although it i poible to provide poitive lower bound on R 0 E[R ϕ ] in ter of thee quantitie, they are not particularly enlightening Intead, the theore below will allow u to characterize a ild condition, under which we can expect R 0 E[R ϕ ] to be trictly poitive A proof appear in the appendix Theore 3 The difference R 0 E[R ϕ ] i at leat j:p j,0 p j, p j,0 p j, k pj, + p j,0 Moreover, if we aue that p j,0 + p j, are orted in decending order, and there exit oe r 2 uch that Prhx j = Oj r for all j, then R 0 E[R ϕ ] i at leat k ax{2 r,0} p j,0 p j, O j:p j,0 p j, The requireent that r 2 i for technical reaon and can eaily be treated eparately What doe thi theore tell u? The non-negative ter j:p j,0 p j, p j,0 p j, can be arbitrarily all, but we can expect it to be lower bounded by a poitive contant independent of, k if a fixed fraction of the label are uch that p j,0 p j,, and if p j,0 p j, i proportional to p j,0 + p j, So we turn our attention to the ter k pj, + p j,0 which can indeed be large in the regie where k cale with For exaple, if p j, + p j,0 = /k for all j, the above equal k/ Ω, and Th 3 ay becoe vacuou Luckily, auing that p j, + p j,0 i equal for all j i unrealitic By definition, p j, + p j,0 i upper bounded by Prhx j =, or the probability that our learned hypothei label a rando intance with label j If the arginal cla ditribution of the claifier i iilar to the arginal cla ditribution of the data, then thi ditribution i often oberved to follow a power law, which correpond to the auption that Prhx j = Oj r for all j Under thi auption, we obtain the econd tateent in Th 3 Thi power-law behavior, oetie known a Zipf law, i a very well known and well docuented phenoenon for any rank v frequency dataet ee exaple in Manning and Schütze, 2002; Adaic and Huberan, 2002; Gabaix, 999, and in particular for the application we have in ind We verify thi property in our experient, preented below Overall, thi lower bound iplie that if we let, k, we can expect R 0 E[R ϕ ] to be poitive whenever grow, fater than k 2 r In particular, if r > which happen quite often in practice, including in our experient, we obtain the intereting reult that the lower bound reain eaningful, even when the nuber of clae k grow fater than the nuber of exaple 5 Experient We applied our technique to the tak of categorizing web page uing the 5 illion categorie defined in Wikipedia A entioned in the introduction, we firt ued earch engine log to create a click graph, which i a bipartite graph between querie and web page A link between query Q and web page W indicate that a ufficiently large nuber of uer iued the query Q and then clicked on the link to page W Next, we randoly plit the et of Wikipedia article into three et: 50% training, 30% validation, and 20% tet Each Wikipedia article i aociated with a et of categorie and alo correpond to a node in the click graph Next, we propagated the categorie fro each Wikipedia training article along the edge of the click graph, to all of the web page that have a query in coon with that article naely, to all web page whoe ditance to the training article i 2 We call the reulting labeling of the web labeling A The rationale behind thi labeling procedure i the auption that two web page that were clicked on by different people, at different tie after the ae query are likely to hare any topic Next, we propagated the categorie along the edge of the click graph a econd tie, extending the reach of each category to all page with graph ditance 4 fro the original article We call thi labeling B We repeated the proce decribed above a econd tie, thi tie eeded with a larger et of label per Wikipedia training article We ued the fact that Wikipedia categorie are theelve categorized by higher-level categorie For exaple, the Wikipedia article on Dog i aociated with the category Doeticated Anial, and the latter i aociated with the category Anial We added all of thee econd-order categorie to each Wikipedia article We propagated the extended category et along the edge of the click graph a before, to obtain labeling C We then perfored a econd iteration of label-propagation to obtain labeling D We applied our label-pruning technique independently to each of the four initial labelling Naely, we revealed the true categorie of the Wikipedia validation article and copared the to the propagated label in the four verion of our experient For each cla we counted true and fale poitive, and decided which label to prune The et of Wikipedia categorie i probleatic in that it i over-coplete Many categorie have duplicate or nearduplicate; oe article are labeled by one category while other article are labeled by it near-duplicate category 4

6 Multicla-Multilabel Claification with More Clae than Exaple tet lo ratio our algorith original rando γ γ γ γ A B C D 02 Figure : Ratio between the bet attainable tet lo and the tet lo attained by three different technique, on four initial labelling Alo the fale-poitive in all four labelling ignificantly outnuber the true-poitive For thee reaon, falepoitive hould be treated with great upicion When we ee a fale-poitive, either our claification i wrong or the Wikipedia editor ay have iply neglected to add thi category Spot-checking reveal that any fale-poitive are actually quite reaonable On the other hand, falenegative hould alway be taken eriouly: a huan editor explicitly added a category to the article and our algorith concluded that it i not relevant To correct thi ibalance, we et γ in Eq to give ore weight to fale-negative Specifically, we et γ to value between 00 and 0 After uing the validation et to identify and reove harful label, we revealed the categorie in the Wikipedia tet et, and evaluated the perforance of our algorith For each of the four labelling and for each value of γ, we alo calculated an oracle pruning which provide a lower bound on the tet lo of any poible pruning algorith Thi wa done by cheating and finding the bet pruning on the tet et in ter of each γ-weighted lo The lo attained by the oracle varie greatly with γ, o it i eaningle to plot abolute lo value for different value of γ on the ae figure To get a coherent viualization of our reult, we plotted the ratio between the oracle lo and the lo of our algorith The perforance of our algorith i hown in olid line in Fig Value cloe to indicate that our tet lo i very cloe to the lo of the ideal pruning For coparion, the plot in Fig alo how the perforance of two other iple algorith The firt i the algorith that perfor no pruning and jut keep the initial labeling The econd i an algorith that ue our ethod to deterine how any label to reove, and then reove label randoly Thee experiental reult clearly how the aount of iproveent achieved by our algorith Depite the tatitical challenge of generalizing with only a handful of exaple per cla, our algorith perfor very well acro a wide range of γ Finally, uing a iple leat-quare fitting technique, we validated that all four dataet atify the power-law auption ued in our theoretical analyi ee Th 3 and the dicuion which follow Naely, when we ort the clae by frequency in the data, we ee that the frequency of cla y j i proportional to j r, with r 3 for labeling A; r 6 for labeling B; r 9 for labeling C; and r 23 for labeling D 6 Concluion In thi paper, we tudied the proble of aive ulticlaultilabel learning, where the et of clae cale with the nuber of available training exaple Thi etting i very relevant when the et of clae reult fro a collaborative tagging chee, uch a Wikipedia categorie or keyword in edia hoting webite In thi regie, the tandard auption of a fixed et of clae i too iplitic, and traightforward generalization of ethod for binary claification uch a ulticla SVM ay be ipractical Motivated by the coputational iue faced by practitioner in thi area, we propoed and analyzed a pot-learning ethod on top of any deired learning algorith, which for our purpoe can be treated a a black-box Our experient deontrate that the ethod work quite well on real-world, large cale data Theoretically, thi etting poe a challenge, ince we cannot hope to get a lot of data on each and every cla A far a we know, thi etting violate the auption underlying ot previou theoretical work on ulticlaultilabel learning Neverthele, a careful analyi allow u to jutify our approach, uing oe non-trivial but ild ufficient condition, uch a parity of label per intance and a power-law behavior of the cla frequencie While our approach ee to work in practice, and ha oe intereting theoretical propertie, the algorith we have focued on i obviouly a very iple one, and everal extenion iediately coe to ind One direction i to utilize additional knowledge about cla dependencie, rather than treating each cla eparately Alo, we have dealt only with very iple label tranforation rule, which prune a ubet of label ie if label A appear, reove it However, it i poible to enviion ore coplex rule, uch a if label A and B appear, but not label C, replace label D by label E Undertanding how to ipleent thee extenion effectively and in a theoretically jutified anner, even when there are a any clae a exaple, reain a topic for future reearch 42

7 Dekel, Shair Reference LA Adaic and BA Huberan 2002 Zipf law and the internet Glottoetric, 3:43 50 Y Ait, O Dekel, and Y Singer 2007 A booting algorith for label covering in ultilabel proble In Proceeding of Artifical Intelligence and Statitic 2007 Z Barutcuoglu, RE Schapire, and OG Troyankaya 2006 Hierarchical ulti-label prediction of gene function Bioinforatic, 227: S Boucheron, G Lugoi, and O Bouquet 2004 Concentration inequalitie In O Bouquet, U von Luxburg, and G Rätch, editor, Advanced Lecture on Machine Learning, Springer LNCS 376, page MR Boutell, J Luo, X Shen, and CM Brown 2004 Learning ulti-label cene claification Pattern Recognition, 379: K Craer and Y Singer 2003 A faily of additive online algorith for category ranking Journal of Machine Learning Reearch, 3: X Gabaix 999 Zipf law for citie: An explanation The Quarterly Journal of Econoic, 43: D Hu, S Kakade, J Langford, and T Zhang 2009 Multi-label prediction via copreed ening In Advance in Neural Inforation Proceing Syte 22 G Lebanon and JD Lafferty 2002 Conditional odel on the ranking poet In Advance in Neural Inforation Proceing Syte 5, page CD Manning and H Schütze 2002 Foundation of Statitical Natural Language Proceing MIT Pre A McCallu 999 Multi-label text claification with a ixture odel trained by e In AAAI99 Workhop on Text Learning T Zhang 2004 Cla-ize independent generalization analyi of oe dicriinative ulti-category claification In Advance in Neural Inforation Proceing Syte 7, page A Technical Proof A Proof of Th 2 We need the following two lea The firt lea follow directly fro Berntein inequality ee for intance Boucheron et al, 2004 We note that uing an inequality that relie on variance i crucial to obtain a non-trivial bound with our proof technique The econd lea follow directly fro the definition The proof are oitted due to lack of pace Lea 2 For any j, if p j,0 p j,, then Prˆp j,0 > ˆp j, i at ot exp p j,0 p j, 2 2 γp j, + γp j,0 + p j,0 p j, /3, A iilar bound hold on Prˆp j,0 ˆp j, if p j,0 p j, Lea 3 It hold that k p j, p j,0 k p j, + p j,0 Let α > 0 be an arbitrary paraeter to be pecified later, and define the label ubet J = {j : p j, p j,0 α}, J 2 = {,,k} \J We have by definition of the pruning procedure and Lea that R ϕ E[R ϕ ] i at ot p j, p j,0 ˆpj,0>ˆp j, Prˆp j,0 > ˆp j, j J + j J 2 p j, p j,0 ˆpj,0>ˆp j, Prˆp j,0 > ˆp j, Focuing on the firt line in the expreion, note that if we change any ingle intance in our aple, at ot 2 ter will change by at ot p j, p j,0 α Therefore, the expreion in the firt line will change by at ot 2α Applying McDiarid inequality, and noting that the expectation of what inide the abolute value i zero, we get that with probability of at leat δ, it i upper bounded by 2α2 log/δ 3 Turning to the econd line in Eq 2, and applying Lea 2, we get that for any j, with probability of at leat g, p j,, p j,0, it hold that ˆp j,0 > ˆp j, Prˆp j,0 > ˆp j, g, p j,, p j,0, where g, p j,, p j,0 equal exp p j, p j,0 2 2 γp j, + γp j,0 + p j,0 p j, /3 2 Let c > 0 be another paraeter to be deterined later If c γp j, + γp j,0 p j,0 p j,, we can upper bound g, p j,, p j,0 by exp c 2 γp j, + γp j,0 2 2 γp j, + γp j,0 + p j,0 p j, /3 Dividing the nuerator and denoinator of the fraction in the exponent by γp j, + γp j,0, and uing the eaily verified fact that for any a > 0, b > 0, γ 0, it hold 43

8 Multicla-Multilabel Claification with More Clae than Exaple that a b / γa + γb /γ γ, we get the upper bound exp c2 γp j, + γp j, /3γ γ On the other hand, we alway have ˆp j,0 > ˆp j, Prˆp j,0 > ˆp j, 5 with probability Applying Eq 4 and Eq 5 on the econd line of Eq 2, we get a probabilitic upper bound for it, of the for p j, p j,0 exp c2 γp j, + γp j,0 2 + /3γ γ j J 2, + p j, p j,0, 6 j J 2,2 where J 2, = {j J 2 : c pj,0 pj, γp j,+γp j,0 }, and J 2,2 = {j J 2 \ J 2, } By a union bound, Eq 5 hold with probability at leat exp c2 γp j, + γp j, /3γ γ j J We now ake four obervation Firt, by Lea 3, j p j, p j,0, o there can be at ot /α label j uch that p j, p j,0 > α Second, it i eay to verify that if p j, p j,0 > α which hold for any j J 2,, then γp j, + γp j,0 > αγ γ Third, for any j J 2,2, p j, p j,0 < c γp j, +γp j,0 Fourth, j J 2,2 γp j, + γp j,0 by Lea 3 and the fact that γ 0, Applying thee four obervation on Eq 6 and Eq 7, we can weaken thi bound to the for α exp c2 αγ γ 2 + /3γ γ + c, which hold with probability of at leat α exp c2 αγ γ 2 + /3γ γ To get the theore tateent, we cobine thi with the bound in Eq 3, ubtitute into Eq 2, chooe α = 2/3, δ = 2/3 exp 2ǫ for oe ǫ > 0, let c = /6+ǫ 2 + /3γ γ, γ γ and perfor oe traightforward iplification A2 Proof of Th 3 We have that R 0 E[R ϕ ] equal k p j,0 p j, Prˆp j,0 > ˆp j, 8 For any j, if p j,0 p j, 0, we have by Lea 2 that Prˆp j,0 ˆp j, i lower bounded by p j,0 p j, 2 exp 2 γp j,0 + γp j, + p j,0 p j, /3 p j,0 p j, 2 exp 2p j,0 + p j, + p j,0 + p j, /3 = exp 3p j,0 p j, 2 8p j,0 + p j, If p j,0 p j, 0, we have by Lea 2 in a iilar anner that Prˆp j,0 > ˆp j, exp 3p j,0 p j, 2 8p j,0 + p j, Subtituting thee reult into Eq 8, we get that R 0 E[R ϕ ] i lower bounded by p j,0 p j, j:p j,0 p j, k p j,0 p j, exp 3p j,0 p j, 2 8p j,0 + p j, In order to upper bound the econd line in the expreion with oething which doe not depend on p j,0 p j,, it i enough to upper bound for any j the expreion ax p j,0 p j, exp p j,0 p j, 3p j,0 p j, 2 9 8p j,0 + p j, 0 For that, it i ufficient to find the axial value of the function fx = xexp 3x 2 /8p, where p := p j, + p j,0, for any x [0, p] It can be verified that thi function i axiized at x = 4p/3 Subtituting thi value for p j,0 p j, in Eq 0, we get an upper bound of the for 4pj,0 + p j, /3 exp Subtituting thi bound in Eq 9, and iplifying by noting that 4/3 exp 07 <, we get the required lower bound j:p j,0 p j, p j,0 p j, k pj, + p j,0 on R 0 E[R ϕ ] To derive fro it the econd inequality in the theore, notice that under the auption tated there, k pj, + p j,0 k Prhxj = i at ot C k j r/2 for oe contant C Thi u i Ok r/2 if r < 2, Ologk if r = 2, and O if r > 2 Ignoring the cae r = 2 for iplicity, we upper bound the different cae by O k ax{2 r,0}, and the inequality tated in the theore follow 44