(Almost) No Label No Cry

Transcription

1 (Almost) No Label No Cry Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau Abstract In Learnng wth Label Proportons (LLP), the objectve s to learn a supervsed classfer when, nstead of labels, only label proportons for bags of observatons are known Ths settng has broad practcal relevance, n partcular for prvacy preservng data processng We frst show that the mean operator, a statstc whch aggregates all labels, s mnmally suffcent for the mnmzaton of many proper scorng losses wth lnear (or kernelzed) classfers wthout usng labels We provde a fast learnng algorthm that estmates the mean operator va a manfold regularzer wth guaranteed approxmaton bounds Then, we present an teratve learnng algorthm that uses ths as ntalzaton We ground ths algorthm n Rademacher-style generalzaton bounds that ft the LLP settng, ntroducng a generalzaton of Rademacher complexty and a Label Proporton Complexty measure Ths latter algorthm optmzes tractable bounds for the correspondng bag-emprcal rsk Experments are provded on fourteen domans, whose sze ranges up to 300K observatons They dsplay that our algorthms are scalable and tend to consstently outperform the state of the art n LLP Moreover, n many cases, our algorthms compete wth or are just percents of AUC away from the Oracle that learns knowng all labels On the largest domans, half a dozen proportons can suffce, e roughly 40K tmes less than the total number of labels Introducton Machne learnng has recently experenced a prolferaton of problem settngs that, to some extent, enrch the classcal dchotomy between supervsed and unsupervsed learnng Cases as multple nstance labels, nosy labels, partal labels as well as sem-supervsed learnng have been studed motvated by applcatons where fully supervsed learnng s no longer realstc In the present work, we are nterested n learnng a bnary classfer from nformaton provded at the level of groups of nstances, called bags The type of nformaton we assume avalable s the label proportons per bag, ndcatng the fracton of postve bnary labels of ts nstances Inspred by [], we refer to ths framework as Learnng wth Label Proportons (LLP) Settngs that perform a bag-wse aggregaton of labels nclude Multple Instance Learnng (MIL) [] In MIL, the aggregaton s logcal rather than statstcal: each bag s provded wth a bnary label expressng an OR condton on all the labels contaned n the bag More general settng also exst [3] [4] [5] Many practcal scenaros ft the LLP abstracton (a) Only aggregated labels can be obtaned due to the physcal lmts of measurement tools [6] [7] [8] [9] (b) The problem s sem- or unsupervsed but doman experts have knowledge about the unlabelled samples n form of expectaton, as pseudomeasurement [5] (c) Labels exsted once but they are now gven n an aggregated fashon for prvacy-preservng reasons, as n medcal databases [0], fraud detecton [], house prce market, electon results, census data, etc (d) Ths settng also arses n computer vson [] [3] [4] Related work The settng was frst ntroduced by [], where a prncpled herarchcal model generates labels consstent wth the proportons and s traned through MCMC Subsequently, [9] and ts follower [6] offer a varety of standard learnng algorthms desgned to generate self-consstent

2 labels [5] gves a Bayesan nterpretaton of LLP where the key dstrbuton s estmated through an RBM Other deas rely on structural learnng of Bayesan networks wth mssng data [7], and on K- MEANS clusterng to solve prelmnary label assgnment [3] [8] Recent SVM mplementatons [] [6] outperform most of the other known methods Theoretcal works on LLP belong to two man categores The frst contans unform convergence results, for the estmators of label proportons [], or the estmator of the mean operator [7] The second contans approxmaton results for the classfer [7] Our work bulds upon ther Mean Map algorthm, that reles on the trck that the logstc loss may be splt n two, a convex part dependng only on the observatons, and a lnear part nvolvng a suffcent statstc for the label, the mean operator Beng able to estmate the mean operator means beng able to ft a classfer wthout usng labels In [7], ths estmaton reles on a restrctve homogenety assumpton that the class-condtonal estmaton of features does not depend on the bags Experments dsplay the lmts of ths assumpton [][6] Contrbutons In ths paper we consder lnear classfers, but our results hold for kernelzed formulatons followng [7] We frst show that the trck about the logstc loss can be generalzed, and the mean operator s actually mnmally suffcent for a wde set of symmetrc proper scorng losses wth no class-dependent msclassfcaton cost, that encompass the logstc, square and Matsushta losses [8] We then provde an algorthm, LMM, whch estmates the mean operator va a Laplacan-based manfold regularzer wthout callng to the homogenety assumpton We show that under a weak dstngushablty assumpton between bags, our estmaton of the mean operator s all the better as the observatons norm ncrease Ths, as we show, cannot hold for the Mean Map estmator Then, we provde a data-dependent approxmaton bound for our classfer wth respect to the optmal classfer, that s shown to be better than prevous bounds [7] We also show that the manfold regularzer s soluton s tghtly related to the lnear separablty of the bags We then provde an teratve algorthm, AMM, that takes as nput the soluton of LMM and optmzes t further over the set of consstent labelngs We ground the algorthm n a unform convergence result nvolvng a generalzaton of Rademacher complextes for the LLP settng The bound nvolves a bag-emprcal surrogate rsk for whch we show that AMM optmzes tractable bounds All our theoretcal results hold for any symmetrc proper scorng loss Experments are provded on fourteen domans, rangng from hundreds to hundreds of thousands of examples, comparng AMM and LMM to ther contenders: Mean Map, InvCal [] and SVM [6] They dsplay that AMM and LMM outperform ther contenders, and sometmes even compete wth the fully supervsed learner whle requrng few proportons only Tests on the largest domans dsplay the scalablty of both algorthms Such expermental evdence serously questons the safety of prvacy-preservng summarzaton of data, whenever accurate aggregates and nformatve ndvdual features are avalable Secton () presents our algorthms and related theoretcal results Secton (3) presents experments Secton (4) concludes A Supplementary Materal [9] ncludes proofs and addtonal experments LLP and the mean operator: theoretcal results and algorthms Learnng settng Hereafter, boldfaces lke p denote vectors, whose coordnates are denoted p l for l,, For any m N, let [m] {,,, m} Let Σ m {σ {, } m } and X R d Examples are couples (observaton, label) X Σ, sampled d accordng to some unknown but fxed dstrbuton D Let S {(x, y ), [m]} D m denote a sze-m sample In Learnng wth Label Proportons (LLP), we do not observe drectly S but S y, whch denotes S wth labels removed; we are gven ts partton n n > 0 bags, S y j S j, j [n], along wth ther respectve label proportons ˆπ j ˆP[y + S j ] and bag proportons ˆp j m j /m wth m j card(s j ) (Ths generalzes to a cover of S, by copyng examples among bags) The bag assgnment functon that parttons S s unknown but fxed In real world domans, t would rather be known, eg state, gender, age band A classfer s a functon h : X R, from a set of classfers H H L denotes the set of lnear classfers, noted h θ (x) θ x wth θ X A (surrogate) loss s a functon F : R R + We let F (S, h) (/m) F (y h(x )) denote the emprcal surrogate rsk on S correspondng to loss F For the sake of clarty, ndexes, j and k respectvely refer to examples, bags and features The mean operator and ts mnmal suffcency µ S m We defne the (emprcal) mean operator as: y x ()

3 Algorthm Laplacan Mean Map (LMM) Input S j, ˆπ j, j [n]; γ > 0 (7); w (7); V (8); permssble φ (); λ > 0; Step : let B± arg mn X R n d l(l, X) usng (7) (Lemma ) Step : let µ S j ˆp j(ˆπ j b+ j ( ˆπ j) b j ) Step 3 : let θ arg mn θ F φ (S y, θ, µ S ) + λ θ (3) Return θ Table : Correspondence between permssble functons φ and the correspondng loss F φ loss name F φ (x) φ(x) logstc loss log( + exp( x)) x log x ( x) log( x) square loss ( x) x( x) Matsushta loss x + + x x( x) The estmaton of the mean operator µ S appears to be a learnng bottleneck n the LLP settng [7] The fact that the mean operator s suffcent to learn a classfer wthout the label nformaton motvates the noton of mnmal suffcent statstc for features n ths context Let F be a set of loss functons, H be a set of classfers, I be a subset of features Some quantty t(s) s sad to be a mnmal suffcent statstc for I wth respect to F and H ff: for any F F, any h H and any two samples S and S, the quantty F (S, h) F (S, h) does not depend on I ff t(s) t(s ) Ths defnton can be motvated from the one n statstcs by buldng losses from log lkelhoods The followng Lemma motvates further the mean operator n the LLP settng, as t s the mnmal suffcent statstc for a broad set of proper scorng losses that encompass the logstc and square losses [8] The proper scorng losses we consder, hereafter called symmetrc (SPSL), are twce dfferentable, non-negatve and such that msclassfcaton cost s not label-dependent Lemma µ S s a mnmal suffcent statstc for the label varable, wth respect to SPSL and H L ([9], Subsecton ) Ths property, very useful for LLP, may also be exploted n other weakly supervsed tasks [] Up to constant scalngs that play no role n ts mnmzaton, the emprcal surrogate rsk correspondng to any SPSL, F φ (S, h), can be wrtten wth loss: F φ (x) φ(0) + φ ( x) a φ + φ ( x), () φ(0) φ(/) b φ and φ s a permssble functon [0, 8], e dom(φ) [0, ], φ s strctly convex, dfferentable and symmetrc wth respect to / φ s the convex conjugate of φ Table shows examples of F φ It follows from Lemma and ts proof, that any F φ (Sθ), can be wrtten for any θ h θ H L as: ( ) F φ (S, θ) b φ F φ (σθ x ) m θ µ S F φ (S y, θ, µ S ), (3) where σ Σ σ The Laplacan Mean Map (LMM) algorthm The sum n eq (3) s convex and dfferentable n θ Hence, once we have an accurate estmator of µ S, we can then easly ft θ to mnmze F φ (S y, θ, µ S ) Ths two-steps strategy s mplemented n LMM n algorthm µ S can be retreved from n bag-wse, label-wse unknown averages b σ j : n µ S (/) ˆp j j σ Σ (ˆπ j + σ( σ))b σ j, (4) wth b σ j E S [x σ, j] denotng these n unknowns (for j [n], σ Σ ), and let b j (/m j ) x S j x The n b σ j s are soluton of a set of n denttes that are (n matrx form): B Π B ± 0, (5) 3

4 where B [b b b n ] R n d, Π [DIAG(ˆπ) DIAG( ˆπ)] R n n and B ± R n d s the matrx of unknowns: [ ] B ± b + b + b + n b - b - b - n (6) } {{ } } {{ } (B + ) (B ) System (5) s underdetermned, unless one makes the homogenety assumpton that yelds the Mean Map estmator [7] Rather than makng such a restrctve assumpton, we regularze the cost that brngs (5) wth a manfold regularzer [], and search for B± arg mn X R n d l(l, X), wth: l(l, X) tr ( (B X Π)D w (B Π X) ) + γtr ( X ) LX, (7) and γ > 0 D w DIAG(w) s a user-fxed bas matrx wth w R n +, (and w ˆp n general) and: [ ] La 0 L εi + R 0 n n, (8) L a where L a D V R n n s the Laplacan of the bag smlartes V s a symmetrc smlarty matrx wth non negatve coordnates, and the dagonal matrx D satsfes d jj j v jj, j [n] The sze of the Laplacan s O(n ), whch s small compared to O(m ) f there are not many bags One can nterpret the Laplacan regularzaton as smoothng the estmates of b σ j wrt the smlarty of the respectve bags Lemma The soluton B± to mn X R n d l(l, X) s B± ( ΠD w Π + γl ) ΠDw B ([9], Subsecton ) Ths Lemma explans the role of penalty εi n (8) as ΠD w Π and L have respectvely n- and ( )-dm null spaces, so the nverson may not be possble Even when ths does not happen exactly, ths may ncur numercal nstabltes n computng the nverse For domans where ths rsk exsts, pckng a small ε > 0 solves the problem Let b σ j denote the row-wse decomposton of B± followng (6), from whch we compute µ S followng (4) when we use these n estmates n leu of the true b σ j We compare µ j ˆπ j b + j ( ˆπ j)b j, j [n] to our estmates µ j ˆπ j b+ j ( ˆπ j) b j, j [n], granted that µ S j ˆp jµ j and µ S j ˆp j µ j Theorem 3 Suppose that γ satsfes γ ((ε(n) ) + max j j v jj )/ mn j w j Let M [µ µ µ n ] R n d, M [ µ µ µ n ] R n d and ς(v, B ± ) ((ε(n) ) + max j j v jj ) B ± F The followng holds: M M F ( ) n mn wj ς(v, B ± ) (9) j ([9], Subsecton 3) The multplcatve factor to ς n (9) s roughly O(n 5/ ) when there s no large dscrepancy n the bas matrx D w, so the upperbound s drven by ς(, ) when there are not many bags We have studed ts varatons when the dstngushablty between bags ncreases Ths settng s nterestng because n ths case we may kll two brds n one shot, wth the estmaton of M and the subsequent learnng problem potentally easer, n partcular for lnear separators We consder two examples for v jj, the frst beng (half) the normalzed assocaton []: v nc jj ( ASSOC(Sj, S j ) ASSOC(S j, S j S j ) + ASSOC(S j, S j ) ASSOC(S j, S j S j ) ) NASSOC(S j, S j ), (0) v G,s jj exp( b j b j /s), s > 0 () Here, ASSOC(S j, S j ) x S j,x S x x j [] To put these two smlarty measures n the context of Theorem 3, consder the settng where we can make assumpton (D) that there exsts a small constant κ > 0 such that b j b j κ max σ,j b σ j, j, j [n] Ths s a weak dstngushablty property as f no such κ exsts, then the centers of dstnct bags may just be confounded Consder also the addtonal assumpton, (D), that there exsts κ > 0 such that max j d j κ, j [n], where d j max x,x x Sj x s a bag s dameter In the followng Lemma, the lttle-oh notaton s wth respect to the largest unknown n eq (4), e max σ,j b σ j 4

5 Algorthm Alternatng Mean Map (AMM OPT ) Input LMM parameters + optmzaton strategy OPT {mn, max} + convergence predcate PR Step : let θ 0 LMM(LMM parameters) and t 0 Step : repeat Step : let σ t arg OPT σ Σ ˆπ F φ (S y, θ t, µ S (σ)) Step : let θ t+ arg mn θ F φ (S y, θ, µ S (σ t )) + λ θ Step 3 : let t t + untl predcate PR s true Return θ arg mn t F φ (S y, θ t+, µ S (σ t )) Lemma 4 There exsts ε > 0 such that ε ε, the followng holds: () ς(v nc, B ± ) o() under assumptons (D + D); () ς(v G,s, B ± ) o() under assumpton (D), s > 0 ([9], Subsecton 4) Hence, provded a weak (D) or stronger (D+D) dstngushablty assumpton holds, the dvergence between M and M gets smaller wth the ncrease of the norm of the unknowns b σ j The proof of the Lemma suggests that the convergence may be faster for VG,s The followng Lemma shows that both smlartes also partally encode the hardness of solvng the classfcaton problem wth lnear separators, so that the manfold regularzer lmts the dstorton of the b ± s between two bags that tend not to be lnearly separable Lemma 5 Take v jj {v G, jj, vnc jj } There exsts 0 < κ l < κ n < such that () f v jj > κ n then S j, S j are not lnearly separable, and f v jj < κ l then S j, S j are lnearly separable ([9], Subsecton 5) Ths Lemma s an advocacy to ft s n a data-dependent way n v G,s jj The queston may be rased as to whether fnte samples approxmaton results lke Theorem 3 can be proven for the Mean Map estmator [7] [9], Subsecton 6 answers by the negatve In the Laplacan Mean Map algorthm (LMM, Algorthm ), Steps and have now been descrbed Step 3 s a dfferentable convex mnmzaton problem for θ that does not use the labels, so t does not present any techncal dffculty An nterestng queston s how much our classfer θ n Step 3 dverges from the one that would be computed wth the true expresson for µ S, θ It s not hard to show that Lemma 7 n Altun and Smola [3], and Corollary 9 n Quadranto et al [7] hold for LMM so that θ θ (λ) µ S µ S The followng Theorem shows a data-dependent approxmaton bound that can be sgnfcantly better, when t holds that θ x, θ x φ ([0, ]), (φ s the frst dervatve) We call ths settng proper scorng complance (PSC) [8] PSC always holds for the logstc and Matsushta losses for whch φ ([0, ]) R For other losses lke the square loss for whch φ ([0, ]) [, ], shrnkng the observatons n a ball of suffcently small radus s suffcent to ensure ths Theorem 6 Let f k R m denote the vector encodng the k th feature varable n S : f k x k (k [d]) Let F denote the feature matrx wth column-wse normalzed feature vectors: fk (d/ k f k ) (d )/(d) f k Under PSC, we have θ θ (λ + q) µ S µ S, wth: q det F F m e b φ φ (φ (q /λ)) (> 0), () for some q I [±(x + max{ µ S, µ S })] Here, x max x and φ (φ ) ([9], Subsecton 7) To see how large q can be, consder the smple case where all egenvalues of F F, λk ( F F) [λ ± δ] for small δ In ths case, q s proportonal to the average feature norm : det F F tr ( ) F F + o(δ) x + o(δ) m md md 5

6 The Alternatng Mean Map (AMM) algorthm Let us denote Σˆπ {σ Σ m : :x S j σ (ˆπ j )m j, j [n]} the set of labelngs that are consstent wth the observed proportons ˆπ, and µ S (σ) (/m) σ x the based mean operator computed from some σ Σˆπ Notce that the true mean operator µ S µ S (σ) for at least one σ Σˆπ The Alternatng Mean Map algorthm, (AMM, Algorthm ), starts wth the output of LMM and then optmzes t further over the set of consstent labelngs At each teraton, t frst pcks a consstent labelng n Σˆπ that s the best (OPT mn) or the worst (OPT max) for the current classfer (Step ) and then fts a classfer θ on the gven set of labels (Step ) The algorthm then terates untl a convergence predcate s met, whch tests whether the dfference between two values for F φ (,, ) s too small (AMM mn ), or the number of teratons exceeds a user-specfed lmt (AMM max ) The classfer returned θ s the best n the sequence In the case of AMM mn, t s the last of the sequence as rsk F φ (S y,, ) cannot ncrease Agan, Step s a convex mnmzaton wth no techncal dffculty Step s combnatoral It can be solved n tme almost lnear n m [9] (Subsecton 8) Lemma 7 The runnng tme of Step n AMM s Õ(m), where the tlde notaton hdes log-terms Bag-Rademacher generalzaton bounds for LLP We relate the mn and max strateges of AMM by unform convergence bounds nvolvng the true surrogate rsk, e ntegratng the unknown dstrbuton D and the true labels (whch we may never know) Prevous unform convergence bounds for LLP focus on coarser graned problems, lke the estmaton of label proportons [] We rely on a LLP generalzaton of Rademacher complexty [4, 5] Let F : R R + be a loss functon and H a set of classfers The bag emprcal Rademacher complexty of sample S, Rm, b s defned as Rm b E σ Σm sup h H {E σ Σ ˆπ E S [σ(x)f (σ (x)h(x))] The usual emprcal Rademacher complexty equals Rm b for card(σˆπ ) The Label Proporton Complexty of H s: L m E Dm E I /,I / sup E S [σ (x)(ˆπ s (x) ˆπl (x))h(x)] (3) h H Here, each of I / l, l, s a random (unformly) subset of [m] of cardnal m Let S(I/ l ) be the sze-m subset of S that corresponds to the ndexes Take l, and any x S If I / l then ˆπ l s (x ) ˆπ l l (x ) s x s bag s label proporton measured on S\S(I / l ) Else, ˆπs (x ) s ts bag s label proporton measured on S(I / ) and ˆπl (x ) s ts label (e a bag s label proporton that would contan only x ) Fnally, σ (x) x S(I / ) Σ L m tends to be all the smaller as classfers n H have small magntude on bags whose label proporton s close to / Theorem 8 Suppose h 0 st h(x) h, x, h Then, for any loss F φ, any tranng sample of sze m and any 0 < δ, wth probablty > δ, the followng bound holds over all h H: ( ) E D [F φ (yh(x))] E Σ ˆπ E S [F φ (σ(x)h(x))] + Rm b h + L m b φ m log δ (4) Furthermore, under PSC (Theorem 6), we have for any F φ : Rm b b φ E Σm sup {E S [σ(x)(ˆπ(x) (/))h(x)]} (5) h H ([9], Subsecton 9) Despte smlar shapes (3) (5), R b m and L m behave dfferently: when bags are pure (ˆπ j {0, }, j), L m 0 When bags are mpure (ˆπ j /, j), R b m 0 As bags get mpure, the bag-emprcal surrogate rsk, E Σ ˆπ E S [F φ (σ(x)h(x))], also tends to ncrease AMM mn and AMM max respectvely mnmze a lowerbound and an upperbound of ths rsk 3 Experments Algorthms We compare LMM, AMM (F φ logstc loss) to the orgnal MM [7], InvCal [], conv- SVM and alter- SVM [6] (lnear kernels) To make experments extensve, we test several ntalzatons for AMM that are not dsplayed n Algorthm (Step ): () the edge mean map estmator, µ S EMM /m ( y )( x ) (AMM EMM ), () the constant estmator µ S (AMM ), and fnally AMM 0ran whch runs 0 random ntal models ( θ 0 ), and selects the one wth smallest rsk; 6

7 AUC rel to MM 3 0 MM LMM G LMM G,s LMM nc 4 6 dvergence (a) AUC rel to Oracle MM LMM G LMM G,s LMM nc (b) AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran (c) AUC Oracle AMM G Bgger domans Small domans 0^ 5 0^ 3 0^ #bags/#nstance (d) Fgure : Relatve AUC (wrt MM) as homogenety assumpton s volated (a) Relatve AUC (wrt Oracle) vs on heart for LMM(b), AMM mn (c) AUC vs n/m for AMM mn G and the Oracle (d) Table : Small domans results #wn/#lose for row vs column Bold faces means p-val < 00 for Wlcoxon sgned-rank tests Top-left subtable s for one-shot methods, bottom-rght teratve ones, bottom-left compare the two Italc s state-of-the-art Grey cells hghlght the best of all (AMM mn G ) LMM algorthm MM LMM InvCal AMM mn AMM max conv- G G,s nc MM G G,s 0ran MM G G,s 0ran SVM AMM mn AMM max SVM G 36/4 G,s 38/3 30/6 nc 8/ 3/37 /37 InvCal 4/46 3/47 4/46 4/46 MM 33/6 6/4 5/5 3/8 46/4 G 38/ 35/4 30/0 37/3 47/3 3/7 G,s 35/4 33/7 30/0 35/5 47/3 4/ 7/5 eg AMM mn G,s wns on AMMmn G 7 tmes, loses 5, wth 8 tes 0ran 7/ 4/6 /8 6/4 44/6 0/30 6/34 9/3 MM 5/5 3/7 /8 5/5 45/5 5/35 3/37 3/37 8/4 G 7/3 /8 /8 6/4 45/5 7/33 4/36 4/36 0/40 3/4 G,s 5/5 /9 /8 4/6 45/5 5/35 3/37 3/37 /38 5/ 6/ 0ran 3/7 /9 9/3 4/6 50/0 9/3 5/35 7/33 7/43 9/30 0/9 7/3 conv- /9 /48 /48 /48 /48 4/46 3/47 3/47 4/46 3/47 3/47 4/46 0/50 alter- 0/50 0/50 0/50 0/50 0/30 0/50 0/50 0/50 3/47 3/47 /48 /49 0/50 7/3 ths s the same procedure of alter- SVM Matrx V (eqs (0), ()) used s ndcated n subscrpt: LMM/AMM G, LMM/AMM G,s, LMM/AMM nc respectvely denote v G,s wth s, v G,s wth s learned on cross valdaton (CV; valdaton ranges ndcated n [9]) and v nc For space reasons, results not dsplayed n the paper can be found n [9], Secton 3 (ncludng runtme comparsons, and detaled results by doman) We splt the algorthms n two groups, one-shot and teratve The latter, ncludng AMM, (conv/alter)- SVM, teratvely optmze a cost over labelngs (always consstent wth label proportons for AMM, not always for (conv/alter)- SVM) The former (LMM, InvCal) do not and are thus much faster Tests are done on a 4-core 3GHz CPUs Mac wth 3GB of RAM AMM/LMM/MM are mplemented n R Code for InvCal and SVM s [6] Smulated domans, MM and the homogenety assumpton The testng metrc s the AUC Pror to testng on our domans, we generate 6 domans that gradually move away the b σ j away from each other (wrt j), thus volatng ncreasngly the homogenety assumpton [7] The degree of volaton s measured as B ± B ± F, where B ± s the homogenety assumpton matrx, that replaces all b σ j by b σ for σ {, }, see eq (5) Fgure (a) dsplays the ratos of the AUC of LMM to the AUC of MM It shows that LMM s all the better wth respect to MM as the homogenety assumpton s volated Furthermore, learnng s n LMM mproves the results Experments on the smulated doman of [6] on whch MM obtans zero accuracy also dsplay that our algorthms perform better ( teraton only of AMM max brngs 00% AUC) Small and large domans experments We convert 0 small domans [9] (m 000) and 4 bgger ones (m > 8000) from UCI[6] nto the LLP framework We cast to one-aganst-all classfcaton when the problem s multclass On large domans, the bag assgnment functon s nspred by []: we craft bags accordng to a selected feature value, and then we remove that feature from the data Ths conforms to the dea that bag assgnment s structured and non random n real-world problems Most of our small domans, however, do not have a lot of features, so nstead of clusterng on one feature and then dscard t, we run K-MEANS on the whole data to make the bags, for K n [5] Small domans results We perform 5-folds nested CV comparsons on the 0 domans 50 AUC values for each algorthm Table synthesses the results [9], splttng one-shot and teratve algo- 7

8 Table 3: AUCs on bg domans (name: #nstances #features) Icap-shape, IIhabtat, IIIcap-colour, IVrace, Veducaton, VIcountry, VIIpoutcome, VIIIjob (number of bags); for each feature, the best result over one-shot, and over teratve algorthms s bold faced AMM mn AMM max algorthm mushroom: adult: marketng: 45 4 census: I(6) II(7) III(0) IV(5) V(6) VI(4) V(4) VII(4) VIII() IV(5) VIII(9) VI(4) EMM MM LMM G LMM G,s AMMEMM AMMMM AMM G AMM G,s AMM AMMEMM AMMMM AMM G AMM G,s AMM Oracle rthms LMM G,s outperforms all one-shot algorthms LMM G and LMM G,s are compettve wth many teratve algorthms, but lose aganst ther AMM counterpart, whch proves that addtonal optmzaton over labels s benefcal AMM G and AMM G,s are confrmed as the best varant of AMM, the frst beng the best n ths case Surprsngly, all mean map algorthms, even one-shots, are clearly superor to SVMs Further results [9] reveal that SVM performances are dampened by learnng classfers wth the nverted polarty e flppng the sgn of the classfer mproves ts performances Fgure (b, c) presents the AUC relatve to the Oracle (whch learns the classfer knowng all labels and mnmzng the logstc loss), as a functon of the Gn of bag assgnment, gn(s) 4E j [ˆπ j ( ˆπ j )] For an close to, we were expectng a drop n performances The unexpected [9] s that on some domans, large entropes ( 8) do not prevent AMM mn to compete wth the Oracle No such pattern clearly emerges for SVM and AMM max [9] Bg domans results We adopt a /5 hold-out method Scalablty results [9] dsplay that every method usng v nc and SVM are not scalable to bg domans; n partcular, the estmated tme for a sngle run of alter- SVM s >00 hours on the adult doman Table 3 presents the results on the bg domans, dstngushng the feature used for bag assgnment Bg domans confrm the effcency of LMM+AMM No approach clearly outperforms the rest, although LMM G,s s often the best one-shot Synthess Fgure (d) gves the AUCs of AMM mn G over the Oracle for all domans [9], as a functon of the degree of supervson, n/m ( f the problem s fully supervsed) Notceably, on 90% of the runs, AMM mn G gets an AUC representng at least 70% of the Oracle s Results on bg domans can be remarkable: on the census doman wth bag assgnment on race, 5 proportons are suffcent for an AUC 5 ponts below the Oracle s whch learns wth 00K labels 4 Concluson In ths paper, we have shown that effcent learnng n the LLP settng s possble, for general loss functons, va the mean operator and wthout resortng to the homogenety assumpton Through ts estmaton, the suffcency allows one to resort to standard learnng procedures for bnary classfcaton, practcally mplementng a reducton between machne learnng problems [7]; hence the mean operator estmaton may be a vable shortcut to tackle other weakly supervsed settngs [] [3] [4] [5] Approxmaton results and generalzaton bounds are provded Experments dsplay results that are superor to the state of the art, wth algorthms that scale to bg domans at affordable computatonal costs Performances sometmes compete wth the Oracle s that learns knowng all labels, even on bg domans Such expermental fndng poses severe mplcatons on the relablty of prvacy-preservng aggregaton technques wth smple group statstcs lke proportons Acknowledgments NICTA s funded by the Australan Government through the Department of Communcatons and the Australan Research Councl through the ICT Centre of Excellence Program G Patrn acknowledges that part of the research was conducted at the Commonwealth Bank of Australa We thank A Menon, D García-García, N de Fretas for nvaluable feedback, and FYu for help wth the code 8

9 References [] F X Yu, S Kumar, T Jebara, and S F Chang On learnng wth label proportons CoRR, abs/40590, 04 [] T G Detterch, R H Lathrop, and T Lozano-Pérez Solvng the multple nstance problem wth axsparallel rectangles Artfcal Intellgence, 89:3 7, 997 [3] G S Mann and A McCallum Generalzed expectaton crtera for sem-supervsed learnng of condtonal random felds In 46 th ACL, 008 [4] J Graça, K Ganchev, and B Taskar Expectaton maxmzaton and posteror constrants In NIPS*0, pages , 007 [5] P Lang, M I Jordan, and D Klen Learnng from measurements n exponental famles In 6 th ICML, pages , 009 [6] D J Muscant, J M Chrstensen, and J F Olson Supervsed learnng by tranng on aggregate outputs In 7 th ICDM, pages 5 6, 007 [7] J Hernández-González, I Inza, and J A Lozano Learnng bayesan network classfers from label proportons Pattern Recognton, 46(): , 03 [8] M Stolpe and K Mork Learnng from label proportons by optmzng cluster model selecton In 5 th ECMLPKDD, pages , 0 [9] B C Chen, L Chen, R Ramakrshnan, and D R Muscant Learnng from aggregate vews In th ICDE, pages 3 3, 006 [0] J Wojtusak, K Irvn, A Brerdnc, and A V Baranova Usng publshed medcal results and nonhomogenous data n rule learnng In 0 th ICMLA, pages 84 89, 0 [] S Rüpng Svm classfer estmaton from group probabltes In 7 th ICML, pages 9 98, 00 [] H Kueck and N de Fretas Learnng about ndvduals from group statstcs In th UAI, pages , 005 [3] S Chen, B Lu, M Qan, and C Zhang Kernel k-means based framework for aggregate outputs classfcaton In 9 th ICDMW, pages , 009 [4] K T La, F X Yu, M S Chen, and S F Chang Vdeo event detecton by nferrng temporal nstance labels In th CVPR, 04 [5] K Fan, H Zhang, S Yan, L Wang, W Zhang, and J Feng Learnng a generatve classfer from label proportons Neurocomputng, 39:47 55, 04 [6] F X Yu, D Lu, S Kumar, T Jebara, and S F Chang SVM for Learnng wth Label Proportons In 30 th ICML, pages 504 5, 03 [7] N Quadranto, A J Smola, T S Caetano, and Q V Le Estmatng labels from label proportons JMLR, 0: , 009 [8] R Nock and F Nelsen Bregman dvergences and surrogates for learnng IEEE TransPAMI, 3: , 009 [9] G Patrn, R Nock, P Rvera, and T S Caetano (Almost) no label no cry - supplementary materal In NIPS*7, 04 [0] M J Kearns and Y Mansour On the boostng ablty of top-down decson tree learnng algorthms In 8 th ACM STOC, pages , 996 [] M Belkn, P Nyog, and V Sndhwan Manfold regularzaton: A geometrc framework for learnng from labeled and unlabeled examples JMLR, 7: , 006 [] J Sh and J Malk Normalzed cuts and mage segmentaton IEEE TransPAMI, : , 000 [3] Y Altun and A J Smola Unfyng dvergence mnmzaton and statstcal nference va convex dualty In 9 th COLT, pages 39 53, 006 [4] P L Bartlett and S Mendelson Rademacher and gaussan complextes: Rsk bounds and structural results JMLR, 3:463 48, 00 [5] V Koltchnsk and D Panchenko Emprcal margn dstrbutons and boundng the generalzaton error of combned classfers Ann of Stat, 30: 50, 00 [6] K Bache and M Lchman UCI machne learnng repostory, 03 [7] A Beygelzmer, V Dan, T Hayes, J Langford, and B Zadrozny Error lmtng reductons between classfcaton tasks In th ICML, pages 49 56, 005 9

10 (Almost) No Label No Cry - Supplementary Materal Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau Table of contents Supplementary materal on proofs Pg Proof of Lemma Pg Proof of Lemma Pg Proof of Theorem 3 Pg 3 Proof of Lemma 4 Pg 4 Proof of Lemma 5 Pg 6 Mean Map estmator s Lemma and Proof Pg 8 Proof of Theorem 6 Pg 9 Proof of Lemma 7 Pg 3 Proof of Theorem 8 Pg 3 Supplementary materal on experments Pg 7 Full Expermental Setup Pg 7 Smulated Doman for Volaton of Homogenety Assumpton Pg 8 Smulated Doman from [] Pg 8 Addtonal Tests on alter- SVM [] Pg 8 Scalablty Pg 9 Full Results on Small Domans Pg 9

11 Supplementary Materal on Proofs Proof of Lemma For any SPSL F (S, h), we can wrte t as ([], Lemma, [3]): F (S, h) F φ (S, h) D φ (y m φ (h(x ))), () where y ff y and 0 otherwse, φ s permssble and D φ s the Bregman dvergence wth generator φ [3] It also holds that: D φ (y φ (h(x ))) b φ F φ (yh(x)) wth: F φ (x) φ ( x) + φ(0) φ(0) φ(/) a φ + φ ( x), () b φ and φ s the convex conjugate of φ, e φ (x) xφ (x) φ(φ (x)) Furthermore, for any permssble φ, the conjex conjugate φ (x) verfes the property φ ( x) φ (x) x, (3) and so we get that: F (S, h) D φ (y m φ (h(x ))) b φ m b φ m b φ m b φ m b φ m b φ m F φ (y h(x )) ( F φ (y h(x )) + ) F φ (y h(x )) ( F φ (y h(x )) + ) F φ ( y h(x )) y h(x ) b φ F φ (yh(x )) y h(x ) m y {,+} ( ) F φ (σh(x )) h y x m σ {,+} σ {,+} F φ (σh(x )) h (µ S) (6) (4) holds because of (3), (5) holds because h s lnear So for any samples S and S wth respectve sze m and m, we have (agan usng the property that h s lnear): ( ) F (S, h) F (S, h) b φ F φ (σh(x )) m m F φ (σh(x )) x S x S σ {,+} whch yelds the statement of the Lemma Proof of Lemma Usng the fact that D w and L are symmetrc, we have: l(l, X) X + h (µ S µ S ), (7) X tr ( B D w Π ) X + X tr ( X ΠD w Π ) X + γ X tr ( X ) LX ΠD w B + ΠD w Π X + γlx 0, out of whch B± follows n Lemma (4) (5)

12 3 Proof of Theorem 3 We let Π o [DIAG(ˆπ) DIAG(ˆπ )] N an orthonormal system (n jj (ˆπ j +( ˆπ j) ) /, j [n] and 0 otherwse) Let K Πo be the n-dm subspace of R d generated by Π o The proof of Theorem (3) explots the followng Lemma, whch assumes that ε s any > 0 real for L n (8) (man fle) to be 0 When ε 0, the result of Theorem (3) stll holds but follows a dfferent proof Lemma Let A ΠD w Π and L defned as n (8) (man paper) Denote for short U ( L A + γ I ) (8) Suppose there exsts ξ > 0 such that for any x R n, the projecton of Ux n K Πo, x U,o, satsfes Then: Proof Combnng Lemma and (5), we get x U,o ξ x (9) M M F γξ B ± F (0) B ± B± Defne the followng permutaton matrx: C ( ) (A + γl) A I B ± ( (γl) A + I ) B ± () [ 0 I I 0 ] R n n () A ΠD w Π s not nvertble but dagonalsable Its (orthonormal) egenvectors can be parttoned n two matrces P o and P such that: We have: P o P [DIAG(ˆπ ) DIAG(ˆπ)] N CΠ o R n n (egenvalues 0), (3) ΠN R n n (egenvalues w j (ˆπ j + ( ˆπ j) ), j) (4) M M P o CB ± P o C B± P ( o C (γl) A + ) I B ± Π ( o (γl) A + ) I B ± (5) γπ ( o L A + γ ) I B ± (6) Eq (5) follows from the fact that C s dempotent Pluggng Frobenus norm n (6), we obtan M M F γ Π ( o L A + γ ) I B ± F γ d k Π o ( L A + γ I ) b ± k d γ ξ b ± k (7) k γ ξ B ± F, whch yelds (0) In (7), b ± k denotes column k n B± Ineq (7) makes use of assumpton (9) To ensure x U,o ξ x, t s suffcent that Ux ξ x, and snce Ux U F x, t s suffcent to show that, (8) U ξ F 3

13 wth U ξ L ξ A + ξγ I, for relevant choces of ξ We have let L ξ (/ξ)l Let 0 λ () λ n () denote the ordered egenvalues of a postve-semdefnte matrx n R n n It follows that, snce L s symmetrc postve defnte, we have λ j (L ξ A) λ j(a) λ n (L ξ ) ( 0), j [n] We have used eq (3) Weyl s Theorem then brngs: λ j (U ξ ) λ n (L ξ ) λ j (A) + ξγ λ n (L ξ ) { ξ γ f j [n] λ n(l ξ ) λ j(a) otherwse (9) Gershgorn s Theorem brngs λ n (/ξ)(ε + max j j l jj ), and furthermore the egenvalues of A satsfy λ j w j /, j n + We thus have: U ξ F nγ ξ ) 4n (ε + max j j l + jj ξ mn j wj (0) In (9) and (0), we have used the egenvalues of A gven n eqs (3) and (4) Assumng: γ ξ n, () a suffcent condton for the rght-hand sde of (0) to be s that ξ ε + max j j l jj n mn j w j () To fnsh up the proof, recall that L D V wth d jj j,j v jj and the coordnates v jj 0 Hence, l jj j j j v jj n max v jj, j [n] j j The proof s fnshed by pluggng ths upperbound n () to choose ξ, then takng the maxmal value for γ n () and fnally solvng the upperbound n (0) Ths ends the proof of Theorem 3 4 Proof of Lemma 4 We frst consder the normalzed assocaton crteron n (0): ASSOC(S j, S j ) vjj N ( ASSOC(Sj, S j ) ASSOC(S j, S j S j ) + x S j,x S j ASSOC(S ) j, S j ) ASSOC(S j, S j S j ) x x (3), 4

14 Remark that b j b j x x m j m j x S j x S j m x + j x S j m j x S j m + j m j m j x S j x x S j x + m j m j x S j,x S j x S j m j x S j x x x x x m j m j x S j m j m j x m j m j x S j,x S j x S j,x S j x x x S j x x x (4) + m j x m j m + m j x j m j m x x j m j m j x S j x S j x S j,x S j } {{ } a x x (5) m j m j x S j,x S j ASSOC(S j, S j ) (6) m j m j ( n ) ( Eq (4) explots the fact that j a n ) j n j a j and eq (5) explots the fact that a (m j m j ) x S j,x S x j x We thus have: ASSOC(S j, S j ) ASSOC(S j, S j S j ) ASSOC(S j, S j ) ASSOC(S j, S j ) + ASSOC(S j, S j ) ASSOC(S j, S j ) ASSOC(S j, S j ) + mjm j b j b j κ m j κ m j + mjm j b j b j + m j κ b j b j 5 (7) (8) (9)

15 Eq (7) uses (6) and eq (8) uses assumpton (D) Eq (8) also holds when permutng j and j, so we get: ( ) ς(v NC ε, B ± ) max j j n + + mj κ b j b j + + m j κ b j b j B ± F ( ) ε n + B ± mnj mj F + κ mn j,j b j b j ( ) ε n + B ± mnj mj F (30) + κ mn j,j b j b j ε n d max σ,j bσ j + 4κ d max σ,j b σ j mn j,j b j b j ε n d max 4κ d σ,j bσ j + κ max σ,j b σ j ) f (max NC σ,j bσ j o(), (3) where the last nequalty uses assumpton (D), and (30) uses the property that (a+b) a +b We have let f NC (x) ε n dx + 4κ d κx, (3) whch s ndeed o() f ε o(n / x) Ths proves the Lemma for ς(v NC, B ± ) The case of ς(v G,s, B ± ) s easer, as ( exp b ) ( j b j exp mn j,j b j b j ) s s ( exp κ ) s max σ,j bσ j, from assumpton (D) alone, whch gves ( ( ε ς(v G,s, B ± ) B ± F n + exp κ )) s max σ,j bσ j ( ( ε B ± F n + exp κ )) s max σ,j bσ j ( ( ε d max σ,j bσ j n + exp κ )) s max σ,j bσ j ) f (max G σ,j bσ j o(), (33) as clamed We have let f G (x) ε n dx+dx exp( κx/s), whch s ndeed o() f ε o(n / x) Remark that we shall have n general f G (x) f NC (x) and even f G (x) o(f NC (x)) f ε 0, so we may expect better convergence n the case of V G,s as max σ,j b σ j grows 5 Proof of Lemma 5 We frst restate the Lemma n a more explct way, that shall provde explct values for κ l and κ n Lemma There exst κ jj and s jj dependng on d j, d j, and κ jj > dependng on m j, m j, such that: 6

16 If v G,s jj jj > exp( /4) then S j, S j are not lnearly separable; If v G,s jj jj < exp( 64) then S j, S j are lnearly separable; If v NC jj If v NC jj > κ jj then S j, S j are not lnearly separable; < κ jj /κ jj then S j, S j are lnearly separable Proof We frst consder the normalzed assocaton crteron n (0), and we prove the Lemma for the followng expressons of κ jj and κ jj : κ jj d jj + d jj d j d j, (34) κ jj 5 max{m j, m j }, (35) wth d jj max{d j, d j } and d j max x,x S j x x, j j [n] For any bag S j, we let (b j, r j) MEB(S j ) denote the mnmum enclosng ball (MEB) for bag S j and dstance L, that s, r j s the smallest unque real such that!b j : d(x, b j ) x b j r j, x S j We have let d(x, b j ) x b j We are gong to prove a frst result nvolvng the MEBs of S j and S j, and then wll translate the result to the Lemma s statement The followng propertes follows from standard propertes of MEBs and the fact that d(, ) s a dstance (they hold for any j j ): (a) d(x, x ) r j, x, x S j ; (b) If bags S j and S j are lnearly separable, then x CO(S j ), x S j such that d(x, x ) max{r j, r j }; here, CO denotes the convex closure; (c) If bags S j and S j are lnearly separable, then d(b j, b j ) max{r j, r j }, where b j and b j are the bags average; (d) x S j, x S j st d(x, x ) r j ; (e) d(x, x ) max{r j, r j } + d(b j, b j ), x CO(S j), x CO(S j ) Let us defne ASSOC(S j, S j ) d (x, x ) (36) x S j,x S j We remark that, assumng that each bag contans at least two elements wthout loss of generalty: vjj NC + (37) + ASSOC(Bj,B j ) ASSOC(B j,b j) + ASSOC(Bj,B j ) ASSOC(B j,b j ) We have ASSOC(S j, S j ) 4m j rj and ASSOC(S j, S j ) 4m j r j (because of (a)), and also ASSOC(S j, S j ) max{m j, m j } max{rj, r j } when S j and S j are lnearly separable (because of (b)), whch yelds n ths case vjj NC + + max{mj,m j } max{r j,r j } m jrj + max{r j,r j } r j + + max{mj,m j } max{r j,r j } m j r j + max{r j,r j } r j (38) Let us name κ jj the rght-hand sde of (38) It follows that when vnc jj > κ jj, S j and S j are not lnearly separable 7

17 On the other hand, we have ASSOC(S j, S j ) m j rj and ASSOC(S j, S j ) m j r j (because of (d)), and also ASSOC(S j, S j ) m j m j ( max{r j, r j } + d(b j, b j )) m j m j (4 max{rj, rj } + d (b j, b j )), (39) because of (e) and the fact that (a + b) a + b It follows that j j : vjj NC + (40) + m j (4 max{r j,r j }+d (b j,b j )) + mj(4 max{r j,r j }+d (b j,b j )) rj r j For any j j, when d (b j, b j ) 4 max{r j, r j }, then we have from (40): vjj NC + + 6m j max{r j,r j } + 6mj max{r j,r j } rj r j > κ jj /(3 max{m j, m j }) (4) Hence, when vjj NC κ jj /(3 max{m j, m j }), t mples d(b j, b j ) > max{r j, r j }, mplyng d(b j, b j ) > r j + r j, whch s a suffcent condton for the lnear separablty of S j and S j So, we can relate the lnear separablty of S j and S j to the value of vjj NC wth respect to κ jj defned n (38) To remove the dependence n the MEB parameters and obtan the statement of the Lemma, we just have to remark that d j /4 r j 4d j, j [n], whch yelds κ jj /6 κ jj κ jj Hence, when vjj NC > κ jj, t follows that vnc jj > κ jj and S j and S j are not lnearly separable On the other hand, when vjj NC κ jj /(6 3 max{m j, m j }) κ jj /κ jj, then vjj NC κ jj /(3 max{m j, m j }) and the bags S j and S j are lnearly separable Ths acheves the proof of Lemma 5 for the normalzed assocaton crteron n (0) The proof for v G,s jj s shorter, and we prove t for s j,j max{d j, d j } (4) We have (/) max{d j, d j } max{r j, r j } max{d j, d j } Hence, because of (c) above, f S j and S j are lnearly separable, then v G,s jj /e/4 ; so, when v G,s jj > /e/4, the two bags are not lnearly separable On the other hand, f d(b j, b j ) max{r j, r j }, then because of (e) above d(b j, b j ) 4 max{r j, r j } 8 max{d j, d j }, and so v G,s jj /e64 Ths mples that f v G,s jj < /e64, then d(b j, b j ) > max{r j, r j } r j + r j, and thus the two bags are lnearly separable, as clamed Ths acheves the proof of Lemma Ths acheves the proof of Lemma 5 6 Mean Map estmator s Lemma and Proof It s not hard to check that the randomzed procedure that bulds µ S RAND yx for some random x S and y {, } guarantees O( + γ) approxmablty when some bags are close to the convex hull of S, for small γ > 0 Hence, the Mean Map estmaton of µ S can be very poor n that respect Lemma 3 For any γ > 0, the Mean Map estmator µ S MM µ S / max σ,j b σ j γ, even when (D + D) hold cannot guarantee µ MM S Proof Let x > 0, ɛ (0, ), p (0, ), p / We create a dataset from four observatons, {(x 0, ), (x 0, ), (x 3 x, ), (x 4 x, )} There are two bags, S takes ɛ of x and ɛ of x S takes ɛ of x 4 and ɛ of x 3 The label-wse estmators µ σ of [4] are soluton of ( [ ] [ ] ɛ ɛ ɛ ɛ [ µ µ ] ɛ ɛ ɛ [ ( ɛ)x ɛx ] ɛ 8 ɛ ] ) [ ɛ ɛ ɛ ɛ ] [ x 0 (43)

18 On the other hand, the true quanttes are: [ ] µ µ [ ( ɛ)x ɛx ] (44) We now mx classes n S and pck bag proportons q P S [S ] and q P S [S ] We have the class proportons defned by P S [y +] ɛq + ( ɛ)( q) p Then ( ) ( ) µ S µ S p( ɛ) ɛ x ( p)ɛ ɛ x ɛ p ɛ ɛ x ɛ( q)x (45) Furthermore, max b σ x We get µ S µ S max b σ ɛ( q) (46) Pckng ɛ and ( q) both > (γ/) s suffcent to have eq (46) > γ for any γ > 0 Remark that both assumptons (D) and (D) hold for any κ < and any κ > 0 7 Proof of Theorem 6 The proof of the Theorem nvolves two Lemmata, the frst of whch s of ndependent nterest and holds for any convex twce dfferentable functon F, and not just any F φ So, let us defne: ( ) b F (S y, θ, µ) F (σθ x ) m θ µ (47) where b s any fxed postve real Defne also the regularzed loss: F (S y, θ, µ, λ) F (S y, θ, µ) + λ θ (48) Let f k R m denote the vector encodng the k th varable n S : f k x k For any k [d], let ( d f k σ k f k denote a normalzaton of vectors f k n the sense that d f k ( d d k ( d k f k f k k ) d d fk (49) ) d ) d k f k (50) Let Ṽ collect all vectors f k n column and V collect all vectors f k n column Wthout loss of generalty, we assume V V 0, e V V postve defnte (e no feature s a lnear combnaton of the others), mplyng, because the columns of Ṽ are just postve rescalng of the columns of V, that Ṽ Ṽ 0 as well We use V nstead of F as n the man paper, n order not to counfound wth the general convex surrogate notaton F that we use here Lemma 4 Gven any two µ and µ, let θ and θ be the respectve mnmzers of F (S y,, µ, λ) and F (S y,, µ, λ) Suppose there exsts F > 0 such that surrogate F satsfes F (±(αθ + ( α)θ ) x ) F, α [0, ], [m] (5) Then the followng holds: θ θ λ + em F vol (Ṽ) µ µ, (5) where vol(ṽ) det Ṽ Ṽ denote the volume of the (row/column) system of Ṽ 9

19 Proof Our proof begns followng the same frst steps as the proof of Lemma 7 n [5], addng the steps that handle the lowerbound on F Consder the followng auxlary functon A F (τ ): A F (τ ) ( F (S y, θ, µ) F (S y, θ, µ ) ) (τ θ ) + λ τ θ, (53) where the gradent of F s computed wth respect to parameter θ The gradent of A F () s: The gradent of A F satsfes A F (τ ) F (S y, θ, µ) F (S y, θ, µ ) + λ(τ θ ), (54) A F (θ ) F (S y, θ, µ, λ) F (S y, θ, µ, λ) 0, (55) as both gradents n the rght are 0 because of the optmalty of θ and θ wth respect to F (S y,, µ, λ) and F (S y,, µ, λ) The Hessan H of A F s HA F (τ ) λi 0 and so A F s convex and s thus mnmal at τ θ Fnally, A F (θ ) 0 It comes thus A F (θ ) 0, whch yelds equvalently: 0 ( F (S y, θ, µ) F (S y, θ, µ ) ) (θ θ ) + λ θ θ ( ) b F (yθ x ) m µ b F (yθ x ) + m µ (θ θ ) y y +λ θ θ ( b F (yθ x ) ) F (yθ x ) (θ θ m ) y y } {{ } a (µ µ ) (θ θ ) + λ θ θ (56) Let us lowerbound a We have F (yθ x) yf (yθ x)x, and a Taylor expanson brngs that for any θ, θ, there exsts some α [0, ] such that, defnng we have: We thus get: a u α, y(αθ + ( α)θ ) x, (57) F (yθ x ) F (yθ x ) + y(θ θ ) x F (u α, ) (58) ( F (yθ x ) y y ( y ) F (yθ x ) (θ θ ) y(f (yθ x ) F (yθ x ))x ) (θ θ ) ( ) (θ θ ) x F (u α, )x (θ θ ) y ((θ θ ) x ) F (u α, ) F ((θ θ ) x ) (59) F (θ θ ) SS (θ θ ), (60) where matrx S R d m s formed by the observatons of S y n columns, and neq (59) comes from (5) Defne T (d/ x )SS Its trace satsfes tr (T) d Let λ d λ d λ > 0 0

20 denote egenvalues of T, wth λ strctly postve because SS V V 0 The AGH nequalty brngs: Multplyng both sde by λ and rearrangng yelds: d λ k ( ) d d λ k (6) d k ( ) d tr (T) λ d ( ) d d λ d ( ) d d (6) d λ ( ) d d det T (63) d Let λ > 0 denote the mnmal egenvalue of SS It satsfes λ ( x /d)λ and thus t comes from neq (63): ( ) d ( ) d d d λ d x det SS ( ) [ d ( ) ] d d d det d x SS ( ) d d det Ṽ Ṽ (64) d ( ) d d vol (Ṽ) (65) d e vol (Ṽ) (66) We have used notaton vol(ṽ) det Ṽ Ṽ Snce (θ θ ) SS (θ θ ) λ θ θ, combnng (60) wth (66) yelds the followng lowerbound on a: Gong back to (56), we get λ θ θ (µ µ ) (θ θ ) + a e F vol (Ṽ) θ θ (67) b em F vol (Ṽ) θ θ 0 Snce (µ µ ) (θ θ ) µ µ θ θ, we get after channg the nequaltes and solvng for θ θ : as clamed θ θ λ + em F vol (Ṽ) µ µ, The second Lemma s used to (5) when F (x) F φ Notce that we cannot rely on strong convexty arguments on F φ, as ths do not hold n general The Lemma s stated n a more general settng than for just F F φ

21 Lemma 5 Fx λ, b > 0, and let x max x Suppose that µ µ for some µ > 0 Let ( ) b F (S y, θ, µ, λ) F (σθ x ) m θ µ + λ θ, (68) and let θ arg mn θ F (S y, θ, µ, λ) Suppose that F () s L-Lpschtz Then σ θ blx + µ λ (69) Proof Let us defne a shrnkng of the optmal soluton θ, θ α αθ for α (0, ) We have ( ) b F (S y, θ α, µ, λ) F (σθα x ) m θ α µ + λ θ α σ ( ) b F (σαθ x ) α m θ µ + λα θ σ ( b F (σθ x ) + L ) σαθ m x σθ x + α θ µ σ +λα θ (70) ( ) b F (σθ bk( α) x ) + θ x α m m θ µ σ +λα θ, (7) where (70) holds because F s L-Lpschtz To have eq (7) smaller than F (S y, θ, µ, λ), we need equvalently: bl( α) θ x α m θ µ + λα θ θ µ + λ θ, that s: bl( α) m θ x + α θ µ λ( α ) θ, and to fnd an α (0, ) such that ths holds, because of Cauchy-Schwartz nequalty, t s suffcent that ( α)(blx + µ) λ( α ) θ, e: θ blx + µ λ( + α) Hence, whenever θ > (blx + µ )/λ, there s a shrnkng of the optmal soluton to eq (68) that further decreases the rsk, thus contradctng ts optmalty Ths ends the proof of Lemma 5 Notce that Lemma 5 does not requre F (x) to be convex, nor dfferentable To use ths Lemma, remark that for any F φ, F φ(x) b φ (φ ) ( x) b φ (φ ) ( x) [ /b φ, 0], (7) for any x φ ([0, ]) [], and thus F φ s /b φ -Lpschtz Fnally, consderng (5), for any α [0, ] ± (αθ + ( α)θ ) x (α θ + ( α) θ )x x + α µ + ( α) µ (73) λ x + max{ µ, µ }, (74) λ where neq (73) uses Lemma 5 wth b /K b φ µ and µ are the parameters of F (S y,, µ, λ) and F (S y,, µ, λ) n Lemma 4

22 Algorthm Label Assgnaton (LA) Input θ R d, a bag B {x R d,,,, m}, bag sze m + [m]; If B then stop Else f m + (m) then y I(m + m) I(m + 0),,,, m Else Step : arg max θ x Step : y sgn(θ x ) Step 3 : LA(θ, B\{x }, m + I(y )) Now, gong back to the parameters of Theorem 6, we make the change µ µ S and µ µ S and obtan the statement of the Theorem for nterval Ths acheves the proof of Theorem 6 I [±(x + max{ µ S, µ S })] (75) 8 Proof of Lemma 7 We make the proof for optmzaton strategy OPT mn The case OPT max flps the choce of the label n Step To mnmze F φ (S y, θ t, µ S (σ)) over σ Σˆπ, we just have to fnd σ arg max σ Σ ˆπ θ σ x, and we can do that bag-wse Algorthm presents the labelng (notaton (m) {,,, m }) Remark that the tme complexty for one bag s O(m j log m j ) due to the orderng (Step ), so the overall complexty s ndeed O(m max log m ) Lemma 6 Let σ {σ, σ,, σ m} be the set of labels obtaned after runnng LA(θ, S j, m + j ) for j,,, n Then σ arg max σ Σ ˆπ θ σ x Proof The total edge, θ σ x (for any σ Σˆπ ), can be summable bag-wse wrt the coordnates of σ Consder thus the optmal set {σ } B arg max σ {,} m : σm + m θ x σ B x, for some bag B {x,,,, m }, wth constrant m + [m ] Ths set contans the label assgnment σ returned by LA(θ, B, m + ), a property that follows from two smple observatons: P Consder any observaton x of bag B; for any optmal labelng σ of B, let m + m + I(σ ) Defne the set {σ } of optmal labelngs of B\{x } wth constrant m + m + I(σ ) Then ths set concdes wth the set created by takng the elements of {σ } B to whch we drop coordnate Ths follows from the per-observaton summablty of the total edge wrt labels P Assume m + (m ) arg max θ x, there exsts an optmal assgnment σ such that σ sgn(θ x ) Otherwse, startng from any optmal assgnment σ, we can flp the label of x and the label of any other x for whch σ σ, and get a label assgnment that satsfes constrant m + and cannot be worse than σ, and s thus optmal, a contradcton Hence, LA(θ, B, m + ) pcks at each teraton a label that matches one n a subset of optmal labelngs, and the recursve call preserves the subset of optmal labelngs Snce when m + (m) the soluton returned by LA(θ, B, m + ) s obvously optmal, we end up when the current B s empty wth σ arg max σ Σ ˆπ θ σ x, as clamed 9 Proof of Theorem 8 We prove separately Eqs (4) and (5) 3

23 9 Proof of eq (4) Notatons : unless explctly stated, all samples lke S and S are of sze m To make the readng of our expectatons clear and smple, we shall wrte E D for E (x,y) D, E Σm for E σ Σm, E S for E (x,y) S, E D m for E S D and E Dm for E S D We now proceed to the proof, that follows the same man steps as that of Theorem 5 n [6] For any q [0, ], let us defne the convex combnaton: F φ (q, h(x)) qf φ (h(x)) + ( q)f φ ( h(x)) (76) It follows that E Σ ˆπ E S [F φ (σ(x)h(x))] E S [F φ (ˆπ(x), h(x))], (77) wth ˆπ(x) the label proporton of the bag to whch x belongs n S We also have h, wth Λ(S) E D [F φ (yh(x))] E S [F φ (ˆπ(x), h(x))] + Λ(S), (78) sup g {E D [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} (79) Let us bound the devatons of Λ(S) around ts expectaton on the samplng of S, usng the ndependent bounded dfferences nequalty (IBDI, [7]) for whch we need to upperbound the maxmum dfference for the supremum term computed over two samples S and S of the same sze, such that S s S wth one example replaced We have: Λ(S) Λ(S ) E S [F φ (ˆπ(x), g(x))] E S [F φ (ˆπ (x), g(x))], (80) wth ˆπ and ˆπ denotng the correspondng label proportons n S and S Let {x } S\S and {x } S \S Let x S j and x S j for some bags j and j Upperbound (80) depends only on bags j and j For any x (S j S j )\{x, x }, eqs () and (3) brng: F φ (ˆπ(x), g(x)) F φ (ˆπ (x), g(x)) F φ(g(x)) F φ ( g(x)) m(x) g(x) b φ m(x) (8) h b φ m(x), (8) where m(x) s the sze of the bag to whch t belongs n S, plus ff t s bag j and j j, mnus ff t s bag j and j j Furthermore, () and (3) also brng: F φ (ˆπ(x), g(x)) F φ ( g(x) ) + b φ (( ˆπ(x)) g(x)>0 + ˆπ(x)( g(x)>0 )) g(x) F φ (0) + b φ (( ˆπ(x)) g(x)>0 + ˆπ(x)( g(x)>0 ))h Also, t comes from ts defnton that: We obtan that: Λ(S) Λ(S ) m F φ (0) + h b φ, x S F φ (0) b φ (0φ (0) φ(φ (0))) φ(/) b φ (83) ) ( + h + + h + b φ b φ m x (S j S j )\{x,x } h b φ m(x) Q m, (84) 4

24 where ( ) h Q + b φ So the IBDI yelds that wth probablty δ/ over the samplng of S, (85) Λ(S) E Dm sup {E D [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} + Q g m log δ, (86) We now upperbound the expectaton n (86) Usng the convexty of the supremum, we have E Dm sup {E D [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} g { E Dm sup ED m [F φ(yg(x))] E S [F φ (ˆπ(x), g(x))] } g E Dm,D sup {E m S [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} (87) g Consder any set S D m, and let I / [m] be a subset of m ndces, pcked unformly at random among all ( ) m m possble choces For any I [m], let S(I) denote the subset of examples whose ndex matches I, and for any x S(I), let ˆπ(x S(I)) denote ts bag proporton n S(I) For any I / l ndexed by l and any x S, let: ˆπ s l (x) { ˆπ(x S(I / l )) f x S(I / l ) ˆπ(x S\S(I / l )) otherwse (88) denote the label proportons nduced by the splt of S n two subsamples S(I / l ) and S\S(I/ l ) Let { ˆπ l l (x) y f x S(I / l ) ˆπ(x S\S(I / l )) otherwse, (89) where y s the true label of x Let σ l (x) x S(I / l ) The Label Proporton Complexty (LPC) L m quantfes the dscrepance between these two estmators When each bag n S has label proporton zero or one, each term factorng classfer h n eq (3) (man fle) s zero, so L m 0 Lemma 7 The followng holds true: E Dm,D sup {E m S [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} g E Dm,Σ m sup {E S [σ(x)f φ (ˆπ(x), h(x))]} + L m (90) h Proof For any σ Σ m and any sets S {x, x,, x m } and S {x, x,, x m}of sze m, denote and S σ S σ {x ff σ, x otherwse}, {x ff σ, x otherwse} (S S )\S σ (9) ˆπ (x) { ˆπσ (x) f x S σ, ˆπ σ (x) otherwse, (9) where ˆπ σ () denote the label proportons n S σ and ˆπ σ () denote the label proportons n S σ Let ˆπ() denote the label proportons n S, ˆπ () denote the label proportons n S (we know each bag to whch each example n S belongs to, so we can compute these estmators), We have E Dm,D m sup h E Dm,D m sup h E Dm,D m sup h {E S [F φ (yh(x))] E S [F φ (ˆπ(x), h(x))]} { E S [F φ (ˆπ (x), h(x))] E S [F φ (ˆπ(x), h(x))] b φ { E Sσ [σ(x)f φ (ˆπ l (x), h(x))] E Sσ [σ(x)f φ (ˆπ r (x), h(x))] b φ 5 } } (93),

25 wth E S [(( ˆπ (x)) y ˆπ (x) y )h(x)] ; (94) ˆπ l (x) (( + σ(x))ˆπ (x) + ( σ(x))ˆπ(x)), ˆπ r (x) We also have from eq () and (3): (( + σ(x))ˆπ(x) + ( σ(x))ˆπ (x)) (95) E Sσ [σ(x)f φ (ˆπ l (x), h(x))] E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] b φ, (96) E Sσ [σ(x)f φ (ˆπ r (x), h(x))] E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] 3, b φ (97) wth E Sσ [σ(x)(ˆπ l (x) ˆπ σ (x))h(x)], (98) 3 E Sσ [σ(x)(ˆπ r (x) ˆπ σ (x))h(x)] (99) We also have: 3 E S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)] 4 (00) Puttng eqs (93), (96), (97) and (00) altogether, we get, after ntroducng Rademacher varables: {E S [F φ (yh(x))] E S [F φ (ˆπ(x), h(x))]} E Dm,D m,σm sup h E Dm,D m,σm sup h E Dm,D m,σm sup h +E Dm,D m,σm sup h {E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] + 4 } {E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] E Sσ [σ(x)f φ (ˆπ σ (x), h(x))]} {E S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)]} E Dm,D sup {E m,σm S [σ(x)f φ (ˆπ (x), h(x))] E S [σ(x)f φ (ˆπ(x), h(x))]} h +E Dm,D sup {E m,σm S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)]} (0) h E Dm,Σ m sup {E S [σ(x)f φ (ˆπ(x), h(x))]} h {E S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)]} (0) +E Dm,D m,σm sup h Eq (0) holds because the dstrbuton of the supremum s the same We also have: E Dm,D sup {E m,σm S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)]} h E Dm,D m,σm sup h {E S [(ˆπ(x) ˆπ (x))h(x)] E S [( y ˆπ (x))h(x)]} E Dm E I /,I / sup E S [σ (x)(ˆπ s (x) ˆπl (x))h(x)] (03) h L m (04) Eq (03) holds because swappng the sample does not make any dfference n the outer expectaton, as each couple of swapped samples s generated wth the same probablty wthout swappng Puttng altogether (0) and (04) ends the proof of Lemma 7 We now bound the devatons of E Σm sup h {E S [σ(x)f φ (ˆπ(x), h(x))]} wth respect to ts expectaton over the samplng of S, E Dm,Σ m sup h {E S [σ(x)f φ (ˆπ(x), h(x))]} To do that, we use a thrd tme the IBDI and compute an upperbound for E Σ m sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} E Σm sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} [ ] sup E g {E S [σ(x)f φ (ˆπ(x), h(x))]} Σm (05) max Σ m sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} [ ] sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} 6 Q m, (06)

26 where Q s defned n eq (85) Eq (05) holds because of the trangular nequalty Ineq (06) holds because σ() So wth probablty δ/ over the samplng of S, E Σm sup {E S [σ(x)f φ (ˆπ(x), h(x))]} h E Dm,Σ m sup {E S [σ(x)f φ (ˆπ(x), h(x))]} Q h m log δ, (07) where Q s defned va (84) We obtan that wth probablty > ((δ/) + (δ/)) δ, the followng holds h: E D [F φ (yh(x))] E S [F φ (ˆπ(x), h(x))] + Λ(S) (see (78) and (79)) E S [F φ (ˆπ(x), h(x))] + E Dm sup {E D [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} g as clamed 9 Proof of eq (5) +Q m log (from (86)) δ E S [F φ (ˆπ(x), h(x))] + E Dm,D sup {E m S [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} g +Q m log (from (87)) δ E S [F φ (ˆπ(x), h(x))] + E Dm,Σ m sup {E S [σ(x)f φ (ˆπ(x), g(x))]} + L m g +Q m log (Lemma (7)) δ E S [F φ (ˆπ(x), h(x))] + E Σm sup {E S [σ(x)f φ (ˆπ(x), h(x))]} + L m h +Q m log δ (from (07)) E Σ ˆπ E S [F φ (σ(x)h(x))] + ˆR b m + L m + 4 ( ) h + b φ m log δ, We have F φ (x) (/b φ))(φ ) ( x) (/b φ )(φ ) ( x) [ /b φ, 0], and thus F φ s /b φ - Lpschtz, so Theorem 4 n [8] brngs: Rm(F, b { η) E σ Σm sup E [m] [σ E σ Σ [F ˆπ φ(σ h(x ) η)]] } h H { b φ E σ Σm sup E [m] [σ E σ Σ [σ ˆπ h(x ) η]] } h H { b φ E σ Σm sup E [m] [σ E σ Σ [σ ˆπ h(x )]] } h H { b φ E σ Σm sup E [m] [σ (ˆπ(x ) )h(x )] }, h H as clamed 3 Supplementary Materal on Experments 3 Full Expermental Setup All mean operator algorthms have been coded n R For SVM and InvCal, we used a Matlab mplementaton from the authors of [] The ranges of parameters for cross valdaton are λ λ m wth λ {0} 0 {0,,}, γ 0 {,,0}, σ {,,0} for mean operator algorthms We ran all 7

27 experments wth D w I and ε 0 Snce we tested on smlar domans -6 are actually the sameranges for InvCal and SVM were taken from [] To avod an addtonal source of complexty n the analyss, we cross-valdated all hyper-parameters usng the knowledge of all labels of the valdaton sets; notce that labels at valdaton tme generally would not be accessble n real world applcatons 3 Smulated Doman for Volaton of Homogenety Assumpton The synthetc data generated for ths test conssts on 6 classfcaton problems, each one formed by 6 bags of 00 two-dmensonal normal samples The dstrbuton generatng the frst dataset satsfes the homogenety assumpton (Fgure (a)) Then, we gradually change the poston of the class-condtonal bag-condtonal means on one lnear drecton (to the rght on Fgure (b) and (c)), wth dfferent offsets for dfferent bags In Fgure we gve a graphcal explanaton of the process wth 3 bags x 0 label + bag x (a) x 0 label + bag x (b) x 0 label + bag x (c) 3 Fgure : Volaton of homogenety assumpton 33 Smulated Doman from [] The MM algorthm was shown to learn a model wth zero accuracy predcton on the toy doman of [] We report here n Table performance of all mean operator algorthms measured n transductve settng, tranng wth cross-valdaton Although none of the dstances used n our experments n LMM leads reasonable accuracy n the toy dataset, AMM max ntalsed wth any startng pont learns n one step a model whch perfectly classfes all the nstances We also notce that EMM returns an optmal classfer by tself (not reported n Table ) Table : AUC on the toy dataset of [] AMM mn AMM max EMM MM LMM G LMM G,s LMM nc ran Addtonal Tests on alter- SVM [] In our experments, we observe that AUC acheved by SVM can be hgh, but t s also often below 05; n those cases the algorthm outputs models whch are worse than random and the average performance over 5 test folds drops We are able to reproduce the same behavour on the heart 8

28 dataset provded by the authors n a demo for alter- SVM; ths also proves our bag assgnment for LLP smulaton does not ntroduce the ssue In a frst test, we randomly select 3/4 of the dataset, and randomly assgn nstances to 4 bags of fxed sze 64, followng [] We repeat the tranng splt 50 tmes wth C C p, as n the demo, and we measure AUCs on the same tranng set As expected, a consstent number of run (%) ends up producng AUC smaller than 05 We dsplay n Fgure (a) the AUC s densty profle, whch shows a relevant mass around 05; notce also the two dstrbuton modes look symmetrc around 05 In a second test, we nvestgate further measurng pars of tranng set AUC and loss value obtaned by the same executon of the algorthm In ths case, we run over all parameters ranges defned n SVM s paper, and do not pck the model that mnmzes the loss over the 0 random runs, but record losses of all Fgures (b) and (c) show scatter plots relatve to two chosen tranng set splts We observe that loss mnmzaton can lead both to hgh and low AUCs, wth only few ponts close to 05 A possble explanaton mght be n the nverted polarty of the learnt lnear classfer; nverted polarty n ths contest means havng a model whch would acheve better performance classfyng nstances labels opposte to the ones predcted We conclude that optmzng SVM s loss n some cases mght be equvalent to tran a max-margn separator of the unlabelled data, whch only explots weakly the nformaton gven by the label proportons Ths would gve a heurstc understandng of the frequent symmetrcal behavour of the AUC 5 count 4 3 alter SVM loss alter SVM loss transet AUC (a) transet AUC (b) transet AUC (c) Fgure : alter- SVM: emprcal dstrbuton of AUC (a), and relatonshp between loss and AUC n two dfferent tran spt (b)(c) 35 Scalablty Fgure 3 (a) shows runtme of learnng (ncludng cross-valdaton) of MM and LMM wth regard to the number of bags whch s the natural parameter of tme complexty for our Laplacan-based methods Although the 3 layers of cross-valdaton of LMM G,s, LMM nc results the only method clearly not scalable Fgure 3 (b) presents how our one-shots algorthms scale on all small domans as a functon of problem sze Runtme s averaged over the dfferent bag assgnments The same plot s gven n Fgure 3 (c) for teratve algorthms, n partcular AMM mn and (alter/conv)- SVM All curves are completed wth measurements on bgger domans when avalable Runtme of SVMs s not drectly comparable wth our methods Ths s due to both (a) the mplementaton on dfferent programmng languages and (b) to the fact that the code provded mplements kernel SVM, even for lnear kernels, whch s a bg overhead n computaton and memory access Nevertheless, the hgh growth rate of conv- SVM makes the algorthm not sutable for large datasets Notceably, even f alter- SVM does not show such behavour, we are not able to run t on our bgger domans, snce t requres approxmately 0 hours to run on a tranng set splt wth fxed parameters 36 Full Results on Small Domans Fnally we report detals about all experments run on the 0 small domans (Table ) In the followng Tables, columns show the number of bags generated through K-MEANS Each cell contans 9

29 runtme (s) MM LMM G LMM G,s LMM nc #bags runtme (s) MM LMM G LMM G,s LMM nc #nstance * #features runtme (s) AMM MM AMM G AMM G,s AMM nc AMM 0ran alter SVM conv SVM #nstance * #features (a) (b) (c) Fgure 3: Learnng runtme of LMM for bags number (a), and for doman sze one-shot (b) and teratve methods (c) Table : Small domans sze dataset nstances feature arrhythma australan breastw 699 colc german heart 70 4 onosphere vertebral column 60 9 vote wne 78 6 average AUC over 5 test splts and standard devaton; runtme n second s n the separated column Best performng algorthm and ones not worse than 0 AUC are bold faced Comparsons are made n the respectve top/bottom sub-tables, whch group one-shot and teratve algorthms We use to hghlght runs whch acheve average AUC greater or equal than the Oracle 0

30 Table 3: arrhythma algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 709 ± ± ± ± ± 75 MM 6499 ± ± ± ± ± 949 LMM G 6499 ± ± ± ± ± 79 0 LMM G,s 6499 ± ± ± ± ± LMMnc 6499 ± ± ± ± ± InvCal 6475 ± ± ± ± ± 56 7 AMM EMM 5954 ± ± ± ± ± 88 8 AMMMM 579 ± ± ± ± ± AMM G 585 ± ± ± ± ± AMM G,s 5667 ± ± ± ± ± AMMnc 579 ± ± ± ± ± AMM 6580 ± ± ± ± ± 50 5 AMM 0ran 5409 ± ± ± ± ± AMM EMM 5059 ± ± ± ± ± AMMMM 608 ± ± ± ± ± AMM G 608 ± ± ± ± ± 67 4 AMM G,s 608 ± ± ± ± ± AMMnc 608 ± ± ± ± ± AMM 6053 ± ± ± ± ± AMM 0ran 4979 ± ± ± ± ± alter- 494 ± ± ± ± ± 60 5 conv- 545 ± ± ± ± ± Oracle 9999 ± ± ± ± ± 007 Table 4: australan algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 6648 ± 36 < 6467 ± 4 < 6356 ± 400 < 647 ± 480 < 634 ± 54 < MM 808 ± 66 < 87 ± 68 < 8749 ± ± < 8953 ± 3 LMM G 808 ± ± ± ± ± 68 8 LMM G,s 808 ± ± ± ± ± 53 7 LMMnc 808 ± ± ± ± ± 4 7 Invcal 967 ± ± ± ± ± 47 5 AMMEMM 8665 ± ± ± ± ± 4 6 AMMMM 8754 ± ± ± ± ± 38 5 AMM G 8754 ± ± ± ± ± 78 8 AMM G,s 8754 ± ± ± ± ± AMMnc 8754 ± ± ± ± ± 93 7 AMM 760 ± ± ± ± ± AMM 0ran 79 ± ± ± ± ± AMMEMM 8009 ± ± ± ± ± AMMMM 8683 ± ± ± ± ± 350 AMM G 8683 ± ± ± ± ± AMM G,s 8683 ± ± ± ± ± AMMnc 8683 ± ± ± ± ± 74 7 AMM 6957 ± ± ± ± ± 30 9 AMM 0ran 778 ± ± ± ± ± alter- 536 ± ± ± ± ± 5 64 conv ± ± ± ± ± Oracle 98 ± 89 < 968 ± 4 < 944 ± 30, 96 ± 03 < 999 ± 358 < Table 5: breastw algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 4865 ± 754 < 745 ± 659 < 668 ± 747 < 3488 ± 33 < 4750 ± 77 < MM 994 ± ± 039 < 998 ± 05 < 998 ± 037 < 998 ± 047 LMM G 994 ± ± ± ± ± LMM G,s 994 ± ± ± ± ± LMMnc 994 ± ± ± ± ± Invcal 967 ± ± ± ± ± 47 5 AMMEMM 9937 ± ± ± ± ± 049 AMMMM 9934 ± ± ± ± ± 048 AMM G 9934 ± ± ± ± ± AMM G,s 9934 ± ± ± ± ± AMMnc 9934 ± ± ± ± ± AMM 9935 ± 045 < 993 ± ± ± ± 048 AMM 0ran 9936 ± ± ± ± ± AMMEMM 994 ± ± ± ± ± AMM MM 990 ± ± ± ± ± AMM G 990 ± ± ± ± ± AMM G,s 990 ± ± ± ± ± AMMnc 990 ± ± ± ± ± AMM 9909 ± ± ± ± ± AMM 0ran 9897 ± ± ± ± ± 04 8 alter ± ± ± ± ± conv- 994 ± ± ± ± ± Oracle 9948 ± 04 < 9953 ± 04 < 993 ± 037 < 9943 ± 039 < 993 ± 044 <

31 Table 6: colc algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 6069 ± 30 < 583 ± 636 < 599 ± 537 < 5383 ± 49 < 595 ± 38 < MM 600 ± 644 < 7048 ± 743 < 673 ± ± ± 338 LMM G 600 ± ± ± ± ± LMM G,s 600 ± ± ± ± ± 30 7 LMMnc 600 ± ± ± ± ± Invcal 3873 ± ± ± ± ± AMM EMM 59 ± ± ± ± ± AMMMM 7744 ± ± ± ± ± 58 4 AMM G 7744 ± ± 3 76 ± ± ± 33 6 AMM G,s 7744 ± ± ± ± ± AMMnc 7744 ± ± ± ± ± AMM 3869 ± ± ± ± ± AMM 0ran 3763 ± ± ± ± ± 47 3 AMM EMM 5094 ± ± ± ± ± AMMMM 4305 ± ± ± ± ± 37 0 AMM G 4305 ± ± ± ± ± AMM G,s 4305 ± ± ± ± ± AMMnc 49 ± ± ± ± ± AMM 59 ± ± ± ± ± 64 8 AMM 0ran 5639 ± ± ± ± ± alter ± ± ± ± ± conv- 57 ± ± ± ± ± Oracle 869 ± 43 < 8780 ± 50 < 8705 ± 605 < 8653 ± 75 < 8797 ± 0 < Table 7: german algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 4790 ± 45 < 50 ± 57 < 460 ± 588 < 5094 ± 6 < 50 ± 55 < MM 607 ± 557 < 609 ± 400 < 6550 ± ± ± 456 LMM G 607 ± ± ± ± ± LMM G,s 607 ± ± ± ± ± 557 LMMnc 607 ± ± ± ± ± Invcal 3874 ± ± ± ± ± AMMEMM 5389 ± ± ± ± ± 8 AMMMM 6045 ± ± ± ± ± 54 7 AMM G 6045 ± ± ± ± ± AMM G,s 6045 ± ± ± ± ± AMMnc 6045 ± ± ± ± ± 56 5 AMM 3708 ± ± ± ± ± AMM 0ran 49 ± ± ± ± ± AMMEMM 4645 ± ± ± ± ± 463 AMMMM 547 ± ± ± ± ± 375 AMM G 547 ± ± ± ± ± AMM G,s 547 ± ± ± ± ± AMMnc 547 ± ± ± ± ± AMM 5839 ± ± ± ± ± AMM 0ran 5047 ± ± ± ± ± alter ± ± ± ± ± 7 64 conv- 970 ± ± ± ± ± Oracle 7943 ± 88 < 7895 ± 399 < 798 ± 70 < 794 ± 80 < 790 ± 36 < Table 8: heart algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 58 ± 39 < 5043 ± 303 < 5509 ± 944 < 4955 ± 747 < 6349 ± 8 < MM 6875 ± 609 < 604 ± 354 < 8035 ± 94 < 76 ± ± 6 LMM G 6875 ± ± ± ± ± LMM G,s 6875 ± ± ± ± ± LMMnc 6875 ± ± ± ± ± Invcal 884 ± ± ± ± ± AMMEMM 6050 ± 3088 < 6336 ± ± ± ± 60 AMMMM 8659 ± ± ± ± ± 570 AMM G 8659 ± ± ± ± ± AMM G,s 8659 ± ± ± ± ± AMMnc 8659 ± ± ± ± ± AMM 906 ± 58 < 899 ± ± ± ± 58 AMM 0ran 7838 ± ± ± ± ± AMMEMM 8574 ± ± ± ± ± 85 6 AMM MM 8535 ± ± ± ± ± AMM G 8535 ± ± ± ± ± 97 3 AMM G,s 8535 ± ± ± ± ± AMMnc 8535 ± ± ± ± ± AMM 777 ± ± ± ± ± 94 6 AMM 0ran 8996 ± ± ± ± ± alter ± ± ± ± ± conv- 468 ± ± ± ± ± Oracle 97 ± 395 < 9 ± 409 < 97 ± 88 < 954 ± 76 < 94 ± 546 <

32 Table 9: onosphere algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 448 ± 3 < 586 ± 80 < 5069 ± 634 < 4460 ± 39 < 489 ± 73 < MM 648 ± 88 < 7774 ± ± ± ± 46 LMM G 648 ± ± ± ± ± 44 7 LMM G,s 648 ± ± ± ± ± 458 LMMnc 648 ± ± 88 ± ± ± 43 8 Invcal 3534 ± ± ± ± ± AMM EMM 5677 ± ± ± ± ± AMMMM 4667 ± ± ± ± ± AMM G 4667 ± ± ± ± ± 55 AMM G,s 4667 ± ± ± ± ± AMMnc 4667 ± ± ± ± ± AMM 547 ± ± ± ± ± 505 AMM 0ran 569 ± ± ± ± ± 65 5 AMM EMM 5799 ± ± ± ± ± 586 AMMMM 7457 ± ± ± ± ± AMM G 7457 ± ± ± ± ± AMM G,s 7457 ± ± ± ± ± 594 AMMnc 7457 ± ± ± ± ± AMM 6553 ± ± ± ± ± 70 AMM 0ran 6505 ± ± ± ± ± alter ± ± ± ± ± conv ± ± ± ± ± 9 87 Oracle 9007 ± 504 < 8999 ± 43 < 9008 ± 550 < 894 ± 634 < 90 ± 57 < Table 0: vertebral column algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 579 ± 04 < 5905 ± 046 < 543 ± 7 < 4539 ± 38 < 630 ± 786 < MM 7745 ± 64 < 7897 ± 354 < 7985 ± 44 < 874 ± 8745 ± 357 LMM G 7745 ± ± ± ± ± 30 6 LMM G,s 7745 ± ± ± ± ± LMMnc 7745 ± ± ± ± ± 357 InvCal 3374 ± ± ± ± ± AMMEMM 807 ± ± ± ± ± 04 3 AMMMM 7564 ± ± ± ± ± 93 3 AMM G 7564 ± ± ± ± ± 83 AMM G,s 7564 ± ± ± ± ± 58 3 AMMnc 7564 ± ± ± ± ± 09 7 AMM 7449 ± ± ± ± ± 75 AMM 0ran 764 ± ± ± ± ± 79 9 AMMEMM 760 ± ± ± ± ± 79 8 AMMMM 753 ± ± ± ± ± 47 9 AMM G 753 ± ± ± ± ± 47 8 AMM G,s 753 ± ± ± ± ± 36 8 AMMnc 753 ± ± ± ± ± AMM 7735 ± ± ± ± ± AMM 0ran 739 ± ± ± ± ± alter ± ± ± ± ± conv- 777 ± ± ± ± ± Oracle 9380 ± 06 < 9383 ± 67 < 9389 ± 89 < 9383 ± 6 < 9400 ± 4 < Table : vote (feature physcan-fee-freeze was removed to make the problem harder) algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 543 ± 879 < 4547 ± 563 < 4688 ± ± ± 059 MM 9456 ± ± ± ± ± 50 LMM G 9456 ± ± ± ± ± 67 0 LMM G,s 9456 ± ± ± ± ± 09 8 LMMnc 9456 ± ± ± ± ± Invcal 9485 ± ± ± ± ± 65 4 AMMEMM 9367 ± ± ± ± ± 6 3 AMMMM 9348 ± 3 95 ± ± ± ± 58 4 AMM G 9348 ± ± ± ± ± 47 5 AMM G,s 9348 ± ± ± ± ± AMMnc 9348 ± ± ± ± ± AMM 9357 ± ± ± ± ± 4 AMM 0ran 9384 ± ± ± ± ± 70 8 AMMEMM 968 ± ± ± ± ± 3 5 AMM MM 947 ± ± ± ± ± 3 7 AMM G 947 ± ± ± ± ± 5 53 AMM G,s 947 ± ± ± ± ± AMMnc 947 ± ± ± ± ± 3 75 AMM 960 ± ± ± ± ± 5 5 AMM 0ran 9049 ± ± ± ± ± 67 8 alter- 558 ± ± ± ± ± 7 57 conv- 563 ± ± ± ± ± Oracle 97 ± 3 < 9743 ± 5 < 9706 ± 087 < 9733 ± 38 < 975 ± 49 < 3

33 Table : wne algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 7038 ± 039 < 567 ± 985 < 554 ± 070 < 658 ± 45 < 4685 ± 67 < MM 6645 ± ± ± ± ± 45 LMM G 6645 ± ± ± ± ± LMM G,s 6645 ± ± ± ± ± LMMnc 6645 ± ± ± ± ± 06 6 Invcal 5896 ± ± ± ± ± 89 6 AMMEMM 807 ± ± ± ± ± 79 AMMMM 684 ± ± ± ± ± AMM G 684 ± ± ± ± ± 0 9 AMM G,s 684 ± ± ± ± ± 0 7 AMMnc 684 ± ± ± ± ± 0 9 AMM 8 ± 39 < 94 ± ± ± ± 366 AMM 0ran 5875 ± ± ± ± ± 66 0 AMMEMM 743 ± ± ± ± ± 55 7 AMMMM 883 ± ± ± ± ± 69 8 AMM G 883 ± ± ± ± ± 69 5 AMM G,s 883 ± ± ± ± ± AMMnc 883 ± ± ± ± ± AMM 754 ± ± ± ± ± AMM 0ran 9754 ± ± ± ± ± alter- 568 ± ± ± ± ± conv- 543 ± ± ± ± ± Oracle 9969 ± 05 < 9980 ± 044 < 9960 ± 043 < 9980 ± 044 < 9978 ± 033 < AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure 4: Relatve AUC (wrt Oracle) vs on arrhythma AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure 5: Relatve AUC (wrt Oracle) vs on australan 4

34 AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure 6: Relatve AUC (wrt Oracle) vs on breastw 5

35 0 0 0 AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure 7: Relatve AUC (wrt Oracle) vs on colc AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure 8: Relatve AUC (wrt Oracle) vs on german AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure 9: Relatve AUC (wrt Oracle) vs on heart 6

36 0 0 0 AUC rel to Oracle MM LMM G LMM G,s LMM nc (a) AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran (b) AUC rel to Oracle alter SVM conv SVM InvCal (c) Fgure 0: Relatve AUC (wrt Oracle) vs on onosphere AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure : Relatve AUC (wrt Oracle) vs on vertebral column AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure : Relatve AUC (wrt Oracle) vs on vote 7

37 AUC rel to Oracle MM LMM G LMM G,s LMM nc AUC rel to Oracle AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle alter SVM conv SVM InvCal (a) (b) (c) Fgure 3: Relatve AUC (wrt Oracle) vs on wne References [] F X Yu, D Lu, S Kumar, T Jebara, and S F Chang SVM for Learnng wth Label Proportons In 30 th ICML, pages 504 5, 03 [] R Nock and F Nelsen Bregman dvergences and surrogates for learnng IEEE TransPAMI, 3: , 009 [3] A Banerjee, X Guo, and H Wang On the optmalty of condtonal expectaton as a bregman predctor IEEE Trans on Informaton Theory, 5: , 005 [4] N Quadranto, A J Smola, T S Caetano, and Q V Le Estmatng labels from label proportons JMLR, 0: , 009 [5] Y Altun and A J Smola Unfyng dvergence mnmzaton and statstcal nference va convex dualty In 9 th COLT, pages 39 53, 006 [6] P L Bartlett and S Mendelson Rademacher and gaussan complextes: Rsk bounds and structural results JMLR, 3:463 48, 00 [7] C McDarmd Concentraton In M Habb, C McDarmd, J Ramrez-Alfonsn, and B Reed, edtors, Probablstc Methods for Algorthmc Dscrete Mathematcs, pages 54 Sprnger Verlag, 998 [8] M Ledoux and M Talagrand Probablty n Banach Spaces Sprnger Verlag, 99 8