(Almost) No Label No Cry

Similar documents

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

What is Candidate Sampling

1 Example 1: Axis-aligned rectangles

Support Vector Machines

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Recurrence. 1 Definitions and main statements

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

The Greedy Method. Introduction. 0/1 Knapsack Problem

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

8 Algorithm for Binary Searching in Trees

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Forecasting the Direction and Strength of Stock Market Movement

L10: Linear discriminants analysis

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Logistic Regression. Steve Kroon

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Realistic Image Synthesis

An Alternative Way to Measure Private Equity Performance

A Probabilistic Theory of Coherence

Single and multiple stage classifiers implementing logistic discrimination

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Loop Parallelization

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

DEFINING %COMPLETE IN MICROSOFT PROJECT

J. Parallel Distrib. Comput.

Performance Analysis and Coding Strategy of ECOC SVMs

Point cloud to point cloud rigid transformations. Minimizing Rigid Registration Errors

BERNSTEIN POLYNOMIALS

where the coordinates are related to those in the old frame as follows.

An Interest-Oriented Network Evolution Mechanism for Online Communities

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

New Approaches to Support Vector Ordinal Regression

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

On the Solution of Indefinite Systems Arising in Nonlinear Optimization

Analysis of Premium Liabilities for Australian Lines of Business

When Network Effect Meets Congestion Effect: Leveraging Social Services for Wireless Services

How To Calculate The Accountng Perod Of Nequalty

Fisher Markets and Convex Programs

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

PERRON FROBENIUS THEOREM

SIMPLE LINEAR CORRELATION

How Much to Bet on Video Poker

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Learning from Multiple Outlooks

Efficient Project Portfolio as a tool for Enterprise Risk Management

Lecture 5,6 Linear Methods for Classification. Summary

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Bayesian Cluster Ensembles

SVM Tutorial: Classification, Regression, and Ranking

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

An MILP model for planning of batch plants operating in a campaign-mode

Lecture 2: Single Layer Perceptrons Kevin Swingler

Learning Permutations with Exponential Weights

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Project Networks With Mixed-Time Constraints

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

This circuit than can be reduced to a planar circuit

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Heuristic Static Load-Balancing Algorithm Applied to CESM

CHAPTER 14 MORE ABOUT REGRESSION

1. Measuring association using correlation and regression

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

The OC Curve of Attribute Acceptance Plans

INSTITUT FÜR INFORMATIK

Statistical Methods to Develop Rating Models

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Optimal resource capacity management for stochastic networks

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Financial market forecasting using a two-step kernel learning method for the support vector regression

Gender Classification for Real-Time Audience Analysis System

Production. 2. Y is closed A set is closed if it contains its boundary. We need this for the solution existence in the profit maximization problem.

Brigid Mullany, Ph.D University of North Carolina, Charlotte

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Extending Probabilistic Dynamic Epistemic Logic

Improved SVM in Cloud Computing Information Mining

STATISTICAL DATA ANALYSIS IN EXCEL

Active Learning for Interactive Visualization

Credit Limit Optimization (CLO) for Credit Cards

On the Interaction between Load Balancing and Speed Scaling

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Sketching Sampled Data Streams

A Simple Approach to Clustering in Excel

On the Interaction between Load Balancing and Speed Scaling

Transcription:

(Almost) No Label No Cry Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau Abstract In Learnng wth Label Proportons (LLP), the objectve s to learn a supervsed classfer when, nstead of labels, only label proportons for bags of observatons are known Ths settng has broad practcal relevance, n partcular for prvacy preservng data processng We frst show that the mean operator, a statstc whch aggregates all labels, s mnmally suffcent for the mnmzaton of many proper scorng losses wth lnear (or kernelzed) classfers wthout usng labels We provde a fast learnng algorthm that estmates the mean operator va a manfold regularzer wth guaranteed approxmaton bounds Then, we present an teratve learnng algorthm that uses ths as ntalzaton We ground ths algorthm n Rademacher-style generalzaton bounds that ft the LLP settng, ntroducng a generalzaton of Rademacher complexty and a Label Proporton Complexty measure Ths latter algorthm optmzes tractable bounds for the correspondng bag-emprcal rsk Experments are provded on fourteen domans, whose sze ranges up to 300K observatons They dsplay that our algorthms are scalable and tend to consstently outperform the state of the art n LLP Moreover, n many cases, our algorthms compete wth or are just percents of AUC away from the Oracle that learns knowng all labels On the largest domans, half a dozen proportons can suffce, e roughly 40K tmes less than the total number of labels Introducton Machne learnng has recently experenced a prolferaton of problem settngs that, to some extent, enrch the classcal dchotomy between supervsed and unsupervsed learnng Cases as multple nstance labels, nosy labels, partal labels as well as sem-supervsed learnng have been studed motvated by applcatons where fully supervsed learnng s no longer realstc In the present work, we are nterested n learnng a bnary classfer from nformaton provded at the level of groups of nstances, called bags The type of nformaton we assume avalable s the label proportons per bag, ndcatng the fracton of postve bnary labels of ts nstances Inspred by [], we refer to ths framework as Learnng wth Label Proportons (LLP) Settngs that perform a bag-wse aggregaton of labels nclude Multple Instance Learnng (MIL) [] In MIL, the aggregaton s logcal rather than statstcal: each bag s provded wth a bnary label expressng an OR condton on all the labels contaned n the bag More general settng also exst [3] [4] [5] Many practcal scenaros ft the LLP abstracton (a) Only aggregated labels can be obtaned due to the physcal lmts of measurement tools [6] [7] [8] [9] (b) The problem s sem- or unsupervsed but doman experts have knowledge about the unlabelled samples n form of expectaton, as pseudomeasurement [5] (c) Labels exsted once but they are now gven n an aggregated fashon for prvacy-preservng reasons, as n medcal databases [0], fraud detecton [], house prce market, electon results, census data, etc (d) Ths settng also arses n computer vson [] [3] [4] Related work The settng was frst ntroduced by [], where a prncpled herarchcal model generates labels consstent wth the proportons and s traned through MCMC Subsequently, [9] and ts follower [6] offer a varety of standard learnng algorthms desgned to generate self-consstent

labels [5] gves a Bayesan nterpretaton of LLP where the key dstrbuton s estmated through an RBM Other deas rely on structural learnng of Bayesan networks wth mssng data [7], and on K- MEANS clusterng to solve prelmnary label assgnment [3] [8] Recent SVM mplementatons [] [6] outperform most of the other known methods Theoretcal works on LLP belong to two man categores The frst contans unform convergence results, for the estmators of label proportons [], or the estmator of the mean operator [7] The second contans approxmaton results for the classfer [7] Our work bulds upon ther Mean Map algorthm, that reles on the trck that the logstc loss may be splt n two, a convex part dependng only on the observatons, and a lnear part nvolvng a suffcent statstc for the label, the mean operator Beng able to estmate the mean operator means beng able to ft a classfer wthout usng labels In [7], ths estmaton reles on a restrctve homogenety assumpton that the class-condtonal estmaton of features does not depend on the bags Experments dsplay the lmts of ths assumpton [][6] Contrbutons In ths paper we consder lnear classfers, but our results hold for kernelzed formulatons followng [7] We frst show that the trck about the logstc loss can be generalzed, and the mean operator s actually mnmally suffcent for a wde set of symmetrc proper scorng losses wth no class-dependent msclassfcaton cost, that encompass the logstc, square and Matsushta losses [8] We then provde an algorthm, LMM, whch estmates the mean operator va a Laplacan-based manfold regularzer wthout callng to the homogenety assumpton We show that under a weak dstngushablty assumpton between bags, our estmaton of the mean operator s all the better as the observatons norm ncrease Ths, as we show, cannot hold for the Mean Map estmator Then, we provde a data-dependent approxmaton bound for our classfer wth respect to the optmal classfer, that s shown to be better than prevous bounds [7] We also show that the manfold regularzer s soluton s tghtly related to the lnear separablty of the bags We then provde an teratve algorthm, AMM, that takes as nput the soluton of LMM and optmzes t further over the set of consstent labelngs We ground the algorthm n a unform convergence result nvolvng a generalzaton of Rademacher complextes for the LLP settng The bound nvolves a bag-emprcal surrogate rsk for whch we show that AMM optmzes tractable bounds All our theoretcal results hold for any symmetrc proper scorng loss Experments are provded on fourteen domans, rangng from hundreds to hundreds of thousands of examples, comparng AMM and LMM to ther contenders: Mean Map, InvCal [] and SVM [6] They dsplay that AMM and LMM outperform ther contenders, and sometmes even compete wth the fully supervsed learner whle requrng few proportons only Tests on the largest domans dsplay the scalablty of both algorthms Such expermental evdence serously questons the safety of prvacy-preservng summarzaton of data, whenever accurate aggregates and nformatve ndvdual features are avalable Secton () presents our algorthms and related theoretcal results Secton (3) presents experments Secton (4) concludes A Supplementary Materal [9] ncludes proofs and addtonal experments LLP and the mean operator: theoretcal results and algorthms Learnng settng Hereafter, boldfaces lke p denote vectors, whose coordnates are denoted p l for l,, For any m N, let [m] {,,, m} Let Σ m {σ {, } m } and X R d Examples are couples (observaton, label) X Σ, sampled d accordng to some unknown but fxed dstrbuton D Let S {(x, y ), [m]} D m denote a sze-m sample In Learnng wth Label Proportons (LLP), we do not observe drectly S but S y, whch denotes S wth labels removed; we are gven ts partton n n > 0 bags, S y j S j, j [n], along wth ther respectve label proportons ˆπ j ˆP[y + S j ] and bag proportons ˆp j m j /m wth m j card(s j ) (Ths generalzes to a cover of S, by copyng examples among bags) The bag assgnment functon that parttons S s unknown but fxed In real world domans, t would rather be known, eg state, gender, age band A classfer s a functon h : X R, from a set of classfers H H L denotes the set of lnear classfers, noted h θ (x) θ x wth θ X A (surrogate) loss s a functon F : R R + We let F (S, h) (/m) F (y h(x )) denote the emprcal surrogate rsk on S correspondng to loss F For the sake of clarty, ndexes, j and k respectvely refer to examples, bags and features The mean operator and ts mnmal suffcency µ S m We defne the (emprcal) mean operator as: y x ()

Algorthm Laplacan Mean Map (LMM) Input S j, ˆπ j, j [n]; γ > 0 (7); w (7); V (8); permssble φ (); λ > 0; Step : let B± arg mn X R n d l(l, X) usng (7) (Lemma ) Step : let µ S j ˆp j(ˆπ j b+ j ( ˆπ j) b j ) Step 3 : let θ arg mn θ F φ (S y, θ, µ S ) + λ θ (3) Return θ Table : Correspondence between permssble functons φ and the correspondng loss F φ loss name F φ (x) φ(x) logstc loss log( + exp( x)) x log x ( x) log( x) square loss ( x) x( x) Matsushta loss x + + x x( x) The estmaton of the mean operator µ S appears to be a learnng bottleneck n the LLP settng [7] The fact that the mean operator s suffcent to learn a classfer wthout the label nformaton motvates the noton of mnmal suffcent statstc for features n ths context Let F be a set of loss functons, H be a set of classfers, I be a subset of features Some quantty t(s) s sad to be a mnmal suffcent statstc for I wth respect to F and H ff: for any F F, any h H and any two samples S and S, the quantty F (S, h) F (S, h) does not depend on I ff t(s) t(s ) Ths defnton can be motvated from the one n statstcs by buldng losses from log lkelhoods The followng Lemma motvates further the mean operator n the LLP settng, as t s the mnmal suffcent statstc for a broad set of proper scorng losses that encompass the logstc and square losses [8] The proper scorng losses we consder, hereafter called symmetrc (SPSL), are twce dfferentable, non-negatve and such that msclassfcaton cost s not label-dependent Lemma µ S s a mnmal suffcent statstc for the label varable, wth respect to SPSL and H L ([9], Subsecton ) Ths property, very useful for LLP, may also be exploted n other weakly supervsed tasks [] Up to constant scalngs that play no role n ts mnmzaton, the emprcal surrogate rsk correspondng to any SPSL, F φ (S, h), can be wrtten wth loss: F φ (x) φ(0) + φ ( x) a φ + φ ( x), () φ(0) φ(/) b φ and φ s a permssble functon [0, 8], e dom(φ) [0, ], φ s strctly convex, dfferentable and symmetrc wth respect to / φ s the convex conjugate of φ Table shows examples of F φ It follows from Lemma and ts proof, that any F φ (Sθ), can be wrtten for any θ h θ H L as: ( ) F φ (S, θ) b φ F φ (σθ x ) m θ µ S F φ (S y, θ, µ S ), (3) where σ Σ σ The Laplacan Mean Map (LMM) algorthm The sum n eq (3) s convex and dfferentable n θ Hence, once we have an accurate estmator of µ S, we can then easly ft θ to mnmze F φ (S y, θ, µ S ) Ths two-steps strategy s mplemented n LMM n algorthm µ S can be retreved from n bag-wse, label-wse unknown averages b σ j : n µ S (/) ˆp j j σ Σ (ˆπ j + σ( σ))b σ j, (4) wth b σ j E S [x σ, j] denotng these n unknowns (for j [n], σ Σ ), and let b j (/m j ) x S j x The n b σ j s are soluton of a set of n denttes that are (n matrx form): B Π B ± 0, (5) 3

where B [b b b n ] R n d, Π [DIAG(ˆπ) DIAG( ˆπ)] R n n and B ± R n d s the matrx of unknowns: [ ] B ± b + b + b + n b - b - b - n (6) } {{ } } {{ } (B + ) (B ) System (5) s underdetermned, unless one makes the homogenety assumpton that yelds the Mean Map estmator [7] Rather than makng such a restrctve assumpton, we regularze the cost that brngs (5) wth a manfold regularzer [], and search for B± arg mn X R n d l(l, X), wth: l(l, X) tr ( (B X Π)D w (B Π X) ) + γtr ( X ) LX, (7) and γ > 0 D w DIAG(w) s a user-fxed bas matrx wth w R n +, (and w ˆp n general) and: [ ] La 0 L εi + R 0 n n, (8) L a where L a D V R n n s the Laplacan of the bag smlartes V s a symmetrc smlarty matrx wth non negatve coordnates, and the dagonal matrx D satsfes d jj j v jj, j [n] The sze of the Laplacan s O(n ), whch s small compared to O(m ) f there are not many bags One can nterpret the Laplacan regularzaton as smoothng the estmates of b σ j wrt the smlarty of the respectve bags Lemma The soluton B± to mn X R n d l(l, X) s B± ( ΠD w Π + γl ) ΠDw B ([9], Subsecton ) Ths Lemma explans the role of penalty εi n (8) as ΠD w Π and L have respectvely n- and ( )-dm null spaces, so the nverson may not be possble Even when ths does not happen exactly, ths may ncur numercal nstabltes n computng the nverse For domans where ths rsk exsts, pckng a small ε > 0 solves the problem Let b σ j denote the row-wse decomposton of B± followng (6), from whch we compute µ S followng (4) when we use these n estmates n leu of the true b σ j We compare µ j ˆπ j b + j ( ˆπ j)b j, j [n] to our estmates µ j ˆπ j b+ j ( ˆπ j) b j, j [n], granted that µ S j ˆp jµ j and µ S j ˆp j µ j Theorem 3 Suppose that γ satsfes γ ((ε(n) ) + max j j v jj )/ mn j w j Let M [µ µ µ n ] R n d, M [ µ µ µ n ] R n d and ς(v, B ± ) ((ε(n) ) + max j j v jj ) B ± F The followng holds: M M F ( ) n mn wj ς(v, B ± ) (9) j ([9], Subsecton 3) The multplcatve factor to ς n (9) s roughly O(n 5/ ) when there s no large dscrepancy n the bas matrx D w, so the upperbound s drven by ς(, ) when there are not many bags We have studed ts varatons when the dstngushablty between bags ncreases Ths settng s nterestng because n ths case we may kll two brds n one shot, wth the estmaton of M and the subsequent learnng problem potentally easer, n partcular for lnear separators We consder two examples for v jj, the frst beng (half) the normalzed assocaton []: v nc jj ( ASSOC(Sj, S j ) ASSOC(S j, S j S j ) + ASSOC(S j, S j ) ASSOC(S j, S j S j ) ) NASSOC(S j, S j ), (0) v G,s jj exp( b j b j /s), s > 0 () Here, ASSOC(S j, S j ) x S j,x S x x j [] To put these two smlarty measures n the context of Theorem 3, consder the settng where we can make assumpton (D) that there exsts a small constant κ > 0 such that b j b j κ max σ,j b σ j, j, j [n] Ths s a weak dstngushablty property as f no such κ exsts, then the centers of dstnct bags may just be confounded Consder also the addtonal assumpton, (D), that there exsts κ > 0 such that max j d j κ, j [n], where d j max x,x x Sj x s a bag s dameter In the followng Lemma, the lttle-oh notaton s wth respect to the largest unknown n eq (4), e max σ,j b σ j 4

Algorthm Alternatng Mean Map (AMM OPT ) Input LMM parameters + optmzaton strategy OPT {mn, max} + convergence predcate PR Step : let θ 0 LMM(LMM parameters) and t 0 Step : repeat Step : let σ t arg OPT σ Σ ˆπ F φ (S y, θ t, µ S (σ)) Step : let θ t+ arg mn θ F φ (S y, θ, µ S (σ t )) + λ θ Step 3 : let t t + untl predcate PR s true Return θ arg mn t F φ (S y, θ t+, µ S (σ t )) Lemma 4 There exsts ε > 0 such that ε ε, the followng holds: () ς(v nc, B ± ) o() under assumptons (D + D); () ς(v G,s, B ± ) o() under assumpton (D), s > 0 ([9], Subsecton 4) Hence, provded a weak (D) or stronger (D+D) dstngushablty assumpton holds, the dvergence between M and M gets smaller wth the ncrease of the norm of the unknowns b σ j The proof of the Lemma suggests that the convergence may be faster for VG,s The followng Lemma shows that both smlartes also partally encode the hardness of solvng the classfcaton problem wth lnear separators, so that the manfold regularzer lmts the dstorton of the b ± s between two bags that tend not to be lnearly separable Lemma 5 Take v jj {v G, jj, vnc jj } There exsts 0 < κ l < κ n < such that () f v jj > κ n then S j, S j are not lnearly separable, and f v jj < κ l then S j, S j are lnearly separable ([9], Subsecton 5) Ths Lemma s an advocacy to ft s n a data-dependent way n v G,s jj The queston may be rased as to whether fnte samples approxmaton results lke Theorem 3 can be proven for the Mean Map estmator [7] [9], Subsecton 6 answers by the negatve In the Laplacan Mean Map algorthm (LMM, Algorthm ), Steps and have now been descrbed Step 3 s a dfferentable convex mnmzaton problem for θ that does not use the labels, so t does not present any techncal dffculty An nterestng queston s how much our classfer θ n Step 3 dverges from the one that would be computed wth the true expresson for µ S, θ It s not hard to show that Lemma 7 n Altun and Smola [3], and Corollary 9 n Quadranto et al [7] hold for LMM so that θ θ (λ) µ S µ S The followng Theorem shows a data-dependent approxmaton bound that can be sgnfcantly better, when t holds that θ x, θ x φ ([0, ]), (φ s the frst dervatve) We call ths settng proper scorng complance (PSC) [8] PSC always holds for the logstc and Matsushta losses for whch φ ([0, ]) R For other losses lke the square loss for whch φ ([0, ]) [, ], shrnkng the observatons n a ball of suffcently small radus s suffcent to ensure ths Theorem 6 Let f k R m denote the vector encodng the k th feature varable n S : f k x k (k [d]) Let F denote the feature matrx wth column-wse normalzed feature vectors: fk (d/ k f k ) (d )/(d) f k Under PSC, we have θ θ (λ + q) µ S µ S, wth: q det F F m e b φ φ (φ (q /λ)) (> 0), () for some q I [±(x + max{ µ S, µ S })] Here, x max x and φ (φ ) ([9], Subsecton 7) To see how large q can be, consder the smple case where all egenvalues of F F, λk ( F F) [λ ± δ] for small δ In ths case, q s proportonal to the average feature norm : det F F tr ( ) F F + o(δ) x + o(δ) m md md 5

The Alternatng Mean Map (AMM) algorthm Let us denote Σˆπ {σ Σ m : :x S j σ (ˆπ j )m j, j [n]} the set of labelngs that are consstent wth the observed proportons ˆπ, and µ S (σ) (/m) σ x the based mean operator computed from some σ Σˆπ Notce that the true mean operator µ S µ S (σ) for at least one σ Σˆπ The Alternatng Mean Map algorthm, (AMM, Algorthm ), starts wth the output of LMM and then optmzes t further over the set of consstent labelngs At each teraton, t frst pcks a consstent labelng n Σˆπ that s the best (OPT mn) or the worst (OPT max) for the current classfer (Step ) and then fts a classfer θ on the gven set of labels (Step ) The algorthm then terates untl a convergence predcate s met, whch tests whether the dfference between two values for F φ (,, ) s too small (AMM mn ), or the number of teratons exceeds a user-specfed lmt (AMM max ) The classfer returned θ s the best n the sequence In the case of AMM mn, t s the last of the sequence as rsk F φ (S y,, ) cannot ncrease Agan, Step s a convex mnmzaton wth no techncal dffculty Step s combnatoral It can be solved n tme almost lnear n m [9] (Subsecton 8) Lemma 7 The runnng tme of Step n AMM s Õ(m), where the tlde notaton hdes log-terms Bag-Rademacher generalzaton bounds for LLP We relate the mn and max strateges of AMM by unform convergence bounds nvolvng the true surrogate rsk, e ntegratng the unknown dstrbuton D and the true labels (whch we may never know) Prevous unform convergence bounds for LLP focus on coarser graned problems, lke the estmaton of label proportons [] We rely on a LLP generalzaton of Rademacher complexty [4, 5] Let F : R R + be a loss functon and H a set of classfers The bag emprcal Rademacher complexty of sample S, Rm, b s defned as Rm b E σ Σm sup h H {E σ Σ ˆπ E S [σ(x)f (σ (x)h(x))] The usual emprcal Rademacher complexty equals Rm b for card(σˆπ ) The Label Proporton Complexty of H s: L m E Dm E I /,I / sup E S [σ (x)(ˆπ s (x) ˆπl (x))h(x)] (3) h H Here, each of I / l, l, s a random (unformly) subset of [m] of cardnal m Let S(I/ l ) be the sze-m subset of S that corresponds to the ndexes Take l, and any x S If I / l then ˆπ l s (x ) ˆπ l l (x ) s x s bag s label proporton measured on S\S(I / l ) Else, ˆπs (x ) s ts bag s label proporton measured on S(I / ) and ˆπl (x ) s ts label (e a bag s label proporton that would contan only x ) Fnally, σ (x) x S(I / ) Σ L m tends to be all the smaller as classfers n H have small magntude on bags whose label proporton s close to / Theorem 8 Suppose h 0 st h(x) h, x, h Then, for any loss F φ, any tranng sample of sze m and any 0 < δ, wth probablty > δ, the followng bound holds over all h H: ( ) E D [F φ (yh(x))] E Σ ˆπ E S [F φ (σ(x)h(x))] + Rm b h + L m + 4 + b φ m log δ (4) Furthermore, under PSC (Theorem 6), we have for any F φ : Rm b b φ E Σm sup {E S [σ(x)(ˆπ(x) (/))h(x)]} (5) h H ([9], Subsecton 9) Despte smlar shapes (3) (5), R b m and L m behave dfferently: when bags are pure (ˆπ j {0, }, j), L m 0 When bags are mpure (ˆπ j /, j), R b m 0 As bags get mpure, the bag-emprcal surrogate rsk, E Σ ˆπ E S [F φ (σ(x)h(x))], also tends to ncrease AMM mn and AMM max respectvely mnmze a lowerbound and an upperbound of ths rsk 3 Experments Algorthms We compare LMM, AMM (F φ logstc loss) to the orgnal MM [7], InvCal [], conv- SVM and alter- SVM [6] (lnear kernels) To make experments extensve, we test several ntalzatons for AMM that are not dsplayed n Algorthm (Step ): () the edge mean map estmator, µ S EMM /m ( y )( x ) (AMM EMM ), () the constant estmator µ S (AMM ), and fnally AMM 0ran whch runs 0 random ntal models ( θ 0 ), and selects the one wth smallest rsk; 6

AUC rel to MM 3 0 MM LMM G LMM G,s LMM nc 4 6 dvergence (a) AUC rel to Oracle 0 09 08 07 06 MM LMM G LMM G,s LMM nc 06 08 0 (b) AUC rel to Oracle 0 09 08 07 06 AMM MM AMM G AMM G,s AMM nc AMM 0ran 06 08 0 (c) AUC 0 08 06 04 0 Oracle AMM G Bgger domans Small domans 0^ 5 0^ 3 0^ #bags/#nstance (d) Fgure : Relatve AUC (wrt MM) as homogenety assumpton s volated (a) Relatve AUC (wrt Oracle) vs on heart for LMM(b), AMM mn (c) AUC vs n/m for AMM mn G and the Oracle (d) Table : Small domans results #wn/#lose for row vs column Bold faces means p-val < 00 for Wlcoxon sgned-rank tests Top-left subtable s for one-shot methods, bottom-rght teratve ones, bottom-left compare the two Italc s state-of-the-art Grey cells hghlght the best of all (AMM mn G ) LMM algorthm MM LMM InvCal AMM mn AMM max conv- G G,s nc MM G G,s 0ran MM G G,s 0ran SVM AMM mn AMM max SVM G 36/4 G,s 38/3 30/6 nc 8/ 3/37 /37 InvCal 4/46 3/47 4/46 4/46 MM 33/6 6/4 5/5 3/8 46/4 G 38/ 35/4 30/0 37/3 47/3 3/7 G,s 35/4 33/7 30/0 35/5 47/3 4/ 7/5 eg AMM mn G,s wns on AMMmn G 7 tmes, loses 5, wth 8 tes 0ran 7/ 4/6 /8 6/4 44/6 0/30 6/34 9/3 MM 5/5 3/7 /8 5/5 45/5 5/35 3/37 3/37 8/4 G 7/3 /8 /8 6/4 45/5 7/33 4/36 4/36 0/40 3/4 G,s 5/5 /9 /8 4/6 45/5 5/35 3/37 3/37 /38 5/ 6/ 0ran 3/7 /9 9/3 4/6 50/0 9/3 5/35 7/33 7/43 9/30 0/9 7/3 conv- /9 /48 /48 /48 /48 4/46 3/47 3/47 4/46 3/47 3/47 4/46 0/50 alter- 0/50 0/50 0/50 0/50 0/30 0/50 0/50 0/50 3/47 3/47 /48 /49 0/50 7/3 ths s the same procedure of alter- SVM Matrx V (eqs (0), ()) used s ndcated n subscrpt: LMM/AMM G, LMM/AMM G,s, LMM/AMM nc respectvely denote v G,s wth s, v G,s wth s learned on cross valdaton (CV; valdaton ranges ndcated n [9]) and v nc For space reasons, results not dsplayed n the paper can be found n [9], Secton 3 (ncludng runtme comparsons, and detaled results by doman) We splt the algorthms n two groups, one-shot and teratve The latter, ncludng AMM, (conv/alter)- SVM, teratvely optmze a cost over labelngs (always consstent wth label proportons for AMM, not always for (conv/alter)- SVM) The former (LMM, InvCal) do not and are thus much faster Tests are done on a 4-core 3GHz CPUs Mac wth 3GB of RAM AMM/LMM/MM are mplemented n R Code for InvCal and SVM s [6] Smulated domans, MM and the homogenety assumpton The testng metrc s the AUC Pror to testng on our domans, we generate 6 domans that gradually move away the b σ j away from each other (wrt j), thus volatng ncreasngly the homogenety assumpton [7] The degree of volaton s measured as B ± B ± F, where B ± s the homogenety assumpton matrx, that replaces all b σ j by b σ for σ {, }, see eq (5) Fgure (a) dsplays the ratos of the AUC of LMM to the AUC of MM It shows that LMM s all the better wth respect to MM as the homogenety assumpton s volated Furthermore, learnng s n LMM mproves the results Experments on the smulated doman of [6] on whch MM obtans zero accuracy also dsplay that our algorthms perform better ( teraton only of AMM max brngs 00% AUC) Small and large domans experments We convert 0 small domans [9] (m 000) and 4 bgger ones (m > 8000) from UCI[6] nto the LLP framework We cast to one-aganst-all classfcaton when the problem s multclass On large domans, the bag assgnment functon s nspred by []: we craft bags accordng to a selected feature value, and then we remove that feature from the data Ths conforms to the dea that bag assgnment s structured and non random n real-world problems Most of our small domans, however, do not have a lot of features, so nstead of clusterng on one feature and then dscard t, we run K-MEANS on the whole data to make the bags, for K n [5] Small domans results We perform 5-folds nested CV comparsons on the 0 domans 50 AUC values for each algorthm Table synthesses the results [9], splttng one-shot and teratve algo- 7

Table 3: AUCs on bg domans (name: #nstances #features) Icap-shape, IIhabtat, IIIcap-colour, IVrace, Veducaton, VIcountry, VIIpoutcome, VIIIjob (number of bags); for each feature, the best result over one-shot, and over teratve algorthms s bold faced AMM mn AMM max algorthm mushroom: 84 08 adult: 4884 89 marketng: 45 4 census: 9985 38 I(6) II(7) III(0) IV(5) V(6) VI(4) V(4) VII(4) VIII() IV(5) VIII(9) VI(4) EMM 556 5980 7668 439 4750 666 6349 5450 443 5605 565 5787 MM 599 9879 50 8093 7665 740 5464 507 4970 75 9037 755 LMM G 739 9857 470 879 7840 7878 5466 500 593 7580 775 763 LMM G,s 949 984 8943 8489 7894 80 497 500 658 8488 607 6974 AMMEMM 85 9945 6943 4997 5698 709 639 5573 430 8786 877 4080 AMMMM 898 990 574 8373 7739 8067 585 757 589 8968 849 6836 AMM G 898 9945 5044 834 855 896 56 756 575 876 888 7699 AMM G,s 894 9957 38 88 7853 896 503 756 5398 8993 8354 53 AMM 9590 9849 973 83 7580 8005 653 6496 666 8909 8894 567 AMMEMM 9304 33 667 5446 6963 566 548 5563 5748 70 774 667 AMMMM 5945 556 9970 857 763 839 4846 534 5690 5075 6676 5867 AMM G 9550 653 9930 875 76 839 5058 477 349 483 6754 7746 AMM G,s 9584 653 846 869 7095 839 6688 477 349 8033 7445 570 AMM 950 7348 9 75 675 7767 6670 66 794 5797 807 534 Oracle 998 998 998 9055 9055 9050 795 7555 7943 943 9437 9445 rthms LMM G,s outperforms all one-shot algorthms LMM G and LMM G,s are compettve wth many teratve algorthms, but lose aganst ther AMM counterpart, whch proves that addtonal optmzaton over labels s benefcal AMM G and AMM G,s are confrmed as the best varant of AMM, the frst beng the best n ths case Surprsngly, all mean map algorthms, even one-shots, are clearly superor to SVMs Further results [9] reveal that SVM performances are dampened by learnng classfers wth the nverted polarty e flppng the sgn of the classfer mproves ts performances Fgure (b, c) presents the AUC relatve to the Oracle (whch learns the classfer knowng all labels and mnmzng the logstc loss), as a functon of the Gn of bag assgnment, gn(s) 4E j [ˆπ j ( ˆπ j )] For an close to, we were expectng a drop n performances The unexpected [9] s that on some domans, large entropes ( 8) do not prevent AMM mn to compete wth the Oracle No such pattern clearly emerges for SVM and AMM max [9] Bg domans results We adopt a /5 hold-out method Scalablty results [9] dsplay that every method usng v nc and SVM are not scalable to bg domans; n partcular, the estmated tme for a sngle run of alter- SVM s >00 hours on the adult doman Table 3 presents the results on the bg domans, dstngushng the feature used for bag assgnment Bg domans confrm the effcency of LMM+AMM No approach clearly outperforms the rest, although LMM G,s s often the best one-shot Synthess Fgure (d) gves the AUCs of AMM mn G over the Oracle for all domans [9], as a functon of the degree of supervson, n/m ( f the problem s fully supervsed) Notceably, on 90% of the runs, AMM mn G gets an AUC representng at least 70% of the Oracle s Results on bg domans can be remarkable: on the census doman wth bag assgnment on race, 5 proportons are suffcent for an AUC 5 ponts below the Oracle s whch learns wth 00K labels 4 Concluson In ths paper, we have shown that effcent learnng n the LLP settng s possble, for general loss functons, va the mean operator and wthout resortng to the homogenety assumpton Through ts estmaton, the suffcency allows one to resort to standard learnng procedures for bnary classfcaton, practcally mplementng a reducton between machne learnng problems [7]; hence the mean operator estmaton may be a vable shortcut to tackle other weakly supervsed settngs [] [3] [4] [5] Approxmaton results and generalzaton bounds are provded Experments dsplay results that are superor to the state of the art, wth algorthms that scale to bg domans at affordable computatonal costs Performances sometmes compete wth the Oracle s that learns knowng all labels, even on bg domans Such expermental fndng poses severe mplcatons on the relablty of prvacy-preservng aggregaton technques wth smple group statstcs lke proportons Acknowledgments NICTA s funded by the Australan Government through the Department of Communcatons and the Australan Research Councl through the ICT Centre of Excellence Program G Patrn acknowledges that part of the research was conducted at the Commonwealth Bank of Australa We thank A Menon, D García-García, N de Fretas for nvaluable feedback, and FYu for help wth the code 8

References [] F X Yu, S Kumar, T Jebara, and S F Chang On learnng wth label proportons CoRR, abs/40590, 04 [] T G Detterch, R H Lathrop, and T Lozano-Pérez Solvng the multple nstance problem wth axsparallel rectangles Artfcal Intellgence, 89:3 7, 997 [3] G S Mann and A McCallum Generalzed expectaton crtera for sem-supervsed learnng of condtonal random felds In 46 th ACL, 008 [4] J Graça, K Ganchev, and B Taskar Expectaton maxmzaton and posteror constrants In NIPS*0, pages 569 576, 007 [5] P Lang, M I Jordan, and D Klen Learnng from measurements n exponental famles In 6 th ICML, pages 64 648, 009 [6] D J Muscant, J M Chrstensen, and J F Olson Supervsed learnng by tranng on aggregate outputs In 7 th ICDM, pages 5 6, 007 [7] J Hernández-González, I Inza, and J A Lozano Learnng bayesan network classfers from label proportons Pattern Recognton, 46():345 3440, 03 [8] M Stolpe and K Mork Learnng from label proportons by optmzng cluster model selecton In 5 th ECMLPKDD, pages 349 364, 0 [9] B C Chen, L Chen, R Ramakrshnan, and D R Muscant Learnng from aggregate vews In th ICDE, pages 3 3, 006 [0] J Wojtusak, K Irvn, A Brerdnc, and A V Baranova Usng publshed medcal results and nonhomogenous data n rule learnng In 0 th ICMLA, pages 84 89, 0 [] S Rüpng Svm classfer estmaton from group probabltes In 7 th ICML, pages 9 98, 00 [] H Kueck and N de Fretas Learnng about ndvduals from group statstcs In th UAI, pages 33 339, 005 [3] S Chen, B Lu, M Qan, and C Zhang Kernel k-means based framework for aggregate outputs classfcaton In 9 th ICDMW, pages 356 36, 009 [4] K T La, F X Yu, M S Chen, and S F Chang Vdeo event detecton by nferrng temporal nstance labels In th CVPR, 04 [5] K Fan, H Zhang, S Yan, L Wang, W Zhang, and J Feng Learnng a generatve classfer from label proportons Neurocomputng, 39:47 55, 04 [6] F X Yu, D Lu, S Kumar, T Jebara, and S F Chang SVM for Learnng wth Label Proportons In 30 th ICML, pages 504 5, 03 [7] N Quadranto, A J Smola, T S Caetano, and Q V Le Estmatng labels from label proportons JMLR, 0:349 374, 009 [8] R Nock and F Nelsen Bregman dvergences and surrogates for learnng IEEE TransPAMI, 3:048 059, 009 [9] G Patrn, R Nock, P Rvera, and T S Caetano (Almost) no label no cry - supplementary materal In NIPS*7, 04 [0] M J Kearns and Y Mansour On the boostng ablty of top-down decson tree learnng algorthms In 8 th ACM STOC, pages 459 468, 996 [] M Belkn, P Nyog, and V Sndhwan Manfold regularzaton: A geometrc framework for learnng from labeled and unlabeled examples JMLR, 7:399 434, 006 [] J Sh and J Malk Normalzed cuts and mage segmentaton IEEE TransPAMI, :888 905, 000 [3] Y Altun and A J Smola Unfyng dvergence mnmzaton and statstcal nference va convex dualty In 9 th COLT, pages 39 53, 006 [4] P L Bartlett and S Mendelson Rademacher and gaussan complextes: Rsk bounds and structural results JMLR, 3:463 48, 00 [5] V Koltchnsk and D Panchenko Emprcal margn dstrbutons and boundng the generalzaton error of combned classfers Ann of Stat, 30: 50, 00 [6] K Bache and M Lchman UCI machne learnng repostory, 03 [7] A Beygelzmer, V Dan, T Hayes, J Langford, and B Zadrozny Error lmtng reductons between classfcaton tasks In th ICML, pages 49 56, 005 9

(Almost) No Label No Cry - Supplementary Materal Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau Table of contents Supplementary materal on proofs Pg Proof of Lemma Pg Proof of Lemma Pg Proof of Theorem 3 Pg 3 Proof of Lemma 4 Pg 4 Proof of Lemma 5 Pg 6 Mean Map estmator s Lemma and Proof Pg 8 Proof of Theorem 6 Pg 9 Proof of Lemma 7 Pg 3 Proof of Theorem 8 Pg 3 Supplementary materal on experments Pg 7 Full Expermental Setup Pg 7 Smulated Doman for Volaton of Homogenety Assumpton Pg 8 Smulated Doman from [] Pg 8 Addtonal Tests on alter- SVM [] Pg 8 Scalablty Pg 9 Full Results on Small Domans Pg 9

Supplementary Materal on Proofs Proof of Lemma For any SPSL F (S, h), we can wrte t as ([], Lemma, [3]): F (S, h) F φ (S, h) D φ (y m φ (h(x ))), () where y ff y and 0 otherwse, φ s permssble and D φ s the Bregman dvergence wth generator φ [3] It also holds that: D φ (y φ (h(x ))) b φ F φ (yh(x)) wth: F φ (x) φ ( x) + φ(0) φ(0) φ(/) a φ + φ ( x), () b φ and φ s the convex conjugate of φ, e φ (x) xφ (x) φ(φ (x)) Furthermore, for any permssble φ, the conjex conjugate φ (x) verfes the property φ ( x) φ (x) x, (3) and so we get that: F (S, h) D φ (y m φ (h(x ))) b φ m b φ m b φ m b φ m b φ m b φ m F φ (y h(x )) ( F φ (y h(x )) + ) F φ (y h(x )) ( F φ (y h(x )) + ) F φ ( y h(x )) y h(x ) b φ F φ (yh(x )) y h(x ) m y {,+} ( ) F φ (σh(x )) h y x m σ {,+} σ {,+} F φ (σh(x )) h (µ S) (6) (4) holds because of (3), (5) holds because h s lnear So for any samples S and S wth respectve sze m and m, we have (agan usng the property that h s lnear): ( ) F (S, h) F (S, h) b φ F φ (σh(x )) m m F φ (σh(x )) x S x S σ {,+} whch yelds the statement of the Lemma Proof of Lemma Usng the fact that D w and L are symmetrc, we have: l(l, X) X + h (µ S µ S ), (7) X tr ( B D w Π ) X + X tr ( X ΠD w Π ) X + γ X tr ( X ) LX ΠD w B + ΠD w Π X + γlx 0, out of whch B± follows n Lemma (4) (5)

3 Proof of Theorem 3 We let Π o [DIAG(ˆπ) DIAG(ˆπ )] N an orthonormal system (n jj (ˆπ j +( ˆπ j) ) /, j [n] and 0 otherwse) Let K Πo be the n-dm subspace of R d generated by Π o The proof of Theorem (3) explots the followng Lemma, whch assumes that ε s any > 0 real for L n (8) (man fle) to be 0 When ε 0, the result of Theorem (3) stll holds but follows a dfferent proof Lemma Let A ΠD w Π and L defned as n (8) (man paper) Denote for short U ( L A + γ I ) (8) Suppose there exsts ξ > 0 such that for any x R n, the projecton of Ux n K Πo, x U,o, satsfes Then: Proof Combnng Lemma and (5), we get x U,o ξ x (9) M M F γξ B ± F (0) B ± B± Defne the followng permutaton matrx: C ( ) (A + γl) A I B ± ( (γl) A + I ) B ± () [ 0 I I 0 ] R n n () A ΠD w Π s not nvertble but dagonalsable Its (orthonormal) egenvectors can be parttoned n two matrces P o and P such that: We have: P o P [DIAG(ˆπ ) DIAG(ˆπ)] N CΠ o R n n (egenvalues 0), (3) ΠN R n n (egenvalues w j (ˆπ j + ( ˆπ j) ), j) (4) M M P o CB ± P o C B± P ( o C (γl) A + ) I B ± Π ( o (γl) A + ) I B ± (5) γπ ( o L A + γ ) I B ± (6) Eq (5) follows from the fact that C s dempotent Pluggng Frobenus norm n (6), we obtan M M F γ Π ( o L A + γ ) I B ± F γ d k Π o ( L A + γ I ) b ± k d γ ξ b ± k (7) k γ ξ B ± F, whch yelds (0) In (7), b ± k denotes column k n B± Ineq (7) makes use of assumpton (9) To ensure x U,o ξ x, t s suffcent that Ux ξ x, and snce Ux U F x, t s suffcent to show that, (8) U ξ F 3

wth U ξ L ξ A + ξγ I, for relevant choces of ξ We have let L ξ (/ξ)l Let 0 λ () λ n () denote the ordered egenvalues of a postve-semdefnte matrx n R n n It follows that, snce L s symmetrc postve defnte, we have λ j (L ξ A) λ j(a) λ n (L ξ ) ( 0), j [n] We have used eq (3) Weyl s Theorem then brngs: λ j (U ξ ) λ n (L ξ ) λ j (A) + ξγ λ n (L ξ ) { ξ γ f j [n] λ n(l ξ ) λ j(a) otherwse (9) Gershgorn s Theorem brngs λ n (/ξ)(ε + max j j l jj ), and furthermore the egenvalues of A satsfy λ j w j /, j n + We thus have: U ξ F nγ ξ ) 4n (ε + max j j l + jj ξ mn j wj (0) In (9) and (0), we have used the egenvalues of A gven n eqs (3) and (4) Assumng: γ ξ n, () a suffcent condton for the rght-hand sde of (0) to be s that ξ ε + max j j l jj n mn j w j () To fnsh up the proof, recall that L D V wth d jj j,j v jj and the coordnates v jj 0 Hence, l jj j j j v jj n max v jj, j [n] j j The proof s fnshed by pluggng ths upperbound n () to choose ξ, then takng the maxmal value for γ n () and fnally solvng the upperbound n (0) Ths ends the proof of Theorem 3 4 Proof of Lemma 4 We frst consder the normalzed assocaton crteron n (0): ASSOC(S j, S j ) vjj N ( ASSOC(Sj, S j ) ASSOC(S j, S j S j ) + x S j,x S j ASSOC(S ) j, S j ) ASSOC(S j, S j S j ) x x (3), 4

Remark that b j b j x x m j m j x S j x S j m x + j x S j m j x S j m + j m j m j x S j x x S j x + m j m j x S j,x S j x S j m j x S j x x x x x m j m j x S j m j m j x m j m j x S j,x S j x S j,x S j x x x S j x x x (4) + m j x m j m + m j x j m j m x x j m j m j x S j x S j x S j,x S j } {{ } a x x (5) m j m j x S j,x S j ASSOC(S j, S j ) (6) m j m j ( n ) ( Eq (4) explots the fact that j a n ) j n j a j and eq (5) explots the fact that a (m j m j ) x S j,x S x j x We thus have: ASSOC(S j, S j ) ASSOC(S j, S j S j ) ASSOC(S j, S j ) ASSOC(S j, S j ) + ASSOC(S j, S j ) ASSOC(S j, S j ) ASSOC(S j, S j ) + mjm j b j b j κ m j κ m j + mjm j b j b j + m j κ b j b j 5 (7) (8) (9)

Eq (7) uses (6) and eq (8) uses assumpton (D) Eq (8) also holds when permutng j and j, so we get: ( ) ς(v NC ε, B ± ) max j j n + + mj κ b j b j + + m j κ b j b j B ± F ( ) ε n + B ± mnj mj F + κ mn j,j b j b j ( ) ε n + B ± mnj mj F (30) + κ mn j,j b j b j ε n d max σ,j bσ j + 4κ d max σ,j b σ j mn j,j b j b j ε n d max 4κ d σ,j bσ j + κ max σ,j b σ j ) f (max NC σ,j bσ j o(), (3) where the last nequalty uses assumpton (D), and (30) uses the property that (a+b) a +b We have let f NC (x) ε n dx + 4κ d κx, (3) whch s ndeed o() f ε o(n / x) Ths proves the Lemma for ς(v NC, B ± ) The case of ς(v G,s, B ± ) s easer, as ( exp b ) ( j b j exp mn j,j b j b j ) s s ( exp κ ) s max σ,j bσ j, from assumpton (D) alone, whch gves ( ( ε ς(v G,s, B ± ) B ± F n + exp κ )) s max σ,j bσ j ( ( ε B ± F n + exp κ )) s max σ,j bσ j ( ( ε d max σ,j bσ j n + exp κ )) s max σ,j bσ j ) f (max G σ,j bσ j o(), (33) as clamed We have let f G (x) ε n dx+dx exp( κx/s), whch s ndeed o() f ε o(n / x) Remark that we shall have n general f G (x) f NC (x) and even f G (x) o(f NC (x)) f ε 0, so we may expect better convergence n the case of V G,s as max σ,j b σ j grows 5 Proof of Lemma 5 We frst restate the Lemma n a more explct way, that shall provde explct values for κ l and κ n Lemma There exst κ jj and s jj dependng on d j, d j, and κ jj > dependng on m j, m j, such that: 6

If v G,s jj jj > exp( /4) then S j, S j are not lnearly separable; If v G,s jj jj < exp( 64) then S j, S j are lnearly separable; If v NC jj If v NC jj > κ jj then S j, S j are not lnearly separable; < κ jj /κ jj then S j, S j are lnearly separable Proof We frst consder the normalzed assocaton crteron n (0), and we prove the Lemma for the followng expressons of κ jj and κ jj : κ jj 6 6 + + d jj + d jj d j d j, (34) κ jj 5 max{m j, m j }, (35) wth d jj max{d j, d j } and d j max x,x S j x x, j j [n] For any bag S j, we let (b j, r j) MEB(S j ) denote the mnmum enclosng ball (MEB) for bag S j and dstance L, that s, r j s the smallest unque real such that!b j : d(x, b j ) x b j r j, x S j We have let d(x, b j ) x b j We are gong to prove a frst result nvolvng the MEBs of S j and S j, and then wll translate the result to the Lemma s statement The followng propertes follows from standard propertes of MEBs and the fact that d(, ) s a dstance (they hold for any j j ): (a) d(x, x ) r j, x, x S j ; (b) If bags S j and S j are lnearly separable, then x CO(S j ), x S j such that d(x, x ) max{r j, r j }; here, CO denotes the convex closure; (c) If bags S j and S j are lnearly separable, then d(b j, b j ) max{r j, r j }, where b j and b j are the bags average; (d) x S j, x S j st d(x, x ) r j ; (e) d(x, x ) max{r j, r j } + d(b j, b j ), x CO(S j), x CO(S j ) Let us defne ASSOC(S j, S j ) d (x, x ) (36) x S j,x S j We remark that, assumng that each bag contans at least two elements wthout loss of generalty: vjj NC + (37) + ASSOC(Bj,B j ) ASSOC(B j,b j) + ASSOC(Bj,B j ) ASSOC(B j,b j ) We have ASSOC(S j, S j ) 4m j rj and ASSOC(S j, S j ) 4m j r j (because of (a)), and also ASSOC(S j, S j ) max{m j, m j } max{rj, r j } when S j and S j are lnearly separable (because of (b)), whch yelds n ths case vjj NC + + max{mj,m j } max{r j,r j } m jrj + max{r j,r j } r j + + max{mj,m j } max{r j,r j } m j r j + max{r j,r j } r j (38) Let us name κ jj the rght-hand sde of (38) It follows that when vnc jj > κ jj, S j and S j are not lnearly separable 7

On the other hand, we have ASSOC(S j, S j ) m j rj and ASSOC(S j, S j ) m j r j (because of (d)), and also ASSOC(S j, S j ) m j m j ( max{r j, r j } + d(b j, b j )) m j m j (4 max{rj, rj } + d (b j, b j )), (39) because of (e) and the fact that (a + b) a + b It follows that j j : vjj NC + (40) + m j (4 max{r j,r j }+d (b j,b j )) + mj(4 max{r j,r j }+d (b j,b j )) rj r j For any j j, when d (b j, b j ) 4 max{r j, r j }, then we have from (40): vjj NC + + 6m j max{r j,r j } + 6mj max{r j,r j } rj r j > κ jj /(3 max{m j, m j }) (4) Hence, when vjj NC κ jj /(3 max{m j, m j }), t mples d(b j, b j ) > max{r j, r j }, mplyng d(b j, b j ) > r j + r j, whch s a suffcent condton for the lnear separablty of S j and S j So, we can relate the lnear separablty of S j and S j to the value of vjj NC wth respect to κ jj defned n (38) To remove the dependence n the MEB parameters and obtan the statement of the Lemma, we just have to remark that d j /4 r j 4d j, j [n], whch yelds κ jj /6 κ jj κ jj Hence, when vjj NC > κ jj, t follows that vnc jj > κ jj and S j and S j are not lnearly separable On the other hand, when vjj NC κ jj /(6 3 max{m j, m j }) κ jj /κ jj, then vjj NC κ jj /(3 max{m j, m j }) and the bags S j and S j are lnearly separable Ths acheves the proof of Lemma 5 for the normalzed assocaton crteron n (0) The proof for v G,s jj s shorter, and we prove t for s j,j max{d j, d j } (4) We have (/) max{d j, d j } max{r j, r j } max{d j, d j } Hence, because of (c) above, f S j and S j are lnearly separable, then v G,s jj /e/4 ; so, when v G,s jj > /e/4, the two bags are not lnearly separable On the other hand, f d(b j, b j ) max{r j, r j }, then because of (e) above d(b j, b j ) 4 max{r j, r j } 8 max{d j, d j }, and so v G,s jj /e64 Ths mples that f v G,s jj < /e64, then d(b j, b j ) > max{r j, r j } r j + r j, and thus the two bags are lnearly separable, as clamed Ths acheves the proof of Lemma Ths acheves the proof of Lemma 5 6 Mean Map estmator s Lemma and Proof It s not hard to check that the randomzed procedure that bulds µ S RAND yx for some random x S and y {, } guarantees O( + γ) approxmablty when some bags are close to the convex hull of S, for small γ > 0 Hence, the Mean Map estmaton of µ S can be very poor n that respect Lemma 3 For any γ > 0, the Mean Map estmator µ S MM µ S / max σ,j b σ j γ, even when (D + D) hold cannot guarantee µ MM S Proof Let x > 0, ɛ (0, ), p (0, ), p / We create a dataset from four observatons, {(x 0, ), (x 0, ), (x 3 x, ), (x 4 x, )} There are two bags, S takes ɛ of x and ɛ of x S takes ɛ of x 4 and ɛ of x 3 The label-wse estmators µ σ of [4] are soluton of ( [ ] [ ] ɛ ɛ ɛ ɛ [ µ µ ] ɛ ɛ ɛ [ ( ɛ)x ɛx ] ɛ 8 ɛ ] ) [ ɛ ɛ ɛ ɛ ] [ x 0 (43)

On the other hand, the true quanttes are: [ ] µ µ [ ( ɛ)x ɛx ] (44) We now mx classes n S and pck bag proportons q P S [S ] and q P S [S ] We have the class proportons defned by P S [y +] ɛq + ( ɛ)( q) p Then ( ) ( ) µ S µ S p( ɛ) ɛ x ( p)ɛ ɛ x ɛ p ɛ ɛ x ɛ( q)x (45) Furthermore, max b σ x We get µ S µ S max b σ ɛ( q) (46) Pckng ɛ and ( q) both > (γ/) s suffcent to have eq (46) > γ for any γ > 0 Remark that both assumptons (D) and (D) hold for any κ < and any κ > 0 7 Proof of Theorem 6 The proof of the Theorem nvolves two Lemmata, the frst of whch s of ndependent nterest and holds for any convex twce dfferentable functon F, and not just any F φ So, let us defne: ( ) b F (S y, θ, µ) F (σθ x ) m θ µ (47) where b s any fxed postve real Defne also the regularzed loss: F (S y, θ, µ, λ) F (S y, θ, µ) + λ θ (48) Let f k R m denote the vector encodng the k th varable n S : f k x k For any k [d], let ( d f k σ k f k denote a normalzaton of vectors f k n the sense that d f k ( d d k ( d k f k f k k ) d d fk (49) ) d ) d k f k (50) Let Ṽ collect all vectors f k n column and V collect all vectors f k n column Wthout loss of generalty, we assume V V 0, e V V postve defnte (e no feature s a lnear combnaton of the others), mplyng, because the columns of Ṽ are just postve rescalng of the columns of V, that Ṽ Ṽ 0 as well We use V nstead of F as n the man paper, n order not to counfound wth the general convex surrogate notaton F that we use here Lemma 4 Gven any two µ and µ, let θ and θ be the respectve mnmzers of F (S y,, µ, λ) and F (S y,, µ, λ) Suppose there exsts F > 0 such that surrogate F satsfes F (±(αθ + ( α)θ ) x ) F, α [0, ], [m] (5) Then the followng holds: θ θ λ + em F vol (Ṽ) µ µ, (5) where vol(ṽ) det Ṽ Ṽ denote the volume of the (row/column) system of Ṽ 9

Proof Our proof begns followng the same frst steps as the proof of Lemma 7 n [5], addng the steps that handle the lowerbound on F Consder the followng auxlary functon A F (τ ): A F (τ ) ( F (S y, θ, µ) F (S y, θ, µ ) ) (τ θ ) + λ τ θ, (53) where the gradent of F s computed wth respect to parameter θ The gradent of A F () s: The gradent of A F satsfes A F (τ ) F (S y, θ, µ) F (S y, θ, µ ) + λ(τ θ ), (54) A F (θ ) F (S y, θ, µ, λ) F (S y, θ, µ, λ) 0, (55) as both gradents n the rght are 0 because of the optmalty of θ and θ wth respect to F (S y,, µ, λ) and F (S y,, µ, λ) The Hessan H of A F s HA F (τ ) λi 0 and so A F s convex and s thus mnmal at τ θ Fnally, A F (θ ) 0 It comes thus A F (θ ) 0, whch yelds equvalently: 0 ( F (S y, θ, µ) F (S y, θ, µ ) ) (θ θ ) + λ θ θ ( ) b F (yθ x ) m µ b F (yθ x ) + m µ (θ θ ) y y +λ θ θ ( b F (yθ x ) ) F (yθ x ) (θ θ m ) y y } {{ } a (µ µ ) (θ θ ) + λ θ θ (56) Let us lowerbound a We have F (yθ x) yf (yθ x)x, and a Taylor expanson brngs that for any θ, θ, there exsts some α [0, ] such that, defnng we have: We thus get: a u α, y(αθ + ( α)θ ) x, (57) F (yθ x ) F (yθ x ) + y(θ θ ) x F (u α, ) (58) ( F (yθ x ) y y ( y ) F (yθ x ) (θ θ ) y(f (yθ x ) F (yθ x ))x ) (θ θ ) ( ) (θ θ ) x F (u α, )x (θ θ ) y ((θ θ ) x ) F (u α, ) F ((θ θ ) x ) (59) F (θ θ ) SS (θ θ ), (60) where matrx S R d m s formed by the observatons of S y n columns, and neq (59) comes from (5) Defne T (d/ x )SS Its trace satsfes tr (T) d Let λ d λ d λ > 0 0

denote egenvalues of T, wth λ strctly postve because SS V V 0 The AGH nequalty brngs: Multplyng both sde by λ and rearrangng yelds: d λ k ( ) d d λ k (6) d k ( ) d tr (T) λ d ( ) d d λ d ( ) d d (6) d λ ( ) d d det T (63) d Let λ > 0 denote the mnmal egenvalue of SS It satsfes λ ( x /d)λ and thus t comes from neq (63): ( ) d ( ) d d d λ d x det SS ( ) [ d ( ) ] d d d det d x SS ( ) d d det Ṽ Ṽ (64) d ( ) d d vol (Ṽ) (65) d e vol (Ṽ) (66) We have used notaton vol(ṽ) det Ṽ Ṽ Snce (θ θ ) SS (θ θ ) λ θ θ, combnng (60) wth (66) yelds the followng lowerbound on a: Gong back to (56), we get λ θ θ (µ µ ) (θ θ ) + a e F vol (Ṽ) θ θ (67) b em F vol (Ṽ) θ θ 0 Snce (µ µ ) (θ θ ) µ µ θ θ, we get after channg the nequaltes and solvng for θ θ : as clamed θ θ λ + em F vol (Ṽ) µ µ, The second Lemma s used to (5) when F (x) F φ Notce that we cannot rely on strong convexty arguments on F φ, as ths do not hold n general The Lemma s stated n a more general settng than for just F F φ

Lemma 5 Fx λ, b > 0, and let x max x Suppose that µ µ for some µ > 0 Let ( ) b F (S y, θ, µ, λ) F (σθ x ) m θ µ + λ θ, (68) and let θ arg mn θ F (S y, θ, µ, λ) Suppose that F () s L-Lpschtz Then σ θ blx + µ λ (69) Proof Let us defne a shrnkng of the optmal soluton θ, θ α αθ for α (0, ) We have ( ) b F (S y, θ α, µ, λ) F (σθα x ) m θ α µ + λ θ α σ ( ) b F (σαθ x ) α m θ µ + λα θ σ ( b F (σθ x ) + L ) σαθ m x σθ x + α θ µ σ +λα θ (70) ( ) b F (σθ bk( α) x ) + θ x α m m θ µ σ +λα θ, (7) where (70) holds because F s L-Lpschtz To have eq (7) smaller than F (S y, θ, µ, λ), we need equvalently: bl( α) θ x α m θ µ + λα θ θ µ + λ θ, that s: bl( α) m θ x + α θ µ λ( α ) θ, and to fnd an α (0, ) such that ths holds, because of Cauchy-Schwartz nequalty, t s suffcent that ( α)(blx + µ) λ( α ) θ, e: θ blx + µ λ( + α) Hence, whenever θ > (blx + µ )/λ, there s a shrnkng of the optmal soluton to eq (68) that further decreases the rsk, thus contradctng ts optmalty Ths ends the proof of Lemma 5 Notce that Lemma 5 does not requre F (x) to be convex, nor dfferentable To use ths Lemma, remark that for any F φ, F φ(x) b φ (φ ) ( x) b φ (φ ) ( x) [ /b φ, 0], (7) for any x φ ([0, ]) [], and thus F φ s /b φ -Lpschtz Fnally, consderng (5), for any α [0, ] ± (αθ + ( α)θ ) x (α θ + ( α) θ )x x + α µ + ( α) µ (73) λ x + max{ µ, µ }, (74) λ where neq (73) uses Lemma 5 wth b /K b φ µ and µ are the parameters of F (S y,, µ, λ) and F (S y,, µ, λ) n Lemma 4

Algorthm Label Assgnaton (LA) Input θ R d, a bag B {x R d,,,, m}, bag sze m + [m]; If B then stop Else f m + (m) then y I(m + m) I(m + 0),,,, m Else Step : arg max θ x Step : y sgn(θ x ) Step 3 : LA(θ, B\{x }, m + I(y )) Now, gong back to the parameters of Theorem 6, we make the change µ µ S and µ µ S and obtan the statement of the Theorem for nterval Ths acheves the proof of Theorem 6 I [±(x + max{ µ S, µ S })] (75) 8 Proof of Lemma 7 We make the proof for optmzaton strategy OPT mn The case OPT max flps the choce of the label n Step To mnmze F φ (S y, θ t, µ S (σ)) over σ Σˆπ, we just have to fnd σ arg max σ Σ ˆπ θ σ x, and we can do that bag-wse Algorthm presents the labelng (notaton (m) {,,, m }) Remark that the tme complexty for one bag s O(m j log m j ) due to the orderng (Step ), so the overall complexty s ndeed O(m max log m ) Lemma 6 Let σ {σ, σ,, σ m} be the set of labels obtaned after runnng LA(θ, S j, m + j ) for j,,, n Then σ arg max σ Σ ˆπ θ σ x Proof The total edge, θ σ x (for any σ Σˆπ ), can be summable bag-wse wrt the coordnates of σ Consder thus the optmal set {σ } B arg max σ {,} m : σm + m θ x σ B x, for some bag B {x,,,, m }, wth constrant m + [m ] Ths set contans the label assgnment σ returned by LA(θ, B, m + ), a property that follows from two smple observatons: P Consder any observaton x of bag B; for any optmal labelng σ of B, let m + m + I(σ ) Defne the set {σ } of optmal labelngs of B\{x } wth constrant m + m + I(σ ) Then ths set concdes wth the set created by takng the elements of {σ } B to whch we drop coordnate Ths follows from the per-observaton summablty of the total edge wrt labels P Assume m + (m ) arg max θ x, there exsts an optmal assgnment σ such that σ sgn(θ x ) Otherwse, startng from any optmal assgnment σ, we can flp the label of x and the label of any other x for whch σ σ, and get a label assgnment that satsfes constrant m + and cannot be worse than σ, and s thus optmal, a contradcton Hence, LA(θ, B, m + ) pcks at each teraton a label that matches one n a subset of optmal labelngs, and the recursve call preserves the subset of optmal labelngs Snce when m + (m) the soluton returned by LA(θ, B, m + ) s obvously optmal, we end up when the current B s empty wth σ arg max σ Σ ˆπ θ σ x, as clamed 9 Proof of Theorem 8 We prove separately Eqs (4) and (5) 3

9 Proof of eq (4) Notatons : unless explctly stated, all samples lke S and S are of sze m To make the readng of our expectatons clear and smple, we shall wrte E D for E (x,y) D, E Σm for E σ Σm, E S for E (x,y) S, E D m for E S D and E Dm for E S D We now proceed to the proof, that follows the same man steps as that of Theorem 5 n [6] For any q [0, ], let us defne the convex combnaton: F φ (q, h(x)) qf φ (h(x)) + ( q)f φ ( h(x)) (76) It follows that E Σ ˆπ E S [F φ (σ(x)h(x))] E S [F φ (ˆπ(x), h(x))], (77) wth ˆπ(x) the label proporton of the bag to whch x belongs n S We also have h, wth Λ(S) E D [F φ (yh(x))] E S [F φ (ˆπ(x), h(x))] + Λ(S), (78) sup g {E D [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} (79) Let us bound the devatons of Λ(S) around ts expectaton on the samplng of S, usng the ndependent bounded dfferences nequalty (IBDI, [7]) for whch we need to upperbound the maxmum dfference for the supremum term computed over two samples S and S of the same sze, such that S s S wth one example replaced We have: Λ(S) Λ(S ) E S [F φ (ˆπ(x), g(x))] E S [F φ (ˆπ (x), g(x))], (80) wth ˆπ and ˆπ denotng the correspondng label proportons n S and S Let {x } S\S and {x } S \S Let x S j and x S j for some bags j and j Upperbound (80) depends only on bags j and j For any x (S j S j )\{x, x }, eqs () and (3) brng: F φ (ˆπ(x), g(x)) F φ (ˆπ (x), g(x)) F φ(g(x)) F φ ( g(x)) m(x) g(x) b φ m(x) (8) h b φ m(x), (8) where m(x) s the sze of the bag to whch t belongs n S, plus ff t s bag j and j j, mnus ff t s bag j and j j Furthermore, () and (3) also brng: F φ (ˆπ(x), g(x)) F φ ( g(x) ) + b φ (( ˆπ(x)) g(x)>0 + ˆπ(x)( g(x)>0 )) g(x) F φ (0) + b φ (( ˆπ(x)) g(x)>0 + ˆπ(x)( g(x)>0 ))h Also, t comes from ts defnton that: We obtan that: Λ(S) Λ(S ) m F φ (0) + h b φ, x S F φ (0) b φ (0φ (0) φ(φ (0))) φ(/) b φ (83) ) ( + h + + h + b φ b φ m x (S j S j )\{x,x } h b φ m(x) Q m, (84) 4

where ( ) h Q + b φ So the IBDI yelds that wth probablty δ/ over the samplng of S, (85) Λ(S) E Dm sup {E D [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} + Q g m log δ, (86) We now upperbound the expectaton n (86) Usng the convexty of the supremum, we have E Dm sup {E D [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} g { E Dm sup ED m [F φ(yg(x))] E S [F φ (ˆπ(x), g(x))] } g E Dm,D sup {E m S [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} (87) g Consder any set S D m, and let I / [m] be a subset of m ndces, pcked unformly at random among all ( ) m m possble choces For any I [m], let S(I) denote the subset of examples whose ndex matches I, and for any x S(I), let ˆπ(x S(I)) denote ts bag proporton n S(I) For any I / l ndexed by l and any x S, let: ˆπ s l (x) { ˆπ(x S(I / l )) f x S(I / l ) ˆπ(x S\S(I / l )) otherwse (88) denote the label proportons nduced by the splt of S n two subsamples S(I / l ) and S\S(I/ l ) Let { ˆπ l l (x) y f x S(I / l ) ˆπ(x S\S(I / l )) otherwse, (89) where y s the true label of x Let σ l (x) x S(I / l ) The Label Proporton Complexty (LPC) L m quantfes the dscrepance between these two estmators When each bag n S has label proporton zero or one, each term factorng classfer h n eq (3) (man fle) s zero, so L m 0 Lemma 7 The followng holds true: E Dm,D sup {E m S [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} g E Dm,Σ m sup {E S [σ(x)f φ (ˆπ(x), h(x))]} + L m (90) h Proof For any σ Σ m and any sets S {x, x,, x m } and S {x, x,, x m}of sze m, denote and S σ S σ {x ff σ, x otherwse}, {x ff σ, x otherwse} (S S )\S σ (9) ˆπ (x) { ˆπσ (x) f x S σ, ˆπ σ (x) otherwse, (9) where ˆπ σ () denote the label proportons n S σ and ˆπ σ () denote the label proportons n S σ Let ˆπ() denote the label proportons n S, ˆπ () denote the label proportons n S (we know each bag to whch each example n S belongs to, so we can compute these estmators), We have E Dm,D m sup h E Dm,D m sup h E Dm,D m sup h {E S [F φ (yh(x))] E S [F φ (ˆπ(x), h(x))]} { E S [F φ (ˆπ (x), h(x))] E S [F φ (ˆπ(x), h(x))] b φ { E Sσ [σ(x)f φ (ˆπ l (x), h(x))] E Sσ [σ(x)f φ (ˆπ r (x), h(x))] b φ 5 } } (93),

wth E S [(( ˆπ (x)) y ˆπ (x) y )h(x)] ; (94) ˆπ l (x) (( + σ(x))ˆπ (x) + ( σ(x))ˆπ(x)), ˆπ r (x) We also have from eq () and (3): (( + σ(x))ˆπ(x) + ( σ(x))ˆπ (x)) (95) E Sσ [σ(x)f φ (ˆπ l (x), h(x))] E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] b φ, (96) E Sσ [σ(x)f φ (ˆπ r (x), h(x))] E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] 3, b φ (97) wth E Sσ [σ(x)(ˆπ l (x) ˆπ σ (x))h(x)], (98) 3 E Sσ [σ(x)(ˆπ r (x) ˆπ σ (x))h(x)] (99) We also have: 3 E S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)] 4 (00) Puttng eqs (93), (96), (97) and (00) altogether, we get, after ntroducng Rademacher varables: {E S [F φ (yh(x))] E S [F φ (ˆπ(x), h(x))]} E Dm,D m,σm sup h E Dm,D m,σm sup h E Dm,D m,σm sup h +E Dm,D m,σm sup h {E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] + 4 } {E Sσ [σ(x)f φ (ˆπ σ (x), h(x))] E Sσ [σ(x)f φ (ˆπ σ (x), h(x))]} {E S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)]} E Dm,D sup {E m,σm S [σ(x)f φ (ˆπ (x), h(x))] E S [σ(x)f φ (ˆπ(x), h(x))]} h +E Dm,D sup {E m,σm S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)]} (0) h E Dm,Σ m sup {E S [σ(x)f φ (ˆπ(x), h(x))]} h {E S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)]} (0) +E Dm,D m,σm sup h Eq (0) holds because the dstrbuton of the supremum s the same We also have: E Dm,D sup {E m,σm S [(ˆπ (x) y )h(x)] + E S [(ˆπ(x) ˆπ (x))h(x)]} h E Dm,D m,σm sup h {E S [(ˆπ(x) ˆπ (x))h(x)] E S [( y ˆπ (x))h(x)]} E Dm E I /,I / sup E S [σ (x)(ˆπ s (x) ˆπl (x))h(x)] (03) h L m (04) Eq (03) holds because swappng the sample does not make any dfference n the outer expectaton, as each couple of swapped samples s generated wth the same probablty wthout swappng Puttng altogether (0) and (04) ends the proof of Lemma 7 We now bound the devatons of E Σm sup h {E S [σ(x)f φ (ˆπ(x), h(x))]} wth respect to ts expectaton over the samplng of S, E Dm,Σ m sup h {E S [σ(x)f φ (ˆπ(x), h(x))]} To do that, we use a thrd tme the IBDI and compute an upperbound for E Σ m sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} E Σm sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} [ ] sup E g {E S [σ(x)f φ (ˆπ(x), h(x))]} Σm (05) max Σ m sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} [ ] sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} sup g {E S [σ(x)f φ (ˆπ(x), h(x))]} 6 Q m, (06)

where Q s defned n eq (85) Eq (05) holds because of the trangular nequalty Ineq (06) holds because σ() So wth probablty δ/ over the samplng of S, E Σm sup {E S [σ(x)f φ (ˆπ(x), h(x))]} h E Dm,Σ m sup {E S [σ(x)f φ (ˆπ(x), h(x))]} Q h m log δ, (07) where Q s defned va (84) We obtan that wth probablty > ((δ/) + (δ/)) δ, the followng holds h: E D [F φ (yh(x))] E S [F φ (ˆπ(x), h(x))] + Λ(S) (see (78) and (79)) E S [F φ (ˆπ(x), h(x))] + E Dm sup {E D [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} g as clamed 9 Proof of eq (5) +Q m log (from (86)) δ E S [F φ (ˆπ(x), h(x))] + E Dm,D sup {E m S [F φ (yg(x))] E S [F φ (ˆπ(x), g(x))]} g +Q m log (from (87)) δ E S [F φ (ˆπ(x), h(x))] + E Dm,Σ m sup {E S [σ(x)f φ (ˆπ(x), g(x))]} + L m g +Q m log (Lemma (7)) δ E S [F φ (ˆπ(x), h(x))] + E Σm sup {E S [σ(x)f φ (ˆπ(x), h(x))]} + L m h +Q m log δ (from (07)) E Σ ˆπ E S [F φ (σ(x)h(x))] + ˆR b m + L m + 4 ( ) h + b φ m log δ, We have F φ (x) (/b φ))(φ ) ( x) (/b φ )(φ ) ( x) [ /b φ, 0], and thus F φ s /b φ - Lpschtz, so Theorem 4 n [8] brngs: Rm(F, b { η) E σ Σm sup E [m] [σ E σ Σ [F ˆπ φ(σ h(x ) η)]] } h H { b φ E σ Σm sup E [m] [σ E σ Σ [σ ˆπ h(x ) η]] } h H { b φ E σ Σm sup E [m] [σ E σ Σ [σ ˆπ h(x )]] } h H { b φ E σ Σm sup E [m] [σ (ˆπ(x ) )h(x )] }, h H as clamed 3 Supplementary Materal on Experments 3 Full Expermental Setup All mean operator algorthms have been coded n R For SVM and InvCal, we used a Matlab mplementaton from the authors of [] The ranges of parameters for cross valdaton are λ λ m wth λ {0} 0 {0,,}, γ 0 {,,0}, σ {,,0} for mean operator algorthms We ran all https:/gthubcom/felxyu/psvm 7

experments wth D w I and ε 0 Snce we tested on smlar domans -6 are actually the sameranges for InvCal and SVM were taken from [] To avod an addtonal source of complexty n the analyss, we cross-valdated all hyper-parameters usng the knowledge of all labels of the valdaton sets; notce that labels at valdaton tme generally would not be accessble n real world applcatons 3 Smulated Doman for Volaton of Homogenety Assumpton The synthetc data generated for ths test conssts on 6 classfcaton problems, each one formed by 6 bags of 00 two-dmensonal normal samples The dstrbuton generatng the frst dataset satsfes the homogenety assumpton (Fgure (a)) Then, we gradually change the poston of the class-condtonal bag-condtonal means on one lnear drecton (to the rght on Fgure (b) and (c)), wth dfferent offsets for dfferent bags In Fgure we gve a graphcal explanaton of the process wth 3 bags x 0 label + bag 3 5 00 5 50 x (a) x 0 label + bag 3 00 5 50 75 x (b) x 0 label + bag 0 5 0 x (c) 3 Fgure : Volaton of homogenety assumpton 33 Smulated Doman from [] The MM algorthm was shown to learn a model wth zero accuracy predcton on the toy doman of [] We report here n Table performance of all mean operator algorthms measured n transductve settng, tranng wth cross-valdaton Although none of the dstances used n our experments n LMM leads reasonable accuracy n the toy dataset, AMM max ntalsed wth any startng pont learns n one step a model whch perfectly classfes all the nstances We also notce that EMM returns an optmal classfer by tself (not reported n Table ) Table : AUC on the toy dataset of [] AMM mn AMM max EMM 0000 0000 MM 846 0000 LMM G 846 0000 LMM G,s 846 0000 LMM nc 846 0000 846 0000 0ran 0000 0000 34 Addtonal Tests on alter- SVM [] In our experments, we observe that AUC acheved by SVM can be hgh, but t s also often below 05; n those cases the algorthm outputs models whch are worse than random and the average performance over 5 test folds drops We are able to reproduce the same behavour on the heart 8

dataset provded by the authors n a demo for alter- SVM; ths also proves our bag assgnment for LLP smulaton does not ntroduce the ssue In a frst test, we randomly select 3/4 of the dataset, and randomly assgn nstances to 4 bags of fxed sze 64, followng [] We repeat the tranng splt 50 tmes wth C C p, as n the demo, and we measure AUCs on the same tranng set As expected, a consstent number of run (%) ends up producng AUC smaller than 05 We dsplay n Fgure (a) the AUC s densty profle, whch shows a relevant mass around 05; notce also the two dstrbuton modes look symmetrc around 05 In a second test, we nvestgate further measurng pars of tranng set AUC and loss value obtaned by the same executon of the algorthm In ths case, we run over all parameters ranges defned n SVM s paper, and do not pck the model that mnmzes the loss over the 0 random runs, but record losses of all Fgures (b) and (c) show scatter plots relatve to two chosen tranng set splts We observe that loss mnmzaton can lead both to hgh and low AUCs, wth only few ponts close to 05 A possble explanaton mght be n the nverted polarty of the learnt lnear classfer; nverted polarty n ths contest means havng a model whch would acheve better performance classfyng nstances labels opposte to the ones predcted We conclude that optmzng SVM s loss n some cases mght be equvalent to tran a max-margn separator of the unlabelled data, whch only explots weakly the nformaton gven by the label proportons Ths would gve a heurstc understandng of the frequent symmetrcal behavour of the AUC 5 count 4 3 alter SVM loss 500 400 300 00 00 alter SVM loss 300 00 00 000 05 050 075 00 transet AUC (a) 0 000 05 050 075 00 transet AUC (b) 0 000 05 050 075 00 transet AUC (c) Fgure : alter- SVM: emprcal dstrbuton of AUC (a), and relatonshp between loss and AUC n two dfferent tran spt (b)(c) 35 Scalablty Fgure 3 (a) shows runtme of learnng (ncludng cross-valdaton) of MM and LMM wth regard to the number of bags whch s the natural parameter of tme complexty for our Laplacan-based methods Although the 3 layers of cross-valdaton of LMM G,s, LMM nc results the only method clearly not scalable Fgure 3 (b) presents how our one-shots algorthms scale on all small domans as a functon of problem sze Runtme s averaged over the dfferent bag assgnments The same plot s gven n Fgure 3 (c) for teratve algorthms, n partcular AMM mn and (alter/conv)- SVM All curves are completed wth measurements on bgger domans when avalable Runtme of SVMs s not drectly comparable wth our methods Ths s due to both (a) the mplementaton on dfferent programmng languages and (b) to the fact that the code provded mplements kernel SVM, even for lnear kernels, whch s a bg overhead n computaton and memory access Nevertheless, the hgh growth rate of conv- SVM makes the algorthm not sutable for large datasets Notceably, even f alter- SVM does not show such behavour, we are not able to run t on our bgger domans, snce t requres approxmately 0 hours to run on a tranng set splt wth fxed parameters 36 Full Results on Small Domans Fnally we report detals about all experments run on the 0 small domans (Table ) In the followng Tables, columns show the number of bags generated through K-MEANS Each cell contans 9

runtme (s) 5 00 75 50 5 0 MM LMM G LMM G,s LMM nc 0 0 30 #bags runtme (s) 0 4 0 0 0 MM LMM G LMM G,s LMM nc 0 3 0 5 0 7 #nstance * #features runtme (s) 0 4 0 0 0 AMM MM AMM G AMM G,s AMM nc AMM 0ran alter SVM conv SVM 0 3 0 5 0 7 #nstance * #features (a) (b) (c) Fgure 3: Learnng runtme of LMM for bags number (a), and for doman sze one-shot (b) and teratve methods (c) Table : Small domans sze dataset nstances feature arrhythma 45 97 australan 690 39 breastw 699 colc 368 83 german 000 7 heart 70 4 onosphere 35 37 vertebral column 60 9 vote 435 49 wne 78 6 average AUC over 5 test splts and standard devaton; runtme n second s n the separated column Best performng algorthm and ones not worse than 0 AUC are bold faced Comparsons are made n the respectve top/bottom sub-tables, whch group one-shot and teratve algorthms We use to hghlght runs whch acheve average AUC greater or equal than the Oracle 0

Table 3: arrhythma algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 709 ± 68 5055 ± 754 503 ± 755 4703 ± 660 534 ± 75 MM 6499 ± 99 6048 ± 78 687 ± 595 700 ± 933 785 ± 949 LMM G 6499 ± 99 8 680 ± 443 7 753 ± 36 0 706 ± 76 8 769 ± 79 0 LMM G,s 6499 ± 99 49 6834 ± 395 49 753 ± 36 54 706 ± 76 5 769 ± 79 57 LMMnc 6499 ± 99 83 69 ± 753 83 70 ± 57 9 7089 ± 986 67 738 ± 99 854 InvCal 6475 ± 304 7 66 ± 60 7 6087 ± 354 7 4446 ± 336 7 5636 ± 56 7 AMM EMM 5954 ± 75 9 565 ± 30 8 6346 ± 037 8 6785 ± 956 8 7565 ± 88 8 AMMMM 579 ± 595 7 6000 ± 796 4 70 ± 646 4 7366 ± 886 5 7836 ± 853 5 AMM G 585 ± 683 3 6880 ± 5 8 7308 ± 9 30 7454 ± 798 9 803 ± 808 30 AMM G,s 5667 ± 466 9 6983 ± 69 84 7308 ± 9 88 7334 ± 76 88 803 ± 808 9 AMMnc 579 ± 595 97 597 ± 839 90 743 ± 6 6 7349 ± 895 74 7804 ± 86 86 AMM 6580 ± 69 5 7000 ± 589 4 687 ± 79 4 6993 ± 47 4 73 ± 50 5 AMM 0ran 5409 ± 03 30 5578 ± 736 3 6638 ± 73 5 6689 ± 675 5 736 ± 55 57 AMM EMM 5059 ± 597 4 593 ± 58 4 6085 ± 543 37 6038 ± 408 4 583 ± 840 40 AMMMM 608 ± 946 45 4686 ± 390 34 678 ± 89 33 7404 ± 946 35 700 ± 765 38 AMM G 608 ± 946 4 67 ± 84 8 6578 ± 39 8 6464 ± 06 7307 ± 67 4 AMM G,s 608 ± 946 44 633 ± 57 380 6385 ± 700 346 6549 ± 06 354 7305 ± 670 374 AMMnc 608 ± 946 06 5557 ± 607 8 6430 ± 64 07 7633 ± 396 36 708 ± 43 965 AMM 6053 ± 979 3 544 ± 38 34 6745 ± 39 3 5585 ± 896 35 66 ± 695 38 AMM 0ran 4979 ± 84 307 5537 ± 46 370 5378 ± 53 30 606 ± 804 3 640 ± 84 338 alter- 494 ± 39 96 570 ± 7 00 5638 ± 73 04 353 ± 30 4 3868 ± 60 5 conv- 545 ± 054 348 ± 30 078 383 ± 84 68 696 ± 0 930 4877 ± 573 004 Oracle 9999 ± 00 9998 ± 005 9994 ± 03 0000 ± 000 9997 ± 007 Table 4: australan algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 6648 ± 36 < 6467 ± 4 < 6356 ± 400 < 647 ± 480 < 634 ± 54 < MM 808 ± 66 < 87 ± 68 < 8749 ± 86 8736 ± < 8953 ± 3 LMM G 808 ± 66 4 8709 ± 8 4 878 ± 36 5 8846 ± 50 6 8969 ± 68 8 LMM G,s 808 ± 66 4 878 ± 308 5 8788 ± 3 9 898 ± 05 0 9080 ± 53 7 LMMnc 808 ± 66 57 870 ± 7 49 8746 ± 303 57 8806 ± 3 90 894 ± 4 7 Invcal 967 ± 3 5 5950 ± 586 5 6800 ± 57 5 6083 ± 37 5 58 ± 47 5 AMMEMM 8665 ± 06 4 8659 ± 308 4 8650 ± 4 4 895 ± 48 6 8885 ± 4 6 AMMMM 8754 ± 384 3 8435 ± 363 4 8699 ± 387 4 8943 ± 34 4 8955 ± 38 5 AMM G 8754 ± 384 0 8479 ± 37 3 8678 ± 4 4 895 ± 8 4 8988 ± 78 8 AMM G,s 8754 ± 384 30 85 ± 375 39 8675 ± 49 43 9037 ± 67 43 8995 ± 80 54 AMMnc 8754 ± 384 63 850 ± 355 57 8663 ± 40 66 8900 ± 83 97 90 ± 93 7 AMM 760 ± 570 8504 ± 53 3 8689 ± 373 4 889 ± 3 4 8898 ± 300 4 AMM 0ran 79 ± 507 7 8097 ± 7 3 8508 ± 330 34 899 ± 8 46 8770 ± 68 47 AMMEMM 8009 ± 399 7 746 ± 85 6 734 ± 607 6 735 ± 333 8 873 ± 360 9 AMMMM 8683 ± 46 0 796 ± 30 5 705 ± 465 6 7389 ± 577 8 759 ± 350 AMM G 8683 ± 46 6 733 ± 95 48 76 ± 494 5 7357 ± 686 55 755 ± 38 63 AMM G,s 8683 ± 46 8 735 ± 03 43 79 ± 49 53 7477 ± 685 63 755 ± 38 88 AMMnc 8683 ± 46 4 7374 ± 48 9 7036 ± 56 0 756 ± 57 38 7644 ± 74 7 AMM 6957 ± 399 5 73 ± 34 5 685 ± 80 6 70 ± 546 7 870 ± 30 9 AMM 0ran 778 ± 9 9 688 ± 473 38 7358 ± 49 46 7 ± 935 64 746 ± 55 88 alter- 536 ± 07 5 508 ± 35 7 5090 ± 63 3 489 ± 45 38 466 ± 5 64 conv- 7780 ± 66 394 664 ± 468 3790 5794 ± 854 344 637 ± 7 337 6373 ± 33 3603 Oracle 98 ± 89 < 968 ± 4 < 944 ± 30, 96 ± 03 < 999 ± 358 < Table 5: breastw algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 4865 ± 754 < 745 ± 659 < 668 ± 747 < 3488 ± 33 < 4750 ± 77 < MM 994 ± 044 9930 ± 039 < 998 ± 05 < 998 ± 037 < 998 ± 047 LMM G 994 ± 044 6 9933 ± 038 3 998 ± 05 3 9935 ± 039 3 99 ± 046 4 LMM G,s 994 ± 044 0 9934 ± 039 0 9937 ± 04 9936 ± 038 993 ± 044 5 LMMnc 994 ± 044 4 999 ± 040 39 997 ± 05 4 9930 ± 038 59 990 ± 047 5 Invcal 967 ± 3 5 5950 ± 586 5 68 ± 57 5 6083 ± 37 5 58 ± 47 5 AMMEMM 9937 ± 04 9933 ± 039 997 ± 054 9934 ± 040 999 ± 049 AMMMM 9934 ± 046 9930 ± 037 9936 ± 07 999 ± 04 999 ± 048 AMM G 9934 ± 046 8 9930 ± 037 5 9936 ± 07 6 999 ± 04 7 9930 ± 049 8 AMM G,s 9934 ± 046 3 9930 ± 037 6 9936 ± 07 9 999 ± 04 0 9930 ± 049 5 AMMnc 9934 ± 046 43 993 ± 035 4 9936 ± 07 44 999 ± 04 6 999 ± 048 9 AMM 9935 ± 045 < 993 ± 037 990 ± 045 9930 ± 04 993 ± 048 AMM 0ran 9936 ± 045 8 99 ± 056 9 996 ± 035 998 ± 043 993 ± 049 4 AMMEMM 994 ± 055 6 990 ± 066 6 993 ± 05 6 9943 ± 030 7 9940 ± 038 9 AMM MM 990 ± 6 9900 ± 064 6 993 ± 035 6 9937 ± 038 7 9939 ± 039 9 AMM G 990 ± 0 9899 ± 064 7 9933 ± 035 8 9937 ± 038 994 ± 039 7 AMM G,s 990 ± 60 9899 ± 064 5 999 ± 045 55 9937 ± 039 63 994 ± 039 8 AMMnc 990 ± 55 9899 ± 064 53 993 ± 035 56 9937 ± 039 76 9940 ± 038 48 AMM 9909 ± 08 5 9909 ± 046 5 999 ± 06 5 9937 ± 038 6 9940 ± 038 8 AMM 0ran 9897 ± 9 47 9858 ± 075 48 9939 ± 07 5 9937 ± 038 6 9936 ± 04 8 alter- 6863 ± 763 4 934 ± 443 5 757 ± 79 33 90 ± 58 4 83 ± 567 8 conv- 994 ± 048 3346 5633 ± 48 3043 777 ± 55 800 390 ± 74 3036 67 ± 89 037 Oracle 9948 ± 04 < 9953 ± 04 < 993 ± 037 < 9943 ± 039 < 993 ± 044 <

Table 6: colc algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 6069 ± 30 < 583 ± 636 < 599 ± 537 < 5383 ± 49 < 595 ± 38 < MM 600 ± 644 < 7048 ± 743 < 673 ± 985 760 ± 935 705 ± 338 LMM G 600 ± 644 7 7037 ± 747 6 75 ± 85 8 7596 ± 038 8 7547 ± 359 9 LMM G,s 600 ± 644 0 70 ± 66 0 7508 ± 74 8 7854 ± 00 6 7643 ± 30 7 LMMnc 600 ± 644 3 7045 ± 746 33 6838 ± 969 5 7404 ± 00 787 ± 30 345 Invcal 3873 ± 543 6 6587 ± 670 6 5930 ± 38 6 654 ± 47 6 5953 ± 000 6 AMM EMM 59 ± 886 3 563 ± 849 3 7093 ± 03 3 78 ± 600 3 74 ± 635 4 AMMMM 7744 ± 36 7884 ± 695 3 6946 ± 644 4 793 ± 76 4 844 ± 58 4 AMM G 7744 ± 36 794 ± 3 76 ± 54 4 7780 ± 8 4 8405 ± 33 6 AMM G,s 7744 ± 36 34 794 ± 3 36 79 ± 538 4 767 ± 670 40 837 ± 34 47 AMMnc 7744 ± 36 36 7833 ± 735 38 7095 ± 469 57 7467 ± 90 7 7986 ± 487 35 AMM 3869 ± 78 5607 ± 468 754 ± 478 7536 ± 564 3 775 ± 500 3 AMM 0ran 3763 ± 49 0 7775 ± 566 7495 ± 564 5 7659 ± 08 7 7894 ± 47 3 AMM EMM 5094 ± 654 9 644 ± 994 9 5753 ± 337 5 5363 ± 47 7 6763 ± 563 9 AMMMM 4305 ± 465 8 7540 ± 464 9 637 ± 44 6 5537 ± 09 8 6949 ± 37 0 AMM G 4305 ± 465 8 789 ± 593 3 634 ± 753 5 63 ± 569 57 68 ± 935 6 AMM G,s 4305 ± 465 84 779 ± 636 9 657 ± 6 5 644 ± 077 68 6947 ± 640 84 AMMnc 49 ± 474 5 7374 ± 7 57 6039 ± 94 646 ± 53 6 6863 ± 37 38 AMM 59 ± 99 7 5989 ± 079 8 5876 ± 6 4 63 ± 33 7 685 ± 64 8 AMM 0ran 5639 ± 06 60 78 ± 876 68 650 ± 385 4 6959 ± 996 39 7440 ± 554 59 alter- 4633 ± 73 8 508 ± 9 6084 ± 55 3 60 ± 379 3 5704 ± 00 49 conv- 57 ± 345 438 3596 ± 934 460 503 ± 557 439 3546 ± 9 43 503 ± 834 47 Oracle 869 ± 43 < 8780 ± 50 < 8705 ± 605 < 8653 ± 75 < 8797 ± 0 < Table 7: german algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 4790 ± 45 < 50 ± 57 < 460 ± 588 < 5094 ± 6 < 50 ± 55 < MM 607 ± 557 < 609 ± 400 < 6550 ± 654 656 ± 605 6696 ± 456 LMM G 607 ± 557 4 64 ± 404 4 6707 ± 636 6 6643 ± 66 6 708 ± 476 7 LMM G,s 607 ± 557 675 ± 33 679 ± 580 6 6640 ± 690 9 7043 ± 557 LMMnc 607 ± 557 03 604 ± 400 87 6547 ± 656 87 656 ± 606 3 670 ± 458 09 Invcal 3874 ± 543 6 6587 ± 670 6 5930 ± 38 6 653 ± 47 6 5954 ± 000 6 AMMEMM 5389 ± 68 7 4863 ± 87 7 534 ± 80 8 5758 ± 344 9 6364 ± 8 AMMMM 6045 ± 558 5 6333 ± 499 6 7458 ± 476 6 743 ± 39 8 7584 ± 54 7 AMM G 6045 ± 558 7 646 ± 699 8 748 ± 434 708 ± 4 7594 ± 455 4 AMM G,s 6045 ± 558 5 640 ± 74 57 749 ± 450 57 78 ± 37 66 7577 ± 444 74 AMMnc 6045 ± 558 8 630 ± 609 0 7537 ± 44 00 753 ± 5 30 7599 ± 56 5 AMM 3708 ± 44 3 3853 ± 97 3 489 ± 07 6 43 ± 58 9 4709 ± 940 0 AMM 0ran 49 ± 650 36 603 ± 557 38 738 ± 470 44 707 ± 3 54 7473 ± 454 7 AMMEMM 4645 ± 330 8 463 ± 30 9 6734 ± 34 9 74 ± 67 0 7458 ± 463 AMMMM 547 ± 888 8 586 ± 9 8 654 ± 84 9 7490 ± 486 0 7488 ± 375 AMM G 547 ± 888 54 56 ± 5 53 7493 ± 88 57 7387 ± 455 60 7543 ± 40 67 AMM G,s 547 ± 888 60 5479 ± 6 58 7484 ± 8 67 7387 ± 455 80 7540 ± 405 97 AMMnc 547 ± 888 54 494 ± 68 37 65 ± 84 37 7489 ± 475 67 7470 ± 37 69 AMM 5839 ± 30 7 604 ± 443 7 6966 ± 693 7 7649 ± 39 8 7544 ± 365 0 AMM 0ran 5047 ± 969 68 5678 ± 089 64 604 ± 548 60 66 ± 88 70 735 ± 697 9 alter- 4936 ± 68 34 4959 ± 58 37 4843 ± 3 40 4885 ± 55 47 505 ± 7 64 conv- 970 ± 03 603 645 ± 543 6343 630 ± 59 636 60 ± 36 6765 637 ± 36 7004 Oracle 7943 ± 88 < 7895 ± 399 < 798 ± 70 < 794 ± 80 < 790 ± 36 < Table 8: heart algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 58 ± 39 < 5043 ± 303 < 5509 ± 944 < 4955 ± 747 < 6349 ± 8 < MM 6875 ± 609 < 604 ± 354 < 8035 ± 94 < 76 ± 666 8350 ± 6 LMM G 6875 ± 609 3 6804 ± 853 3 887 ± 66 4 89 ± 8 4 8585 ± 384 6 LMM G,s 6875 ± 609 9 6904 ± 65 8368 ± 590 3 896 ± 79 4 8636 ± 394 7 LMMnc 6875 ± 609 6040 ± 48 804 ± 974 89 784 ± 498 4 8447 ± 506 9 Invcal 884 ± 496 4 7058 ± 645 4 3733 ± 03 4 4496 ± 964 4 676 ± 505 4 AMMEMM 6050 ± 3088 < 6336 ± 850 705 ± 97 8087 ± 55 963 ± 60 AMMMM 8659 ± 64 8057 ± 67 8796 ± 450 9004 ± 54 945 ± 570 AMM G 8659 ± 64 5 8670 ± 545 5 8746 ± 67 6 906 ± 87 7 955 ± 593 9 AMM G,s 8659 ± 64 5 8670 ± 545 6 883 ± 400 8 9086 ± 8 955 ± 593 7 AMMnc 8659 ± 64 3 7897 ± 678 4 878 ± 44 9048 ± 353 45 95 ± 577 5 AMM 906 ± 58 < 899 ± 590 8864 ± 3 9078 ± 0 903 ± 58 AMM 0ran 7838 ± 3044 5 873 ± 47 6 8985 ± 3 7 90 ± 49 9 9047 ± 639 4 AMMEMM 8574 ± 38 3 8460 ± 087 4 8460 ± 784 3 8983 ± 7 5 765 ± 85 6 AMM MM 8535 ± 06 4 843 ± 976 4 9049 ± 475 4 899 ± 90 8935 ± 698 7 AMM G 8535 ± 06 3 878 ± 656 3 9049 ± 475 3 8958 ± 79 6 8855 ± 97 3 AMM G,s 8535 ± 06 39 9049 ± 505 40 9058 ± 477 40 8958 ± 79 49 8994 ± 663 67 AMMnc 8535 ± 06 0 873 ± 93 8984 ± 44 30 9006 ± 30 54 8954 ± 660 40 AMM 777 ± 377 4 893 ± 399 3 8968 ± 379 3 906 ± 38 5 8797 ± 94 6 AMM 0ran 8996 ± 56 3 8993 ± 50 3 8803 ± 36 30 9080 ± 36 38 896 ± 868 54 alter- 4775 ± 758 5 597 ± 8 6 63 ± 83 0 5849 ± 098 7 4833 ± 77 47 conv- 468 ± 434 873 ± 530 85 6903 ± 38 97 478 ± 35 88 5034 ± 575 080 Oracle 97 ± 395 < 9 ± 409 < 97 ± 88 < 954 ± 76 < 94 ± 546 <

Table 9: onosphere algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 448 ± 3 < 586 ± 80 < 5069 ± 634 < 4460 ± 39 < 489 ± 73 < MM 648 ± 88 < 7774 ± 53 7895 ± 736 8676 ± 96 883 ± 46 LMM G 648 ± 88 5 8080 ± 3 6 8346 ± 46 5 87 ± 3 7 884 ± 44 7 LMM G,s 648 ± 88 4 8 ± 50 5 834 ± 484 5 873 ± 57 7 8799 ± 458 LMMnc 648 ± 88 0 7939 ± 88 ± 640 3 8705 ± 48 68 8834 ± 43 8 Invcal 3534 ± 876 5 4478 ± 537 5 538 ± 90 5 535 ± 85 5 5408 ± 953 5 AMM EMM 5677 ± 64 8507 ± 54 8604 ± 5 868 ± 38 867 ± 354 3 AMMMM 4667 ± 853 3 845 ± 460 843 ± 667 859 ± 448 3 8777 ± 556 3 AMM G 4667 ± 853 0 8505 ± 4 9 858 ± 69 9 8597 ± 39 8885 ± 55 AMM G,s 4667 ± 853 8 8463 ± 380 6 858 ± 69 7 860 ± 437 30 8885 ± 55 36 AMMnc 4667 ± 853 4 856 ± 439 6 8477 ± 645 36 8596 ± 450 7 8757 ± 53 74 AMM 547 ± 346 8365 ± 389 875 ± 44 8676 ± 407 8783 ± 505 AMM 0ran 569 ± 4 0 8039 ± 636 8589 ± 55 873 ± 37 3 878 ± 65 5 AMM EMM 5799 ± 896 0 763 ± 59 0 807 ± 447 8699 ± 73 8708 ± 586 AMMMM 7457 ± 86 0 753 ± 474 0 7865 ± 793 8884 ± 30 900 ± 550 3 AMM G 7457 ± 86 3 7806 ± 5 33 834 ± 654 35 8998 ± 308 38 884 ± 594 4 AMM G,s 7457 ± 86 96 79 ± 458 98 8336 ± 66 04 9088 ± 3 884 ± 594 AMMnc 7457 ± 86 47 7580 ± 54 50 80 ± 695 6 8805 ± 47 99 899 ± 545 98 AMM 6553 ± 730 0 779 ± 663 9 80 ± 795 0 8545 ± 33 890 ± 70 AMM 0ran 6505 ± 659 85 7960 ± 656 8 7856 ± 477 88 8844 ± 3 94 8937 ± 667 09 alter- 4307 ± 605 4458 ± 495 4 694 ± 499 7 677 ± 5 55 5967 ± 70 49 conv- 3667 ± 744 36 4455 ± 958 80 5784 ± 598 788 6593 ± 390 887 4758 ± 9 87 Oracle 9007 ± 504 < 8999 ± 43 < 9008 ± 550 < 894 ± 634 < 90 ± 57 < Table 0: vertebral column algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 579 ± 04 < 5905 ± 046 < 543 ± 7 < 4539 ± 38 < 630 ± 786 < MM 7745 ± 64 < 7897 ± 354 < 7985 ± 44 < 874 ± 8745 ± 357 LMM G 7745 ± 64 3 7834 ± 8 3 893 ± 38 3 875 ± 7 5 9043 ± 30 6 LMM G,s 7745 ± 64 9 7834 ± 8 8 8387 ± 363 9 877 ± 56 3 906 ± 300 4 LMMnc 7745 ± 64 3 7843 ± 74 3 800 ± 40 35 8350 ± 46 54 880 ± 357 InvCal 3374 ± 495 4 3646 ± 57 4 754 ± 579 4 689 ± 65 4 599 ± 879 4 AMMEMM 807 ± 8 7856 ± 866 9056 ± 344 908 ± 78 934 ± 04 3 AMMMM 7564 ± 50 6854 ± 490 870 ± 46 966 ± 99 3 9350 ± 93 3 AMM G 7564 ± 50 6 697 ± 569 7 8757 ± 448 8 945 ± 89 0 9359 ± 83 AMM G,s 7564 ± 50 9 697 ± 569 8786 ± 46 3 904 ± 38 30 997 ± 58 3 AMMnc 7564 ± 50 34 6849 ± 486 35 8833 ± 57 39 96 ± 398 59 9370 ± 09 7 AMM 7449 ± 608 6866 ± 49 9060 ± 38 94 ± 58 995 ± 75 AMM 0ran 764 ± 480 7575 ± 507 6 959 ± 0 8 95 ± 44 5 946 ± 79 9 AMMEMM 760 ± 70 4 784 ± 44 5 8787 ± 94 5 8788 ± 39 6 907 ± 79 8 AMMMM 753 ± 369 5 87 ± 33 5 8743 ± 59 6 8885 ± 39 6 909 ± 47 9 AMM G 753 ± 369 5 739 ± 606 7 8789 ± 97 7 8798 ± 37 909 ± 47 8 AMM G,s 753 ± 369 44 6748 ± 670 50 8789 ± 97 5 8798 ± 37 63 908 ± 36 8 AMMnc 753 ± 369 43 897 ± 805 45 8785 ± 00 49 889 ± 4 70 909 ± 47 44 AMM 7735 ± 36 4 704 ± 79 5 847 ± 66 5 89 ± 3 6 9094 ± 306 8 AMM 0ran 739 ± 433 36 849 ± 93 47 8744 ± 5 47 8579 ± 454 50 9087 ± 53 69 alter- 4088 ± 580 307 ± 747 3 686 ± 640 6 5884 ± 33 377 ± 748 48 conv- 777 ± 63 364 78 ± 888 9 36 ± 838 38 450 ± 49 48 7049 ± 559 306 Oracle 9380 ± 06 < 9383 ± 67 < 9389 ± 89 < 9383 ± 6 < 9400 ± 4 < Table : vote (feature physcan-fee-freeze was removed to make the problem harder) algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 543 ± 879 < 4547 ± 563 < 4688 ± 606 550 ± 803 5393 ± 059 MM 9456 ± 04 9537 ± 6 9565 ± 085 9633 ± 9 9674 ± 50 LMM G 9456 ± 04 7 9593 ± 47 8 9587 ± 8 964 ± 5 9 9694 ± 67 0 LMM G,s 9456 ± 04 0 9603 ± 4 9600 ± 8 3 9638 ± 99 5 968 ± 09 8 LMMnc 9456 ± 04 8 9583 ± 34 3 957 ± 09 43 963 ± 58 85 968 ± 50 34 Invcal 9485 ± 7 4 730 ± 4 7786 ± 49 4 674 ± 68 4 7977 ± 65 4 AMMEMM 9367 ± 84 9504 ± 30 968 ± 078 9643 ± 3 9694 ± 6 3 AMMMM 9348 ± 3 95 ± 89 3 960 ± 08 3 965 ± 3 4 9730 ± 58 4 AMM G 9348 ± 3 0 956 ± 90 959 ± 0 964 ± 3 9736 ± 47 5 AMM G,s 9348 ± 3 9 9487 ± 30 33 9534 ± 098 35 96 ± 30 39 9736 ± 47 46 AMMnc 9348 ± 3 3 9538 ± 38 35 958 ± 0 46 9603 ± 48 89 9738 ± 45 38 AMM 9357 ± 99 943 ± 336 965 ± 066 967 ± 0 9683 ± 4 AMM 0ran 9384 ± 3 9459 ± 356 9585 ± 097 9663 ± 3 5 9666 ± 70 8 AMMEMM 968 ± 08 9497 ± 4 9494 ± 3 9583 ± 36 4 9660 ± 3 5 AMM MM 947 ± 038 9343 ± 407 3 937 ± 34 4 9540 ± 0 5 9677 ± 3 7 AMM G 947 ± 038 40 9434 ± 65 34 9403 ± 08 43 9565 ± 70 48 9645 ± 5 53 AMM G,s 947 ± 038 4 94 ± 87 7 9403 ± 08 3 960 ± 83 4 9637 ± 39 60 AMMnc 947 ± 038 65 9496 ± 348 66 9407 ± 078 78 954 ± 8 4 9674 ± 3 75 AMM 960 ± 9 9448 ± 4 9434 ± 08 9536 ± 56 3 9654 ± 5 5 AMM 0ran 9049 ± 0 0 9459 ± 85 03 949 ± 073 04 9573 ± 83 96 ± 67 8 alter- 558 ± 37 9 674 ± 47 6088 ± 350 5 630 ± 95 33 487 ± 7 57 conv- 563 ± 03 848 47 ± 49 807 96 ± 59 855 5754 ± 598 467 ± 948 8 Oracle 97 ± 3 < 9743 ± 5 < 9706 ± 087 < 9733 ± 38 < 975 ± 49 < 3

Table : wne algorthm bags 4 bags 8 bags 6 bags 3 bags AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AUC tme(s) AMM mn AMM max SVM EMM 7038 ± 039 < 567 ± 985 < 554 ± 070 < 658 ± 45 < 4685 ± 67 < MM 6645 ± 54 84 ± 676 858 ± 480 9035 ± 373 9557 ± 45 LMM G 6645 ± 54 4 897 ± 373 5 9069 ± 530 5 9409 ± 345 5 9774 ± 067 6 LMM G,s 6645 ± 44 3 933 ± 94 3 968 ± 606 4 9553 ± 40 5 9769 ± 090 9 LMMnc 6645 ± 54 9 8400 ± 548 8630 ± 48 8 90 ± 45 40 968 ± 06 6 Invcal 5896 ± 577 6 838 ± 459 6 558 ± 959 6 6307 ± 6 6 70 ± 89 6 AMMEMM 807 ± 808 9033 ± 887 946 ± 059 8897 ± 66 8834 ± 79 AMMMM 684 ± 90 8556 ± 70 8870 ± 83 9378 ± 9 9866 ± AMM G 684 ± 90 6 9306 ± 788 7 934 ± 84 7 9609 ± 88 7 9933 ± 0 9 AMM G,s 684 ± 90 7 9487 ± 568 8 9300 ± 895 0 9609 ± 88 9933 ± 0 7 AMMnc 684 ± 90 0 8703 ± 393 3 883 ± 790 0 9749 ± 506 43 9933 ± 0 9 AMM 8 ± 39 < 94 ± 634 9960 ± 060 9603 ± 757 9703 ± 366 AMM 0ran 5875 ± 330 4 9947 ± 068 5 995 ± 045 6 9959 ± 054 7 9895 ± 66 0 AMMEMM 743 ± 36 3 855 ± 748 4 9967 ± 074 5 9809 ± 309 6 900 ± 55 7 AMMMM 883 ± 856 5 9760 ± 40 4 874 ± 776 6 994 ± 079 7 986 ± 69 8 AMM G 883 ± 856 5 884 ± 0 5 0000 ± 000 9 9963 ± 066 0 986 ± 69 5 AMM G,s 883 ± 856 44 79 ± 390 44 0000 ± 000 56 9963 ± 066 59 986 ± 69 75 AMMnc 883 ± 856 9 8544 ± 904 867 ± 79 3 9936 ± 074 56 986 ± 69 35 AMM 754 ± 0 3 8045 ± 00 4 983 ± 463 5 979 ± 905 5 880 ± 978 7 AMM 0ran 9754 ± 55 30 9680 ± 394 3 9946 ± 08 4 99 ± 079 47 9854 ± 66 58 alter- 568 ± 54 4 3653 ± 097 6 6554 ± 6 9 95 ± 960 3 86 ± 93 44 conv- 543 ± 463 83 703 ± 658 794 588 ± 386 840 5560 ± 9 659 58 ± 784 495 Oracle 9969 ± 05 < 9980 ± 044 < 9960 ± 043 < 9980 ± 044 < 9978 ± 033 < 0 0 0 AUC rel to Oracle 08 06 04 MM LMM G LMM G,s LMM nc AUC rel to Oracle 08 06 04 AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle 08 06 04 alter SVM conv SVM InvCal 080 085 090 095 (a) 080 085 090 095 (b) 080 085 090 095 (c) Fgure 4: Relatve AUC (wrt Oracle) vs on arrhythma 0 0 0 AUC rel to Oracle 08 06 04 0 MM LMM G LMM G,s LMM nc 06 07 08 09 AUC rel to Oracle 08 06 04 0 AMM MM AMM G AMM G,s AMM nc AMM 0ran 06 07 08 09 AUC rel to Oracle 08 06 04 0 alter SVM conv SVM InvCal 06 07 08 09 (a) (b) (c) Fgure 5: Relatve AUC (wrt Oracle) vs on australan 4

00 00 00 AUC rel to Oracle 075 050 05 MM LMM G LMM G,s LMM nc AUC rel to Oracle 075 050 05 AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle 075 050 05 alter SVM conv SVM InvCal 009 0 03 05 (a) 009 0 03 05 (b) 009 0 03 05 (c) Fgure 6: Relatve AUC (wrt Oracle) vs on breastw 5

0 0 0 AUC rel to Oracle 08 06 04 MM LMM G LMM G,s LMM nc AUC rel to Oracle 08 06 04 AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle 08 06 04 alter SVM conv SVM InvCal 075 080 085 090 075 080 085 090 075 080 085 090 (a) (b) (c) Fgure 7: Relatve AUC (wrt Oracle) vs on colc 0 0 0 AUC rel to Oracle 08 06 04 MM LMM G LMM G,s LMM nc AUC rel to Oracle 08 06 04 AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle 08 06 04 alter SVM conv SVM InvCal 078 080 08 078 080 08 078 080 08 (a) (b) (c) Fgure 8: Relatve AUC (wrt Oracle) vs on german 0 0 0 AUC rel to Oracle 08 06 04 MM LMM G LMM G,s LMM nc AUC rel to Oracle 08 06 04 AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle 08 06 04 alter SVM conv SVM InvCal 08 09 08 09 08 09 (a) (b) (c) Fgure 9: Relatve AUC (wrt Oracle) vs on heart 6

0 0 0 AUC rel to Oracle 08 06 04 MM LMM G LMM G,s LMM nc 00 0 0 03 (a) AUC rel to Oracle 08 06 04 AMM MM AMM G AMM G,s AMM nc AMM 0ran 00 0 0 03 (b) AUC rel to Oracle 08 06 04 alter SVM conv SVM InvCal 00 0 0 03 (c) Fgure 0: Relatve AUC (wrt Oracle) vs on onosphere 0 0 0 AUC rel to Oracle 08 06 04 MM LMM G LMM G,s LMM nc AUC rel to Oracle 08 06 04 AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle 08 06 04 alter SVM conv SVM InvCal 04 05 06 (a) 04 05 06 (b) 04 05 06 (c) Fgure : Relatve AUC (wrt Oracle) vs on vertebral column 00 00 00 AUC rel to Oracle 075 050 05 MM LMM G LMM G,s LMM nc AUC rel to Oracle 075 050 05 AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle 075 050 05 alter SVM conv SVM InvCal 05 030 035 040 045 (a) 05 030 035 040 045 (b) 05 030 035 040 045 (c) Fgure : Relatve AUC (wrt Oracle) vs on vote 7

00 00 00 AUC rel to Oracle 075 050 05 MM LMM G LMM G,s LMM nc AUC rel to Oracle 075 050 05 AMM MM AMM G AMM G,s AMM nc AMM 0ran AUC rel to Oracle 075 050 05 alter SVM conv SVM InvCal 05 06 07 05 06 07 05 06 07 (a) (b) (c) Fgure 3: Relatve AUC (wrt Oracle) vs on wne References [] F X Yu, D Lu, S Kumar, T Jebara, and S F Chang SVM for Learnng wth Label Proportons In 30 th ICML, pages 504 5, 03 [] R Nock and F Nelsen Bregman dvergences and surrogates for learnng IEEE TransPAMI, 3:048 059, 009 [3] A Banerjee, X Guo, and H Wang On the optmalty of condtonal expectaton as a bregman predctor IEEE Trans on Informaton Theory, 5:664 669, 005 [4] N Quadranto, A J Smola, T S Caetano, and Q V Le Estmatng labels from label proportons JMLR, 0:349 374, 009 [5] Y Altun and A J Smola Unfyng dvergence mnmzaton and statstcal nference va convex dualty In 9 th COLT, pages 39 53, 006 [6] P L Bartlett and S Mendelson Rademacher and gaussan complextes: Rsk bounds and structural results JMLR, 3:463 48, 00 [7] C McDarmd Concentraton In M Habb, C McDarmd, J Ramrez-Alfonsn, and B Reed, edtors, Probablstc Methods for Algorthmc Dscrete Mathematcs, pages 54 Sprnger Verlag, 998 [8] M Ledoux and M Talagrand Probablty n Banach Spaces Sprnger Verlag, 99 8