Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No. 5141 - Telecom ParisTech - Journée Traitement de Masses de Données du Laboratoire JL Lions UPMC
Goals of Statistical Learning Theory Statistical issues cast as M-estimation problems: Classification Regression Density level set estimation... and their variants Minimal assumptions on the distribution Build realistic M-estimators for special criteria Questions: Optimal elements Consistency Non-asymptotic excess risk bounds Fast rates of convergence Oracle inequalities
Main Example: Classification (X, Y ) random pair with unknown distribution P X 2Xobservation vector Y 2 { 1, +1} binary label/class A posteriori probability regression function 8x 2X, (x) =P{Y =1 X = x} g : X! { 1, +1} classifier Performance measure = classification error L(g) =P {g(x) 6= Y }! min g Solution: Bayes rule Bayes error L = L(g ) 8x 2X, g (x) =2I { (x)>1/2} 1
Empirical Risk Minimization Sample (X 1,Y 1 ),...,(X n,y n ) with i.i.d. copies of (X, Y ) Class G of classifiers Empirical Risk Minimization principle ĝ n = arg min L n (g) := 1 g2g n Best classifier in the class nx i=1 ḡ = arg min L(g) g2g I {g(xi )6=Y i }
Empirical Processes in Classification Bias-variance decomposition L(ĝ n ) L apple (L(ĝ n ) L n (ĝ n )) + (L n (ḡ) L(ḡ)) + (L(ḡ) L ) apple 2 sup L n (g) L(g) g2g! + inf L(g) g2g L Concentration inequality With probability 1 : sup L n (g) L(g) apple E sup L n (g) L(g) + g2g g2g r 2 log(1/ ) n
Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities
Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities
Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities
Big Data? Big Challenge! Now, it is much easier to collect data, massively and in real-time: ubiquity of sensors (cell phones, internet, embedded systems, social networks,...) to store and manage Big (and Complex) Data (distributed file systems, NoSQL) to implement massively parallelized and distributed computational algorithms (MapReduce, clouds) The three features of Big Data analysis Velocity: process data in quasi-real time (on-line algorithms) Volume: scalability (parallelized, distributed algorithms) Variety: complex data (text, signal, image, graph)
How to apply ERM to Big Data? Suppose that n is too large to evaluate the empirical risk L n (g) Common sense: run your preferred learning algorithm using a subsample of "reasonable" size B<<n, e.g. by drawing with replacement in the original training data set...
How to apply ERM to Big Data? Suppose that n is too large to evaluate the empirical risk L n (g) Common sense: run your preferred learning algorithm using a subsample of "reasonable" size B<<n, e.g. by drawing with replacement in the original training data set...... but of course, statistical performance is downgraded! 1/ p n << 1/ p B
Survey designs: a solution to Big Data learning? Framework: massive original sample (X 1,Y 1 ),...,(X n,y n ) viewed as a superpopulation Survey plan R n = probability distribution on the ensemble of all nonempty subsets of {1,...,n} Let S R N and set i =1if i 2 S, i =0otherwise The vector ( 1,..., n ) fully describes S First and second order inclusion probabilities: i (R N )=P{i 2 S} and i,j (R N )=P{(i, j) 2 S 2 } Do not rely on the empirical risk based on the survey sample {(X i,y i ): i 2 S} P i2s I{g(X i) 6= Y i } is a biased estimate of L(g) 1 #S
Horvitz -Thompson theory Consider the Horvitz-Thompson estimator of the risk L Rn n (g) = 1 n nx i=1 i i I{g(X i ) 6= Y i } And the Horvitz Thompson empirical risk minimizer It may work if sup g2g arg min g2g L Rn n (g) =gn L Rn n (g) L n (g) is small In general, due to the dependence structure, not much can be said about the fluctuations of this supremum
The Poisson case: the i s are independent In this case, L Rn n (g) is a simple average of independent r.v. s ) back to empirical process theory One recovers the same learning rate as if all data had been used, e.g. VC finite dimension case E [L(g n) L ] apple (apple n p 2 + 4) r V log(n + 1) + log 2 n where apple n = q Pn i=1 (1/ 2 i ) (the i s should not be too small...) The upper bound is optimal in the minimax sense.
The Poisson case: the i s are independent Can be extended to more general sampling plans Q n provided you are able to control d TV (R n,q n ) def = X S2P(U n) P n (S) R n (S). A coupling technique (Hajek, 1964) can be used to show that it works for rejective sampling, Rao-Sampford sampling, successive sampling, post-stratified sampling, etc
Beyond Empirical Processes U-Statistics as Performance Criteria In various situations, the performance criterion is not a basic sample mean statistic any more Examples: Clustering: within cluster point scatter related to a partition P 2 X D(X i,x j ) X I{(X i,x j ) 2C 2 } n(n 1) i<j C2P Graph inference (link prediction) Ranking The empirical criterion is an average over all possible k-tuples U-statistic of degree k 2
Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n
Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability
Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability Quantitative formulation: maximize the criterion L(s) =P{s(X (1) ) <...<s(x (k) ) Y (1) =1,...,Y (K) = K}
Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability Quantitative formulation: maximize the criterion L(s) =P{s(X (1) ) <...<s(x (k) ) Y (1) =1,...,Y (K) = K} Observations: n k i.i.d. copies of X given Y = k, X (k) 1,...,X(k) n k n = n 1 +...+ n K
Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K
Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K But the number of terms to be summed is prohibitive! n 1... n K
Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K But the number of terms to be summed is prohibitive! n 1... n K Maximization of b L n (s) is computationally unfeasible...
Generalized U-statistics K 1 samples and degrees (d 1,...,d K ) 2 N K (X (k) 1,...,X(k) n k ), 1 apple k apple K, K independent i.i.d. samples drawn from F k (dx) on X k respectively Kernel H : X d 1 1 X d K K! R, square integrable w.r.t. µ = F d 1 1 F d K K
Generalized U-statistics Definition The K-sample U-statistic of degrees (d 1,...,d K ) with kernel H is U n (H) = PI 1... P I K H(X (1) I 1 ; X (2) I 2 ;...; X (K) n 1 d 1 I K ) n K, dk where P I k refers to summation over all subsets X (k) I k =(X (k) i 1,...,X (k) i dk ) related to a set I k of d k indexes 1 apple i 1 <...<i dk apple n k It is said symmetric when H is permutation symmetric in each set of d k arguments X (k) I k. References: Lee (1990) n k d k
Generalized U-statistics Unbiased estimator of (H) =E[H(X (1) 1,...,X(1) d 1,...,X (K) 1,...,X (K) d k )] with minimum variance Asymptotically Gaussian as n k /n! k > 0 for k =1,...,K Its computation requires the summation of KY nk d k k=1 terms K-partite ranking: d k =1for 1 apple k apple K H s (x 1,...,x K )=I {s(x 1 ) <s(x 2 ) < <s(x K )}
Incomplete U-statistics Replace U n (H) by an incomplete version, involving much less terms Build a set D B of cardinality B built by sampling with replacement in the set of indexes ((i (1) 1,...,i(1) d 1 ),...,(i (K) 1,...,i (K) d K )) with 1 apple i (k) 1 <...<i (k) d k apple n k, 1 apple k apple K Compute the Monte-Carlo version based on B terms eu B (H) = 1 B X H(X (1) I 1,...,X (K) I K ) (I 1,...,I K )2D B An incomplete U-statistic is NOT a U-statistic
ERM based on incomplete U-statistics Replace the criterion by a tractable incomplete version based on B = O(n) terms min eu B (H) H2H This leads to investigate the maximal deviations sup H2H eu B (H) U n (H)
Main Result Theorem Let H be a VC major class of bounded symmetric kernels of finite VC dimension V < +1. Set M H =sup (H,x)2H X H(x). Then, n o (i) P sup H2H UB e (H) U n (H) > apple 2(1 + # ) V e B 2 /M 2 H (ii) for all 2 (0, 1), with probability at least 1, we have: 1 M H sup H2H r h i eu B (H) E eub 2V log(1 + apple) (H) apple 2 apple r r log(2/ ) V log(1 + # ) + log(4/ ) + +, apple B where apple =min{bn 1 /d 1 c,...,bn K /d K c}
Consequences Empirical risk sampling with B = O(n) yields a rate bound of the order O( p log n/n) One suffers no loss in terms of learning rate, while drastically reducing computational cost
Example: Ranking Empirical ranking performance for SVMrank based on 1%, 5%, 10%, 20% and 100% of the "LETOR 2007" dataset.
Sketch of Proof Set =(( k (I)) I2 ) 1applekappleB, where k (I) is equal to 1 if the tuple I =(I 1,...,I K ) has been selected at the k-th draw and to 0 otherwise The k s are i.i.d. random vectors For all (k, I) 2{1,...,B}, the r.v. k (I) has a Bernoulli distribution with parameter 1/# With these notations, where eu B (H) U n (H) = 1 B BX Z k (H), k=1 Z k (H) = X I2 ( k (I) 1/# )H(X I ) Freezing the X I s, by virtue of Sauer s lemma: #{(H(X I )) I2 : H 2H}apple(1 + # ) V.
Sketch of Proof (continued) Conditioned upon the X I s, Z 1 (H),...,Z B (H) are independent The first assertion is thus obtained by applying Hoeffding s inequality combined with the union bound Set apple 1 V H + H X (1) 1,...,X(1) n 1,...,X (K) 1,...,X n (K) K = X (1) 1,...,X(1) d 1,...,X (K) 1,...,X (K) d K H X (1) d 1 +1,...,X(1) 2d 1,...,X (K) d K +1,...,X(K) 2d K +... + H X (1) appled 1 d K +1,...,X(K) appled K, d 1 +1,...,X(K) appled K
Sketch of Proof (continued) The proof of the second assertion is based on the Hoeffding decomposition U n (H) = 1 n 1! n K! X 12S n1,..., K2S nk V X (1) 1(1),...,X(K) K(n K ) The concentration result is then obtained in a classical manner Convexity (Chernoff s bound) Symmetrization Randomization Application of McDiarmid s bounded difference inequality
Beyond finite VC dimension Challenge: develop probabilistic tools and complexity assumptions to investigate the concentration properties of collections of sums of weighted binomials eu B (H) U n (H) = 1 B BX Z k (H), k=1 with Z k (H) = X I2 ( k (I) 1/# )H(X I )
Some references Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling. S. Clémençon, S. Robbiano and J. Tressou (2013). In the Proceedings of the SIAM International Conference on Data-Mining, Austin (USA). Empirical processes in survey sampling. P. Bertail, E. Chautru and S. Clémençon (2013). Submitted. A statistical view of clustering performance through the theory of U-processes. S. Clémençon (2014). In Journal of Multivariate Analysis. On Survey Sampling and Empirical Risk Minimization. P. Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, Fort Lauderdale (USA).
Introduction Investigate the binary classification problem in statistical learning context I Data not stored in central unit but processed by independent agents (processors) I Aim : not to find a consensus on a common classifier but find how to combine e ciently the local ones I Solution : implement in an on-line and distributed manner 2/21
Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 3/21
Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 3/21
Learning problem sign(h (X )) r.v. observation r.v. binary output X 2 X R n!! Ŷ 2 { 1,+1} Given training dataset (X,Y )=(X i,y i ) i=1,...,n in a high dimension n and with unknown joint distribution......find the best prediction rule sign(h? ) such the classifier function H (x) : H? =min P e(h ) where P e (H )=P[ YH (X ) > 0] = E 1 { YH (X )>0} H minimizes the probability of error P e B but 1(x) is not a di erentiable function! 4/21
Learning problem Majorize E 1 { YH (X )>0} by a convex function: Convex Surrogate E 1 { YH (X )>0} apple E[j( YH (X ))] How? Use a cost function with appropiate properties Example : use the quadratic function j(u)= (u+1)2 2 : R! [0, + ) 4/21
Learning problem sign(h (X )) r.v. observation r.v. binary output X 2 X R n!! Ŷ 2 { 1,+1} Given training dataset (X,Y )=(X i,y i ) i=1,...,n in a high dimension n and with unknown joint distribution......find the best prediction rule sign(h? ) such the classifier function H (x) : H? =min H R j(h ) where R j (H )=E[j( YH (X ))] minimizes the risk function R j (H ) 4 when j(u)= (u+1)2 2! H? coincides with the naive Bayes classifier! 4/21
Aggregation of local classifiers Consider a classification device composed by a set V of N connected agents Each agent v 2 V : I disposes of {(X 1,v,Y 1,v ),...,(X nv,v,y nv,v )}!n v independent copies of (X,Y ) I selects a local soft classifier function from a parametric class {h v (, q v )} Set q v =(a v,b v ),theglobal soft classifier is: H (x, )=Â v2v h v (x,q v ) 0 1 B where : h v (x,q v )=a v h v (x,b v ) and = @ q 1. q N C A 5/21
Problem statement The problem can be summarized as follows: I given an observed data X I obtain the best estimated label Ŷ as sign(h (X, )) I where is computed from the optimization problem using the training data (X,Y )=(X i,y i ) i=1,...,n as: min R j( 2 YH (X, )) 6/21
Problem statement Approaches 1. Agreement to a common decision rule [Tsitsiklis-84, Agarwal-10 ] : consensus approach I find an average consensus solution : =(q,...,q) I each agent use the global classifier H (X, ) 2. Mixture of experts : cooperative approach I find the best aggregation solution : =(q 1,...,q N ) I each agent use its local classifier h v (x,q v ) 6/21
Problem statement Approaches 1. Agreement to a common decision rule [Tsitsiklis-84, Agarwal-10 ] : consensus approach 2. Mixture of experts : cooperative approach I find the best aggregation solution : =(q 1,...,q N ) I each agent use its local classifier h v (x,q v ) 4 Example: set b v =0, a v 0 and h v : X! { 1,+1} :theweak classifier h v (x,q v )=a v h v (x) 6/21
Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 7/21
High rate distributed learning Solve the minimization problem of the parametric risk function: min R j(h (X, )) 2 8/21
High rate distributed learning An standard distributed gradient descent iterative approach : I generates a vector sequence of the estimated parameter ( t ) t 1 =(q t,1,,q t,n ) t 1 I at each agent v the update step writes: q t+1,v = q t,v + g t E Y v h v (X,q t,v )j 0 ( YH (X, t )) {z } B the joint distribution is unknown 8/21
High rate distributed learning An standard distributed and on-line gradient descent iterative approach is: I generate a vector sequence of the estimated parameter ( t ) t 1 =(q t,1,,q t,n ) t 1 I each agent v observes a pair (X t+1,v,y t+1,v ) I at each agent v the update step writes: q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q vt,v )j 0 ( Y t+1,v H (X t+1,v, t )) {z } replace by the empirical version B evaluate H (X t+1,v, (t) ) is required at each t and v! 8/21
High rate distributed learning Example At iteration t, eachagentv 2 V has (X v,t,q v,t )... (X 3,t,q 3,t ) (X 2,t,q 2,t ) 2 3 4 (X 4,t,q 4,t ) 1 h 1 (X 1,t,q 1,t )...and evaluates its local h v (X t,v,q t,v ) 9/21
High rate distributed learning Example Each node v sends its observation X t,v to all the other nodes... X t,1 3 X t,1 2 4 X t,1 1 (X t,1,q t,1 ) 9/21
High rate distributed learning Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from all the other nodes... h 3 (X 1,t,q 3,t ) h 2 (X 1,t,q 2,t ) 2 3 4 h 4 (X 1,t,q 4,t ) 1 h 1 (X 1,t,q 1,t ) 9/21
High rate distributed learning Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from all the other nodes... 3 2 4 h 1 (X 1,t,q 1,t ) 1 {h 2 (X 1,t,q 2,t ),h 3 (X 1,t,q 3,t ),h 4 (X 1,t,q 4,t )}...and computes the global : H (X t,1, t )=Â 4 w=1 h w (X t,1,q t,w ) B N (N 1) communications per iteration N =4! 12! 9/21
Proposed distributed learning : OLGA algorithm 4 Replace the global H (X t+1,v, (t) ) by a local estimate Ŷ (V) t,v each v 2 V such : at E[Ŷ (V) t+1,v X t+1,v, t ]=H (X t+1,v, t ) How? sparse communications with ratio sparsity p... On-line Learning Gossip Algorithm (OLGA)...for each v 2 V at time t, thelocalgradient descent update writes : q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q t,v )j 0 ( Y t+1,v Ŷ (V) t+1,v ) 10/21
Proposed distributed learning : OLGA algorithm Example At iteration t, eachagentv 2 V has (X t,v,q t,v )... (X 3,t,q 3,t ) (X 2,t,q 2,t ) 2 3 4 (X 4,t,q 4,t ) 1 h 1 (X 1,t,q 1,t )...and evaluates its local h v (X t,v,q t,v ) 10/21
Proposed distributed learning : OLGA algorithm Example Each node v sends its observation X t,v to randomly selected nodes with probability p = 1 3... 3 4 X t,1 2 1 (X t,1,q t,1 ) 10/21
Proposed distributed learning : OLGA algorithm Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from the randomly selected nodes... 2 3 4 h 4 (X 1,t,q 4,t ) 1 h 1 (X 1,t,q 1,t ) 10/21
Proposed distributed learning : OLGA algorithm Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from choosen nodes... 3 2 4 h 1 (X 1,t,q 1,t ) 1 {h 4 (X 1,t,q 4,t )}...and computes its local estimated : Ŷ (V) t,1 = h 1(X t,1,q t,1 )+ 1 p h 4(X t,1,q t,4 ) B pn (N 1) communications per iteration N =4,p= 13! 4 (reduction of 67%)! 10/21
Performance analysis B What is the e ect of sparsification?......study the behaviour of the vector sequence t as t! I the consistency of the final solution given by the algorithm I qualify the error variance excess due to the sparsity 11/21
Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 12/21
Asymptotic behaviour of OLGA Under suitable assumptions, we prove the following results : 1. Consistency : ( t ) t 1 a.s.! q? 2 L = { R j ( )=0} 2. CLT :conditionedtotheevent{lim t! t =? },then p gt ( t? ) L! N(0,S(G? )) where : estimation error in a centralized case z } { G? = E[(H (X,? ) Y ) 2 v h v (X,q v? ) T v h v (X,q v? )]+ + 1 p p  E[h w (X,q w? ) 2 v h v (X,q v? ) T v h v (X,q v? )] w6=v {z } additional noise term induced by the distributed setting 13/21
Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 14/21
Abestagentsselectionapproach When... B the number of agents N "!di cult to implement B redudancy agents! avoid similar outputs... include distributed agent selection! How? add a `1-penalization term with tunning parameter l min R j(h (X, )) + l  a v 2 where: I the weight a v =0for an idle agent and a v > 0 when it is active v 15/21
Including best agents selection to OLGA algorithm Introduce an update step at each time t of OLGA to seek : the time varying set of active nodes S t V 16/21
Including best agents selection to OLGA algorithm The extended algorithm is summarized as follows, at time t: 1. obtain active nodes S t from the sequence of updated weights (a t,1,...,a t,n ) 2. apply OLGA to the set of active agents v 2 S t as: i) estimate local Ŷ (S t ) t+1,v from a random selection among the current active nodes ii) update local gradient descent q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q t,v )j 0 ( Y t+1,v Ŷ (S t ) t+1,v ) 16/21
Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 17/21
Example with simulated data Binary classification of (+) and(o) data samples with N = 60 agents using weak lineal classifiers (-). When using distributed selection, it reduces to 25 active classifiers. 6 6 4 4 2 2 0 0 2 2 4 4 6 6 4 2 0 2 4 6 6 6 4 2 0 2 4 6 (a) OLGA (b) OLGA with distributed selection 18/21
Comparison with real data Binary classification of the available benchmark dataset banana using weak lineal classifiers when increasing N. 0.55 0.5 OLGA (p=0.6) GentleBoost Error rate 0.45 0.4 0.35 0.3 0.25 0.2 5 10 15 20 25 30 35 Number of weak learners Figure: Comparison between a centralized and sequential approach (GentleBoost) and our distributed and on-line algorithm (OLGA). 19/21
Conclusions I A fully distributed and on-line algorithm is proposed for binary classification of big datasets solved by N processors 4 the algorithm is then adapted to select useful classifiers! N # I We obtain theoretical results from the asymptotic analysis of the sequence estimated by OLGA I Numerical results are illustrated showing a comparable behaviour to a centralized, batch and sequential approach (GentleBoost) 20/21