Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon



Similar documents
Random graphs with a given degree sequence

Statistical Machine Learning

Linear Threshold Units

Adaptive Online Gradient Descent

Introduction to Online Learning Theory

Lecture 3: Linear methods for classification

Message-passing sequential detection of multiple change points in networks

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Decompose Error Rate into components, some of which can be measured on unlabeled data

Large-Scale Similarity and Distance Metric Learning

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Basics of Statistical Machine Learning

Applied Multivariate Analysis - Big data analytics

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Fuzzy Probability Distributions in Bayesian Analysis

Tail inequalities for order statistics of log-concave vectors and applications

Big Data - Lecture 1 Optimization reminders

Linear Classification. Volker Tresp Summer 2015

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Bootstrapping Big Data

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Consistent Binary Classification with Generalized Performance Metrics

Influences in low-degree polynomials

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

STA 4273H: Statistical Machine Learning

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Week 1: Introduction to Online Learning

Simple and efficient online algorithms for real world applications

MATHEMATICAL METHODS OF STATISTICS

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Distributed Machine Learning and Big Data

Model Combination. 24 Novembre 2009

Online Learning, Stability, and Stochastic Gradient Descent

Interactive Machine Learning. Maria-Florina Balcan

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

CSCI567 Machine Learning (Fall 2014)

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Linear Codes. Chapter Basics

U-statistics on network-structured data with kernels of degree larger than one

Introduction to General and Generalized Linear Models

Tree based ensemble models regularization by convex optimization

HT2015: SC4 Statistical Data Mining and Machine Learning

L25: Ensemble learning

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Exact Nonparametric Tests for Comparing Means - A Personal Summary

2.3 Convex Constrained Optimization Problems

GI01/M055 Supervised Learning Proximal Methods

(Quasi-)Newton methods

Chapter 6. The stacking ensemble approach

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

The CUSUM algorithm a small review. Pierre Granjon

Statistical machine learning, high dimension and big data

L3: Statistical Modeling with Hadoop

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Machine Learning Big Data using Map Reduce

STAT 830 Convergence in Distribution

Unsupervised Data Mining (Clustering)

Regularized Logistic Regression for Mind Reading with Parallel Validation

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Equilibrium computation: Part 1

How To Solve The Cluster Algorithm

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

ON INDUCED SUBGRAPHS WITH ALL DEGREES ODD. 1. Introduction

Towards running complex models on big data

Concentration inequalities for order statistics Using the entropy method and Rényi s representation

Support Vector Machines with Clustering for Training with Very Large Datasets

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Un point de vue bayésien pour des algorithmes de bandit plus performants

Linear Models for Classification

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Nonparametric tests these test hypotheses that are not statements about population parameters (e.g.,

1 Another method of estimation: least squares

The Artificial Prediction Market

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Reject Inference in Credit Scoring. Jie-Men Mok

1 Norms and Vector Spaces

1. Prove that the empty set is a subset of every set.

SAS Software to Fit the Generalized Linear Model

Data Mining: An Overview. David Madigan

Lecture 13: Martingales

VERTICES OF GIVEN DEGREE IN SERIES-PARALLEL GRAPHS

Maximum Likelihood Estimation of ADC Parameters from Sine Wave Test Data. László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár

Social Media Mining. Data Mining Essentials

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

10. Proximal point method

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Transcription:

Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No. 5141 - Telecom ParisTech - Journée Traitement de Masses de Données du Laboratoire JL Lions UPMC

Goals of Statistical Learning Theory Statistical issues cast as M-estimation problems: Classification Regression Density level set estimation... and their variants Minimal assumptions on the distribution Build realistic M-estimators for special criteria Questions: Optimal elements Consistency Non-asymptotic excess risk bounds Fast rates of convergence Oracle inequalities

Main Example: Classification (X, Y ) random pair with unknown distribution P X 2Xobservation vector Y 2 { 1, +1} binary label/class A posteriori probability regression function 8x 2X, (x) =P{Y =1 X = x} g : X! { 1, +1} classifier Performance measure = classification error L(g) =P {g(x) 6= Y }! min g Solution: Bayes rule Bayes error L = L(g ) 8x 2X, g (x) =2I { (x)>1/2} 1

Empirical Risk Minimization Sample (X 1,Y 1 ),...,(X n,y n ) with i.i.d. copies of (X, Y ) Class G of classifiers Empirical Risk Minimization principle ĝ n = arg min L n (g) := 1 g2g n Best classifier in the class nx i=1 ḡ = arg min L(g) g2g I {g(xi )6=Y i }

Empirical Processes in Classification Bias-variance decomposition L(ĝ n ) L apple (L(ĝ n ) L n (ĝ n )) + (L n (ḡ) L(ḡ)) + (L(ḡ) L ) apple 2 sup L n (g) L(g) g2g! + inf L(g) g2g L Concentration inequality With probability 1 : sup L n (g) L(g) apple E sup L n (g) L(g) + g2g g2g r 2 log(1/ ) n

Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities

Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities

Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities

Big Data? Big Challenge! Now, it is much easier to collect data, massively and in real-time: ubiquity of sensors (cell phones, internet, embedded systems, social networks,...) to store and manage Big (and Complex) Data (distributed file systems, NoSQL) to implement massively parallelized and distributed computational algorithms (MapReduce, clouds) The three features of Big Data analysis Velocity: process data in quasi-real time (on-line algorithms) Volume: scalability (parallelized, distributed algorithms) Variety: complex data (text, signal, image, graph)

How to apply ERM to Big Data? Suppose that n is too large to evaluate the empirical risk L n (g) Common sense: run your preferred learning algorithm using a subsample of "reasonable" size B<<n, e.g. by drawing with replacement in the original training data set...

How to apply ERM to Big Data? Suppose that n is too large to evaluate the empirical risk L n (g) Common sense: run your preferred learning algorithm using a subsample of "reasonable" size B<<n, e.g. by drawing with replacement in the original training data set...... but of course, statistical performance is downgraded! 1/ p n << 1/ p B

Survey designs: a solution to Big Data learning? Framework: massive original sample (X 1,Y 1 ),...,(X n,y n ) viewed as a superpopulation Survey plan R n = probability distribution on the ensemble of all nonempty subsets of {1,...,n} Let S R N and set i =1if i 2 S, i =0otherwise The vector ( 1,..., n ) fully describes S First and second order inclusion probabilities: i (R N )=P{i 2 S} and i,j (R N )=P{(i, j) 2 S 2 } Do not rely on the empirical risk based on the survey sample {(X i,y i ): i 2 S} P i2s I{g(X i) 6= Y i } is a biased estimate of L(g) 1 #S

Horvitz -Thompson theory Consider the Horvitz-Thompson estimator of the risk L Rn n (g) = 1 n nx i=1 i i I{g(X i ) 6= Y i } And the Horvitz Thompson empirical risk minimizer It may work if sup g2g arg min g2g L Rn n (g) =gn L Rn n (g) L n (g) is small In general, due to the dependence structure, not much can be said about the fluctuations of this supremum

The Poisson case: the i s are independent In this case, L Rn n (g) is a simple average of independent r.v. s ) back to empirical process theory One recovers the same learning rate as if all data had been used, e.g. VC finite dimension case E [L(g n) L ] apple (apple n p 2 + 4) r V log(n + 1) + log 2 n where apple n = q Pn i=1 (1/ 2 i ) (the i s should not be too small...) The upper bound is optimal in the minimax sense.

The Poisson case: the i s are independent Can be extended to more general sampling plans Q n provided you are able to control d TV (R n,q n ) def = X S2P(U n) P n (S) R n (S). A coupling technique (Hajek, 1964) can be used to show that it works for rejective sampling, Rao-Sampford sampling, successive sampling, post-stratified sampling, etc

Beyond Empirical Processes U-Statistics as Performance Criteria In various situations, the performance criterion is not a basic sample mean statistic any more Examples: Clustering: within cluster point scatter related to a partition P 2 X D(X i,x j ) X I{(X i,x j ) 2C 2 } n(n 1) i<j C2P Graph inference (link prediction) Ranking The empirical criterion is an average over all possible k-tuples U-statistic of degree k 2

Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n

Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability

Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability Quantitative formulation: maximize the criterion L(s) =P{s(X (1) ) <...<s(x (k) ) Y (1) =1,...,Y (K) = K}

Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability Quantitative formulation: maximize the criterion L(s) =P{s(X (1) ) <...<s(x (k) ) Y (1) =1,...,Y (K) = K} Observations: n k i.i.d. copies of X given Y = k, X (k) 1,...,X(k) n k n = n 1 +...+ n K

Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K

Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K But the number of terms to be summed is prohibitive! n 1... n K

Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K But the number of terms to be summed is prohibitive! n 1... n K Maximization of b L n (s) is computationally unfeasible...

Generalized U-statistics K 1 samples and degrees (d 1,...,d K ) 2 N K (X (k) 1,...,X(k) n k ), 1 apple k apple K, K independent i.i.d. samples drawn from F k (dx) on X k respectively Kernel H : X d 1 1 X d K K! R, square integrable w.r.t. µ = F d 1 1 F d K K

Generalized U-statistics Definition The K-sample U-statistic of degrees (d 1,...,d K ) with kernel H is U n (H) = PI 1... P I K H(X (1) I 1 ; X (2) I 2 ;...; X (K) n 1 d 1 I K ) n K, dk where P I k refers to summation over all subsets X (k) I k =(X (k) i 1,...,X (k) i dk ) related to a set I k of d k indexes 1 apple i 1 <...<i dk apple n k It is said symmetric when H is permutation symmetric in each set of d k arguments X (k) I k. References: Lee (1990) n k d k

Generalized U-statistics Unbiased estimator of (H) =E[H(X (1) 1,...,X(1) d 1,...,X (K) 1,...,X (K) d k )] with minimum variance Asymptotically Gaussian as n k /n! k > 0 for k =1,...,K Its computation requires the summation of KY nk d k k=1 terms K-partite ranking: d k =1for 1 apple k apple K H s (x 1,...,x K )=I {s(x 1 ) <s(x 2 ) < <s(x K )}

Incomplete U-statistics Replace U n (H) by an incomplete version, involving much less terms Build a set D B of cardinality B built by sampling with replacement in the set of indexes ((i (1) 1,...,i(1) d 1 ),...,(i (K) 1,...,i (K) d K )) with 1 apple i (k) 1 <...<i (k) d k apple n k, 1 apple k apple K Compute the Monte-Carlo version based on B terms eu B (H) = 1 B X H(X (1) I 1,...,X (K) I K ) (I 1,...,I K )2D B An incomplete U-statistic is NOT a U-statistic

ERM based on incomplete U-statistics Replace the criterion by a tractable incomplete version based on B = O(n) terms min eu B (H) H2H This leads to investigate the maximal deviations sup H2H eu B (H) U n (H)

Main Result Theorem Let H be a VC major class of bounded symmetric kernels of finite VC dimension V < +1. Set M H =sup (H,x)2H X H(x). Then, n o (i) P sup H2H UB e (H) U n (H) > apple 2(1 + # ) V e B 2 /M 2 H (ii) for all 2 (0, 1), with probability at least 1, we have: 1 M H sup H2H r h i eu B (H) E eub 2V log(1 + apple) (H) apple 2 apple r r log(2/ ) V log(1 + # ) + log(4/ ) + +, apple B where apple =min{bn 1 /d 1 c,...,bn K /d K c}

Consequences Empirical risk sampling with B = O(n) yields a rate bound of the order O( p log n/n) One suffers no loss in terms of learning rate, while drastically reducing computational cost

Example: Ranking Empirical ranking performance for SVMrank based on 1%, 5%, 10%, 20% and 100% of the "LETOR 2007" dataset.

Sketch of Proof Set =(( k (I)) I2 ) 1applekappleB, where k (I) is equal to 1 if the tuple I =(I 1,...,I K ) has been selected at the k-th draw and to 0 otherwise The k s are i.i.d. random vectors For all (k, I) 2{1,...,B}, the r.v. k (I) has a Bernoulli distribution with parameter 1/# With these notations, where eu B (H) U n (H) = 1 B BX Z k (H), k=1 Z k (H) = X I2 ( k (I) 1/# )H(X I ) Freezing the X I s, by virtue of Sauer s lemma: #{(H(X I )) I2 : H 2H}apple(1 + # ) V.

Sketch of Proof (continued) Conditioned upon the X I s, Z 1 (H),...,Z B (H) are independent The first assertion is thus obtained by applying Hoeffding s inequality combined with the union bound Set apple 1 V H + H X (1) 1,...,X(1) n 1,...,X (K) 1,...,X n (K) K = X (1) 1,...,X(1) d 1,...,X (K) 1,...,X (K) d K H X (1) d 1 +1,...,X(1) 2d 1,...,X (K) d K +1,...,X(K) 2d K +... + H X (1) appled 1 d K +1,...,X(K) appled K, d 1 +1,...,X(K) appled K

Sketch of Proof (continued) The proof of the second assertion is based on the Hoeffding decomposition U n (H) = 1 n 1! n K! X 12S n1,..., K2S nk V X (1) 1(1),...,X(K) K(n K ) The concentration result is then obtained in a classical manner Convexity (Chernoff s bound) Symmetrization Randomization Application of McDiarmid s bounded difference inequality

Beyond finite VC dimension Challenge: develop probabilistic tools and complexity assumptions to investigate the concentration properties of collections of sums of weighted binomials eu B (H) U n (H) = 1 B BX Z k (H), k=1 with Z k (H) = X I2 ( k (I) 1/# )H(X I )

Some references Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling. S. Clémençon, S. Robbiano and J. Tressou (2013). In the Proceedings of the SIAM International Conference on Data-Mining, Austin (USA). Empirical processes in survey sampling. P. Bertail, E. Chautru and S. Clémençon (2013). Submitted. A statistical view of clustering performance through the theory of U-processes. S. Clémençon (2014). In Journal of Multivariate Analysis. On Survey Sampling and Empirical Risk Minimization. P. Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, Fort Lauderdale (USA).

Introduction Investigate the binary classification problem in statistical learning context I Data not stored in central unit but processed by independent agents (processors) I Aim : not to find a consensus on a common classifier but find how to combine e ciently the local ones I Solution : implement in an on-line and distributed manner 2/21

Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 3/21

Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 3/21

Learning problem sign(h (X )) r.v. observation r.v. binary output X 2 X R n!! Ŷ 2 { 1,+1} Given training dataset (X,Y )=(X i,y i ) i=1,...,n in a high dimension n and with unknown joint distribution......find the best prediction rule sign(h? ) such the classifier function H (x) : H? =min P e(h ) where P e (H )=P[ YH (X ) > 0] = E 1 { YH (X )>0} H minimizes the probability of error P e B but 1(x) is not a di erentiable function! 4/21

Learning problem Majorize E 1 { YH (X )>0} by a convex function: Convex Surrogate E 1 { YH (X )>0} apple E[j( YH (X ))] How? Use a cost function with appropiate properties Example : use the quadratic function j(u)= (u+1)2 2 : R! [0, + ) 4/21

Learning problem sign(h (X )) r.v. observation r.v. binary output X 2 X R n!! Ŷ 2 { 1,+1} Given training dataset (X,Y )=(X i,y i ) i=1,...,n in a high dimension n and with unknown joint distribution......find the best prediction rule sign(h? ) such the classifier function H (x) : H? =min H R j(h ) where R j (H )=E[j( YH (X ))] minimizes the risk function R j (H ) 4 when j(u)= (u+1)2 2! H? coincides with the naive Bayes classifier! 4/21

Aggregation of local classifiers Consider a classification device composed by a set V of N connected agents Each agent v 2 V : I disposes of {(X 1,v,Y 1,v ),...,(X nv,v,y nv,v )}!n v independent copies of (X,Y ) I selects a local soft classifier function from a parametric class {h v (, q v )} Set q v =(a v,b v ),theglobal soft classifier is: H (x, )=Â v2v h v (x,q v ) 0 1 B where : h v (x,q v )=a v h v (x,b v ) and = @ q 1. q N C A 5/21

Problem statement The problem can be summarized as follows: I given an observed data X I obtain the best estimated label Ŷ as sign(h (X, )) I where is computed from the optimization problem using the training data (X,Y )=(X i,y i ) i=1,...,n as: min R j( 2 YH (X, )) 6/21

Problem statement Approaches 1. Agreement to a common decision rule [Tsitsiklis-84, Agarwal-10 ] : consensus approach I find an average consensus solution : =(q,...,q) I each agent use the global classifier H (X, ) 2. Mixture of experts : cooperative approach I find the best aggregation solution : =(q 1,...,q N ) I each agent use its local classifier h v (x,q v ) 6/21

Problem statement Approaches 1. Agreement to a common decision rule [Tsitsiklis-84, Agarwal-10 ] : consensus approach 2. Mixture of experts : cooperative approach I find the best aggregation solution : =(q 1,...,q N ) I each agent use its local classifier h v (x,q v ) 4 Example: set b v =0, a v 0 and h v : X! { 1,+1} :theweak classifier h v (x,q v )=a v h v (x) 6/21

Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 7/21

High rate distributed learning Solve the minimization problem of the parametric risk function: min R j(h (X, )) 2 8/21

High rate distributed learning An standard distributed gradient descent iterative approach : I generates a vector sequence of the estimated parameter ( t ) t 1 =(q t,1,,q t,n ) t 1 I at each agent v the update step writes: q t+1,v = q t,v + g t E Y v h v (X,q t,v )j 0 ( YH (X, t )) {z } B the joint distribution is unknown 8/21

High rate distributed learning An standard distributed and on-line gradient descent iterative approach is: I generate a vector sequence of the estimated parameter ( t ) t 1 =(q t,1,,q t,n ) t 1 I each agent v observes a pair (X t+1,v,y t+1,v ) I at each agent v the update step writes: q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q vt,v )j 0 ( Y t+1,v H (X t+1,v, t )) {z } replace by the empirical version B evaluate H (X t+1,v, (t) ) is required at each t and v! 8/21

High rate distributed learning Example At iteration t, eachagentv 2 V has (X v,t,q v,t )... (X 3,t,q 3,t ) (X 2,t,q 2,t ) 2 3 4 (X 4,t,q 4,t ) 1 h 1 (X 1,t,q 1,t )...and evaluates its local h v (X t,v,q t,v ) 9/21

High rate distributed learning Example Each node v sends its observation X t,v to all the other nodes... X t,1 3 X t,1 2 4 X t,1 1 (X t,1,q t,1 ) 9/21

High rate distributed learning Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from all the other nodes... h 3 (X 1,t,q 3,t ) h 2 (X 1,t,q 2,t ) 2 3 4 h 4 (X 1,t,q 4,t ) 1 h 1 (X 1,t,q 1,t ) 9/21

High rate distributed learning Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from all the other nodes... 3 2 4 h 1 (X 1,t,q 1,t ) 1 {h 2 (X 1,t,q 2,t ),h 3 (X 1,t,q 3,t ),h 4 (X 1,t,q 4,t )}...and computes the global : H (X t,1, t )=Â 4 w=1 h w (X t,1,q t,w ) B N (N 1) communications per iteration N =4! 12! 9/21

Proposed distributed learning : OLGA algorithm 4 Replace the global H (X t+1,v, (t) ) by a local estimate Ŷ (V) t,v each v 2 V such : at E[Ŷ (V) t+1,v X t+1,v, t ]=H (X t+1,v, t ) How? sparse communications with ratio sparsity p... On-line Learning Gossip Algorithm (OLGA)...for each v 2 V at time t, thelocalgradient descent update writes : q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q t,v )j 0 ( Y t+1,v Ŷ (V) t+1,v ) 10/21

Proposed distributed learning : OLGA algorithm Example At iteration t, eachagentv 2 V has (X t,v,q t,v )... (X 3,t,q 3,t ) (X 2,t,q 2,t ) 2 3 4 (X 4,t,q 4,t ) 1 h 1 (X 1,t,q 1,t )...and evaluates its local h v (X t,v,q t,v ) 10/21

Proposed distributed learning : OLGA algorithm Example Each node v sends its observation X t,v to randomly selected nodes with probability p = 1 3... 3 4 X t,1 2 1 (X t,1,q t,1 ) 10/21

Proposed distributed learning : OLGA algorithm Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from the randomly selected nodes... 2 3 4 h 4 (X 1,t,q 4,t ) 1 h 1 (X 1,t,q 1,t ) 10/21

Proposed distributed learning : OLGA algorithm Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from choosen nodes... 3 2 4 h 1 (X 1,t,q 1,t ) 1 {h 4 (X 1,t,q 4,t )}...and computes its local estimated : Ŷ (V) t,1 = h 1(X t,1,q t,1 )+ 1 p h 4(X t,1,q t,4 ) B pn (N 1) communications per iteration N =4,p= 13! 4 (reduction of 67%)! 10/21

Performance analysis B What is the e ect of sparsification?......study the behaviour of the vector sequence t as t! I the consistency of the final solution given by the algorithm I qualify the error variance excess due to the sparsity 11/21

Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 12/21

Asymptotic behaviour of OLGA Under suitable assumptions, we prove the following results : 1. Consistency : ( t ) t 1 a.s.! q? 2 L = { R j ( )=0} 2. CLT :conditionedtotheevent{lim t! t =? },then p gt ( t? ) L! N(0,S(G? )) where : estimation error in a centralized case z } { G? = E[(H (X,? ) Y ) 2 v h v (X,q v? ) T v h v (X,q v? )]+ + 1 p p  E[h w (X,q w? ) 2 v h v (X,q v? ) T v h v (X,q v? )] w6=v {z } additional noise term induced by the distributed setting 13/21

Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 14/21

Abestagentsselectionapproach When... B the number of agents N "!di cult to implement B redudancy agents! avoid similar outputs... include distributed agent selection! How? add a `1-penalization term with tunning parameter l min R j(h (X, )) + l  a v 2 where: I the weight a v =0for an idle agent and a v > 0 when it is active v 15/21

Including best agents selection to OLGA algorithm Introduce an update step at each time t of OLGA to seek : the time varying set of active nodes S t V 16/21

Including best agents selection to OLGA algorithm The extended algorithm is summarized as follows, at time t: 1. obtain active nodes S t from the sequence of updated weights (a t,1,...,a t,n ) 2. apply OLGA to the set of active agents v 2 S t as: i) estimate local Ŷ (S t ) t+1,v from a random selection among the current active nodes ii) update local gradient descent q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q t,v )j 0 ( Y t+1,v Ŷ (S t ) t+1,v ) 16/21

Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 17/21

Example with simulated data Binary classification of (+) and(o) data samples with N = 60 agents using weak lineal classifiers (-). When using distributed selection, it reduces to 25 active classifiers. 6 6 4 4 2 2 0 0 2 2 4 4 6 6 4 2 0 2 4 6 6 6 4 2 0 2 4 6 (a) OLGA (b) OLGA with distributed selection 18/21

Comparison with real data Binary classification of the available benchmark dataset banana using weak lineal classifiers when increasing N. 0.55 0.5 OLGA (p=0.6) GentleBoost Error rate 0.45 0.4 0.35 0.3 0.25 0.2 5 10 15 20 25 30 35 Number of weak learners Figure: Comparison between a centralized and sequential approach (GentleBoost) and our distributed and on-line algorithm (OLGA). 19/21

Conclusions I A fully distributed and on-line algorithm is proposed for binary classification of big datasets solved by N processors 4 the algorithm is then adapted to select useful classifiers! N # I We obtain theoretical results from the asymptotic analysis of the sequence estimated by OLGA I Numerical results are illustrated showing a comparable behaviour to a centralized, batch and sequential approach (GentleBoost) 20/21