Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon

Size: px
Start display at page:

Download "Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon"

Transcription

1 Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms Stéphan Clémençon LTCI UMR CNRS No Telecom ParisTech - Journée Traitement de Masses de Données du Laboratoire JL Lions UPMC

2 Goals of Statistical Learning Theory Statistical issues cast as M-estimation problems: Classification Regression Density level set estimation... and their variants Minimal assumptions on the distribution Build realistic M-estimators for special criteria Questions: Optimal elements Consistency Non-asymptotic excess risk bounds Fast rates of convergence Oracle inequalities

3 Main Example: Classification (X, Y ) random pair with unknown distribution P X 2Xobservation vector Y 2 { 1, +1} binary label/class A posteriori probability regression function 8x 2X, (x) =P{Y =1 X = x} g : X! { 1, +1} classifier Performance measure = classification error L(g) =P {g(x) 6= Y }! min g Solution: Bayes rule Bayes error L = L(g ) 8x 2X, g (x) =2I { (x)>1/2} 1

4 Empirical Risk Minimization Sample (X 1,Y 1 ),...,(X n,y n ) with i.i.d. copies of (X, Y ) Class G of classifiers Empirical Risk Minimization principle ĝ n = arg min L n (g) := 1 g2g n Best classifier in the class nx i=1 ḡ = arg min L(g) g2g I {g(xi )6=Y i }

5 Empirical Processes in Classification Bias-variance decomposition L(ĝ n ) L apple (L(ĝ n ) L n (ĝ n )) + (L n (ḡ) L(ḡ)) + (L(ḡ) L ) apple 2 sup L n (g) L(g) g2g! + inf L(g) g2g L Concentration inequality With probability 1 : sup L n (g) L(g) apple E sup L n (g) L(g) + g2g g2g r 2 log(1/ ) n

6 Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities

7 Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities

8 Classification Theory - Main Results 1 Bayes risk consistency and rate of convergence Complexity control: r V E sup L n (g) L(g) apple C g2g n if G is a VC class with VC dimension V. 2 Fast rates of convergence Under variance control: rate faster than n 1/2 3 Convex risk minimization 4 Oracle inequalities

9 Big Data? Big Challenge! Now, it is much easier to collect data, massively and in real-time: ubiquity of sensors (cell phones, internet, embedded systems, social networks,...) to store and manage Big (and Complex) Data (distributed file systems, NoSQL) to implement massively parallelized and distributed computational algorithms (MapReduce, clouds) The three features of Big Data analysis Velocity: process data in quasi-real time (on-line algorithms) Volume: scalability (parallelized, distributed algorithms) Variety: complex data (text, signal, image, graph)

10 How to apply ERM to Big Data? Suppose that n is too large to evaluate the empirical risk L n (g) Common sense: run your preferred learning algorithm using a subsample of "reasonable" size B<<n, e.g. by drawing with replacement in the original training data set...

11 How to apply ERM to Big Data? Suppose that n is too large to evaluate the empirical risk L n (g) Common sense: run your preferred learning algorithm using a subsample of "reasonable" size B<<n, e.g. by drawing with replacement in the original training data set but of course, statistical performance is downgraded! 1/ p n << 1/ p B

12 Survey designs: a solution to Big Data learning? Framework: massive original sample (X 1,Y 1 ),...,(X n,y n ) viewed as a superpopulation Survey plan R n = probability distribution on the ensemble of all nonempty subsets of {1,...,n} Let S R N and set i =1if i 2 S, i =0otherwise The vector ( 1,..., n ) fully describes S First and second order inclusion probabilities: i (R N )=P{i 2 S} and i,j (R N )=P{(i, j) 2 S 2 } Do not rely on the empirical risk based on the survey sample {(X i,y i ): i 2 S} P i2s I{g(X i) 6= Y i } is a biased estimate of L(g) 1 #S

13 Horvitz -Thompson theory Consider the Horvitz-Thompson estimator of the risk L Rn n (g) = 1 n nx i=1 i i I{g(X i ) 6= Y i } And the Horvitz Thompson empirical risk minimizer It may work if sup g2g arg min g2g L Rn n (g) =gn L Rn n (g) L n (g) is small In general, due to the dependence structure, not much can be said about the fluctuations of this supremum

14 The Poisson case: the i s are independent In this case, L Rn n (g) is a simple average of independent r.v. s ) back to empirical process theory One recovers the same learning rate as if all data had been used, e.g. VC finite dimension case E [L(g n) L ] apple (apple n p 2 + 4) r V log(n + 1) + log 2 n where apple n = q Pn i=1 (1/ 2 i ) (the i s should not be too small...) The upper bound is optimal in the minimax sense.

15 The Poisson case: the i s are independent Can be extended to more general sampling plans Q n provided you are able to control d TV (R n,q n ) def = X S2P(U n) P n (S) R n (S). A coupling technique (Hajek, 1964) can be used to show that it works for rejective sampling, Rao-Sampford sampling, successive sampling, post-stratified sampling, etc

16 Beyond Empirical Processes U-Statistics as Performance Criteria In various situations, the performance criterion is not a basic sample mean statistic any more Examples: Clustering: within cluster point scatter related to a partition P 2 X D(X i,x j ) X I{(X i,x j ) 2C 2 } n(n 1) i<j C2P Graph inference (link prediction) Ranking The empirical criterion is an average over all possible k-tuples U-statistic of degree k 2

17 Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n

18 Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability

19 Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability Quantitative formulation: maximize the criterion L(s) =P{s(X (1) ) <...<s(x (k) ) Y (1) =1,...,Y (K) = K}

20 Example: Ranking Data with ordinal label: (X 1,Y 1 ),...,(X n,y n ) 2 X {1,...,K} n Goal: rank X 1,...,X n through a scoring function s : X!R s.t. s(x) and Y tend to increase/decrease together with high probability Quantitative formulation: maximize the criterion L(s) =P{s(X (1) ) <...<s(x (k) ) Y (1) =1,...,Y (K) = K} Observations: n k i.i.d. copies of X given Y = k, X (k) 1,...,X(k) n k n = n n K

21 Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K

22 Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K But the number of terms to be summed is prohibitive! n 1... n K

23 Example: Ranking A natural empirical counterpart of L(s) is P Pn n o n1 K i 1 =1 ik =1 bl I s(x (1) i 1 ) <...<s(x (K) i K ) n (s) =, n 1 n K But the number of terms to be summed is prohibitive! n 1... n K Maximization of b L n (s) is computationally unfeasible...

24 Generalized U-statistics K 1 samples and degrees (d 1,...,d K ) 2 N K (X (k) 1,...,X(k) n k ), 1 apple k apple K, K independent i.i.d. samples drawn from F k (dx) on X k respectively Kernel H : X d 1 1 X d K K! R, square integrable w.r.t. µ = F d 1 1 F d K K

25 Generalized U-statistics Definition The K-sample U-statistic of degrees (d 1,...,d K ) with kernel H is U n (H) = PI 1... P I K H(X (1) I 1 ; X (2) I 2 ;...; X (K) n 1 d 1 I K ) n K, dk where P I k refers to summation over all subsets X (k) I k =(X (k) i 1,...,X (k) i dk ) related to a set I k of d k indexes 1 apple i 1 <...<i dk apple n k It is said symmetric when H is permutation symmetric in each set of d k arguments X (k) I k. References: Lee (1990) n k d k

26 Generalized U-statistics Unbiased estimator of (H) =E[H(X (1) 1,...,X(1) d 1,...,X (K) 1,...,X (K) d k )] with minimum variance Asymptotically Gaussian as n k /n! k > 0 for k =1,...,K Its computation requires the summation of KY nk d k k=1 terms K-partite ranking: d k =1for 1 apple k apple K H s (x 1,...,x K )=I {s(x 1 ) <s(x 2 ) < <s(x K )}

27 Incomplete U-statistics Replace U n (H) by an incomplete version, involving much less terms Build a set D B of cardinality B built by sampling with replacement in the set of indexes ((i (1) 1,...,i(1) d 1 ),...,(i (K) 1,...,i (K) d K )) with 1 apple i (k) 1 <...<i (k) d k apple n k, 1 apple k apple K Compute the Monte-Carlo version based on B terms eu B (H) = 1 B X H(X (1) I 1,...,X (K) I K ) (I 1,...,I K )2D B An incomplete U-statistic is NOT a U-statistic

28 ERM based on incomplete U-statistics Replace the criterion by a tractable incomplete version based on B = O(n) terms min eu B (H) H2H This leads to investigate the maximal deviations sup H2H eu B (H) U n (H)

29 Main Result Theorem Let H be a VC major class of bounded symmetric kernels of finite VC dimension V < +1. Set M H =sup (H,x)2H X H(x). Then, n o (i) P sup H2H UB e (H) U n (H) > apple 2(1 + # ) V e B 2 /M 2 H (ii) for all 2 (0, 1), with probability at least 1, we have: 1 M H sup H2H r h i eu B (H) E eub 2V log(1 + apple) (H) apple 2 apple r r log(2/ ) V log(1 + # ) + log(4/ ) + +, apple B where apple =min{bn 1 /d 1 c,...,bn K /d K c}

30 Consequences Empirical risk sampling with B = O(n) yields a rate bound of the order O( p log n/n) One suffers no loss in terms of learning rate, while drastically reducing computational cost

31 Example: Ranking Empirical ranking performance for SVMrank based on 1%, 5%, 10%, 20% and 100% of the "LETOR 2007" dataset.

32 Sketch of Proof Set =(( k (I)) I2 ) 1applekappleB, where k (I) is equal to 1 if the tuple I =(I 1,...,I K ) has been selected at the k-th draw and to 0 otherwise The k s are i.i.d. random vectors For all (k, I) 2{1,...,B}, the r.v. k (I) has a Bernoulli distribution with parameter 1/# With these notations, where eu B (H) U n (H) = 1 B BX Z k (H), k=1 Z k (H) = X I2 ( k (I) 1/# )H(X I ) Freezing the X I s, by virtue of Sauer s lemma: #{(H(X I )) I2 : H 2H}apple(1 + # ) V.

33 Sketch of Proof (continued) Conditioned upon the X I s, Z 1 (H),...,Z B (H) are independent The first assertion is thus obtained by applying Hoeffding s inequality combined with the union bound Set apple 1 V H + H X (1) 1,...,X(1) n 1,...,X (K) 1,...,X n (K) K = X (1) 1,...,X(1) d 1,...,X (K) 1,...,X (K) d K H X (1) d 1 +1,...,X(1) 2d 1,...,X (K) d K +1,...,X(K) 2d K H X (1) appled 1 d K +1,...,X(K) appled K, d 1 +1,...,X(K) appled K

34 Sketch of Proof (continued) The proof of the second assertion is based on the Hoeffding decomposition U n (H) = 1 n 1! n K! X 12S n1,..., K2S nk V X (1) 1(1),...,X(K) K(n K ) The concentration result is then obtained in a classical manner Convexity (Chernoff s bound) Symmetrization Randomization Application of McDiarmid s bounded difference inequality

35 Beyond finite VC dimension Challenge: develop probabilistic tools and complexity assumptions to investigate the concentration properties of collections of sums of weighted binomials eu B (H) U n (H) = 1 B BX Z k (H), k=1 with Z k (H) = X I2 ( k (I) 1/# )H(X I )

36 Some references Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling. S. Clémençon, S. Robbiano and J. Tressou (2013). In the Proceedings of the SIAM International Conference on Data-Mining, Austin (USA). Empirical processes in survey sampling. P. Bertail, E. Chautru and S. Clémençon (2013). Submitted. A statistical view of clustering performance through the theory of U-processes. S. Clémençon (2014). In Journal of Multivariate Analysis. On Survey Sampling and Empirical Risk Minimization. P. Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, Fort Lauderdale (USA).

37 Introduction Investigate the binary classification problem in statistical learning context I Data not stored in central unit but processed by independent agents (processors) I Aim : not to find a consensus on a common classifier but find how to combine e ciently the local ones I Solution : implement in an on-line and distributed manner 2/21

38 Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 3/21

39 Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 3/21

40 Learning problem sign(h (X )) r.v. observation r.v. binary output X 2 X R n!! Ŷ 2 { 1,+1} Given training dataset (X,Y )=(X i,y i ) i=1,...,n in a high dimension n and with unknown joint distribution......find the best prediction rule sign(h? ) such the classifier function H (x) : H? =min P e(h ) where P e (H )=P[ YH (X ) > 0] = E 1 { YH (X )>0} H minimizes the probability of error P e B but 1(x) is not a di erentiable function! 4/21

41 Learning problem Majorize E 1 { YH (X )>0} by a convex function: Convex Surrogate E 1 { YH (X )>0} apple E[j( YH (X ))] How? Use a cost function with appropiate properties Example : use the quadratic function j(u)= (u+1)2 2 : R! [0, + ) 4/21

42 Learning problem sign(h (X )) r.v. observation r.v. binary output X 2 X R n!! Ŷ 2 { 1,+1} Given training dataset (X,Y )=(X i,y i ) i=1,...,n in a high dimension n and with unknown joint distribution......find the best prediction rule sign(h? ) such the classifier function H (x) : H? =min H R j(h ) where R j (H )=E[j( YH (X ))] minimizes the risk function R j (H ) 4 when j(u)= (u+1)2 2! H? coincides with the naive Bayes classifier! 4/21

43 Aggregation of local classifiers Consider a classification device composed by a set V of N connected agents Each agent v 2 V : I disposes of {(X 1,v,Y 1,v ),...,(X nv,v,y nv,v )}!n v independent copies of (X,Y ) I selects a local soft classifier function from a parametric class {h v (, q v )} Set q v =(a v,b v ),theglobal soft classifier is: H (x, )=Â v2v h v (x,q v ) 0 1 B where : h v (x,q v )=a v h v (x,b v ) and q 1. q N C A 5/21

44 Problem statement The problem can be summarized as follows: I given an observed data X I obtain the best estimated label Ŷ as sign(h (X, )) I where is computed from the optimization problem using the training data (X,Y )=(X i,y i ) i=1,...,n as: min R j( 2 YH (X, )) 6/21

45 Problem statement Approaches 1. Agreement to a common decision rule [Tsitsiklis-84, Agarwal-10 ] : consensus approach I find an average consensus solution : =(q,...,q) I each agent use the global classifier H (X, ) 2. Mixture of experts : cooperative approach I find the best aggregation solution : =(q 1,...,q N ) I each agent use its local classifier h v (x,q v ) 6/21

46 Problem statement Approaches 1. Agreement to a common decision rule [Tsitsiklis-84, Agarwal-10 ] : consensus approach 2. Mixture of experts : cooperative approach I find the best aggregation solution : =(q 1,...,q N ) I each agent use its local classifier h v (x,q v ) 4 Example: set b v =0, a v 0 and h v : X! { 1,+1} :theweak classifier h v (x,q v )=a v h v (x) 6/21

47 Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 7/21

48 High rate distributed learning Solve the minimization problem of the parametric risk function: min R j(h (X, )) 2 8/21

49 High rate distributed learning An standard distributed gradient descent iterative approach : I generates a vector sequence of the estimated parameter ( t ) t 1 =(q t,1,,q t,n ) t 1 I at each agent v the update step writes: q t+1,v = q t,v + g t E Y v h v (X,q t,v )j 0 ( YH (X, t )) {z } B the joint distribution is unknown 8/21

50 High rate distributed learning An standard distributed and on-line gradient descent iterative approach is: I generate a vector sequence of the estimated parameter ( t ) t 1 =(q t,1,,q t,n ) t 1 I each agent v observes a pair (X t+1,v,y t+1,v ) I at each agent v the update step writes: q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q vt,v )j 0 ( Y t+1,v H (X t+1,v, t )) {z } replace by the empirical version B evaluate H (X t+1,v, (t) ) is required at each t and v! 8/21

51 High rate distributed learning Example At iteration t, eachagentv 2 V has (X v,t,q v,t )... (X 3,t,q 3,t ) (X 2,t,q 2,t ) (X 4,t,q 4,t ) 1 h 1 (X 1,t,q 1,t )...and evaluates its local h v (X t,v,q t,v ) 9/21

52 High rate distributed learning Example Each node v sends its observation X t,v to all the other nodes... X t,1 3 X t,1 2 4 X t,1 1 (X t,1,q t,1 ) 9/21

53 High rate distributed learning Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from all the other nodes... h 3 (X 1,t,q 3,t ) h 2 (X 1,t,q 2,t ) h 4 (X 1,t,q 4,t ) 1 h 1 (X 1,t,q 1,t ) 9/21

54 High rate distributed learning Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from all the other nodes h 1 (X 1,t,q 1,t ) 1 {h 2 (X 1,t,q 2,t ),h 3 (X 1,t,q 3,t ),h 4 (X 1,t,q 4,t )}...and computes the global : H (X t,1, t )=Â 4 w=1 h w (X t,1,q t,w ) B N (N 1) communications per iteration N =4! 12! 9/21

55 Proposed distributed learning : OLGA algorithm 4 Replace the global H (X t+1,v, (t) ) by a local estimate Ŷ (V) t,v each v 2 V such : at E[Ŷ (V) t+1,v X t+1,v, t ]=H (X t+1,v, t ) How? sparse communications with ratio sparsity p... On-line Learning Gossip Algorithm (OLGA)...for each v 2 V at time t, thelocalgradient descent update writes : q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q t,v )j 0 ( Y t+1,v Ŷ (V) t+1,v ) 10/21

56 Proposed distributed learning : OLGA algorithm Example At iteration t, eachagentv 2 V has (X t,v,q t,v )... (X 3,t,q 3,t ) (X 2,t,q 2,t ) (X 4,t,q 4,t ) 1 h 1 (X 1,t,q 1,t )...and evaluates its local h v (X t,v,q t,v ) 10/21

57 Proposed distributed learning : OLGA algorithm Example Each node v sends its observation X t,v to randomly selected nodes with probability p = X t,1 2 1 (X t,1,q t,1 ) 10/21

58 Proposed distributed learning : OLGA algorithm Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from the randomly selected nodes h 4 (X 1,t,q 4,t ) 1 h 1 (X 1,t,q 1,t ) 10/21

59 Proposed distributed learning : OLGA algorithm Example Each node v obtains the evaluation of h w (X t,v,q t,w ) from choosen nodes h 1 (X 1,t,q 1,t ) 1 {h 4 (X 1,t,q 4,t )}...and computes its local estimated : Ŷ (V) t,1 = h 1(X t,1,q t,1 )+ 1 p h 4(X t,1,q t,4 ) B pn (N 1) communications per iteration N =4,p= 13! 4 (reduction of 67%)! 10/21

60 Performance analysis B What is the e ect of sparsification?......study the behaviour of the vector sequence t as t! I the consistency of the final solution given by the algorithm I qualify the error variance excess due to the sparsity 11/21

61 Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 12/21

62 Asymptotic behaviour of OLGA Under suitable assumptions, we prove the following results : 1. Consistency : ( t ) t 1 a.s.! q? 2 L = { R j ( )=0} 2. CLT :conditionedtotheevent{lim t! t =? },then p gt ( t? ) L! N(0,S(G? )) where : estimation error in a centralized case z } { G? = E[(H (X,? ) Y ) 2 v h v (X,q v? ) T v h v (X,q v? )]+ + 1 p p  E[h w (X,q w? ) 2 v h v (X,q v? ) T v h v (X,q v? )] w6=v {z } additional noise term induced by the distributed setting 13/21

63 Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 14/21

64 Abestagentsselectionapproach When... B the number of agents N "!di cult to implement B redudancy agents! avoid similar outputs... include distributed agent selection! How? add a `1-penalization term with tunning parameter l min R j(h (X, )) + l  a v 2 where: I the weight a v =0for an idle agent and a v > 0 when it is active v 15/21

65 Including best agents selection to OLGA algorithm Introduce an update step at each time t of OLGA to seek : the time varying set of active nodes S t V 16/21

66 Including best agents selection to OLGA algorithm The extended algorithm is summarized as follows, at time t: 1. obtain active nodes S t from the sequence of updated weights (a t,1,...,a t,n ) 2. apply OLGA to the set of active agents v 2 S t as: i) estimate local Ŷ (S t ) t+1,v from a random selection among the current active nodes ii) update local gradient descent q t+1,v = q t,v + g t Y t+1,v v h v (X t+1,v,q t,v )j 0 ( Y t+1,v Ŷ (S t ) t+1,v ) 16/21

67 Outline Background Proposed algorithm Theoretical results Improvement of agents selection Numerical experiences 17/21

68 Example with simulated data Binary classification of (+) and(o) data samples with N = 60 agents using weak lineal classifiers (-). When using distributed selection, it reduces to 25 active classifiers (a) OLGA (b) OLGA with distributed selection 18/21

69 Comparison with real data Binary classification of the available benchmark dataset banana using weak lineal classifiers when increasing N OLGA (p=0.6) GentleBoost Error rate Number of weak learners Figure: Comparison between a centralized and sequential approach (GentleBoost) and our distributed and on-line algorithm (OLGA). 19/21

70 Conclusions I A fully distributed and on-line algorithm is proposed for binary classification of big datasets solved by N processors 4 the algorithm is then adapted to select useful classifiers! N # I We obtain theoretical results from the asymptotic analysis of the sequence estimated by OLGA I Numerical results are illustrated showing a comparable behaviour to a centralized, batch and sequential approach (GentleBoost) 20/21

Random graphs with a given degree sequence

Random graphs with a given degree sequence Sourav Chatterjee (NYU) Persi Diaconis (Stanford) Allan Sly (Microsoft) Let G be an undirected simple graph on n vertices. Let d 1,..., d n be the degrees of the vertices of G arranged in descending order.

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Decompose Error Rate into components, some of which can be measured on unlabeled data

Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

More information

Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

More information

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

More information

Fuzzy Probability Distributions in Bayesian Analysis

Fuzzy Probability Distributions in Bayesian Analysis Fuzzy Probability Distributions in Bayesian Analysis Reinhard Viertl and Owat Sunanta Department of Statistics and Probability Theory Vienna University of Technology, Vienna, Austria Corresponding author:

More information

Tail inequalities for order statistics of log-concave vectors and applications

Tail inequalities for order statistics of log-concave vectors and applications Tail inequalities for order statistics of log-concave vectors and applications Rafał Latała Based in part on a joint work with R.Adamczak, A.E.Litvak, A.Pajor and N.Tomczak-Jaegermann Banff, May 2011 Basic

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

A Stochastic 3MG Algorithm with Application to 2D Filter Identification A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est,

More information

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

More information

Consistent Binary Classification with Generalized Performance Metrics

Consistent Binary Classification with Generalized Performance Metrics Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014 Problem and Motivation

More information

Influences in low-degree polynomials

Influences in low-degree polynomials Influences in low-degree polynomials Artūrs Bačkurs December 12, 2012 1 Introduction In 3] it is conjectured that every bounded real polynomial has a highly influential variable The conjecture is known

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. 1.3 Neural Networks 19 Neural Networks are large structured systems of equations. These systems have many degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. Two very

More information

Performance measures for Neyman-Pearson classification

Performance measures for Neyman-Pearson classification Performance measures for Neyman-Pearson classification Clayton Scott Abstract In the Neyman-Pearson (NP) classification paradigm, the goal is to learn a classifier from labeled training data such that

More information

Week 1: Introduction to Online Learning

Week 1: Introduction to Online Learning Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

More information

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR) 2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

More information

Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Online Learning, Stability, and Stochastic Gradient Descent

Online Learning, Stability, and Stochastic Gradient Descent Online Learning, Stability, and Stochastic Gradient Descent arxiv:1105.4701v3 [cs.lg] 8 Sep 2011 September 9, 2011 Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Brain

More information

Interactive Machine Learning. Maria-Florina Balcan

Interactive Machine Learning. Maria-Florina Balcan Interactive Machine Learning Maria-Florina Balcan Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Archis Ghate a and Robert L. Smith b a Industrial Engineering, University of Washington, Box 352650, Seattle, Washington,

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

Linear Codes. Chapter 3. 3.1 Basics

Linear Codes. Chapter 3. 3.1 Basics Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

More information

U-statistics on network-structured data with kernels of degree larger than one

U-statistics on network-structured data with kernels of degree larger than one U-statistics on network-structured data with kernels of degree larger than one Yuyi Wang 1, Christos Pelekis 1, and Jan Ramon 1 Computer Science Department, KU Leuven, Belgium Abstract. Most analysis of

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

L25: Ensemble learning

L25: Ensemble learning L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

More information

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning

More information

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Exact Nonparametric Tests for Comparing Means - A Personal Summary Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

The CUSUM algorithm a small review. Pierre Granjon

The CUSUM algorithm a small review. Pierre Granjon The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

STAT 830 Convergence in Distribution

STAT 830 Convergence in Distribution STAT 830 Convergence in Distribution Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Convergence in Distribution STAT 830 Fall 2011 1 / 31

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Transfer Learning! and Prior Estimation! for VC Classes!

Transfer Learning! and Prior Estimation! for VC Classes! 15-859(B) Machine Learning Theory! Transfer Learning! and Prior Estimation! for VC Classes! Liu Yang! 04/25/2012! Liu Yang 2012 1 Notation! Instance space X = R n! Concept space C of classifiers h: X ->

More information

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

CS229T/STAT231: Statistical Learning Theory (Winter 2015) CS229T/STAT231: Statistical Learning Theory (Winter 2015) Percy Liang Last updated Wed Oct 14 2015 20:32 These lecture notes will be updated periodically as the course goes on. Please let us know if you

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Equilibrium computation: Part 1

Equilibrium computation: Part 1 Equilibrium computation: Part 1 Nicola Gatti 1 Troels Bjerre Sorensen 2 1 Politecnico di Milano, Italy 2 Duke University, USA Nicola Gatti and Troels Bjerre Sørensen ( Politecnico di Milano, Italy, Equilibrium

More information

How To Solve The Cluster Algorithm

How To Solve The Cluster Algorithm Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz adriano@nce.ufrj.br

More information

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne

More information

ON INDUCED SUBGRAPHS WITH ALL DEGREES ODD. 1. Introduction

ON INDUCED SUBGRAPHS WITH ALL DEGREES ODD. 1. Introduction ON INDUCED SUBGRAPHS WITH ALL DEGREES ODD A.D. SCOTT Abstract. Gallai proved that the vertex set of any graph can be partitioned into two sets, each inducing a subgraph with all degrees even. We prove

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Concentration inequalities for order statistics Using the entropy method and Rényi s representation

Concentration inequalities for order statistics Using the entropy method and Rényi s representation Concentration inequalities for order statistics Using the entropy method and Rényi s representation Maud Thomas 1 in collaboration with Stéphane Boucheron 1 1 LPMA Université Paris-Diderot High Dimensional

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

Big Data: The Computation/Statistics Interface

Big Data: The Computation/Statistics Interface Big Data: The Computation/Statistics Interface Michael I. Jordan University of California, Berkeley September 2, 2013 What Is the Big Data Phenomenon? Big Science is generating massive datasets to be used

More information

Un point de vue bayésien pour des algorithmes de bandit plus performants

Un point de vue bayésien pour des algorithmes de bandit plus performants Un point de vue bayésien pour des algorithmes de bandit plus performants Emilie Kaufmann, Telecom ParisTech Rencontre des Jeunes Statisticiens, Aussois, 28 août 2013 Emilie Kaufmann (Telecom ParisTech)

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Université des Actuaires Machine-Learning pour les données massives: algorithmes randomisés, en ligne et distribués

Université des Actuaires Machine-Learning pour les données massives: algorithmes randomisés, en ligne et distribués Université des Actuaires Machine-Learning pour les données massives: algorithmes randomisés, en ligne et distribués Stéphan Clémençon Institut Mines Télécom - Télécom ParisTech July 7, 2014 Agenda I Technologies

More information

Nonparametric tests these test hypotheses that are not statements about population parameters (e.g.,

Nonparametric tests these test hypotheses that are not statements about population parameters (e.g., CHAPTER 13 Nonparametric and Distribution-Free Statistics Nonparametric tests these test hypotheses that are not statements about population parameters (e.g., 2 tests for goodness of fit and independence).

More information

1 Another method of estimation: least squares

1 Another method of estimation: least squares 1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring. Jie-Men Mok Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

More information

1 Norms and Vector Spaces

1 Norms and Vector Spaces 008.10.07.01 1 Norms and Vector Spaces Suppose we have a complex vector space V. A norm is a function f : V R which satisfies (i) f(x) 0 for all x V (ii) f(x + y) f(x) + f(y) for all x,y V (iii) f(λx)

More information

1. Prove that the empty set is a subset of every set.

1. Prove that the empty set is a subset of every set. 1. Prove that the empty set is a subset of every set. Basic Topology Written by Men-Gen Tsai email: b89902089@ntu.edu.tw Proof: For any element x of the empty set, x is also an element of every set since

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

Lecture 13: Martingales

Lecture 13: Martingales Lecture 13: Martingales 1. Definition of a Martingale 1.1 Filtrations 1.2 Definition of a martingale and its basic properties 1.3 Sums of independent random variables and related models 1.4 Products of

More information

VERTICES OF GIVEN DEGREE IN SERIES-PARALLEL GRAPHS

VERTICES OF GIVEN DEGREE IN SERIES-PARALLEL GRAPHS VERTICES OF GIVEN DEGREE IN SERIES-PARALLEL GRAPHS MICHAEL DRMOTA, OMER GIMENEZ, AND MARC NOY Abstract. We show that the number of vertices of a given degree k in several kinds of series-parallel labelled

More information

Maximum Likelihood Estimation of ADC Parameters from Sine Wave Test Data. László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár

Maximum Likelihood Estimation of ADC Parameters from Sine Wave Test Data. László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár Maximum Lielihood Estimation of ADC Parameters from Sine Wave Test Data László Balogh, Balázs Fodor, Attila Sárhegyi, and István Kollár Dept. of Measurement and Information Systems Budapest University

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

10. Proximal point method

10. Proximal point method L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

More information

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

More information