arxiv:1506.08910v1 [stat.ml] 30 Jun 2015

Similar documents
Convexity, Inequalities, and Norms

Chapter 7 Methods of Finding Estimators

Properties of MLE: consistency, asymptotic normality. Fisher information.

Modified Line Search Method for Global Optimization

I. Chi-squared Distributions

Hypothesis testing. Null and alternative hypotheses

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008


Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

A probabilistic proof of a binomial identity

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE

Maximum Likelihood Estimators.

Asymptotic Growth of Functions

Regularized Distance Metric Learning: Theory and Algorithm

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

LECTURE 13: Cross-validation

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Totally Corrective Boosting Algorithms that Maximize the Margin

Department of Computer Science, University of Otago

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

Chapter 14 Nonparametric Statistics

1 Correlation and Regression Analysis

Sequences and Series

Confidence Intervals for One Mean

Entropy of bi-capacities

Determining the sample size

Incremental calculation of weighted mean and variance

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Output Analysis (2, Chapters 10 &11 Law)

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

1 Computing the Standard Deviation of Sample Means

Systems Design Project: Indoor Location of Wireless Devices

Overview of some probability distributions.

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

THE TWO-VARIABLE LINEAR REGRESSION MODEL

SUPPORT UNION RECOVERY IN HIGH-DIMENSIONAL MULTIVARIATE REGRESSION 1

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

Lecture 4: Cheeger s Inequality

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

TIGHT BOUNDS ON EXPECTED ORDER STATISTICS

Plug-in martingales for testing exchangeability on-line

Research Article Sign Data Derivative Recovery

1. MATHEMATICAL INDUCTION

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

Lecture 2: Karger s Min Cut Algorithm

Finding the circle that best fits a set of points

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Introduction to Statistical Learning Theory

Chapter 5: Inner Product Spaces

Universal coding for classes of sources

On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

Class Meeting # 16: The Fourier Transform on R n

Normal Distribution.

THE ABRACADABRA PROBLEM

BASIC STATISTICS. f(x 1,x 2,..., x n )=f(x 1 )f(x 2 ) f(x n )= f(x i ) (1)

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

Estimating Probability Distributions by Observing Betting Practices

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Coordinating Principal Component Analyzers

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

THIN SEQUENCES AND THE GRAM MATRIX PAMELA GORKIN, JOHN E. MCCARTHY, SANDRA POTT, AND BRETT D. WICK

INVESTMENT PERFORMANCE COUNCIL (IPC)

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Stock Market Trading via Stochastic Network Optimization

3. Greatest Common Divisor - Least Common Multiple

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Solutions 9 Spring 2006

Lesson 15 ANOVA (analysis of variance)

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8

Soving Recurrence Relations

1 The Gaussian channel

Stochastic Online Scheduling with Precedence Constraints

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

MARTINGALES AND A BASIC APPLICATION

Statistical inference: example 1. Inferential Statistics

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

A Recursive Formula for Moments of a Binomial Distribution

1. C. The formula for the confidence interval for a population mean is: x t, which was

Present Values, Investment Returns and Discount Rates

Ekkehart Schlicht: Economic Surplus and Derived Demand

An Efficient Polynomial Approximation of the Normal Distribution Function & Its Inverse Function

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

A Mathematical Perspective on Gambling

Transcription:

Learig Sigle Idex Models i High Dimesios Ravi Gati, Nikhil Rao 2, Rebecca M. Willett 3 ad Robert Nowak 3 arxiv:506.0890v [stat.ml] 30 Ju 205 Wiscosi Istitutes for Discovery, 330 N Orchard St, Madiso, WI, 5375 2 Departmet of Computer Sciece, Uiversity of Texas at Austi, 7872 3 Departmet of Electrical ad Computer Egieerig, Uiversity of Wiscosi-Madiso, Madiso, WI, 53706 Abstract Sigle Idex Models (SIMs) are simple yet flexible semi-parametric models for classificatio ad regressio. Respose variables are modeled as a oliear, mootoic fuctio of a liear combiatio of features. Estimatio i this cotext requires learig both the feature weights, ad the oliear fuctio. While methods have bee described to lear SIMs i the low dimesioal regime, a method that ca efficietly lear SIMs i high dimesios has ot bee forthcomig. We propose three variats of a computatioally ad statistically efficiet algorithm for SIM iferece i high dimesios. We establish excess risk bouds for the proposed algorithms ad experimetally validate the advatages that our SIM learig methods provide relative to Geeralized Liear Model (GLM) ad low dimesioal SIM based learig methods. Itroductio High-dimesioal learig is ofte tackled usig geeralized liear models, where we assume that a respose variable Y 2 R is related to a feature vector X 2 R d via E[Y X = x] =g? (w >? x) () for some weight vector w? 2 R d ad some mootoic ad smooth fuctio g? called the trasfer fuctio. Typical examples of g? are the logit fuctio ad the probit fuctio for classificatio, ad the liear fuctio for regressio. While classical work o geeralized liear models (GLMs) assumes g? is kow, this potetially oliear fuctio is ofte ukow ad hece a major challege i statical iferece. The model i () with g? ukow is called a Sigle Idex Model (SIM) ad is a powerful semi-parametric geeralizatio of a GLM. SIMs were first itroduced i ecoometrics ad statistics [3,, 2]. Recetly, computatioally ad statistically efficiet algorithms have bee provided for learig SIMs [6, 5] i low-dimesioal settigs where the umber of samples/observatios is larger tha the ambiet dimesio d. However, moder data aalysis problems i machie learig, sigal processig, ad computatioal biology ivolve high dimesioal datasets, where the umber of parameters far exceeds the umber of samples ( d). I this paper we cosider the problem of learig SIMs, give labeled data, i the high-dimesioal regime. We provide algorithms that are both computatioally ad statistically efficiet for learig SIMs i high-dimesios, ad validate our methods o several high dimesioal datasets. Our cotributios i this paper ca be summarized as follows: gatimahapat@wisc.edu ikhilr@cs.utexas.edu rmwillett@wisc.edu rdowak@wisc.edu

. We propose a suite of algorithms to lear SIMs i high dimesios. Our simplest algorithm called SILO (Sigle Idex Lasso Optimizatio) is a simple, o iterative method that estimates the vector w? ad a mootoic, Lipschitz fuctio g?. isilo ad cisilo are iterative variats of SILO that use differet loss fuctios. While isilo uses a squared loss fuctio, cisilo uses a calibrated loss fuctio that adapts to the SIM from which our data is geerated. 2. We provide excess risk bouds o the hypotheses retured by SILO, isilo, cisilo. 3. We experimetally compare our algorithms with other methods used both for SIM learig ad high dimesioal parameter estimatio o various real world high dimesioal datasets. Our experimetal results show superior performace of isilo ad cisilo whe compared to commoly used methods for high dimesioal estimatio. The rest of the paper is orgaized as follows: I Sectio (2), we formally set up the problem we wish to solve, ad detail the proposed methods, SILO, isilo, cisilo. I Sectio (3), we perform a theoretical aalysis of SILO, isilo, ad cisilo. We perform a thorough empirical evaluatio o several datasets i Sectio (), ad coclude our paper i Sectio (5). Full proofs of our theoretical aalysis are available i the appedix.. Related work High dimesioal parameter estimatio for GLMs has bee widely studied, both from a theoretical ad algorithmic poit of view ( [5, 7, 9] ad refereces therei). Learig SIMs is a harder problem ad was first itroduced i ecoometrics [] ad statistics [3]. I [6] the authors proposed ad aalyzed the Isotro algorithm to lear SIMs i the low dimesioal settig. Isotro uses perceptro type updates to lear w?, alog with applicatio of the Pool Adjacet Violator (PAV) algorithm to lear g?. This was improved i [5] where the authors proposed the Slisotro algorithm that combied perceptro updates to lear w? alog with a Lipschitz PAV (LPAV) procedure to lear g?. Both the Isotro ad the Slisotro algorithm rely o performig perceptro updates. While the perceptro algorithm works for low-dimesioal classificatio problems, to the best of our kowledge the performace of the perceptro algorithm has ot bee studied i high-dimesios. Hece, it is ot clear if the Isotro ad the Slisotro algorithms desiged for learig SIM i low-dimesios would work i the high dimesioal settig. Alquier ad Biau [] cosider learig high dimesioal sigle idex models. The authors provide estimators of g?, w? usig PAC-Bayesia aalysis. However, the estimator relies o reversible jump MCMC, ad it is seemigly hard to implemet. Also, the MCMC step is slow to coverge eve for moderately sized problems. To the best of our kowledge, simple, practical algorithms with theoretical guaratees ad good empirical performace for learig sigle idex models i high dimesios are ot available. Restricted versios of the SIM estimatio problem have bee cosidered i [, 2], where the authors are oly iterested i accurate parameter estimatio ad ot predictio. Hece, i these works the proposed algorithms do ot lear the trasfer fuctio. The LPAV: Before we discuss algorithms for learig high dimesioal SIMs, we discuss the LPAV algorithm proposed i [5], as a extesio to the PAV method used i [6]. Give data (p,y ),...(p,y ), where p,...,p 2 R the LPAV outputs the best uivariate mootoic, -Lipschitz fuctio ĝ, that miimizes squared error P (g(p i) y i ) 2. I order to do this, the LPAV first solves the followig optimizatio problem: ẑ = arg mi z2r kz yk2 2 s.t. 0 apple z j z i apple p j p i if p i apple p j (2) where ĝ(p i )=ẑ i. This gives us the value of ĝ o a discrete set of poits p,...,p. To get ĝ everywhere else o the real lie, we simply perform liear iterpolatio as follows: Sort p i for all i ad let p {i} be the i th etry after sortig. The, for ay 2 R, we have 8 >< ẑ {}, if apple p {} ĝ( ) = ẑ {}, if p {} (3) >: µẑ {i} +( µ)ẑ {i+} if = µp {i} +( µ)p {i+} I the algorithms that we shall discuss i this paper we shall ivoke the LPAV routie with p i set to the projectio of the data poit x i o some algorithm-depedet weight vector w. 2

2 Statistical model ad proposed algorithms Assume we are provided i.i.d. data {(x,y ),...,(x,y )}, where the label Y is geerated accordig to the model E[Y X = x] =g? (w? > x) for a ukow parameter vector w? 2 R d d ad ukow -Lipschitz, mootoic fuctio g?. We additioally assume that y 2 [0, ], kw? k 2 apple ad kw? k 0 apple s, where k k 0 is the `0 pseudo-orm. The sparsity assumptio o w? is motivated by the fact that cosistet estimatio i high dimesios is a ill-posed problem without makig further structural assumptios o the uderlyig parameters. Our goal is to make predictios o usee data. Specifically, we would like to provide estimators ĝ ad ŵ of g? ad w? so that give a previously usee sample x, we predict ŷ =ĝ(ŵ > x). To this ed, we propose three algorithms that we explai ext 2. SILO: Sigle Idex Lasso Optimizatio We first propose SILO, a simple SIM learig algorithm that first lears ŵ ad the fits a fuctio ĝ usig ŵ. Specifically, SILO performs the followig two steps i a sigle pass:. I order to lear ŵ we solve the problem that was first proposed i [0]. This optimizatio problem is idepedet of the trasfer fuctio g? ad miimizes a liear loss subject to model costraits: ŵ = arg mi w:kwk 2apple, kwk apple p s X y i x > i w. () where the costrait kwk apple p s arises from costraiig a s sparse vector to have uit Euclidea orm. 2. After learig ŵ, SILO simply fits a -Lipschitz mootoic fuctio by ivokig the LPAV routie with the vector p =[p,...,p ], where p i = ŵ > x i. LPAV outputs a fuctio ĝ. Our fial predictor has the form ŷ =ĝ(ŵ > x). Note that there is o eed to re-lear ŵ after learig ĝ, sice the optimizatio problem to lear ŵ is idepedet of ĝ. This property makes SILO a very simple ad a computatioally attractive algorithm. 2.2 isilo: Iterative SILO with squared loss SILO is computatioally very efficiet, sice it oly ivolves learig ŵ, ĝ oce. However, completely igorig ĝ to lear ŵ could be suboptimal, ad we propose two algorithms to overcome this drawback. We first propose isilo, a iterative method detailed i Algorithm. Give the model i (), isilo miimizes the squared loss with a sparsity pealty to estimate ŵ, ĝ: ŵ, ĝ = arg mi w,g X (y i g(w > x i )) 2 + kwk. (5) We adopt a alteratig miimizatio prodecure. I iteratio t, give g t, we would ideally perform a proximal poit update w.r.t. w to obtai X w t = Prox,k k w t (g t (w t > x i ) y i )gt 0! (wt T x i )x i where Prox( ) is the soft thresholdig operator associated with the k k orm, >0 is a appropriate step size, ad g 0 t is the derivative of g t. Ufortuately, the above gradiet step requires us to estimate the derivative of g t, which ca be difficult. So, istead of performig the above proximal gradiet update, we istead perform a proximal perceptro type update similar i spirit to [6, 5], by replacig g 0 t by the Lipschitz costat of g t. Sice g t is obtaied usig 3

Algorithm : isilo Require: Data: X =[x,...,x ], Labels: y =[y,...,y ] >, Regularizatio:, Step size, Iitial parameters: g 0 is -Lipschitz, mootoic fuctio, w 0 2 R d, Iteratios: T>0. : Iitialize ŵ = w 0, ĝ = g 0. 2: opterr = MSE(w 0,g 0 ) 3: for t=,... T do : Perform the update show i Equatio (6) to get w t. 5: Calculate err = MSE(w t,g t ). 6: if err apple opterr the 7: opterr = err. 8: ŵ = w t, ĝ = g t 9: ed if 0: Obtai g t by solvig problem (2) with p i = w t > x i ad liear iterpolatio (3) : Calculate err = MSE(w t,g t ). 2: if err apple opterr the 3: opterr = err. : ŵ = w t, ĝ = g t 5: ed if 6: ed for 7: Output ŵ, ĝ the LPAV algorithm, g t is Lipschitz. Note that ulike the perceptro, we have a o uity step size. This leads to the followig update equatio! X w t = Prox,k k w t (g t (w t > x i ) y i )x i (6) Give w t i iteratio t, isilo updates g t to be the solutio to the LPAV problem with p i = w > t x i. The o-covexity of (5) requires us to to perform a book-keepig procedure that keeps track of the best estimate of ĝ, ŵ by calculatig the MSE of the curret hypothesis o a held-out validatio set. This is doe i steps 5-9 ad 2-6 of Algorithms. Similar book-keepig procedures have bee used i the Isotro, ad Slisotro algorithms of [6, 5]. 2.3 cisilo: Iterative SILO with calibrated loss isilo like the Slisotro algorithm [5] use a squared loss fuctio ad a approximate gradiet descet method to estimate w. These methods do ot take ito accout the derivative of the estimate of the trasfer fuctio while takig gradiet descet steps. We ow propose cisilo, a versio of SILO that uses a calibrated loss fuctio that adapts to the SIM that we are tryig to lear. Suppose g? was kow. Let? : R! R be a fuctio such that 0? = g?. Sice g? is mootoically icreasig,? is covex, ad we ca lear ŵ by solvig the followig covex program: ŵ := X?(w > x i ) y i w > x i + kwk (7) Whe the trasfer fuctio is liear,? is a quadratic fuctio, ad we obtai the stadard Lasso problem that miimizes squared loss with ` pealty. Whe the trasfer fuctio is the logit fuctio, (7) reduces to sparse logistic regressio. Modulo, the ` pealty term the above objective is a sample versio of the followig stochastic optimizatio problem: mi E[?(w > x) yw > x]. (8) w

0 If? = g?, the the optimal solutio to the above problem correspods to the sigle idex model that satisfies E[Y X = x] =g? (w? > x). Hece the above calibrated loss fuctio takes ito accout the trasfer fuctio g? used i the SIM via? ad automatically adapts to the SIM from which the data is geerated. Whe g? is ukow, we istead cosider the followig optimizatio problem: ŵ, ĝ = arg mi w,g X (w > x i ) y i w > x i + kwk s.t. g = 0 2G (9) where the set G = {g : R! R is a -Lipschitz, mootoic fuctio}. Note that the above optimizatio problem optimizes for g via its itegral. cisilo solves the above optimizatio problem by iteratively miimizig for w, g. The pseudo-code for cisilo is give i Algorithm 2. There are three key update procedures performed i each iteratio of cisilo, which we explai below: I Step, cisilo fixes g to g t ad performs oe step of a proximal poit update o the objective i problem (9) w.r.t. w to get:! X w t = Prox,k k w t (g t (w t > x i ) y i )x i. (0) This step is idetical to the update step i isilo except that the gt 0 does ot feature i this update. Thus, the proximal poit steps usig a calibrated loss fuctio ca be performed exactly ulike the proximal poit steps i isilo. The use of a calibrated loss fuctio brigs with it aother challege: The LPAV procedure, which was desiged to miimize the squared loss, ca o loger be used i cisilo to estimate g?. cisilo istead uses a ovel quadratic program to efficietly estimate g?. From the first order optimality coditios of the optimizatio problem (9) for w at w t we get that the optimal fuctio g t should satisfy X (g t (w t > x i ) y i )x i + t =0, t 2 @ w t. () g t is updated such that L.H.S. of () has the smallest possible orm. This ca be cast as a quadratic program (QP) as follows: Defie, p =[p,...,p ] >, where p i = w t > x i ad z =[z,...,z ] >, where z i = g t (p i ). Let X =[x,...,x ]x be a d data matrix. Let q = X > y. Now, solve the problem mi z kx > z + qk 2 2 s.t. 0 apple z i apple 8i ad 0 apple z j z i apple p j p i if p i apple p j (2) We call optimizatio problem (2) QPFit, which is differet from the LPAV give that it is derived from optimizig a calibrated loss fuctio, which could be very differet from the squared loss. 2. Iitializig isilo ad cisilo Sice both isilo ad cisilo are o-covex, alteratig miimizatio procedures, a good iitializatio is key to achievig good performace. A simple iitializatio would be to choose w 0 radomly ad g 0 to be the idetity fuctio. However, we iitialize both methods with ŵ, ĝ obtaied by ruig the (efficiet) SILO algorithm from Sectio 2.. We demostrate i the ext sectio that this yields very good theoretical guaratees, as well as good empirical performace i Sectio. Remarks : Like i isilo we perform book-keepig steps i cisilo too. Sice obtaiig exact or approximate gradiets i isilo ad cisilo are easy we use first order methods to solve for ŵ. Usig lie search methods i cisilo, to compute step sizes, would require evaluatig the calibrated loss fuctio. This ca be computatioally itesive, sice we have access to the calibrated loss fuctio oly via its gradiet. Hece, i isilo, ad cisilo we use a fixed step size to perform our updates. Despite the use of fixed step size, we show empirically that isilo is ofte as competitive ad sometimes better at makig predictios tha GLM based methods with optimal step sizes, ad cisilo is sigificatly superior. 5

Algorithm 2: cisilo Require: Data: X =[x,...,x ], Labels y =[y,...,y ] >,, Regularizatio parameter parameters: w 0 2 R d,g 0 : R! R is -Lipschitz, mootoic fuctio. : Iitialize ŵ = w 0, ĝ = g 0. 2: opterr = MSE(w 0,g 0 ) 3: for t=,2,... T do : Perform the update step show i Equatio (0) to obtai w t. 5: Calculate err = MSE(w t,g t ). 6: if err apple opterr the 7: opterr = err. 8: ŵ = w t, ĝ = g t 9: ed if 0: Calculate: p Xw t, @kw t k, q X > y : Obtai g t by solvig problem (2) ad liear iterpolatio. 2: Calculate err = MSE(w t,g t ). 3: if err apple opterr the : opterr = err. 5: ŵ = w t, ĝ = g t 6: ed if 7: ed for 8: Output ŵ, ĝ, step size, Iitial 3 Theoretical aalysis of SILO, isilo ad cisilo I this sectio, we aalyze the excess risk of the predictors output by isilo, ad cisilo. For a give hypothesis ĥ(x) =ĝ(ŵ > x), defie err(h) :=E (h(x) y) 2. The excess risk is the defied as We first list the techical assumptios we make: E(ĥ) :=err(ĥ) err(h?)=e(y ĥ(x)) 2 E(y g? (w >? x)) 2 (3) A. The data x,...,x is sampled i.i.d. from the stadard multivariate Gaussia distributio. A2. E[Y X = x] =g? (w >? x), ad 0 apple Y apple, A3. g? is mootoic ad Lipschitz, A. kw? k 0 apple s, kw? k 2 apple, kŵk 0 apple k, ad k d. We provide sketches of relevat results i this sectio, ad refer the iterested reader to the Appedix for detailed proofs. Our first mai result is a excess risk bouds for SILO: Theorem. Let ĥ(x) =ĝ(ŵ> x) be the hypothesis output by SILO. Let = E µ N(0,) g? (µ)µ >0. The uder assumptios A-A, the excess risk of the predictor ĥ is, with probability at least, bouded from above by r (s + k) log(2d) s E(ĥ) =Õ + p s p (s + k) log(2d) () where Õ hides factors that are poly-logarithmic i, d,,sad k. 6

q Proof Sketch: For otatioal coveiece, deote by 2 = Cslog(2d/s), where C>0 is a uiversal costat. WLOG, we ca assume that kŵk 0 apple s. Our assumptio o the sparsity of ŵ is pretty leiet, ad is most ofte satisfied i practice. Also, sice ŵ is obtaied from SILO, we have kŵk 2 apple, kŵk apple p s. From a result of Pla ad Vershyi [0, Corollary 3.] (Lemma i appedix), we kow that kw ŵk 2 2 apple 2. The excess risk E(ĥ) ca be bouded as follows. E(ĥ) =E[(ĝ(ŵ> x) y) 2 (g? (w > x) y) 2 ]=E(ĝ(ŵ > x) g? (w > x)) 2 = E(ĝ(ŵ > x) ĝ(w > x)+ĝ(w > x) g? (w > x)) 2 apple 2(s + k) 2 log(2d)+2e(ĝ(w > x) g? (w > x)) 2 with probability at least where we used the fact that ĝ is -Lipschitz, ad upper bouds o the expected suprema of a collectio of Gaussia radom variables. Next, we shall boud the R.H.S. of the above equatio. E(ĝ(w > x) g? (w > x)) 2 apple (a) E(ĝ(w > x) y) 2 E(g? (w > x) y) 2 (b) apple X (s log(2d)) (ĝ(w > x i ) y i ) 2 (g? (w > x i ) y i ) 2 / + Õ p I iequality (a) we used a certai projectio iequality for covex sets (see Lemma i appedix). To obtai iequality (b) we replace the expected value quatities with their empirical versios, plus deviatio terms. Via stadard (s log(2d))/ applicatio of large deviatio iequalities, it is possible to establish that these deviatios are Õ( p ) (see Lemma 5 i appedix). The proof cocludes by upper boudig the empirical term i the above equatio usig optimality of ĝ ad properties of maxima of a collectio of Gaussia radom variables. Our ext result is a upper boud o the excess risk bouds of isilo ad cisilo: Theorem 2. Suppose ĝ, ŵ are the outputs of SILO o our data. Let ĥ(x) =ĝ(ŵ> x) be the hypothesis correspodig to these outputs. Let h? (x) = def g? (w > x). Now, let ĥt be the output of cisilo obtaied by usig ĝ, ŵ as iitializers. The uder the assumptios A-A, with high probability we ca boud the excess risk of ĥt by s r (s + k) log(2d) s E(ĥT ) apple Õ + p s r p s log 2 (2d + ) (s + k) log(2d) + + where Õ hides factors that are poly-logarithmic i, d,,s,k. Moreover, the same excess risk guaratees hold for ĥ T obtaied by ruig isilo. Proof Sketch : From Theorem we kow that r (s + k) log(2d) s E(ĥ) =err(ĥ) err(h ) apple Õ + p s p (s + k) log(2d) Usig stadard large deviatio argumets (see Lemma 6 i appedix) we ca claim that err(ĥ) with probability at least. This gives us r r s s cerr(ĥ) =err(ĥ)+õ =err(h? )+err(ĥ) err(h?)+õ s r (s + k) log(2d) s =err(h? )+Õ + p s p (s + k) log(2d) + cerr(ĥ) = Õ(p s ) s log 2 (2d + ). Now cosider ĥt obtaied by ruig either cisilo or isilo for T iteratios, whe iitialized with ŵ, ĝ obtaied by ruig SILO first o the data. Sice ĥt is chose by usig a held-out validatio set as the iterate correspodig 7

Figure : Errors rates are ormalized so that the Slisotro has a error of. Note that cisilo cosistetly outperforms all other methods, ad isilo is very competitive. The umbers below each dataset refer to (, d) to the smallest validatio error, we ca claim via Hoeffdig iequality that the empirical error of ĥt caot be too much larger tha that of ĥ (for otherwise ĥt will ot be the iterate with the smallest validatio error). Precisely, if the validatio set is of size, the with high probability cerr(ĥt ) apple cerr(ĥ)+õ p. Usig the above iequalities, ad via stadard large deviatio argumets to boud err(ĥt ) Remarks : (s+k)log(2d) I the boud of Theorem 2, the first term i Õ domiates, ad the excess risk boud is essetially p s. Also, usig the output of SILO to iitialize isilo ad cisilo yields strog theoretical guar- Õ atees. cerr(ĥt ) we get the desired result. The costat i our results: acts like the sigal to oise ratio i our results. The larger is, the better our boud gets. For example, for the logistic model, is approximately the orm of the data ( p log(d)). For measuremets of the form y = sig(x T w), is a costat. <0 ca be easily tackled by reversig the sigs of y, ad =0implies that the data ad observatios are ucorrelated, ad aturally ay error boud will be meaigless. Comparisos to existig results i low dimesios: I [5] the authors obtaied dimesio depedet as well as dimesio idepedet bouds o the predictio error for the Slisotro algorithm for the SIM problem. However, these results were obtaied uder the restrictive assumptio that kw? k 2 apple W, kxk 2 apple B, ad both W, B are fixed ad idepedet of dimesios. I order to carry through a correct high-dimesioal aalysis, oe eeds to let either W or B or both grow with d. I our aalysis, we assume that the data is sampled from a stadard multi-variate Gaussia, ad hece kxk 2 apple p d with high probability. If oe were to replace B with p d i the results of [5], the the excess risk of their predictor would scale as mi{ d, p d }, ad sice d, their bouds are meaigless i the highdimesioal settig. I cotrast our results i Theorem 2 have a (poly)-logarithmic depedece o d, ad hece are /3 / useful i the high dimesioal settig studied i this paper. The same argumets apply to the results of [6], where i additio oe eeds a fresh batch of samples at each ru. Experimetal results We tested our algorithms SILO, isilo, ad cisilo o may real world high dimesioal datasets. For compariso with methods that assume g kow, we used Sparse Logistic Regressio (SLR), ad Sparse Squared Hige Loss miimizatio (SHL) [3] 2. We also tested the Slisotro [5] algorithm desiged for low-dimesioal SIM. For each dataset we radomly chose 60% of the data for traiig, ad 20% each for validatio ad testig. The parameters, are chose via validatio. Mac-Wi, Crypt-Elec, Atheism-Religio ad Auto-Motorcycle are from the 20 Newsgroups I their aalysis B =. 2 code dowloaded from http://www.cs.ubc.ca/ schmidtm/software/lgeeral.html 8

dataset. Arcee is from the NIPS challege 3, ad the Page dataset is obtaied form the WebKB dataset [8]. Prostrate ad Colo cacer datasets are available olie 5. Figure shows the misclassificatio error obtaied o the test set. We show results for 8 datasets of varyig size. Additioal results are available i the supplemetary material. Sice the datasets (ad errors) are varied, we ormalize the error rates so that the Slisotro has uit error. As we ca see from these results, usig the calibrated loss i cisilo yields the best performace i all the datasets cosidered, except MacWi. isilo is as good as or better tha SLR i 6/8 cases. It is ecouragig to ote that isilo ad cisilo do well despite ot havig the luxury of choosig optimal step sizes at each iteratio. Fially, the relatively poor performace of SILO uderlies the importace of iterative methods i the SIM learig settig. 5 Coclusios I this paper, we itroduced a suite of algorithms based o sparse parameter estimatio for learig sigle idex models i the high dimesioal settig. We derived excess risk guaratees for the proposed methods. Our algorithm employig a calibrated loss ad a ovel quadratic programmig method to fit the trasfer fuctio achieves superior results compared to stadard high dimesioal classificatio methods based o miimizig the logistic or the hige loss. I the future we pla to ivestigate learig sigle idex models with structural costraits other tha sparsity such as low rak, group sparsity, ad ideed other very geeral costraits. Refereces [] Pierre Alquier ad Gérard Biau. Sparse sigle-idex model. The Joural of Machie Learig Research, ():23 280, 203. [2] Joel L Horowitz. Semiparametric ad oparametric methods i ecoometrics. Spriger, 2009. [3] Joel L Horowitz ad Wolfgag Härdle. Direct semiparametric estimatio of sigle-idex models with discrete covariates. Joural of the America Statistical Associatio, 9(36):632 60, 996. [] Hidehiko Ichimura. Semiparametric least squares (sls) ad weighted sls estimatio of sigle-idex models. Joural of Ecoometrics, 58():7 20, 993. [5] Sham M Kakade, Varu Kaade, Ohad Shamir, ad Adam Kalai. Efficiet learig of geeralized liear ad sigle idex models with isotoic regressio. I Advaces i Neural Iformatio Processig Systems, pages 927 935, 20. [6] Adam Tauma Kalai ad Ravi Sastry. The isotro algorithm: High-dimesioal isotoic regressio. I COLT, 2009. [7] Sahad N Negahba, Pradeep Ravikumar, Marti J Waiwright, ad Bi Yu. A uified framework for highdimesioal aalysis of m-estimators with decomposable regularizers. Statistical Sciece, 27():538 557, 202. [8] Kamal Paul Nigam. Usig ulabeled data to improve text classificatio. PhD thesis, Citeseer, 200. [9] Mee Youg Park ad Trevor Hastie. L-regularizatio path algorithm for geeralized liear models. Joural of the Royal Statistical Society: Series B (Statistical Methodology), 69():659 677, 2007. [0] Yaiv Pla ad Roma Vershyi. Robust -bit compressed sesig ad sparse logistic regressio: A covex programmig approach. Iformatio Theory, IEEE Trasactios o, 59():82 9, 203. 3 http://www.ipsfsc.ecs.soto.ac.uk/datasets/ http://vikas.sidhwai.org/maifoldregularizatio.html 5 http://www.stat.cmu.edu/ jiashu/research/software/hcclassificatio/prostate/ 9

[] Yaiv Pla, Roma Vershyi, ad Elea Yudovia. High-dimesioal estimatio with geometric costraits. arxiv preprit arxiv:0.379, 20. [2] Nikhil S Rao, Robert D Nowak, Christopher R Cox, ad Timothy T Rogers. Classificatio with sparse overlappig groups. arxiv preprit arxiv:02.52, 20. [3] Mark Schmidt, Gle Fug, ad Romer Rosales. Optimizatio methods for l-regularizatio. Uiversity of British Columbia, Techical Report TR-2009, 9, 2009. [] Natha Srebro, Karthik Sridhara, ad Ambuj Tewari. Smoothess, low oise ad fast rates. I Advaces i Neural Iformatio Processig Systems, pages 299 2207, 200. [5] Sara A Va de Geer. High-dimesioal geeralized liear models ad the lasso. The Aals of Statistics, pages 6 65, 2008. [6] Tog Zhag. Coverig umber bouds of certai regularized liear fuctio classes. The Joural of Machie Learig Research, 2:527 550, 2002. 0

A Prelimiaries We shall eed a few defiitios ad a few importat lemmas ad propositios before we ca state the proofs of our theorems. We shall cosider the followig fuctio class. G = {g :[ W, W ]! [0, ],g is -Lipschitz ad mootoic}. (5) Though the above defiitio of G uses a uspecified parameter W, most ofte we shall use W = p s log(2d). The followig result cocerig suprema of a collectio of i.i.d. Gaussia radom variables is stadard ad we shall state it without proof. Propositio. Let [g i ] m be a collectio of m i.i.d. Gaussia radom variables with mea 0 ad variace 2 p max g p i apple log(2m)+ 2 log(2/ ) w.p. i2[m] The ext lemma is stadard ad a proof ca be foud i Lemma 9 i [5].. The, Lemma. Let F be a covex class of fuctios, ad let f = arg mi f2f E(f(x) y) 2. Suppose that E[Y X = x] =g? (w > x) for some g? 2G. The for ay f 2F, the followig holds true E[(f(x) y) 2 ] E[(f(x) y) 2 ] E[(f(x) f (x)) 2 ] (6) Lemma 2. Let x 2 R d be a stadard ormal radom vector. The with probability at least w > x apple Õ(p s log(2d)) Proof. The proof follows immediately from Propositio () ad the fact that kw k apple p s. Lemma 3. Let e 2 R d be such that kek 0 apple s + k ad kek 2 apple. Let x be a stadard ormal radom vector. The with probability at least e > x apple Õ( p (s + k) log(2d)) Proof. Let e =[e,...,e d ]. Similarly, let x =[x,...,x d ]. We the have e > x = dx e i x i (7) apple max x i dx e i (8) (a) apple p Xs+k log(2d) e i w.p (9) (b) apple p (s + k) log(2d). (20) I obtaiig iequality (a) we used the fact that the max of the absolute value of d Gaussia radom variables is bouded by p log(2d). I equality (b) we used the fact that kek 0 apple s + k, ad hece oly s + k of the elemets of e are o-zero. We ext eed the followig importat result (Corollary 3. i [0]) Lemma.. Let W = {w 2 R d : kwk 2 apple, kwk apple p s}. Let ŵ be obtaied from SILO, show i the mai paper. Suppose, ŵ 2W. Let x,...x be idepedet Gaussia radom vectors. Assume that the measuremets E[Y X = x] =g? (w > x), where kw k 2 apple, kw k 0 apple s. The with probability at least, the solutio ŵ obtaied from SILO satisfies the iequality kŵ w k 2 2 apple 2 apple r Cslog(2d/s), where C>0 is a uiversal costat, ad = E µ N(0,) g? (µ)µ

Lemma 5. With probability at least E(ĝ(w > x) y) 2 E(g? (w > x) y) 2 apple X (ĝ(w > x i ) y i ) 2 (g? (w > x i ) y i ) 2 + Õ where Õ hides factors that are (poly)-logarithmic i, Proof. From Lemma 6 (i) i [5] we kow that r! W (2) N 2 (r, G, z,...,z ) applen (r, G) apple r 2 2W r, (22) where N 2 (r, G, z,...,z ) is the L 2 empirical coverig umber of fuctio class G at radius r, ad N (r, G) is the L coverig umber. Usig Dudley etropy itegral, we ca upper boud the empirical Rademacher complexity by s Z ˆR (G) = if + 0 log(/r)+ 2W r dr apple 0p W p. (23) >0 Hece, via stadard large deviatio iequalities we ca claim that E[(ĝ(w > x) y) 2 ] apple r X (ĝ(w > W x) y) 2 + O( ). (2) Similarly via stadard cocetratio iequalities we ca claim that with probability at least, E[(g? (w > x) y) 2 ] r X log(2/ ) (g? (w > x) y) 2 appleo( ) (25) i ad hece puttig together the above two iequalities the desired result follows. B Proof of Theorem q For otatioal coveiece, deote by 2 = Cslog(2d/s), where C>0 is a uiversal costat. Sice, ŵ is obtaied from SILO, we have kŵk 2 apple, kŵk apple p s. The excess risk E(ĥ) ca be bouded as follows. E(ĥ) =E[(ĝ(ŵ> x) y) 2 (g? (w > x) y) 2 ] = E(ĝ(ŵ > x) g? (w > x)) 2 = E(ĝ(ŵ > x) ĝ(w > x)+ĝ(w > x) g? (w > x)) 2 apple 2 E(ĝ(ŵ > x) ĝ(w > x)) 2 +2E(ĝ(w > x) g? (w > x)) 2 (a) apple 2 E((ŵ w ) > x) 2 +2E(ĝ(w > x) g? (w > x)) 2 (b) apple s 2 log(2d)+2e(ĝ(w > x) g? (w > x)) 2 with probability at least (26) Where i order to obtai iequality (a) we used the fact that ĝ is -Lipschitz, ad i order to obtai iequality (b) we used Lemma (3). We shall ow boud the R.H.S. of iequality 26. We do this as follows E(ĝ(w > x) g? (w > x)) 2 apple (a) E(ĝ(w > x) y) 2 E(g? (w > x) y) 2 (27) (b) apple X (ĝ(w > x i ) y i ) 2 (g? (w > x i ) y i ) 2 + (28) 2

I iequality (a) we used Lemma with the fuctio class F = G w. I iequality (b) we used Lemma (5) the expectatio quatity i terms of its empirical quatity, with set to the maximum value of w > x i. We kow, from Lemma 2 that this max value is p s log(2d) with probability at least. Hece by substitutig W = p s log(2d) qp s log(2d) for W, we get = O. Next we shall try to upper boud the empirical term i the above equatio. We have X (ĝ(w > x i ) y i ) 2 (g? (w > x i ) y i ) 2 = X (ĝ(ŵ > x i ) y i ĝ(ŵ > x i )+ĝ(w > x i )) 2 X (g? (ŵ > x i ) y i g? (ŵ > x i )+g? (w > x i )) 2 = X (ĝ(ŵ > x i ) y i ) 2 X (g? (ŵ > x i ) y i ) 2 {z } apple0 + X (ĝ(ŵ > x i ) ĝ(w > x i )) 2 X (g? (w > x i ) ĝ(ŵ > x i )) 2 {z } {z } T 0 X + 2 (ĝ(ŵ > x i ) y i )(ĝ(ŵ > x i ) ĝ(w > x i )) {z } T 2 2 X (g? (ŵ > x i ) y i )(g? (ŵ > x i ) g? (ŵ > x i )) {z } T 3 where the term marked as apple 0 is egative because ĝ is the solutio to a miimizatio problem that miimizes the empirical squared error uder mootoicity ad -Lipschitz costraits. Sice g? is also mootoic ad -Lipschitz the squared error correspodig to the predictor ĝ(ŵ > x) should be smaller tha the squared error correspodig to g? (ŵ > x). The term marked as 0 is positive because it is a average of squared quatities. We shall ow boud T,T 2,T 3 as follows T = (a) apple (29) X (ĝ(ŵ > x i ) ĝ(w > x i )) 2 (30) X ((ŵ w ) > x i )) 2 (3) (b) apple (s + k) 2 log(2d) (32) where, to obtai iequality (a) we used the fact that ĝ is -Lipschitz, ad to obtai iequality (b) we used Lemma 2. To upper boud T 2 we proceed as follows T 2 = 2 (a) apple 2 X (ĝ(ŵ > x i ) y i )(ĝ(ŵ > x i ) ĝ(w > x i )) (33) X ĝ(ŵ > x i ) ĝ(w > x i )) (3) (b) apple p (s + k) log(2d) (35) 3

To obtai iequality (a) we used the fact that y i ĝ(ŵ > x i ) apple, ad to obtai iequality (b) we used the fact that ĝ is -Lipschitz ad Lemma 3. The same reasoig ca be applied to upper boud T 3 to get T 3 apple p k log(2d). Fially usig lemma (), we kow that kw ŵk 2 2 = 2 apple Õ( p s ). Gatherig all the terms, we get with probability at least, r (s + k) log(2d) s E(ĥ) =Õ + p s p (s + k) log(2d) (36) where, = E µ N(0,) g(µ)µ is a costat that depeds o g?. C Large Deviatio Guaratees for isilo, cisilo Lemma 6. For ay hypothesis h(x) =g(w > x), where W = {w 2 R d : kwk apple p s, kwk 2 apple }, g 2G, w 2W, we have r pcerr(ht s err(h T ) apple cerr(h T )+Õ ), where the Õ hides factors (poly) logarithmic i d,, /. I particular the above result also applies to h T which is the hypothesis obtaied by ruig isilo or cisilo for T iteratios, ad to ĥ, the hypothesis obtaied by ruig SILO. Before we give the proof of this theorem, we would like to poit out that our assumptio that ŵ 2Wis ot at all restrictive. I practice the result provided by the iterates of a proximal gradiet method used i SILO -M for a sufficietly large are sparse. Proof. Cosider the fuctio class H = {h(x) =g(w > x):w 2W,g 2G}. By costructio, we are guarateed that h T, ĥ 2H, w.h.p., with W = p s log(2d). I order to establish a large deviatio boud o the risk of h T we shall first calculate the worst case Rademacher complexity of H. To do this, we establish L 2 coverig umber of the fuctio class H by establishig L coverig umber of U, ad L 2 coverig umber of W. Both these results are stadard. From Lemma 6 i [5] we have N (, G) apple log + 2sp log(2d). (37) Sice, kwk apple p s, kxk apple Õ(p log(2d)), we ca use Theorem 3 i [6], to coclude that w.h.p. It is ot hard to see that log N 2 (W,,) apple s log2 (2d + ) 2. (38) log N 2 (F,,) apple log N 2 W, 2 p 2, + log N G, 2 p 2 s log 2 = Õ (2d + ) 2 (39) (0) Usig Lemma A. i [] we ca boud the worst case Rademacher complexity of H by 0s ˆR (H) apple Õ @ s log 2 (2d + ) A

Fially applyig Theorem i [] we get with probability at least 0 s err(h T ) apple cerr(h T )+Õ @ p cerr(h T ) s log 2 (2d + ) A. D Proof of Theorem (2) Proof. From Theorem () we kow that r (s + k) log(2d) s E(ĥ) =err(ĥ) err(h ) apple Õ + p s p (s + k) log(2d) () Usig Lemma 6 we ca say that with probability at least s cerr(ĥ) =err(ĥ)+ s log 2 (2d + ) =err(h? )+err(ĥ) err(h?)+ r (s + k) log(2d) s =err(h? )+Õ + p s s s log 2 (2d + ) p (s + k) log(2d) + s (2) s log 2 (2d + ). (3) Now cosider ĥt obtaied by ruig isilo for T iteratios, whe iitialized with ŵ, ĝ obtaied by ruig SILO first o the data. Sice ĥt is chose by usig a held-out validatio set as the iterate correspodig to the smallest validatio error, we ca claim via Hoeffdig iequality that the empirical error of ĥt caot be too much larger tha that of ĥ (for otherwise ĥt will ot be the iterate with the smallest validatio error). Precisely, if the validatio set is of size, the with high probability cerr(ĥt ) apple cerr(ĥ)+õ p. () Summig up Equatios () ad (2) we get 0 r cerr(ĥt ) apple err(h )+Õ @ (s + k) log(2d) s + p s p ss log 2 (2d + ) (s + k) log(2d)+ + r A (5) Now usig Theorem (6) to upper boud err(ĥt ) i terms of cerr(ĥt ), ad combiig it with the above boud we get the desired result. The same argumets apply eve to the cisilo algorithm. E Additioal Experimetal Results Here we report results o other high dimesioal datasets. Figure 2 agai shows the advatage of the calibrated, ad iterative method cisilo. Table has the details of the datasets i Figure 2 5

Eyedata Lik PageLik Slisotro SHL SLR SILO isilo Figure 2: Compariso of differet methods over differet datasets. The results are ormalized so that the Slisotro has error = Dataset d Leukamia 729 Eyedata 20 200 Lik 526 80 Page+Lik 526 80 Gisette 200 5000 Table : Dataset details 6