Acceleraed Gradien Mehods for Sochasic Opimizaion and Online Learning Chonghai Hu, James T. Kwok, Weike Pan Deparmen of Compuer Science and Engineering Hong Kong Universiy of Science and Technology Clear Waer Bay, Kowloon, Hong Kong Deparmen of Mahemaics, Zhejiang Universiy Hangzhou, China hino.hu@gmail.com, {jamesk,weikep}@cse.us.hk Absrac Regularized risk minimizaion ofen involves non-smooh opimizaion, eiher because of he loss funcion (e.g., hinge loss) or he regularizer (e.g., l -regularizer). Gradien mehods, hough highly scalable and easy o implemen, are known o converge slowly. In his paper, we develop a novel acceleraed gradien mehod for sochasic opimizaion while sill preserving heir compuaional simpliciy and scalabiliy. The proposed algorihm, called (Sochasic Acceleraed GradiEn), exhibis fas convergence raes on sochasic composie opimizaion wih convex or srongly convex objecives. Experimenal resuls show ha is faser han recen (sub)gradien mehods including, and. Moreover, can also be exended for online learning, resuling in a simple algorihm bu wih he bes regre bounds currenly known for hese problems. Inroducion Risk minimizaion is a he hear of many machine learning algorihms. Given a class of models parameerized by w and a loss funcion l(, ), he goal is o minimize E XY [l(w;x,y )] w.r.. w, where he expecaion is over he join disribuion of inpu X and oupu Y. However, since he join disribuion is ypically unknown in pracice, a surrogae problem is o replace he expecaion by is empirical average on a raining sample {(x,y ),...,(x m,y m )}. Moreover, a regularizer Ω( ) is ofen added for well-posedness. This leads o he minimizaion of he regularized risk min w m m l(w;x i,y i ) + λω(w), () i= where λ is a regularizaion parameer. In opimizaion erminology, he deerminisic opimizaion problem in () can be considered as a sample average approximaion (SAA) of he corresponding sochasic opimizaion problem: min w E XY [l(w;x,y )] + λω(w). () Since boh l(, ) and Ω( ) are ypically convex, () is a convex opimizaion problem which can be convenienly solved even wih sandard off-he-shelf opimizaion packages. However, wih he proliferaion of daa-inensive applicaions in he ex and web domains, daa ses wih millions or rillions of samples are nowadays no uncommon. Hence, off-he-shelf opimizaion solvers are oo slow o be used. Indeed, even ailor-made sofwares for specific models, such as he sequenial minimizaion opimizaion (SMO) mehod for he SVM, have superlinear compuaional
complexiies and hus are no feasible for large daa ses. In ligh of his, he use of sochasic mehods have recenly drawn a lo of ineres and many of hese are highly successful. Mos are based on (varians of) he sochasic gradien descen (SGD). Examples include Pegasos [], SGD-QN [], [3], and sochasic coordinae descen () [4]. The main advanages of hese mehods are ha hey are simple o implemen, have low per-ieraion complexiy, and can scale up o large daa ses. Their runime is independen of, or even decrease wih, he number of raining samples [5, 6]. On he oher hand, because of heir simpliciy, hese mehods have a slow convergence rae, and hus may require a large number of ieraions. While sandard gradien schemes have a slow convergence rae, hey can ofen be acceleraed. This sems from he pioneering work of Neserov in 983 [7], which is a deerminisic algorihm for smooh opimizaion. Recenly, i is also exended for composie opimizaion, where he objecive has a smooh componen and a non-smooh componen [8, 9]. This is paricularly relevan o machine learning since he loss l and regularizer Ω in () may be non-smooh. Examples include loss funcions such as he commonly-used hinge loss used in he SVM, and regularizers such as he popular l penaly in Lasso [], and basis pursui. These acceleraed gradien mehods have also been successfully applied in he opimizaion problems of muliple kernel learning [] and race norm minimizaion []. Very recenly, Lan [3] made an iniial aemp o furher exend his for sochasic composie opimizaion, and obained he convergence rae of O ( L/N + (M + σ)/ N ). (3) Here, N is he number of ieraions performed by he algorihm, L is he Lipschiz parameer of he gradien of he smooh erm in he objecive, M is he Lipschiz parameer of he nonsmooh erm, and σ is he variance of he sochasic subgradien. Moreover, noe ha he firs erm of (3) is relaed o he smooh componen in he objecive while he second erm is relaed o he nonsmooh componen. Complexiy resuls [4, 3] show ha (3) is he opimal convergence rae for any ieraive algorihm solving sochasic (general) convex composie opimizaion. However, as poined ou in [5], a very useful propery ha can improve he convergence raes in machine learning opimizaion problems is srong convexiy. For example, () can be srongly convex eiher because of he srong convexiy of l (e.g., log loss, square loss) or Ω (e.g., l regularizaion). On he oher hand, [3] is more ineresed in general convex opimizaion problems and so srong convexiy is no uilized. Moreover, hough heoreically ineresing, [3] may be of limied pracical use as () he sepsize in is updae rule depends on he ofen unknown σ; and () he number of ieraions performed by he algorihm has o be fixed in advance. Inspired by he successes of Neserov s mehod, we develop in his paper a novel acceleraed subgradien ( scheme for sochasic composie opimizaion. I achieves he opimal convergence rae of O L/N + σ/ ) N for general convex objecives, and O ( (L + µ)/n + σµ /N ) for µ- srongly convex objecives. Moreover, is per-ieraion complexiy is almos as low as ha for sandard (sub)gradien mehods. Finally, we also exend he acceleraed gradien scheme o online learning. We obain O( N) regre for general convex problems and O(log N) regre for srongly convex problems, which are he bes regre bounds currenly known for hese problems. Seing and Mahemaical Background Firs, we recapiulae a few noions in convex analysis. (Lipschiz coninuiy) A funcion f(x) is L-Lipschiz if f(x) f(y) L x y. Lemma. [4] The gradien of a differeniable funcion f(x) is Lipschiz coninuous wih Lipschiz parameer L if, for any x and y, f(y) f(x) + f(x),y x + L x y. (4) (Srong convexiy) A funcion φ(x) is µ-srongly convex if φ(y) φ(x)+ g(x),y x + µ y x for any x,y and subgradien g(x) φ(x). Lemma. [4] Le φ(x) be µ-srongly convex and x = arg min x φ(x). Then, for any x, φ(x) φ(x ) + µ x x. (5)
We consider he following sochasic convex sochasic opimizaion problem, wih a composie objecive funcion min{φ(x) E[F(x,ξ)] + ψ(x)}, (6) x where ξ is a random vecor, f(x) E[F(x,ξ)] is convex and differeniable, and ψ(x) is convex bu non-smooh. Clearly, his includes he opimizaion problem (). Moreover, we assume ha he gradien of f(x) is L-Lipschiz and φ(x) is µ-srongly convex (wih µ ). Noe ha when φ(x) is smooh (ψ(x) = ), µ lower bounds he smalles eigenvalue of is Hessian. Recall ha in smooh opimizaion, he gradien updae x + = x λ f(x ) on a funcion f(x) can be seen as proximal regularizaion of he linearized f a he curren ierae x [6]. In oher words, x + = arg min x ( f(x ),x x + λ x x ). Wih he presence of a non-smooh componen, we have he following more general noion. (Gradien mapping) [8] In minimizing f(x) + ψ(x), where f is convex and differeniable and ψ is convex and non-smooh, ( x + = arg min f(x),x x + ) x λ x x + ψ(x) (7) is called he generalized gradien updae, and δ = λ (x x + ) is he (generalized) gradien mapping. Noe ha he quadraic approximaion is made o he smooh componen only. I can be shown ha he gradien mapping is analogous o he gradien in smooh convex opimizaion [4, 8]. This is also a common consruc used in recen sochasic subgradien mehods [3, 7]. 3 Acceleraed Gradien Mehod for Sochasic Learning Le G(x,ξ ) x F(x,ξ ) x=x be he sochasic gradien of F(x,ξ ). We assume ha i is an unbiased esimaor of he gradien f(x), i.e., E ξ [G(x,ξ)] = f(x). Algorihm shows he proposed algorihm, which will be called (Sochasic Acceleraed GradiEn). I involves he updaing of hree sequences {x }, {y } and {z }. Noe ha y is he generalized gradien updae, and x + is a convex combinaion of y and z. The algorihm also mainains wo parameer sequences {α } and {L }. We will see in Secion 3. ha differen seings of hese parameers lead o differen convergence raes. Noe ha he only expensive sep of Algorihm is he compuaion of he generalized gradien updae y, which is analogous o he subgradien compuaion in oher subgradien-based mehods. In general, is compuaional complexiy depends on he srucure of ψ(x). As will be seen in Secion 3.3, his can ofen be efficienly obained in many regularized risk minimizaion problems. Algorihm (Sochasic Acceleraed GradiEn). Inpu: Sequences {L } and {α }. Iniialize: y = z =, α = λ =. L = L + µ. for = o N do x = ( α )y { + α z. y = arg min x G(x,ξ ),x x + L x x + ψ(x) }. z = z (L α + µ) [L (x y ) + µ(z x )]. end for Oupu y N. 3. Convergence Analysis Define G(x,ξ ) f(x ). Because of he unbiasedness of G(x,ξ ), E ξ [ ] =. In he following, we will show ha he value of φ(y ) φ(x) can be relaed o ha of φ(y ) φ(x) for any x. Le δ L (x y ) be he gradien mapping involved in updaing y. Firs, we inroduce he following lemma. Lemma 3. For, φ(x) is quadraically bounded from below as φ(x) φ(y ) + δ,x x + µ x x +,y x + L L L δ. 3
Proposiion. Assume ha for each, σ and L > L, hen φ(y ) φ(x) + L α + µα x z ( α )[φ(y ) φ(x)] + L α x z + σ (L L) + α,x z. (8) Proof. Define V (x) = δ,x x + µ x x + Lα x z. I is easy o see ha z = arg min x R d V (x). Moreover, noice ha V (x) is (L α + µ)-srongly convex. Hence on applying Lemmas and 3, we obain ha for any x, V (z ) V (x) L α + µ x z = δ,x x + µ x x + L α x z L α + µ x z φ(x) φ(y ) L L L δ + L α x z L α +µ x z +,x y. Then, φ(y ) can be bounded from above, as: φ(y ) φ(x) + δ,x z L L L δ L α z z + L α x z L (9) α + µ x z +,x y, where he non-posiive erm µ z x has been dropped from is righ-hand-side. On he oher hand, by applying Lemma 3 wih x = y, we ge φ(y ) φ(y ) δ,x y +,y y L L L δ, () where he non-posiive erm µ y x has also been dropped from he righ-hand-side. On muliplying (9) by α and () by α, and hen adding hem ogeher, we obain φ(y ) φ(x) ( α )[φ(y ) φ(x)] L L L δ +A+B+C L α z z, () where A = δ,α (x z )+( α )(x y ), B = α,x y +( α ),y y, and C = Lα x z Lα +µα x z. In he following, we consider o upper bound A and B. Firs, by using he updae rule of x in Algorihm and he Young s inequaliy, we have A = δ,α (x z ) + ( α )(x y ) + α δ,z z = α δ,z z L α z z + δ. () L On he oher hand, B can be bounded as B =,α x + ( α )y x +,x y = α,x z +,δ L α,x z + σ δ, (3) L where he second equaliy is due o he updae rule of x, and he las sep is from he Cauchy- Schwarz inequaliy and he boundedness of. Hence, plugging () and (3) ino (), φ(y ) φ(x) ( α )[φ(y ) φ(x)] (L L) δ L + σ δ + α,x z + C L σ ( α )[φ(y ) φ(x)] + (L L) + α,x z + C, where he las sep is due o he fac ha ax + bx b 4a wih a,b >. On re-arranging erms, we obain (8). The Young s inequaliy saes ha x, y x a + a y for any a >. 4
Le he opimal soluion in problem (6) be x. From he updae rules in Algorihm, we observe ha he riple (x,y,z ) depends on he random process ξ [ ] {ξ,...,ξ } and hence is also random. Clearly, z and x are independen of ξ. Thus, E ξ[],x z = E ξ[ ] E ξ[] [,x z ξ [ ] ] = E ξ[ ] E ξ [,x z ] = E ξ[ ] x z, E ξ [ ] =, where he firs equaliy uses E x [h(x)] = E y E x [h(x) y], and he las equaliy is from our assumpion ha he sochasic gradien G(x, ξ) is unbiased. Taking expecaions on boh sides of (8) wih x = x, we obain he following corollary, which will be useful in proving he subsequen heorems. Corollary. E[φ(y )] φ(x ) + L α + µα E[ x z ] ( α )(E[φ(y )] φ(x )) + L α E[ x z ] + σ (L L). So far, he choice of L and α in Algorihm has been lef unspecified. In he following, we will show ha wih a good choice of L and α, (he expecaion of) φ(y ) converges rapidly o φ(x ). Theorem. Assume ha E[ x z ] D for some D. Se L = b( + ) 3 + L, α = +, (4) where b > is a consan. Then he expeced error of Algorihm can be bounded as ) E[φ(y N )] φ(x ) 3D L N + (3D b + 5σ. (5) 3b N If σ were known, we can se b o he opimal choice of 5σ 3D, and he bound in (5) becomes 3D L 5σD N. N + Noe ha so far φ(x) is only assumed o be convex. As is shown in he following heorem, he convergence rae can be furher improved by assuming srong convexiy. This also requires anoher seing of α and L which is differen from ha in (4). Theorem. Assume he same condiions as in Theorem, excep ha φ(x) is µ-srongly convex. Se L = L + µλ, for ; α = λ + λ 4 λ, for, (6) where λ Π k= ( α ) for and λ =. Then, he expeced error of Algorihm can be bounded as E[φ(y N )] φ(x (L + µ)d ) N + 6σ Nµ. (7) In comparison, only converges as O(log(N)/N) for srongly convex objecives. 3. Remarks As in recen sudies on sochasic composie opimizaion [3], he error bounds in (5) and (7) consis of wo erms: a faser erm which is relaed o he smooh componen and a slower erm relaed o he non-smooh componen. benefis from using he srucure of he problem and acceleraes he convergence of he smooh componen. On he oher hand, many sochasic (sub)gradien-based algorihms like do no separae he smooh from he non-smooh par, bu simply rea he whole objecive as non-smooh. Consequenly, convergence of he smooh componen is also slowed down o O(/ N). As can be seen from (5) and (7), he convergence of is essenially encumbered by he variance of he sochasic subgradien. Recall ha he variance of he average of p i.i.d. random 5
variables is equal o /p of he original variance. Hence, as in Pegasos [], σ can be reduced by esimaing he subgradien from a daa subse. Unlike he AC-SA algorihm in [3], he seings of L and α in (4) do no require knowledge of σ and he number of ieraions, boh of which can be difficul o esimae in pracice. Moreover, wih he use of a sparsiy-promoing ψ(x), can produce a sparse soluion (as will be experimenally demonsraed in Secion 5) while AC-SA canno. This is because in, he oupu y is obained from a generalized gradien updae. Wih a sparsiy-promoing ψ(x), his reduces o a (sof) hresholding sep, and hus ensures a sparse soluion. On he oher hand, in each ieraion of AC-SA, is oupu is a convex combinaion of wo oher variables. Unforunaely, adding wo vecors is unlikely o produce a sparse vecor. 3.3 Efficien Compuaion of y The compuaional efficiency of Algorihm hinges on he efficien compuaion of y. Recall ha y is jus he generalized gradien updae, and so is no significanly more expensive han he gradien updae in radiional algorihms. Indeed, he generalized gradien updae is ofen a cenral componen in various opimizaion and machine learning algorihms. In paricular, Duchi and Singer [3] showed how his can be efficienly compued wih he various smooh and non-smooh regularizers, including he l,l,l,l, Berhu and marix norms. Ineresed readers are referred o [3] for deails. 4 Acceleraed Gradien Mehod for Online Learning In his secion, we exend he proposed acceleraed gradien scheme for online learning of (). The algorihm, shown in Algorihm, is similar o he sochasic version in Algorihm. Algorihm -based Online Learning Algorihm. Inpus: Sequences {L } and {α }, where L > L and < α <. Iniialize: z = y. loop x = ( α )y + α { z. Oupu y = arg min x f (x ),x x + L x x + ψ(x) }. z = z α (L + µα ) [L (x y ) + µ(z x )]. end loop Firs, we inroduce he following lemma, which plays a similar role as is sochasic counerpar of Lemma 3. Moreover, le δ L (x y ) be he gradien mapping relaed o he updaing of y. Lemma 4. For >, φ (x) can be quadraically bounded from below as φ (x) φ (y ) + δ,x x + µ x x + L L L δ. Proposiion. For any x and, assume ha here exiss a subgradien ĝ(x) ψ(x) such ha f (x) + ĝ(x) Q. Then for Algorihm, φ (y ) φ (x) Q ( α )(L L) + L α x z L + µα α x z + ( α )L α ( α )L y z L z y. Proof Skech. Define τ = L α. From he updae rule of z, one can check ha z = arg min V (x) δ,x x + µ x x x + τ x z. Similar o he analysis in obaining (9), we can obain φ (y ) φ (x) δ,x z L L L δ τ z z + τ x z τ +µ x z. (9) 6 (8)
On he oher hand, δ,x z δ L = L ( z x z y ) L z z + L ( α ) z y L α z y, () on using he convexiy of. Using (), he inequaliy (9) becomes φ (y ) φ (x) L ( α ) z y L z y L L L δ + τ x z τ + µ x z. On he oher hand, by he convexiy of φ (x) and he Young s inequaliy, we have φ (y ) φ (y ) f (y ) + ĝ (y ),y y Q () ( α )(L L) + ( α )(L L) y y. () Moreover, by using he updae rule of x and he convexiy of, we have y y = (y x ) + (x y ) = α (y z ) + (x y ) α y z + ( α ) x y = α y z + δ ( α )L. (3) On using (3), i follows from () ha Q φ (y ) φ (y ) ( α )(L L) + α ( α )(L L) y z + L L L δ. Inequaliy (8) hen follows immediaely by adding his o (). Theorem 3. Assume ha µ =, and x z D for. Se α = a and L = al +L, where a (,) is a consan. Then he regre of Algorihm can be bounded as N = [φ (y ) φ (x )] LD a + [ LD + Q ] N. a( a)l Theorem 4. Assume ha µ >, and x z D for. Se α = a, and L = aµ + L + a (µ L) +, where a (,) is a consan. Then he regre of Algorihm can be bounded as N [ (a + a [φ (y ) φ (x )µ + L )] a = ] D + Q log(n + ). a( a)µ In paricular, wih a =, he regre bound reduces o ( 3µ + L) D + Q µ log(n + ). 5 Experimens In his secion, we perform experimens on he sochasic opimizaion of (). Two daa ses are used (Table ). The firs one is he pcmac daa se, which is a subse of he -newsgroup daa se from [8], while he second one is he RCV daa se, which is a filered collecion of he Reuers RCV from [9]. We choose he square loss for l(, ) and he l regularizer for Ω( ) in (). As discussed in Secion 3.3 and [3], he generalized gradien updae can be efficienly compued by sof hresholding in his case. Moreover, we do no use srong convexiy and so µ =. We compare he proposed algorihm (wih L and α in (4)) wih hree recen algorihms: () [3]; () [4]; and (3) [4]. For fair comparison, we compare heir convergence Downloaded from hp://people.cs.uchicago.edu/ vikass/svmlin.hml and hp://www.cs.ucsb.edu/ wychen/sc.hml. 7
behavior w.r.. boh he number of ieraions and he number of daa access operaions, he laer of which has been advocaed in [4] as an implemenaion-independen measure of ime. Moreover, he efficiency ricks for sparse daa described in [4] are also implemened. Following [4], we se he regularizaion parameer λ in () o 6. The η parameer in is searched over he range of { 6, 5, 4, 3,, }, and he one wih he lowes l -regularized loss is used. As in Pegasos [], he (sub)gradien is compued from small sample subses. The subse size p is se o min(.m,5), where m is he daa se size. This is used on all he algorihms excep, since is based on coordinae descen and is quie differen from he oher sochasic subgradien algorihms. 3 All he algorihms are rained wih he same maximum amoun of ime (i.e., number of daa access operaions). Table : Summary of he daa ses. daa se #feaures #insances sparsiy pcmac 7,5,946.73% RCV 47,36 93,844.% Resuls are shown in Figure. As can be seen, requires much fewer ieraions for convergence han he ohers (Figures (a) and (e)). Moreover, he addiional coss on mainaining x and z are small, and he mos expensive sep in each ieraion is in compuing he generalized gradien updae. Hence, is per-ieraion complexiy is comparable wih he oher (sub)gradien schemes, and is convergence in erms of he number of daa access operaions is sill he fases (Figures (b), (c), (f) and (g)). Moreover, he sparsiy of he soluion is comparable wih hose of he oher algorihms (Figures (d) and (h)). L regularized loss.8.6.4. L regularized loss.8.6.4. Error (%) 8 6 4 Densiy of w 8 6 4 3 4 Number of Ieraions (a) 4 6 8 Number of Daa Accessesx 6 (b) 4 6 8 Number of Daa Accessesx 6 (c) 4 6 8 Number of Daa Accesses x 6 (d) L regularized loss.8.6.4. 3 4 Number of Ieraions (e) L regularized loss.8.6.4..5.5.5 Number of Daa Accessesx 8 (f) Error (%) 8 6 4.5.5.5 Number of Daa Accesses x 8 (g) Densiy of w 4 x 4 3.5.5.5 Number of Daa Accessesx 8 Figure : Performance of he various algorihms on he pcmac (upper) and RCV (below) daa ses. (h) 6 Conclusion In his paper, we developed a novel acceleraed gradien mehod () for sochasic convex composie opimizaion. I enjoys he compuaional simpliciy and scalabiliy of radiional (sub)gradien mehods bu are much faser, boh heoreically and empirically. Experimenal resuls show ha ouperforms recen (sub)gradien descen mehods. Moreover, can also be exended o online learning, obaining he bes regre bounds currenly known. Acknowledgmen This research has been parially suppored by he Research Grans Council of he Hong Kong Special Adminisraive Region under gran 659. 3 For he same reason, an ieraion is also very differen from an ieraion in he oher algorihms. Hence, is no shown in he plos on he regularized loss versus number of ieraions. 8
References [] S. Shalev-Shwarz, Y. Singer, and N. Srebro. Pegasos: Primal esimaed sub-gradien solver for SVM. In Proceedings of he 4h Inernaional Conference on Machine Learning, pages 87 84, Corvalis, Oregon, USA, 7. [] A. Bordes, L. Boou, and P. Gallinari. SGD-QN: Careful Quasi-Newon Sochasic Gradien Descen. Journal of Machine Learning Research, :737 754, 9. [3] J. Duchi and Y. Singer. Online and bach learning using forward looking subgradiens. Technical repor, 9. [4] S. Shalev-Shwarz and A. Tewari. Sochasic mehods for l regularized loss minimizaion. In Proceedings of he 6h Inernaional Conference on Machine Learning, pages 99 936, Monreal, Quebec, Canada, 9. [5] L. Boou and O. Bousque. The radeoffs of large scale learning. In Advances in Neural Informaion Processing Sysems. 8. [6] S. Shalev-Shwarz and N. Srebro. SVM opimizaion: Inverse dependence on raining se size. In Proceedings of he 5h Inernaional Conference on Machine Learning, pages 98 935, Helsinki, Finland, 8. [7] Y. Neserov. A mehod for unconsrained convex minimizaion problem wih he rae of convergence o( k ). Doklady AN SSSR (ranslaed as Sovie. Mah. Docl.), 69:543 547, 983. [8] Y. Neserov. Gradien mehods for minimizing composie objecive funcion. CORE Discussion Paper 7/76, Caholic Universiy of Louvain, Sepember 7. [9] A. Beck and M. Teboulle. A fas ieraive shrinkage-hresholding algorihm for linear inverse problems. SIAM Journal on Imaging Sciences, :83, 9. [] R. Tibshirani. Regression shrinkage and selecion via he Lasso. Journal of he Royal Saisical Sociey: Series B, 58:67 88, 996. [] S. Ji, L. Sun, R. Jin, and J. Ye. Muli-label muliple kernel learning. In Advances in Neural Informaion Processing Sysems. 9. [] S. Ji and J. Ye. An acceleraed gradien mehod for race norm minimizaion. In Proceedings of he Inernaional Conference on Machine Learning. Monreal, Canada, 9. [3] G. Lan. An opimal mehod for sochasic composie opimizaion. Technical repor, School of Indusrial and Sysems Engineering, Georgia Insiue of Technology, 9. [4] Y. Neserov and I.U.E. Neserov. Inroducory Lecures on Convex Opimizaion: A Basic Course. Kluwer, 3. [5] S.M. Kakade and S. Shalev-Shwarz. Mind he dualiy gap: Logarihmic regre algorihms for online opimizaion. In Advances in Neural Informaion Processing Sysems. 9. [6] A. Beck and M. Teboulle. Mirror descen and nonlinear projeced subgradien mehods for convex opimizaion. Operaions Research Leers, 3(3):67 75, 3. [7] S.J. Wrigh, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconsrucion by separable approximaion. In Proceedings of he Inernaional Conference on Acousics, Speech, and Signal Processing, Las Vegas, Nevada, USA, March 8. [8] V. Sindhwani and S.S. Keerhi. Large scale semi-supervised linear SVMs. In Proceedings of he SIGIR Conference on Research and Developmen in Informaion Rerieval, pages 477 484, Seale, WA, USA, 6. [9] Y. Song, W.Y. Chen, H. Bai, C.J. Lin, and E.Y. Chang. Parallel specral clusering. In Proceedings of he European Conference on Machine Learning, pages 374 389, Anwerp, Belgium, 8. 9