On the k-support and Related Norms

Similar documents
GI01/M055 Supervised Learning Proximal Methods

Sparse Prediction with the k-support Norm

Big Data Analytics: Optimization and Randomization

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Notes on Symmetric Matrices

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

When Is There a Representer Theorem? Vector Versus Matrix Regularizers

Statistical machine learning, high dimension and big data

Bilinear Prediction Using Low-Rank Models

Large-Scale Similarity and Distance Metric Learning

Maximum-Margin Matrix Factorization

Big Data - Lecture 1 Optimization reminders

Convex Programming Tools for Disjunctive Programs

Tail inequalities for order statistics of log-concave vectors and applications

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

A FIRST COURSE IN OPTIMIZATION THEORY

Several Views of Support Vector Machines

16.3 Fredholm Operators

6.231 Dynamic Programming and Stochastic Control Fall 2008

Galaxy Morphological Classification

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

Some representability and duality results for convex mixed-integer programs.

Distributed Machine Learning and Big Data

10. Proximal point method

Variational approach to restore point-like and curve-like singularities in imaging

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Bayesian Statistics: Indian Buffet Process

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

The p-norm generalization of the LMS algorithm for adaptive filtering

The Many Facets of Big Data

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Scheduling and Location (ScheLoc): Makespan Problem with Variable Release Dates

Support Vector Machines with Clustering for Training with Very Large Datasets

Projection-free Online Learning

Introduction to Online Learning Theory

Big Data Analytics. Lucas Rego Drumond

1 Norms and Vector Spaces

Cyber-Security Analysis of State Estimators in Power Systems

One side James Compactness Theorem

Direct Convex Relaxations of Sparse SVM

Topological Data Analysis Applications to Computer Vision

Proximal mapping via network optimization

Dantzig-Wolfe bound and Dantzig-Wolfe cookbook

The Need for Training in Big Data: Experiences and Case Studies

Convex analysis and profit/cost/support functions

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Duality of linear conic problems

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

BANACH AND HILBERT SPACE REVIEW

Big Data Systems CS 5965/6965 FALL 2015

Chapter 6. Cuboids. and. vol(conv(p ))

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Mechanisms for Fair Attribution

Part II Redundant Dictionaries and Pursuit Algorithms

Adaptive Online Gradient Descent

Understanding Big Data Spectral Clustering

Collaborative Filtering. Radek Pelánek

A PRIMAL-DUAL APPROACH TO NONPARAMETRIC PRODUCTIVITY ANALYSIS: THE CASE OF U.S. AGRICULTURE. Jean-Paul Chavas and Thomas L. Cox *

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Big Data & Scripting Part II Streaming Algorithms

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

An Introduction to Machine Learning

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Notes for AA214, Chapter 7. T. H. Pulliam Stanford University

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

How To Solve The Cluster Algorithm

Functional Principal Components Analysis with Survey Data

2.3 Convex Constrained Optimization Problems

Univariate and Multivariate Methods PEARSON. Addison Wesley

Unsupervised and supervised dimension reduction: Algorithms and connections

STA 4273H: Statistical Machine Learning

ALMOST COMMON PRIORS 1. INTRODUCTION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Introduction: Overview of Kernel Methods

Online Convex Optimization

Lecture 5 Least-squares

Primal-Dual methods for sparse constrained matrix completion

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Statistical Machine Learning

Follow the Perturbed Leader

fire Utrymningsplan/Evacuation plan In case of fire or other emergency Vid brand eller annan fara Rescue Call Larma Warn Varna Extinguish Evacuate

Applied Algorithm Design Lecture 5

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Orthogonal Projections and Orthonormal Bases

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Optimization with Sparsity-Inducing Penalties. Contents

Perron vector Optimization applied to search engines

Order statistics and concentration of l r norms for log-concave vectors

Transcription:

On the k-support and Related Norms Massimiliano Pontil Department of Computer Science Centre for Computational Statistics and Machine Learning University College London (Joint work with Andrew McDonald and Dimitris Stamos) Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 1 / 14

Plan Problem Spectral regularization k-support norm Box norm Link to cluster norm Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 2 / 14

Problem Learn a matrix from a set of linear measurements: y i = W, X i + noise i, i = 1,..., n Method min W R d m n (y i W, X i ) 2 + λω(w ) Matrix completion: X i = e r e c Multitask learning: X i = e r x i Regularizer Ω encourages matrix structure Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 3 / 14

Spectral Regularization min W R d m n (y i W, X i ) 2 + λω(w ) Ω favors matrix structure (low rank, low variance, clustering, etc.) Choose an OI-norm: Ω(W ) W = UWV, U, V orthogonal von Neumann (1937): W = g(σ(w )), with g is an SG-function Well studied example is trace norm: g( ) = 1 Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 4 / 14

k-support Norm [Argyriou et al. 2012] Special case of group lasso with overlap [Jacob et al., 2009] w (k) = inf v J 2 : v J = w, supp(v J ) J J k J k Includes the l 1 -norm (k = 1) and l 2 -norm (k = d) Unit ball of (k) is the convex hull of {card(w) k, w 2 1} k Dual norm: u,(k) = ( u i )2 Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 5 / 14

Spectral k-support Norm k-support norm is an SG-function, inducing the OI-norm W (k) := σ(w ) (k) Proposition. Unit ball of σ( ) (k) is the convex hull of {rank(w ) k, W F 1} Includes trace norm (k = 1) and Frobenius norm (k = d) Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 6 / 14

Matrix Completion Experiment dataset norm test error r k a ML 100k tr 0.2017 13 - - ρ = 50% en 0.2017 13 - - ks 0.1990 9 1.87 - box 0.1989 10 2.00 1e-5 ML 1M tr 0.1790 17 - - ρ = 50% en 0.1789 15 - - ks 0.1782 17 1.80 - box 0.1777 19 2.00 1e-6 Jester1 tr 0.1752 11 - - 20 per en 0.1752 11 - - line ks 0.1739 11 6.38 - box 0.1726 11 6.40 2e-5 Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 7 / 14

MTL Experiment Table: Multitask learning clustering on Lenk dataset, with simple thresholding. dataset norm test error k a Lenk fr 3.7869 (0.07) - - 8 per task tr 1.9058 (0.04) - - en 1.8974 (0.04) - - ks 1.8933 (0.04) 1.02 - box 1.8916 (0.04) 1.01 5.5e-3 c-fr 1.8667 (0.08) - - c-tr 1.7904 (0.03) - - c-en 1.7896 (0.03) - - c-ks 1.7775 (0.03) 1.89 - c-box 1.7754 (0.03) 1.12 9.5e-3 Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 8 / 14

Box Norm Let Θ R d ++, bounded and convex and consider the norm: Box norm: Θ = w 2 Θ = inf θ Θ d w 2 i θ i, { a < θ i b, u 2,Θ = sup θ Θ d θ i c} Includes k-support norm for a = 0, b = 1, c = k d θ i ui 2 Unit ball is the convex hull of { w R d : i J J k w 2 i b + i / J } wi 2 a 1 Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 9 / 14

Unit Balls Figure: Unit balls of the box norm in R 2 for k = 1, a {0.01, 0.25, 0.50}. Figure: Unit balls of the dual box norm in R 2 for k = 1, a {0.01, 0.25, 0.50}. Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 10 / 14

Cluster Norm Box norm is an SG-function, inducing the OI-norm { d W 2 Θ = σ(w ) 2 Θ = inf σ i (W ) 2 : θ (a, b] d, θ i d } θ i c Associated OI-norm has been used to favour task clustering [Jacob et al. 2008]. It can be written as } W 2 Θ {tr(w = inf Σ 1 W T ) : ai Σ bi, tr Σ c Includes spectral k-support norm for a = 0, b = 1, c = k Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 11 / 14

Interpretation of a Proposition. If c = da + k(b a), the solution of the regularization problem is given by Ŵ = ˆV + Ẑ, where ( ˆV, Ẑ) = arg min V,Z n ( 1 (y i V + Z, X i ) 2 + λ a V 2 F + 1 ) b a Z 2 (k) Parameter a balances the relative importance of the two components Cluster norm is the Moureau envelope of spectral k-support norm: { 1 W 2 Θ = a W Z 2 F + 1 } b a Z 2 (k) min Z R d m Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 12 / 14

Computation of the Θ norm Assume w.l.o.g. w 0 with non increasing components w 2 Θ = 1 b w [1:q] 2 2 + 1 c qb la w [q+1:d l] 2 1 + 1 a w [l+1:d] 2 2, where q, l {0,..., d} are uniquely determined In particular: w (k) = w [1:q] 2 2 + 1 k q w [q+1:d] 2 1 d where q {0,..., k 1} is determined by w q 1 k q w j > w q+1 j=q+1 Computation of norm is O(d log(d)) For k-support improves previous O(kd) method Efficient optimization using proximal-gradient methods Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 13 / 14

Extensions/Open Problems Other sets Θ allow for exact prox, e.g. Θ = {θ 1... θ d > 0}. Can give a general characterization? Online learning / stochastic optimization Kernel extensions Massimiliano Pontil (UCL) On the k-support and Related Norms Sestri Levante, Sept 2014 14 / 14