Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013
Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators, social networks All levels: personal, professional, scientific, industrial Too large and/or complex for manual processing Computational challenges Dealing with large databases Statistical challenges What can be predicted from such databases and how? Looking for hidden information Opportunities (and threats)
Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators, social networks All levels: personal, professional, scientific, industrial Too large and/or complex for manual processing Computational challenges Dealing with large databases Statistical challenges What can be predicted from such databases and how? Looking for hidden information Opportunities (and threats)
Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, etc.
Object recognition
Learning for bioinformatics - Proteins Crucial components of cell life Predicting multiple functions and interactions Massive data: up to 1 millions for humans! Complex data Amino-acid sequence Link with DNA Tri-dimensional molecule
Search engines - advertising
Advertising - recommendation
Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, etc. Two main challenges: 1. Computational: ideal running-time complexity = O(pn + kn) 2. Statistical: meaningful results
Big learning: challenges and opportunities Outline Scientific context Big data: need for supervised and unsupervised learning Beyond stochastic gradient for supervised learning Few passes through the data Provable robustness and ease of use Matrix factorization for unsupervised learning Looking for hidden information through dictionary learning Feature learning
Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) convex data fitting term + regularizer
Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) convex data fitting term + regularizer Applications to any data-oriented field Computer vision, bioinformatics Natural language processing, etc.
Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) Main practical challenges convex data fitting term + regularizer Designing/learning good features Φ(x) Efficiently solving the optimization problem
Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1
Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1
Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)
Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)
Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time
Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) hybrid time stochastic deterministic
Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i
Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i Stochastic version of incremental average gradient(blatt et al., 2008) Simple implementation Extra memory requirement: same size as original data (or less) Simple/robust constant step-size
Stochastic average gradient Convergence analysis Assume each f i is L-smooth and g= 1 n n i=1 f i is µ-strongly convex Constant step size γ t = 1 16L. If µ 2L, C R such that n t 0, E [ g(θ t ) g(θ ) ] ( Cexp t ) 8n Linear convergence rate with iteration cost independent of n After each pass through the data, constant error reduction Breaking two lower bounds
spam dataset (n = 92 189, p = 823 470)
Simplicity Few lines of code Robustness Large-scale supervised learning Convex optimization Step-size Adaptivity to problem difficulty On-going work Single pass through the data (Bach and Moulines, 2013) Distributed algorithms - Convexity as a solution to all problems? - Need good features Φ(x) for linear predictions θ Φ(x)!
Simplicity Few lines of code Robustness Large-scale supervised learning Convex optimization Step-size Adaptivity to problem difficulty On-going work Single pass through the data (Bach and Moulines, 2013) Distributed algorithms Convexity as a solution to all problems? Need good features Φ(x) for linear predictions θ Φ(x)!
Unsupervised learning through matrix factorization Given data matrix X = (x 1,...,x n) R n p Principal component analysis: x i Dα i K-means: x i d k X = DA
Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) - Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1
Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1 w 2 w 2 w1 w1
Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1 Dictionary learning problem: given n signals x 1,...,x n, Estimate both dictionary D and codes α 1,...,α n min D n j=1 min α j R p { xj Dα j 2 2 +λ α j 1 }
Challenges of dictionary learning min D n j=1 { xj } min Dα α j 2 j R p 2 +λ α j 1 Algorithmic challenges Large number of signals online learning (Mairal et al., 2009) Theoretical challenges Identifiabiliy/robustness (Jenatton et al., 2012) Domain-specific challenges Going beyond plain sparsity structured sparsity (Jenatton, Mairal, Obozinski, and Bach, 2011)
Applications - Digital Zooming
Digital Zooming (Couzinie-Devy et al., 2011)
Applications - Task-driven dictionaries inverse half-toning (Mairal et al., 2011)
Extensions - Task-driven dictionaries inverse half-toning (Mairal et al., 2011)
Big learning: challenges and opportunities Conclusion Scientific context Big data: need for supervised and unsupervised learning Beyond stochastic gradient for supervised learning Few passes through the data Provable robustness and ease of use Matrix factorization for unsupervised learning Looking for hidden information through dictionary learning Feature learning
References F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). Technical Report 00831977, HAL, 2013. D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. 18(1):29 51, 2008. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297 2334, 2011. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, 2012. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In International Conference on Machine Learning (ICML), 2009. B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37:3311 3325, 1997.