Big learning: challenges and opportunities

Transcription

1 Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013

2 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators, social networks All levels: personal, professional, scientific, industrial Too large and/or complex for manual processing Computational challenges Dealing with large databases Statistical challenges What can be predicted from such databases and how? Looking for hidden information Opportunities (and threats)

3 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators, social networks All levels: personal, professional, scientific, industrial Too large and/or complex for manual processing Computational challenges Dealing with large databases Statistical challenges What can be predicted from such databases and how? Looking for hidden information Opportunities (and threats)

4 Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, etc.

5 Object recognition

6 Learning for bioinformatics - Proteins Crucial components of cell life Predicting multiple functions and interactions Massive data: up to 1 millions for humans! Complex data Amino-acid sequence Link with DNA Tri-dimensional molecule

7 Search engines - advertising

8 Advertising - recommendation

9 Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, etc. Two main challenges: 1. Computational: ideal running-time complexity = O(pn + kn) 2. Statistical: meaningful results

10 Big learning: challenges and opportunities Outline Scientific context Big data: need for supervised and unsupervised learning Beyond stochastic gradient for supervised learning Few passes through the data Provable robustness and ease of use Matrix factorization for unsupervised learning Looking for hidden information through dictionary learning Feature learning

11 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) convex data fitting term + regularizer

12 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) convex data fitting term + regularizer Applications to any data-oriented field Computer vision, bioinformatics Natural language processing, etc.

13 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n Prediction as a linear function θ Φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 min θ R p n n i=1 l ( y i,θ Φ(x i ) ) + µω(θ) Main practical challenges convex data fitting term + regularizer Designing/learning good features Φ(x) Efficiently solving the optimization problem

14 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1

15 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1

16 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)

17 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

18 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time

19 Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) hybrid time stochastic deterministic

20 Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i

21 Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i Stochastic version of incremental average gradient(blatt et al., 2008) Simple implementation Extra memory requirement: same size as original data (or less) Simple/robust constant step-size

22 Stochastic average gradient Convergence analysis Assume each f i is L-smooth and g= 1 n n i=1 f i is µ-strongly convex Constant step size γ t = 1 16L. If µ 2L, C R such that n t 0, E [ g(θ t ) g(θ ) ] ( Cexp t ) 8n Linear convergence rate with iteration cost independent of n After each pass through the data, constant error reduction Breaking two lower bounds

23 spam dataset (n = , p = )

24 Simplicity Few lines of code Robustness Large-scale supervised learning Convex optimization Step-size Adaptivity to problem difficulty On-going work Single pass through the data (Bach and Moulines, 2013) Distributed algorithms - Convexity as a solution to all problems? - Need good features Φ(x) for linear predictions θ Φ(x)!

25 Simplicity Few lines of code Robustness Large-scale supervised learning Convex optimization Step-size Adaptivity to problem difficulty On-going work Single pass through the data (Bach and Moulines, 2013) Distributed algorithms Convexity as a solution to all problems? Need good features Φ(x) for linear predictions θ Φ(x)!

26 Unsupervised learning through matrix factorization Given data matrix X = (x 1,...,x n) R n p Principal component analysis: x i Dα i K-means: x i d k X = DA

27 Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) - Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1

28 Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1 w 2 w 2 w1 w1

29 Learning dictionaries for uncovering hidden structure Fact: many natural signals may be approximately represented as a superposition of few atoms from a dictionary D = (d 1,...,d k ) k Decomposition x = α i d i = Dα with α R k sparse i=1 Natural signals (sounds, images) (Olshausen and Field, 1997) Decoding problem: given a dictionary D, finding α through regularized convex optimization min α R k x Dα 2 2+λ α 1 Dictionary learning problem: given n signals x 1,...,x n, Estimate both dictionary D and codes α 1,...,α n min D n j=1 min α j R p { xj Dα j 2 2 +λ α j 1 }

30 Challenges of dictionary learning min D n j=1 { xj } min Dα α j 2 j R p 2 +λ α j 1 Algorithmic challenges Large number of signals online learning (Mairal et al., 2009) Theoretical challenges Identifiabiliy/robustness (Jenatton et al., 2012) Domain-specific challenges Going beyond plain sparsity structured sparsity (Jenatton, Mairal, Obozinski, and Bach, 2011)

31 Applications - Digital Zooming

32 Digital Zooming (Couzinie-Devy et al., 2011)

33 Applications - Task-driven dictionaries inverse half-toning (Mairal et al., 2011)

34 Extensions - Task-driven dictionaries inverse half-toning (Mairal et al., 2011)

35 Big learning: challenges and opportunities Conclusion Scientific context Big data: need for supervised and unsupervised learning Beyond stochastic gradient for supervised learning Few passes through the data Provable robustness and ease of use Matrix factorization for unsupervised learning Looking for hidden information through dictionary learning Feature learning

36 References F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). Technical Report , HAL, D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. 18(1):29 51, R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12: , N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In International Conference on Machine Learning (ICML), B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37: , 1997.