Deep Learning for Big Data Yoshua Bengio Département d Informa0que et Recherche Opéra0onnelle, U. Montréal 30 May 2013, Journée de la recherche École Polytechnique, Montréal
Big Data & Data Science Super- hot buzzword Data deluge Two sides of the coin: 1. Allowing computers to understand the data (percep0on) 2. Allowing computers to take decisions (ac0on) My research: 1. CERC in Data Science and Real- Time Decision- Making: The necessity to combine 1 and 2. 2
Business execu0ves are faced with a relentless and exponen0al growth of data that can be collected by their enterprises 3 Big Data a Growing Torrent 30 billion pieces of content shared on Facebook every month 5 billion mobile phones in use in 2010 40% projected growth in global data generated per year vs. 5% growth in global IT spending Data: McKinsey 1 Exabyte = 1 Billion Gigabytes Figure: The Economist
Making sense of this data could unleash substan0al value across an array of industries. 4 Big Data Big Value $300 billion poten0al annual value to US health care $600 billion poten0al annual consumer surplus from using personal loca0on data globally 250 billion poten0al annual value to Europe s public sector 60% poten0al increase in retailers opera0ng margins possible with big data Source: McKinsey
There are many reasons to believe that since last year turning data into a compeffve advantage is becoming a top- of- mind C- level issue. 5 Big Data: in the minds of executives O Reilly Strata Conference, Twice yearly event, started 2011 McKinsey White Paper, 2011 The Economist Special Report, 2010 The world of Big Data is on fire The Economist, Sept 2011 #bigdata on Twider
Data Science: automatically extracting knowledge from data From: Yann LeCun Lecture 1 on Big Data, large scale machine learning, 2013 6
Decision Science + Machine Learning The topic of a successful CERC applica0on Why? Data deluge & real- 0me online learning 7 Learned models are used to take decisions on the fly The data used to train depends on the decisions taken Can t separate the learning from the decisions like in tradi0onal OR & ML setups Examples: Online adver0sing & recommenda0on systems Online video games Fraud detec0on, targeted marking, etc.
Ultimate Goals for AI AI Needs knowledge Needs learning Needs generalizing where probability mass concentrates Needs to fight the curse of dimensionality Needs disentangling the underlying explanatory factors ( making sense of the data ) 8
Easy Learning = example (x,y) y true unknown function learned function: prediction = f(x) x
Local Smoothness Prior: Locally Capture the Variations y prediction f(x) = training example true function: unknown x x è f(x) f(x ) learnt = interpolated test point x x
What We Are Fighting Against: The Curse of Dimensionality To generalize locally, need representa0ve examples for all relevant varia0ons!
Manifold Learning Prior: examples concentrate near lower dimensional manifold 12
Putting Probability Mass where Structure is Plausible Empirical distribu0on: mass at training examples Smoothness: spread mass around Insufficient Guess structure and generalize accordingly 13
Representation Learning Good input features essen0al for successful ML (feature engineering = 90% of effort in industrial ML) Handcrasing features vs learning them Representa0on learning: guesses the features / factors / causes = good representa0on. 14
Deep Representation Learning Deep learning algorithms adempt to learn mul0ple levels of representa0on of increasing complexity/abstrac0on When the number of levels can be data- selected, this is Deep Learning h 3 h 2 15 h 1 x
A Modern Deep Architecture Op0onal Output layer Here predic0ng a supervised target Hidden layers These learn more abstract representa0ons as you head up Input layer 16 This has raw sensory inputs (roughly)
Google Image Search: Different object types represented in the same space Google: S. Bengio, J. Weston & N. Usunier (IJCAI 2011, NIPS 2010, JMLR 2010, MLJ 2010)
How do humans generalize from very few examples? Brains may be born with generic priors. Which ones? Humans transfer knowledge from previous learning: Representa0ons Explanatory factors Previous learning from: unlabeled data + labels for other tasks 18
Learning multiple levels of representation Theore0cal evidence for mul0ple levels of representa0on ExponenFal gain for some families of funcfons Biologically inspired learning Brain has a deep architecture Cortex seems to have a generic learning algorithm Humans first learn simpler concepts and then compose them to more complex ones 19
Learning multiple levels of representation (Lee, Largman, Pham & Ng, NIPS 2009) (Lee, Grosse, Ranganath & Ng, ICML 2009) Successive model layers learn deeper intermediate representa0ons Layer 3 High- level linguis0c representa0ons Parts combine to form objects Layer 2 20 Layer 1 Prior: underlying factors & concepts compactly expressed w/ mulfple levels of abstracfon
subsubsub1 subsubsub2 subsubsub3 subsub1 subsub2 subsub3 sub1 sub2 sub3 main Deep computer program
subroutine1 includes subsub1 code and subsub2 code and subsubsub1 code subroutine2 includes subsub2 code and subsub3 code and subsubsub3 code and main Shallow computer program
Major Breakthrough in 2006 Ability to train deep architectures by using layer- wise unsupervised learning, whereas previous purely supervised adempts had failed Unsupervised feature learners: RBMs Auto- encoder variants Sparse coding variants Empirical successes since then: 2 competitions, Google, Microsoft, IBM, Apple 23 Toronto Hinton Bengio Montréal Le Cun New York
Deep Networks for Speech Recognition: results from Google, IBM, Microsoft task Hours of training data Deep net+hmm GMM+HMM same data GMM+HMM more data Switchboard 309 16.1 23.6 17.1 (2k hours) English Broadcast news Bing voice search Google voice input 50 17.5 18.8 24 30.4 36.2 5870 12.3 16.0 (lots more) Youtube 1400 47.6 52.3 24 (numbers taken from Geoff Hinton s June 22, 2012 Google talk)
Deep Sparse Rectifier Neural Networks (Glorot,Bordes and Bengio AISTATS 2011), following up on (Nair & Hinton 2010) Machine learning motivations Neuroscience motivations Leaky integrate-and-fire model Sparse representations Sparse gradients Rectifier f(x)=max(0,x) Outstanding results by Krizhevsky et al 2012 killing the state- of- the- art on ImageNet 1000: 1st choice Top- 5 2nd best 27% err Previous SOTA 45% err 26% err Krizhevsky et al 37% err 15% err
Learning Multiple Levels of Abstraction The big payoff of deep learning is to allow learning higher levels of abstrac0on Higher- level abstrac0ons disentangle the factors of varia0on, which allows much easier generaliza0on and transfer More abstract representa0ons à Successful transfer (domains, languages), 2 interna0onal compe00ons won 26
Challenges Ahead Big data + deep learning = underfizng, local minima, ill- condi0oning, difficulty of using 2 nd - order methods in stochas0c / online sezng The challenge of inference with non- unimodal non- factorial posteriors (can we avoid this altogether?) Big data + deep learning + parallel compu0ng à our current best training algorithms are highly sequen0al big efforts @ Google in this respect (Dean et al ICML 2012, NIPS 2012) Much remains to be understood mathema0cally, (Alain & Bengio ICLR 2013) one of few scratching the 0p of the iceberg 27
LISA team: Merci! Questions?