Large Scale Learning

Data hypergrowth: an example Reuters- 21578: about 10K docs (ModApte) Bekkerman et al, SIGIR 2001 RCV1: about 807K docs Bekkerman & Scholz, CIKM 2008 LinkedIn job Mtle data: about 100M docs Bekkerman & Gavish, KDD 2011 100000000 10000000 1000000 100000 10000 2000 2004 2008 2012 Slide by R. Bekkerman, M. Bilenko, J. Langford

New age of big data The world has gone mobile 5 billion cellphones produce daily data Social networks have gone online TwiVer produces 200M tweets a day Crowdsourcing is the reality Labeling of 100,000+ data instances is doable Within a week J Slide by R. Bekkerman, M. Bilenko, J. Langford

Size mavers One thousand data instances One million data instances One billion data instances One trillion data instances Those are not different numbers, those are different mindsets J Slide by R. Bekkerman, M. Bilenko, J. Langford

One million data instances Currently, the most acmve zone Can be crowdsourced Can be processed by a quadramc algorithm Once parallelized 1M data collecmon cannot be too diverse But can be too homogenous Preprocessing / data probing is crucial Slide by R. Bekkerman, M. Bilenko, J. Langford

Big dataset cannot be too sparse 1M data instances cannot belong to 1M classes Simply because it s not pracmcal to have 1M classes J Here s a stamsmcal experiment, in text domain: 1M documents Each document is 100 words long Randomly sampled from a unigram language model No stopwords 245M pairs have word overlap of 10% or more Real- world datasets are denser than random Slide by R. Bekkerman, M. Bilenko, J. Langford

One billion data instances Web- scale Guaranteed to contain data in different formats ASCII text, pictures, javascript code, PDF documents Guaranteed to contain (near) duplicates Likely to be badly preprocessed J Storage is an issue Slide by R. Bekkerman, M. Bilenko, J. Langford

One trillion data instances Beyond the reach of the modern technology Peer- to- peer paradigm is (arguably) the only way to process the data Data privacy / inconsistency / skewness issues Can t be kept in one locamon Is intrinsically hard to sample Slide by R. Bekkerman, M. Bilenko, J. Langford

Not enough (clean) training data? Use exismng labels as a guidance rather than a direcmve In a semi- supervised clustering framework Or label more data! J With a livle help from the crowd Slide by R. Bekkerman, M. Bilenko, J. Langford

Crowdsourcing labeled data Crowdsourcing is a tough business J People are not machines Any worker who can game the system game the system will ValidaMon framework + qualificamon tests are a must Labeling a lot of data can be fairly expensive Slide by R. Bekkerman, M. Bilenko, J. Langford

Let s talk about how we can learn with datasets this large... 15

StochasMc Gradient Descent 16

Consider Learning with Numerous Data LogisMc regression objecmve: J( ) = 1 nx [y i log h (x i )+(1 y i ) log (1 h (x i ))] n i=1 @ cost (x i,y i ) @ j Fit via gradient descent: j j 1 nx (h (x i ) y i ) x ij n i=1 What is the computamonal complexity in terms of n? 17

Batch Gradient Descent IniMalize θ Repeat { } j j 1 n Gradient Descent nx (h (x i ) y i ) x ij for j = 0...d! i=1 StochasMc Gradient Descent IniMalize θ Randomly shuffle dataset Repeat { (Typically 1 10x) For i = 1...n, do j j (h (x i ) y i ) x ij } @ @ j J( ) for j = 0...d! @ @ j cost (x i,y i ) 18

Batch vs StochasMc GD Batch GD StochasMc GD Learning rate α is typically held constant Can slowly decrease α over Mme to force θ to converge: e.g., = constant1 iterationnumber + constant2 Based on slide by Andrew Ng 19

Graph- and Data- Parallelism 20

Map- Reduce Computer 1 Training set Computer 2 Combine results Computer 3 Computer 4 Based on slide by Andrew Ng 21

MulM- Core Machines Core 1 Training set Core 2 Combine results Core 3 Core 4 Based on slide by Andrew Ng 22

Map- Reduce for Batch GD Split dataset up into chunks (e.g., with n = 400) to nx compute j j 1 n i=1 (h (x i ) y i ) x ij temp1 = P 100 i=1 (h (x i ) y i ) x ij (x 1,y 1 )... (x 100,y 100 )! (x 101,y 101 )... (x 200,y 200 )! temp2 = P 200 i=101 (h (x i ) y i ) x ij (x 201,y 201 )... (x 300,y 300 )! temp3 = P 300 i=201 (h (x i ) y i ) x ij! (x 301,y 301 )... (x 400,y 400 )! temp4 = P 400 i=301 (h (x i ) y i ) x ij Training set Based on example by Andrew Ng 23

Map- Reduce for Batch GD Split dataset up into chunks (e.g., with n = 400) to nx compute j j 1 n i=1 (h (x i ) y i ) x ij temp1 = P 100 i=1 (h (x i ) y i ) x ij (x 1,y 1 )... (x 100,y 100 )! (x 101,y 101 )... (x 200,y 200 )! (x 201,y 201 )... (x 300,y 300 )! temp2 = P 200 i=101 Combine (h (x i ) results y i ) x ij j j 1 400 4X tempi i=1 temp3 = P 300 i=201 (h (x i ) y i ) x ij (x 301,y 301 )... (x 400,y 400 )!! Training set Based on example by Andrew Ng temp4 = P 400 i=301 (h (x i ) y i ) x ij 24

Slide by R. Bekkerman, M. Bilenko, J. Langford Parallelizing k- means

k- means on MapReduce Mappers read data pormons and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Slide by R. Bekkerman, M. Bilenko, J. Langford

Discussion on MapReduce MapReduce is not designed for iteramve processing Mappers read the same data again and again MapReduce looks too low- level to some people Data analysts are tradimonally SQL folks J MapReduce looks too high- level to others A lot of MapReduce logic is hard to adapt Example: grouping documents by words Slide by R. Bekkerman, M. Bilenko, J. Langford

GraphLab Open- source parallel machine learning Developed at Carnegie Mellon Univ. Available at www.graphlab.org 30

For more informamon... Cambridge Univ. Press Released in 2011 21 chapters Covering Plasorms Algorithms Learning setups ApplicaMons Slide by R. Bekkerman, M. Bilenko, J. Langford

Learning MulMple Tasks via Knowledge Transfer 35

Transfer Learning Idea: Transfer informamon from one or more source tasks to improve learning on a target task Data Model Step 1 Source Tasks Task 1 Task 2 Task N Learner Learner Learner Source Knowledge n Plenty of training data for each source task Eric Eaton 36

Transfer Learning Idea: Transfer informamon from one or more source tasks to improve learning on a target task Source Knowledge Step 2 New Target Task Data Machine Learner Model n Insufficient training data on the target task Eric Eaton 37

Benefits of Transfer in Learning n Primary goal: learning the target task T new bever auer first learning related source tasks T 1,, T N Performance BeVer means some combinamon of: More rapid learning with transfer without transfer Performance Improved inimal performance with transfer without transfer Performance Higher achievable performance with transfer without transfer # Training Examples # Training Examples Figures adapted from (DARPA/IPTO, 2005) # Training Examples Secondary goal: creamng chunks of reusable knowledge Eric Eaton 38

MulH- Task Learning n Idea: Learn all task models simultaneously, sharing knowledge (Caruana 1997; Zhang et al. 2008; Kumar & Daumé 2012) Data Model Task 1 Task 2 Task N MulH- Task Learner Eric Eaton 39