Large Scale Learning

Size: px

Start display at page:

Download "Large Scale Learning"

Imogene Phelps
8 years ago
Views:

1 Large Scale Learning

2 Data hypergrowth: an example Reuters : about 10K docs (ModApte) Bekkerman et al, SIGIR 2001 RCV1: about 807K docs Bekkerman & Scholz, CIKM 2008 LinkedIn job Mtle data: about 100M docs Bekkerman & Gavish, KDD Slide by R. Bekkerman, M. Bilenko, J. Langford

3 New age of big data The world has gone mobile 5 billion cellphones produce daily data Social networks have gone online TwiVer produces 200M tweets a day Crowdsourcing is the reality Labeling of 100,000+ data instances is doable Within a week J Slide by R. Bekkerman, M. Bilenko, J. Langford

4 Size mavers One thousand data instances One million data instances One billion data instances One trillion data instances Those are not different numbers, those are different mindsets J Slide by R. Bekkerman, M. Bilenko, J. Langford

5 One million data instances Currently, the most acmve zone Can be crowdsourced Can be processed by a quadramc algorithm Once parallelized 1M data collecmon cannot be too diverse But can be too homogenous Preprocessing / data probing is crucial Slide by R. Bekkerman, M. Bilenko, J. Langford

6 Big dataset cannot be too sparse 1M data instances cannot belong to 1M classes Simply because it s not pracmcal to have 1M classes J Here s a stamsmcal experiment, in text domain: 1M documents Each document is 100 words long Randomly sampled from a unigram language model No stopwords 245M pairs have word overlap of 10% or more Real- world datasets are denser than random Slide by R. Bekkerman, M. Bilenko, J. Langford

7 One billion data instances Web- scale Guaranteed to contain data in different formats ASCII text, pictures, javascript code, PDF documents Guaranteed to contain (near) duplicates Likely to be badly preprocessed J Storage is an issue Slide by R. Bekkerman, M. Bilenko, J. Langford

8 One trillion data instances Beyond the reach of the modern technology Peer- to- peer paradigm is (arguably) the only way to process the data Data privacy / inconsistency / skewness issues Can t be kept in one locamon Is intrinsically hard to sample Slide by R. Bekkerman, M. Bilenko, J. Langford

9 Not enough (clean) training data? Use exismng labels as a guidance rather than a direcmve In a semi- supervised clustering framework Or label more data! J With a livle help from the crowd Slide by R. Bekkerman, M. Bilenko, J. Langford

10 Crowdsourcing labeled data Crowdsourcing is a tough business J People are not machines Any worker who can game the system game the system will ValidaMon framework + qualificamon tests are a must Labeling a lot of data can be fairly expensive Slide by R. Bekkerman, M. Bilenko, J. Langford

11 Let s talk about how we can learn with datasets this large... 15

12 StochasMc Gradient Descent 16

13 Consider Learning with Numerous Data LogisMc regression objecmve: J( ) = 1 nx [y i log h (x i )+(1 y i ) log (1 h (x i ))] n cost (x i,y i j Fit via gradient descent: j j 1 nx (h (x i ) y i ) x ij n i=1 What is the computamonal complexity in terms of n? 17

14 Batch Gradient Descent IniMalize θ Repeat { } j j 1 n Gradient Descent nx (h (x i ) y i ) x ij for j = 0...d! i=1 StochasMc Gradient Descent IniMalize θ Randomly shuffle dataset Repeat { (Typically 1 10x) For i = 1...n, do j j (h (x i ) y i ) x j J( ) for j j cost (x i,y i ) 18

15 Batch vs StochasMc GD Batch GD StochasMc GD Learning rate α is typically held constant Can slowly decrease α over Mme to force θ to converge: e.g., = constant1 iterationnumber + constant2 Based on slide by Andrew Ng 19

16 Graph- and Data- Parallelism 20

17 Map- Reduce Computer 1 Training set Computer 2 Combine results Computer 3 Computer 4 Based on slide by Andrew Ng 21

18 MulM- Core Machines Core 1 Training set Core 2 Combine results Core 3 Core 4 Based on slide by Andrew Ng 22

19 Map- Reduce for Batch GD Split dataset up into chunks (e.g., with n = 400) to nx compute j j 1 n i=1 (h (x i ) y i ) x ij temp1 = P 100 i=1 (h (x i ) y i ) x ij (x 1,y 1 )... (x 100,y 100 )! (x 101,y 101 )... (x 200,y 200 )! temp2 = P 200 i=101 (h (x i ) y i ) x ij (x 201,y 201 )... (x 300,y 300 )! temp3 = P 300 i=201 (h (x i ) y i ) x ij! (x 301,y 301 )... (x 400,y 400 )! temp4 = P 400 i=301 (h (x i ) y i ) x ij Training set Based on example by Andrew Ng 23

20 Map- Reduce for Batch GD Split dataset up into chunks (e.g., with n = 400) to nx compute j j 1 n i=1 (h (x i ) y i ) x ij temp1 = P 100 i=1 (h (x i ) y i ) x ij (x 1,y 1 )... (x 100,y 100 )! (x 101,y 101 )... (x 200,y 200 )! (x 201,y 201 )... (x 300,y 300 )! temp2 = P 200 i=101 Combine (h (x i ) results y i ) x ij j j X tempi i=1 temp3 = P 300 i=201 (h (x i ) y i ) x ij (x 301,y 301 )... (x 400,y 400 )!! Training set Based on example by Andrew Ng temp4 = P 400 i=301 (h (x i ) y i ) x ij 24

21 Slide by R. Bekkerman, M. Bilenko, J. Langford Parallelizing k- means

22 Slide by R. Bekkerman, M. Bilenko, J. Langford Parallelizing k- means

23 Slide by R. Bekkerman, M. Bilenko, J. Langford Parallelizing k- means

24 k- means on MapReduce Mappers read data pormons and centroids Mappers assign data instances to clusters Mappers compute new local centroids and local cluster sizes Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids Reducers write the new centroids Slide by R. Bekkerman, M. Bilenko, J. Langford

25 Discussion on MapReduce MapReduce is not designed for iteramve processing Mappers read the same data again and again MapReduce looks too low- level to some people Data analysts are tradimonally SQL folks J MapReduce looks too high- level to others A lot of MapReduce logic is hard to adapt Example: grouping documents by words Slide by R. Bekkerman, M. Bilenko, J. Langford

26 GraphLab Open- source parallel machine learning Developed at Carnegie Mellon Univ. Available at 30

27 For more informamon... Cambridge Univ. Press Released in chapters Covering Plasorms Algorithms Learning setups ApplicaMons Slide by R. Bekkerman, M. Bilenko, J. Langford

28 Learning MulMple Tasks via Knowledge Transfer 35

29 Transfer Learning Idea: Transfer informamon from one or more source tasks to improve learning on a target task Data Model Step 1 Source Tasks Task 1 Task 2 Task N Learner Learner Learner Source Knowledge n Plenty of training data for each source task Eric Eaton 36

30 Transfer Learning Idea: Transfer informamon from one or more source tasks to improve learning on a target task Source Knowledge Step 2 New Target Task Data Machine Learner Model n Insufficient training data on the target task Eric Eaton 37

31 Benefits of Transfer in Learning n Primary goal: learning the target task T new bever auer first learning related source tasks T 1,, T N Performance BeVer means some combinamon of: More rapid learning with transfer without transfer Performance Improved inimal performance with transfer without transfer Performance Higher achievable performance with transfer without transfer # Training Examples # Training Examples Figures adapted from (DARPA/IPTO, 2005) # Training Examples Secondary goal: creamng chunks of reusable knowledge Eric Eaton 38

32 MulH- Task Learning n Idea: Learn all task models simultaneously, sharing knowledge (Caruana 1997; Zhang et al. 2008; Kumar & Daumé 2012) Data Model Task 1 Task 2 Task N MulH- Task Learner Eric Eaton 39

Scaling Up Machine Learning

Scaling Up Machine Learning Parallel and Distributed Approaches Ron Bekkerman, LinkedIn Misha Bilenko, MSR John Langford, Y!R http://hunch.net/~large_scale_survey Outline Introduction Tree Induction Break