Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Size: px

Start display at page:

Download "Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher"

Todd Cody McDowell
10 years ago
Views:

1 Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

2 Outline 2 1. Retrospection 2. Stratosphere Plans 3. Comparison with Hadoop 4. Evaluation 5. Outlook

3 Retrospection 3 Matrix factorization: Find optimal factors W H V 5 4?? 5 2 2? 4 W H WxH Stochastic Gradient Descent

4 Retrospection 4 Result of SGD step Hadoop jobs with judging loss

5 Stratosphere Plans Multiple Iterations 5 for each starting point: while loss too high: MapReduce optimize factors MapReduce join triples MapReduce calculate loss (training) calculate loss (judging) for each starting point: while stop file not empty: MapMapReduce optimize factors CrossMatch join triples MapReduceCross calculate loss (training) and decide MapMatchReduce calculate loss (judging)

(judging) for each starting point: while stop file not empty: MapMapReduce optimize factors

6 Stratosphere Plans Subplans in One Iteration 6 Triples Training data 3,7,b,w,h Optimize factors + Join triples + 3,7,b,-,- keys: row, column Judging data Losses 0.9; 1.0; 2.4 Calc loss (training) + Calc loss (judging) + Saw,Tom,4 file not empty stop? 0.96

keys: row, column Judging data Losses 0.9; 1.0; 2.

7 Stratosphere Plans Optimize Factors 7 Triples Factors Map: SGD Reduce: average not materialized Factor W Triples 3,7,b,w,h 5,7,b,w,h Map: SGD 3,-,-,w,- -,7,-,-,h 5,-,-,w,- -,7,-,-,h Map: filter w Map: filter h pacts configured to read fields Reduce: average Reduce: average 3,-,-,w,- 5,-,-,w,- Factor H -,7,-,-,h

3,-,-,w,- -,7,-,-,h 5,-,-,w,- -,7,-,-,h Map: filter w Map: filter h pacts

8 Stratosphere Plans Join Triples (1/2) 8 Factors Training data needs matrix dimensions Map: replicate factors, keep training data Reduce: group Triples Factor W Factor H 3,-,-,w,- -,7,-,-,h Cross Training data 3,7,b,-,- 3,7,-,w,h Match (row,col) Triples 3,7,b,w,h

data Reduce: group Triples Factor W Factor H 3,-,-,w,- -,7,-,-,h

9 Stratosphere Plans Join Triples (2/2) 9 Factor W 3,-,-,w,- Cross Training data 3,7,-,w,h Triples Factor H -,7,-,-,h 3,7,b,-,- Match (row,col) compiler hints helpful? 3,7,b,w,h Factor W Training data 3,-,-,w,- 3,7,b,-,- Match (row) Factor H -,7,-,-,h 3,7,b,w,- Match (col) Triples 3,7,b,w,h

10 Stratosphere Plans Calculate Loss (Training) 10 Triples Map: local loss Reduce: RMSE 0.9 driver class knows loss history and decides on stopping dummy, loss, #points Triples Map: local loss Reduce: RMSE # 1 1 Cross: loss history OutputFormat decide stop stop? 1 1 Losses (epoch e) 1.0; ; 1.0; 2.4 Losses (epoch e+1)

9 driver class knows loss history and decides on stopping dummy, loss, #points

11 Stratosphere Plans Calculate Loss (Judging) 11 Factors Netflix judging files no MapReduce job, driver class receives loss directly 0.96 Triples Judging data 3,7,b,w,h Saw,Tom,4 Map: emit cells Saw,Tom,4.8 Match (mid, uid): local loss dummy,0.8²,1 similar to training loss, Map included in Match Reduce: RMSE 0.96

96 Triples Judging data 3,7,b,w,h Saw,Tom,4 Map: emit cells Saw,Tom,4.

12 Comparison Jobs vs. Plans 12 Equal results without random... Files Starting points Training sequence Factors not materialized Separate factor files possible Create new file for each iteration No efficient serialization between plans (yet) Either parsing text file Or use sequence file singlethreaded

Separate factor files possible Create new file for each iteration No

13 Comparison Data Schema 13 3#7 / TripleStorage (Storage class is tagged union) Dummy key / LossStorage Getter, setter, tostring() 3,7,b,w,h Dummy key, loss, #points Remember key places Reuse pacts with configurations for different keys Composite keys possible

3,7,b,w,h Dummy key, loss, #points Remember key places Reuse

14 Comparison Preprocessing 14 Requires: Line format according to parameters Copy to HDFS Serialize factors and blocks Use Map file to write serialized values to HDFS Java process to define lines Shell script to move to HDFS Extra pacts to parse lines

Map file to write serialized values to HDFS Java process to

15 Stratosphere Preprocessing Define Line Format 15 factorw.txt Netflix files factorh.txt 0 blocks.txt 1 Reduce: group cells to blocks 2 Create triples Calc loss: training SGD Step Plan 0 Netflix judging files

16 Suggestions 16 Global aggregation of loss with dummy key: Reducer with no key Reducer with compiler hint nrofkeys = 1 No sorting needed Provide tostring() in PactRecord Provide getter, setter in PactRecord to encapsulate field numbers and class types Keep configuration options (e.g. reduce: average W or H) Sequence files for sinks and sources Move log files to master node

setter in PactRecord to encapsulate field numbers and class types Keep configuration options

17 Evaluation Parameters for Netflix Data 17 starting points: 1 max. iterations: 1 degree of parallelism: 1, 2, 5, 10 for each starting point: while stop file not empty: MapMapReduce CrossMatch MapReduceCross MapMatchReduce optimize factors join triples step size: calculate loss (training) and decide calculate loss (judging) factor size: 5 63K 125K 250K 500K user ~2K ~4K 9K 1/8 1/4 1/2 data size block size: 1000 movies x 1000 user 18K 1 movies

MapMapReduce CrossMatch MapReduceCross MapMatchReduce optimize factors join triples step size: calculate loss

18 Evaluation Run Time for Variable Data Size 18 2 h 52 min 50 x 50 2 h 52 min 1000 x 1000 run time 2 h 24 min 1 h 55 min 1 h 26 min 0 h 57 min grows by factor 4 Init SGD Step run time 2 h 24 min 1 h 55 min 1 h 26 min 0 h 57 min 0 h 28 min 0 h 28 min 0 h 00 min 1/8 1/4 1/2 1 data size 0 h 00 min 1/8 1/4 1/2 1 data size run time 2 h 52 min 2 h 24 min 1 h 55 min 1000 x 1000 grows by factor 4 1 h 26 min Init Blocks (reduce cells) 0 h 57 min Init Join (cross, match) 0 h 28 min SDG Step 0 h 00 min 1/8 1/4 1/2 1 data size 10 nodes vs. DoP = 8 Usually iterations No sequence file between plans

1/8 1/4 1/2 1 data size run time 2 h 52 min 2 h 24 min 1 h 55 min 1000 x 1000 grows by factor 4 1 h 26 min Init Blocks (reduce cells) 0 h 57 min

19 Evaluation Run Time for Variable Degree of Parallelism 19 Data size: 1/8 20 min Reading and writing always DoP=1 run time 15 min 10 min DoP: performs best, but data is small faster with higher DoP 5 min 10 0 min optimize factors Map Map Reduce join triples Cross Match write triples subplan calc loss (training) Map Reduce Cross calc loss (judging) Map Match Reduce

faster with higher DoP 5 min 10 0 min optimize factors Map Map Reduce join triples Cross

20 Outlook 20 Join triples with Cross-Match vs. Match-Match Degree of parallelism: 10 Inspect judging outcome: RMSE should be equal Evaluate DoP with bigger data

21 Summary 21

22 References 22 Anand Rajaraman and Jeff Ullman. Mining of Massive Datasets. Cambridge University Press, Rainer Gemulla, Peter J. Haas, Erik Nijkamp, and Yannis Sismanis. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent. IBM Research Report RJ10481, March [1] November 2011

Estimating PageRank Values of Wikipedia Articles using MapReduce

Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html