! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center 1 October 2nd, 2014
Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms -- I 10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Linked Big Data Graph Computing 10/30/14 9 Big Data Visualization 11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster Platforms 11/20/14 12 Big Data Next Challenges IoT, Cognition, and Beyond 11/27/14 Thanksgiving Holiday 12/04/14 13 Final Projects Discussion (Optional) 12/11/14 & 12/12/14 14-15 Two-Day Big Data Analytics Workshop Final Project Presentations 2
Review Key Components of Mahout 3
Mahout reference book 4
Setting Up Mahout Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven Step 3: Mahout Eclipse Luna (June 2014) 5
Recommender Inputs Solid lines: positively related Dashed lines: negatively related Input Data: User, Item, Rating 6
User-based Recommendation Scenario I gettofail.com 7
User-based Recommendation Scenario II 8
User-based Recommendation Scenario III 9
User-based Recommendation Algorithms 10
Example Recommender Code via Mahout 11
Process and output of the example Recommendation for Person 1: Item 104 > Item 106 Item 107 is not favored 12
Refresh (Reload) Data 13
Update data 14
User Similarity Measurements Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity!! 15
Pearson Correlation Similarity Data: missing data 16
On Pearson Similarity Three problems with the Pearson Similarity:! 1. Not take into account of the number of items in which two users preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.) 2. If two users overlap on only one item, no correlation can be computed. 3. The correlation is undefined if either series of preference values are identical. Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used. 17
Euclidean Distance Similarity Similarity = 1 / ( 1 + d ) 18
Cosine Similarity Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0). 19
Spearman Correlation Similarity Example for ties Pearson value on the relative ranks 20
Caching User Similarity Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed. 21
Tanimoto (Jaccard) Coefficient Similarity Discard preference values 22 Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.
Log-Likelihood Similarity Asses how unlikely it is that the overlap between the two users is just due to chance. 23
Performance measurements Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson: 0.89 24
Performance measurements 10 nearest neighbors: 0.98 100 nearest neighbors: 0.89 500 nearest neighbors: 0.75 95% of training; 5% of testing 25
Selecting the number of neighbors Based on number of neighbors Based on a fixed threshold, e.g., 0.7 or 0.5 26
Item-based recommendation 27
Item-based recommendation algorithm 28
Code and Performance of Item-Based Recommendation performance 29
Slope-One Recommender 30
Slope-One Algorithm Difference values from the example Slope-One got a result of near 0.65 on the GroupLens data 31
Other recommenders SVD recommender number of features number of training step lambda: factor for regularization SVD method got 0.69 on the GroupLens data 32
Linear Interpolation Item-based recommender SVD method got 0.76 on the GroupLens data 33
Cluster-based Recommendation 34
Other Recommenders not in Mahout Groups (SDM 06) A 3 rd party Knowledge Repository: 30K users and 20K documents. Study the most active 697 users who have at least 20 download in a year. Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53% improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering) CB DR CB DR CB DR 35
Other Recommenders not in Mahout Info Flow (SIGIR 06) CF + SP IF TIF Network Info Flow Number of recommended users Innovators? Late majority adopt? Early adopters Early majority Early adopter Late adopter CF + SP IF TIF Number of recommended users IF: Graphical Information Flow Model TIF: Joint Topic Detection + Information Flow Model Tests: 1 month 586 new docs 1,170 users 36 People with similar tastes Laggards! Comparing to Collaborative Filtering (CF) + Similar People Precision: IF is 91% better, TIF is 108% better Recall: IF is 87% better, TIF is 113% better
Distributed Item-based Recommender 37
Distributed recommender get co-occurrence matrix Data: 38
Multiply the co-occurrence matrix with user preference The highest is 103 (101, 104, 105, 107 have been purchased by user 3) 39
Translating to MapReduce: generating user vectors 40
Translating to MapReduce: calculating co-occurrence 41
Translating to MapReduce: matrix multiplication 42
Translating to MapReduce: partial products 43
Translating to MapReduce: partial product II 44
Running Recommender on MapReduce and HDFS 45
Questions? 46