! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Size: px

Start display at page:

Download "! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II"

Ariel Morrison
10 years ago
Views:

1 ! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center 1 October 2nd, 2014

of Electrical Engineering and Computer Science Mgr., Dept.

2 Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms -- I 10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Linked Big Data Graph Computing 10/30/14 9 Big Data Visualization 11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster Platforms 11/20/14 12 Big Data Next Challenges IoT, Cognition, and Beyond 11/27/14 Thanksgiving Holiday 12/04/14 13 Final Projects Discussion (Optional) 12/11/14 & 12/12/ Two-Day Big Data Analytics Workshop Final Project Presentations 2

(classification) 10/23/14 8 Linked Big Data Graph Computing 10/30/14 9 Big Data Visualization 11/06/14 10 Mobile Data Collection, Analysis, and Interface 11/13/14 11 Hardware, Processors, and Cluster

3 Review Key Components of Mahout 3

4 Mahout reference book 4

5 Setting Up Mahout Step 1: Java JVM and IDEs (e.g., Eclipse) Step 2: Maven Step 3: Mahout Eclipse Luna (June 2014) 5

6 Recommender Inputs Solid lines: positively related Dashed lines: negatively related Input Data: User, Item, Rating 6

7 User-based Recommendation Scenario I gettofail.com 7

8 User-based Recommendation Scenario II 8

9 User-based Recommendation Scenario III 9

10 User-based Recommendation Algorithms 10

11 Example Recommender Code via Mahout 11

12 Process and output of the example Recommendation for Person 1: Item 104 > Item 106 Item 107 is not favored 12

13 Refresh (Reload) Data 13

14 Update data 14

15 User Similarity Measurements Pearson Correlation Similarity Euclidean Distance Similarity Cosine Measure Similarity Spearman Correlation Similarity Tanimoto Coefficient Similarity (Jaccard coefficient) Log-Likelihood Similarity!! 15

Similarity Spearman Correlation Similarity Tanimoto

16 Pearson Correlation Similarity Data: missing data 16

17 On Pearson Similarity Three problems with the Pearson Similarity:! 1. Not take into account of the number of items in which two users preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.) 2. If two users overlap on only one item, no correlation can be computed. 3. The correlation is undefined if either series of preference values are identical. Adding Weighting.WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used. 17

, 2 overlap items ==> 1, more items may not be better.) 2. If two users overlap on only one item, no correlation can be computed. 3.

18 Euclidean Distance Similarity Similarity = 1 / ( 1 + d ) 18

19 Cosine Similarity Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0). 19

20 Spearman Correlation Similarity Example for ties Pearson value on the relative ranks 20

21 Caching User Similarity Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed. 21

22 Tanimoto (Jaccard) Coefficient Similarity Discard preference values 22 Tanimoto similarity is the same as Jaccard similarity. But, Tanimoto distance is not the same as Jaccard distance.

23 Log-Likelihood Similarity Asses how unlikely it is that the overlap between the two users is just due to chance. 23

24 Performance measurements Using GroupLens data ( 10 million rating MovieLens dataset. Spearnman: 0.8 Tanimoto: 0.82 Log-Likelihood: 0.73 Euclidean: 0.75 Pearson (weighted): 0.77 Pearson:

25 Performance measurements 10 nearest neighbors: nearest neighbors: nearest neighbors: % of training; 5% of testing 25

26 Selecting the number of neighbors Based on number of neighbors Based on a fixed threshold, e.g., 0.7 or

27 Item-based recommendation 27

28 Item-based recommendation algorithm 28

29 Code and Performance of Item-Based Recommendation performance 29

30 Slope-One Recommender 30

31 Slope-One Algorithm Difference values from the example Slope-One got a result of near 0.65 on the GroupLens data 31

32 Other recommenders SVD recommender number of features number of training step lambda: factor for regularization SVD method got 0.69 on the GroupLens data 32

33 Linear Interpolation Item-based recommender SVD method got 0.76 on the GroupLens data 33

34 Cluster-based Recommendation 34

35 Other Recommenders not in Mahout Groups (SDM 06) A 3 rd party Knowledge Repository: 30K users and 20K documents. Study the most active 697 users who have at least 20 download in a year. Results: beyond Collaborative Filtering: (1) Collaborative + Content Filtering (53% improvement); (2) CBDR: Collaborative + Content Filtering + Graph Community Analytics (259% accuracy improvement over collaborative filtering) CB DR CB DR CB DR 35

36 Other Recommenders not in Mahout Info Flow (SIGIR 06) CF + SP IF TIF Network Info Flow Number of recommended users Innovators? Late majority adopt? Early adopters Early majority Early adopter Late adopter CF + SP IF TIF Number of recommended users IF: Graphical Information Flow Model TIF: Joint Topic Detection + Information Flow Model Tests: 1 month 586 new docs 1,170 users 36 People with similar tastes Laggards! Comparing to Collaborative Filtering (CF) + Similar People Precision: IF is 91% better, TIF is 108% better Recall: IF is 87% better, TIF is 113% better

37 Distributed Item-based Recommender 37

38 Distributed recommender get co-occurrence matrix Data: 38

39 Multiply the co-occurrence matrix with user preference The highest is 103 (101, 104, 105, 107 have been purchased by user 3) 39

40 Translating to MapReduce: generating user vectors 40

41 Translating to MapReduce: calculating co-occurrence 41

42 Translating to MapReduce: matrix multiplication 42

43 Translating to MapReduce: partial products 43

44 Translating to MapReduce: partial product II 44

45 Running Recommender on MapReduce and HDFS 45

46 Questions? 46

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining [email protected] Outline Predictive modeling methodology k-nearest Neighbor