Hybrid model rating prediction with Linked Open Data for Recommender Systems

Similar documents
Collaborative Filtering. Radek Pelánek

A Logistic Regression Approach to Ad Click Prediction

IPTV Recommender Systems. Paolo Cremonesi

Machine Learning using MapReduce

Advanced Ensemble Strategies for Polynomial Models

Factorization Machines

Big Data Analytics. Lucas Rego Drumond

Using Data Mining for Mobile Communication Clustering and Characterization

QASM: a Q&A Social Media System Based on Social Semantics

The Need for Training in Big Data: Experiences and Case Studies

Addressing Cold Start in Recommender Systems: A Semi-supervised Co-training Algorithm

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Social Media Mining. Data Mining Essentials

Cross-Domain Collaborative Recommendation in a Cold-Start Context: The Impact of User Profile Size on the Quality of Recommendation

How To Cluster On A Search Engine

Bayesian Factorization Machines

How To Make A Credit Risk Model For A Bank Account

Logistic Regression for Spam Filtering

Standardization and Its Effects on K-Means Clustering Algorithm

Content-Based Recommendation

How To Identify A Churner

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

User Behavior Analysis Based On Predictive Recommendation System for E-Learning Portal

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Azure Machine Learning, SQL Data Mining and R

Utility of Distrust in Online Recommender Systems

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Rating Prediction with Informative Ensemble of Multi-Resolution Dynamic Models

An Introduction to Data Mining

Chapter 12 Discovering New Knowledge Data Mining

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Java Modules for Time Series Analysis

Factorization Machines

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Predict Influencers in the Social Network

Recommendation Tool Using Collaborative Filtering

Designing a learning system

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Advances in Collaborative Filtering

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Clustering Technique in Data Mining for Text Documents

Big & Personal: data and models behind Netflix recommendations

arxiv: v1 [cs.ir] 12 Jun 2015

How I won the Chess Ratings: Elo vs the rest of the world Competition

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Prediction Model for Crude Oil Price Using Artificial Neural Networks

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Spark: Cluster Computing with Working Sets

Web Document Clustering

Linear Threshold Units

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Lecture #2. Algorithms for Big Data

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

Qi Liu Rutgers Business School ISACA New York 2013

Ensembles and PMML in KNIME

Bisecting K-Means for Clustering Web Log data

Spam Detection Using Customized SimHash Function

Query Recommendation employing Query Logs in Search Optimization

Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1

Scalable Hands-Free Transfer Learning for Online Advertising

Chapter 6. The stacking ensemble approach

Server Load Prediction

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Environmental Remote Sensing GEOG 2021

Predict the Popularity of YouTube Videos Using Early View Data

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

A QoS-Aware Web Service Selection Based on Clustering

Distributed forests for MapReduce-based machine learning

Mining the Web of Linked Data with RapidMiner

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Lecture 2: The SVM classifier

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

A Social Network-Based Recommender System (SNRS)

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Automated Process for Generating Digitised Maps through GPS Data Compression

Transcription:

Hybrid model rating prediction with Linked Open Data for Recommender Systems Andrés Moreno 12 Christian Ariza-Porras 1, Paula Lago 1, Claudia Jiménez-Guarín 1, Harold Castro 1, and Michel Riveill 2 1 School of Engineering, Universidad de los Andes, Bogotá, Colombia 2 Univ. Nice Sophia Antipolis, CNRS, I3S, UMR 7271, 06900 Sophia Antipolis, France {dar-more, cf.ariza975, pa.lago52,cjimenez, hcastro}@uniandes.edu.co, riveill@unice.fr Abstract. We detail the solution of team uniandes1 to the ESWC 2014 Linked Open Data-enabled Recommender Systems Challenge Task 1 (rating prediction on a cold start situation). In these situations, there are few ratings per item and user and thus collaborative filtering techniques may not be suitable. In order to be able to use a content-based solution, linked-open data from DBPedia was used to obtain a set of descriptive features for each item. We compare the performance (measured as RMSE) of three models on this cold-start situation: contentbased (using min-count sketches), collaborative filtering (SVD++) and rule-based switched hybrid models. Experimental results show that the hybrid system outperforms each of the models that compose it. Since features taken from DBPedia were sparse, we clustered items in order to reduce the dimensionality of the item and user profiles. Keywords: semantic web, recommender systems 1 Introduction Recommender systems (RS) are automatic agents that attempt to suggest new or interesting items to users. A number of different algorithms have been proposed to improve the performance of recommender systems, which can be classified in two groups: collaborative-filtering techniques and content-based filtering techniques. Collaborative-filtering techniques (CF) are based on the fact that similar users like similar items and thus base their predictions in the ratings provided by similar users. Content-based techniques (CB) build a user profile of interests based on the features of the items the user has rated. On cold-start situations, when items have few ratings, neither system can perform well. This is because they don t have the amount of data needed to find either true similarities among users (CF) or to construct the user profiles (CB). In these circumstances, more data is needed, either to describe the items or the users. Thanks to linked open data initiatives, information about items can be found on the web. Task 1 of the linked open data enabled recommender systems challenge purpose was to predict the rating a user would give to an item in a

cold-start situation. In order to be able to use a CB solution, linked-open data is used to obtain features that describe items in machine-readable format. The paper is organized as follows: we describe the provided dataset, the performance metric used to evaluate the predictions, give an overview of the proposed solution and discuss the obtained results. Dataset description The DBbook dataset contains 75559 ratings of 6166 books by 6181 users. The possible ratings that a user can assign to an item are O = {0, 1, 2, 3, 4, 5}. The ratings file has 3 fields: a user id, an item id, and the rating. Each item has been rated by at least one user, but the evaluation set includes some books not rated in the training set, representing a cold-start situation. The dataset also provides a mapping of each item id to a DBPedia URI which gives access to a semantic description of items. Given this description, we can define each book with a set of concepts C i taken from DBPedia. We use the following concepts to describe a book: author, categories, literary genres, and subject. Figure 1 depicts the feature extraction process. The feature space size is 14001 concepts. Each book has an average of 16.49 features with standard deviation (std) of 6.18. Each feature appears in an average of 9.62 books with std of 118, and a max of 4030. Fig. 1: Semantic Features Extraction 2 Prediction Model Burke [1] describes different ways in which recommender system models can be combined. The switched strategy maintains different models in parallel and reports to the user the prediction of the model with higher confidence. We use as base model a widely known CF algorithm (SVD++) (Section 2.2). However, since traditional CF systems usually make incorrect predictions when no previous

ratings about the item are known, the prediction of the switched hybrid model on cold-start situations is delegated to the CB model explained in Section 2.1. The measure used to evaluate the predictive performance of the system is the Root Mean Square Error (RMSE). Let T be the rating set of a hold-out set (test set), T ui, the rating that the user u gave to item i and rˆ u i the model prediction, 1 the RMSE is defined as RMSE := T T (ˆr ui T ui T ui ) 2. In the remainder of this section, we will describe the models that take part in our system. 2.1 Content Based model On a CB model, a user u has a profile with a list of non duplicate concepts C u and a set of O vectors w o R Cu, o O. For each example of user-item interaction, each of the concepts that are related to the item (C i ) are considered for addition into the user s list C u. We use an inclusion policy using a sliding window min-count sketch structure [2] based on the work developed in [4]: All concepts seen by the user at least N times during the window duration of the sketch are present in the user s list, and the size of the vectors w o is updated. After modifying the list and the w o vectors length, the weights of the vector are adjusted using a stochastic logistic regression strategy. Let r ui O the rating user u gives to item i and m ui = meta(c i C u ) R Cu a function that takes the concept set of an item and converts it into a binary vector where each coordinate is 1 if the user s concept belong to the items list (m ui [f] = 1 Cu[f] C i ). For each vector w o O, we predict σ( w o, m ui ) and update each of the vectors as in wu o wu o γ(σ( w o, m ui ) 1 rui=o)m ui, where σ(c) is the sigmoid function. The rating prediction under this model is calculated as in ˆr ui = o O σ( wo,m ui ) o o O σ( wo,m ui ) Feature Generation and evaluation We use DBPedia to retrieve book features as described in section 1. Using all the retrieved features the predictor performance was lower than expected and, as shown in Figure 2, if we increase the minimum inclusion rate, the performance declines. A quick evaluation of the features shows that some of them are highly correlated, which led us to consider that clusters of features may provide more information to the predictor. We created clusters of features by co-occurrence, using k-means with cosine distance, convergence delta of 0.01 and 200 iterations. Figure 3 depicts the dataset generation process. We vary the number of clusters (k) and measure the performance against the test set. In Figure 4a, we can see that the predictor performance using clusters is better than using all the extracted features. Although with 23 clusters we have a slightly better result against the test set, we use the 50 clusters because this had better performance using the evaluation tool. With these 50 clusters as features, each book has an average of 2.1 features with a std of 0.57. Each feature appears on average in 344 books, with a std of 1332. When trained with these new features, the predictor improves its performance notably with a min inclusion rate of 2, as shown in Figure 4b. The best RMSE with the content-based predictor on the evaluation tool was 0.969.

Fig. 2: RMSE vs inclusion rate for book features Fig. 3: Content Based Dataset Generation (a) All features vs different cluster s size (b) With 50 clusters changing the minimum inclusion rate Fig. 4: Content Based predictor performance using clusters 2.2 Collaborative Filtering model The SVD++ algorithm [3] prediction rule uses the global average of ratings (µ) and the bias or deviation from the mean for each user (b u ) and each item (b i ) as model parameters. In order to account for the user-item interaction the SVD++ model represents each user as a vector x u and each item as an vector y i R k. Each item is represented by an extra vector z i R k that is used by the prediction rule to represent the items the user has rated into her profile. Let R(u) the set of items the user u has rated, ( the prediction under the SVD++ model is given by ˆr ui = µ + b i + b u + yi T x u + R (u) 1 2 z j ). When an j R(u) item has not been seen by the system, the prediction rule only uses the sum of the global mean and the user bias. Parameters of the model are learned using a regularized stochastic gradient descent strategy. 3 Model validation To test the performance of the hybrid model, we generated 5 datasets with approximately 80% for training and the 20% for testing, each of these datasets had a different percentage of cold-start ratings varying from 5% to 25%. The model delegates the prediction to the CB model only when it has not seen the item before. Fig. 5 shows the RMSE of the hybrid model as the number of new items in the test set increases. The results show that the hybrid model outperforms CF for a low number of cold-start items.

Fig. 5: RMSE of the hybrid model vs SVD++ on cold-start 4 Conclusions We have described our approach to improve the performance of recommender systems using linked open-data that is freely available on the web. Open data provides descriptions of items that help the recommender system understand better why a user likes an item (a user may like a book because of its author, its literary genre, its main subject, etc). This approach can help alleviate the new item cold-start problem. However, users may like items based on subjective features such as tone which are not provided in the open-data repositories used. For this reason, we proposed an hybrid model based on rules that uses a pure collaborative approach when enough ratings are present, and uses a contentbased approach in the other cases. Our model had a RMSE of 0.8787 against the quiz set provided by the challenge. Open-data such as data from social-networks can also be used to describe users and calculate similarities of new users based on this data. This could further improve the performance of recommender systems under a new-user cold start problem. References 1. Burke, R.: Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction 12(4), 331 370 (Nov 2002), http://dx.doi.org/10.1023/a:1021240730564 2. Dimitropoulos, X., Stoecklin, M., Hurley, P., Kind, A.: The eternal sunshine of the sketch data structure. Comput. Netw. 52(17), 3248 3257 (Dec 2008), http://dx.doi.org/10.1016/j.comnet.2008.08.014 3. Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 426 434. KDD 08, ACM, New York, NY, USA (2008), http://dx.doi.org/10.1145/1401890.1401944 4. McMahan, H.B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., Nie, L., Phillips, T., Davydov, E., Golovin, D., Chikkerur, S., Liu, D., Wattenberg, M., Hrafnkelsson, A.M., Boulos, T., Kubica, J.: Ad click prediction: A view from the trenches. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1222 1230. ACM, New York, NY, USA (2013)