Detection of Collusion Behaviors in Online Reputation Systems

Detection of Collusion Behaviors in Online Reputation Systems Yuhong Liu, Yafei Yang, and Yan Lindsay Sun University of Rhode Island, Kingston, RI Email: {yuhong, yansun}@ele.uri.edu Qualcomm Incorporated, San Diego, CA Email: yafeiy@qualcomm.com Abstract Online reputation systems are gaining popularity. Dealing with collaborative unfair ratings in such systems has been recognized as an important but difficult problem. The current defense mechanisms focus on analyzing rating values for individual products. In this paper, we propose a scheme that detects collaborative unfair raters based on similarity in their rating behaviors. The proposed scheme integrates abnormal detection in both rating-value domain and the user-domain. To evaluate the proposed the scheme in realistic scenarios, we design and launch a cyber competition, in which attack data from real human users are collected. Furthermore, we design and launch a Rating Challenge to collect unfair rating data from real human users. The proposed system is evaluated through experiments using real attack data. The proposed scheme can accurately detect collusion behaviors and therefore significantly reduces the impact from the dishonest users. I. Introduction The online reputation systems, also known as the online feedback-based rating systems, are creating large scale, virtual word-of-mouth networks [] in which individuals share opinions and experiences by providing ratings to products, companies, digital content and even other people. For example, Epinions.com encourages Internet users to rate practically any kind of businesses. Citysearch.com solicits and displays user ratings on restaurants, bars, and performances. YouTube.com recommends video clips based on viewers ratings. As reputation systems are having increasing influence on purchasing decision of consumers and online digital content distribution, manipulation of such systems is rapidly growing. Firms post biased ratings and reviews to praise their own products or bad-mouth the products of their competitors. Political campaigns promote positive video clips and hide negative video clips by inserting unfair ratings at YouTube.com. There is ample evidence that such manipulation takes place [2]. For example, some ebay users are artificially boosting their reputation by buying and selling feedbacks [3]. The current defense schemes use majority rule, signal modeling, and/or trust management. Most schemes detect unfair ratings based on the majority rule. That is, the ratings that are far away from the majority s opinion are marked as unfair ratings. For example, clustering techniques are used to separate fair ratings and unfair ratings into two clusters [2]. Other representative majority-rule based techniques include the statistical filtering [4], an endorsement-based method in [5], and an entropy-based method in [6]. Recently, signal modeling techniques are applied to detect collaborative unfair ratings [7]. Particularly, statistical detectors are designed to identify the time at which the important features of the rating values (e.g. mean, histogram, arrival rate, and AR model errors) change rapidly. These sudden-change-time-instances indicate the starting-time and ending-time of collaborative attacks. There are also methods to calculate trust in raters. Based on the rating history of users, the system can evaluate how much the raters are trusted to provide honest ratings. The trust information is then used to differentiate ratings from good users and suspicious users [5], [8]. The above methods are effective in different scenarios, but they share one common property. That is, they focus on examining the rating values for individual products. In other words, the suspicious rating values are detected and treated independently. It does not matter whether the suspicious ratings are from one user, several uncorrelated users, or several highly correlated users. In practice, it is very common that the unfair ratings are provided by a group of user IDs, whose behaviors are highly correlated. In this paper, we propose to address the unfair rating problem by detecting the abnormal signals from both user-domain and rating-domain. In particular, the proposed scheme contains three parts. First, the abnormal detection is performed upon rating values for individual products. This yields a set of products that are suspected to be under attack. Second, we calculate the similarity among users who provide ratings for these suspicious products. Third, the similarity scores are used to detect colluded dishonest user IDs that tend to be clustered together. The ratings from colluded dishonest users are removed. To evaluate the proposed methods in the real world, we design and launch a Cyber Competition [9] to collect attack data from real human users. The proposed method is tested against attacks from real human users. The performance shows that the proposed system can accurately detect collusion behaviors and therefore significantly reduces the impact from the attackers. The rest of the paper is organized as follows. Related work and attack models are discussed in Section II. The collusion detection system is described in Section III. The rating challenge and experimental results are presented in Section IV, followed by conclusion in Section V.

A. Related Research II. Related Work and Attack Models In Section I, we briefly introduced three key techniques used for detecting and handling dishonest ratings: majorityrule, signal modeling, and trust-in-raters. These techniques are jointly used in various schemes. For example, in [5], a rater gives high endorsement to other raters who provide similar ratings and low endorsement to the raters who provide different ratings. In [6], if a new rating leads to a significant change in the uncertainty in rating distribution, it is considered to be an unfair rating. In [8], the normal ratings are modeled as noise and collaborative unfair ratings are modeled as signal. The AR model error is calculated. Low model error indicates high possibility of being under attack. Schemes in [8] and [7] identify time intervals in which unfair ratings are highly suspected. Then, they reduce the trust values of the raters who provide many ratings in these time intervals. The final rating score is the weighted average of the rating values, where the weights depend on trust values. The proposed scheme is compatible with the most of existing detection schemes, and could even be extended to incorporate the trust calculation. On the other hand, collusion detection has been studied in many fields including PageRank [], Online auction system [], and P2P file sharing network [2]. The solutions in these systems mainly deal with colluders who have frequent interactions among each other. They cannot be directly applied to the online reputation systems for two reasons. First, in online reputation systems, the users do not necessarily interact with each. Second, the dishonest user IDs can have diverse behaviors because they can randomly rate many products that they are not interested in. Honest users may have similar rating behaviors because they share similar interests. B. Attack Models In this paper, we study collaborative unfair raters, who aim to either boost or downgrade the overall rating of the victim objects. These unfair raters not only provide ratings to the victim objects, but also to other objects that they are not interested in. They do not necessarily follow any attack models. Instead, the attack data are collected from real human users through a cyber competition. III. Collusion Detection System As discussed in Section I, the majority-rule based detection methods and signal-modeling detection methods only examine the rating values for individual products. They both have some disadvantages. The majority rule holds under two conditions. First, the number of unfair ratings is less than the number of honest ratings. Second, the bias of the unfair ratings (i.e. the difference between the unfair ratings and the honest ratings) is sufficiently large. In practice, it is not difficult for the attacker to register/control a large number of user IDs. Furthermore, the attacker can introduce a relatively small bias in unfair ratings. In this case, the majority-rule based methods cannot detect dishonest ratings unless it can tolerate high rate. The signal-modeling based methods can handle a large number of unfair ratings with moderate bias. They can identify Fig.. System Overview. Arrival Rate Histogram '('%!" %%%&'% $$ $ # $ Combination Similarity Distance Calculation Clustering Rating Filtering Rating Aggregation Model Change Mean Change the suspicious time intervals in which unfair ratings are highly likely, but cannot determine exactly which ratings are dishonest in the suspicious time intervals. In this paper, we explore the abnormal signals in both rating-value domain and the user domain. We notice that the collaborative dishonest users work together for achieving the same goal. Therefore, their rating behaviors can be similar. The similarity analysis in the user-domain, however, cannot work along. Since the rating data are often very sparse, it is highly possible that normal users who have given few ratings are similar to other normal users or even to dishonest users. The similarity analysis and the abnormal rating detections need to be combined. As illustrated in Figure, the overall design of the proposed system has three components: Abnormity Detection, Similarity, and Rating Aggregation. A. Abnormal detection As the first step, abnormal detection is applied to the rating values for individual products. The signal-modeling based detector proposed in [7] is applied to detect suspicious intervals. Let W k denote the union of all suspicious intervals detected for product k. Note that there may be more than one suspicious interval for one product. The methods in [7] can also tell whether the attacker s goal is to boost or reduce the rating in the suspicious interval. Let d k = H when we detect that the attacker s goal is to boost product k. Otherwise, set d k = L. Note that the algorithms in [7] may not work well when the unfair ratings spread over the entire rating duration. Therefore, for each product, we examine the variance of all rating values. If this variance for product k is larger than a threshold and W k is empty, we set W k as the entire rating duration. As a summary, this module delivers two outputs: suspicious intervals (W k ) and attack directions (d k ) for individual products. 2

3 Fig. 2. u u2 u3 p p2 p3 p4 p5 2 3 3 4 3 2 Rating Matrix for Similarity Calculation B. Similarity detection The philosophy behind similarity detector is that the rating behavior of colluded users who have the same attack goal should be similar. To calculate similarity between raters, we construct a rating matrix, which describes rater-product relationship. If a user has ever provided ratings in the suspicious intervals of any products, this user s overall rating behavior is represented by one row in the rating matrix. Assume there are m such users. Then the matrix is m n, where n is the total number of products. The matrix element R i,j is the rating value given by the user associated with row i to product j. The rating values are within a numerical scales and it can be null indicating that the user has not yet rated that product. It is important to point out that the rating matrix only contains the users who are associated with suspicious intervals and contains their ratings for all products. Figure 2 illustrates a rating matrix with 3 users and 5 products. Based on the rating matrix, for any pair of users, we calculate the similarity distance between them. There are various ways to calculate similarity distance. In this work, we investigate Euclidean distance and Pearson correlation. For user i and user j, assume that they both provide ratings to product k, k 2, k N. Then, the Euclidean Distance between user i and user j is D i,j = N (R i,kt R j,kt ) 2. () N t= In the example in Figure 2, the distance between user and user 3 considers their ratings to product 3, 4, and 5. The Pearson Correlation Distance D p i,j is calculated as: D p i,j = N N t= ( R i,k t R i σ i )( R j,k t R j σ j ), (2) where R i and σ i are the mean and variance of user i s all rating values, respectively. In our experiments, we find that Euclidean distance can better capture the behavior similarity among dishonest users. Results will be shown in Section IV. Next, we divide users into clusters based on their similarity distance. Three questions need to be answered: () which clustering algorithm to use; (2) how many clusters; and (3) which cluster contains the dishonest users. We investigate several clustering algorithms and find that the simple K-median method [3] works well. Intuitively, users may belong to two clusters, one for the honest users and the other for the dishonest users. This intuition, however, does not agree with the experimental results. In fact, K=2 is not a good choice, which yields very high rate. In our experiments, we studies the and rate of the proposed scheme for different K values and plot the ROC curves. The ROC curves give good guideline on the selection of K values. After the users are separated into K clusters, we identify the dishonest users cluster using attack direction (i.e. d k ). Recall that d k tells whether the dishonest users are trying to downgrade or boost product k. For cluster c, we calculate the indicator as I c = R avg (c, k) R avg (c, k), (3) k:d k =L k:d k =H where R avg (c, k) is the average rating for product k from the users in cluster c. The cluster with the lowest I c value is marked as the cluster of dishonest users. The users in this cluster is marked as dishonest. All ratings from dishonest users are removed, and then the reputation of products are calculated. In the implementation of the proposed approach, we use the C Clustering Library written by the laboratory of DNA Information Analysis, Human Genome Center, Institute of Medical Science, at the University of Tokyo. IV. Performance Evaluation A. Rating Challenge and Experiment Description For any online reputation systems, it is very difficult to evaluate their attack-resistance properties in practical settings due to the lack of realistic attack data. Even if one can obtain data from e-commerce websites or P2P networks, there is no ground truth about which ratings are dishonest. To understand human users collusion behaviors and evaluate the proposed scheme against non-simulated attacks, we designed and launched a Cyber Competition. In this contest, We collected real online rating data for 3 products from a famous e-commerce website. From day to day 5, 3 honest users gave their ratings to these products. Each player can control at most 3 dishonest user IDs. The total number of ratings from dishonest user IDs should be less than. The game goal is to downgrade the reputation score of product. The system calculates the reputation score of product every 5 days. In other words, reputation scores are calculated on day 5, day 3,... and day 5. The average of these reputation scores will be the overall reputation of product. The attack power is defined as the difference between the true reputation of product and the calculated reputation after inserting the unfair ratings. The competition rule encourages the players to adjust the resources they use (i.e. the number of dishonest user IDs and the number of total ratings). The evaluation criteria are based on both the attack resources and the attack power. The detailed description of the evaluation criteria can be found at [9]. The competition has attracted 63 registered players from 7 universities in China and United States. We have collected 75, valid submissions.

4 Attacker Number Detection Rate (Pd) Variance of Pd False Alarm Rate (Pf) Variance of Pf 5 945.4.95.63*.e-4 9 93.4.38.263*.e-4 5 988..4.88*.e-4 2 936.6.3.639*.e-4 25 936.8.6.433*.e-4 3 799.4..78*.e-4 Attacker Number Detection Rate (Pd) Variance of Pd False Alarm Rate (Pf) Variance of Pf 5.3599.44.388.43*.e-3 9.286.57.387.9*.e-3 5.2452.32.372.36*.e-3 2.285.23.38.29*.e-3 25.268.6.387.96*.e-3 3.955.3.384.7*.e-3 Fig. 3. Detection Rate and False Alarm Rate (Euclidean distance) Fig. 4. Detection Rate and False Alarm Rate (Pearson correlation distance) B. Data Preparation The data collected from the cyber competition are preprocessed as follows. First, we calculate the basic features of each submitted attack file, including the number of dishonest user IDs, the total number of dishonest ratings, and the attack power. The underlying rating aggregation algorithm can be seen on the rating competition website [9]. Second, the attack files that use the same amount of user IDs are grouped together. Third, in each group, we select strong attacks whose attack power is greater than 8 percent of the largest attack power in this group. The third step is important because we do not want to test our schemes against weak attacks, in which the players did not find a good way to mislead the rating system. After data pre-processing, we have 342 attack files using 5 dishonest user IDs, 536 attack files using 9 IDs, 68 files using 5 IDs, 88 files using 2 IDs, 244 files using 25 IDs, and 424 files using 3 IDs. C. Experiment Results In the first experiment, we study the performance of the proposed scheme, measured by the (i.e. the number of detected dishonest users divided by the total number of dishonest users) and the rate (i.e. the number of normal users being marked as dishonest divided by the total number of normal users). When the similarity calculation uses Euclidean distance, the results are shown in Figure 3. Both the mean and variance of the / rate are shown. We can clearly see that the proposed method yields very good, and maintains low rate. It is important to point out that we applied the well-known Betafunction based unfair rating detection methods [4] to the same dataset. The is almost zero. This is because there are only five rating values. The smart attackers do not give extreme rating values in order to avoid being detected. Therefore, the majority-rule based methods do not work well. In the second experiment, we compare different algorithms for calculating user similarity. We conduct the same experiment except using Pearson correlation distance for similarity calculation. The results are shown in Figure 4. In this case, the is greatly reduced. We see that the performance of the proposed scheme is sensitive to the choice of the similarity calculation algorithms. Euclidean distance performs much better than Pearson correlation distance. Fig. 5. Reputation 4.2 4 3.8 3.6 3.4 3.2 3 2.8 Reputation under Attack Recovered Reputation Real Reputation 2.6 5 5 2 25 3 Attacker Number Detection Rate and False Alarm Rate (Pearson correlation distance) In the third experiment, we examine the reputation of the victim product (i.e. product ). Three cases are compared: (a) the product reputation without unfair ratings; (b) product reputation under attack without the proposed defense scheme; and (c) the recovered reputation with the proposed scheme. The results are shown in Figure 5, with reputation of product v.s. the number of dishonest user IDs. It is seen that the proposed scheme can recover the reputation very effectively. In the fourth experiment, we investigate how to choose proper K value that is used in the clustering algorithm. In K-median algorithm [3], K represents the number of groups that raters are divided into. If K is too large, the malicious colluders might be clustered into different groups; if K is too small, normal raters will probably be clustered into the same group with the dishonest raters. In Figure 6 and Figure 7, we shows the and rate as a function of K, when there are 5 dishonest user IDs. In this case, there are 4 users who ever provide ratings in suspicious intervals. We call these users the suspicious users. In this case, the rating matrix has 4 rows. In Figure 8, the ROC curve is drawn. Roughly speaking, the K value between 5 and are reasonable. The ROC performance depends on both the number of dishonest IDs and the number of suspicious users. By adjusting the parameters in the abnormal detection algorithm, we can increase the suspicious interval and make the suspicious user set have 5 users. Note that we only have 5 dishonest users. Figure 9 shows that the ROC performance in this case is even better than the curve shown in Figure 8. This observation is

5 K= K=5 K=2 K=3 K=5 K=2.3.3.2. K=4 K=45.2 5 5 2 25 3 35 4 45 value of k.2.4.6.8..2 Fig. 6. Detection Rate v.s. K values Fig. 8. user set ROC Curve with 5 dishonest user IDs and 4 users in suspicious.7.6.5.4.3.2.3 K=5 K= K=2 K=3 K=4 K=45 K=5 K=5 K=2..2 Fig. 7. 5 5 2 25 3 35 4 45 value of k False Alarm Rate v.s. K values interesting. It implies that the abnormal detection algorithms do not need to be perfect. Even if the abnormal detection algorithm gives a large suspicious user set, the following collusion detection algorithm can identify real dishonest users. In Figure, we plot the ROC curve when there is 25 dishonest user IDs and 24 suspicious users. As a summary, when the number of dishonest users increases, K values should be smaller. When the suspicious user set becomes larger, the K values should increase too. V. Conclusions In this paper, we addressed the problem of detecting and handling collusion behaviors in on-line rating systems. In particular, we find out that it is very common that the unfair ratings are provided by a group of user IDs, whose behaviors are highly correlated. Based on this discovery, we propose to address the unfair rating problem by integrating the abnormal signals from both user-domain and rating-domain. In this system, three modules are included: unfair rating detection, behavior similarity calculation, and rating aggregation. The proposed scheme is evaluated against attacks created by real human users. The performance shows that the proposed system can accurately detect collusion behaviors and therefore significantly reduces the impact from the attackers. VI. Acknowledgement We sincerely thank Qinyuan Feng and Wenkai Wang for designing and administrating CANT Cyber Competition, and. K=99.2.4.6.8..2 Fig. 9. ROC Curve with 5 dishonest user IDs and 5 users in suspicious user set all participants to this contest. This research is supported in part by grants from NSF (award #643532), URI Foundation, and (...). REFERENCES [] C. Dellarocas, The digitization of word-of-mouth: Promise and challenges of online reputation systems, Management Science, vol. 49, no., pp. 47 424, October 23. [2] C. Dellarocas, Strategic manipulation of internet opinion forums: Implications for consumers and firms, Management Science, October 26. [3] J. Brown and J. Morgan, Reputation in online auctions: The market for trust, California Management Review, vol. 49, no., pp. 6 8, 26. [4] A. Whitby, A. Jøang, and J. Indulska, Filtering out unfair ratings in Bayesian reputation systems, in Proc. 7th Int. Workshop on Trust in Agent Societies, 24. [5] M. Chen and J. Singh, Computing and using reputations for internet ratings, in Proceedings of the 3rd ACM conference on Electronic Commerce, 2. [6] J. Weng, C. Miao, and A. Goh, An entropy-based approach to protecting rating systems from unfair testimonies, IEICE TRANSACTIONS on Information and Systems, vol. E89-D, no. 9, pp. 252 25, September 26. [7] Y. Yang, Y. Sun, S. Kay, and Q. Yang, Defending online reputation systems against collaborative unfair raters through signal modeling and trust, in Proceedings of the 24th ACM Symposium on Applied Computing (SAC 9), 29. [8] Y. Yang, Y. Sun, J. Ren, and Q. Yang, Building trust in online rating systems through signal modeling, in Proceedings of IEEE ICDCS Workshop on Trust and Reputation Management, 27. [9] Beijing University and University of Rhode Island, CANT cyber competition, http://www.ele.uri.edu/nest/challenge/htm/cant.html.

6.3.2 K= K=2 K=3 K=4 K=5 K=6 K=7 K=5 K=2..5..5.2 Fig.. user set ROC Curve with 25 dishonest user IDs and 24 users in suspicious [] H. Zhang, A. Goel, R. Govindan, K. Mason, and B. van Roy, Improving eigenvector-based reputation systems against collusions, in Proceedings of the 3rd Workshop on Web Graph Algorithms, 24. [] J.C. Wang and C.C. Chiu, Recommending trusted online auction sellers using social network analysis, Expert Systems with Applications, vol. 34, pp. 666 679, 28. [2] Q. Lian, Z. Zhang, M. Yang, B. Zhao, Y. Dai, and X. Li, An empirical study of collusion behavior in the maze p2p file-sharing system, in Proceeding of 27th International Conference on Distributed Computing Systems (ICDCS 7), 27. [3] D.L. Massart and L. Kaufman, The Interpretation of Analytical Chemical Data By the Use of Cluster Analysis, John Wiley & Sons, 983.