# Combating Web Spam with TrustRank

Save this PDF as:

Size: px
Start display at page:

## Transcription

4 Ordered Trust Property T(p) < T(q) Pr[O(p) = ] < Pr[O(q) = ], T(p) = T(q) Pr[O(p) = ] = Pr[O(q) = ]. Another way to relax the requirements for T is to introduce a threshold value δ: Threshold Trust Property T(p) > δ O(p) =. That is, if a page p receives a score above δ, we know that it is good. Otherwise, we cannot tell anything about p. Such a function T would at least be capable of telling us that some subset of pages with a trust score above δ is good. Note that a function T with the threshold property does not necessarily provide an ordering of pages based on their likelihood of being good. 3. Evaluation Metrics This section introduces three metrics that help us evaluate whether a particular function T has some of the desired properties. We assume that we have a sample set X of web pages for which we can invoke both T and O. Then, we can evaluate how well a desired property is achieved for this set. In Section 6 we discuss how a meaningful sample set X can be selected, but for now, we can simply assume that X is a set of random web pages. Our first metric, pairwise orderedness, is related to the ordered trust property. We introduce a binary function I(T,O, p,q) to signal if a bad page received an equal or higher trust score than a good page (a violation of the ordered trust property): if T(p) T(q) and O(p) < O(q), I(T,O, p,q) = if T(p) T(q) and O(p) > O(q), 0 otherwise. Next, we generate from our sample X a set P of ordered pairs of pages (p,q), p q, and we compute the fraction of the pairs for which T did not make a mistake: Pairwise Orderedness pairord(t,o,p) = P (p,q) P I(T,O, p,q). P Hence, if pairord equals, there are no cases when T misrated a pair. Conversely, if pairord equals zero, then T misrated all the pairs. In Section 6 we discuss how to select a set P of sample page pairs for evaluation. Our next two metrics are related to the threshold trust property. It is natural to think of the performance of function T in terms of the commonly used precision and recall metrics [] for a certain threshold value δ. We define precision as the fraction of good among all pages in X that have a trust score above δ: Precision prec(t,o) = {p X T(p) > δ and O(p) = }. {q X T(q) > δ} Similarly, we define recall as the ratio between the number of good pages with a trust score above δ and the total number of good pages in X: Recall rec(t,o) = 4 Computing Trust {p X T(p) > δ and O(p) = }. {q X O(q) = } Let us begin our quest for a proper trust function by starting with some simple approaches. We will then combine the gathered observations and construct the TrustRank algorithm in Section 4.3. Given a limited budget L of O-invocations, it is straightforward to select at random a seed set S of L pages and call the oracle on its elements. (In Section 5 we discuss how to select a better seed set.) We denote the subsets of good and bad seed pages by S + and S, respectively. Since the remaining pages are not checked by the human expert, we assign them a trust score of / to signal our lack of information. Therefore, we call this scheme the ignorant trust function T 0, defined for any p V as follows: Ignorant Trust Function { O(p) if p S, T 0 (p) = / otherwise. For example, we can set L to 3 and apply our method to the example in Figure. A randomly selected seed set could then be S = {,3,6}. Let o and t 0 denote the vectors of oracle and trust scores for each page, respectively. In this case, o = [,,,, 0, 0, 0], t 0 = [,,,,, 0, ]. To evaluate the performance of the ignorant trust function, let us suppose that our sample X consists of all 7 pages, and that we consider all possible 7 6 = 4 ordered pairs. Then, the pairwise orderedness score of T 0 is 7/. Similarly, for a threshold δ = /, the precision is while the recall is /. 4. Trust Propagation As a next step in computing trust scores, we take advantage of the approximate isolation of good pages. We still select at random the set S of L pages that we invoke the oracle on. Then, expecting that good pages point to other good pages only, we assign a score of to all pages that are reachable from a page in S + in M or fewer steps. The appropriate trust function T M is defined as: 579

12 Precision/Recall Precision Recall Threshold TrustRank Bucket Figure 3: Precision and recall. tion retrieval problems, where it is usual to have a large corpus of documents, with only a few of those documents being relevant for a specific query. In contrast, our sample set consists most of good documents, all of which are relevant. This is why the baseline precision score for the sample X is 63/( ) = Related Work Our work builds on existing PageRank research. The idea of biasing PageRank to combat spam was introduced in []. The use of custom static score distribution vectors has been studied in the context of topic-sensitive Page- Rank [6]. Recent analyses of (biased) PageRank are provided by [, ]. The problem of trust has also been addressed in the context of peer-to-peer systems. For instance, [9] presents an algorithm similar to PageRank for computing the reputation or dependability of a node in a peer-to-peer network. The data mining and machine learning communities also explored the topic of web and spam detection (for instance, see [3]). However, this research is oriented toward the analysis of individual documents. The analysis typically looks for telltale signs of spamming techniques based on statistics derived from examples. 8 Conclusions As the web grows in size and value, search engines play an increasingly critical role, allowing users to find information of interest. However, today s search engines are seriously threatened by malicious web spam that attempts to subvert the unbiased searching and ranking services provided by the engines. Search engines are today combating web spam with a variety of ad hoc, often proprietary techniques. We believe that our work is a first attempt at formalizing the problem and at introducing a comprehensive solution to assist in the detection of web spam. Our experimental results show that we can effectively identify a significant number of strongly reputable (non-spam) pages. In a search engine, TrustRank can be used either separately to filter the index, or in combination with PageRank and other metrics to rank search results. We believe that there are still a number of interesting experiments that need to be carried out. For instance, it would be desirable to further explore the interplay between dampening and splitting for trust propagation. In addition, there are a number of ways to refine our methods. For example, instead of selecting the entire seed set at once, one could think of an iterative process: after the oracle has evaluated some pages, we could reconsider what pages it should evaluate next, based on the previous outcome. Such issues are a challenge for future research. Acknowledgement The authors would like to thank David Cossock and Farzin Maghoul for the inspiring discussions and valuable comments. References [] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 999. [] M. Bianchini, M. Gori, and F. Scarselli. Inside Page- Rank. Tech. rep., University of Siena, 003. [3] G. Golub and C. Van Loan. Matrix Computations. The Johns Hopkins University Press, 996. [4] Z. Gyöngyi and H. Garcia-Molina. Seed selection in TrustRank. Tech. rep., Stanford University, 004. [5] T. Haveliwala. Efficient computation of PageRank. Tech. rep., Stanford University, 999. [6] T. Haveliwala. Topic-sensitive PageRank. In Proceedings of the Eleventh International Conference on World Wide Web, 00. [7] J. Hopcroft, R. Motwani, and J. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 00. [8] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Extrapolation methods for accelerating PageRank computations. In Proceedings of the Twelfth International Conference on World Wide Web, 003. [9] S. Kamvar, M. Schlosser, and H. Garcia-Molina. The EigenTrust algorithm for reputation management in PP networks. In Proceedings of the Twelfth International Conference on World Wide Web, 003. [0] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604 63, 999. [] A. Langville and C. Meyer. Deeper inside PageRank. Tech. rep., North Carolina State University, 003. [] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Tech. rep., Stanford University, 998. [3] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e- mail. In Learning for Text Categorization: Papers from the 998 Workshop,

### Are Automated Debugging Techniques Actually Helping Programmers?

Are Automated Debugging Techniques Actually Helping Programmers? Chris Parnin and Alessandro Orso Georgia Institute of Technology College of Computing {chris.parnin orso}@gatech.edu ABSTRACT Debugging

### All Your Contacts Are Belong to Us: Automated Identity Theft Attacks on Social Networks

All Your Contacts Are Belong to Us: Automated Identity Theft Attacks on Social Networks Leyla Bilge, Thorsten Strufe, Davide Balzarotti, Engin Kirda EURECOM Sophia Antipolis, France bilge@eurecom.fr, strufe@eurecom.fr,

### Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow

Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell

### Automatically Detecting Vulnerable Websites Before They Turn Malicious

Automatically Detecting Vulnerable Websites Before They Turn Malicious Kyle Soska and Nicolas Christin, Carnegie Mellon University https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/soska

### An efficient reconciliation algorithm for social networks

An efficient reconciliation algorithm for social networks Nitish Korula Google Inc. 76 Ninth Ave, 4th Floor New York, NY nitish@google.com Silvio Lattanzi Google Inc. 76 Ninth Ave, 4th Floor New York,

### Do Not Crawl in the DUST: Different URLs with Similar Text

Do Not Crawl in the DUST: Different URLs with Similar Text Ziv Bar-Yossef Dept. of Electrical Engineering Technion, Haifa 32000, Israel Google Haifa Engineering Center, Israel zivby@ee.technion.ac.il Idit

### Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations

Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations Jure Leskovec Carnegie Mellon University jure@cs.cmu.edu Jon Kleinberg Cornell University kleinber@cs.cornell.edu Christos

Understanding and Combating Link Farming in the Twitter Social Network Saptarshi Ghosh IIT Kharagpur, India Naveen K. Sharma IIT Kharagpur, India Bimal Viswanath MPI-SWS, Germany Gautam Korlam IIT Kharagpur,

### Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Fan Deng University of Alberta fandeng@cs.ualberta.ca Davood Rafiei University of Alberta drafiei@cs.ualberta.ca ABSTRACT

### Steering User Behavior with Badges

Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,

### Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637

Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435

### Automatic segmentation of text into structured records

Automatic segmentation of text into structured records Vinayak Borkar Kaustubh Deshmukh Sunita Sarawagi Indian Institute of Technology, Bombay ABSTRACT In this paper we present a method for automatically

### Speeding up Distributed Request-Response Workflows

Speeding up Distributed Request-Response Workflows Virajith Jalaparti (UIUC) Peter Bodik Srikanth Kandula Ishai Menache Mikhail Rybalkin (Steklov Math Inst.) Chenyu Yan Microsoft Abstract We found that

### Introduction to Data Mining and Knowledge Discovery

Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining

### Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park

### Why Johnny Can t Encrypt: A Usability Evaluation of PGP 5.0

Why Johnny Can t Encrypt: A Usability Evaluation of PGP 5.0 Alma Whitten School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 alma@cs.cmu.edu J. D. Tygar 1 EECS and SIMS University

### Collective Intelligence and its Implementation on the Web: algorithms to develop a collective mental map

Collective Intelligence and its Implementation on the Web: algorithms to develop a collective mental map Francis HEYLIGHEN * Center Leo Apostel, Free University of Brussels Address: Krijgskundestraat 33,

### Intellectual Need and Problem-Free Activity in the Mathematics Classroom

Intellectual Need 1 Intellectual Need and Problem-Free Activity in the Mathematics Classroom Evan Fuller, Jeffrey M. Rabin, Guershon Harel University of California, San Diego Correspondence concerning

### Less is More: Selecting Sources Wisely for Integration

Less is More: Selecting Sources Wisely for Integration Xin Luna Dong AT&T Labs-Research lunadong@research.att.com Barna Saha AT&T Labs-Research barna@research.att.com Divesh Srivastava AT&T Labs-Research

### A Usability Study and Critique of Two Password Managers

A Usability Study and Critique of Two Password Managers Sonia Chiasson and P.C. van Oorschot School of Computer Science, Carleton University, Ottawa, Canada chiasson@scs.carleton.ca Robert Biddle Human

### Fake It Till You Make It: Reputation, Competition, and Yelp Review Fraud

Fake It Till You Make It: Reputation, Competition, and Yelp Review Fraud Michael Luca Harvard Business School Georgios Zervas Boston University Questrom School of Business May

### Monte Carlo methods in PageRank computation: When one iteration is sufficient

Monte Carlo methods in PageRank computation: When one iteration is sufficient K.Avrachenkov, N. Litvak, D. Nemirovsky, N. Osipova Abstract PageRank is one of the principle criteria according to which Google

### No Free Lunch in Data Privacy

No Free Lunch in Data Privacy Daniel Kifer Penn State University dan+sigmod11@cse.psu.edu Ashwin Machanavajjhala Yahoo! Research mvnak@yahoo-inc.com ABSTRACT Differential privacy is a powerful tool for

### Recommendation Systems

Chapter 9 Recommendation Systems There is an extensive class of Web applications that involve predicting user responses to options. Such a facility is called a recommendation system. We shall begin this

### ACMS: The Akamai Configuration Management System

ACMS: The Akamai Configuration Management System Alex Sherman, Philip A. Lisiecki, Andy Berkheimer, and Joel Wein. Akamai Technologies, Inc. Columbia University Polytechnic University. {andyb,lisiecki,asherman,jwein}@akamai.com

### How to Use Expert Advice

NICOLÒ CESA-BIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California

### Can political science literatures be believed? A study of publication bias in the APSR and the AJPS

Can political science literatures be believed? A study of publication bias in the APSR and the AJPS Alan Gerber Yale University Neil Malhotra Stanford University Abstract Despite great attention to the