# Combating Web Spam with TrustRank

Save this PDF as:

Size: px
Start display at page:

## Transcription

4 2.2 PageRank PageRank is a well known algorithm that uses link information to assign global importance scores to all pages on the web. Because our proposed algorithms rely on PageRank, this section offers a short overview. The intuition behind PageRank is that a web page is important if several other important web pages point to it. Correspondingly, PageRank is based on a mutual reinforcement between pages: the importance of a certain page influences and is being influenced by the importance of some other pages. The PageRank score r(p) of a page p is defined as: r(p) = α q:(q,p) E r(q) + ( α) ω(q) N, where α is a decay factor. The equivalent matrix equation form is: r = α T r + ( α) N N. Hence, the score of some page p is a sum of two components: one part of the score comes from pages that point to p, and the other (static) part of the score is equal for all web pages. PageRank scores can be computed iteratively, for instance, by applying the Jacobi method [3]. While in a strict mathematical sense, iterations should be run to convergence, it is more common to use only a fixed number of M iterations in practice. It is important to note that while the regular PageRank algorithm assigns the same static score to each page, a biased PageRank version may break this rule. In the matrix equation r = α T r + ( α) d, vector d is a static score distribution vector of arbitrary, non-negative entries summing up to one. Vector d can be used to assign a non-zero static score to a set of special pages only; the score of such special pages is then spread during the iterations to the pages they point to. 3 Assessing Trust 3. Oracle and Trust Functions As discussed in Section, determining if a page is spam is subjective and requires human evaluation. We formalize the notion of a human checking a page for spam by a binary oracle function O over all pages p V: { 0 if p is bad, O(p) = if p is good. Figure 2 represents a small seven-page web where good pages are shown as white, and bad pages as black. For this example, calling the oracle on pages through 4 would yield the return value of. Note that there are a number of equivalent definitions of PageRank [] that might slightly differ in mathematical formulation and numerical properties, but yield the same relative ordering between any two web pages. 4

6 then this should indicate that p is less likely to be good than q. Such a function would at least be useful in ordering search results, giving preference to pages more likely to be good. More formally, then, a desirable property for the trust function is: Ordered Trust Property T(p) < T(q) Pr[O(p) = ] < Pr[O(q) = ], T(p) = T(q) Pr[O(p) = ] = Pr[O(q) = ]. Another way to relax the requirements for T is to introduce a threshold value δ: Threshold Trust Property T(p) > δ O(p) =. That is, if a page p receives a score above δ, we know that it is good. Otherwise, we cannot tell anything about p. Such a function T would at least be capable of telling us that some subset of pages with a trust score above δ is good. Note that a function T with the threshold property does not necessarily provide an ordering of pages based on their likelihood of being good. 3.2 Evaluation Metrics This section introduces three metrics that help us evaluate whether a particular function T has some of the desired properties. We assume that we have a sample set X of web pages for which we can invoke both T and O. Then, we can evaluate how well a desired property is achieved for this set. In Section 6 we discuss how a meaningful sample set X can be selected, but for now, we can simply assume that X is a set of random web pages. Our first metric, pairwise orderedness, is related to the ordered trust property. We introduce a binary function I(T, O, p, q) to signal if a bad page received an equal or higher trust score than a good page (a violation of the ordered trust property): 0 if T(p) T(q) and O(p) < O(q), I(T, O, p, q) = 0 if T(p) T(q) and O(p) > O(q), otherwise. Next, we generate from our sample X a set P of ordered pairs of pages (p, q), p q, and we compute the fraction of the pairs for which T did not make a mistake: Pairwise Orderedness pairord(t, O, P) = P (p,q) P I(T, O, p, q). P Hence, if pairord equals, there are no cases when T mis-rated a pair. Conversely, if pairord equals zero, then T mis-rated all the pairs. In Section 6 we discuss how to select a set P of sample page pairs for evaluation. Our next two metrics are related to the threshold trust property. It is natural to think of the performance of function T in terms of the commonly used precision and recall metrics [] 6

7 for a certain threshold value δ. We define precision as the fraction of good among all pages in X that have a trust score above δ: Precision prec(t, O) = {p X T(p) > δ and O(p) = }. {q X T(q) > δ} Similarly, we define recall as the ratio between the number of good pages with a trust score above δ and the total number of good pages in X: Recall rec(t, O) = {p X T(p) > δ and O(p) = }. {q X O(q) = } 4 Computing Trust Let us begin our quest for a proper trust function by starting with some simple approaches. We will then combine the gathered observations and construct the TrustRank algorithm in Section 4.3. Given a limited budget L of O-invocations, it is straightforward to select at random a seed set S of L pages and call the oracle on its elements. (In Section 5 we discuss how to select a better seed set.) We denote the subsets of good and bad seed pages by S + and S, respectively. Since the remaining pages are not checked by the human expert, we assign them a trust score of /2 to signal our lack of information. Therefore, we call this scheme the ignorant trust function T 0, defined for any p V as follows: Ignorant Trust Function T 0 (p) = { O(p) if p S, /2 otherwise. For example, we can set L to 3 and apply our method to the example in Figure 2. A randomly selected seed set could then be S = {, 3, 6}. Let o and t 0 denote the vectors of oracle and trust scores for each page, respectively. In this case, o = [,,,, 0, 0, 0], t 0 = [, 2,, 2, 2, 0, 2 ]. To evaluate the performance of the ignorant trust function, let us suppose that our sample X consists of all 7 pages, and that we consider all possible 7 6 = 42 ordered pairs. Then, the pairwise orderedness score of T 0 is 7/2. Similarly, for a threshold δ = /2, the precision is while the recall is /2. 4. Trust Propagation As a next step in computing trust scores, we take advantage of the approximate isolation of good pages. We still select at random the set S of L pages that we invoke the oracle on. Then, expecting that good pages point to other good pages only, we assign a score of to all pages that are reachable from a page in S + in M of fewer steps. The appropriate trust function T M is defined as: 7

8 M pairord prec rec 9/2 3/ /2 4/5 Table : Performance of the M-step trust function T M for M {, 2, 3}. M-Step Trust Function O(p) if p S, T M (p) = if p / S and q S + : q M p, /2 otherwise, where q M p denotes the existence of a path of a maximum length of M from page q to page p. Such a path must not include bad seed pages. Using the example in Figure 2 and the seed set S = {, 3, 6}, we present the trust score assignments for three different values of M: M = : t = [,,, 2, 2, 0, 2 ], M = 2 : t 2 = [,,,, 2, 0, 2 ], M = 3 : t 3 = [,,,,, 0, 2 ]. We would expect that T M performs better than T 0 with respect to some of our metrics. Indeed, Table shows that for M = and M = 2, both pairwise orderedness and recall increase, and precision remains. However, there is a drop in performance when we go to M = 3. The reason is that page 5 receives a score of due to the link from good page 4 to bad page 5 (marked with an asterisk on Figure 2). As we saw in the previous example, the problem with M-step trust is that we are not absolutely sure that pages reachable from good seeds are indeed good. As a matter of fact, the further away we are from good seed pages, the less certain we are that a page is good. For instance, in Figure 2 there are 2 pages (namely, pages 2 and 4) that are at most 2 links away from the good seed pages. As both of them are good, the probability that we reach a good page in at most 2 steps is. Similarly, the number of pages reachable from the good seed in at most 3 steps is 3. Only two of these (pages 2 and 4) are good, while page 5 is bad. Thus, the probability of finding a good page drops to 2/ Trust Attenuation These observations suggest that we reduce trust as we move further and further away from the good seed pages. There are many ways to achieve this attenuation of trust. Here we describe two possible schemes. Figure 3 illustrates the first idea, which we call trust dampening. Since page 2 is one link away from the good seed page, we assign it a dampened trust score of β, where β <. Since page 3 is reachable in one step from page 2 with score β, it gets a dampened score of β β. We also need to decide how to assign trust to pages with multiple inlinks. For instance, in Figure 3, assume page also links to page 3. We could assign page 3 the maximum trust score, in this case β, or the average score, in this case (β + β β)/2. 8

10 function TrustRank input T transition matrix N number of pages L limit of oracle invocations α B decay factor for biased PageRank M B number of biased PageRank iterations output t TrustRank scores begin // evaluate seed-desirability of pages () s = SelectSeed(...) // generate corresponding ordering (2) σ = Rank({,..., N}, s) // select good seeds (3) d = 0 N for i = to L do if O(σ(i)) == then d(σ(i)) = // normalize static score distribution vector (4) d = d/ d // compute TrustRank scores (5) t = d for i = to M B do t = α B T t + ( α B ) d return t end Figure 5: The TrustRank algorithm. As a first step, the algorithm calls function SelectSeed, which returns a vector s. The entry s(p) in this vector gives the desirability of page p as a seed page. (Please refer to Section 5 for details.) As we will see in Section 5., one version of SelectSeed returns the following vector on the example of Figure 2: s = [ 0.08, 0.3, 0.08, 0.0, 0.09, 0.06, 0.02 ]. In step (2) function Rank(x, s) generates a permutation x of the vector x, with elements x (i) in decreasing order of s(x (i)). In other words, Rank reorders the elements of x in decreasing order of their s-scores. For our example, we get: σ = [ 2, 4, 5,, 3, 6, 7 ]. That is, page 2 is the most desirable seed page, followed by page 4, and so on. Step (3) invokes the oracle function on the L most desirable seed pages. The entries of the static score distribution vector d that correspond to good seed pages are set to. Step (4) normalizes vector d so that its entries sum up to. Assuming that L = 3, the seed set is {2, 4, 5}. Pages 2 and 4 are the good seeds, and we get the following static score 0

12 function SelectSeed input U inverse transition matrix N number of pages α I decay factor M I number of iterations output s inverse PageRank scores begin s = N for i = to M do s = α U s + ( α) N N return s end Figure 6: The inverse PageRank algorithm. difference is that in our case the importance of a page depends on its outlinks, not its inlinks. Therefore, to compute the desirability of a page, we perform a PageRank computation on the graph G = (V, E ), where (p, q) E (q, p) E. Since we inverted the links, we call our algorithm inverse PageRank. Figure 6 shows a SelectSeed algorithm that performs the inverse PageRank computation. Note that the decay factor α I and the number of iterations M I can be different from the values α B and M B used by the TrustRank algorithm. The computation is identical to that in the traditional PageRank algorithm (Section 2.2), except that the inverse transition matrix U is used instead of the regular transition matrix T. For our example from Figure 2, the inverse PageRank algorithm (α I = 0.85, M I = 20) yields the following scores (already shown in Section 4.3): s = [ 0.08, 0.3, 0.08, 0.0, 0.09, 0.06, 0.02 ] For a value of L = 3, the seed set is S = {2, 4, 5}. Correspondingly, the good seed set is S + = {2, 4}, so pages 2 and 4 are used as starting points for score distribution. It is important to note that inverse PageRank is a heuristic (that works well in practice, as we will see in Section 6). First, inverse PageRank does not guarantee maximum coverage. For instance, in the example in Figure 7 and for L = 2, maximum coverage is achieved through the seed set {, 3} or {2, 3}. However, the inverse PageRank computation yields the score vector: s = [ 0.05, 0.05, 0.04, 0.02, 0.02, 0.02, 0.02 ], which leads to the seed set S = {, 2}. Nevertheless, inverse PageRank is appealing because its execution time is polynomial in the number of pages, while determining the maximum coverage is an N P-complete problem. 2 2 The general problem of identifying the minimal set of pages that yields maximum coverage is equivalent to the independent set problem [6] on directed graphs as shown next. The web graph can be transformed in a directed graph G = (V, E ), where an edge (p, q) E signals that page q can be reached from page p. 2

13 t()= c 3 t(3)=c Figure 7: A graph for which inverse PageRank does not yield maximum coverage. A second reason why inverse PageRank is a heuristic is that maximizing coverage may not always be the best strategy. To illustrate, let us propagate trust via splitting, without any dampening. Returning to Figure 7, say we only select page 2 as seed and it turns out to be good. Then pages 4, 5, and 6 each receive a score of /3. Now, assume we only select page 3 as seed and it also happens to be good. Then page 7 gets a score of. Depending on our ultimate goal, it may be preferable to use page 3, since we can be more certain about the page it identifies, even if the set is smaller. However, if we are only using trusts scores for comparing against other trust scores, it may still be better to learn about more pages, even if with less absolute accuracy. 5.2 High PageRank So far we have assumed that the value of identifying a page as good or bad is the same for all web pages. Yet, it may be more important to ascertain the goodness of pages that will appear high in query result sets. For example, say we have four pages p, q, r, and s, whose contents match a given set of query terms equally well. If the search engine uses PageRank to order the results, the page with highest rank, say p, will be displayed first, followed by the page with next highest rank, say q, and so on. Since it is more likely the user will be interested in pages p and q, as opposed to pages r and s (pages r and s may even appear on later result pages and may not even be seen by the user), it seems more useful to obtain accurate trust scores for pages p and q rather than for r and s. For instance, if page p turns out to be spam, the user may rather visit page q instead. Thus, a second heuristic for selecting a seed set is to give preference to pages with high PageRank. Since high-pagerank pages are likely to point to other high-pagerank pages, then good trust scores will also be propagated to pages that are likely to be at the top of result sets. Thus, with PageRank selection of seeds, we may identify the goodness of fewer pages (as compared to inverse PageRank), but they may be more important pages to know about. 6 Experiments 6. Data Set To evaluate our algorithms, we performed experiments using the complete set of pages crawled and indexed by the AltaVista search engine as of August In order to reduce computational demands, we decided to work at the level of web sites instead of individual pages. (Note that all presented methods work equally well for either We argue that such transformation does not change the complexity class of the algorithm, since it involves breadth-first search that has polynomial execution time. Then, finding a minimal set that provides maximal coverage is the same as finding the maximum independent set for G, which is an N P-complete problem. 3

15 unusable usable non-existent 9.6% empty 5.6% unknown 4.3% alias 3.5% personal page host 2.2% spam 3.7% reputable 56.3% advertisement.3% web organization 3.7% Figure 8: Composition of the evaluation sample. directories, reducing the initial set to roughly 7, 900. By sampling the sites that were filtered out, we found that insignificantly few reputable ones were removed by the process. Out of the remaining 7, 900 sites, we manually evaluated the top, 250 (seed set S) and selected 78 sites to be used as good seeds. This procedure corresponded to step (3) in Figure 5. The relatively small size of the good seed set S + is due to the extremely rigorous selection criteria that we adopted: not only did we make sure that the sites were not spam, but we also applied a second filter we only selected sites with a clearly identifiable authority (such as a governmental or educational institution or company) that controlled the contents of the site. The extra filter was added to guarantee the longevity of the good seed set, since the presence of physical authorities decreases the chance that the sites would degrade in the short run. 6.3 Evaluation Sample In order to evaluate the metrics presented in Section 3.2, we needed a set X of sample sites with known oracle scores. (Note that this is different from the seed set and it is only used for assessing the performance of our algorithms.) We settled on a sample of 000 sites, a number that gave us enough data points, and was still be manageable in terms of oracle evaluation time. We decided not to select the 000 sample sites of X at random. With a random sample, a great number of the sites would be very small (with few pages) and/or have very low PageRank. (Both size and PageRank follow power-law distributions, with many sites at the tail end of the distribution.) As we discussed in Section 5.2, it is more important for us to correctly detect spam in high PageRank sites, since they will more often appear high in query result sets. Furthermore, it is hard for the oracle to evaluate small sites due to the reduced body of evidence, so it also does not make sense to consider many small sites in our sample. In order to assure diversity, we adopted the following sampling method. We generated the list of sites in decreasing order of their PageRank scores, and we segmented it into 20 buckets. Each of the buckets contained a different number of sites, with scores summing up to 5 percent of the total PageRank score. Therefore, the first bucket contained the 86 sites with the highest PageRank scores, bucket 2 the next 665, while the 20th bucket contained 5 million sites that were assigned the lowest PageRank scores. We constructed our sample set of 000 sites by selecting 50 sites at random from each bucket. Then, we performed a manual (oracle) evaluation of the sample sites, determining if they were spam or not. The outcome of the evaluation process is presented in Figure 6.3, a 5

17 % Good PageRank Bucket % Good TrustRank Bucket Figure 9: Good sites in PageRank and TrustRank buckets. % Bad PageRank Bucket % Bad TrustRank Bucket Figure 0: Bad sites in PageRank and TrustRank buckets. 3. Ignorant Trust. As another baseline, we generated the ignorant trust scores of sites. All sites were assigned an ignorant trust score of /2, except for the 250 seeds, which received scores of 0 or PageRank versus TrustRank Let us discuss the difference between PageRank and TrustRank first. Remember, the Page- Rank algorithm does not incorporate any knowledge about the quality of a site, nor does it explicitly penalize badness. In fact, we will see that it is not very uncommon that some site created by a skilled spammer receives high PageRank score. In contrast, our TrustRank is meant to differentiate good and bad sites: we expect that spam sites were not assigned high TrustRank scores. Figures 9 and 0 provide a side-by-side comparison of PageRank and TrustRank with respect to the ratio of good and bad sites in each bucket. PageRank buckets were introduced in Section 6.3; we defined TrustRank buckets as containing the same number of sites as PageRank buckets. Note that we merged buckets 7 through 20 both for PageRank and TrustRank. (These last 4 buckets contained the more than 3 million sites that were unreferenced. All such sites received the same minimal static PageRank score and a zero TrustRank score, making it impossible to set up an ordering among them.) The horizontal axes of Figures 9 and 0 mark the PageRank and TrustRank bucket numbers, respectively. The vertical axis of the first figure corresponds to the percentage of good within a specific bucket, i.e., the number of good sample sites divided by the total number of 7

19 Pairwise Orderedness TrustRank PageRank Ignorant Top-PageRank Sample Sites Figure 2: Pairwise orderedness Pairwise Orderedness We used the pairwise orderedness metric presented in Section 3.2 to evaluate TrustRank with respect to the ordered trust property. For this experiment, we built the set P of all possible pairs of sites for several subsets of our evaluation sample X. We started by using the subset of X of the 00 sites with highest PageRank scores, in order to check TrustRank for the most important sites. Then, we gradually added more and more sites to our subset, in their decreasing order of PageRank scores. Finally, we used all pairs of all the 748 sample sites to compute the pairwise orderedness score. Figure 2 displays the results of this experiment. The horizontal axis shows the number of sample sites used for evaluation, while the vertical axis represents the pairwise orderedness scores for the specific sample sizes. For instance, we can conclude that for the 500 top- PageRank sample sites TrustRank receives a pairwise orderedness score of about Figure 2 also shows the pairwise orderedness scores for the ignorant trust function and PageRank. The overlap between our seed set S and sample set X is of 5 good sites, so all but five sample sites received a score of /2 from the ignorant trust function. Hence, the pairwise orderedness scores for the ignorant function represent the case when we have almost no information about the quality of the sites. Similarly, pairwise orderedness scores for PageRank illustrate how much the knowledge of importance can help in distinguishing good and bad. As we can see, TrustRank constantly outperforms both the ignorant function and PageRank Precision and Recall Our last set of experimental results, shown in Figure 3, present the performance of TrustRank with respect to the metrics of precision and recall. We used as threshold values δ the borderline TrustRank scores that separated the 7 TrustRank buckets, discussed in Section The lowest buckets corresponding to each threshold value are presented on the horizontal axis of the figure; we display the precision and recall scores on the vertical. For instance, if the threshold δ is set so that all and only sample sites in TrustRank buckets through 0 are above it, then precision is 0.86 and recall is TrustRank assigned the highest scores to good sites, and the proportion of bad increases gradually as we move to lower scores. Hence, precision and recall manifest an almost linear decrease and increase, respectively. Note the high (0.82) precision score for the whole sample set: such a value would be very uncommon for traditional information retrieval problems, 9

20 Precision/Recall Precision Recall Threshold TrustRank Bucket Figure 3: Precision and recall. where it is usual to have a large corpus of documents, with only a few of those documents being relevant for a specific query. In contrast, our sample set consists most of good documents, all of which are relevant. This is why the baseline precision score for the sample X is 63/( ) = Related Work Our work builds on existing PageRank research. The idea of biasing PageRank to combat spam was introduced in []. The use of custom static score distribution vectors has been studied in the context of topic-sensitive PageRank [5]. Recent analyses of (biased) PageRank are provided by [2, 0]. The problem of trust has also been addressed in the context of peer-to-peer systems. For instance, [8] presents an algorithm similar to PageRank for computing the reputation or dependability of a node in a peer-to-peer network. The data mining and machine learning communities also explored the topic of web and spam detection (for instance, see [2]). However, this research is oriented toward the analysis of individual documents. The analysis typically looks for telltale signs of spamming techniques based on statistics derived from examples. 8 Conclusions As the web grows in size and value, search engines play an increasingly critical role, allowing users to find information of interest. However, today s search engines are seriously threatened by malicious web spam that attempts to subvert the unbiased searching and ranking services provided by the engines. Search engines are today combating web spam with a variety of ad hoc, often proprietary techniques. We believe that our work is a first attempt at formalizing the problem and at introducing a comprehensive solution to assist in the detection of web spam. Our experimental results show that we can effectively identify a significant number of strongly reputable (non-spam) pages. We believe that there are still a number of interesting experiments that need to be carried out. For instance, it would be desirable to further explore the interplay between dampening and splitting for trust propagation. In addition, there are a number of ways to refine our methods. For example, instead of selecting the entire seed set at once, one could think of an 20

### Combating Web Spam with TrustRank

Combating Web Spam with TrustRank Zoltán Gyöngyi Hector Garcia-Molina Jan Pedersen Stanford University Stanford University Yahoo! Inc. Computer Science Department Computer Science Department 70 First Avenue

### Part 1: Link Analysis & Page Rank

Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

### Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

### The Second Eigenvalue of the Google Matrix

0 2 The Second Eigenvalue of the Google Matrix Taher H Haveliwala and Sepandar D Kamvar Stanford University taherh,sdkamvar @csstanfordedu Abstract We determine analytically the modulus of the second eigenvalue

### Web Spam Taxonomy. Abstract. 1 Introduction. 2 Definition. Hector Garcia-Molina Computer Science Department Stanford University hector@cs.stanford.

Web Spam Taxonomy Zoltán Gyöngyi Computer Science Department Stanford University zoltan@cs.stanford.edu Hector Garcia-Molina Computer Science Department Stanford University hector@cs.stanford.edu Abstract

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Web Search Engines: Solutions

Web Search Engines: Solutions Problem 1: A. How can the owner of a web site design a spider trap? Answer: He can set up his web server so that, whenever a client requests a URL in a particular directory

### International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer.

RESEARCH ARTICLE SURVEY ON PAGERANK ALGORITHMS USING WEB-LINK STRUCTURE SOWMYA.M 1, V.S.SREELAXMI 2, MUNESHWARA M.S 3, ANIL G.N 4 Department of CSE, BMS Institute of Technology, Avalahalli, Yelahanka,

### Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 Intuition: solve the recursive equation: a page is important if important pages link to it. Technically, importance = the principal

### The Mathematics Behind Google s PageRank

The Mathematics Behind Google s PageRank Ilse Ipsen Department of Mathematics North Carolina State University Raleigh, USA Joint work with Rebecca Wills Man p.1 Two Factors Determine where Google displays

### Removing Web Spam Links from Search Engine Results

Removing Web Spam Links from Search Engine Results Manuel EGELE pizzaman@iseclab.org, 1 Overview Search Engine Optimization and definition of web spam Motivation Approach Inferring importance of features

### So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

### Enhancing the Ranking of a Web Page in the Ocean of Data

Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India hkshitesh@gmail.com In today

### Spam Detection with a Content-based Random-walk Algorithm

Spam Detection with a Content-based Random-walk Algorithm ABSTRACT F. Javier Ortega Departamento de Lenguajes y Sistemas Informáticos Universidad de Sevilla Av. Reina Mercedes s/n 41012, Sevilla (Spain)

### Music Lyrics Spider. Wang Litan 1 and Kan Min Yen 2 School of Computing, National University of Singapore 3 Science Drive 2, Singapore ABSTRACT

NUROP CONGRESS PAPER Music Lyrics Spider Wang Litan 1 and Kan Min Yen 2 School of Computing, National University of Singapore 3 Science Drive 2, Singapore 117543 ABSTRACT The World Wide Web s dynamic and

### Machine Learning for Naive Bayesian Spam Filter Tokenization

Machine Learning for Naive Bayesian Spam Filter Tokenization Michael Bevilacqua-Linn December 20, 2003 Abstract Background Traditional client level spam filters rely on rule based heuristics. While these

### New Metrics for Reputation Management in P2P Networks

New for Reputation in P2P Networks D. Donato, M. Paniccia 2, M. Selis 2, C. Castillo, G. Cortesi 3, S. Leonardi 2. Yahoo!Research Barcelona Catalunya, Spain 2. Università di Roma La Sapienza Rome, Italy

### Reputation Network Analysis for Email Filtering

Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler University of Maryland, College Park MINDSWAP 8400 Baltimore Avenue College Park, MD 20742 {golbeck, hendler}@cs.umd.edu

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for

### Social Media Mining. Network Measures

Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

### A Decision Theoretic Approach to Targeted Advertising

82 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 A Decision Theoretic Approach to Targeted Advertising David Maxwell Chickering and David Heckerman Microsoft Research Redmond WA, 98052-6399 dmax@microsoft.com

### Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

### Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

### Web Usage in Client-Server Design

Web Search Web Usage in Client-Server Design A client (e.g., a browser) communicates with a server via http Hypertext transfer protocol: a lightweight and simple protocol asynchronously carrying a variety

### Search engine ranking

Proceedings of the 7 th International Conference on Applied Informatics Eger, Hungary, January 28 31, 2007. Vol. 2. pp. 417 422. Search engine ranking Mária Princz Faculty of Technical Engineering, University

### Personalized Reputation Management in P2P Networks

Personalized Reputation Management in P2P Networks Paul - Alexandru Chirita 1, Wolfgang Nejdl 1, Mario Schlosser 2, and Oana Scurtu 1 1 L3S Research Center / University of Hannover Deutscher Pavillon Expo

### Web Search. 2 o Semestre 2012/2013

Dados na Dados na Departamento de Engenharia Informática Instituto Superior Técnico 2 o Semestre 2012/2013 Bibliography Dados na Bing Liu, Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd

### Detecting Spam Web Pages through Content Analysis

Detecting Spam Web Pages through Content Analysis Alex Ntoulas 1, Marc Najork 2, Mark Manasse 2, Dennis Fetterly 2 1 UCLA Computer Science Department 2 Microsoft Research, Silicon Valley Web Spam: raison

### Bayesian Spam Filtering

Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

### Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

### A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Analysis and Computation of Google s PageRank

Analysis and Computation of Google s PageRank Ilse Ipsen North Carolina State University Joint work with Rebecca M. Wills IMACS p.1 Overview Goal: Compute (citation) importance of a web page Simple Web

### Simulating a File-Sharing P2P Network

Simulating a File-Sharing P2P Network Mario T. Schlosser, Tyson E. Condie, and Sepandar D. Kamvar Department of Computer Science Stanford University, Stanford, CA 94305, USA Abstract. Assessing the performance

### Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

### An Effective Content Based Web Page Ranking Approach

An Effective Content Based Web Page Ranking Approach NIDHI SHALYA M.Tech (Computer Science), nidz.cs@hotmail.com SHASHWAT SHUKLA Computer Science Department, shashwatshukla10aug@gmail.com DEEPAK ARORA

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks H. T. Kung Dario Vlah {htk, dario}@eecs.harvard.edu Harvard School of Engineering and Applied Sciences

### Protein Protein Interaction Networks

Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

### Paper Classification for Recommendation on Research Support System Papits

IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5A, May 006 17 Paper Classification for Recommendation on Research Support System Papits Tadachika Ozono, and Toramatsu Shintani,

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

### A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey

### MASCOT Search Results Interpretation

The Mascot protein identification program (Matrix Science, Ltd.) uses statistical methods to assess the validity of a match. MS/MS data is not ideal. That is, there are unassignable peaks (noise) and usually

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### Ranking on Data Manifolds

Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

### SIP Service Providers and The Spam Problem

SIP Service Providers and The Spam Problem Y. Rebahi, D. Sisalem Fraunhofer Institut Fokus Kaiserin-Augusta-Allee 1 10589 Berlin, Germany {rebahi, sisalem}@fokus.fraunhofer.de Abstract The Session Initiation

### An Alternative Web Search Strategy? Abstract

An Alternative Web Search Strategy? V.-H. Winterer, Rechenzentrum Universität Freiburg (Dated: November 2007) Abstract We propose an alternative Web search strategy taking advantage of the knowledge on

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### Identifying Market Price Levels using Differential Evolution

Identifying Market Price Levels using Differential Evolution Michael Mayo University of Waikato, Hamilton, New Zealand mmayo@waikato.ac.nz WWW home page: http://www.cs.waikato.ac.nz/~mmayo/ Abstract. Evolutionary

### Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules

### Domain Classification of Technical Terms Using the Web

Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

### Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

### Construction Algorithms for Index Model Based on Web Page Classification

Journal of Computational Information Systems 10: 2 (2014) 655 664 Available at http://www.jofcis.com Construction Algorithms for Index Model Based on Web Page Classification Yangjie ZHANG 1,2,, Chungang

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### Adaption of Statistical Email Filtering Techniques

Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques

### 6.042/18.062J Mathematics for Computer Science October 3, 2006 Tom Leighton and Ronitt Rubinfeld. Graph Theory III

6.04/8.06J Mathematics for Computer Science October 3, 006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Graph Theory III Draft: please check back in a couple of days for a modified version of these

### SEO Techniques for various Applications - A Comparative Analyses and Evaluation

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727 PP 20-24 www.iosrjournals.org SEO Techniques for various Applications - A Comparative Analyses and Evaluation Sandhya

### Savita Teli 1, Santoshkumar Biradar 2

Effective Spam Detection Method for Email Savita Teli 1, Santoshkumar Biradar 2 1 (Student, Dept of Computer Engg, Dr. D. Y. Patil College of Engg, Ambi, University of Pune, M.S, India) 2 (Asst. Proff,

### Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

### CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

### The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

### Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous

### Spam Filtering with Naive Bayesian Classification

Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011

### T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

### Link Analysis. Chapter 5. 5.1 PageRank

Chapter 5 Link Analysis One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

### Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques

### Real Time Traffic Monitoring With Bayesian Belief Networks

Real Time Traffic Monitoring With Bayesian Belief Networks Sicco Pier van Gosliga TNO Defence, Security and Safety, P.O.Box 96864, 2509 JG The Hague, The Netherlands +31 70 374 02 30, sicco_pier.vangosliga@tno.nl

### Classification with Decision Trees

Classification with Decision Trees Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong 1 / 24 Y Tao Classification with Decision Trees In this lecture, we will discuss

DON T FOLLOW ME: SPAM DETECTION IN TWITTER Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA hwang@psu.edu Keywords: Abstract: social

### A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

### A Brief Introduction to Property Testing

A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as

### Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Graph Theory and Complex Networks: An Introduction Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Computer networks Version: March 3, 2011 2 / 53 Contents

### Reputation Management in P2P Networks: The EigenTrust Algorithm

Reputation Management in P2P Networks: The EigenTrust Algorithm by Adrian Alexa supervised by Anja Theobald 1 Introduction Peer-to-Peer networks is a fast developing branch of Computer Science and many

### Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

Evaluation of a New Method for Measuring the Internet Distribution: Simulation Results Christophe Crespelle and Fabien Tarissan LIP6 CNRS and Université Pierre et Marie Curie Paris 6 4 avenue du président

### Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany

### Fraud Detection in Electronic Auction

Fraud Detection in Electronic Auction Duen Horng Chau 1, Christos Faloutsos 2 1 Human-Computer Interaction Institute, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh

### KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

### Sybilproof Reputation Mechanisms

Sybilproof Reputation Mechanisms Alice Cheng Center for Applied Mathematics Cornell University, Ithaca, NY 14853 alice@cam.cornell.edu Eric Friedman School of Operations Research and Industrial Engineering

### Strategic Online Advertising: Modeling Internet User Behavior with

2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### 6.207/14.15: Networks Lecture 6: Growing Random Networks and Power Laws

6.207/14.15: Networks Lecture 6: Growing Random Networks and Power Laws Daron Acemoglu and Asu Ozdaglar MIT September 28, 2009 1 Outline Growing random networks Power-law degree distributions: Rich-Get-Richer

### Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

### Google s PageRank: The Math Behind the Search Engine

Google s PageRank: The Math Behind the Search Engine Rebecca S. Wills Department of Mathematics North Carolina State University Raleigh, NC 27695 rmwills@ncsu.edu May, 2006 Introduction Approximately 9

### Applying Machine Learning to Stock Market Trading Bryce Taylor

Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn