# Spam Detection with a Content-based Random-walk Algorithm

Save this PDF as:

Size: px
Start display at page:

## Transcription

4 Fraction of globally popular words: a web page with a high fraction of words within the most popular words in the entire corpus, is likely to be a spam. This metric scores spam self promotion techniques such as the word stuffing mechanisms. Average length of words: non-spam web pages have a bell-shaped distribution of average word lengths, while malicious pages have much higher values. Hence, this heuristic penalizes the use of word stuffing mechanisms. In the next section, we explain how we use these heuristics to obtain the spam-biased vectors for our algorithm. 3.3 Obtaining the seed sets As we previously explained, our approach uses the contentbased heuristics to automatically select the sets of seeds that will be used in our algorithm to obtain a final rankingof web pages by spam likelihood. A given page will be demoted or promoted in the ranking according to its relations with this bad and good web pages. We use seed sets to ensure that negative scores of negative pages will be propagated over the graph, and analogous for the positive seeds. The seed sets are represented in our approach by two spam-biased vectors, e + and e. The vectors contain the spam and non-spam likelihoods of the web pages taken as seeds in our algorithm, giving higher positive or negative scores to those nodes that have higher e + or e (see Equations (1) and (2)). In this section, we introduce three different ways to build the spam-biased vectors, given the heuristics of each web page N most positive/negative pages This first method chooses the N most positive and negative pages in the graph as seeds, according to their metrics. Formally, given a page v j, let M(v j) be a vector with the content-based metrics for v j. The spam likelihood of v j will be determined by the norm of M(v j), as shown in Equation (3): Spaminess(A) = h 2 (3) h M(v j ) where A is a web page, and h represents the heuristics which M(v j) contains. In this way, we take the N nodes with highest Spaminess as negative seeds, and the N nodes with lowest Spaminess as positive seeds. The spam-biased vectors can be defined as follows: { e + 1/N if i S + i = (4) 0 otherwise where N is a parameter that specifies the number of seeds that will be taken. S + is the set of the N nodes with lowest Spaminess in the graph. The formula is analogous for vector e N most positive/negative pages with contentbased weights Following the previous schema, we can take advantage of the content-based metrics by including the actual scores directly in the computation of the weights of the seeds, as shown in Equation (5): e + i = { Spaminess(i) j S + Spaminess(j) if i S + 0 otherwise (5) where Spaminess(i) is the norm of the vector built from the metrics of the page i (see Equation 3) Content-based nodes characterization The last method consists in applying the previous formula to every nodes in the graph. We can rely on the thresholds proposed in the study in [17], and use them to determine whether a page must be a negative or a positive seed. The thresholds for the metrics considered in the present work are shown in Table 1. Given a page, if one of its metrics is Heuristics Threshold Compressibility 6.0 Fraction of globally popular words 0.75 Average length of words 9.0 Table 1: Thresholds for the content-based metrics. above the corresponding threshold, we include the page in the set of negative seeds, and in other case it will be taken as a positive seed. Once the sets of nodes have been defined, we apply the same formulas shown in Equation (5). 4. EXPERIMENTS In this section, we show the experimental design defined to show the performance of our technique, the dataset used and the results obtained. We also detail the values of the parameters for each set of experiments, and the different variants proposed in this work. The aim of the experiments is to show the performance of our approaches in terms of demotion of spam web pages in a ranking, and to compare them to a state-of-art spam detection technique. The experiments are intended to assess the usefulness of including content-based heuristics into a random-walk algorithm, in order to build a ranking of web pages according to their relevance, penalizing the spam web pages. Figure 1: Web pages per bucket according to Page- Rank, in a log scale. 4.1 Dataset The corpus used in the experiments for the paper is the WEBSPAM-UK2006 Dataset [3] for spam detection. It contains more than 98 million pages. The collection is based on 48

5 a crawl of the.uk domain performed in May It was collected by the Laboratory of Web Algorithmics, Università degli Studi di Milano, with the support of the DELIS EU-FET research project. The collection was labeled by a group of volunteers and/or by domain-specific patterns such as.gov.uk or.ac.uk. Of the 11,402 hosts in UK2006 dataset, 7,423 are labeled as spam or non-spam. For the evaluation purposes, we have considered as spam any web page that belongs to a host labeled as spam. There are about 10 million spam web pages in the collection. 4.2 Evaluation The purpose of our algorithm is to build a ranking of web pages, demoting the spam web pages in the ranking and trying to put them as far as possible from the first positions of the ranking. Since our approach does not classify the web pages between spam or non-spam, it do not make sense to perform an evaluation in terms of classification accuracy. We use in our experiments the same evaluation method followed in other works on the application of graph-based algorithms to the spam detection task [2, 8]. Our intuition is that it is more important to correctly detect the spam in high Page- Rank valued sites, because they will often appear in the first positions in the search results for many queries. The aim of this evaluation method is to easily determine the number of spam web pages detected mainly in the highest positions of the ranking of web pages. The evaluation method is implemented as follows. First, a list of pages is generated in decreasing order of their Page- Rank score. This list is segmented into 20 buckets, in such way that each of the buckets contains a different number of sites, with scores summing up to 5% of the total PageRank score. The nodes per bucket are plotted in Figure 1. Once we obtain the size of the buckets according to the PageRank scores of the web pages, we use them to build a set of buckets with the rankings computed in each experiment. For evaluation purposes, buckets of the same sizes as the ones in Table 1 are built with the results of each method. The number of spam web pages per bucket is our evaluation metric. It is obtained by counting the number of pages in each bucket that are labeled as spam in the dataset. The aim of a spam detection technique is to avoid spam pages into the first buckets, demoting these pages in order to put them into the lastest buckets. In the next sections, we show the performance of the proposed approaches. The parameters of the PageRank, the damping factor and the threshold, have been set to 0.85 and 0.01, respectively, for all the experiments shown in this work. 4.3 TrustRank TrustRank algorithm [8] is a link-based algorithm that takes as input the web graph and a set of non-spam web pages, chosen in a semi-supervised way, that are the seeds for the algorithm. The output is a ranking of web pages according to their relevance, in which the spam web pages are demoted. TrustRank computes a score for each web page, as PageRank, using the seed set to include a bias in the random-walk algorithm. In order to build the seed set, they propose an inverse PageRank, taking into account the out-links of the web pages, instead of their in-links. Then they choose by hand a number N of non-spam web pages from that ranking. In this way, they try to take as seeds the N good pages that reach as many nodes as possible. The results shown in Figure 2 have been obtained by performing 20 iterations of the algorithm with a damping factor of 0.85, as suggested in [8]. Since they use a set of 178 seeds with a dataset of 13 million pages, we have taken a seed set with = 534 pages from our dataset of 99 million pages. Figure 2: Spam web pages per bucket, using TrustRank algorithm. We test this technique with the UK2006 dataset, and it achieves good results. There are less than a 3% of spam web pages in the two first buckets. However, more than the 10% of pages in the third bucket are spam. 4.4 N most positive/negative pages The first set of experiments corresponds to the method introduced in Section We have selected the 5% of the most positive and negative nodes as the positive and negative seed sets, respectively. The results are shown in Figure 3. Figure 3: Results using 5% most positive and negative nodes as seeds. In the figure, we can see that the 14% of web pages in the first bucket are spam, and the amount of spam pages per bucket is around the 6% of the total number of pages in the rest of the dataset, except for the last buckets. These results 49

6 are worse than the ones with TrustRank, mainly in the first buckets which are the most important. 4.5 N most positive/negative pages with contentbased weights The results applying the method in Section 3.3.2, that includes the content-based heuristics directly in the algorithm are shown in Figure 4, presenting the number of spam web pages in each bucket, taking the 5% of the most positive and negative nodes as seeds. Figure 5: Results using all the nodes in the web graph as seeds. The seed weight is set according to the content-based metrics. Seeds + Metrics) is our second approach, based on the previous one, but setting the weights of the seeds according to the content-based metrics; and finally, CbC (Content-Based Characterization) corresponds with our third approach that takes every node as a seed for the random-walk algorithm. Figure 4: Results using the 5% most positive and negative nodes as seeds. The seed weight is set according to the content-based metrics. Comparing the charts in Figures 3 and 4 we can see the improvement achieved by the inclusion of the metrics in the weights of the seeds. Our second approach is more effective at demoting the spam web pages in the high PR buckets. Furthermore, it outperformes the results of TrustRank algorithm, allowing a maximum of 4% of spam web pages into the 12 first buckets. 4.6 Content-based seed characterization In this section we present the experimental results achieved with the method explained in Section This approach applies the same method as the previous section, but taking all the nodes in the web graph as seeds for the algorithm. The results can be seen in Figure 5. The results show a noticeable improvement with respect to TrustRank, although the approach in Section achieves a better performance in terms of demotion of spam web pages in the final ranking. 4.7 Comparative study A recap of the results presented in this work is shown in Table 2. The first column represents the buckets of pages and the second one contains the number of web pages from the first bucket to the current one, inclusive. The rest of the columns shows the accumulated number of spam web pages for each technique, that is the total number of spam web pages from the first bucket to the current one, inclusive. In the table, TR corresponds with TrustRank algorithm; PNS (Positive/Negative Seeds) is our first approach, that takes the N most positive and negative pages as seeds and assigns to each of them a weight of 1/N; PNS+M(Positive/Negative # Pages TR PNS PNS+M CbC Table 2: Accumulated number of spam web pages for each method: TrustRank (TR), Positive/Negative seeds (PNS), Positive/Negative Seeds with metric-based weights (PNS+M) and Contentbased characterization (CbC) Since the first positions of the ranking are the most important for us, only the first 10 buckets are presented in the table. We can see that PNS+M achieves the best results within each bucket, outperforming the TrustRank. In contrast, the first approach does not improve the TrustRank algorithm. The relevance of including the content-based metrics in the random-walk algorithm is clear regarding the difference between these two experiments. The contentbased characterization method also presents good results, with only 32 spam web pages in the first 649 pages, outperforming TrustRank in those first buckets as well. 5. CONCLUSIONS In this work, we have introduced a novel method to deal with the web spam pages. Our approach combines concepts from known link-based and content-based techniques to avoid the negative effects of spam web pages in a web 50

### Know your Neighbors: Web Spam Detection using the Web Topology

Know your Neighbors: Web Spam Detection using the Web Topology Carlos Castillo 1 chato@yahoo inc.com Debora Donato 1 debora@yahoo inc.com Vanessa Murdock 1 vmurdock@yahoo inc.com Fabrizio Silvestri 2 f.silvestri@isti.cnr.it

### Methods for Web Spam Filtering

Methods for Web Spam Filtering Ph.D. Thesis Summary Károly Csalogány Supervisor: András A. Benczúr Ph.D. Eötvös Loránd University Faculty of Informatics Department of Information Sciences Informatics Ph.D.

### Comparison of Ant Colony and Bee Colony Optimization for Spam Host Detection

International Journal of Engineering Research and Development eissn : 2278-067X, pissn : 2278-800X, www.ijerd.com Volume 4, Issue 8 (November 2012), PP. 26-32 Comparison of Ant Colony and Bee Colony Optimization

### Spam Host Detection Using Ant Colony Optimization

Spam Host Detection Using Ant Colony Optimization Arnon Rungsawang, Apichat Taweesiriwate and Bundit Manaskasemsak Abstract Inappropriate effort of web manipulation or spamming in order to boost up a web

### Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web Xin Li, Yiqun Liu, Min Zhang, Shaoping Ma State Key Laboratory of Intelligent Technology and Systems Tsinghua

### New Metrics for Reputation Management in P2P Networks

New for Reputation in P2P Networks D. Donato, M. Paniccia 2, M. Selis 2, C. Castillo, G. Cortesi 3, S. Leonardi 2. Yahoo!Research Barcelona Catalunya, Spain 2. Università di Roma La Sapienza Rome, Italy

### Detecting Spam Web Pages through Content Analysis

Detecting Spam Web Pages through Content Analysis Alex Ntoulas 1, Marc Najork 2, Mark Manasse 2, Dennis Fetterly 2 1 UCLA Computer Science Department 2 Microsoft Research, Silicon Valley Web Spam: raison

### Web Spam Identification Through Language Model Analysis

Web Spam Identification Through Language Model Analysis ABSTRACT Juan Martinez-Romo Dpto. Lenguajes y Sistemas Informáticos UNED 28040 Madrid, Spain juaner@lsi.uned.es This paper applies a language model

### Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

### Graph Regularization Methods for Web Spam Detection

Noname manuscript No. (will be inserted by the editor) Graph Regularization Methods for Web Spam Detection Jacob Abernethy Olivier Chapelle Carlos Castillo the date of receipt and acceptance should be

### Link-Based Similarity Search to Fight Web Spam

Link-Based Similarity Search to Fight Web Spam András A. Benczúr Károly Csalogány Tamás Sarlós Informatics Laboratory Computer and Automation Research Institute Hungarian Academy of Sciences 11 Lagymanyosi

### Adversarial Information Retrieval on the Web (AIRWeb 2007)

WORKSHOP REPORT Adversarial Information Retrieval on the Web (AIRWeb 2007) Carlos Castillo Yahoo! Research chato@yahoo-inc.com Kumar Chellapilla Microsoft Live Labs kumarc@microsoft.com Brian D. Davison

### Web Spam Detection: link-based and content-based techniques

Web Spam Detection: link-based and content-based techniques Luca Becchetti 2, Carlos Castillo 1, Debora Donato 1, Stefano Leonardi 2, and Ricardo Baeza-Yates 1 1 Yahoo! Research Barcelona C/Ocata 1, 08003

DON T FOLLOW ME: SPAM DETECTION IN TWITTER Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA hwang@psu.edu Keywords: Abstract: social

### Extracting Link Spam using Biased Random Walks From Spam Seed Sets

Extracting Link Spam using Biased Random Walks From Spam Seed Sets Baoning Wu Dept of Computer Science & Engineering Lehigh University Bethlehem, PA 1815 USA baw4@cse.lehigh.edu ABSTRACT Link spam deliberately

### New Metrics for Reputation Management in P2P Networks

New Metrics for Reputation Management in P2P Networks Debora Donato 1 debora@yahoo-inc.com Carlos Castillo 1 chato@yahoo-inc.com Mario Paniccia 3 mario paniccia@yahoo.it Giovanni Cortese 2 g.cortese@computer.org

### A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then

### Blocking Blog Spam with Language Model Disagreement

Blocking Blog Spam with Language Model Disagreement Gilad Mishne Informatics Institute, University of Amsterdam Kruislaan 403, 1098SJ Amsterdam The Netherlands gilad@science.uva.nl David Carmel, Ronny

### Web (2.0) Mining: Analyzing Social Media

Web (2.0) Mining: Analyzing Social Media Anupam Joshi, Tim Finin, Akshay Java, Anubhav Kale, and Pranam Kolari University of Maryland, Baltimore County, Baltimore MD 2250 Abstract Social media systems

### International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer.

RESEARCH ARTICLE SURVEY ON PAGERANK ALGORITHMS USING WEB-LINK STRUCTURE SOWMYA.M 1, V.S.SREELAXMI 2, MUNESHWARA M.S 3, ANIL G.N 4 Department of CSE, BMS Institute of Technology, Avalahalli, Yelahanka,

### An Effective Content Based Web Page Ranking Approach

An Effective Content Based Web Page Ranking Approach NIDHI SHALYA M.Tech (Computer Science), nidz.cs@hotmail.com SHASHWAT SHUKLA Computer Science Department, shashwatshukla10aug@gmail.com DEEPAK ARORA

### Incorporating Participant Reputation in Community-driven Question Answering Systems

Incorporating Participant Reputation in Community-driven Question Answering Systems Liangjie Hong, Zaihan Yang and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem,

### Part 1: Link Analysis & Page Rank

Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

### Subordinating to the Majority: Factoid Question Answering over CQA Sites

Journal of Computational Information Systems 9: 16 (2013) 6409 6416 Available at http://www.jofcis.com Subordinating to the Majority: Factoid Question Answering over CQA Sites Xin LIAN, Xiaojie YUAN, Haiwei

### Identifying Link Farm Spam Pages

Identifying Link Farm Spam Pages Baoning Wu and Brian D. Davison Department of Computer Science & Engineering Lehigh University Bethlehem, PA 18015 USA {baw4,davison}@cse.lehigh.edu ABSTRACT With the increasing

### Improving Web Spam Classification using Rank-time Features

Improving Web Spam Classification using Rank-time Features Krysta M. Svore Microsoft Research 1 Microsoft Way Redmond, WA ksvore@microsoft.com Qiang Wu Microsoft Research 1 Microsoft Way Redmond, WA qiangwu@microsoft.com

### Analysis and Detection of Web Spam by means of Web Content

Analysis and Detection of Web Spam by means of Web Content Víctor M. Prieto, Manuel Álvarez, Rafael López-García and Fidel Cacheda Department of Information and Communication Technologies, University of

### Revealing the Relationship Network Behind Link Spam

Revealing the Relationship Network Behind Link Spam Apostolis Zarras Ruhr-University Bochum apostolis.zarras@rub.de Antonis Papadogiannakis FORTH-ICS papadog@ics.forth.gr Sotiris Ioannidis FORTH-ICS sotiris@ics.forth.gr

### Enhancing the Ranking of a Web Page in the Ocean of Data

Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India hkshitesh@gmail.com In today

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

### Graph Mining and its applications to Reputation Management in Networks

Graph Mining and its applications to Reputation Management in Networks Stefano Leonardi Sapienza University of Rome Rome, Italy Plan of the Lecture Link-based Webspam detection Web Spam Topological Web

### A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

### Bayesian Spam Filtering

Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

### Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

### Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

### Combating Web Spam with TrustRank

Combating Web Spam with TrustRank Zoltán Gyöngyi Hector Garcia-Molina Jan Pedersen Stanford University Stanford University Yahoo! Inc. Computer Science Department Computer Science Department 70 First Avenue

### Relaxed Online SVMs for Spam Filtering

Relaxed Online SVMs for Spam Filtering D. Sculley Tufts University Department of Computer Science 161 College Ave., Medford, MA USA dsculleycs.tufts.edu Gabriel M. Wachman Tufts University Department of

### INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal

INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal Research Article ISSN 2277 9140 ABSTRACT Web page categorization based

### Topical Authority Identification in Community Question Answering

Topical Authority Identification in Community Question Answering Guangyou Zhou, Kang Liu, and Jun Zhao National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95

### Personalized Reputation Management in P2P Networks

Personalized Reputation Management in P2P Networks Paul - Alexandru Chirita 1, Wolfgang Nejdl 1, Mario Schlosser 2, and Oana Scurtu 1 1 L3S Research Center / University of Hannover Deutscher Pavillon Expo

### Link Analysis and Site Structure in Information Retrieval

Link Analysis and Site Structure in Information Retrieval Thomas Mandl Information Science Universität Hildesheim Marienburger Platz 22 31141 Hildesheim - Germany mandl@uni-hildesheim.de Abstract: Link

### Construction Algorithms for Index Model Based on Web Page Classification

Journal of Computational Information Systems 10: 2 (2014) 655 664 Available at http://www.jofcis.com Construction Algorithms for Index Model Based on Web Page Classification Yangjie ZHANG 1,2,, Chungang

### WEB PAGE CATEGORISATION BASED ON NEURONS

WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.

Recap CS276 Lecture 15 Plan for today Wrap up spam Crawling Connectivity servers In the last lecture we introduced web search Paid placement SEO/Spam Link-based ranking Most search engines use hyperlink

### Nullification test collections for web spam and SEO

Nullification test collections for web spam and SEO ABSTRACT Timothy Jones Computer Science Dept. The Australian National University Canberra, Australia tim.jones@cs.anu.edu.au David Hawking Funnelback

### Removing Web Spam Links from Search Engine Results

Removing Web Spam Links from Search Engine Results Manuel Egele Technical University Vienna pizzaman@iseclab.org Christopher Kruegel University of California Santa Barbara chris@cs.ucsb.edu Engin Kirda

### Toward Spam 2.0: An Evaluation of Web 2.0 Anti-Spam Methods

Toward Spam 2.0: An Evaluation of Web 2.0 Anti-Spam Methods Pedram Hayati, Vidyasagar Potdar Digital Ecosystem and Business Intelligence (DEBI) Institute, Curtin University of Technology, Perth, WA, Australia

### Web Spam Detection Using Improved Decision Tree Classification Method

Web Spam Detection Using Improved Decision Tree Classification Method Rashmi R. Tundalwar Computer Engineering Department Modern College of Engineering Pune, India Prof. Manasi Kulkarni Computer Engineering

### Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 Intuition: solve the recursive equation: a page is important if important pages link to it. Technically, importance = the principal

### Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

### Web Search Engines: Solutions

Web Search Engines: Solutions Problem 1: A. How can the owner of a web site design a spider trap? Answer: He can set up his web server so that, whenever a client requests a URL in a particular directory

### RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for

### Combating Web Spam with TrustRank

Combating Web Spam with TrustRank Zoltán Gyöngyi Hector Garcia-Molina Jan Pedersen Stanford University Stanford University Yahoo! Inc. Computer Science Department Computer Science Department 70 First Avenue

### Web Spam Classification: a Few Features Worth More

Web Spam Classification: a Few Features Worth More Miklós Erdélyi 1,2 András Garzó 1 András A. Benczúr 1 1 Institute for Computer Science and Control, Hungarian Academy of Sciences 2 University of Pannonia,

### Question Routing by Modeling User Expertise and Activity in cqa services

Question Routing by Modeling User Expertise and Activity in cqa services Liang-Cheng Lai and Hung-Yu Kao Department of Computer Science and Information Engineering National Cheng Kung University, Tainan,

### Web Search. 2 o Semestre 2012/2013

Dados na Dados na Departamento de Engenharia Informática Instituto Superior Técnico 2 o Semestre 2012/2013 Bibliography Dados na Bing Liu, Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd

### Search engines: ranking algorithms

Search engines: ranking algorithms Gianna M. Del Corso Dipartimento di Informatica, Università di Pisa, Italy ESP, 25 Marzo 2015 1 Statistics 2 Search Engines Ranking Algorithms HITS Web Analytics Estimated

### Dr. Antony Selvadoss Thanamani, Head & Associate Professor, Department of Computer Science, NGM College, Pollachi, India.

Enhanced Approach on Web Page Classification Using Machine Learning Technique S.Gowri Shanthi Research Scholar, Department of Computer Science, NGM College, Pollachi, India. Dr. Antony Selvadoss Thanamani,

### Keywords: blogs, blogosphere, social media, Web mining. Introduction

An Analytical research on social media system by web mining technique: blogs & blogosphere Case Dr. Manoj Kumar Singh 1, Dileep Kumar G 2, Mohammad Kemal Ahmad 3 1,2 Asst Professor, Department of Computing,

### Search and Information Retrieval

Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

### Google: data structures and algorithms

Google: data structures and algorithms Petteri Huuhka Department of Computer Science University of Helsinki Petteri.Huuhka@helsinki.fi Abstract. The architecture of Google s search engine is designed to

### FAST AND EFFICIENT WEB SPIDER

INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 FAST AND EFFICIENT WEB SPIDER Satwinder Kaur 1, Alisha Gupta 2 1 Research Scholar, 2 Lecturer 1-2 Department of Computer

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

### SVMs for the Blogosphere: Blog Identification and Splog Detection

SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin and Anupam Joshi University of Maryland, Baltimore County Baltimore MD {kolari1, finin, joshi}@umbc.edu Abstract

### Building Domain-Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method

Building Domain-Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method Jialun Qin, Yilu Zhou Dept. of Management Information Systems The University of

### A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

Understanding and Combating Link Farming in the Twitter Social Network Saptarshi Ghosh IIT Kharagpur, India Naveen K. Sharma IIT Kharagpur, India Bimal Viswanath MPI-SWS, Germany Gautam Korlam IIT Kharagpur,

### Protein Protein Interaction Networks

Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

### Cross-Lingual Web Spam Classification

Cross-Lingual Web Spam Classification András Garzó Bálint Daróczy Tamás Kiss Dávid Siklósi András A. Benczúr Institute for Computer Science and Control, Hungarian Academy of Sciences (MTA SZTAKI) Eötvös

### Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

### Music Lyrics Spider. Wang Litan 1 and Kan Min Yen 2 School of Computing, National University of Singapore 3 Science Drive 2, Singapore ABSTRACT

NUROP CONGRESS PAPER Music Lyrics Spider Wang Litan 1 and Kan Min Yen 2 School of Computing, National University of Singapore 3 Science Drive 2, Singapore 117543 ABSTRACT The World Wide Web s dynamic and

### Identifying Best Bet Web Search Results by Mining Past User Behavior

Identifying Best Bet Web Search Results by Mining Past User Behavior Eugene Agichtein Microsoft Research Redmond, WA, USA eugeneag@microsoft.com Zijian Zheng Microsoft Corporation Redmond, WA, USA zijianz@microsoft.com

### University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

### Prevalence and Mitigation of Forum Spamming

Prevalence and Mitigation of Forum ming Youngsang Shin, Minaxi Gupta, Steven Myers School of Informatics and Computing Indiana University, Bloomington Email: {shiny, minaxi}@cs.indiana.edu, samyers@indiana.edu

### The Splog Detection Task and A Solution Based on Temporal and Link Properties

The Splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Xiaodan Song, Yun Chi, Koji Hino, Hari Sundaram, Jun Tatemura and Belle

### Spidering and Filtering Web Pages for Vertical Search Engines

Spidering and Filtering Web Pages for Vertical Search Engines Michael Chau The University of Arizona mchau@bpa.arizona.edu 1 Introduction The size of the Web is growing exponentially. The number of indexable

### Index Terms Domain name, Firewall, Packet, Phishing, URL.

BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet

### Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

### Affinity Prediction in Online Social Networks

Affinity Prediction in Online Social Networks Matias Estrada and Marcelo Mendoza Skout Inc., Chile Universidad Técnica Federico Santa María, Chile Abstract Link prediction is the problem of inferring whether

### The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

### Spam Detection on Twitter Using Traditional Classifiers M. McCord CSE Dept Lehigh University 19 Memorial Drive West Bethlehem, PA 18015, USA

Spam Detection on Twitter Using Traditional Classifiers M. McCord CSE Dept Lehigh University 19 Memorial Drive West Bethlehem, PA 18015, USA mpm308@lehigh.edu M. Chuah CSE Dept Lehigh University 19 Memorial

### Web Spam, Propaganda and Trust

Web Spam, Propaganda and Trust Panagiotis T. Metaxas Wellesley College Wellesley, MA 02481, USA pmetaxas@wellesley.edu Joseph DeStefano College of the Holy Cross Worcester, MA 01610, USA joed@mathcs.holycross.edu

### Removing Web Spam Links from Search Engine Results

Removing Web Spam Links from Search Engine Results Manuel EGELE pizzaman@iseclab.org, 1 Overview Search Engine Optimization and definition of web spam Motivation Approach Inferring importance of features

### Detecting Arabic Cloaking Web Pages Using Hybrid Techniques

, pp.109-126 http://dx.doi.org/10.14257/ijhit.2013.6.6.10 Detecting Arabic Cloaking Web Pages Using Hybrid Techniques Heider A. Wahsheh 1, Mohammed N. Al-Kabi 2 and Izzat M. Alsmadi 3 1 Computer Science

### Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

### Semi-Supervised Learning for Blog Classification

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

### PRODUCT REVIEW RANKING SUMMARIZATION

PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,

### Detection of Internet Scam Using Logistic Regression

Detection of Internet Scam Using Logistic Regression Mehrbod Sharifi mehrbod@cs.cmu.edu Eugene Fink eugenefink@cmu.edu Computer Science, Carnegie Mellon University, Pittsburgh, PA 15217 Jaime G. Cabonell

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Framework for Intelligent Crawler Engine on IaaS Cloud Service Model

International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 17 (2014), pp. 1783-1789 International Research Publications House http://www. irphouse.com Framework for

### Graph Processing and Social Networks

Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

### REVIEW ON QUERY CLUSTERING ALGORITHMS FOR SEARCH ENGINE OPTIMIZATION

Volume 2, Issue 2, February 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: A REVIEW ON QUERY CLUSTERING

### Web Spam Taxonomy. Abstract. 1 Introduction. 2 Definition. Hector Garcia-Molina Computer Science Department Stanford University hector@cs.stanford.

Web Spam Taxonomy Zoltán Gyöngyi Computer Science Department Stanford University zoltan@cs.stanford.edu Hector Garcia-Molina Computer Science Department Stanford University hector@cs.stanford.edu Abstract

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

### A Novel Method to Detect Fraud Ranking For Mobile Apps

A Novel Method to Detect Fraud Ranking For Mobile Apps Geedikapally Rajesh Kumar M.Tech Student Department of CSE St.Peter's Engineering College. Abstract: Ranking fraud in the mobile App market refers

IADIS International Journal on WWW/Internet Vol. 12, No. 1, pp. 52-64 ISSN: 1645-7641 USER INTENT PREDICTION FROM ACCESS LOG IN ONLINE SHOP Hidekazu Yanagimoto. Osaka Prefecture University. 1-1, Gakuen-cho,

### An Analysis of Factors Used in Search Engine Ranking

An Analysis of Factors Used in Search Engine Ranking Albert Bifet 1 Carlos Castillo 2 Paul-Alexandru Chirita 3 Ingmar Weber 4 1 Technical University of Catalonia 2 University of Chile 3 L3S Research Center

### 1 o Semestre 2007/2008

Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction