Detecting Spam Web Pages through Content Analysis

Similar documents
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Blocking Blog Spam with Language Model Disagreement

Know your Neighbors: Web Spam Detection using the Web Topology

Improving Web Spam Classification using Rank-time Features

Spam Detection with a Content-based Random-walk Algorithm

Bayesian Spam Filtering

DON T FOLLOW ME: SPAM DETECTION IN TWITTER

Comparison of Ant Colony and Bee Colony Optimization for Spam Host Detection

Removing Web Spam Links from Search Engine Results

Adversarial Information Retrieval on the Web (AIRWeb 2007)

Revealing the Relationship Network Behind Link Spam

LOW COST PAGE QUALITY FACTORS TO DETECT WEB SPAM

Spam Host Detection Using Ant Colony Optimization

Removing Web Spam Links from Search Engine Results

Web Spam Taxonomy. Abstract. 1 Introduction. 2 Definition. Hector Garcia-Molina Computer Science Department Stanford University hector@cs.stanford.

Spam Filtering using Signed and Trust Reputation Management

Search engine optimization: Black hat Cloaking Detection technique

Extracting Link Spam using Biased Random Walks From Spam Seed Sets

IMPROVING SPAM FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

Filtering Junk Mail with A Maximum Entropy Model

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Spam Double-Funnel: Connecting Web Spammers with Advertisers

Detecting Linkedin Spammers and its Spam Nets

Identifying Link Farm Spam Pages

Analysis of Web Archives. Vinay Goel Senior Data Engineer

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

An Efficient Spam Filtering Techniques for Account

Spam Filtering in Twitter using Sender-Receiver Relationship

Combating Web Spam with TrustRank

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

WE DEFINE spam as an message that is unwanted basically

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web

Link-Based Similarity Search to Fight Web Spam

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Website Standards Association. Business Website Search Engine Optimization

Spam Filter Optimality Based on Signal Detection Theory

A Collaborative Approach to Anti-Spam

Best Website Design March 2010

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

On Attacking Statistical Spam Filters

Security and Trust in social media networks

Impact of Feature Selection Technique on Classification

Internet Marketing Institute Delhi Mobile No.: DIMI. Internet Marketing Institute Delhi (DIMI)

Predicting Flight Delays

Combating Web Spam with TrustRank

Singlefin. protection services. Spam Filtering. Building a More @ By Lawrence Didsbury, MCSE, MASE

Efficient Spam Filtering using Adaptive Ontology

A Quantitative Study of Forum Spamming Using Context-based Analysis

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Spam Filtering using Naïve Bayesian Classification

Identifying Best Bet Web Search Results by Mining Past User Behavior

Online terminologie 1. % Exit The percentage of users who exit from a page. Active Time / Engagement Time

On Attacking Statistical Spam Filters

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Spam Filtering with Naive Bayesian Classification

Fraud Detection in Electronic Auction

Real Time Traffic Monitoring With Bayesian Belief Networks

Towards better accuracy for Spam predictions

SEO Best Practices Checklist

Detecting Arabic Cloaking Web Pages Using Hybrid Techniques

A Three-Way Decision Approach to Spam Filtering

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

BAYESIAN BASED COMMENT SPAM DEFENDING TOOL *

Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology

Spammer and Hacker, Two Old Friends

Worst Practices in. Search Engine Optimization. contributed articles

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Optimize Your Content

The Impact of Feature Selection on Signature-Driven Spam Detection

EVILSEED: A Guided Approach to Finding Malicious Web Pages

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Toward Spam 2.0: An Evaluation of Web 2.0 Anti-Spam Methods

Spam Filtering Methods for Filtering


77% 77% 42 Good Signals. 16 Issues Found. Keyword. Landing Page Audit. credit. discover.com. Put the important stuff above the fold.

Search and Information Retrieval

Investigation of Support Vector Machines for Classification

Non-Parametric Spam Filtering based on knn and LSA

Search Engine Optimization Glossary

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Detecting Spam Using Spam Word Associations

St. Bernard Managed Protection Services. White Paper. Spam Filtering Building a More Accurate Filter. By Lawrence Didsbury, MCSE, MASE

Content Filters A WORD TO THE WISE WHITE PAPER BY LAURA ATKINS, CO- FOUNDER

Controlling Spam s at the Routers

Immunity from spam: an analysis of an artificial immune system for junk detection

Fuzzy Logic for Spam Deduction

MEASURING AND FINGERPRINTING CLICK-SPAM IN AD NETWORKS

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

How does the performance of spam filters change with decreasing message length? Adrian Scoică (as2270), Girton College

DIGITAL MARKETING BASICS: SEO

Nullification test collections for web spam and SEO

Bayesian Learning Cleansing. In its original meaning, spam was associated with a canned meat from

The Splog Detection Task and A Solution Based on Temporal and Link Properties

Methods for Web Spam Filtering

Use of Ethical SEO Methodologies to Achieve Top Rankings in Top Search Engines

Hoodwinking Spam Filters

Differential Voting in Case Based Spam Filtering

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Transcription:

Detecting Spam Web Pages through Content Analysis Alex Ntoulas 1, Marc Najork 2, Mark Manasse 2, Dennis Fetterly 2 1 UCLA Computer Science Department 2 Microsoft Research, Silicon Valley

Web Spam: raison d etre E-commerce is rapidly growing Projected to $329 billion by 2010 More traffic more money Large fraction of traffic from Search Engines Increase Search Engine referrals: Place ads Provide genuinely better content Create Web spam 19 June 2006 WWW 2006 2

Web Spam (you know it when you see it) 19 June 2006 WWW 2006 3

Defining Web Spam Spam Web page: A page created for the sole purpose of attracting search engine referrals (to this page or some other target page) Ultimately a judgment call Some web pages are borderline cases 19 June 2006 WWW 2006 4

Why Web Spam is Bad Bad for users Makes it harder to satisfy information need Leads to frustrating search experience Bad for search engines Wastes bandwidth, CPU cycles, storage space Pollutes corpus (infinite number of spam pages!) Distorts ranking of results 19 June 2006 WWW 2006 5

How pervasive is Web Spam? Dataset Real-Web data from the MSNBot crawler Collected during August 2004 Processed only MIME types text/html text/plain 105,484,446 Web pages in total 19 June 2006 WWW 2006 6

Spam per Top-level Domain 95% confidence 19 June 2006 WWW 2006 7

Spam per Language 95% confidence 19 June 2006 WWW 2006 8

Detecting Web Spam (1) We report results for English pages only (analysis is similar for other languages) ~55 million English Web pages in our collection Manual inspection of a random sample of the English pages Sample size: 17,168 Web pages 2,364 (13.8%) spam 14,804 (86.2%) non-spam 19 June 2006 WWW 2006 9

Detecting Web Spam (2) Spam detection: A classification problem Given salient features of a Web page, decide whether the page is spam Which salient features? Need to understand spamming techniques to decide on features Finding right features is alchemy, not science We focus on features that Are fast to compute Are local (i.e. examine a page in isolation) 19 June 2006 WWW 2006 10

Number of Words in <title> 110 words 19 June 2006 WWW 2006 11

Distribution of Word-counts in <title> Spam more likely in pages with more words in title 19 June 2006 WWW 2006 12

Visible Content of a Page 19 June 2006 WWW 2006 13

Visible Content of a Page Visible Content = size(in bytes)of visible words size(in bytes)of the page 19 June 2006 WWW 2006 14

Distribution of Visible-content fractions Spam not likely in pages with little visible content 19 June 2006 WWW 2006 15

Compressibility of a Page 19 June 2006 WWW 2006 16

zipratio of a page size(in bytes)of uncompressed non - HTML text zipratio = size(in bytes)of compressed non - HTML text 19 June 2006 WWW 2006 17

Distribution of zipratios Spam more likely in pages with high zipratio 19 June 2006 WWW 2006 18

Obscure Content very rare words 19 June 2006 WWW 2006 19

Independence Likelihood Model Consider all n-grams w i+1 w i+n of a Web page, with kn-grams in total Probability that n-gram occurs in collection: P( w i + 1... w i + n) = number of occurences of n - gram total number of n - grams Independence likelihood model IndepLH = 1 1 k k i= 0 log P( w i w i + 1... + n) Low IndepLH high probability of page s existence 19 June 2006 WWW 2006 20

Distribution of 3-gram Likelihoods (Independence) Pages with high & low likelihoods are spam Longer n-grams seem to help 19 June 2006 WWW 2006 21

Conditional Likelihood Model Consider all n-grams w i+1 w i+n of a Web page, with kn-grams in total Probability that n-gram w i+1 w i+n occurs, given that (n-1)-gram w i+1 w i+n-1 occurs P( w Conditional likelihood model CondLH w =... w i+ n i+ 1 i+ n 1 = 1 1 k k i= 0 ) log P( w P( wi + 1... wi + n P( w... w i+ n i+ 1... wi + n 1) Low CondLH high probability of page s existence i+ 1 w 1 w i+ n 1 i+ n ) ) 19 June 2006 WWW 2006 22

Distribution of 3-gram Likelihoods (Conditional) Pages with high & low likelihoods are spam Longer n-grams seem to help 19 June 2006 WWW 2006 23

Spam Detection as a Classification Problem Use the previously presented metrics as features for a classifier Use the 17,168 manually tagged pages to build a classifier We show results for a decision-tree C4.5 (other classifiers performed similarly) 19 June 2006 WWW 2006 24

Putting it all together size of the page static rank link depth number of dots/dashes/digits in hostname hostname length hostname domain number of words in the page number of words in the title fraction of anchor text average length of the words fraction of visible content fraction of top- 100,200,500,1000 words in the text fraction of text in top-100,200,500,1000 words compression ratio (zipratio) occurrence of the phrase Privacy Policy occurrence of the phrase Privacy Statement occurrence of the phrase Customer Service occurrence of the word Disclaimer occurrence of the word Copyright occurrence of the word Fax occurrence of the word Phone likelihood (independence) 1,2,3,4,5-grams likelihood (conditional probability) 2,3,4,5-grams 19 June 2006 WWW 2006 25

A Portion of the Induced Decision Tree lhoodindep5gram <= 13.73377 lhoodindep5gram > 13.73377 fractop1kintext <= 0.062 fractop1kintext > 0.062 non-spam fractexttop500 <= 0.646 fractexttop500 > 0.646 fractop500intext <= 0.154 fractop500intext > 0.154 spam 19 June 2006 WWW 2006 26

Evaluation of the Classifier We evaluated the classifier using 10-fold cross validation (and random splits) Training samples: 17,168 2,364 spam 14,804 non-spam Correctly classified: 16,642 (96.93 %) Incorrectly classified: 526 ( 3.07 %) 19 June 2006 WWW 2006 27

Confusion Matrix & Precision/Recall classified as spam non-spam spam 1,940 424 non-spam 366 14,440 class recall precision spam 82.1% 84.2% non-spam 97.5% 97.1% 19 June 2006 WWW 2006 28

Bagging & Boosting class recall precision spam 84.4% 91.2% non-spam 98.7% 97.5% After bagging class recall precision spam 86.2% 91.1% non-spam 98.7% 97.8% After boosting 19 June 2006 WWW 2006 29

Related Work Link spam Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] A.L. da Costa Carvalho, P.A. Chirita, E.S. de Moura, P. Calado, W. Nejdl. Site Level Noise Removal for Search Engines. [WWW 2006] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Content spam G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement.[AIRWeb 2005] D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB 2004] D. Fetterly, M. Manasse and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. [SIGIR 2005] Cloaking Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. [AIRWeb 2005] B. Wu and B. Davison. Cloaking and Redirection: a preliminary study.[airweb 2005] e-mail spam J. Hidalgo. Evaluating cost-sensitive Unsolicited Bulk Email categorization. [SAC 2002] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. [AAAI 1998] 19 June 2006 WWW 2006 30

Conclusion Studied properties of the spam pages (per top-level domain and language) Implemented and evaluated a variety of spam detection techniques Combined the techniques into a decision-tree classifier We can identify ~86% of all spam with ~91% accuracy 19 June 2006 WWW 2006 31

Thank you