An Analysis of Factors Used in Search Engine Ranking

Similar documents

An Analysis of Factors Used in Search Engine Ranking

An Analysis of Factors Used in Search Engine Ranking

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

1 o Semestre 2007/2008

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

Search and Information Retrieval

Content-Based Recommendation

IJREAS Volume 2, Issue 2 (February 2012) ISSN: STUDY OF SEARCH ENGINE OPTIMIZATION ABSTRACT

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Search engine ranking

Web Search. 2 o Semestre 2012/2013

Data Mining - Evaluation of Classifiers

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Removing Web Spam Links from Search Engine Results

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Social Business Intelligence Text Search System

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Programming Exercise 3: Multi-class Classification and Neural Networks

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Active Learning SVM for Blogs recommendation

Enhancing the Ranking of a Web Page in the Ocean of Data

Software Engineering 4C03

Supervised Learning (Big Data Analytics)

Removing Web Spam Links from Search Engine Results

Machine Learning Final Project Spam Filtering

Forecasting stock markets with Twitter

LCs for Binary Classification

Introduction to Information Retrieval

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Dynamical Clustering of Personalized Web Search Results

Linear Classification. Volker Tresp Summer 2015

Make search become the internal function of Internet

Optimize Your Content

An Introduction to Machine Learning and Natural Language Processing Tools

Data Mining Algorithms Part 1. Dejan Sarka

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

SEO BASICS. March 20, 2015

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

STA 4273H: Statistical Machine Learning

Corso di Biblioteche Digitali

Improving Search Engines via Classification

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

Machine Learning in Spam Filtering

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Semantic Search in Portals using Ontologies

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Practical Graph Mining with R. 5. Link Analysis

Big Data Analytics CSCI 4030

Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Lecture 8 February 4

Social Media Mining. Data Mining Essentials

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

WHAT IS THE TEMPORAL VALUE OF WEB SNIPPETS?

BIG DATA What it is and how to use?

Bootstrapping Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

WE DEFINE spam as an message that is unwanted basically

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Knowledge Discovery from patents using KMX Text Analytics

HydroDesktop Overview

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Data Mining Methods: Applications for Institutional Research

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Classification of Bad Accounts in Credit Card Industry

Measuring the Utilization of On-Page Search Engine Optimization in Selected Domain

The 4 Mistakes Content Marketers Make

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Identifying Best Bet Web Search Results by Mining Past User Behavior

D-optimal plans in observational studies

Data Mining Practical Machine Learning Tools and Techniques

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Projektgruppe. Categorization of text documents via classification

An ontology-based approach for semantic ranking of the web search engines results

Online Marketing Optimization Essentials

Simple and efficient online algorithms for real world applications

Neural Networks and Support Vector Machines

Predict Influencers in the Social Network

Beating the MLB Moneyline

CERN Search Engine Status CERN IT-OIS

Prediction of Stock Performance Using Analytical Techniques

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Trust and Reputation Management

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Decoding DNS data. Using DNS traffic analysis to identify cyber security threats, server misconfigurations and software bugs

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Transcription:

An Analysis of Factors Used in Search Engine Ranking Albert Bifet 1 Carlos Castillo 2 Paul-Alexandru Chirita 3 Ingmar Weber 4 1 Technical University of Catalonia 2 University of Chile 3 L3S Research Center 4 Max-Planck-Institute for Computer Science Adversarial Information Retrieval Workshop AIRWeb 14th IWWW Conference Tokyo

Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

Search Engine Ranking. Un buscador es alguien que busca, no necesariamente alguien que encuentra. JORGE BUCAY Why it s Search Engine Ranking so important? 88% of the time we use it, given a new task E-commerce depends on a high ranking How we can get a good ranking? Ranking depends on certain features Don t forget : CONTENT

An Analysis of Factors Used in Search Engine Ranking Influence of different page features on the ranking of search engine results: We use Google Binary classification problem Linear and non-linear methods Training set and a test set

Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

IR Model Definition An information retrieval model is a quadruple [D, Q, F, R(q i, d j )] where 1 D is a set composed of logical representations for the documents in the collection 2 Q is a set composed of logical representations for the user information needs (Queries) 3 F is a framework for modeling documents representations, queries, and their relationships 4 R(q i, d j ) is a ranking function wich associates a real number with a query q i Q and a document representation d j D

IR Model Definition Let t be the number of index terms in the system and k i be a generic index term. 1 K = {k 1,..., k t } is the set of all index terms 2 A weight w i,j is associated with each k i of a document d j 3 With the document d j is associated an index term vector d j = (w 1,j, w 2,j,..., w t,j ) Example 1 Document 1 =" My tailor is rich", Document 2 ="Your chair is red" 2 K = {"tailor","chair","rich","red" } 3 d1 = (1, 0, 1, 0), d 2 = (0, 1, 0, 1)

Vector Model Definition For the vector model, the weight w i,j associated with a pair (k i, d j ) is positive and non-binary. The index terms in the query are also weighted 1 Let w i,q be the weight associated with the pair [k i, q] 2 We define the query vector q as q = (w 1,q, w 2,q,..., w t,q ) 3 With the document d j is associated an index term vector d j = (w 1,j, w 2,j,..., w t,j ) 4 Sim(d j, q) = d j q d j q = i w i,j w i,q i w 2 i,j i w 2 i,q

Vector Model Example 1 Document 1 ="My tailor is rich" 2 Document 2 ="Your chair is red" 3 K = { "tailor", "chair", "rich" } 4 Query= "rich" Example 1 q = (0, 0, 1) 2 d1 = (1, 0, 2), d 2 = (0, 1, 0) 5, 3 d Sim(d 1, q) = 1 q d = 2 1 q Sim(d 2, q) = 0

Vector Model Definition 1 Let N be the total number of documents in the system 2 Let n i be the number of documents in wich k i appears 3 The best known term-weighting schemes use weights wich are given by w i,j = f i,j idf i 4 The inverse document frequency is idf i = log N n i

PageRank The Anatomy of a Search Engine Sergey Brin and Lawrence Page Definition We assume page A has pages T 1...T n which point to it. d is a damping factor between 0 and 1. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1 d) + d( PR(T 1) C(T 1 ) +... + PR(T n) C(T n ) ) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages PageRanks will be one.

PageRank

Features Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

Features Content Features Content features, query independent: FNEN Fraction of terms in the documents which can not be found in an English dictionary NBOD Number of bytes of the original document RFFT Relative frequency of the more frequent term ATLE Average term length Content features, query dependent: SIMT Similarity of the term to the document. AMQT Average matches of the query terms FATT Anchor text term frequency

Features Formatting, Link and Metadata Features Formatting features, query-dependent. TMKY Term in the meta keywords or description (N) Link features, query-independent: ILNK Number of pages linking to a page, in-degree approximated using Google API link: queries PRNK PageRank of the page, or the approximation of the PageRank in a 0-10 scale obtained from Google s toolbar. Metadata features: TURL Term is in the page s URL or not. TDIR Term is listed in a web directory or not.

Estimating a Ranking Function Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

Estimating a Ranking Function Data Set Arts : Albertinelli, Bacchiacca, Botticelli States : Arizona, Arkansas, Connecticut Spam : buy cds, buy dvds, cheap software Multiple : anova bootstrap feature missing principal squared, analysis frequent likelihood misclassification pruning statistical, Training Set (7 queries): learn a linear scoring function or a decision tree. Validation Set (2 queries): used for feature selection and pruning the decision tree. Test Set (3 queries): To estimate the generalization error of our ranking function,

Estimating a Ranking Function Binary classification problem Let q be a query, and u, v be feature vector of pages. Let u < q v represent the ordering returned by the ranking function of a search engine for the given query. We want to find f such that f (u) < f (v) whenever u < v. If we assume that f is linear, then there exists a vector w such that f (u) = w u. Then f (u) < f (v) w u < w v w (v u) > 0.

Estimating a Ranking Function Logistic Regression and SVM Logistic regression models the posterior probabilities of the classes. In the case of only two classes and + the model has the form P(class = + X = x) log P(class = X = x) = β 0 + w x (1) Support Vector Machines typically use linear decision boundaries in a transformed feature space. In this higher dimension space data becomes separable by hyperplanes

Estimating a Ranking Function Binary classification trees f might not be linear Search engine might use several layers of índices A classification tree is built through binary recursive partitioning. Pruning the tree leads to a better performance on general data

Architecture of the system Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

Architecture of the system Architecture. Downloader: software that executes a query and downloads the returned pages using the data set queries Feature Extractor: software that computes the features of the pages downloaded. Analyzer: software that analyzes the features of the returned pages and estimates a function

Main Results Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

Main Results Precision values obtained using only individual features. Feature Arts States Spam Multiple NBOD (-)53.8% 54.4% 51.1% 50.9% FNEN (-)59.4% 53.5% 51.6% (-)59.8% RFFT 52.1% (-)54.3% 50.0% 53.9% ATLE (-)54.2% 54.0% 50.0% 54.4% FATT 56.2% 50.3% 53.0% 51.8% AMQT 55.4% 50.5% 52.1% 56.9% SIMT (N) 56.5% (-)52.0% 52.7% 59.0% SIMT 55.4% (-)50.9% 52.6% 69.7% TMKY (N) 51.4% 51.4% 54.5% 50.0% TMKY 53.0% 51.4% 55.1% 50.0% ILNK 58.0% 66.3% 57.0% 53.9% PRNK 58.7% 60.6% 55.0% 57.2% TDIR 52.3% 53.5% 54.3% 51.9%

Main Results Best precision achieved on all. Table: Best precision achieved on all, shifted and top pairs. We include the performance on the test data as well as on the whole data set, including training, validation and test sets. % all % shifted % top pairs correct pairs correct pairs correct Dataset Test All Test All Test All Best model Arts 63.7% 61.8% 69.1% 66.4% 47.6% 48.0% Log. regr., strongest 3 features States 64.6% 66.3% 73.2% 73.8% 97.6% 98.5% Class. tree, only ILINK feature Spam 62.5% 59.5% 70.5% 62.1% 98.2% 74.8% Log. regr., strongest 10 features Multiple 67.5% 70.9% 78.1% 81.3% 81.0% 87.0% Log. reg., strongest 3 features

Main Results Relevant Features hidden The query logs, which Google obtains through its toolbar. The age of the incoming links and other information related to web link dynamics. The rate of change at which a website changes, obtained by repeated web crawls. The true number of ingoing links, as Google s link:www.abc.com only gives a lower bound. The true PageRank used by Google, as the one displayed in its toolbar is only an approximation, and furthermore, seems to be too strongly correlated to the number of in-links.

Summary Influence of different page features on the ranking of search engine results: We use Google Binary classification problem Training set and a test set Ranking only according to the strongest feature for a category gives is able to predict the order in which any pair of pages will appear in the results with a precision of between 57% and 70%.