Search engines: ranking algorithms

Similar documents
Part 1: Link Analysis & Page Rank

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

The PageRank Citation Ranking: Bring Order to the Web

DATA ANALYSIS II. Matrix Algorithms

Perron vector Optimization applied to search engines

Ranking on Data Manifolds

Enhancing the Ranking of a Web Page in the Ocean of Data

Graph Algorithms and Graph Databases. Dr. Daisy Zhe Wang CISE Department University of Florida August 27th 2014

Department of Cognitive Sciences University of California, Irvine 1

Corso di Biblioteche Digitali

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Practical Graph Mining with R. 5. Link Analysis

Top Online Activities (Jupiter Communications, 2000) CS276A Text Information Retrieval, Mining, and Exploitation

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

Online edition (c)2009 Cambridge UP

Web Graph Analyzer Tool

Versatile weighting strategies for a citation-based research evaluation model

1 o Semestre 2007/2008

Google: data structures and algorithms

Google s PageRank: The Math Behind the Search Engine

Link Analysis. Chapter PageRank

Walk-Based Centrality and Communicability Measures for Network Analysis

Search engine ranking

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

ELA

Web Link Analysis: Estimating a Document s Importance from its Context

The Mathematics Behind Google s PageRank

THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE

How To Understand The Network Of A Network

The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity

Part 2: Community Detection

arxiv: v3 [math.na] 1 Oct 2012

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Web Search. 2 o Semestre 2012/2013

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Challenges in Running a Commercial Web Search Engine. Amit Singhal

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

Structure Preserving Model Reduction for Logistic Networks

A Graph theoretical approach to Network Vulnerability Analysis and Countermeasures

Master s Theory Exam Spring 2006

A Spectral Clustering Approach to Validating Sensors via Their Peers in Distributed Sensor Networks

Social Network Mining

Soft Clustering with Projections: PCA, ICA, and Laplacian

Big Data Analytics. Lucas Rego Drumond

Algorithms for representing network centrality, groups and density and clustered graph representation

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Power Rankings: Math for March Madness

Search Engine Optimisation (SEO) Factsheet

Removing Web Spam Links from Search Engine Results

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

Monte Carlo methods in PageRank computation: When one iteration is sufficient

Introduction. Why does it matter how Google identifies these SEO black hat tactics?

Healthcare Analytics. Aryya Gangopadhyay UMBC

Social Network Analysis

Dynamical Clustering of Personalized Web Search Results

Tutorial, IEEE SERVICE 2014 Anchorage, Alaska

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

2. THE WEB FRONTIER. n + 1, and defining an augmented system: x. x x n+1 (3)

Anna Paananen. Comparative Analysis of Yandex and Google Search Engines

Identifying Link Farm Spam Pages

BOOLEAN CONSENSUS FOR SOCIETIES OF ROBOTS

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Dynamics of information spread on networks. Kristina Lerman USC Information Sciences Institute

Search Engine Optimisation

[Ramit Solutions] SEO SMO- SEM - PPC. [Internet / Online Marketing Concepts] SEO Training Concepts SEO TEAM Ramit Solutions

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Search Engines. Stephen Shaw 18th of February, Netsoc

Chapter 14. Link Analysis and Web Search Searching the Web: The Problem of Ranking

Combining BibTeX and PageRank to Classify WWW Pages

Exploiting DNS Traffic to Rank Internet Domains

The Perron-Frobenius theorem.

Deliverable D7.2: Dissemination Plan

Search Engine Optimization: Getting your website on the front page of search results

Director of Marketing, Cote Family Companies

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Social Media Mining. Graph Essentials

Huy Nguyen. Department of Electrical Engineering and Computer Science May 22, 2009 Certified by / VI-

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Efficient Identification of Starters and Followers in Social Media

Step 3: Go to Column C. Use the function AVERAGE to calculate the mean values of n = 5. Column C is the column of the means.

Doc ID: URCHINB-001 (3/30/05)

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

LECTURE 4. Last time: Lecture outline

Content Delivery Networks. Shaxun Chen April 21, 2009

Analysis of a Production/Inventory System with Multiple Retailers

Incorporating Participant Reputation in Community-driven Question Answering Systems

10. Search Engine Marketing

Google Analytics and Google Analytics Premium: limits and quotas

A Google-like Model of Road Network Dynamics and its Application to Regulation and Control

Mining Social-Network Graphs

graphical Systems for Website Design

PARTITIONING DATA TO INCREASE WEBSITE VISIBILITY ON SEARCH ENGINE

Conductance, the Normalized Laplacian, and Cheeger s Inequality

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Spidering and Filtering Web Pages for Vertical Search Engines

PageRank Citation Ranking

Social Media Mining. Network Measures

CS54100: Database Systems

Transcription:

Search engines: ranking algorithms Gianna M. Del Corso Dipartimento di Informatica, Università di Pisa, Italy ESP, 25 Marzo 2015

1 Statistics 2 Search Engines Ranking Algorithms HITS

Web Analytics Estimated size: up to 45 billion pages Indexed pages: at least 2 billion pages Average size of a web page is 1.3 MB... Up 35% in last year Petabytes (10 15 bytes) for storing just the indexed pages!!

Web Analytics Web pages are dynamical objects Most of web pages content changes daily Average lifespan of a page is around 10 days. A huge quantity of data!! Find a document without search engines it s like looking for a needle in a haystack.

Statistics Market shares https://www.quantcast.com/top-sites

Statistics Market shares https://www.quantcast.com/top-sites Sheet1 Google 67 Microsoft 17.9 Yahoo! 11.3 Ask 2.7 AOL 1.3 Google Microsoft Yahoo! Ask AOL [comscore, July 2013]

Google s Numbers [NASDAQ:GOOG]

Google s Numbers Sheet1 year Ad Revenues Other Rev websites network other 2005 99% 1.20% Revenue Sources 55 44 1 2006 99% 1.06% 60 39 1 2007 99% 1.09% 64 35 1 1.2 2008 97% 3.06% 66 31 3 99% 2009 99% 97% 99% 97% 3.22% 97% 96% 67 96% 30 3 1 95% 2010 96% 3.70% 66 30 4 2011 96% 3.62% 69 27 4 0.8 2012 95% 5.11% 68 27 5 Ad Revenues 0.6 Other Rev 0.4 0.2 1.20% 1.06% 1.09% 3.06% 3.22% 3.70% 3.62% 5.11% 0 2005 2006 2007 2008 2009 2010 2011 2012 [investor.google.com/financial/tables.html]

Search Engines The dream of search engines: catalog everything published on the web Web content fruition by query Rank by relevance to the query

The Anatomy of a Search Engine Page Repository Spiders Indexer Collection Analysis Queries Query Engine Results Ranking Text Structure Utility Spider Control Indexes

Ranking Algorithms... At the very beginning (mid 90s) The order of pages returned by the engine for a given query was controlled by the owner of the page, who could increase the ranking score increasing the frequency of certain terms, font color, font size etc. SPAM

Ranking Algorithms... At the very beginning (mid 90s) The order of pages returned by the engine for a given query was controlled by the owner of the page, who could increase the ranking score increasing the frequency of certain terms, font color, font size etc. SPAM

Ranking Algorithms Modern engines In 1998 two similar ideas HITS (Jon Kleinberg) (S. Brin and L. Page) The importance of a page should not depend on the owner of the page!

Ranking Algorithms Basic intuition Look at the linkage structure p q Adding a link from page p to page q is a sort of vote from the author of p to q. Idea If the content of a page is interesting it will receive many votes, i.e. it will have many inlinks.

Ranking Algorithms Web Graph The Web is seen as a directed graph: Each page is a node Each hyperlink is an edge G = 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0

Ranking Algorithms Web Graph The importance of a web page is determined by the structure of the web graph At first approximation: Contents of pages is not used! Aim: The owner of a page cannot boost the importance of its pages!

HITS An example query=the best automobile makers in the last 4 years Intention get back a list of top car brands and their official web sites Textual ranking pages returned might be very different Top companies might not even use the terms automobile makers.they might use the term car manufacturer instead

HITS HITS Each page has associated two scores a i authority score h i hub score A page is a good authority if it is linked by many good hubs A page is a good hub if it is linked by many good authorities To every page p we associate two scores: { ap = i I(p) h i h p = i O(p) a i { a (i) = G T h (i 1) h (i) = Ga (i) Hub pages advertise authoritative pages!

HITS HITS

HITS HITS G T G and GG T are non negative G T G and GG T are semipositive defined real nonnegative eigenvalues { h (i) = GG T h (i 1) a (i) = G T Ga (i 1) h is the dominant eigenvector of GG T a is the dominant eigenvector of G T G For Perron-Frobenius Theorem the eigenvector associated to the dominant eigenvalue has nonnegative entries!

HITS HITS At query time Find relevant pages Construct the graph starting with those bunch of pages Compute dominant eigenvector of G T G Sort the dominant eigenvalue and return the pages in accordance with the ordering induced

Google s Is a static ranking schema At query time relevant pages are retrieved neighbours The ranking of pages is based on the of pages which is precomputed

PR home doc3 doc1 doc2 doc3 home doc1 1. doc 3 2. doc 1 3. doc 2 doc2

A page is important if is voted by important pages The vote is expressed by a link

A page distribute its importance equally to its neighbors The importance of a page is the sum of the importances of pages which points to it π j = i I(j) π i outdeg(i) G = [ 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 ] P = [ 0 0 1 0 0 0 0 0 0 0 0 1/2 0 0 1/2 1/2 0 1/2 0 0 1/3 1/3 0 1/3 0 ]

It is called Random surfer model The web surfer jumps from page to page following hyperlinks. The probability of jumping to a page depends of the number of links in that page. Starting with a vector z (0), compute z (k) j = i I(j) z (k 1) p ij, p ij = 1 outdeg(i) Equivalent to compute the stationary distribution of the Markov chain with transition matrix P. Equivalent to compute the left eigenvector of P corresponding to eigenvalue 1.

It is called Random surfer model The web surfer jumps from page to page following hyperlinks. The probability of jumping to a page depends of the number of links in that page. Starting with a vector z (0), compute z (k) j = i I(j) z (k 1) p ij, p ij = 1 outdeg(i) Equivalent to compute the stationary distribution of the Markov chain with transition matrix P. Equivalent to compute the left eigenvector of P corresponding to eigenvalue 1.

It is called Random surfer model The web surfer jumps from page to page following hyperlinks. The probability of jumping to a page depends of the number of links in that page. Starting with a vector z (0), compute z (k) j = i I(j) z (k 1) p ij, p ij = 1 outdeg(i) Equivalent to compute the stationary distribution of the Markov chain with transition matrix P. Equivalent to compute the left eigenvector of P corresponding to eigenvalue 1.

Two problems: Presence of dangling nodes P cannot be stochastic P might not have the eigenvalue 1 Presence of cycles P can be reducible: the random surfer is trapped Uniqueness of eigenvalue 1 not guaranteed

Perron-Frobenius Theorem Let A 0 be an irreducible matrix there exists an eigenvector equal to the spectral radius of A, with algebraic multiplicity 1 there exists an eigenvector x > 0 such that Ax = ρ(a)x. if A > 0, then ρ(a) is the unique eigenvalue with maximum modulo.

Presence of dangling nodes P = P + dv T { 1 if page i is dangling d i = 0 otherwise v i = 1/n; P = 0 0 1 0 0 0 0 0 0 0 0 1/2 0 0 1/2 1/2 0 1/2 0 0 1/3 1/3 0 1/3 0 P = 0 0 1 0 0 1/5 1/5 1/5 1/5 1/5 0 1/2 0 0 1/2 1/2 0 1/2 0 0 1/3 1/3 0 1/3 0

Presence of cycles Force irreducibility by adding artificial arcs chosen by the random surfer with small probability α. P = (1 α) P = (1 α) P + αev T, [ 0 0 1 0 0 1/5 1/5 1/5 1/5 1/5 0 1/2 0 0 1/2 1/2 0 1/2 0 0 1/3 1/3 0 1/3 0 Typical values of α is 0.15. ] + α [ 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 1/5 ].

: a toy eample P = 0.03 0.03 0.88 0.03 0.03 0.2 0.2 0.2 0.2 0.2 0.03 0.45 0.03 0.03 0.45 0.45 0.03 0.45 0.03 0.03 0.31 0.31 0.03 0.31 0.03 Computing the largest eigenvector of P we get z T [0.38, 0.52, 0.59, 0.27, 0.40], which corresponds to the following order of importance of pages [3, 2, 5, 1, 4].

P is sparse, P is full. The vector y = P T x, for x 0, such that x 1 = 1 can be computed as follows y = (1 α)p T x γ = x y, y = y + γv. The eigenvalues of P and P are interrelated: λ 1 ( P) = λ 1 ( P) = 1, λj ( P) = (1 α) λj ( P), j > 1. For the web graph λ 2 ( P) = (1 α).