The Mathematics Behind Google s PageRank



Similar documents
Google s PageRank: The Math Behind the Search Engine

Enhancing the Ranking of a Web Page in the Ocean of Data

THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE

Monte Carlo methods in PageRank computation: When one iteration is sufficient

The PageRank Citation Ranking: Bring Order to the Web

Solution of Linear Systems

Part 1: Link Analysis & Page Rank

Search engines: ranking algorithms

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Web Graph Analyzer Tool

Perron vector Optimization applied to search engines

DATA ANALYSIS II. Matrix Algorithms

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

Operation Count; Numerical Linear Algebra

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

The Effectiveness of PageRank and HITS Algorithm

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

An Improved Page Rank Algorithm based on Optimized Normalization Technique

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

A Uniform Approach to Accelerated PageRank Computation

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

SOLVING LINEAR SYSTEMS

PageRank Conveniention of a Single Web Pageboard

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

A numerically adaptive implementation of the simplex method

MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process

8 Square matrices continued: Determinants

Numerical Methods I Eigenvalue Problems

Ranking on Data Manifolds

Link Analysis. Chapter PageRank

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

7 Gaussian Elimination and LU Factorization

Identifying Link Farm Spam Pages

Manifold Learning Examples PCA, LLE and ISOMAP

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Nonlinear Iterative Partial Least Squares Method

Practical Graph Mining with R. 5. Link Analysis

Estimating PageRank Values of Wikipedia Articles using MapReduce

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Anna Paananen. Comparative Analysis of Yandex and Google Search Engines

Combining BibTeX and PageRank to Classify WWW Pages

Combating Web Spam with TrustRank

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

MATH APPLIED MATRIX THEORY

CHOOSING A COLLEGE. Teacher s Guide Getting Started. Nathan N. Alexander Charlotte, NC

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Reproducing Calculations for the Analytical Hierarchy Process

x + y + z = 1 2x + 3y + 4z = 0 5x + 6y + 7z = 3

Vector and Matrix Norms

Fast Incremental and Personalized PageRank

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

Reputation Management in P2P Networks: The EigenTrust Algorithm

Matrix Multiplication

ON THE COMPLEXITY OF THE GAME OF SET.

LINEAR ALGEBRA W W L CHEN

CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition

UNCOUPLING THE PERRON EIGENVECTOR PROBLEM

Google: data structures and algorithms

Numerical Matrix Analysis

Distributed R for Big Data

Web Link Analysis: Estimating a Document s Importance from its Context

Lecture 3: Finding integer solutions to systems of linear equations

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Similarity and Diagonalization. Similar Matrices

CS224W Project Report: Finding Top UI/UX Design Talent on Adobe Behance

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Math 215 HW #6 Solutions

Big Data Analytics: Optimization and Randomization

Search engine ranking

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

SALEM COMMUNITY COLLEGE Carneys Point, New Jersey COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated

HSL and its out-of-core solver

Soft Clustering with Projections: PCA, ICA, and Laplacian

Department of Cognitive Sciences University of California, Irvine 1

Personalized Reputation Management in P2P Networks

Combating Web Spam with TrustRank

Why? A central concept in Computer Science. Algorithms are ubiquitous.

The Ideal Class Group

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

by the matrix A results in a vector which is a reflection of the given

Adaptive Stable Additive Methods for Linear Algebraic Calculations

A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Dynamical Clustering of Personalized Web Search Results

Buying and Selling Traffic: The Internet as an Advertising Medium

Online edition (c)2009 Cambridge UP

arxiv: v3 [math.na] 1 Oct 2012

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Semantic Search in Portals using Ontologies

Number boards for mini mental sessions

Extracting Link Spam using Biased Random Walks From Spam Seed Sets

Continuity of the Perron Root

Time-dependent Network Algorithm for Ranking in Sports

Transcription:

The Mathematics Behind Google s PageRank Ilse Ipsen Department of Mathematics North Carolina State University Raleigh, USA Joint work with Rebecca Wills Man p.1

Two Factors Determine where Google displays a web page on the Search Engine Results Page: 1. PageRank (links) A page has high PageRank if many pages with high PageRank link to it 2. Hypertext Analysis (page contents) Text, fonts, subdivisions, location of words, contents of neighbouring pages Man p.2

PageRank An objective measure of the citation importance of a web page [Brin & Page 1998] Assigns a rank to every web page Influences the order in which Google displays search results Based on link structure of the web graph Does not depend on contents of web pages Does not depend on query Man p.3

More PageRank More Visitors Man p.4

PageRank... continues to provide the basis for all of our web search tools http://www.google.com/technology/ Links are the currency of the web Exchanging & buying of links BO (backlink obsession) Search engine optimization Man p.5

Overview Mathematical Model of Internet Computation of PageRank Is the Ranking Correct? Floating Point Arithmetic Issues Man p.6

Mathematical Model of Internet 1. Represent internet as graph 2. Represent graph as stochastic matrix 3. Make stochastic matrix more convenient = Google matrix 4. dominant eigenvector of Google matrix = PageRank Man p.7

The Internet as a Graph Link from one web page to another web page Web graph: Web pages = nodes Links = edges Man p.8

The Internet as a Graph Man p.9

The Web Graph as a Matrix 1 2 55 4 3 S = 0 1 2 0 1 2 0 0 0 1 1 1 3 3 3 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 Links = nonzero elements in matrix Man p.10

Properties of Matrix S Row i of S: Links from page i to other pages Column i of S: Links into page i S is a stochastic matrix: All elements in [0, 1] Elements in each row sum to 1 Dominant left eigenvector: ω T S = ω T ω 0 ω 1 = 1 ω i is probability of visiting page i But: ω not unique Man p.11

Google Matrix Convex combination Stochastic matrix S G = αs + (1 α)11v T } {{ } rank 1 Damping factor 0 α < 1 e.g. α =.85 Column vector of all ones 11 Personalization vector v 0 v 1 = 1 Models teleportation Man p.12

PageRank G = αs + (1 α)11v T G is stochastic, with eigenvalues: 1 > α λ 2 (S) α λ 3 (S)... Unique dominant left eigenvector: π T G = π T π 0 π 1 = 1 π i is PageRank of web page i [Haveliwala & Kamvar 2003, Eldén 2003, Serra-Capizzano 2005] Man p.13

How Google Ranks Web Pages Model: Internet web graph stochastic matrix G Computation: PageRank π is eigenvector of G π i is PageRank of page i Display: If π i > π k then page i may be displayed before page k depending on hypertext analysis Man p.14

Facts The anatomy of a large-scale hypertextual web search engine [Brin & Page 1998] US patent for PageRank granted in 2001 Google indexes 10 s of billions of web pages (1 billion = 10 9 ) Google serves 200 million queries per day Each query processed by 1000 machines All search engines combined process more than 500 million queries per day [Desikan, 26 October 2006] Man p.15

Computation of PageRank The world s largest matrix computation [Moler 2002] Eigenvector Matrix dimension is 10 s of billions The matrix changes often 250,000 new domain names every day Fortunately: Matrix is sparse Man p.16

Power Method Want: π such that π T G = π T Power method: Pick an initial guess x (0) Repeat [x (k+1) ] T := [x (k) ] T G until termination criterion satisfied Each iteration is a matrix vector multiply Man p.17

Matrix Vector Multiply x T G = x T [ αs + (1 α)11v T] Cost: # non-zero elements in S A power method iteration is cheap Man p.18

Error Reduction in 1 Iteration π T G = π T G = αs + (1 α)11v T [x (k+1) π] T = [x (k) ] T G π T G = α[x (k) π] T S Error: x (k+1) π 1 } {{ } iteration k+1 α x (k) π 1 } {{ } iteration k Man p.19

Error in Power Method π T G = π T G = αs + (1 α)11v T Error after k iterations: x (k) π 1 α k x (0) π 1 } {{ } 2 [Bianchini, Gori & Scarselli 2003] Error bound does not depend on matrix dimension Man p.20

Advantages of Power Method Simple implementation (few decisions) Cheap iterations (sparse matvec) Minimal storage (a few vectors) Robust convergence behaviour Convergence rate independent of matrix dimension Numerically reliable and accurate (no subtractions, no overflow) But: can be slow Man p.21

PageRank Computation Power method Page, Brin, Motwani & Winograd 1999 Bianchini, Gori & Scarselli 2003 Acceleration of power method Kamvar, Haveliwala, Manning & Golub 2003 Haveliwala, Kamvar, Klein, Manning & Golub 2003 Brezinski & Redivo-Zaglia 2004, 2006 Brezinski, Redivo-Zaglia & Serra-Capizzano 2005 Aggregation/Disaggregation Langville & Meyer 2002, 2003, 2006 Ipsen & Kirkland 2006 Man p.22

PageRank Computation Methods that adapt to web graph Broder, Lempel, Maghoul & Pedersen 2004 Kamvar, Haveliwala & Golub 2004 Haveliwala, Kamvar, Manning & Golub 2003 Lee, Golub & Zenios 2003 Lu, Zhang, Xi, Chen, Liu, Lyu & Ma 2004 Ipsen & Selee 2006 Krylov methods Golub & Greif 2004 Del Corso, Gullí, Romani 2006 Man p.23

PageRank Computation Schwarz & asynchronous methods Bru, Pedroche & Szyld 2005 Kollias, Gallopoulos & Szyld 2006 Linear system solution Arasu, Novak, Tomkins & Tomlin 2002 Arasu, Novak & Tomkins 2003 Bianchini, Gori & Scarselli 2003 Gleich, Zukov & Berkin 2004 Del Corso, Gullí & Romani 2004 Langville & Meyer 2006 Man p.24

PageRank Computation Surveys of numerical methods: Langville & Meyer 2004 Berkhin 2005 Langville & Meyer 2006 (book) Man p.25

Is the Ranking Correct? π T = (.23.24.26.27 ) x T = (.27.26.24.23 ) x π =.04 Small error, but incorrect ranking y T = ( 0.001.002.997 ) y π =.727 Large error, but correct ranking Man p.26

What is Important? Numerical value ordinal rank ordinal rank: position of an element in an ordered list Very little research on ordinal ranking Rank-stability, rank-similarity [Lempel & Moran, 2005] [Borodin, Roberts, Rosenthal & Tsaparas 2005] Man p.27

Ordinal Ranking Largest element gets Orank 1 π T = (.23.24.26.27 ) Orank(π 4 ) = 1, Orank(π 1 ) = 4 x T = (.27.26.24.23 ) Orank(x 1 ) = 1 Orank(π 1 ) = 4 y T = ( 0.001.002.997 ) Orank(y 4 ) = 1 = Orank(π 4 ) Man p.28

Problems with Ordinal Ranking When done with power method: Popular termination criteria do not guarantee correct ranking Additional iterations can destroy ranking Rank convergence depends on: α, v, initial guess, matrix dimension, structure of web graph Even if successive iterates have the same ranking, their ranking may not be correct [Wills & Ipsen 2007] Man p.29

Ordinal Ranking Criterion Given: Approximation x, x 0 Error bound β x π 1 Criterion: x i > x j + β = π i > π j Why? (x i π i ) (x j π j ) x π 1 β x i (x j + β) π i π j 0 < x i (x j + β) = 0 < π i π j [Kirkland 2006] Man p.30

Applicability of Criterion 100 90 B1 FP Bound Applies Element Matches Successor 80 70 Percentage of Elements 60 50 40 30 20 10 0 20 40 60 80 100 120 140 160 180 Iteration Number Man p.31

Properties of Ranking Criterion Applies to any approximation, provided error bound is available Requires well-separated elements Tends to identify ranks of larger elements Determines partial ranking Top-k, bucket and exact ranking Easy to use with power method Man p.32

Top-k Ranking Given: Approximation x to PageRank π Permutation P so that x = Px with x 1... x n Write: π = Pπ Suppose x k > x k+1 + β Man p.33

Top-k Ranking x 1 Orank( π 1 ) k.. x k Orank( π k ) k β x k+1 Orank( π k+1 ) k + 1.. x n Orank( π n ) k + 1 Man p.34

Exact Ranking Given: Approximation x to PageRank π Permutation P so that x = Px with x 1... x n Write: π = Pπ If x k 1 > x k + β and x k > x k+1 + β then Orank( π k ) = k Man p.35

Exact Ranking x 1. x k 1 β Orank( π k ) k x k+1. x n x k β Orank( π k ) k Man p.36

Bucket Ranking Given: Approximation x to PageRank π Permutation P so that x = Px with x 1... x n Write: π = Pπ Suppose x k i > x k + β and x k > x k+j + β π k is in bucket of width i + j 1 Man p.37

Bucket Ranking x 1. x k i β Orank( π k ) k i + 1 x k β Orank( π k ) k + j 1 x k+j. x n Man p.38

Experiments n # buckets 1. bucket last bucket 9,914 4,307 1 7% 3,148,440 34,911 1 95% n exact rank exact top 100 lowest rank 9,914 32% 79 9,215 3,148,440 0.76% 100 151,794 Man p.39

Buckets for Small Matrix 200 180 160 Number in Each Bucket 140 120 100 80 60 40 20 0 0 500 1000 1500 2000 2500 3000 3500 4000 Bucket Number Man p.40

Power Method Ranking Simple error bound: Simple ranking criterion: If x (k) i x (k) π 1 2α k }{{} β > x (k) j + 2α k then π i > π j But: 2α k is too pessimistic (not tight enough) Man p.41

Power Method Ranking Tighter error bound: x (k) π 1 α 1 α x(k) x (k 1) 1 } {{ } β More effective ranking criterion: If x (k) i > x (k) j + β then π i > π j Man p.42

Floating Point Ranking If x (k) i > x (k) j β = + β then π i > π j α 1 α x(k) x (k 1) 1 + e e is round off error from single matvec IEEE double precision floating point arithmetic: e cm10 16 m max # links into any web page Man p.43

Expensive Implementation To prevent accumulation of round off Explicit normalization of iterates x (k+1) = x (k+1) / x (k+1) 1 Compute norms, inner products, matvecs with compensated summation Limited by round off error from single matvec Analysis for matrix dimensions n < 10 14 in IEEE arithmetic (ǫ 10 16 ) Man p.44

Difficulties for XLARGE Problems Catastrophic cancellation when computing β = α 1 α x(k) x (k 1) 1 + e Bound β dominated by round off e Compensated summation insufficient to reduce higher order round off O(nǫ 2 ) Doubly compensated summation too expensive: O(n log n) flops Man p.45

Possible Remedies Lump dangling nodes [Ipsen & Selee 2006] Web pages w/o outlinks: pdf & image files, protected pages, web frontier Up to 50%-80% of all web pages Remove unreferenced web pages Use faster converging method Then 1 power method iteration for ranking Relative ranking criteria? Man p.46

Summary Google orders web pages according to: PageRank and hypertext analysis PageRank = left eigenvector of G G = αs + (1 α)11v T Power method: simple, robust, accurate Convergence rate depends on α but not on matrix dimension Criterion for ordinal ranking Round off serious for XLarge problems Man p.47

User-Friendly Resources Rebecca Wills: Google s PageRank: The Math Behind the Search Engine Mathematical Intelligencer, 2006 Amy Langville & Carl Meyer: Google s PageRank and Beyond The Science of Search Engine Rankings Princeton University Press, 2006 Amy Langville & Carl Meyer: Broadcast of On-Air Interview, November 2006 Carl Meyer s web page Man p.48