Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Transcription

1 Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Link-based page importance measures Why link-based? Mathematical background PageRank Crawlers What are crawlers Algorithm for computing page importance while crawling Distributed implementation calls for NoSQL key-value database 3 May Big Data Technology 2 1

2 Link Analysis - Motivation Traditional text-based ranking methods from the field of Information Retrieval are not sufficient on the Web: The web is huge, with great variation in the quality of the pages even when those pages contain similar text Queries are usually very short, making differentiation between pages difficult The text on many web pages does not sufficiently describe the page Keyword spamming The emphasis of web search is on precision, not recall! 3 May Big Data Technology 3 Link Analysis - Motivation The connectivity patterns between Web pages contain a gold mine of information A link from page a to page b can often be interpreted as: 1. A recommendation, by a s author, of the contents of b 2. Evidence that pages a,b share some topic of interest A co-citation of a and b (by a third page c) may also constitute evidence that a,b share some topic of interest a c b Analyzing linkage patterns at scale can identify oft-praised pages and topical affinities between pages 3 May Big Data Technology 4 2

3 Mathematical Background - Irreducibility A directed graph G=(V,E) is called irreducible if for every i,k V there is a path in G originating at i and ending in k A non-negative NxN matrix W is called irreducible if for every i,k {1,,N} there exists a non-negative integer m such that [W m ] i,k >0 The support graph G W ={V W,E W } of a non-negative NxN matrix W is a directed graph with N vertices, such that i k E W if and only if W i,k >0 Lemma: a non-negative square matrix W is irreducible if and only if G W is irreducible Lemma: a directed graph G is irreducible if and only if the adjacency matrix of G is irreducible 3 May Big Data Technology 5 Mathematical Background: Primitive & Stochastic Matrices Definition: a non-negative NxN matrix P is stochastic if the sum of every row in P is 1 Definition: the period of a directed graph G is the greatest common divisor of the lengths of all the cycles in G G is called aperiodic if it has a period of 1 Definition: a non-negative NxN matrix M is called primitive if its support graph G M is aperiodic 3 May Big Data Technology 6 3

4 Mathematical Background: Ergodic Thoerem Let M be a non-negative NxN matrix, and denote by λ1(m), λ2(m),...λn(m) the N eigenvalues of M, ordered by non-decreasing absolute value (i.e. λ1(m) λ2 (M)... λn(m) ) λ1(m) is the spectral radius of M and will simply be denoted by λ(m) Ergodic Theorem: let P be an irreducible and primitive stochastic matrix λ(p) = λ1(p) = 1, and any other eigenvalue of λ* of P satisfies λ* <1 There is a unique distribution row-vector π which satisfies πp=π π is the principal eigenvector of P and is the stationary distribution of the Markov Chain defined by the transition matrix P For any distribution row-vector q, lim k q P k = π Note that the last bullet defines an iterative method to compute π 3 May Big Data Technology 7 PageRank (Brin & Page, 1998) Named after Google s co-founder, Larry Page A global, query independent importance measure of Web pages A page is considered important if it receives many links from important pages Based on Markov chains and random walks 3 May Big Data Technology 8 4

5 PageRank: Random Surfer Model A random surfer moves from page to page. Upon leaving page p, the surfer chooses one of two actions: 1. Follows an outgoing link of p (chosen uniformly at random), with probability d See next slide for a discussion of pages that have no outlinks 2. Jumps to an arbitrary Web page (chosen uniformly at random), with probability 1-d The vector of PageRanks is the stationary distribution of this (ergodic) random walk 3 May Big Data Technology 9 PageRank Handling Dangling Nodes PageRank as stated in the previous slide is not well defined with respect to exiting pages that have no outgoing links (dangling nodes) There are three accepted approaches for treating pages with no outgoing links: 1. Eliminate such pages from the graph (iteratively prune the graph until reaching a steady state) 2. Consider such pages to link back to the pages that link to them 3. Consider such pages to link to all web pages (effectively making an exit out of them equivalent to a random jump) 3 May Big Data Technology 10 5

6 PageRank: Steady State Equations The PageRanks obey the following equations: R(p) = (1-d)/N + d Σ R(j)/D(j) j I(p) R(p) The PageRank of page p. d A damping factor, 0 < d < 1 N Number of Web pages I(p) The set of pages that point to p D(j) Number of out-links (out-degree) of page j 3 May Big Data Technology 11 PageRank Algebraic Notation Let W denote the NxN adjacency matrix of the Web s link structure, after some form of handling dangling nodes Let W norm denote the matrix that results by dividing each row j of W by j s out-degree (row j s sum) Let T by an NxN matrix whose entries are all equal to 1/N Define M = (1-d)T + dw norm M is the transition matrix corresponding to PageRank s random walk M s principal eigenvector is the vector of PageRanks 3 May Big Data Technology 12 6

7 Crawlers - Introduction The role of crawlers is to collect Web content Starting with some seed URLs, crawlers learn of additional crawl targets via hyperlinks on the crawled pages Several types of crawlers: Batch crawlers crawl a snapshot of their crawl space, until reaching a certain size or time limit Incremental crawlers continuously crawl their crawl space, revisiting URLs to ensure freshness Focused crawlers attempt to crawl pages pertaining to some topic/theme, while minimizing number of off-topic pages that are collected Web scale crawling is carried out by distributed crawlers, complicating what is conceptually a simple operation Resources consumed: bandwidth, computation time (beyond communication e.g. parsing), storage space 3 May Big Data Technology 14 Generic Web Crawling Algorithm Given a root set of distinct URLs: Put the root URLs in a (priority) queue While queue is not empty: Take the first URL x out of the queue Retrieve the contents of x from the web Do whatever you want with it Mark x as visited If x is an html page, parse it to find hyperlink URLs For each hyperlink URL y within x: If y hasn t been visited (perhaps lately), enqueue y with some priority Note that this algorithm may never stop on evolving graphs 3 May Big Data Technology 15 7

8 Adaptive On-Line Page Importance Computation (Abiteboul, Preda, Cobena WWW 2003) Assume we want to prioritize a continuous crawl by a PageRank-like link-based importance measure of Web pages One option is to build a link database as we crawl, and use it to compute PageRank every K crawl operations This is both difficult and expensive The paper above details a method to compute a PageRank-like measure in an on-line, asynchronous manner with little overhead in terms of computations, I/O and memory The description in the following slides is for a simplified version of a continuous crawl over N pages whose links never change (i.e. the set of pages to be continuously crawled is fixed, and only their text is subject to change) 3 May Big Data Technology 16 Page Importance Definition Let G=(V,E) be the (directed) link graph of the Web We build G =(V,E ) by adding a virtual node y to G, with links to and from all other nodes; thus, G will be strongly connected Formally, V = V {y}, E = E { x V: x y, y x } Whenever E >0, G is also aperiodic Let M(G ) denote the row-normalized adjacency matrix of G ; obviously M(G ) is stochastic, and by the above assumption also Ergodic Let π denote the principal eigenvector of M(G ) (and stationary distribution of the Markov chain it represents), i.e. π = π M(G ) We will define π as the Importance Vector of all nodes in G that the algorithm will compute 3 May Big Data Technology 17 8

9 On-Line Algorithm for Computing Page Importance The algorithm assigns each page v the following variables: 1. Cash money C(v) 2. Bank money B(v) In addition, let L(v) denote the outlinks of page v, and let G denote the global amount of money in the bank. The algorithm itself proceeds as follows: Initialization: for all pages, let C(v) 1/N, B(v) 0; G 0 Repeat forever: Pick a page v to visit according to the Visit Strategy* Distribute (evenly) an amount equal to C(v) among v s children, i.e. j L(v), C(j) += C(v) / L(v) Deposit v s cash in the bank: B(v) += C(v), G += C(v), C(v) 0 Fairness condition*: every page is visited infinitely often 3 May Big Data Technology 18 Analysis of the Algorithm Three Lemmas Lemma 1: at all times, Σ v C(v) = 1 (proof by easy induction) Lemma 2: at all times, B(v)+C(v)=1/N + Σ j:j v M(G ) j,v B(j) Proof: also by induction, which is trivial at time zero. The step analyzes three distinct cases of which page is visited: 1. v is visited, and then nothing changes on the RHS while money is just moved from C(v) to B(v) on the LHS 2. A page j that links to v is visited. The LHS grows by C(j)/ L(j), which is the exact increment of the RHS 3. Otherwise neither side changes Lemma 3: G goes to infinity as the algorithm proceeds Proof: at any time t there is at least one page x whose cash amount is at least 1/N. Since each page is visited infinitely often, there is a finite t >t in which x is visited, thus G will increase by at least 1/N by time t 3 May Big Data Technology 19 9

10 Analysis of the Algorithm Main Theorem Lemma 1: at all times, Σ v C(v) = 1 (proof by easy induction) Lemma 2: at all times, B(v)+C(v)=1/N + Σ j:j v M(G ) j,v B(j) Lemma 3: G goes to infinity as the algorithm proceeds Let B be the normalized bank vector, i.e. B v =B(v)/G. Thus, B is a distribution vector. Theorem: B*M(G )-B 0 as the algorithm proceeds Proof: examine the v th coordinate: B*M(G )-B v = G -1 * B(v) - Σ j B(j) M(G ) j,v = G -1 * B(v) + C(v) C(v) - Σ j:j v B(j) M(G ) j,v = G -1 * 1/N C(v) < G -1 0 Conclusion: B π, the stationary distribution of M(G ) 3 May Big Data Technology 20 Possible Visit Strategies Crawling Policies 1. Round Robin (obviously fair) 2. Random choose the next node to visit u.a.r., thus guaranteeing that the probability of each page being visited infinitely often is 1 3. Greedy - visit the node with maximal cash, thus increasing G in the fastest possible manner Why are all nodes are visited infinitely often? 3 May Big Data Technology 21 10

11 Additional Notes 1. There are adaptations of the algorithm to the case of evolving graphs and to a distributed implementation 2. Estimating v s importance by (B(v)+C(v)) / (G+1) is slightly better than just using B(v)/G 3 May Big Data Technology 22 Distributed Implementation As mentioned earlier, Web-scale crawling is a distributed task carried out on multiple machines Every visit to page v at crawl time requires access to B(v), G, C(v) and C(j) for every neighbor j of v The visits will happen across many crawl nodes Even repeat visits to the same node v may happen on different nodes (fault tolerance) Consequently, we need read and write access to the above variables across all nodes! 3 May Big Data Technology 23 11

12 Distributed Implementation Given a visit to v, the following transaction should be performed: 1. Foreach j L(v): 1. C(j) += C(v) / L(v) 2. B(v) += C(v) 3. G += C(v) 4. C(v) 0 1. t C(v) 2. C(v) 0 3. Foreach j L(v): 1. C(j) += t / L(v) 4. B(v) += t 5. G += t Re-examining the required consistency, we can rewrite as: 1. Lock C(v) 2. t C(v) 3. C(v) 0 4. Unlock C(v) 5. Foreach j L(v): 1. C(j) += t / L(v) 6. B(v) += t 7. G += t 3 May Big Data Technology 24 NoSQL Databases Not Only SQL: a class of database services that traded off most functionality of full-blown SQL databases for extreme scalability Simplest manifest: super-scalable key-value store service Pioneered by Google (BigTable); now many other instances Representative high-level API of a tabular NoSQL database: lock(row), unlock(row) get(row, [column]), put(row, [column, value]*) increment(row, [column, incr-value]*) 1. lock(v) 2. t= get(v, C) 3. put(v,c,0) 4. unlock(v) 5. Foreach j L(v): 1. increment (j, C, t/ L(v) 6. increment(v, B, t) 7. increment(g,,t) 3 May Big Data Technology 25 12

13 Next Class Properties and design of scalable real-time NoSQL key-value stores, primarily BigTable and HBase 3 May Big Data Technology 26 13