Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time"

Transcription

1 Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Link-based page importance measures Why link-based? Mathematical background PageRank Crawlers What are crawlers Algorithm for computing page importance while crawling Distributed implementation calls for NoSQL key-value database 3 May Big Data Technology 2 1

2 Link Analysis - Motivation Traditional text-based ranking methods from the field of Information Retrieval are not sufficient on the Web: The web is huge, with great variation in the quality of the pages even when those pages contain similar text Queries are usually very short, making differentiation between pages difficult The text on many web pages does not sufficiently describe the page Keyword spamming The emphasis of web search is on precision, not recall! 3 May Big Data Technology 3 Link Analysis - Motivation The connectivity patterns between Web pages contain a gold mine of information A link from page a to page b can often be interpreted as: 1. A recommendation, by a s author, of the contents of b 2. Evidence that pages a,b share some topic of interest A co-citation of a and b (by a third page c) may also constitute evidence that a,b share some topic of interest a c b Analyzing linkage patterns at scale can identify oft-praised pages and topical affinities between pages 3 May Big Data Technology 4 2

3 Mathematical Background - Irreducibility A directed graph G=(V,E) is called irreducible if for every i,k V there is a path in G originating at i and ending in k A non-negative NxN matrix W is called irreducible if for every i,k {1,,N} there exists a non-negative integer m such that [W m ] i,k >0 The support graph G W ={V W,E W } of a non-negative NxN matrix W is a directed graph with N vertices, such that i k E W if and only if W i,k >0 Lemma: a non-negative square matrix W is irreducible if and only if G W is irreducible Lemma: a directed graph G is irreducible if and only if the adjacency matrix of G is irreducible 3 May Big Data Technology 5 Mathematical Background: Primitive & Stochastic Matrices Definition: a non-negative NxN matrix P is stochastic if the sum of every row in P is 1 Definition: the period of a directed graph G is the greatest common divisor of the lengths of all the cycles in G G is called aperiodic if it has a period of 1 Definition: a non-negative NxN matrix M is called primitive if its support graph G M is aperiodic 3 May Big Data Technology 6 3

4 Mathematical Background: Ergodic Thoerem Let M be a non-negative NxN matrix, and denote by λ1(m), λ2(m),...λn(m) the N eigenvalues of M, ordered by non-decreasing absolute value (i.e. λ1(m) λ2 (M)... λn(m) ) λ1(m) is the spectral radius of M and will simply be denoted by λ(m) Ergodic Theorem: let P be an irreducible and primitive stochastic matrix λ(p) = λ1(p) = 1, and any other eigenvalue of λ* of P satisfies λ* <1 There is a unique distribution row-vector π which satisfies πp=π π is the principal eigenvector of P and is the stationary distribution of the Markov Chain defined by the transition matrix P For any distribution row-vector q, lim k q P k = π Note that the last bullet defines an iterative method to compute π 3 May Big Data Technology 7 PageRank (Brin & Page, 1998) Named after Google s co-founder, Larry Page A global, query independent importance measure of Web pages A page is considered important if it receives many links from important pages Based on Markov chains and random walks 3 May Big Data Technology 8 4

5 PageRank: Random Surfer Model A random surfer moves from page to page. Upon leaving page p, the surfer chooses one of two actions: 1. Follows an outgoing link of p (chosen uniformly at random), with probability d See next slide for a discussion of pages that have no outlinks 2. Jumps to an arbitrary Web page (chosen uniformly at random), with probability 1-d The vector of PageRanks is the stationary distribution of this (ergodic) random walk 3 May Big Data Technology 9 PageRank Handling Dangling Nodes PageRank as stated in the previous slide is not well defined with respect to exiting pages that have no outgoing links (dangling nodes) There are three accepted approaches for treating pages with no outgoing links: 1. Eliminate such pages from the graph (iteratively prune the graph until reaching a steady state) 2. Consider such pages to link back to the pages that link to them 3. Consider such pages to link to all web pages (effectively making an exit out of them equivalent to a random jump) 3 May Big Data Technology 10 5

6 PageRank: Steady State Equations The PageRanks obey the following equations: R(p) = (1-d)/N + d Σ R(j)/D(j) j I(p) R(p) The PageRank of page p. d A damping factor, 0 < d < 1 N Number of Web pages I(p) The set of pages that point to p D(j) Number of out-links (out-degree) of page j 3 May Big Data Technology 11 PageRank Algebraic Notation Let W denote the NxN adjacency matrix of the Web s link structure, after some form of handling dangling nodes Let W norm denote the matrix that results by dividing each row j of W by j s out-degree (row j s sum) Let T by an NxN matrix whose entries are all equal to 1/N Define M = (1-d)T + dw norm M is the transition matrix corresponding to PageRank s random walk M s principal eigenvector is the vector of PageRanks 3 May Big Data Technology 12 6

7 Crawlers - Introduction The role of crawlers is to collect Web content Starting with some seed URLs, crawlers learn of additional crawl targets via hyperlinks on the crawled pages Several types of crawlers: Batch crawlers crawl a snapshot of their crawl space, until reaching a certain size or time limit Incremental crawlers continuously crawl their crawl space, revisiting URLs to ensure freshness Focused crawlers attempt to crawl pages pertaining to some topic/theme, while minimizing number of off-topic pages that are collected Web scale crawling is carried out by distributed crawlers, complicating what is conceptually a simple operation Resources consumed: bandwidth, computation time (beyond communication e.g. parsing), storage space 3 May Big Data Technology 14 Generic Web Crawling Algorithm Given a root set of distinct URLs: Put the root URLs in a (priority) queue While queue is not empty: Take the first URL x out of the queue Retrieve the contents of x from the web Do whatever you want with it Mark x as visited If x is an html page, parse it to find hyperlink URLs For each hyperlink URL y within x: If y hasn t been visited (perhaps lately), enqueue y with some priority Note that this algorithm may never stop on evolving graphs 3 May Big Data Technology 15 7

8 Adaptive On-Line Page Importance Computation (Abiteboul, Preda, Cobena WWW 2003) Assume we want to prioritize a continuous crawl by a PageRank-like link-based importance measure of Web pages One option is to build a link database as we crawl, and use it to compute PageRank every K crawl operations This is both difficult and expensive The paper above details a method to compute a PageRank-like measure in an on-line, asynchronous manner with little overhead in terms of computations, I/O and memory The description in the following slides is for a simplified version of a continuous crawl over N pages whose links never change (i.e. the set of pages to be continuously crawled is fixed, and only their text is subject to change) 3 May Big Data Technology 16 Page Importance Definition Let G=(V,E) be the (directed) link graph of the Web We build G =(V,E ) by adding a virtual node y to G, with links to and from all other nodes; thus, G will be strongly connected Formally, V = V {y}, E = E { x V: x y, y x } Whenever E >0, G is also aperiodic Let M(G ) denote the row-normalized adjacency matrix of G ; obviously M(G ) is stochastic, and by the above assumption also Ergodic Let π denote the principal eigenvector of M(G ) (and stationary distribution of the Markov chain it represents), i.e. π = π M(G ) We will define π as the Importance Vector of all nodes in G that the algorithm will compute 3 May Big Data Technology 17 8

9 On-Line Algorithm for Computing Page Importance The algorithm assigns each page v the following variables: 1. Cash money C(v) 2. Bank money B(v) In addition, let L(v) denote the outlinks of page v, and let G denote the global amount of money in the bank. The algorithm itself proceeds as follows: Initialization: for all pages, let C(v) 1/N, B(v) 0; G 0 Repeat forever: Pick a page v to visit according to the Visit Strategy* Distribute (evenly) an amount equal to C(v) among v s children, i.e. j L(v), C(j) += C(v) / L(v) Deposit v s cash in the bank: B(v) += C(v), G += C(v), C(v) 0 Fairness condition*: every page is visited infinitely often 3 May Big Data Technology 18 Analysis of the Algorithm Three Lemmas Lemma 1: at all times, Σ v C(v) = 1 (proof by easy induction) Lemma 2: at all times, B(v)+C(v)=1/N + Σ j:j v M(G ) j,v B(j) Proof: also by induction, which is trivial at time zero. The step analyzes three distinct cases of which page is visited: 1. v is visited, and then nothing changes on the RHS while money is just moved from C(v) to B(v) on the LHS 2. A page j that links to v is visited. The LHS grows by C(j)/ L(j), which is the exact increment of the RHS 3. Otherwise neither side changes Lemma 3: G goes to infinity as the algorithm proceeds Proof: at any time t there is at least one page x whose cash amount is at least 1/N. Since each page is visited infinitely often, there is a finite t >t in which x is visited, thus G will increase by at least 1/N by time t 3 May Big Data Technology 19 9

10 Analysis of the Algorithm Main Theorem Lemma 1: at all times, Σ v C(v) = 1 (proof by easy induction) Lemma 2: at all times, B(v)+C(v)=1/N + Σ j:j v M(G ) j,v B(j) Lemma 3: G goes to infinity as the algorithm proceeds Let B be the normalized bank vector, i.e. B v =B(v)/G. Thus, B is a distribution vector. Theorem: B*M(G )-B 0 as the algorithm proceeds Proof: examine the v th coordinate: B*M(G )-B v = G -1 * B(v) - Σ j B(j) M(G ) j,v = G -1 * B(v) + C(v) C(v) - Σ j:j v B(j) M(G ) j,v = G -1 * 1/N C(v) < G -1 0 Conclusion: B π, the stationary distribution of M(G ) 3 May Big Data Technology 20 Possible Visit Strategies Crawling Policies 1. Round Robin (obviously fair) 2. Random choose the next node to visit u.a.r., thus guaranteeing that the probability of each page being visited infinitely often is 1 3. Greedy - visit the node with maximal cash, thus increasing G in the fastest possible manner Why are all nodes are visited infinitely often? 3 May Big Data Technology 21 10

11 Additional Notes 1. There are adaptations of the algorithm to the case of evolving graphs and to a distributed implementation 2. Estimating v s importance by (B(v)+C(v)) / (G+1) is slightly better than just using B(v)/G 3 May Big Data Technology 22 Distributed Implementation As mentioned earlier, Web-scale crawling is a distributed task carried out on multiple machines Every visit to page v at crawl time requires access to B(v), G, C(v) and C(j) for every neighbor j of v The visits will happen across many crawl nodes Even repeat visits to the same node v may happen on different nodes (fault tolerance) Consequently, we need read and write access to the above variables across all nodes! 3 May Big Data Technology 23 11

12 Distributed Implementation Given a visit to v, the following transaction should be performed: 1. Foreach j L(v): 1. C(j) += C(v) / L(v) 2. B(v) += C(v) 3. G += C(v) 4. C(v) 0 1. t C(v) 2. C(v) 0 3. Foreach j L(v): 1. C(j) += t / L(v) 4. B(v) += t 5. G += t Re-examining the required consistency, we can rewrite as: 1. Lock C(v) 2. t C(v) 3. C(v) 0 4. Unlock C(v) 5. Foreach j L(v): 1. C(j) += t / L(v) 6. B(v) += t 7. G += t 3 May Big Data Technology 24 NoSQL Databases Not Only SQL: a class of database services that traded off most functionality of full-blown SQL databases for extreme scalability Simplest manifest: super-scalable key-value store service Pioneered by Google (BigTable); now many other instances Representative high-level API of a tabular NoSQL database: lock(row), unlock(row) get(row, [column]), put(row, [column, value]*) increment(row, [column, incr-value]*) 1. lock(v) 2. t= get(v, C) 3. put(v,c,0) 4. unlock(v) 5. Foreach j L(v): 1. increment (j, C, t/ L(v) 6. increment(v, B, t) 7. increment(g,,t) 3 May Big Data Technology 25 12

13 Next Class Properties and design of scalable real-time NoSQL key-value stores, primarily BigTable and HBase 3 May Big Data Technology 26 13

MATH 56A SPRING 2008 STOCHASTIC PROCESSES 31

MATH 56A SPRING 2008 STOCHASTIC PROCESSES 31 MATH 56A SPRING 2008 STOCHASTIC PROCESSES 3.3. Invariant probability distribution. Definition.4. A probability distribution is a function π : S [0, ] from the set of states S to the closed unit interval

More information

The Mathematics of Internet Search Engines

The Mathematics of Internet Search Engines The Mathematics of Internet Search Engines David Marshall Department of Mathematics Monmouth University April 4, 2007 Introduction Search Engines, Then and Now Then... Now... Pagerank Outline Introduction

More information

Web Graph Analyzer Tool

Web Graph Analyzer Tool Web Graph Analyzer Tool Konstantin Avrachenkov INRIA Sophia Antipolis 2004, route des Lucioles, B.P.93 06902, France Email: K.Avrachenkov@sophia.inria.fr Danil Nemirovsky St.Petersburg State University

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

6.042/18.062J Mathematics for Computer Science October 3, 2006 Tom Leighton and Ronitt Rubinfeld. Graph Theory III

6.042/18.062J Mathematics for Computer Science October 3, 2006 Tom Leighton and Ronitt Rubinfeld. Graph Theory III 6.04/8.06J Mathematics for Computer Science October 3, 006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Graph Theory III Draft: please check back in a couple of days for a modified version of these

More information

Search engines: ranking algorithms

Search engines: ranking algorithms Search engines: ranking algorithms Gianna M. Del Corso Dipartimento di Informatica, Università di Pisa, Italy ESP, 25 Marzo 2015 1 Statistics 2 Search Engines Ranking Algorithms HITS Web Analytics Estimated

More information

The Second Eigenvalue of the Google Matrix

The Second Eigenvalue of the Google Matrix 0 2 The Second Eigenvalue of the Google Matrix Taher H Haveliwala and Sepandar D Kamvar Stanford University taherh,sdkamvar @csstanfordedu Abstract We determine analytically the modulus of the second eigenvalue

More information

Introduction to Flocking {Stochastic Matrices}

Introduction to Flocking {Stochastic Matrices} Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

More information

Seminar assignment 1. 1 Part

Seminar assignment 1. 1 Part Seminar assignment 1 Each seminar assignment consists of two parts, one part consisting of 3 5 elementary problems for a maximum of 10 points from each assignment. For the second part consisting of problems

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

Eigenvalues and Markov Chains

Eigenvalues and Markov Chains Eigenvalues and Markov Chains Will Perkins April 15, 2013 The Metropolis Algorithm Say we want to sample from a different distribution, not necessarily uniform. Can we change the transition rates in such

More information

15 Markov Chains: Limiting Probabilities

15 Markov Chains: Limiting Probabilities MARKOV CHAINS: LIMITING PROBABILITIES 67 Markov Chains: Limiting Probabilities Example Assume that the transition matrix is given by 7 2 P = 6 Recall that the n-step transition probabilities are given

More information

1 Limiting distribution for a Markov chain

1 Limiting distribution for a Markov chain Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

More information

Perron vector Optimization applied to search engines

Perron vector Optimization applied to search engines Perron vector Optimization applied to search engines Olivier Fercoq INRIA Saclay and CMAP Ecole Polytechnique May 18, 2011 Web page ranking The core of search engines Semantic rankings (keywords) Hyperlink

More information

CHAPTER III - MARKOV CHAINS

CHAPTER III - MARKOV CHAINS CHAPTER III - MARKOV CHAINS JOSEPH G. CONLON 1. General Theory of Markov Chains We have already discussed the standard random walk on the integers Z. A Markov Chain can be viewed as a generalization of

More information

Enhancing the Ranking of a Web Page in the Ocean of Data

Enhancing the Ranking of a Web Page in the Ocean of Data Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India hkshitesh@gmail.com In today

More information

Bindel, Fall 2012 Matrix Computations (CS 6210) Week 8: Friday, Oct 12

Bindel, Fall 2012 Matrix Computations (CS 6210) Week 8: Friday, Oct 12 Why eigenvalues? Week 8: Friday, Oct 12 I spend a lot of time thinking about eigenvalue problems. In part, this is because I look for problems that can be solved via eigenvalues. But I might have fewer

More information

S = {1, 2,..., n}. P (1, 1) P (1, 2)... P (1, n) P (2, 1) P (2, 2)... P (2, n) P = . P (n, 1) P (n, 2)... P (n, n)

S = {1, 2,..., n}. P (1, 1) P (1, 2)... P (1, n) P (2, 1) P (2, 2)... P (2, n) P = . P (n, 1) P (n, 2)... P (n, n) last revised: 26 January 2009 1 Markov Chains A Markov chain process is a simple type of stochastic process with many social science applications. We ll start with an abstract description before moving

More information

IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS

IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS There are four questions, each with several parts. 1. Customers Coming to an Automatic Teller Machine (ATM) (30 points)

More information

Lecture 6: The Group Inverse

Lecture 6: The Group Inverse Lecture 6: The Group Inverse The matrix index Let A C n n, k positive integer. Then R(A k+1 ) R(A k ). The index of A, denoted IndA, is the smallest integer k such that R(A k ) = R(A k+1 ), or equivalently,

More information

Monte Carlo methods in PageRank computation: When one iteration is sufficient

Monte Carlo methods in PageRank computation: When one iteration is sufficient Monte Carlo methods in PageRank computation: When one iteration is sufficient K.Avrachenkov, N. Litvak, D. Nemirovsky, N. Osipova Abstract PageRank is one of the principle criteria according to which Google

More information

Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

More information

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.) Chapter 7 Google PageRank The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.) One of the reasons why Google TM is such an effective search engine is the PageRank

More information

Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition

Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Jack Gilbert Copyright (c) 2014 Jack Gilbert. Permission is granted to copy, distribute and/or modify this document under the terms

More information

4.9 Markov matrices. DEFINITION 4.3 A real n n matrix A = [a ij ] is called a Markov matrix, or row stochastic matrix if. (i) a ij 0 for 1 i, j n;

4.9 Markov matrices. DEFINITION 4.3 A real n n matrix A = [a ij ] is called a Markov matrix, or row stochastic matrix if. (i) a ij 0 for 1 i, j n; 49 Markov matrices DEFINITION 43 A real n n matrix A = [a ij ] is called a Markov matrix, or row stochastic matrix if (i) a ij 0 for 1 i, j n; (ii) a ij = 1 for 1 i n Remark: (ii) is equivalent to AJ n

More information

Lecture12 The Perron-Frobenius theorem.

Lecture12 The Perron-Frobenius theorem. Lecture12 The Perron-Frobenius theorem. 1 Statement of the theorem. 2 Proof of the Perron Frobenius theorem. 3 Graphology. 3 Asymptotic behavior. The non-primitive case. 4 The Leslie model of population

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks

princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks Lecturer: Sanjeev Arora Scribe:Elena Nabieva 1 Basics A Markov chain is a discrete-time stochastic process

More information

UNCOUPLING THE PERRON EIGENVECTOR PROBLEM

UNCOUPLING THE PERRON EIGENVECTOR PROBLEM UNCOUPLING THE PERRON EIGENVECTOR PROBLEM Carl D Meyer INTRODUCTION Foranonnegative irreducible matrix m m with spectral radius ρ,afundamental problem concerns the determination of the unique normalized

More information

Combining BibTeX and PageRank to Classify WWW Pages

Combining BibTeX and PageRank to Classify WWW Pages Combining BibTeX and PageRank to Classify WWW Pages MIKE JOY, GRAEME COLE and JONATHAN LOUGHRAN University of Warwick IAN BURNETT IBM United Kingdom and ANDREW TWIGG Cambridge University Focused search

More information

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as

More information

Local Approximation of PageRank and Reverse PageRank

Local Approximation of PageRank and Reverse PageRank Local Approximation of PageRank and Reverse PageRank Ziv Bar-Yossef Department of Electrical Engineering Technion, Haifa, Israel and Google Haifa Engineering Center Haifa, Israel zivby@ee.technion.ac.il

More information

Markov Chains, part I

Markov Chains, part I Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X

More information

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom. Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

More information

Lecture 4 Online and streaming algorithms for clustering

Lecture 4 Online and streaming algorithms for clustering CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

More information

Notes on Matrix Multiplication and the Transitive Closure

Notes on Matrix Multiplication and the Transitive Closure ICS 6D Due: Wednesday, February 25, 2015 Instructor: Sandy Irani Notes on Matrix Multiplication and the Transitive Closure An n m matrix over a set S is an array of elements from S with n rows and m columns.

More information

The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

More information

Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

More information

Introduction to Stationary Distributions

Introduction to Stationary Distributions 13 Introduction to Stationary Distributions We first briefly review the classification of states in a Markov chain with a quick example and then begin the discussion of the important notion of stationary

More information

The Perron-Frobenius theorem.

The Perron-Frobenius theorem. Chapter 9 The Perron-Frobenius theorem. The theorem we will discuss in this chapter (to be stated below) about matrices with non-negative entries, was proved, for matrices with strictly positive entries,

More information

Link Analysis. Chapter 5. 5.1 PageRank

Link Analysis. Chapter 5. 5.1 PageRank Chapter 5 Link Analysis One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as

More information

max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

More information

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued). MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0

More information

Continuous-Time Markov Chains - Introduction

Continuous-Time Markov Chains - Introduction 25 Continuous-Time Markov Chains - Introduction Prior to introducing continuous-time Markov chains today, let us start off with an example involving the Poisson process. Our particular focus in this example

More information

1 Short Introduction to Time Series

1 Short Introduction to Time Series ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

More information

Online edition (c)2009 Cambridge UP

Online edition (c)2009 Cambridge UP DRAFT! April 1, 2009 Cambridge University Press. Feedback welcome. 461 21 Link analysis The analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search.

More information

Lesson 3. Algebraic graph theory. Sergio Barbarossa. Rome - February 2010

Lesson 3. Algebraic graph theory. Sergio Barbarossa. Rome - February 2010 Lesson 3 Algebraic graph theory Sergio Barbarossa Basic notions Definition: A directed graph (or digraph) composed by a set of vertices and a set of edges We adopt the convention that the information flows

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Lecture 9. 1 Introduction. 2 Random Walks in Graphs. 1.1 How To Explore a Graph? CS-621 Theory Gems October 17, 2012

Lecture 9. 1 Introduction. 2 Random Walks in Graphs. 1.1 How To Explore a Graph? CS-621 Theory Gems October 17, 2012 CS-62 Theory Gems October 7, 202 Lecture 9 Lecturer: Aleksander Mądry Scribes: Dorina Thanou, Xiaowen Dong Introduction Over the next couple of lectures, our focus will be on graphs. Graphs are one of

More information

Magnitude of the crawling problem. Introduction to Information Retrieval Basic crawler operation

Magnitude of the crawling problem. Introduction to Information Retrieval  Basic crawler operation Magnitude of the crawling problem Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart

More information

Effective Page Refresh Policies For Web Crawlers

Effective Page Refresh Policies For Web Crawlers Effective Page Refresh Policies For Web Crawlers JUNGHOO CHO University of California, Los Angeles and HECTOR GARCIA-MOLINA Stanford University In this paper we study how we can maintain local copies of

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

MATH 551 - APPLIED MATRIX THEORY

MATH 551 - APPLIED MATRIX THEORY MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points

More information

Reading 7 : Program Correctness

Reading 7 : Program Correctness CS/Math 240: Introduction to Discrete Mathematics Fall 2015 Instructors: Beck Hasti, Gautam Prakriya Reading 7 : Program Correctness 7.1 Program Correctness Showing that a program is correct means that

More information

The Characteristic Polynomial

The Characteristic Polynomial Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem

More information

Crawling and web indexes CE-324: Modern Information Retrieval Sharif University of Technology

Crawling and web indexes CE-324: Modern Information Retrieval Sharif University of Technology Crawling and web indexes CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Similarity and Diagonalization. Similar Matrices

Similarity and Diagonalization. Similar Matrices MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

More information

BIN504 - Lecture XI. Bayesian Inference & Markov Chains

BIN504 - Lecture XI. Bayesian Inference & Markov Chains BIN504 - Lecture XI Bayesian Inference & References: Lee, Chp. 11 & Sec. 10.11 Ewens-Grant, Chps. 4, 7, 11 c 2012 Aybar C. Acar Including slides by Tolga Can Rev. 1.2 (Build 20130528191100) BIN 504 - Bayes

More information

Theorem (The division theorem) Suppose that a and b are integers with b > 0. There exist unique integers q and r so that. a = bq + r and 0 r < b.

Theorem (The division theorem) Suppose that a and b are integers with b > 0. There exist unique integers q and r so that. a = bq + r and 0 r < b. Theorem (The division theorem) Suppose that a and b are integers with b > 0. There exist unique integers q and r so that a = bq + r and 0 r < b. We re dividing a by b: q is the quotient and r is the remainder,

More information

Continuous-time Markov Chains

Continuous-time Markov Chains Continuous-time Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 31, 2016

More information

Lecture 6 Online and streaming algorithms for clustering

Lecture 6 Online and streaming algorithms for clustering CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

More information

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important

More information

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

More information

THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE

THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE THE $5,,, EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE KURT BRYAN AND TANYA LEISE Abstract. Google s success derives in large part from its PageRank algorithm, which ranks the importance of webpages according

More information

Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

More information

Web Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley

Web Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley Web Search Engines Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Search Engine Characteristics

More information

CS6322: Information Retrieval Sanda Harabagiu. Lecture 11: Crawling and web indexes

CS6322: Information Retrieval Sanda Harabagiu. Lecture 11: Crawling and web indexes Sanda Harabagiu Lecture 11: Crawling and web indexes Today s lecture Crawling Connectivity servers Sec. 20.2 Basic crawler operation Begin with known seed URLs Fetch and parse them Extract URLs they point

More information

Power Rankings: Math for March Madness

Power Rankings: Math for March Madness Power Rankings: Math for March Madness James A. Swenson University of Wisconsin Platteville swensonj@uwplatt.edu March 5, 2011 Math Club Madison Area Technical College James A. Swenson (UWP) Power Rankings:

More information

Analysis and Computation of Google s PageRank

Analysis and Computation of Google s PageRank Analysis and Computation of Google s PageRank Ilse Ipsen North Carolina State University Joint work with Rebecca M. Wills IMACS p.1 Overview Goal: Compute (citation) importance of a web page Simple Web

More information

LECTURE 4. Last time: Lecture outline

LECTURE 4. Last time: Lecture outline LECTURE 4 Last time: Types of convergence Weak Law of Large Numbers Strong Law of Large Numbers Asymptotic Equipartition Property Lecture outline Stochastic processes Markov chains Entropy rate Random

More information

Tools for the analysis and design of communication networks with Markovian dynamics

Tools for the analysis and design of communication networks with Markovian dynamics 1 Tools for the analysis and design of communication networks with Markovian dynamics Arie Leizarowitz, Robert Shorten, Rade Stanoević Abstract In this paper we analyze the stochastic properties of a class

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

Markov Chains. Chapter Introduction and Definitions 110SOR201(2002)

Markov Chains. Chapter Introduction and Definitions 110SOR201(2002) page 5 SOR() Chapter Markov Chains. Introduction and Definitions Consider a sequence of consecutive times ( or trials or stages): n =,,,... Suppose that at each time a probabilistic experiment is performed,

More information

University of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods

University of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods University of Chicago Autumn 2003 CS37101-1 Markov Chain Monte Carlo Methods Lecture 2: October 7, 2003 Markov Chains, Coupling, Stationary Distribution Eric Vigoda 2.1 Markov Chains In this lecture, we

More information

A short proof of Perron s theorem. Hannah Cairns, April 25, 2014.

A short proof of Perron s theorem. Hannah Cairns, April 25, 2014. A short proof of Perron s theorem. Hannah Cairns, April 25, 2014. A matrix A or a vector ψ is said to be positive if every component is a positive real number. A B means that every component of A is greater

More information

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié. Random access protocols for channel access Markov chains and their stability laurent.massoulie@inria.fr Aloha: the first random access protocol for channel access [Abramson, Hawaii 70] Goal: allow machines

More information

Recap CS276. Plan for today. Link-based ranking. Link spam. Link farms and link exchanges. Wrap up spam Crawling Connectivity servers

Recap CS276. Plan for today. Link-based ranking. Link spam. Link farms and link exchanges. Wrap up spam Crawling Connectivity servers Recap CS276 Lecture 15 Plan for today Wrap up spam Crawling Connectivity servers In the last lecture we introduced web search Paid placement SEO/Spam Link-based ranking Most search engines use hyperlink

More information

1. Markov chains. 1.1 Specifying and simulating a Markov chain. Page 5

1. Markov chains. 1.1 Specifying and simulating a Markov chain. Page 5 1. Markov chains Page 5 Section 1. What is a Markov chain? How to simulate one. Section 2. The Markov property. Section 3. How matrix multiplication gets into the picture. Section 4. Statement of the Basic

More information

Performance Analysis of a Telephone System with both Patient and Impatient Customers

Performance Analysis of a Telephone System with both Patient and Impatient Customers Performance Analysis of a Telephone System with both Patient and Impatient Customers Yiqiang Quennel Zhao Department of Mathematics and Statistics University of Winnipeg Winnipeg, Manitoba Canada R3B 2E9

More information

Math 312 Lecture Notes Markov Chains

Math 312 Lecture Notes Markov Chains Math 312 Lecture Notes Markov Chains Warren Weckesser Department of Mathematics Colgate University Updated, 30 April 2005 Markov Chains A (finite) Markov chain is a process with a finite number of states

More information

Chapter 4. Trees. 4.1 Basics

Chapter 4. Trees. 4.1 Basics Chapter 4 Trees 4.1 Basics A tree is a connected graph with no cycles. A forest is a collection of trees. A vertex of degree one, particularly in a tree, is called a leaf. Trees arise in a variety of applications.

More information

CS 103X: Discrete Structures Homework Assignment 3 Solutions

CS 103X: Discrete Structures Homework Assignment 3 Solutions CS 103X: Discrete Structures Homework Assignment 3 s Exercise 1 (20 points). On well-ordering and induction: (a) Prove the induction principle from the well-ordering principle. (b) Prove the well-ordering

More information

A simple criterion on degree sequences of graphs

A simple criterion on degree sequences of graphs Discrete Applied Mathematics 156 (2008) 3513 3517 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam Note A simple criterion on degree

More information

Optimization of the HOTS score of a website s pages

Optimization of the HOTS score of a website s pages Optimization of the HOTS score of a website s pages Olivier Fercoq and Stéphane Gaubert INRIA Saclay and CMAP Ecole Polytechnique June 21st, 2012 Toy example with 21 pages 3 5 6 2 4 20 1 12 9 11 15 16

More information

A characterization of trace zero symmetric nonnegative 5x5 matrices

A characterization of trace zero symmetric nonnegative 5x5 matrices A characterization of trace zero symmetric nonnegative 5x5 matrices Oren Spector June 1, 009 Abstract The problem of determining necessary and sufficient conditions for a set of real numbers to be the

More information

Link-based Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011

Link-based Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011 Link-based Analysis on Large Graphs Presented by Weiren Yu Mar 01, 2011 Overview 1 Introduction 2 Problem Definition 3 Optimization Techniques 4 Experimental Results 2 1. Introduction Many applications

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

An Alternative Web Search Strategy? Abstract

An Alternative Web Search Strategy? Abstract An Alternative Web Search Strategy? V.-H. Winterer, Rechenzentrum Universität Freiburg (Dated: November 2007) Abstract We propose an alternative Web search strategy taking advantage of the knowledge on

More information

Cyclotomic Extensions

Cyclotomic Extensions Chapter 7 Cyclotomic Extensions A cyclotomic extension Q(ζ n ) of the rationals is formed by adjoining a primitive n th root of unity ζ n. In this chapter, we will find an integral basis and calculate

More information

Web Usage in Client-Server Design

Web Usage in Client-Server Design Web Search Web Usage in Client-Server Design A client (e.g., a browser) communicates with a server via http Hypertext transfer protocol: a lightweight and simple protocol asynchronously carrying a variety

More information

Math 7 Elementary Linear Algebra MARKOV CHAINS. Definition of Experiment An experiment is an activity with a definite, observable outcome.

Math 7 Elementary Linear Algebra MARKOV CHAINS. Definition of Experiment An experiment is an activity with a definite, observable outcome. T. Henson I. Basic Concepts from Probability Examples of experiments: Math 7 Elementary Linear Algebra MARKOV CHAINS Definition of Experiment An experiment is an activity with a definite, observable outcome.

More information

Linear Programming. March 14, 2014

Linear Programming. March 14, 2014 Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

More information

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = 36 + 41i.

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = 36 + 41i. Math 5A HW4 Solutions September 5, 202 University of California, Los Angeles Problem 4..3b Calculate the determinant, 5 2i 6 + 4i 3 + i 7i Solution: The textbook s instructions give us, (5 2i)7i (6 + 4i)(

More information

Notes on Factoring. MA 206 Kurt Bryan

Notes on Factoring. MA 206 Kurt Bryan The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

More information

Eigenvalues and eigenvectors of a matrix

Eigenvalues and eigenvectors of a matrix Eigenvalues and eigenvectors of a matrix Definition: If A is an n n matrix and there exists a real number λ and a non-zero column vector V such that AV = λv then λ is called an eigenvalue of A and V is

More information