# Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Save this PDF as:

Size: px
Start display at page:

Download "Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time"

## Transcription

1 Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Link-based page importance measures Why link-based? Mathematical background PageRank Crawlers What are crawlers Algorithm for computing page importance while crawling Distributed implementation calls for NoSQL key-value database 3 May Big Data Technology 2 1

2 Link Analysis - Motivation Traditional text-based ranking methods from the field of Information Retrieval are not sufficient on the Web: The web is huge, with great variation in the quality of the pages even when those pages contain similar text Queries are usually very short, making differentiation between pages difficult The text on many web pages does not sufficiently describe the page Keyword spamming The emphasis of web search is on precision, not recall! 3 May Big Data Technology 3 Link Analysis - Motivation The connectivity patterns between Web pages contain a gold mine of information A link from page a to page b can often be interpreted as: 1. A recommendation, by a s author, of the contents of b 2. Evidence that pages a,b share some topic of interest A co-citation of a and b (by a third page c) may also constitute evidence that a,b share some topic of interest a c b Analyzing linkage patterns at scale can identify oft-praised pages and topical affinities between pages 3 May Big Data Technology 4 2

3 Mathematical Background - Irreducibility A directed graph G=(V,E) is called irreducible if for every i,k V there is a path in G originating at i and ending in k A non-negative NxN matrix W is called irreducible if for every i,k {1,,N} there exists a non-negative integer m such that [W m ] i,k >0 The support graph G W ={V W,E W } of a non-negative NxN matrix W is a directed graph with N vertices, such that i k E W if and only if W i,k >0 Lemma: a non-negative square matrix W is irreducible if and only if G W is irreducible Lemma: a directed graph G is irreducible if and only if the adjacency matrix of G is irreducible 3 May Big Data Technology 5 Mathematical Background: Primitive & Stochastic Matrices Definition: a non-negative NxN matrix P is stochastic if the sum of every row in P is 1 Definition: the period of a directed graph G is the greatest common divisor of the lengths of all the cycles in G G is called aperiodic if it has a period of 1 Definition: a non-negative NxN matrix M is called primitive if its support graph G M is aperiodic 3 May Big Data Technology 6 3

4 Mathematical Background: Ergodic Thoerem Let M be a non-negative NxN matrix, and denote by λ1(m), λ2(m),...λn(m) the N eigenvalues of M, ordered by non-decreasing absolute value (i.e. λ1(m) λ2 (M)... λn(m) ) λ1(m) is the spectral radius of M and will simply be denoted by λ(m) Ergodic Theorem: let P be an irreducible and primitive stochastic matrix λ(p) = λ1(p) = 1, and any other eigenvalue of λ* of P satisfies λ* <1 There is a unique distribution row-vector π which satisfies πp=π π is the principal eigenvector of P and is the stationary distribution of the Markov Chain defined by the transition matrix P For any distribution row-vector q, lim k q P k = π Note that the last bullet defines an iterative method to compute π 3 May Big Data Technology 7 PageRank (Brin & Page, 1998) Named after Google s co-founder, Larry Page A global, query independent importance measure of Web pages A page is considered important if it receives many links from important pages Based on Markov chains and random walks 3 May Big Data Technology 8 4

5 PageRank: Random Surfer Model A random surfer moves from page to page. Upon leaving page p, the surfer chooses one of two actions: 1. Follows an outgoing link of p (chosen uniformly at random), with probability d See next slide for a discussion of pages that have no outlinks 2. Jumps to an arbitrary Web page (chosen uniformly at random), with probability 1-d The vector of PageRanks is the stationary distribution of this (ergodic) random walk 3 May Big Data Technology 9 PageRank Handling Dangling Nodes PageRank as stated in the previous slide is not well defined with respect to exiting pages that have no outgoing links (dangling nodes) There are three accepted approaches for treating pages with no outgoing links: 1. Eliminate such pages from the graph (iteratively prune the graph until reaching a steady state) 2. Consider such pages to link back to the pages that link to them 3. Consider such pages to link to all web pages (effectively making an exit out of them equivalent to a random jump) 3 May Big Data Technology 10 5

6 PageRank: Steady State Equations The PageRanks obey the following equations: R(p) = (1-d)/N + d Σ R(j)/D(j) j I(p) R(p) The PageRank of page p. d A damping factor, 0 < d < 1 N Number of Web pages I(p) The set of pages that point to p D(j) Number of out-links (out-degree) of page j 3 May Big Data Technology 11 PageRank Algebraic Notation Let W denote the NxN adjacency matrix of the Web s link structure, after some form of handling dangling nodes Let W norm denote the matrix that results by dividing each row j of W by j s out-degree (row j s sum) Let T by an NxN matrix whose entries are all equal to 1/N Define M = (1-d)T + dw norm M is the transition matrix corresponding to PageRank s random walk M s principal eigenvector is the vector of PageRanks 3 May Big Data Technology 12 6

7 Crawlers - Introduction The role of crawlers is to collect Web content Starting with some seed URLs, crawlers learn of additional crawl targets via hyperlinks on the crawled pages Several types of crawlers: Batch crawlers crawl a snapshot of their crawl space, until reaching a certain size or time limit Incremental crawlers continuously crawl their crawl space, revisiting URLs to ensure freshness Focused crawlers attempt to crawl pages pertaining to some topic/theme, while minimizing number of off-topic pages that are collected Web scale crawling is carried out by distributed crawlers, complicating what is conceptually a simple operation Resources consumed: bandwidth, computation time (beyond communication e.g. parsing), storage space 3 May Big Data Technology 14 Generic Web Crawling Algorithm Given a root set of distinct URLs: Put the root URLs in a (priority) queue While queue is not empty: Take the first URL x out of the queue Retrieve the contents of x from the web Do whatever you want with it Mark x as visited If x is an html page, parse it to find hyperlink URLs For each hyperlink URL y within x: If y hasn t been visited (perhaps lately), enqueue y with some priority Note that this algorithm may never stop on evolving graphs 3 May Big Data Technology 15 7

8 Adaptive On-Line Page Importance Computation (Abiteboul, Preda, Cobena WWW 2003) Assume we want to prioritize a continuous crawl by a PageRank-like link-based importance measure of Web pages One option is to build a link database as we crawl, and use it to compute PageRank every K crawl operations This is both difficult and expensive The paper above details a method to compute a PageRank-like measure in an on-line, asynchronous manner with little overhead in terms of computations, I/O and memory The description in the following slides is for a simplified version of a continuous crawl over N pages whose links never change (i.e. the set of pages to be continuously crawled is fixed, and only their text is subject to change) 3 May Big Data Technology 16 Page Importance Definition Let G=(V,E) be the (directed) link graph of the Web We build G =(V,E ) by adding a virtual node y to G, with links to and from all other nodes; thus, G will be strongly connected Formally, V = V {y}, E = E { x V: x y, y x } Whenever E >0, G is also aperiodic Let M(G ) denote the row-normalized adjacency matrix of G ; obviously M(G ) is stochastic, and by the above assumption also Ergodic Let π denote the principal eigenvector of M(G ) (and stationary distribution of the Markov chain it represents), i.e. π = π M(G ) We will define π as the Importance Vector of all nodes in G that the algorithm will compute 3 May Big Data Technology 17 8

9 On-Line Algorithm for Computing Page Importance The algorithm assigns each page v the following variables: 1. Cash money C(v) 2. Bank money B(v) In addition, let L(v) denote the outlinks of page v, and let G denote the global amount of money in the bank. The algorithm itself proceeds as follows: Initialization: for all pages, let C(v) 1/N, B(v) 0; G 0 Repeat forever: Pick a page v to visit according to the Visit Strategy* Distribute (evenly) an amount equal to C(v) among v s children, i.e. j L(v), C(j) += C(v) / L(v) Deposit v s cash in the bank: B(v) += C(v), G += C(v), C(v) 0 Fairness condition*: every page is visited infinitely often 3 May Big Data Technology 18 Analysis of the Algorithm Three Lemmas Lemma 1: at all times, Σ v C(v) = 1 (proof by easy induction) Lemma 2: at all times, B(v)+C(v)=1/N + Σ j:j v M(G ) j,v B(j) Proof: also by induction, which is trivial at time zero. The step analyzes three distinct cases of which page is visited: 1. v is visited, and then nothing changes on the RHS while money is just moved from C(v) to B(v) on the LHS 2. A page j that links to v is visited. The LHS grows by C(j)/ L(j), which is the exact increment of the RHS 3. Otherwise neither side changes Lemma 3: G goes to infinity as the algorithm proceeds Proof: at any time t there is at least one page x whose cash amount is at least 1/N. Since each page is visited infinitely often, there is a finite t >t in which x is visited, thus G will increase by at least 1/N by time t 3 May Big Data Technology 19 9

10 Analysis of the Algorithm Main Theorem Lemma 1: at all times, Σ v C(v) = 1 (proof by easy induction) Lemma 2: at all times, B(v)+C(v)=1/N + Σ j:j v M(G ) j,v B(j) Lemma 3: G goes to infinity as the algorithm proceeds Let B be the normalized bank vector, i.e. B v =B(v)/G. Thus, B is a distribution vector. Theorem: B*M(G )-B 0 as the algorithm proceeds Proof: examine the v th coordinate: B*M(G )-B v = G -1 * B(v) - Σ j B(j) M(G ) j,v = G -1 * B(v) + C(v) C(v) - Σ j:j v B(j) M(G ) j,v = G -1 * 1/N C(v) < G -1 0 Conclusion: B π, the stationary distribution of M(G ) 3 May Big Data Technology 20 Possible Visit Strategies Crawling Policies 1. Round Robin (obviously fair) 2. Random choose the next node to visit u.a.r., thus guaranteeing that the probability of each page being visited infinitely often is 1 3. Greedy - visit the node with maximal cash, thus increasing G in the fastest possible manner Why are all nodes are visited infinitely often? 3 May Big Data Technology 21 10

11 Additional Notes 1. There are adaptations of the algorithm to the case of evolving graphs and to a distributed implementation 2. Estimating v s importance by (B(v)+C(v)) / (G+1) is slightly better than just using B(v)/G 3 May Big Data Technology 22 Distributed Implementation As mentioned earlier, Web-scale crawling is a distributed task carried out on multiple machines Every visit to page v at crawl time requires access to B(v), G, C(v) and C(j) for every neighbor j of v The visits will happen across many crawl nodes Even repeat visits to the same node v may happen on different nodes (fault tolerance) Consequently, we need read and write access to the above variables across all nodes! 3 May Big Data Technology 23 11

12 Distributed Implementation Given a visit to v, the following transaction should be performed: 1. Foreach j L(v): 1. C(j) += C(v) / L(v) 2. B(v) += C(v) 3. G += C(v) 4. C(v) 0 1. t C(v) 2. C(v) 0 3. Foreach j L(v): 1. C(j) += t / L(v) 4. B(v) += t 5. G += t Re-examining the required consistency, we can rewrite as: 1. Lock C(v) 2. t C(v) 3. C(v) 0 4. Unlock C(v) 5. Foreach j L(v): 1. C(j) += t / L(v) 6. B(v) += t 7. G += t 3 May Big Data Technology 24 NoSQL Databases Not Only SQL: a class of database services that traded off most functionality of full-blown SQL databases for extreme scalability Simplest manifest: super-scalable key-value store service Pioneered by Google (BigTable); now many other instances Representative high-level API of a tabular NoSQL database: lock(row), unlock(row) get(row, [column]), put(row, [column, value]*) increment(row, [column, incr-value]*) 1. lock(v) 2. t= get(v, C) 3. put(v,c,0) 4. unlock(v) 5. Foreach j L(v): 1. increment (j, C, t/ L(v) 6. increment(v, B, t) 7. increment(g,,t) 3 May Big Data Technology 25 12

13 Next Class Properties and design of scalable real-time NoSQL key-value stores, primarily BigTable and HBase 3 May Big Data Technology 26 13

### MATH 56A SPRING 2008 STOCHASTIC PROCESSES 31

MATH 56A SPRING 2008 STOCHASTIC PROCESSES 3.3. Invariant probability distribution. Definition.4. A probability distribution is a function π : S [0, ] from the set of states S to the closed unit interval

### The Mathematics of Internet Search Engines

The Mathematics of Internet Search Engines David Marshall Department of Mathematics Monmouth University April 4, 2007 Introduction Search Engines, Then and Now Then... Now... Pagerank Outline Introduction

### Web Graph Analyzer Tool

Web Graph Analyzer Tool Konstantin Avrachenkov INRIA Sophia Antipolis 2004, route des Lucioles, B.P.93 06902, France Email: K.Avrachenkov@sophia.inria.fr Danil Nemirovsky St.Petersburg State University

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

### Part 1: Link Analysis & Page Rank

Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

### 6.042/18.062J Mathematics for Computer Science October 3, 2006 Tom Leighton and Ronitt Rubinfeld. Graph Theory III

6.04/8.06J Mathematics for Computer Science October 3, 006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Graph Theory III Draft: please check back in a couple of days for a modified version of these

### Search engines: ranking algorithms

Search engines: ranking algorithms Gianna M. Del Corso Dipartimento di Informatica, Università di Pisa, Italy ESP, 25 Marzo 2015 1 Statistics 2 Search Engines Ranking Algorithms HITS Web Analytics Estimated

### The Second Eigenvalue of the Google Matrix

0 2 The Second Eigenvalue of the Google Matrix Taher H Haveliwala and Sepandar D Kamvar Stanford University taherh,sdkamvar @csstanfordedu Abstract We determine analytically the modulus of the second eigenvalue

### Introduction to Flocking {Stochastic Matrices}

Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur - Yvette May 21, 2012 CRAIG REYNOLDS - 1987 BOIDS The Lion King CRAIG REYNOLDS

### Seminar assignment 1. 1 Part

Seminar assignment 1 Each seminar assignment consists of two parts, one part consisting of 3 5 elementary problems for a maximum of 10 points from each assignment. For the second part consisting of problems

### Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

### Eigenvalues and Markov Chains

Eigenvalues and Markov Chains Will Perkins April 15, 2013 The Metropolis Algorithm Say we want to sample from a different distribution, not necessarily uniform. Can we change the transition rates in such

### 15 Markov Chains: Limiting Probabilities

MARKOV CHAINS: LIMITING PROBABILITIES 67 Markov Chains: Limiting Probabilities Example Assume that the transition matrix is given by 7 2 P = 6 Recall that the n-step transition probabilities are given

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### Perron vector Optimization applied to search engines

Perron vector Optimization applied to search engines Olivier Fercoq INRIA Saclay and CMAP Ecole Polytechnique May 18, 2011 Web page ranking The core of search engines Semantic rankings (keywords) Hyperlink

### CHAPTER III - MARKOV CHAINS

CHAPTER III - MARKOV CHAINS JOSEPH G. CONLON 1. General Theory of Markov Chains We have already discussed the standard random walk on the integers Z. A Markov Chain can be viewed as a generalization of

### Enhancing the Ranking of a Web Page in the Ocean of Data

Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India hkshitesh@gmail.com In today

### Bindel, Fall 2012 Matrix Computations (CS 6210) Week 8: Friday, Oct 12

Why eigenvalues? Week 8: Friday, Oct 12 I spend a lot of time thinking about eigenvalue problems. In part, this is because I look for problems that can be solved via eigenvalues. But I might have fewer

### S = {1, 2,..., n}. P (1, 1) P (1, 2)... P (1, n) P (2, 1) P (2, 2)... P (2, n) P = . P (n, 1) P (n, 2)... P (n, n)

last revised: 26 January 2009 1 Markov Chains A Markov chain process is a simple type of stochastic process with many social science applications. We ll start with an abstract description before moving

### IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS

IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS There are four questions, each with several parts. 1. Customers Coming to an Automatic Teller Machine (ATM) (30 points)

### Lecture 6: The Group Inverse

Lecture 6: The Group Inverse The matrix index Let A C n n, k positive integer. Then R(A k+1 ) R(A k ). The index of A, denoted IndA, is the smallest integer k such that R(A k ) = R(A k+1 ), or equivalently,

### Monte Carlo methods in PageRank computation: When one iteration is sufficient

Monte Carlo methods in PageRank computation: When one iteration is sufficient K.Avrachenkov, N. Litvak, D. Nemirovsky, N. Osipova Abstract PageRank is one of the principle criteria according to which Google

### Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

### The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

Chapter 7 Google PageRank The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.) One of the reasons why Google TM is such an effective search engine is the PageRank

### Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition

Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Jack Gilbert Copyright (c) 2014 Jack Gilbert. Permission is granted to copy, distribute and/or modify this document under the terms

### 4.9 Markov matrices. DEFINITION 4.3 A real n n matrix A = [a ij ] is called a Markov matrix, or row stochastic matrix if. (i) a ij 0 for 1 i, j n;

49 Markov matrices DEFINITION 43 A real n n matrix A = [a ij ] is called a Markov matrix, or row stochastic matrix if (i) a ij 0 for 1 i, j n; (ii) a ij = 1 for 1 i n Remark: (ii) is equivalent to AJ n

### Lecture12 The Perron-Frobenius theorem.

Lecture12 The Perron-Frobenius theorem. 1 Statement of the theorem. 2 Proof of the Perron Frobenius theorem. 3 Graphology. 3 Asymptotic behavior. The non-primitive case. 4 The Leslie model of population

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks

princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks Lecturer: Sanjeev Arora Scribe:Elena Nabieva 1 Basics A Markov chain is a discrete-time stochastic process

### UNCOUPLING THE PERRON EIGENVECTOR PROBLEM

UNCOUPLING THE PERRON EIGENVECTOR PROBLEM Carl D Meyer INTRODUCTION Foranonnegative irreducible matrix m m with spectral radius ρ,afundamental problem concerns the determination of the unique normalized

### Combining BibTeX and PageRank to Classify WWW Pages

Combining BibTeX and PageRank to Classify WWW Pages MIKE JOY, GRAEME COLE and JONATHAN LOUGHRAN University of Warwick IAN BURNETT IBM United Kingdom and ANDREW TWIGG Cambridge University Focused search

### 1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as

### Local Approximation of PageRank and Reverse PageRank

Local Approximation of PageRank and Reverse PageRank Ziv Bar-Yossef Department of Electrical Engineering Technion, Haifa, Israel and Google Haifa Engineering Center Haifa, Israel zivby@ee.technion.ac.il

### Markov Chains, part I

Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X

### Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.

Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,

### Lecture 4 Online and streaming algorithms for clustering

CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### Notes on Matrix Multiplication and the Transitive Closure

ICS 6D Due: Wednesday, February 25, 2015 Instructor: Sandy Irani Notes on Matrix Multiplication and the Transitive Closure An n m matrix over a set S is an array of elements from S with n rows and m columns.

### The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### Introduction to Stationary Distributions

13 Introduction to Stationary Distributions We first briefly review the classification of states in a Markov chain with a quick example and then begin the discussion of the important notion of stationary

### The Perron-Frobenius theorem.

Chapter 9 The Perron-Frobenius theorem. The theorem we will discuss in this chapter (to be stated below) about matrices with non-negative entries, was proved, for matrices with strictly positive entries,

### Link Analysis. Chapter 5. 5.1 PageRank

Chapter 5 Link Analysis One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0

### Continuous-Time Markov Chains - Introduction

25 Continuous-Time Markov Chains - Introduction Prior to introducing continuous-time Markov chains today, let us start off with an example involving the Poisson process. Our particular focus in this example

### 1 Short Introduction to Time Series

ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

### Online edition (c)2009 Cambridge UP

DRAFT! April 1, 2009 Cambridge University Press. Feedback welcome. 461 21 Link analysis The analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search.

### Lesson 3. Algebraic graph theory. Sergio Barbarossa. Rome - February 2010

Lesson 3 Algebraic graph theory Sergio Barbarossa Basic notions Definition: A directed graph (or digraph) composed by a set of vertices and a set of edges We adopt the convention that the information flows

### Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

### Lecture 9. 1 Introduction. 2 Random Walks in Graphs. 1.1 How To Explore a Graph? CS-621 Theory Gems October 17, 2012

CS-62 Theory Gems October 7, 202 Lecture 9 Lecturer: Aleksander Mądry Scribes: Dorina Thanou, Xiaowen Dong Introduction Over the next couple of lectures, our focus will be on graphs. Graphs are one of

### Magnitude of the crawling problem. Introduction to Information Retrieval Basic crawler operation

Magnitude of the crawling problem Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart

### Effective Page Refresh Policies For Web Crawlers

Effective Page Refresh Policies For Web Crawlers JUNGHOO CHO University of California, Los Angeles and HECTOR GARCIA-MOLINA Stanford University In this paper we study how we can maintain local copies of

### Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

### Social Media Mining. Network Measures

Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

### MATH 551 - APPLIED MATRIX THEORY

MATH 55 - APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points

### Reading 7 : Program Correctness

CS/Math 240: Introduction to Discrete Mathematics Fall 2015 Instructors: Beck Hasti, Gautam Prakriya Reading 7 : Program Correctness 7.1 Program Correctness Showing that a program is correct means that

### The Characteristic Polynomial

Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem

### Crawling and web indexes CE-324: Modern Information Retrieval Sharif University of Technology

Crawling and web indexes CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### BIN504 - Lecture XI. Bayesian Inference & Markov Chains

BIN504 - Lecture XI Bayesian Inference & References: Lee, Chp. 11 & Sec. 10.11 Ewens-Grant, Chps. 4, 7, 11 c 2012 Aybar C. Acar Including slides by Tolga Can Rev. 1.2 (Build 20130528191100) BIN 504 - Bayes

### Theorem (The division theorem) Suppose that a and b are integers with b > 0. There exist unique integers q and r so that. a = bq + r and 0 r < b.

Theorem (The division theorem) Suppose that a and b are integers with b > 0. There exist unique integers q and r so that a = bq + r and 0 r < b. We re dividing a by b: q is the quotient and r is the remainder,

### Continuous-time Markov Chains

Continuous-time Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 31, 2016

### Lecture 6 Online and streaming algorithms for clustering

CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### THE \$25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE

THE \$5,,, EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE KURT BRYAN AND TANYA LEISE Abstract. Google s success derives in large part from its PageRank algorithm, which ranks the importance of webpages according

### Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

### Web Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley

Web Search Engines Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Search Engine Characteristics

### CS6322: Information Retrieval Sanda Harabagiu. Lecture 11: Crawling and web indexes

Sanda Harabagiu Lecture 11: Crawling and web indexes Today s lecture Crawling Connectivity servers Sec. 20.2 Basic crawler operation Begin with known seed URLs Fetch and parse them Extract URLs they point

### Power Rankings: Math for March Madness

Power Rankings: Math for March Madness James A. Swenson University of Wisconsin Platteville swensonj@uwplatt.edu March 5, 2011 Math Club Madison Area Technical College James A. Swenson (UWP) Power Rankings:

### Analysis and Computation of Google s PageRank

Analysis and Computation of Google s PageRank Ilse Ipsen North Carolina State University Joint work with Rebecca M. Wills IMACS p.1 Overview Goal: Compute (citation) importance of a web page Simple Web

### LECTURE 4. Last time: Lecture outline

LECTURE 4 Last time: Types of convergence Weak Law of Large Numbers Strong Law of Large Numbers Asymptotic Equipartition Property Lecture outline Stochastic processes Markov chains Entropy rate Random

### Tools for the analysis and design of communication networks with Markovian dynamics

1 Tools for the analysis and design of communication networks with Markovian dynamics Arie Leizarowitz, Robert Shorten, Rade Stanoević Abstract In this paper we analyze the stochastic properties of a class

### Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

### Markov Chains. Chapter Introduction and Definitions 110SOR201(2002)

page 5 SOR() Chapter Markov Chains. Introduction and Definitions Consider a sequence of consecutive times ( or trials or stages): n =,,,... Suppose that at each time a probabilistic experiment is performed,

### University of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods

University of Chicago Autumn 2003 CS37101-1 Markov Chain Monte Carlo Methods Lecture 2: October 7, 2003 Markov Chains, Coupling, Stationary Distribution Eric Vigoda 2.1 Markov Chains In this lecture, we

### A short proof of Perron s theorem. Hannah Cairns, April 25, 2014.

A short proof of Perron s theorem. Hannah Cairns, April 25, 2014. A matrix A or a vector ψ is said to be positive if every component is a positive real number. A B means that every component of A is greater

### Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Random access protocols for channel access Markov chains and their stability laurent.massoulie@inria.fr Aloha: the first random access protocol for channel access [Abramson, Hawaii 70] Goal: allow machines

Recap CS276 Lecture 15 Plan for today Wrap up spam Crawling Connectivity servers In the last lecture we introduced web search Paid placement SEO/Spam Link-based ranking Most search engines use hyperlink

### 1. Markov chains. 1.1 Specifying and simulating a Markov chain. Page 5

1. Markov chains Page 5 Section 1. What is a Markov chain? How to simulate one. Section 2. The Markov property. Section 3. How matrix multiplication gets into the picture. Section 4. Statement of the Basic

### Performance Analysis of a Telephone System with both Patient and Impatient Customers

Performance Analysis of a Telephone System with both Patient and Impatient Customers Yiqiang Quennel Zhao Department of Mathematics and Statistics University of Winnipeg Winnipeg, Manitoba Canada R3B 2E9

### Math 312 Lecture Notes Markov Chains

Math 312 Lecture Notes Markov Chains Warren Weckesser Department of Mathematics Colgate University Updated, 30 April 2005 Markov Chains A (finite) Markov chain is a process with a finite number of states

### Chapter 4. Trees. 4.1 Basics

Chapter 4 Trees 4.1 Basics A tree is a connected graph with no cycles. A forest is a collection of trees. A vertex of degree one, particularly in a tree, is called a leaf. Trees arise in a variety of applications.

### CS 103X: Discrete Structures Homework Assignment 3 Solutions

CS 103X: Discrete Structures Homework Assignment 3 s Exercise 1 (20 points). On well-ordering and induction: (a) Prove the induction principle from the well-ordering principle. (b) Prove the well-ordering

### A simple criterion on degree sequences of graphs

Discrete Applied Mathematics 156 (2008) 3513 3517 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam Note A simple criterion on degree

### Optimization of the HOTS score of a website s pages

Optimization of the HOTS score of a website s pages Olivier Fercoq and Stéphane Gaubert INRIA Saclay and CMAP Ecole Polytechnique June 21st, 2012 Toy example with 21 pages 3 5 6 2 4 20 1 12 9 11 15 16

### A characterization of trace zero symmetric nonnegative 5x5 matrices

A characterization of trace zero symmetric nonnegative 5x5 matrices Oren Spector June 1, 009 Abstract The problem of determining necessary and sufficient conditions for a set of real numbers to be the

### Link-based Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011

Link-based Analysis on Large Graphs Presented by Weiren Yu Mar 01, 2011 Overview 1 Introduction 2 Problem Definition 3 Optimization Techniques 4 Experimental Results 2 1. Introduction Many applications

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### 1 o Semestre 2007/2008

Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

### An Alternative Web Search Strategy? Abstract

An Alternative Web Search Strategy? V.-H. Winterer, Rechenzentrum Universität Freiburg (Dated: November 2007) Abstract We propose an alternative Web search strategy taking advantage of the knowledge on

### Cyclotomic Extensions

Chapter 7 Cyclotomic Extensions A cyclotomic extension Q(ζ n ) of the rationals is formed by adjoining a primitive n th root of unity ζ n. In this chapter, we will find an integral basis and calculate

### Web Usage in Client-Server Design

Web Search Web Usage in Client-Server Design A client (e.g., a browser) communicates with a server via http Hypertext transfer protocol: a lightweight and simple protocol asynchronously carrying a variety

### Math 7 Elementary Linear Algebra MARKOV CHAINS. Definition of Experiment An experiment is an activity with a definite, observable outcome.

T. Henson I. Basic Concepts from Probability Examples of experiments: Math 7 Elementary Linear Algebra MARKOV CHAINS Definition of Experiment An experiment is an activity with a definite, observable outcome.

### Linear Programming. March 14, 2014

Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

### Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = 36 + 41i.

Math 5A HW4 Solutions September 5, 202 University of California, Los Angeles Problem 4..3b Calculate the determinant, 5 2i 6 + 4i 3 + i 7i Solution: The textbook s instructions give us, (5 2i)7i (6 + 4i)(