Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time


 Prosper Williams
 2 years ago
 Views:
Transcription
1 Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Linkbased page importance measures Why linkbased? Mathematical background PageRank Crawlers What are crawlers Algorithm for computing page importance while crawling Distributed implementation calls for NoSQL keyvalue database 3 May Big Data Technology 2 1
2 Link Analysis  Motivation Traditional textbased ranking methods from the field of Information Retrieval are not sufficient on the Web: The web is huge, with great variation in the quality of the pages even when those pages contain similar text Queries are usually very short, making differentiation between pages difficult The text on many web pages does not sufficiently describe the page Keyword spamming The emphasis of web search is on precision, not recall! 3 May Big Data Technology 3 Link Analysis  Motivation The connectivity patterns between Web pages contain a gold mine of information A link from page a to page b can often be interpreted as: 1. A recommendation, by a s author, of the contents of b 2. Evidence that pages a,b share some topic of interest A cocitation of a and b (by a third page c) may also constitute evidence that a,b share some topic of interest a c b Analyzing linkage patterns at scale can identify oftpraised pages and topical affinities between pages 3 May Big Data Technology 4 2
3 Mathematical Background  Irreducibility A directed graph G=(V,E) is called irreducible if for every i,k V there is a path in G originating at i and ending in k A nonnegative NxN matrix W is called irreducible if for every i,k {1,,N} there exists a nonnegative integer m such that [W m ] i,k >0 The support graph G W ={V W,E W } of a nonnegative NxN matrix W is a directed graph with N vertices, such that i k E W if and only if W i,k >0 Lemma: a nonnegative square matrix W is irreducible if and only if G W is irreducible Lemma: a directed graph G is irreducible if and only if the adjacency matrix of G is irreducible 3 May Big Data Technology 5 Mathematical Background: Primitive & Stochastic Matrices Definition: a nonnegative NxN matrix P is stochastic if the sum of every row in P is 1 Definition: the period of a directed graph G is the greatest common divisor of the lengths of all the cycles in G G is called aperiodic if it has a period of 1 Definition: a nonnegative NxN matrix M is called primitive if its support graph G M is aperiodic 3 May Big Data Technology 6 3
4 Mathematical Background: Ergodic Thoerem Let M be a nonnegative NxN matrix, and denote by λ1(m), λ2(m),...λn(m) the N eigenvalues of M, ordered by nondecreasing absolute value (i.e. λ1(m) λ2 (M)... λn(m) ) λ1(m) is the spectral radius of M and will simply be denoted by λ(m) Ergodic Theorem: let P be an irreducible and primitive stochastic matrix λ(p) = λ1(p) = 1, and any other eigenvalue of λ* of P satisfies λ* <1 There is a unique distribution rowvector π which satisfies πp=π π is the principal eigenvector of P and is the stationary distribution of the Markov Chain defined by the transition matrix P For any distribution rowvector q, lim k q P k = π Note that the last bullet defines an iterative method to compute π 3 May Big Data Technology 7 PageRank (Brin & Page, 1998) Named after Google s cofounder, Larry Page A global, query independent importance measure of Web pages A page is considered important if it receives many links from important pages Based on Markov chains and random walks 3 May Big Data Technology 8 4
5 PageRank: Random Surfer Model A random surfer moves from page to page. Upon leaving page p, the surfer chooses one of two actions: 1. Follows an outgoing link of p (chosen uniformly at random), with probability d See next slide for a discussion of pages that have no outlinks 2. Jumps to an arbitrary Web page (chosen uniformly at random), with probability 1d The vector of PageRanks is the stationary distribution of this (ergodic) random walk 3 May Big Data Technology 9 PageRank Handling Dangling Nodes PageRank as stated in the previous slide is not well defined with respect to exiting pages that have no outgoing links (dangling nodes) There are three accepted approaches for treating pages with no outgoing links: 1. Eliminate such pages from the graph (iteratively prune the graph until reaching a steady state) 2. Consider such pages to link back to the pages that link to them 3. Consider such pages to link to all web pages (effectively making an exit out of them equivalent to a random jump) 3 May Big Data Technology 10 5
6 PageRank: Steady State Equations The PageRanks obey the following equations: R(p) = (1d)/N + d Σ R(j)/D(j) j I(p) R(p) The PageRank of page p. d A damping factor, 0 < d < 1 N Number of Web pages I(p) The set of pages that point to p D(j) Number of outlinks (outdegree) of page j 3 May Big Data Technology 11 PageRank Algebraic Notation Let W denote the NxN adjacency matrix of the Web s link structure, after some form of handling dangling nodes Let W norm denote the matrix that results by dividing each row j of W by j s outdegree (row j s sum) Let T by an NxN matrix whose entries are all equal to 1/N Define M = (1d)T + dw norm M is the transition matrix corresponding to PageRank s random walk M s principal eigenvector is the vector of PageRanks 3 May Big Data Technology 12 6
7 Crawlers  Introduction The role of crawlers is to collect Web content Starting with some seed URLs, crawlers learn of additional crawl targets via hyperlinks on the crawled pages Several types of crawlers: Batch crawlers crawl a snapshot of their crawl space, until reaching a certain size or time limit Incremental crawlers continuously crawl their crawl space, revisiting URLs to ensure freshness Focused crawlers attempt to crawl pages pertaining to some topic/theme, while minimizing number of offtopic pages that are collected Web scale crawling is carried out by distributed crawlers, complicating what is conceptually a simple operation Resources consumed: bandwidth, computation time (beyond communication e.g. parsing), storage space 3 May Big Data Technology 14 Generic Web Crawling Algorithm Given a root set of distinct URLs: Put the root URLs in a (priority) queue While queue is not empty: Take the first URL x out of the queue Retrieve the contents of x from the web Do whatever you want with it Mark x as visited If x is an html page, parse it to find hyperlink URLs For each hyperlink URL y within x: If y hasn t been visited (perhaps lately), enqueue y with some priority Note that this algorithm may never stop on evolving graphs 3 May Big Data Technology 15 7
8 Adaptive OnLine Page Importance Computation (Abiteboul, Preda, Cobena WWW 2003) Assume we want to prioritize a continuous crawl by a PageRanklike linkbased importance measure of Web pages One option is to build a link database as we crawl, and use it to compute PageRank every K crawl operations This is both difficult and expensive The paper above details a method to compute a PageRanklike measure in an online, asynchronous manner with little overhead in terms of computations, I/O and memory The description in the following slides is for a simplified version of a continuous crawl over N pages whose links never change (i.e. the set of pages to be continuously crawled is fixed, and only their text is subject to change) 3 May Big Data Technology 16 Page Importance Definition Let G=(V,E) be the (directed) link graph of the Web We build G =(V,E ) by adding a virtual node y to G, with links to and from all other nodes; thus, G will be strongly connected Formally, V = V {y}, E = E { x V: x y, y x } Whenever E >0, G is also aperiodic Let M(G ) denote the rownormalized adjacency matrix of G ; obviously M(G ) is stochastic, and by the above assumption also Ergodic Let π denote the principal eigenvector of M(G ) (and stationary distribution of the Markov chain it represents), i.e. π = π M(G ) We will define π as the Importance Vector of all nodes in G that the algorithm will compute 3 May Big Data Technology 17 8
9 OnLine Algorithm for Computing Page Importance The algorithm assigns each page v the following variables: 1. Cash money C(v) 2. Bank money B(v) In addition, let L(v) denote the outlinks of page v, and let G denote the global amount of money in the bank. The algorithm itself proceeds as follows: Initialization: for all pages, let C(v) 1/N, B(v) 0; G 0 Repeat forever: Pick a page v to visit according to the Visit Strategy* Distribute (evenly) an amount equal to C(v) among v s children, i.e. j L(v), C(j) += C(v) / L(v) Deposit v s cash in the bank: B(v) += C(v), G += C(v), C(v) 0 Fairness condition*: every page is visited infinitely often 3 May Big Data Technology 18 Analysis of the Algorithm Three Lemmas Lemma 1: at all times, Σ v C(v) = 1 (proof by easy induction) Lemma 2: at all times, B(v)+C(v)=1/N + Σ j:j v M(G ) j,v B(j) Proof: also by induction, which is trivial at time zero. The step analyzes three distinct cases of which page is visited: 1. v is visited, and then nothing changes on the RHS while money is just moved from C(v) to B(v) on the LHS 2. A page j that links to v is visited. The LHS grows by C(j)/ L(j), which is the exact increment of the RHS 3. Otherwise neither side changes Lemma 3: G goes to infinity as the algorithm proceeds Proof: at any time t there is at least one page x whose cash amount is at least 1/N. Since each page is visited infinitely often, there is a finite t >t in which x is visited, thus G will increase by at least 1/N by time t 3 May Big Data Technology 19 9
10 Analysis of the Algorithm Main Theorem Lemma 1: at all times, Σ v C(v) = 1 (proof by easy induction) Lemma 2: at all times, B(v)+C(v)=1/N + Σ j:j v M(G ) j,v B(j) Lemma 3: G goes to infinity as the algorithm proceeds Let B be the normalized bank vector, i.e. B v =B(v)/G. Thus, B is a distribution vector. Theorem: B*M(G )B 0 as the algorithm proceeds Proof: examine the v th coordinate: B*M(G )B v = G 1 * B(v)  Σ j B(j) M(G ) j,v = G 1 * B(v) + C(v) C(v)  Σ j:j v B(j) M(G ) j,v = G 1 * 1/N C(v) < G 1 0 Conclusion: B π, the stationary distribution of M(G ) 3 May Big Data Technology 20 Possible Visit Strategies Crawling Policies 1. Round Robin (obviously fair) 2. Random choose the next node to visit u.a.r., thus guaranteeing that the probability of each page being visited infinitely often is 1 3. Greedy  visit the node with maximal cash, thus increasing G in the fastest possible manner Why are all nodes are visited infinitely often? 3 May Big Data Technology 21 10
11 Additional Notes 1. There are adaptations of the algorithm to the case of evolving graphs and to a distributed implementation 2. Estimating v s importance by (B(v)+C(v)) / (G+1) is slightly better than just using B(v)/G 3 May Big Data Technology 22 Distributed Implementation As mentioned earlier, Webscale crawling is a distributed task carried out on multiple machines Every visit to page v at crawl time requires access to B(v), G, C(v) and C(j) for every neighbor j of v The visits will happen across many crawl nodes Even repeat visits to the same node v may happen on different nodes (fault tolerance) Consequently, we need read and write access to the above variables across all nodes! 3 May Big Data Technology 23 11
12 Distributed Implementation Given a visit to v, the following transaction should be performed: 1. Foreach j L(v): 1. C(j) += C(v) / L(v) 2. B(v) += C(v) 3. G += C(v) 4. C(v) 0 1. t C(v) 2. C(v) 0 3. Foreach j L(v): 1. C(j) += t / L(v) 4. B(v) += t 5. G += t Reexamining the required consistency, we can rewrite as: 1. Lock C(v) 2. t C(v) 3. C(v) 0 4. Unlock C(v) 5. Foreach j L(v): 1. C(j) += t / L(v) 6. B(v) += t 7. G += t 3 May Big Data Technology 24 NoSQL Databases Not Only SQL: a class of database services that traded off most functionality of fullblown SQL databases for extreme scalability Simplest manifest: superscalable keyvalue store service Pioneered by Google (BigTable); now many other instances Representative highlevel API of a tabular NoSQL database: lock(row), unlock(row) get(row, [column]), put(row, [column, value]*) increment(row, [column, incrvalue]*) 1. lock(v) 2. t= get(v, C) 3. put(v,c,0) 4. unlock(v) 5. Foreach j L(v): 1. increment (j, C, t/ L(v) 6. increment(v, B, t) 7. increment(g,,t) 3 May Big Data Technology 25 12
13 Next Class Properties and design of scalable realtime NoSQL keyvalue stores, primarily BigTable and HBase 3 May Big Data Technology 26 13
MATH 56A SPRING 2008 STOCHASTIC PROCESSES 31
MATH 56A SPRING 2008 STOCHASTIC PROCESSES 3.3. Invariant probability distribution. Definition.4. A probability distribution is a function π : S [0, ] from the set of states S to the closed unit interval
More informationThe Mathematics of Internet Search Engines
The Mathematics of Internet Search Engines David Marshall Department of Mathematics Monmouth University April 4, 2007 Introduction Search Engines, Then and Now Then... Now... Pagerank Outline Introduction
More informationWeb Graph Analyzer Tool
Web Graph Analyzer Tool Konstantin Avrachenkov INRIA Sophia Antipolis 2004, route des Lucioles, B.P.93 06902, France Email: K.Avrachenkov@sophia.inria.fr Danil Nemirovsky St.Petersburg State University
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationThe PageRank Citation Ranking: Bring Order to the Web
The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please
More information6.042/18.062J Mathematics for Computer Science October 3, 2006 Tom Leighton and Ronitt Rubinfeld. Graph Theory III
6.04/8.06J Mathematics for Computer Science October 3, 006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Graph Theory III Draft: please check back in a couple of days for a modified version of these
More informationSearch engines: ranking algorithms
Search engines: ranking algorithms Gianna M. Del Corso Dipartimento di Informatica, Università di Pisa, Italy ESP, 25 Marzo 2015 1 Statistics 2 Search Engines Ranking Algorithms HITS Web Analytics Estimated
More informationThe Second Eigenvalue of the Google Matrix
0 2 The Second Eigenvalue of the Google Matrix Taher H Haveliwala and Sepandar D Kamvar Stanford University taherh,sdkamvar @csstanfordedu Abstract We determine analytically the modulus of the second eigenvalue
More informationIntroduction to Flocking {Stochastic Matrices}
Supelec EECI Graduate School in Control Introduction to Flocking {Stochastic Matrices} A. S. Morse Yale University Gif sur  Yvette May 21, 2012 CRAIG REYNOLDS  1987 BOIDS The Lion King CRAIG REYNOLDS
More informationSeminar assignment 1. 1 Part
Seminar assignment 1 Each seminar assignment consists of two parts, one part consisting of 3 5 elementary problems for a maximum of 10 points from each assignment. For the second part consisting of problems
More informationPractical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
More informationEigenvalues and Markov Chains
Eigenvalues and Markov Chains Will Perkins April 15, 2013 The Metropolis Algorithm Say we want to sample from a different distribution, not necessarily uniform. Can we change the transition rates in such
More information15 Markov Chains: Limiting Probabilities
MARKOV CHAINS: LIMITING PROBABILITIES 67 Markov Chains: Limiting Probabilities Example Assume that the transition matrix is given by 7 2 P = 6 Recall that the nstep transition probabilities are given
More information1 Limiting distribution for a Markov chain
Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easytocheck
More informationPerron vector Optimization applied to search engines
Perron vector Optimization applied to search engines Olivier Fercoq INRIA Saclay and CMAP Ecole Polytechnique May 18, 2011 Web page ranking The core of search engines Semantic rankings (keywords) Hyperlink
More informationCHAPTER III  MARKOV CHAINS
CHAPTER III  MARKOV CHAINS JOSEPH G. CONLON 1. General Theory of Markov Chains We have already discussed the standard random walk on the integers Z. A Markov Chain can be viewed as a generalization of
More informationEnhancing the Ranking of a Web Page in the Ocean of Data
Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India hkshitesh@gmail.com In today
More informationBindel, Fall 2012 Matrix Computations (CS 6210) Week 8: Friday, Oct 12
Why eigenvalues? Week 8: Friday, Oct 12 I spend a lot of time thinking about eigenvalue problems. In part, this is because I look for problems that can be solved via eigenvalues. But I might have fewer
More informationS = {1, 2,..., n}. P (1, 1) P (1, 2)... P (1, n) P (2, 1) P (2, 2)... P (2, n) P = . P (n, 1) P (n, 2)... P (n, n)
last revised: 26 January 2009 1 Markov Chains A Markov chain process is a simple type of stochastic process with many social science applications. We ll start with an abstract description before moving
More informationIEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS
IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS There are four questions, each with several parts. 1. Customers Coming to an Automatic Teller Machine (ATM) (30 points)
More informationLecture 6: The Group Inverse
Lecture 6: The Group Inverse The matrix index Let A C n n, k positive integer. Then R(A k+1 ) R(A k ). The index of A, denoted IndA, is the smallest integer k such that R(A k ) = R(A k+1 ), or equivalently,
More informationMonte Carlo methods in PageRank computation: When one iteration is sufficient
Monte Carlo methods in PageRank computation: When one iteration is sufficient K.Avrachenkov, N. Litvak, D. Nemirovsky, N. Osipova Abstract PageRank is one of the principle criteria according to which Google
More informationMarkov chains and Markov Random Fields (MRFs)
Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume
More informationThe world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)
Chapter 7 Google PageRank The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.) One of the reasons why Google TM is such an effective search engine is the PageRank
More informationMarkov Chains, Stochastic Processes, and Advanced Matrix Decomposition
Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Jack Gilbert Copyright (c) 2014 Jack Gilbert. Permission is granted to copy, distribute and/or modify this document under the terms
More information4.9 Markov matrices. DEFINITION 4.3 A real n n matrix A = [a ij ] is called a Markov matrix, or row stochastic matrix if. (i) a ij 0 for 1 i, j n;
49 Markov matrices DEFINITION 43 A real n n matrix A = [a ij ] is called a Markov matrix, or row stochastic matrix if (i) a ij 0 for 1 i, j n; (ii) a ij = 1 for 1 i n Remark: (ii) is equivalent to AJ n
More informationLecture12 The PerronFrobenius theorem.
Lecture12 The PerronFrobenius theorem. 1 Statement of the theorem. 2 Proof of the Perron Frobenius theorem. 3 Graphology. 3 Asymptotic behavior. The nonprimitive case. 4 The Leslie model of population
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationprinceton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks
princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks Lecturer: Sanjeev Arora Scribe:Elena Nabieva 1 Basics A Markov chain is a discretetime stochastic process
More informationUNCOUPLING THE PERRON EIGENVECTOR PROBLEM
UNCOUPLING THE PERRON EIGENVECTOR PROBLEM Carl D Meyer INTRODUCTION Foranonnegative irreducible matrix m m with spectral radius ρ,afundamental problem concerns the determination of the unique normalized
More informationCombining BibTeX and PageRank to Classify WWW Pages
Combining BibTeX and PageRank to Classify WWW Pages MIKE JOY, GRAEME COLE and JONATHAN LOUGHRAN University of Warwick IAN BURNETT IBM United Kingdom and ANDREW TWIGG Cambridge University Focused search
More information1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let
Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as
More informationLocal Approximation of PageRank and Reverse PageRank
Local Approximation of PageRank and Reverse PageRank Ziv BarYossef Department of Electrical Engineering Technion, Haifa, Israel and Google Haifa Engineering Center Haifa, Israel zivby@ee.technion.ac.il
More informationMarkov Chains, part I
Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X
More informationSome Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.
Some Polynomial Theorems by John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA 90405 rkennedy@ix.netcom.com This paper contains a collection of 31 theorems, lemmas,
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 Online kclustering To the extent that clustering takes place in the brain, it happens in an online
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NPCompleteness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationNotes on Matrix Multiplication and the Transitive Closure
ICS 6D Due: Wednesday, February 25, 2015 Instructor: Sandy Irani Notes on Matrix Multiplication and the Transitive Closure An n m matrix over a set S is an array of elements from S with n rows and m columns.
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More informationContinued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
More informationIntroduction to Stationary Distributions
13 Introduction to Stationary Distributions We first briefly review the classification of states in a Markov chain with a quick example and then begin the discussion of the important notion of stationary
More informationThe PerronFrobenius theorem.
Chapter 9 The PerronFrobenius theorem. The theorem we will discuss in this chapter (to be stated below) about matrices with nonnegative entries, was proved, for matrices with strictly positive entries,
More informationLink Analysis. Chapter 5. 5.1 PageRank
Chapter 5 Link Analysis One of the biggest changes in our lives in the decade following the turn of the century was the availability of efficient and accurate Web search, through search engines such as
More informationmax cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x
Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study
More informationMATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).
MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0
More informationContinuousTime Markov Chains  Introduction
25 ContinuousTime Markov Chains  Introduction Prior to introducing continuoustime Markov chains today, let us start off with an example involving the Poisson process. Our particular focus in this example
More information1 Short Introduction to Time Series
ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The
More informationOnline edition (c)2009 Cambridge UP
DRAFT! April 1, 2009 Cambridge University Press. Feedback welcome. 461 21 Link analysis The analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search.
More informationLesson 3. Algebraic graph theory. Sergio Barbarossa. Rome  February 2010
Lesson 3 Algebraic graph theory Sergio Barbarossa Basic notions Definition: A directed graph (or digraph) composed by a set of vertices and a set of edges We adopt the convention that the information flows
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationLecture 9. 1 Introduction. 2 Random Walks in Graphs. 1.1 How To Explore a Graph? CS621 Theory Gems October 17, 2012
CS62 Theory Gems October 7, 202 Lecture 9 Lecturer: Aleksander Mądry Scribes: Dorina Thanou, Xiaowen Dong Introduction Over the next couple of lectures, our focus will be on graphs. Graphs are one of
More informationMagnitude of the crawling problem. Introduction to Information Retrieval Basic crawler operation
Magnitude of the crawling problem Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart
More informationEffective Page Refresh Policies For Web Crawlers
Effective Page Refresh Policies For Web Crawlers JUNGHOO CHO University of California, Los Angeles and HECTOR GARCIAMOLINA Stanford University In this paper we study how we can maintain local copies of
More informationCost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
More informationSocial Media Mining. Network Measures
Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the likeminded users
More informationMATH 551  APPLIED MATRIX THEORY
MATH 55  APPLIED MATRIX THEORY FINAL TEST: SAMPLE with SOLUTIONS (25 points NAME: PROBLEM (3 points A web of 5 pages is described by a directed graph whose matrix is given by A Do the following ( points
More informationReading 7 : Program Correctness
CS/Math 240: Introduction to Discrete Mathematics Fall 2015 Instructors: Beck Hasti, Gautam Prakriya Reading 7 : Program Correctness 7.1 Program Correctness Showing that a program is correct means that
More informationThe Characteristic Polynomial
Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem
More informationCrawling and web indexes CE324: Modern Information Retrieval Sharif University of Technology
Crawling and web indexes CE324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS276, Stanford)
More informationSimilarity and Diagonalization. Similar Matrices
MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that
More informationBIN504  Lecture XI. Bayesian Inference & Markov Chains
BIN504  Lecture XI Bayesian Inference & References: Lee, Chp. 11 & Sec. 10.11 EwensGrant, Chps. 4, 7, 11 c 2012 Aybar C. Acar Including slides by Tolga Can Rev. 1.2 (Build 20130528191100) BIN 504  Bayes
More informationTheorem (The division theorem) Suppose that a and b are integers with b > 0. There exist unique integers q and r so that. a = bq + r and 0 r < b.
Theorem (The division theorem) Suppose that a and b are integers with b > 0. There exist unique integers q and r so that a = bq + r and 0 r < b. We re dividing a by b: q is the quotient and r is the remainder,
More informationContinuoustime Markov Chains
Continuoustime Markov Chains Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 31, 2016
More informationLecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 Online kclustering To the extent that clustering takes place in the brain, it happens in an online
More informationMath Review. for the Quantitative Reasoning Measure of the GRE revised General Test
Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 19967 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationTHE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE
THE $5,,, EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE KURT BRYAN AND TANYA LEISE Abstract. Google s success derives in large part from its PageRank algorithm, which ranks the importance of webpages according
More informationReport: Declarative Machine Learning on MapReduce (SystemML)
Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETHID 11947512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,
More informationWeb Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UCBerkeley
Web Search Engines Chapter 27, Part C Based on Larson and Hearst s slides at UCBerkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Search Engine Characteristics
More informationCS6322: Information Retrieval Sanda Harabagiu. Lecture 11: Crawling and web indexes
Sanda Harabagiu Lecture 11: Crawling and web indexes Today s lecture Crawling Connectivity servers Sec. 20.2 Basic crawler operation Begin with known seed URLs Fetch and parse them Extract URLs they point
More informationPower Rankings: Math for March Madness
Power Rankings: Math for March Madness James A. Swenson University of Wisconsin Platteville swensonj@uwplatt.edu March 5, 2011 Math Club Madison Area Technical College James A. Swenson (UWP) Power Rankings:
More informationAnalysis and Computation of Google s PageRank
Analysis and Computation of Google s PageRank Ilse Ipsen North Carolina State University Joint work with Rebecca M. Wills IMACS p.1 Overview Goal: Compute (citation) importance of a web page Simple Web
More informationLECTURE 4. Last time: Lecture outline
LECTURE 4 Last time: Types of convergence Weak Law of Large Numbers Strong Law of Large Numbers Asymptotic Equipartition Property Lecture outline Stochastic processes Markov chains Entropy rate Random
More informationTools for the analysis and design of communication networks with Markovian dynamics
1 Tools for the analysis and design of communication networks with Markovian dynamics Arie Leizarowitz, Robert Shorten, Rade Stanoević Abstract In this paper we analyze the stochastic properties of a class
More informationCorso di Biblioteche Digitali
Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050315 3115 cell. 348397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 7075% esame orale 2530% progetto
More informationMarkov Chains. Chapter Introduction and Definitions 110SOR201(2002)
page 5 SOR() Chapter Markov Chains. Introduction and Definitions Consider a sequence of consecutive times ( or trials or stages): n =,,,... Suppose that at each time a probabilistic experiment is performed,
More informationUniversity of Chicago Autumn 2003 CS Markov Chain Monte Carlo Methods
University of Chicago Autumn 2003 CS371011 Markov Chain Monte Carlo Methods Lecture 2: October 7, 2003 Markov Chains, Coupling, Stationary Distribution Eric Vigoda 2.1 Markov Chains In this lecture, we
More informationA short proof of Perron s theorem. Hannah Cairns, April 25, 2014.
A short proof of Perron s theorem. Hannah Cairns, April 25, 2014. A matrix A or a vector ψ is said to be positive if every component is a positive real number. A B means that every component of A is greater
More informationRandom access protocols for channel access. Markov chains and their stability. Laurent Massoulié.
Random access protocols for channel access Markov chains and their stability laurent.massoulie@inria.fr Aloha: the first random access protocol for channel access [Abramson, Hawaii 70] Goal: allow machines
More informationRecap CS276. Plan for today. Linkbased ranking. Link spam. Link farms and link exchanges. Wrap up spam Crawling Connectivity servers
Recap CS276 Lecture 15 Plan for today Wrap up spam Crawling Connectivity servers In the last lecture we introduced web search Paid placement SEO/Spam Linkbased ranking Most search engines use hyperlink
More information1. Markov chains. 1.1 Specifying and simulating a Markov chain. Page 5
1. Markov chains Page 5 Section 1. What is a Markov chain? How to simulate one. Section 2. The Markov property. Section 3. How matrix multiplication gets into the picture. Section 4. Statement of the Basic
More informationPerformance Analysis of a Telephone System with both Patient and Impatient Customers
Performance Analysis of a Telephone System with both Patient and Impatient Customers Yiqiang Quennel Zhao Department of Mathematics and Statistics University of Winnipeg Winnipeg, Manitoba Canada R3B 2E9
More informationMath 312 Lecture Notes Markov Chains
Math 312 Lecture Notes Markov Chains Warren Weckesser Department of Mathematics Colgate University Updated, 30 April 2005 Markov Chains A (finite) Markov chain is a process with a finite number of states
More informationChapter 4. Trees. 4.1 Basics
Chapter 4 Trees 4.1 Basics A tree is a connected graph with no cycles. A forest is a collection of trees. A vertex of degree one, particularly in a tree, is called a leaf. Trees arise in a variety of applications.
More informationCS 103X: Discrete Structures Homework Assignment 3 Solutions
CS 103X: Discrete Structures Homework Assignment 3 s Exercise 1 (20 points). On wellordering and induction: (a) Prove the induction principle from the wellordering principle. (b) Prove the wellordering
More informationA simple criterion on degree sequences of graphs
Discrete Applied Mathematics 156 (2008) 3513 3517 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam Note A simple criterion on degree
More informationOptimization of the HOTS score of a website s pages
Optimization of the HOTS score of a website s pages Olivier Fercoq and Stéphane Gaubert INRIA Saclay and CMAP Ecole Polytechnique June 21st, 2012 Toy example with 21 pages 3 5 6 2 4 20 1 12 9 11 15 16
More informationA characterization of trace zero symmetric nonnegative 5x5 matrices
A characterization of trace zero symmetric nonnegative 5x5 matrices Oren Spector June 1, 009 Abstract The problem of determining necessary and sufficient conditions for a set of real numbers to be the
More informationLinkbased Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011
Linkbased Analysis on Large Graphs Presented by Weiren Yu Mar 01, 2011 Overview 1 Introduction 2 Problem Definition 3 Optimization Techniques 4 Experimental Results 2 1. Introduction Many applications
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More information1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
More informationAn Alternative Web Search Strategy? Abstract
An Alternative Web Search Strategy? V.H. Winterer, Rechenzentrum Universität Freiburg (Dated: November 2007) Abstract We propose an alternative Web search strategy taking advantage of the knowledge on
More informationCyclotomic Extensions
Chapter 7 Cyclotomic Extensions A cyclotomic extension Q(ζ n ) of the rationals is formed by adjoining a primitive n th root of unity ζ n. In this chapter, we will find an integral basis and calculate
More informationWeb Usage in ClientServer Design
Web Search Web Usage in ClientServer Design A client (e.g., a browser) communicates with a server via http Hypertext transfer protocol: a lightweight and simple protocol asynchronously carrying a variety
More informationMath 7 Elementary Linear Algebra MARKOV CHAINS. Definition of Experiment An experiment is an activity with a definite, observable outcome.
T. Henson I. Basic Concepts from Probability Examples of experiments: Math 7 Elementary Linear Algebra MARKOV CHAINS Definition of Experiment An experiment is an activity with a definite, observable outcome.
More informationLinear Programming. March 14, 2014
Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1
More informationMath 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = 36 + 41i.
Math 5A HW4 Solutions September 5, 202 University of California, Los Angeles Problem 4..3b Calculate the determinant, 5 2i 6 + 4i 3 + i 7i Solution: The textbook s instructions give us, (5 2i)7i (6 + 4i)(
More informationNotes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
More informationEigenvalues and eigenvectors of a matrix
Eigenvalues and eigenvectors of a matrix Definition: If A is an n n matrix and there exists a real number λ and a nonzero column vector V such that AV = λv then λ is called an eigenvalue of A and V is
More information