Web Information Retrieval. Lecture 11 Graph Structure of the Web for IR

Similar documents
Practical Graph Mining with R. 5. Link Analysis

Social Media Mining. Graph Essentials

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Topological Properties

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

1 o Semestre 2007/2008

A discussion of Statistical Mechanics of Complex Networks P. Part I

Distance Degree Sequences for Network Analysis

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Part 2: Community Detection

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

The PageRank Citation Ranking: Bring Order to the Web

Social Network Mining

Social Media Mining. Network Measures

Midterm Practice Problems

Course on Social Network Analysis Graphs and Networks

Graph Mining and Social Network Analysis

Analysis of Algorithms, I

Why? A central concept in Computer Science. Algorithms are ubiquitous.

DATA ANALYSIS II. Matrix Algorithms

Mining Social Network Graphs

Reti di Calcolatori! Web Search

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Random graphs and complex networks

Graph Mining Techniques for Social Media Analysis

Analyzing the Facebook graph?

2. (a) Explain the strassen s matrix multiplication. (b) Write deletion algorithm, of Binary search tree. [8+8]

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Search Engine Optimization (SEO): Improving Website Ranking

Network (Tree) Topology Inference Based on Prüfer Sequence

Some questions... Graphs

General Network Analysis: Graph-theoretic. COMP572 Fall 2009

B490 Mining the Big Data. 2 Clustering

Problem Set 7 Solutions

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Protein Protein Interaction Networks

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Lecture 8: Routing I Distance-vector Algorithms. CSE 123: Computer Networks Stefan Savage

Distributed Computing over Communication Networks: Maximal Independent Set

Analysis of MapReduce Algorithms

Cpt S 223. School of EECS, WSU

Google: data structures and algorithms

CIS 700: algorithms for Big Data

Lecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol

CMPSCI611: Approximating MAX-CUT Lecture 20

SPANNING CACTI FOR STRUCTURALLY CONTROLLABLE NETWORKS NGO THI TU ANH NATIONAL UNIVERSITY OF SINGAPORE

Network Metrics, Planar Graphs, and Software Tools. Based on materials by Lala Adamic, UMichigan

Mathematical issues in network construction and security

Graph structure in the web

Search engine ranking

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Routing with OSPF. Introduction

Follow the Perturbed Leader

Ranking on Data Manifolds

Web Graph Analyzer Tool

Big Ideas in Mathematics

An Alternative Web Search Strategy? Abstract

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Graph theory and network analysis. Devika Subramanian Comp 140 Fall 2008

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Strong and Weak Ties

Outline. EE 122: Interdomain Routing Protocol (BGP) BGP Routing. Internet is more complicated... Ion Stoica TAs: Junda Liu, DK Moon, David Zats

Simplified External memory Algorithms for Planar DAGs. July 2004

Approximation Algorithms

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

Introduction to Networks and Business Intelligence

Greedy Routing on Hidden Metric Spaces as a Foundation of Scalable Routing Architectures

Social and Technological Network Analysis. Lecture 3: Centrality Measures. Dr. Cecilia Mascolo (some material from Lada Adamic s lectures)

Handout #Ch7 San Skulrattanakulchai Gustavus Adolphus College Dec 6, Chapter 7: Digraphs

Measurement and Analysis of Online Social Networks

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Applied Algorithm Design Lecture 5

SGL: Stata graph library for network analysis

Sampling Biases in IP Topology Measurements

1. Write the number of the left-hand item next to the item on the right that corresponds to it.

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Exploring Big Data in Social Networks

Data Structures and Algorithms Written Examination

Online Social Networks and Network Economics. Aris Anagnostopoulos, Online Social Networks and Network Economics

Mining Social-Network Graphs

Web Search. 2 o Semestre 2012/2013

Part 1: Link Analysis & Page Rank

5.1 Bipartite Matching

A Review And Evaluations Of Shortest Path Algorithms

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Chapter 6: Graph Theory

Connected Identifying Codes for Sensor Network Monitoring

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

Looking at the Blogosphere Topology through Different Lenses

Graph/Network Visualization

1. Nondeterministically guess a solution (called a certificate) 2. Check whether the solution solves the problem (called verification)

Social network analysis with R sna package

Transcription:

Web Information Retrieval Lecture Graph Structure of the Web for IR

Today s lecture Anchor text Graph structure of the web

The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption : A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The anchor of the hyperlink describes the target page (textual context)

Assumption : reputed sites 4

Assumption 2: annotation of target 5

Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between: IBM s home page (mostly graphical) IBM s copyright page (high term freq. for ibm ) Rival s spam page (arbitrarily high term freq.) ibm ibm.com IBM home page A million pieces of anchor text with ibm send a strong signal www.ibm.com

Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com Joe s computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter

Indexing anchor text Can sometimes have unexpected side effects - e.g., miserable failure Can score anchor text with weight depending on the authority of the anchor page s website E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust the anchor text from them

Google Bombing

Anchor Text Other applications Weighting/filtering links in the graph HITS Generating page descriptions from anchor text

The Web Graph Other applications

Web graph Notation: G = (V, E) is given by a set of vertices (nodes) denoted V a set of edges (links) = pairs of nodes denoted E The page graph (directed) V = static web pages (5 B) E = static hyperlinks (35 B?) # pages indexed by Google on March 5

Graph Theory Graph G=(V,E) V = set of vertices E = set of edges 2 3 5 4 undirected graph E={(,2),(,3),(2,3),(3,4),(4,5)}

Graph Theory Graph G=(V,E) V = set of vertices E = set of edges 2 3 5 4 directed graph E={,2, 2,,3, 3,2, 3,4, 4,5 }

Undirected graph degree d(i) of node i number of edges incident on node i 2 degree sequence [d(),d(2),d(3),d(4),d(5)] [2,2,3,2,] 3 degree distribution [(,),(2,3), (3,)] 5 4

Directed Graph in-degree d in (i) of node i number of edges pointing to node i 2 out-degree d out (i) of node i number of edges leaving node i 3 in-degree sequence [,2,,,] out-degree sequence [2,,2,,] 5 4

Paths Path from node i to node j: a sequence of edges (directed or undirected from node i to node j) path length: number of edges on the path nodes i and j are connected 2 2 3 3 5 4 5 4

Shortest Paths Shortest Path from node i to node j also known as BFS path, or geodesic path 2 2 3 3 5 4 5 4

Diameter The longest shortest path in the graph 2 2 3 3 5 4 5 4

Undirected graph Connected graph: a graph where there every pair of nodes is connected Disconnected graph: a graph that is not connected Connected Components: subsets of vertices that are connected 2 3 5 4

Directed Graph Strongly connected graph: there exists a path from every i to every j 2 Weakly connected graph: If edges are made to be undirected the graph is connected 3 5 4

Connected components definitions Weakly connected components (WCC) Set of nodes such that from any node can go to any node via an undirected path Strongly connected components (SCC) Set of nodes such that from any node can go to any node via a directed path. WCC SCC

Adjacency matrix Adjacency matrix symmetric matrix for undirected graphs 2 3 4 5 A

Adjacency matrix Adjacency matrix unsymmetric matrix for undirected graphs A 2 3 4 5

Why is it interesting to study the Web Graph? It is the largest artifact ever conceived by the human Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization Predict the evolution of the Web Mathematical models Sociological understanding

Many other web/internet related graphs Physical network graph V = Routers E = communication links The host graph (directed) V = hosts E = There is an edge from a page on host A to a page on host B The cosine graph (undirected, weighted) V = static web pages E = cosine distance among term vectors associated with pages Co-citation graph (undirected, weighted) V = static web pages E = (x,y) number of pages that refer to both x and y Communication graphs (which hosts talks to which hosts at a given time) Routing graph (how packets move) Social networks Biology

Observing Web Graph It is a huge ever expanding graph We do not know which percentage of it we know The only way to discover the graph structure of the web as hypertext is via large scale crawls Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and death of nodes and links

Naïve solution Keep crawling, when you stop seeing more pages, stop Extremely simple but wrong solution: crawling is complicated because the web is complicated spamming duplicates mirrors First example of a complication: Soft 44 When a page does not exists, the server is supposed to return an error code = 44 Many servers do not return an error code, but keep the visitor on site, or simply send him to the home page

The Static Public Web Static not the result of a cgi-bin scripts no? in the URL doesn t change very often Public no password required no robots.txt exclusion no noindex meta tag

Graph properties of the Web Global structure: how does the web look from far away? Connectivity: how many connections? Connected components: how large? Reachability: can one go from here to there? How many hops? Dense Subgraphs: are there good clusters?

Power laws Inverse polynomial tail (for large k) Pr(X = k) ~ /k Graph of such a function is a line on a log-log scale log(pr(x=k)) = - * log(k) Very large values possible & probable Exponential tail (for large k) Pr(X = k) ~ e -k Graph of such function is a line on a log scale log(pr(x=k)) = -k No very large values

Power laws Aris Anagnostopoulos, Online Social Networks and Network Economics

Power laws Examples Inverse poly tail: the distribution of wealth, popularity of books, etc. Inverse exp tail: the distribution of age, etc Internet network topology [Faloutsos & al. 99]

Power laws are ubiquitous

SCC and WCC distribution The SCC and WCC sizes follows a power law distribution the second largest SCC is significantly smaller

The In-degree distribution Altavista crawl, 999 WebBase Crawl 2 Indegree follows power law distribution Pr[ in - degree( u) 2. k] k

Graph structure of the Web Altavista, 999 WebBase 2 Out-degree follow power law distribution? Pages with a large number of outlinks are less frequent.

What is good about 2. Expected number of nodes with degree k ~ n / k 2. Expected contribution to total edges ~ n / k 2. * k = ~ n / k. Summing over all k, the expected number of edges is ~ n This is good news: our computational power is such that we can deal with work linear in the size of the web, but not much about that.

The bow-tie structure of the Web TENDRILS 22 % DISC 8% CORE 27 % OUT 22 % IN 2 %

Papillon inglese: bow-tie italiano: papillon, farfallino, francese: noeud de papillon spagnolo: corbata, pajarita, greco: παπιγιόν tedesco: fliege polacco: mucha

Experiments () WebBase TENDRILS SCC22% IN OUT Regione Altavista 99 OUT 2% DISC. 8% Dimensione (numero di nodi) SCC 28% 44.73.93 4.44.56 IN 53.33.475 2% TENDRILS 3% OUT 39% DISC. 4% IN % SCC 33% TENDRILS DISC. Pr[ sizescc 7.8.859 ( u) i] i 6.72.83

Experiments (2) SCC distribution region by region Power Law 2.7

Experiments (3) Indegree distribution region by region Power Law 2.

Experiments (4) Indegree distribution in the SCC graph Power Law 2.

Experiments (5) WCC distribution in IN Power Law.8

What did we learn from all these power laws? The second largest SCC is of size less nodes! IN and Out have millions of access points to the CORE and thousands of relatively large Weakly Connected Components This may help in designing better crawling strategies at least for IN and OUT, i.e. the load can be splitted between the robots without much overlapping. While power law with exponent 2. is a universal feature of the Web, there is no fractal structure: IN and OUT do not show any Bow Tie Phenomena

Algorithmic issues Apply standard linear-time algorithms for WCC and SCC Hard to do if you can t store the entire graph in memory!! WCC is easy if you can store V (semi-external algorithm) No one knows how to do DFS in semi-external memory, so SCC is hard?? Might be able to do approx SCC, based on low diameter. Random sampling for connectivity information Find all nodes reachable from one given node ( Breadth- First Search ) BFS is also hard. Simpler on low diameter

Find the CORE Iterate the following process: Pick a random vertex v Compute all node reached from v: O(v) Compute all nodes that reach v: I(v) Compute SCC(v):= I(v) O(v) Check whether it is the largest SCC If the CORE is about ¼ of the vertices, after 2 iterations, probability to not find the core < %.

Resources IIR Chapters 2 2.