Sampling Biases in IP Topology Measurements



Similar documents
Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

Assignment #3 Routing and Network Analysis. CIS3210 Computer Networks. University of Guelph

Estimating Network Layer Subnet Characteristics via Statistical Sampling

Social Media Mining. Graph Essentials

WISE Power Tutorial All Exercises

Mining Social Network Graphs

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Simple Linear Regression Inference

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Correlational Research

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Internet (IPv4) Topology Mapping. Department of Computer Science The University of Texas at Dallas

Adverse Impact Ratio for Females (0/ 1) = 0 (5/ 17) = Adverse impact as defined by the 4/5ths rule was not found in the above data.

8.1 Min Degree Spanning Tree

I. ADDITIONAL EVALUATION RESULTS. A. Environment

What cannot be measured on the Internet? Yvonne-Anne Pignolet, Stefan Schmid, G. Trédan. Misleading stars

High-Frequency Active Internet Topology Mapping

The Joint Degree Distribution as a Definitive Metric of the Internet AS-level Topologies

Estimating the Degree of Activity of jumps in High Frequency Financial Data. joint with Yacine Aït-Sahalia

BGP Prefix Hijack: An Empirical Investigation of a Theoretical Effect Masters Project

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

The Coremelt Attack. Ahren Studer and Adrian Perrig. We ve Come to Rely on the Internet

Posit: An Adaptive Framework for Lightweight IP Geolocation

Having a coin come up heads or tails is a variable on a nominal scale. Heads is a different category from tails.

Violent crime total. Problem Set 1

EFFICIENT DETECTION IN DDOS ATTACK FOR TOPOLOGY GRAPH DEPENDENT PERFORMANCE IN PPM LARGE SCALE IPTRACEBACK

Social Media Mining. Network Measures

Network (Tree) Topology Inference Based on Prüfer Sequence

Chi-square test Fisher s Exact test

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Distance Degree Sequences for Network Analysis

2013 MBA Jump Start Program. Statistics Module Part 3

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

November 08, S8.6_3 Testing a Claim About a Standard Deviation or Variance

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Introduction to Quantitative Methods

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

p-values and significance levels (false positive or false alarm rates)

Chi Square Tests. Chapter Introduction

Detecting Network Anomalies. Anant Shah

Effective Network Monitoring

Part 2: Community Detection

TEST 2 STUDY GUIDE. 1. Consider the data shown below.

WITH THE RAPID growth of the Internet, overlay networks

BA 275 Review Problems - Week 5 (10/23/06-10/27/06) CD Lessons: 48, 49, 50, 51, 52 Textbook: pp

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Some Examples of Network Measurements

Variables Control Charts

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

Exploring Big Data in Social Networks

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

Math 58. Rumbos Fall Solutions to Review Problems for Exam 2

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

HYPOTHESIS TESTING: POWER OF THE TEST

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Understand the role that hypothesis testing plays in an improvement project. Know how to perform a two sample hypothesis test.

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Testing Random- Number Generators

Chapter 7 Section 7.1: Inference for the Mean of a Population

Stats Review Chapters 9-10

Hypothesis Testing for Beginners

In the general population of 0 to 4-year-olds, the annual incidence of asthma is 1.4%

Statistics 2014 Scoring Guidelines

How To Check For Differences In The One Way Anova

Temporal Dynamics of Scale-Free Networks

11. Analysis of Case-control Studies Logistic Regression

Introduction to Regression and Data Analysis

Statistics Review PSY379

Part 2: Analysis of Relationship Between Two Variables

Chapter 3 RANDOM VARIATE GENERATION

BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp , ,

Transcription:

Sampling Biases in IP Topology Measurements Anukool Lakhina with John Byers, Mark Crovella and Peng Xie Department of Boston University

Discovering the Internet topology Goal: Discover the Internet Router Graph Vertices represent routers, Edges connect routers that are one IP hop apart 212.12.58.3 212.12.5.77 source, A 163.55.221.88 163.55.1.41 destination, B 163.55.1.198 Measurement Primitive: traceroute Reports the IP path from A to B i.e., how IP paths are overlaid on the router graph

Traceroute studies today Sources k sources: Few active sources, strategically located. m destinations: Many passive destinations, globally dispersed. Union of many traceroute paths. Destinations (k,m)-traceroute study

High Variability in node degrees Degree distribution of routers found to be highly variable (degrees span several orders of magnitude). Various studies have even concluded that the degree distribution has a power law tail, Pr[ X > d] d c log(pr[x>d]) Dataset from [PG98] log(degree) [FFF99, GT00, BC01, ]

Our Question How reliable are (k,m)-traceroute methods in sampling graphs? We show that as a tool for measuring degree distribution, (k,m)-traceroute methods exhibit significant bias.

A thought experiment Idea: Simulate topology measurements on a random graph. 1. Generate a sparse Erdös-Rényi random graph, G=(V,E). Each edge present independently with probability p 1 1 Assign weights: w(e) = 1 + ε, where ε in, 2. Pick k unique source nodes, uniformly at random 3. Pick m unique destination nodes, uniformly at random 4. Simulate traceroute from k sources to m destinations, i.e. learn shortest paths between k sources and m destinations. 5. Let Ĝ be union of shortest paths. V V Ask: How does Ĝ compare with G?

Underlying Random Graph, G log(pr[x>x]) Measured Graph, Ĝ Underlying Graph: N=100000, p=0.00015 Measured Graph: k=3, m=1000 log(degree) Ĝ is a biased sample of G with a dramatically different degree distribution. Can high variability be a measurement artifact?

Outline Motivation and Thought Experiments Understanding Bias on Simulated Topologies Detecting Bias in Simulated Scenarios Statistical hypotheses to infer presence of bias Examining Internet Maps

Understanding Bias (k,m)-traceroute sampling of graphs is biased An intuitive explanation: When traces are run from few sources to large destinations, some portions of underlying graph are explored more than others. Edges incident to a node in Ĝ are sampled disproportionately.

Analyzing nonuniform edge sampling Question: Given some vertex in Ĝ that is h hops from the source, what fraction of its true edges are contained in Ĝ? Analysis reveals that: As h increases, fraction of edges discovered falls off sharply. Fraction of node edges discovered 1000dst 600dst 100dst Distance from source

What does this suggest? Destinations Edges close to the source are sampled more often than edges further away. S3 Intuitive Picture: Neighborhood near sources is well explored but, this visibility falls with hop distance from sources. S1 Destinations S2

Inferring Bias Goal: Given a measured Ĝ, is it a biased sample? Why this is difficult: Don t have underlying graph. Don t have criteria for checking bias. General Approach: Examine statistical properties as a function of distance from nearest source. Unbiased sample No change Change Bias

Towards Detecting Bias Examine Pr[D H], the conditional probability that a node has degree d, given that it is at distance h from the source. log(pr[x>x]) Ĝ degrees H=3 log(degree) Ĝ degrees H=2 Two observations: 1. Highest degree nodes are near the source. 2. Degree distribution of nodes near the source differs from those further away.

A Statistical Test for C1 C1: Are the highest-degree nodes near the source? If so, then consistent with bias. H C1 0 The 1% highest degree nodes occur at random with distance to nearest source. Cut vertex set in half: N (near) and F (far), by distance from nearest source. Let v : (0.01) V k : fraction of v highest-degree nodes that lie in N Can bound likelihood k deviates from 1/2 using Chernoff-bounds: Pr[ k δ (1+ δ ) > 2 ] < (1+ δ ) Reject null hypothesis with confidence 1-α if: α e (1+ δ) e (1+ δ) δ (1+δ ) v 2 v 2

A Statistical Test for C2 C2: Is the degree distribution of nodes near the source different from those further away? If so, consistent with bias. H C2 0 Degree distribution of nodes near the source is consistent with that of all nodes. Compare degree distribution of nodes in N and Ĝ, using the Chi-Square Test: 2 χ = l i= 1 ( O i E ) where O and E are observed and expected degree frequencies and l is histogram bin size. Reject hypothesis with confidence 1-α if: i 2 / E i 2 χ 2 > χ[ α, l 1]

Our Definition of Bias Bias (Definition): Failure of a sampled graph to meet statistical tests for randomness associated with C1 and C2. Disclaimer: Tests are binary and don t tell us how biased datasets are. A dataset that fails both tests is a poor choice for making generalizations about underlying graph.

log(pr[x>x]) Introducing datasets Dataset Name Date # Nodes # Links # Srcs # Dsts Reference Pansiot-Grad 1995 3,888 4,857 12 1270 PG98 Mercator 1999 228,263 320,149 1 NA GT00 Skitter 2000 7,202 11,575 8 1277 BBBC01 Pansiot-Grad Mercator Skitter log(degree)

Testing C1 H C1 0 The 1% highest degree nodes occur at random with distance to source. Pansiot-Grad: Mercator: Skitter: 93% of the highest degree nodes are in N 90% of the highest degree nodes are in N 84% of the highest degree nodes are in N

Testing C2 H C2 0 Degree distribution of nodes near the source is consistent with that of all nodes. Pansiot-Grad Mercator Skitter log(pr[x>x]) Far All Near Far All Near Far All Near log(degree)

Summary of Statistical Tests For all datasets, we reject both null hypotheses of no bias. We conclude that it is likely that true degree distribution of sampled routers is different than what is shown in these datasets.

Final Remarks Using (k,m)-traceroute methods to discover Internet topology yields biased samples. Rocketfuel [SMW:02] may avoid some pitfalls of (k,m)- traceroute studies but is limited-scale One open question: How to sample the degree of a router at random?