Virtual Landmarks for the Internet



Similar documents
International Journal of Advanced Research in Computer Science and Software Engineering

Manifold Learning Examples PCA, LLE and ISOMAP

Data Mining: Algorithms and Applications Matrix Math Review

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Binary vs Analogue Path Monitoring in IP Networks

3. INNER PRODUCT SPACES

Applied Algorithm Design Lecture 5

WITH THE RAPID growth of the Internet, overlay networks

Chapter ML:XI (continued)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

We shall turn our attention to solving linear systems of equations. Ax = b

Topological Properties

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Section 1.1. Introduction to R n

Going Big in Data Dimensionality:

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Inner Product Spaces

The Value of Visualization 2

Applications to Data Smoothing and Image Processing I

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Traffic Engineering & Network Planning Tool for MPLS Networks

Visualization of General Defined Space Data

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

2.3 Convex Constrained Optimization Problems

HDDVis: An Interactive Tool for High Dimensional Data Visualization

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Nonlinear Iterative Partial Least Squares Method

Similarity Search in a Very Large Scale Using Hadoop and HBase

Performance Metrics for Graph Mining Tasks

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Link Prediction in Social Networks

17. Inner product spaces Definition Let V be a real vector space. An inner product on V is a function

The Power of Asymmetry in Binary Hashing

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Approximation Algorithms

Visualization by Linear Projections as Information Retrieval

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Vector and Matrix Norms

How To Make Visual Analytics With Big Data Visual

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Inner product. Definition of inner product

MA 323 Geometric Modelling Course Notes: Day 02 Model Construction Problem

How To Cluster

A Topology-Aware Relay Lookup Scheme for P2P VoIP System

A Study of Web Log Analysis Using Clustering Techniques

Peer-to-Peer Networks 02: Napster & Gnutella. Christian Schindelhauer Technical Faculty Computer-Networks and Telematics University of Freiburg

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Spazi vettoriali e misure di similaritá

Classification algorithm in Data mining: An Overview

Argonne National Laboratory, Argonne, IL USA 60439

Mathematical Model Based Total Security System with Qualitative and Quantitative Data of Human

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Energy Efficient MapReduce

On real-time delay monitoring in software-defined networks

Neural Networks Lesson 5 - Cluster Analysis

Local outlier detection in data forensics: data mining approach to flag unusual schools

LIST OF FIGURES. Figure No. Caption Page No.

Linear Algebra Review. Vectors

24. The Branch and Bound Method

Optimal Gateway Selection in Multi-domain Wireless Networks: A Potential Game Perspective

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Network Coding for Security and Error Correction

Metric Spaces. Chapter 1

RVS-Seminar Overlay Multicast Quality of Service and Content Addressable Network (CAN)

So which is the best?

Introduction to Multivariate Analysis

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

B490 Mining the Big Data. 2 Clustering

Metric Spaces. Chapter Metrics

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

Tree based ensemble models regularization by convex optimization

Design of LDPC codes

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Standardization and Its Effects on K-Means Clustering Algorithm

Large-Scale Similarity and Distance Metric Learning

Linear Programming. March 14, 2014

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

Computer Networks - CS132/EECS148 - Spring

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

5.1 Bipartite Matching

Lecture 2 Linear functions and examples

QUALITY OF SERVICE METRICS FOR DATA TRANSMISSION IN MESH TOPOLOGIES

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Transcription:

Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science

Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser games Overlay routing networks Server selection

Estimating Distance without Measuring Internet coordinates An Internet location assigned to each node Proposed by Ng and Zhang, IMW 2001 Called Global Network Positioning (GNP) What is distance? In this work, minimum RTT Corresponds to propagation delay in the absence of queueing/congestion Assumed to be stable long enough to be worth estimating Good first-order predictor of path performance

Internet Coordinates: The Basic Idea Assign each node a set of coordinates, such that Euclidean distance approximates network distance (minimum RTT) d x 1 = (3,2,4) x 2 = (-2,5,3) x 1 -x 2 d

But but but This can t work! Internet distances are too irregular! The Internet has arbitrary connectivity with no obvious geometry! And assigning coordinates must be computationally very expensive!

Two Questions 1. Are Internet coordinate schemes really accurate when applied to large sets of measurements spanning the whole Internet? 2. Can Internet coordinates be assigned in a computationally efficient way?

The Embedding Problem A metric space is a pair (X,d) where X is a set of points, and d: (X,X) R is a metric, i.e., it is: symmetric, positive definite, and satisfies the triangle inequality. A Euclidean space R n is a metric space (Y,δ) with Y = a vector set and δ = the Euclidean norm An embedding is a mapping φ: X R n Given some X, d, and n, we seek an accurate embedding, i.e., a φ with δ(φ(x 1 ), φ(x 2 )) d(x 1, x 2 ) for all x 1, x 2 in X

Versions of the Embedding Problem Finite Metric Space (graph) embeddings N. Linial Precise, algorithmic, worst-case Distance geometry X and d are taken from a known Euclidean space Exact solution for φ from linear algebra Multidimensional Scaling (MDS) Using geometric embedding to approximate empirical measurements

Multidimensional Scaling (MDS) The most general kind of embedding problem Arose first in psychology Treated as a nonlinear optimization, ie, φ = arg min f Σ x 1, x 2 in X (δ(f(x 1 ), f(x 2 )) - d(x 1, x 2 )) 2 Method used in first Internet studies (GNP) Solved approximately via iterative methods slow, can be difficult to configure

A different method: Lipschitz embedding Lipschitz embedding: a point s coordinates are the distances to a fixed set of landmarks 1 4 9 7 3 1 x 1 = (1,4,9) x 2 = (7,3,1)

Why does the Lipschitz embedding work? Recall that d obeys the triangle inequality (x 1,y 1,z 1 ) (x 2,y 2,z 2 ) x 1 -x 2 <, y 1 -y 2 <, etc. so, if nodes 1 and 2 are close, their coordinates are similar

Lipschitz embedding of Internet distance Advantages: Fast! Simple! Questions: Triangle inequality doesn t hold does it matter? What is the right number of dimensions? How can we achieve low dimensional embedding? More landmarks generally better results But more landmarks larger coordinate vectors Most importantly is it accurate?

Turning to the data Dataset Dimensions # Msmts Notes GNP 19 869 16,511 50% in NA RON1 13 13 169 Mostly US RON2 15 15 225 Mostly US NLANR AMP 116 116 13,456 Abilene-connected Skitter 12 196,286 2,355,565 50% outside US, attempts to span IP space Sockeye 11 156,359 1,719,949 penultimate hop to a node in each live /24

First question: Triangle Inequality CDF of min (d(i,k) + d(k,j))/d(i,j) over all pairs (i,j) k

Next Question: How many dimensions? Answer via Principal Component Analysis (PCA) PCA: optimal linear projection from higher dimension to lower dimension φ is a linear function, so equivalent to multiplying by a matrix M i.e., φ(x 1 ) Mx 1 Plot of error of projection, as a function of number of dimensions of projected points, is called a scree plot

Exploring Dimensionality via Scree Plots Illustrative experiment: start with 250 points randomly scattered in an n-dimensional unit hypercube Form the 250 250 distance matrix Treat this matrix as a set of 250 points in 250-dimensional space, i.e., as a Lipschitz embedding. What is the error of projecting these points to a low dimensional space?

Scree Plot Exposes Underlying Dimension

Scree Plots of Internet Data Datasets similar, and error dropoff sharp!

Last Question: Achieving Low Dimensional Embedding Scree plots also tell us that we can use PCA to reduce dimensionality of Lipschitz embedding i.e., let x 1, x 2, x 3, each be a set of measurements to n known landmarks Treat each as a vector of length n Then there is an r n matrix M with r 8, such that Mx i Mx j x i -x j M is found easily using PCA Call this method virtual landmarks coordinates are linear combinations of distances to real landmarks

Summary: Implications for Lipschitz Embedding Triangle Inequality violations not severe Embeddings in 7 to 9 dimensions should be sufficient PCA can provide dimensionality reduction of Lipschitz embedding so, is Lipschitz embedding accurate? Evaluate using relative error: δ(φ(x 1 ), φ(x 2 )) - d(x 1, x 2 ) / d(x 1, x 2 )

Lipschitz embedding in 8 dimensions 90% of distances have r.e. less than 0.5 (Skitter: 90% have r.e. less than 0.34)

Virtual Landmarks compared to GNP GNP: 3,626 sec VL: < 1 sec NLANR AMP Dataset

Virtual Landmarks compared to GNP (2) GNP: 182 sec VL: < 1 sec GNP dataset

Scaling Virtual Landmarks So far we have assumed that each node needing coordinates uses measurements to the same set of landmarks presents scaling problems But this is not necessary VL method removes dependence on specific landmarks Different nodes can use different landmark sets As long as transformation between different coordinate systems is known

Scaling via Spanners M 1 M 2 T 21 M 1 x 1 M 2 x 2 Spanners Spanners determine their coordinates in both systems so can compute transformation matrix T 21

Accuracy using Spanners 5 replications, AMP dataset, 2 sets of 20 landmarks

Coordinate Schemes for the Internet Virtual Landmarks (Lipschitz embedding combined with PCA) is a fast and accurate method for assigning Internet coordinates Computation is scalable to millions of nodes Measurement is scalable to millions of nodes Internet distances are surprisingly amenable to geometric embedding Dimension about 7 to 9 Consistent over all datasets

Why do network coordinate schemes work?

Coordinate systems are powerful Coordinate systems open the door to geometric approaches to Internet problems Clustering Partitioning Potential to unify hybrid wired/wireless application configuration Potential to optimize overlays, p2p, multicast, server selection, etc. A new kind of map of the Internet