NETWORK DISCOVERY USING INCOMPLETE MEASUREMENTS

Size: px
Start display at page:

Download "NETWORK DISCOVERY USING INCOMPLETE MEASUREMENTS"

Transcription

1 NETWORK DISCOVERY USING INCOMPLETE MEASUREMENTS by Brian Eriksson A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) At the University of Wisconsin - Madison 2010

2 c Copyright by Brian Eriksson 2010 All Rights Reserved

3 To Amy and my parents. i

4 ii Acknowledgements First of all, I would like to thank my advisors, Robert Nowak and Paul Barford. This dissertation could not be written without the extensive professional support given by both, including the countless hours spent editing my paper drafts and presentations. Rob, particularly for taking a chance on me early in my graduate school career and teaching me some very important life lessons along the way. And Paul, for spending many late deadline nights helping me with papers at the last minute and introducing me to a research area that allowed me to be creative and exploit my unique skill set. I would also like to thank the number of other research collaborators who gave me significant help over the course of my graduate studies, including Bruce Maggs (Duke University/Akamai), Aarti Singh (CMU), Nick Duffield (AT&T), Matthew Roughan (University of Adelaide), Joel Sommers (Colgate University), Peyman Milanfar (University of California - Santa Cruz), Mark Crovella (Boston University), Mark Coates (McGill University), and Sina Farsiu (Duke University). Also, the remaining members of my PhD committee, Amos Ron, Barry Van Veen, and William Sethares. Final professional thanks go to the number of past/current graduate students here at the University of Wisconsin, including Minglei Huang, Jarvis Haupt, Mike Rabbat, Waheed Bajwa, Rui Castro, Laura Balzano, and Gautam Dasarathy.

5 iii On a personal note, I would like to thank my parents and my sister for support throughout my graduate school career. You did not always understand what exactly I was doing in graduate school, but you did always understand why I was doing it. Also, to the doctors and nurses at the University of Wisconsin hospitals and clinics. While the final months of my Ph.D. were completely different that I ever could have imagined, everyone here has treated me with respect and care during a difficult time. Finally, to my fiancee and best friend, Amy. You have been the only force of sanity in my life during moments when I frequently lost my perspective on things. And I hope that in the future a majority of our conversations no longer revolve around complaining about graduate school. I m not bound to succeed, but I m bound to live up to what light I have.

6 iv Contents ACKNOWLEDGEMENTS LIST OF TABLES ii xi LIST OF FIGURES xvii LIST OF ALGORITHMS xviii ABSTRACT xix 1 Introduction Internet Measurements and Terminology Active Measurements Passive Measurements Motivation and Summary of Major Contributions Internet Topology Inference IP Geolocation Anomaly Detection Organization Related Works Network Topology Discovery from Incomplete Passive Measurements Passive End Host Clustering Passive Shared Path Estimation Passive Network Embedding Inferring Unseen Structure of the Network Core

7 v 2.3 Toward the Practical Use of Network Tomography IP Geolocation using Population Data NBgeo Related Work PinPoint Related Work Model-based Anomaly Detection Network Topology Discovery from Incomplete Passive Measurements Passive Measurement Datasets Passive Clustering of End Hosts Hop Count Distance Vectors and Network Topology Client Clustering Gaussian Mixture Model for Subnet Clusters The Missing Data Problem - Imputing Missing Hop Counts Imputation Methods Shared Infrastructure Estimation Cluster-Level Shared Path Length Estimation Predictive Shared Path Length Estimation Shared Path Estimation Analysis Topology Estimation Performance with Imputed Data End Host to End Host Distance Estimation Multidimensional Scaling (MDS) Landmark MDS using Active Measurements Landmark MDS using Incomplete Passive Measurements Exploiting BGP Data End Host to End Host Distance Estimation Results Summary Inferring Unseen Structure of the Network Experimental Dataset

8 vi Core Router Definition Core Router Identification Inferring Unseen Components of the Core Estimating the Number of Unseen Core Routers Experimental Performance Estimating Unseen Connectivity Matrix Completion Algorithm Experimental Performance of Matrix Completion Inferring Unseen Core Links Experimental Performance of Unseen Link Inference Adaptively Targeted Core Probing Naive Random Selection Probing Unseen Target Estimation Probing Algorithm Targeted Probing Experiments Summary Toward the Practical Use of Network Tomography Depth-First Search (DFS) Order Logical Topology Discovery using DFS Ordering Case A - σi,i 1 2 σ2 i 1,i 2 < δ Case B - σ 2 i,i 1 σ2 i 1,i 2 + δ Case C - σ 2 i,i 1 + δ < σ2 i 1,i Depth-First Search Ordering Estimation Experiments Prior Methods Datasets Synthetic Noise-Free Experiments Real World Experiments

9 vii 5.5 Summary Active Clustering of Hierarchical Data The Hierarchical Clustering Problem Active Hierarchical Clustering under the CL Condition Robust and Efficient Hierarchical Clustering with CL Violations Experiments Summary IP Geolocation using Population Data Geolocation using NBgeo Bayesian Geolocation Framework Geolocation using PinPoint PinPoint Methodology Summary Hop-Based Mapping using Landmarks NTP Hop-Based Mapping Experiments Commercial Node Hop-Based Mapping Experiments Passive Hop-based Mapping Targeted Distance Estimates using Hop-based Mapping Latency to Distance Estimation Exponential Latency Weighting Sparse Embedding Algorithm PinPoint IP Geolocation Algorithm Experiments Comparison Methodologies NBgeo Experiments PinPoint Experiments Complete Landmark Probing Results Latency Probing Budget Results

10 viii PinPoint Component Performance Bootstrap Confidence Bounds Results Summary Model-based Anomaly Detection Anomaly Datasets Synthetic Traffic Data GEANT Data Abliene Real-World Data BasisDetect Overview Basis Decomposition of Network Data Dictionary Construction from Labeled Set Anomaly Decomposition using Penalized Basis Pursuit Experiments GEANT Time-series Network Data Wavelet Type Performance Analysis Tuning Parameter Performance Analysis Synthesized Network-wide Data Abilene Real-World Network Data Summary Conclusions Future Work Concluding Statement Author Bibliography Conference Papers Journal Papers Technical Reports

11 ix List of Tables 3.1 Details of honeypot data sets used in our study. All data was collected over a one day period on December 22, Counts of occurrences of common source IP addresses in multiple honeypots Shared path estimation results for a 1000 node synthetic topology assuming that probes from 800 randomly selected end host nodes were observed in 8 randomly selected monitors Shared path estimation results for the Skitter topology assuming that probes from 700 randomly selected end host nodes were observed in 8 randomly selected monitors Comparison of three different techniques for discovering pairwise distances between end hosts. Where N is the number of end hosts, M are the number of monitors (N M) An example hop count matrix using observed hop elements from the single traceroute path p 1 r 1 r 2 r 3 r 4 p 2 (where - represents an unknown element) Hop Matrix reconstruction error rates. The RMSE of 100,000 core router to core router hop distances held out Division of Matrix Completion Errors for Holdout Data Performance of Unseen Link Classification Algorithm with various threshold values using λ thresholding on both the bootstrap thresholding and the hop count thresholding methodologies

12 x 5.1 Upper bound probing complexity for the three probing methodologies for balanced l-ary tree. (Where p (l) is sublinear in l) Comparison of number of probes needed to estimate logical topology using synthetic Orbis topologies NTP Dataset - The average geolocation error for various end host to landmark mapping methodologies NTP Dataset - Hop-based Mapping methodology quintile errors Commercial Node Dataset - The average geolocation error for various end host to landmark mapping methodologies Commercial Node Dataset - Hop-based Mapping methodology quintile errors PinPoint Algorithm complexity for both probing and computation. Where N is the number of end hosts, K is the probing budget, T is the number of monitors, M is the number of landmarks, B is the number of bootstrap iterations, and G is the number of feasible geolocation points The geolocation error for all geolocation methodology using latency data from all landmarks (error distance in miles) The performance of the NBgeo Algorithm given additional data (error distance in miles) The geolocation error (in miles) for all geolocation methodology using latency data from all landmarks (for number of landmarks, T = 50, 200) GEANT Network Data - Number of false alarms declared for a percentage of the true anomalies detected GEANT Network Data - Number of false alarms declared in order to detect every anomaly in the GEANT dataset (with respect to various wavelet types) Synthetic Traffic Matrices - Number of false alarms declared for a percentage of the true anomalies detected

13 xi 8.4 Abilene Network Data - Number of false alarms declared for a percentage of the PCA anomalies detected

14 xii List of Figures 1.1 Toy network topology. (Left) - Physical topology, (Right) - Logical topology Example of the three Internet measurement types (Ping, traceroute, and passive where the monitor is at IP 6 ) between two points in the network, where a time-to-live (TTL) value of 60 indicates that there are = 4 routers between the two end hosts in the network (Left) Example Network Topology with sources S i sending packets through a core component to monitors M k, (Right) Example network where S 1 and S 2 share a border router Comparison of clustering results for Unique Contrast Clustering and Hop Distance Nearest Neighbor in terms of average number of cluster elements in matching IP subspaces Example of a subnet having multiple egress points D histogram of hop count contrast vectors with clusters highlighted in ellipses Comparison of Gaussian mixture clusters to random clusters. Simulated topology, N = 1000, M = Comparison of Gaussian mixture clusters to random clusters. Skitter topology, N = 700, M = Striped dots indicate passive measurement data observed, Black dots indicate no information observed (Left) - Observations where Network-centric imputation may perform well, (Right) - Observations where Network-centric imputation will fail... 35

15 xiii 3.8 Imputation accuracy over a range of randomly selected missing values using data from the real-world honeynet dataset Imputation accuracy over a range of randomly selected missing values using data from M = 16 honeypots. (Left) N = 1000, (Center) N = 2000, (Right) N = Spectrum of sharedness (black dots represent routers). (Left) No sharedness, (Center) Intermediate sharedness, (Right) Maximum sharedness Example of cluster-level path estimation The effect of increasing the number of clusters on the shared path estimation performance on the simulated topology using the cluster-level shared path estimation method The effect of increasing the number of clusters on the shared path estimation performance on the Skitter topology using the Gaussian mixture EM cluster-level shared path estimation method Topology estimation performance for two different estimation methods in a 1000 node synthetic topology, ((Left) M = 8, (Center) M = 16, (Right) M = 24) Performance of topology estimation algorithm in 1000, 2000, and 3000 node synthetic topologies with M = 16, (Left) - Predictive Function Topology Estimation, (Right) - Cluster-Level Topology Estimation Example mask array W, (with N = 4 and M = 2). Note that not all hop-counts from end hosts to monitors are observed, and none of the hop-counts between end hosts are observed Two end hosts with the same hop distance to a single monitor Simulation results for error rates of pairwise hop estimation for synthetic topology versus amount of available data (N=1000). (left) M=8, (center) M=16, (right) M= Simulation results for error rates of pairwise hop estimation for synthetic topology versus amount of available data (M=16). (left) N=2000, (right) N=

16 xiv 3.20 The effect of embedding dimension to estimating the pairwise distance values for the synthetic topology, N = 1000, w/ M = 32 and calculated dimension d = 5, confidence bars indicating +/-1 standard deviation The effect of adding additional monitors to estimating the pairwise distance values for the synthetic topology, observing complete hop count data, N = 3000, confidence bars indicating +/-1 standard deviation RMSE of pairwise hop estimation simulation results for the Skitter topology (N = 1000). (Left) M = 8, (Center) M = 16, (Right) M = Simulation results for asymmetric reverse paths for synthetic topology (N = 1000, M = 16) versus amount of available data. (left) Reverse paths off by 1 hop, (center) Reverse paths off by 2 hops, (right) Reverse paths off by 3 hops A representation of our pragmatic definition of the Internet s core Empirical Cumulative Probability for the imputation error using both Matrix Completion and Mean imputation Percentage of total links correctly classified plotted against threshold of confidence upper bound (λ) for both bootstrap upper bound estimate and hop count estimate (Left) - Number of additional unique core routers found using the two probing techniques, (Right) - Number of additional unique core links found using the two probing techniques Example of Network Radar on simple logical topology Example simple logical topology in a proper DFS Order (A) - Case A - σi,i 1 2 σ2 i 1,i 2 < δ. The current end host x i is attached to the parent of x i 1. (B) - Case B - σ 2 i,i 1 σ2 i 1,i 2 + δ. A new router r i is created with children x i, x i 1 with parent f (x i 2 ) (A) - Case C-1 - σr 2 σ2 i,i 1 < δ. The current end host (x i ) is attached to router r. (B) - Case C-2 - σ 2 r < σ2 i,i 1 + δ. A new router r i is attached on the path between routers r and f (r )

17 xv 5.5 Example of covariance values from a single end host not revealing the entire topology (Left) The first split taken on a balanced l-ary tree. (Right) The second split taken on a balanced l-ary tree. Both splits indicated by the dotted line, the arrow indicates the randomly chosen end host covariance values are measured against Real world topology used to test tomography methods Topology reconstruction results for the three algorithms (DFS Ordering, Sequential, and Hierarchical Clustering) Resulting ordering of Gene Microarray reconstructions. (Left) - Standard Agglomerative Clustering, (Center) - Outlier Based Clustering, (Right) - Robust Outlier Based Clustering (Left) - Probability for latency measurements between 10-19ms being observed given a target s distance from a monitor. Stem plot - Histogram density estimation, Solid line - Kernel density estimation. (Right) - The kernel estimated probability of placement in each county given latency observation between 10-19ms from a single monitor marked by x (Left) - Estimated posteriori probabilities for all counties in the continental US. (Right) - Estimated posteriori probabilities for constraint-based restricted counties Toy example of network routing geography vs. direct line-of-sight geography Geographic placement of NTP servers Example of network where an end host is C hops away from a landmark, with both sharing the same border router (Left) Hop-based geolocation mean error decay with the number of observed hop counts by each end host. (Right) Hop-based geolocation median error decay with the number of observed hop counts by each end host. (Standard deviations are shown in the error bars)

18 xvi 7.7 Likelihood distribution of distance to landmark given observed latency of 10-20ms. Solid line - Kernel density estimation, Dashed Line - Estimated cumulative distribution, Dashed blocks - histogram Empirical cumulative probability of error distance for both NBgeo with constraint information and the CBG method Median geolocation error (in miles) given a limited probing budget (T = 200) Cumulative distribution of geolocation error for both PinPoint and Octant algorithms (K = 20, T = 200) Cumulative distribution of geolocation error for PinPoint removing the improvements of bootstrap estimation and exponential latency weighting. (K = 20, T = 200) Cumulative distribution of geolocation error for confidence quintiles derived from 95% bootstrap confidence interval size (K = 20, T = 200) The BasisDetect Framework (Left) minutes of packet counts across a single link in the GEANT network. Known anomalies are marked with x. (Center) - The first four atoms found in the signal dictionary consisting of a Discrete Cosine Transformation (DCT). (Right) - Comparison of the observed signal with a representation using the best linear combination of the four atoms Fourier analysis of Abilene data (March 1-15, 2004). (Left) - Abilene traffic, (Center) - Important region of power spectrum., (Right) - Fourier approximation GEANT Network Data - False Alarm anomalies found for a specified level of true anomaly detection for the three time-series detection methodologies (Fourier, EWMA, BasisDetect)

19 xvii 8.5 Tuning parameter performance experiment, examination of how well the BasisDetect algorithm performs as each of the tuning parameters are removed. Using the full BasisDetect algorithm (γ, ρ learned from training set), BasisDetect w/o residual (γ learned from training set, ρ = 0), and BasisDetect w/o penalty (ρ learned from training set, γ = Synthetic Traffic Matrices - False Alarms declared for a specified level of true anomaly detection for the three network-wide detection methodologies (PCA, Distributed Spatial, BasisDetect) Abilene Real-World Network Data - Using 15 anomalies found by the PCA methodology, the false alarm rates are displayed for both BasisDetect and the Distributed Spatial methodology

20 xviii List of Algorithms 1 - Gaussian Mixture EM Imputation Algorithm MDS Algorithm with Incomplete Passive Measurements MDS Algorithm with Incomplete Passive Measurements and BGP Information Unseen Link Estimation Algorithm - Bootstrap Thresholding Unseen Target Estimation Probing Algorithm Ordered Logical Topology Discovery Algorithm Bisection DFS Ordering Algorithm - bisect(x, δ) Outlier-based Clustering Algorithm Robust Outlier Clustering Algorithm NBgeo - Naive Bayes IP Geolocation Algorithm PinPoint IP Geolocation Algorithm Dictionary Construction Algorithm Penalized Basis Pursuit Algorithm

21 xix Abstract Resolving characteristics of the Internet from empirical measurements is important in the development of new protocols, traffic engineering, advertising, and troubleshooting. Internet measurement campaigns commonly involve heavy network load probes that are usually non-adaptive and incomplete, and thus directly reveal only a fraction of the underlying network characteristics. This dissertation addresses the open problem of Internet characteristic discovery in an incomplete measurement regime. Using partially observed measurements, we specifically focus on the problems of Internet topology discovery, inferring the geographic location of Internet resources, and network anomaly detection. First, we consider the inference of topological characteristics of the Internet from three distinct forms of incomplete measurements. Initial work demonstrates how Passive Measurements, potentially-incomplete passively observed characteristics of the network, can be used to infer topological structure, such as clustering and shared path lengths. The second form of missing measurements come in the form of a set of traceroute probes, where we obtain partial knowledge of route lengths between routers in the network. Using a novel statistical methodology, we show how unobserved links between routers can be detected. Finally, we develop a novel targeted delay-based tomographic methodology, which resolves the tree topology of a network with a methodology that only requires a number of directed measurements within a poly-logarithmic factor of derived lower bounds. The second component of this dissertation focuses on two critical networking problems geographic location interference of Internet resources and network anomaly detection. In terms of

22 xx geographic location inference, our methodology exploits a set of landmarks in the network with known geographic location and targeted latency probes to avoid erroneous measurements caused by non-line-of-sight routing of long network paths. The use of a novel embedding algorithm allows for the inferred geolocation of end hosts to be clustered in areas of large population density without explicitly defined population data. Finally, we examine detecting unforeseen anomalous events in a network. Using a limited training set of labeled anomalies, our new anomaly detection framework extracts signal characteristics of anomalous events and detects their occurrence across observed network-wide measurements.

23 NETWORK DISCOVERY USING INCOMPLETE MEASUREMENTS Brian Eriksson Under the supervision of Professor Robert D. Nowak At the University of Wisconsin - Madison Resolving characteristics of the Internet from empirical measurements is important in the development of new protocols, traffic engineering, advertising, and troubleshooting. Internet measurement campaigns commonly involve heavy network load probes that are usually non-adaptive and incomplete, and thus directly reveal only a fraction of the underlying network characteristics. This dissertation addresses the open problem of Internet characteristic discovery in an incomplete measurement regime. Using partially observed measurements, we specifically focus on the problems of Internet topology discovery, inferring the geographic location of Internet resources, and network anomaly detection. The first problem addressed in this work is the inference of topological characteristics of the Internet from three distinct forms of incomplete measurements. Initial work demonstrates how Passive Measurements, potentially-incomplete passively observed characteristics of the network, can be used to infer topological structure, such as clustering and shared path lengths. The second form of missing measurements come in the form of a set of traceroute probes, where we obtain partial knowledge of route lengths between routers in the network. Using a novel statistical methodology, we show how unobserved links between routers can be detected. Finally, we develop a novel targeted delay-based tomographic methodology, which resolves the tree topology of a network with a methodology that only requires a number of directed measurements within a poly-logarithmic factor of derived lower bounds. The second component of this dissertation focuses on two critical networking problems geographic location interference of Internet resources and network anomaly detection. In terms of geographic location inference, our methodology exploits a set of landmarks in the network with known geographic location and targeted latency probes to avoid erroneous measurements caused by non-line-of-sight routing of long network paths. The use of a novel embedding algorithm allows for the inferred geolocation of end hosts to be clustered in areas of large population density without

24 explicitly defined population data. Finally, we examine detecting unforeseen anomalous events in a network. Using a limited training set of labeled anomalies, our new anomaly detection framework extracts signal characteristics of anomalous events and detects their occurrence across observed network-wide measurements. Approved: Professor Robert D. Nowak Department of Electrical and Computer Engineering University of Wisconsin - Madison

25 1 Chapter 1 Introduction As science and technology advances, the more prevalent extremely complex systems will become. This complexity is often in the form of decentralized systems with a very large number of interdependencies. These systems form loosely defined networks, and can be found in areas ranging from social interaction to genetic regulation systems. The analysis of these structures, commonly known as Network Science, has emerged in recent years to develop formal methodologies for predicting and modeling behaviors of these complex systems. This dissertation will focus on developing novel statistical and machine learning techniques in the field of Network Science for application on one of the largest man-made network structures in existence, the Internet. Over the past quarter century, the Internet has grown into a gigantic, extremely complex infrastructure that connects over a billion users world wide. The ability to measure, map, and analyze characteristics of the Internet accurately would facilitate network design, network management and network security processes by exposing the strengths and weaknesses in connectivity and opportunities to improve its robustness and performance. Prior work on discovering network characteristics, such as generating Internet maps [1, 2] or IP geolocation [3, 4], have mainly focused on the engineering problems associated with extensively probing the Internet using high network load probes, and then aggregating the vast quantities of data returned. This approach has inherent shortcomings, with timeliness issues of estimated characteristics due to the large probing load on the network and the need for frequently updated disambiguation databases. In contrast, the work

26 2 in this dissertation focuses on transforming the task of resolving network characteristics from an engineering exercise based on exhaustively probing the network to a mathematical inference problem based on exploiting a non-exhaustive subset of observed network measurements. By validating our methodologies on a known network structure such as the Internet, we have a stepping stone for the vast array of other scientific problems where the existence of a network is implied (e.g., genetic regulatory networks, brain networks, etc.). Thesis Statement The complexity of the Internet requires more thorough and intelligent analysis of network measurements than previously performed. By exploiting known network structure and novel data fusion methodologies, latent information in noise corrupted and incomplete Internet measurements can be revealed. This hidden information exposes important new features in the network that were previously ignored, with applications in the areas of topology discovery, IP geolocation, and anomaly detection. 1.1 Internet Measurements and Terminology In order to refer to objects in the Internet, throughout this dissertation we will use terminology common to networking literature. An end host will refer to any object in the Internet that can send or receive information requests. We will focus on the level-3 network layer where these end hosts are connected through routers which direct where data packets travel such that the packet destination is eventually reached. An autonomous system (AS) is a partition of end hosts and routers controlled by a single network operator (e.g., Level3 or AT&T). A physical topology route will indicate the specific physical routers that a path between two end hosts contains, and a logical topology route will refer to the path topology containing only routers with either in-degree or out-degree greater than one. A labeled example of this terminology can be seen in Figure 1.1. In order to resolve characteristics of the network, we will rely on information returned through the use of either Active or Passive Internet Measurements.

27 3 Figure 1.1: Toy network topology. (Left) - Physical topology, (Right) - Logical topology Active Measurements A vast majority of prior Internet characteristic discovery research is dependent on Active Measurements [1, 2, 3, 4, 5, 6, 7, 8, 9], which here we will specify as a tomographic network measurement performed by specifying a probe destination target from a set origin point in the network. The active measurement output will consist of some characteristic of the network between the origin and destination (e.g., delay, router topology, etc.). In this dissertation, we focus on two specific active measurement probes, ping and traceroute. Ping Measurements The most basic active probe considered in this dissertation are simple ICMP ping probes [10]. Using an ICMP echo request packet, the host origin computer will send a probe to a targeted destination end host, returning both the round trip time latency (RTT) in milliseconds and the time-to-live (TTL) value, indicating the number of routers between the host computer and the targeted end host. Advantages of ping measurements are that the probes are lightweight with very little load on the network path, while the main disadvantage is that no further useful topology characteristics (shared path lengths of two network routes, specific routers traversed by a path, etc.) are returned by individual ICMP ping measurements.

28 4 traceroute Measurements In prior research (e.g., [1, 2]), the predominant measurement for acquiring Internet topology characteristics has been based on tools similar to traceroute probes to gather data. Standard traceroute probes further exploit ICMP packets to return both the number of routers (i.e., the hop count between two points in the network), and the set of router interface IP addresses along the path between the two probe points in the network. This probing methodology allows for routing adjacencies to be known along the path between the two probe points (e.g., router A is physically connected to router B). In addition, the router interface IP addresses allow for domain name server (DNS) requests for further information about each router along the observed path. This technique, referred to as undns [11], creates location hints that have been used frequently on the problem of estimating an end host s geographic location [3, 4]. Unfortunately, effective use of such hints requires significant and frequently updated databases which still introduce the possibility of errors [12]. Great strides have been made in mitigating the problems associated with active probe Internet measurements, such as interface disambiguation in [11, 13], which is the problem of resolving multiple IP addresses associated with a single physical router. This has enabled accurate mapping of ISP topologies (e.g., [2]) and of the Internet s core (e.g., [1, 14]) using traceroute probes. However, there are still three important limitations in the use of active probing tools for Internet characteristic discovery. First, the vast size of the Internet means that a set of measurement hosts M and target hosts N where N M must be established in order for the resultant measurements to capture the diverse features of the infrastructure (especially on the edges of the network [14]). Second, active probes sent from monitors to the large set of target hosts result in a significant traffic load on the network. Third, in order to prevent reverse engineering of networks, service providers frequently attempt to thwart structure discovery by blocking ICMP probes to specific routers (and thus blocking both traceroute and ping probes). This results in the acquisition of incomplete active measurements due to (i) - the inability to perform exhaustive probing of all objects in the network in a reasonable length of time, and/or (ii) - the obfuscation of critical network infrastructure by administrators to avoid reverse engineering.

29 Passive Measurements One alternative to the use of active probes is to acquire network information passively, where instead of introducing probe-based traffic into the network we instead measure existing Internet traffic. The methodology we will consider in this dissertation consists of a series of monitors on network links sampling traffic. From passively sampled packets, we can obtain the IP address and Time-To-Live (TTL) count off the packet header. At the origin of each packet, it is assigned an operating system dependent integer value (i.e. 64, 128, or 255). As the data packet traverses the network, the TTL count is decremented by a single count at each router encountered. When the TTL count reaches zero, the packet is discarded, thus preventing packets from forever traversing the network. Using the technique from [15], the TTL count can be translated into the number of routers between the end host and the passive monitor. It is not uncommon to observe packets from an single end host source at several of the passive monitors, resulting in a vector of hop-count distances from each monitor to that source. These vectors provide an indication of the topological location of the source relative to the monitors, with no additional load added to the network by measurement probes. Unfortunately, a finite duration passive measurement campaign will likely result in incomplete measurements, because packets from each host are typically only observed at a subset of the monitors. This can be due either to packet sampling restrictions at our monitors (where only a subset of traffic will be observed) or an end host not directing any traffic towards the locations of specific monitors. Due to the inherently incomplete nature of passive measurements, there is relatively little prior work that addresses passive network monitoring. An example of the three Internet measurement types considered in this dissertation are shown in Figure Motivation and Summary of Major Contributions We will focus on three significant network characterization problems, (i) - Topology inference, (ii) - IP geolocation, and (iii) - Network anomaly detection. Our approach will be to develop method-

30 6 Figure 1.2: Example of the three Internet measurement types (Ping, traceroute, and passive where the monitor is at IP 6 ) between two points in the network, where a time-to-live (TTL) value of 60 indicates that there are = 4 routers between the two end hosts in the network. ologies to handle the effects of missing and corrupted measurement data to improve upon prior network characterization performance. This missingness will be the result of either measurements that are unavailable during the probing campaign (e.g., incomplete passive measurements), or the result of a targeted probing algorithm where we will select particular active measurements in order to reduce the total load on the network Internet Topology Inference There are significant challenges in any approach to measurement and characterization of Internet topology. First, the lack of built-in support for topology measurement coupled with the desire of many Internet Service Providers to keep much of this information private calls for a distributed measurement infrastructure and structural inference methods that are reliable and robust. Next, the vast size and global footprint of the Internet suggest that a potentially significant number of measurement hosts will be required in order to gather sufficient data to generate comprehensive

31 7 maps. Finally, the well known dynamic nature of the Internet means that measurements must be taken almost continuously in order to identify changes in a timely fashion. Understanding the Internet s structure through empirical measurements is important in the development of new topology generators, new protocols, traffic engineering, and troubleshooting, among other things. Our topology inference results offer the possibility of a greatly expanded perspective of Internet structure with much lower network traffic impact and management overhead. The first step in our topology inference study will focus on primarily using passive measurements to resolve topology characteristics. There are significant challenges in using passive packet measurements for discovering Internet structure. First, and most importantly, the individual measurements themselves would seem to convey almost no information about network structure. Second, end host IP addresses are often considered sensitive and are typically subject to privacy constraints (we address this by only using end host IP addresses as unique identifiers of hosts, and to resolve which specific Autonomous System (AS) the end host resides in). And finally, passive measurements give no indication of which routers were traversed between two points in the network, making the problem of topology discovery far more difficult when compared to an active measurements methodology. Despite these challenges, we will describe passive measurement-based algorithms that enable (i) automatic clustering or grouping of traffic sources that share network paths accurately without relying on IP address or autonomous system information, (ii) topological structure to be inferred accurately with only a small number of active measurements, (iii) missing information to be recovered, which is a serious challenge in the use of passive packet measurements. We demonstrate our techniques using a series of simulated topologies and empirical data sets. Our experiments show that the clusters established by our method closely correspond to sources that actually share paths. We also show the trade-offs between selectively applied active probes and the accuracy of the inferred topology between sources. Finally, we characterize the degree to which missing information can be recovered from passive measurements, which further enhances the accuracy of the inferred topologies. The second stage of our topology study focuses on the problem that common mapping campaigns using traceroute reveal only a portion of the underlying topology. We will demonstrate

32 8 that standard probing methods yield datasets that implicitly contain information about much more than just the directly observed links and routers. Each probe yields information that places constraints on the underlying topology, and by integrating a large number of such constraints it is possible to accurately infer the existence unseen components of the Internet (i.e., links and routers not directly revealed by the probing). Moreover, we show that this information can be used to adaptively re-focus the probing in order to more quickly discover the topology. These findings suggest radically new and more efficient approaches to Internet mapping, specifically on the discovery of the core of the Internet. We define Internet core as the set of routers that is roughly bounded by ingress/egress routers from stub autonomous systems. We describe a novel data analysis methodology designed to accurately infer (i) the number of unseen core routers, (ii) the unseen hop-count distances between observed routers, and (iii) unseen links between observed routers. We use a large experimental dataset to validate the proposed methods. The validation shows that our methods can predict the number of unseen routers to within a 10% error level, estimate 60% of the unseen distances between observed routers to within a one-hop (i.e., a single router) error, and robustly detect over 35% of the unseen links between observed routers. Furthermore, we use the information extracted by our inference methodology to drive an adaptive active-probing scheme. The adaptive probing method allows us to generate maps using roughly 50% fewer probes than standard non-adaptive approaches. The focus of our topology study then shifts to the field of delay-based tomographic probing. Topology recovery via tomographic inference is potentially an attractive complement to standard methods that use TTL-limited probes. Unfortunately, prior tomographic techniques (e.g., [5, 8]) have required an infeasible exhaustive (i.e., quadratic with respect to the number of end hosts considered) number of probes for accurate, large scale topology recovery. We will describe new techniques that aim toward the practical use of tomographic inference for accurate router-level topology measurement. We will focus on a novel Depth-First Search (DFS) Ordering algorithm that clusters end host probe targets based on shared infrastructure, and enables the logical tree topology of the network to be recovered accurately and efficiently without the need for an exhaustive number of measurement probes. We evaluate the capabilities of our DFS Ordering topology recovery

33 9 algorithm in simulation and find that our method uses 94% fewer probes than exhaustive methods and 50% fewer than the current state-of-the-art. We also present results from a case study in the live Internet where we show that DFS Ordering can recover the logical router-level topology more accurately and with fewer probes than prior techniques. Finally, we examine theoretical bounds for resolving hierarchical clustering (i.e., the tree topology of a network) from limited and potentially noise corrupted similarities. Our main contributions prove that a sampling-at-random (i.e., passive sampling) methodology will always require an exhaustive (i.e., quadratic) number of pairwise similarity measurements to resolve the entire clustering hierarchy. This is then contrasted with a targeted, active sampling regime, where we show how in the presence of uncorrupted measurements a methodology can be designed that requires only O (N log N) pairwise similarities to accurately reconstruct a tree topology for N objects. These results are then extended to the regime where our similarities are corrupted with noise, where we present a methodology that will reconstruct the clustering with high probability using only O (N polylogn) targeted measurements IP Geolocation The ability to pinpoint the geographic location (or geolocation) of IP hosts is compelling for applications such as on-line advertising and network attack diagnosis. While prior methods (e.g., [3, 4, 9, 16, 17]) can accurately identify the location of hosts in some regions of the Internet, the accuracy of standard IP geolocation techniques can be impaired by noisy measurements (e.g., distance derived from non line-of-sight Internet routes) or potentially misleading information such as DNS names which are generated by and must be interpreted by people. The hypothesis of our geolocation work is that the accuracy of IP geolocation can be improved through the creation of a flexible analytic framework that incorporates different types of geolocation information. We introduce the NBgeo framework, a machine-learning classification based geolocation algorithm. This methodology uses a set of lightweight measurements from a set of known monitors to a target, and then classifies the location of that target based on the most probable geographic region given probability densities learned from a training set. For this study, we employ a Naive Bayes

34 10 framework that has low computational complexity and enables additional societal information to be easily added to enhance the classification process. We use explicitly defined (i.e., a priori supplied) population data from the US Census [18] to improve upon our estimation. Our results show that the new NBgeo framework results in geolocation estimates that have median error 50 miles closer than the current measurement-based geolocation methods. We then introduce a second novel methodology for IP geolocation that we call PinPoint. Pin- Point is based on two key innovations. First, we use a geographically diverse set Internet hosts with ground truth geographic coordinates as landmarks, providing our algorithm with implicitly defined population information. PinPoint begins by identifying the subset of landmarks that are geographically nearest to the target host using a novel clustering methodology based on hop count measurements. Next, PinPoint uses latency measurements from landmark subsets to geolocate the targets. Using only the latencies from landmarks closest to a target in hop distance results in highly accurate predictors compared to latency measurements from arbitrary landmarks which tend distort distance due to the vagaries of routing. PinPoint estimates geolocation from latencies using a novel sparse embedding algorithm that preserves latency distances and encourages the targets to cluster geographically, which is desirable since targets tend to concentrate in cities. This second innovation serves as an important regularization in the embedding process that further mitigates the effects of noise and errors. We demonstrate that PinPoint performs significantly better than all existing geolocation tools using measurements conducted from a large set of end hosts with ground truth locations. Our results show that PinPoint is able to geolocate hosts with a median error of 38 miles and an average case of less than 97 miles. In contrast, the best commercial IP geolocation database yields an average error of 493 miles, while the previous state-of-the-art measurement-based geolocation methodology yields a median error of 91 miles and an average error of 170 miles Anomaly Detection The ability to detect unexpected events in large networks can be a significant benefit to daily network operations. A great deal of work has been done over the past decade to develop effective anomaly detection tools (e.g., [19, 20]), but they remain virtually unused in live network opera-

35 11 tions due to an unacceptably high false alarm rate. We seek to improve the ability to accurately detect unexpected network events through the use of BasisDetect, a flexible but precise modeling framework. Using a small dataset with labeled anomalies, the BasisDetect framework allows us to define large classes of anomalies and detect them in different types of network data, both from single sources and from multiple, potentially diverse sources. Network anomaly signal characteristics are learned via a novel basis pursuit based methodology. We demonstrate the feasibility of our BasisDetect framework method and compare it to previous detection methods using a combination of synthetic and real-world data. In comparison with previous anomaly detection methods, our BasisDetect methodology results show a 50% reduction in the number of false alarms in a single node dataset, and over 65% reduction in false alarms for synthetic network-wide data. 1.3 Organization This dissertation is organized as follows. The proceeding chapter describes our work in relation to prior research. Chapter 3 describes how to use passive measurements to discover various characteristics of network topology (end host clustering, shared path estimates, end host-to-end host path lengths). In Chapter 4, given an initial set of Internet probes, we present novel statistical techniques to estimate unseen routers and links in the network. Then in Chapter 5, we describe an intelligent network tomography probing procedure that drastically reduces the total number of active delay-based probes needed to resolve the logical topology of a network. Theoretical bounds for tomographic clustering-based methodologies can be found in Chapter 6. The final two methodology chapters deal with two application-based case studies in resolving network characteristics. In Chapter 7, two methodologies for resolving the geographic location of Internet resources are described. For the geographic estimation of network objects, these methodologies use either an implicit or explicit restriction on areas with large population density. Finally, in Chapter 8, a novel model-based anomaly detection procedure is presented. Then in Chapter 9 the contributions of this thesis are again summarized and some future directions are explored.

36 12 Chapter 2 Related Works In this chapter we set the contributions of the thesis in terms of prior research. 2.1 Network Topology Discovery from Incomplete Passive Measurements While there have been many previous studies that have focused on developing methods for estimating network topology, a great deal of prior work in this area has focused on solely using active traceroute-like probes (e.g., [11, 13, 14, 21]). In each case, these studies highlight several challenges associated with this kind of approach, including the need for widely distributed nodes from which probes can be sent (i.e., to address the need for a broad perspective) and the difficult problem of interface disambiguation. A number of large topology mapping efforts that attempt to address the problem of limited perspective have been active for years including the well known Skitter [1] and Dimes [22] efforts. While the problem of interface disambiguation has been known since Paxson s work in the mid-1990 s [23], the recent study by Sherwood et al. demonstrates how problematic this issue can be when using standard disambiguation techniques [21]. Another study that is related to ours is by Magoni and Hoerdt [24]. In that paper, the authors describe a traceroute-based approach and encounter the same difficulties with perspective and router interfaces. In contrast, the work in Chapter 3 uses passive measurements enhanced with a small subset

37 13 of active probed-based measurements to examine router topology structure. Acquisition of passive hop count measurements from packet traffic is a previously studied problem. While the deployment of specialized hardware on TAP ed links (e.g., [25, 26]) could be used in our work, publicly available data sets almost always anonymize source IP addresses making it impossible to relate measurements from multiple sites. An alternative form of passive packet measurements are those collected in network honeypots [27, 28, 29, 30]). Honeypots monitor routed but otherwise unused address space, so all traffic directed to these monitors is unwanted and almost always malicious. Honeypots do not solicit traffic, however low interaction sensors will respond to incoming connection requests in order to distinguish spoofed addresses. In this way they are not completely passive. However, monitors of large address segments can receive millions of connections per day from systems all over the world and therefore offer an incredibly unique and valuable perspective [31]. The unsolicited nature of honeynet traffic coupled with the volume and wide deployment of monitors make it an attractive source of data for our work. Passive measurements of routing updates have been previously used to establish intra-domain network maps [32], meanwhile our goal is to discovery Internet-wide structure with more simple and lightweight hop count measurements (the number of routers between two points in the network). The focus of our passive measurement network discovery work in Chapter 3 is on identifying Internet structure in terms of clusters of clients [33], shared paths [7, 6, 34, 5], and end-host to end-host distances [35, 36, 37, 38, 39] Passive End Host Clustering Clustering end hosts in the Internet in a topologically significant manner is a problem relevant to the creation of overlay networks [33, 40] and the geolocation of resources [3, 4, 9, 41]. The most relevant prior research ([33]) uses a BGP routing table based approach to group IP addresses. In contrast to this prior approach, our methodology will rely not on IP addresses (which can be spoofed, [42]), but on passively observed incomplete hop count measurements.

38 Passive Shared Path Estimation Using delay-based tomographic methods, prior methods (e.g., [5, 6, 7, 34]) have shown how using a series of active measurements shared logical routing paths can be estimated. The main focus of this prior work is relative comparisons between multiple paths in order to estimate the logical topology of a network. Our methodology will show how using a small number of active probes, we can estimate the number of physical routers shared between two paths in the Internet Passive Network Embedding Previous network embedding methods have considered the different problem of latency estimation between nodes. In [35, 36, 37, 38, 39], methods are proposed in which a set of M landmark nodes are embedded in a low-dimensional Euclidean space, and then M N latency measurements are made between each landmark node and all N other nodes. While past studies have identified difficulties with some of the basic assumptions of embeddings (e.g., [43]), more recent work has shown them to perform quite well in practice [44]. Embeddings have also been proposed as a mechanism for topological inference [38, 39]. These approaches are based on hop-count measurements obtained using an exhaustive number of active probes between landmarks and all other nodes. In contrast, our proposed approach relies primarily on incomplete passive measurements between landmarks and our target end hosts and additionally a negligible number of active probes, resulting in a significantly lighter weight approach to the problem. Our emphasis on passively collected data avoids the problems of using a large number of active measurements, which includes the difficulty in generating real-time Internet topologies from these measurements and the prevalence of blocking standard active probes by ISPs. The total number of active probes needed for our method will be shown to grow quadratically in the embedding dimension, making the method almost completely dependent on passive measurements. Our embedding methodology, unlike the prior work in [35, 38, 39], is designed to embed IP sources given very incomplete network measurements. In fact, due to their reliance on complete measurements, the previous work in the area of network embedding is incomparable to our methodology.

39 Inferring Unseen Structure of the Network Core In Chapter 4 we examine the characteristics of the Internet core given a limited number of measurements. Our consideration of Internet core is informed by prior topology mapping studies including [13, 14, 45]. While these papers provide various definitions of core, we believe that a strict definition is of less importance and ultimately arbitrary. The goal of our study is not to find specific boundaries, but to find as much of the central component of the Internet as possible. To that end, our definition is similar to what is given in [46] roughly that the core is bounded by routers that are greater than one IP hop beyond end hosts or border routers of stub autonomous systems. Our work is also informed by our passive measurements work in Chapter 3. In those studies, we propose methods for establishing Internet maps based on passive observations of hop counts in packets. While the idea of using inference methods to estimate incomplete hop counts is similar, the work in Chapter 4 differs from theirs in objective (unseen core inference), data (the use of active probes), and methods (unseen router estimation, matrix completion, unseen link estimation). The work in our network embedding work in Chapter 3 examines the problem of estimating pairwise hop counts using incomplete measurements to a set of landmarks. In contrast, the work in this chapter will demonstrate a methodology for estimating pairwise hops using only massively incomplete pairwise measurements between the objects. The first component in our unseen core estimation study focuses on determining the number of previously unseen core routers would be found given an increase in the probing of the network. This problem is motivated by a standard problem in statistics, the unseen species problem, where given an incomplete observation, we try to estimate how much was missed. Classic results in [47] estimated the number of unseen species of moths in an environment given a limited observation, and the work in [48] estimates the total number of words Shakespeare knew given his collective works. Recently, methodologies in both [49] and [50] have examined the problem of unseen species estimation in the context of networking. Both of these methodologies are directed towards finding the total number of routers/links in a network given limited observations. While estimating the

40 16 total number of unseen routers is an interesting problem, validating the results is an impossible task without the entire network available (infeasible when considering the Internet). Our work focuses on the problem of estimating how many additional routers would be found given a fractional increase in the probing infrastructure. This would be of interest to anyone trying to determine whether or not to continue probing a network to discover additional nodes. To the best of our knowledge, our work is the first attempt to estimate the increased coverage of the network found given a feasible number of additional measurements. The next component in our unseen core estimation study shows how the unobserved link lengths between core routers can be estimated. The work uses a set of Internet probes to construct a hop count matrix with each element containing the number of routers between two points in the network. Due to the limited number of probes sent throughout the network, this matrix is very sparsely populated. Recent work in [51] has shown that matrices of size N N and of rank r can be exactly reconstructed with only k known elements, where k O ( N 1.2 r log N ). Due to the very large size of the matrix, our work uses an efficient matrix factorization method from [52]. We use these prior results in an attempt to infer the unobserved path lengths between arbitrary core routers. By expanding upon these techniques, we develop a novel methodology for estimating unseen link locations in the network, an issue previously unexplored in Internet literature. Our final component of the unseen core study is a targeted probing methodology that directs the user towards areas of the network that contain the most uncertainty given the current set of measurements. Our targeted probing methodology looks at a similar goal as prior work on the DoubleTree algorithm [53], an intelligent probing mechanism devised for the purpose of sampling end hosts in the Internet. The DoubleTree algorithm uses the specific tree topology characteristics and specially crafted probes to limit the number of probes needed to discover the topology. In contrast to this prior work, our methodology focuses on the reduction of the number of source-destination pairs used to probe the network using standard off-the-shelf techniques (e.g., traceroute), not on the crafting of special probes to minimize measurement load. We offer our targeted probing techniques as validation that the unseen core techniques of inferring unseen core routers and unseen core links are correctly revealing areas of particular uncertainty in the network, where increased

41 17 probing would result in greater understanding of the current topology of the Internet. 2.3 Toward the Practical Use of Network Tomography The work in Chapter 5 introduces a new tomographic methodology for resolving the tree-based logical routing topology in a network. The initial work most directly related to the research in this chapter are the hierarchical clustering methodologies explored in [5, 6, 7, 34]. The main limitation to these methodologies are the requirement of acquiring the entire covariance matrix (e.g., O ( N 2) measurements given N number of end hosts in the topology). The hierarchical clustering methodology will be considered the worst case probing bounds, as it performs an exhaustive probing on the set of end hosts in the network. This is due to the decoupling of topology measurements and topology inference, where no information from prior measurements is used to inform new measurements and topology inference is performed completely separate from the measurement process. A more efficient probing methodology is the Sequential Topology Inference algorithm from [54]. This work sequentially builds the logical tree structure and leverages the current estimated logical tree structure to determine where the next probe pair measurements should be performed. This work couples topology inference and measurement into one process by exploiting the tree structure of the topology. For a balanced l-ary tree (a balanced tree where each non-leaf node has exactly l children), this reduces the number of probes needed from O ( N 2) for hierarchical clustering, to O (Nl log l (N)) for the Sequential Topology Inference algorithm. Our ordering-based method will show how improvements to this performance can be obtained by exploiting the structure of not just the tree topology, but the structure of the topology measurements. We will show how our methodology can further reduce the number of probes by roughly a factor of 2 compared to this current state-of-the-art. This improvement is the result of considering the ordering of the end hosts considered, previous referred to as a topological sort [55]. The idea of a topological sort has been explored previously in sensor network literature in [56], where a topological sort of the nodes in a sensor network provides efficient routes through the network

42 18 with lower power consumption. Due to the focus on wire-line networks in this work, we are not able to chose the routing. Instead we will use a modified version of topological sorting to efficiently reconstruct the logical routing from Internet measurements. 2.4 IP Geolocation using Population Data Considerable prior work has been done on the subject of IP geolocation [3, 4, 9, 41]. While we are informed by this work and our motivation for highly accurate estimates is the same, the geolocation methodologies described in Chapter 7 takes several steps to improve estimation accuracy versus prior algorithms. Unlike the methodologies of [3, 4], no traceroute probes are necessary in the either geolocation methodology. This avoids the problems of interface disambiguation [11, 21] and dependency on unreliable undns naming conventions that are sometimes used for geolocation [12] NBgeo Related Work Recent work in the machine learning literature has shown how complicated classification problems with many degrees-of-freedom can be broken down into several lower-dimensional problems using a technique called Naive Bayes [57]. Empirical work in [58] and [59] has shown considerable improvement on classification using Naive Bayes even against more complicated classification techniques. We will exploit Naive Bayes and standard techniques in nonparametric statistics literature [60] to develop our novel NBgeo geolocation algorithm. The empirical results in [61] showing the superlinear relationship between the population count of a geographic area and the number of routers located in that area will help inform our Naive Bayes methodology. This property will be exploited using population density and geographic data (size, adjacency) of each county acquired using the publicly available databases from the U.S. Census website [18] PinPoint Related Work While our Naive Bayes methodology is a first step in exploiting population information and a known partitioning of the geography, this work is expanded upon in the PinPoint algorithm to

43 19 account for situations where either no population data is known or a natural partitioning the geography is unavailable. Our PinPoint algorithm also avoids the need for latency measurements from a shared infrastructure, common to the methodologies of [9, 41]. Latency measurements from shared measurement infrastructure, such as Planetlab [40], have been found to be highly variable [62], which can cause bias in geolocation results. PinPoint instead relies primarily on hop count values from a set of monitors, avoiding this shared measurement infrastructure latency variability problem. Hop counts can be easily established by examining TTL values in IP packets using the method described in [63]. The PinPoint methodology is also informed by prior work on low-dimensional embedding of observed pairwise distances. Commonly referred to as Network Coordinate Systems, lowdimensional embedding problems in networking literature have been well studied over the past several years [37, 64, 65, 66, 67]. The goal of these studies is to establish a method for accurately estimating latencies between arbitrary hosts in the Internet. A common problem in previous network coordinate algorithms are triangle inequality violations caused by inaccurate long latencies under consideration [68]. These violations are due to the underlying network manifold structure not returning direct line-of-slight measurements for every observation. To limit or avoid this problem altogether, PinPoint considers only short distance (low latency and small hop count) measurements in the embedding algorithm. We hypothesize that these measurements are more likely to be highly correlated to the line-of-sight in the network and therefore avoid triangle inequality violations. Finally, the Network Time Protocol infrastructure plays an intrinsic role in our PinPoint algorithm study. NTP itself was developed by Mills to enable hosts to tightly synchronize their clocks [69]. The protocol specifies that synchronization will be facilitated by a widely deployed hierarchical infrastructure of time servers. At the top of the hierarchy are stratum 0/1 servers servers, which use either GPS or atomic clocks as their source. A direct consequence of GPS in NTP is a large set of servers distributed throughout the world with precisely known locations, and with the capability to respond to measurement requests. To the best of our knowledge, this is the first work using NTP for IP geolocation. However, PinPoint does not rely exclusively on NTP, and in future empirical studies, we expect to add other nodes to our landmark database including DNS

44 20 servers that report their locations. 2.5 Model-based Anomaly Detection The focus of Chapter 8 is on anomaly detection in network time-series data. Initial anomaly detection work considered only single time-series data in isolation (e.g., a single link in a network). This work uses some transformation of the network data to distinguish between standard operating environment and residual anomaly energy. These methodologies included analysis using wavelets [70, 71], Exponentially Weighted Moving Average filters [72, 73], and Fourier filtering [71]. Initial network-wide anomaly detection work focused on the application of Principle Component Analysis (PCA) [19, 74, 75] to a collection of network time-series data. The methodology, originally described in [74], decomposes a traffic matrix into a set of vector components that capture the variance across all links or flows of the network. The components that resolve the highest variance across all links (e.g., the most standard components) are considered to represent standard operating characteristics of the network observed in the link data matrix, the modeled traffic. Meanwhile, the less dominant components represent residual traffic that is abnormal to the links in general, the residual traffic. The amount of traffic energy in this residual component determines whether or not an anomaly has occurred in the observed traffic on each link. The limitations of this PCA approach are well documented in [76]. In addition to having high sensitivity to tuning parameters, large anomalies in the network can corrupt the modeled traffic components and therefore cause obvious events to be ignored by the methodology. In addition, detected anomalies found by PCA can not be localized to the specific anomalous link or router. Finally, it can lead to masking, where one anomaly hides another. The authors of the Distributed Spatial Anomaly Detection technique described in [20] recognize that one of the main limitations of the PCA approach was the necessity of communicating all flow information back to some centralized computation point. Using non-parametric statistics and False Discovery Rate techniques (FDR) [77], each router in the network generates just a small test statistic that is communicated for anomaly detection. The use of more sophisticated multiple

45 21 hypothesis detection techniques, like FDR, allows for better statistical detection rate than simple thresholding. One of the biggest limitations of this approach is the complete decoupling of the measurements in the time domain. Therefore, any temporal correlation between network anomaly events (the measurements at time t helping inform the events from measurements at t + 1) are ignored. In addition, the measurements considered are with respect to traffic volume only, with no discussion on how other network information (bytes, unique IP address, entropy measurements) could be intelligently fused into the framework. Finally, the detected anomalies are not necessarily points of interest to a network administrator or anything that might represent the known structure of anomalies in networks, they are simply events of traffic volume that are unlike anything found in the training set. This could be significantly biased by limited training data, introducing the possibility for a large number of false alarms reported. This situation may occur when events are unlike the training set observation and yet uninteresting from a network administration prospective. Other distributed approaches to anomaly detection exist [78, 79], but our focus is not distribution, but rather we aim to carefully treat the false alarm problem. Our anomaly detection methodology will exploit the same non-parametric statistical techniques as [20] (originally developed in [80]). However, our methodology differs in that we use an estimated feature vector of detected anomaly energy instead of the raw packet counts. Data fusion from different data sources was shown some time ago to reduce false alarm rates (e.g., [73]). In contrast, our anomaly detection methodology develops an approach which can flexibly incorporate various different sources of data. By considering a general feature vector, we can consider the fusion of a wide range of link characteristics (packet count, byte count, IP entropy), thereby improving results. For the detection of anomalies in time series data, our methodology will leverage the significant prior work on basis decomposition of signals [81, 82, 83, 84]. This prior work focused on creating methodologies to exactly represent a signal given a sparse linear combination of signal atoms. The work here will deviate, in that we are trying to resolve the gross characteristics of the signal, allowing for non-exact signal representation by our basis atoms. In addition, our novel methodology will allow for the penalization of choosing selected atoms, an application previously unexplored in the basis pursuit literature. The motivation for sparsity has been previously shown in both the

46 anomaly detection work of [71] and work on the underlying causes of faults in networks [85, 86]. 22

47 23 Chapter 3 Network Topology Discovery from Incomplete Passive Measurements In this chapter, we describe methodologies for discovering characteristics of Internet topology leveraging passive measurements. We present four main contributions: The ability to cluster end hosts in a topologically significant manner using incomplete hop count measurements. A novel algorithm to impute missing passive measurements by leveraging assumptions on network topology. A lightweight methodology for inferring the shared path between pairs of end hosts to a target end host. A novel multidimensional scaling algorithm for estimating end host-to-end host hop distance using passive measurements and available Autonomous System information. Roughly speaking, the methodologies described in this chapter enable topology discovery from a number of active measurements proportional to the number of discovered topology clusters (i.e., we need only make O(1) traceroute measurements from each cluster to each passive monitor site). Since the number of clusters is expected to be drastically smaller than the number of sources, the

48 24 burden of active measurements is almost inconsequential. This is a measurement regime previously unexplored in networking literature. 3.1 Passive Measurement Datasets For the purposes of validating our passive measurement methodologies, we use a set of three different topology measurements. The first are a set of synthetic topologies generated using the Heuristically Optimized Topologies (HOT) constraints originally introduced in [87]. A synthetically generated Heuristically Optimized Topology uses known constraints on both technological and economic factors to create a graph representative of those found in the Internet. Then using the Orbis [88] toolset, we scale our HOT topologies to show the effects of our topology discovery methodologies on topologies of various sizes with full ground truth. The second data set used is a router-level connectivity map of the Internet based on data collected by Skitter [1]. Measurements in Skitter are based on traceroute-like active probes sent from a set of 24 monitors to a set of nearly 1M target hosts distributed throughout the Internet. We use the openly available router-level map created from data collected between April 21 and May 8, This map consists of 192,224 unique nodes and 609,066 undirected links. It is important to note that the goal of the Skitter target host list is to have one responding node in each /24 prefix. Thus, the characteristics of the Skitter graph with respect to destination subnets is different from synthetically generated topologies, which reflect collections of nodes in subnets. The third data set used in our study was collected over a 24 hour period starting at 00:00 on December 22, 2006 from 15 topologically diverse honeypot sensors. These sensors are located in 11 distinct /8 prefixes that are managed by 10 different organizations. The segments of IP address space monitored by the honeypots varied from /25 to /21 plus one /16. Over 37,000,000 total packets were collected and evaluated in our study. The packets do not contain spoofed source IP addresses since they were the responses to SYN/ACKs from the honeynet [28]. Details of the data set can be found in Table 3.1. In order to preserve the integrity of the honeypots, we cannot disclose their locations in IPv4 address space.

49 25 Table 3.1: Details of honeypot data sets used in our study. All data was collected over a one day period on December 22, Node Total Pkts. Uniq. IPs Mean Hops Hop Std. Dev. 1 22,586, , ,533,700 9, ,689 8, , , ,158 6, ,621 11, ,253 6, ,226 6, ,334 6, ,522 8, ,907 8, ,955,100 6, ,986 5, , , Of particular interest and importance in our evaluation are the occurrences of the same source IP address in multiple honeypots. We found that 93.5% of the unique IP addresses in our data set appear in only one of the honeypots. This is most likely due to the diverse locations of the sensors coupled with the fact that different instances of malware limit their scans to smaller segments of address space. Nevertheless, this left us with over 22,000 unique IP addresses from which we conducted our analysis. Details of the instances of multiple occurrences of unique IP addresses are listed in Table 3.2 (note that there were virtually no addresses were seen in more than 10 monitors). Our analysis assumes that the only data that will be used to infer network structure is the source IP address (used only to uniquely identify a host and as an active probe target) and the TTL extracted from the header of each packet. In the case of the synthetic and Skitter data sets, we synthesize these values by taking the shortest path lengths between the set of monitors and the end hosts. The monitors in these topologies will be a random selection of a small subset of leaf nodes in the graph topology. In the case of the honeynet data we use the clever technique described in [15] to infer the number of hops between the honeypot monitor and the host. This inference is made based on the fact that (i) there are only a few initial TTL values used in popular

50 26 Table 3.2: Counts of occurrences of common source IP addresses in multiple honeypots Num. Honeypots Num. Sources operating systems (e.g., 64 for most UNIX variants, 128 for most Microsoft variants and 255 for several others), and (ii) typical hop counts for end-to-end paths are far less than the differences between the standard TTL values. Thus, the hop count is inferred by rounding the TTL up to the next highest initial TTL value and then subtracting the initial TTL. 3.2 Passive Clustering of End Hosts We assume that the ground truth router-level topology of the Internet will resemble the network in Figure 3.1-(left) [14]. In this diagram, packets sent from end host S i will depart from the edge of the network and eventually enter the densely-connected core component through a border router. The packets will traverse the core, exit through another border router and eventually be intercepted by a passive monitor M k. This configuration enables edge and core mapping, and assumes monitors, such as honeynets or passive collection, are collocated near frequently accessed network sites. Given a passive observation of packets between S i and M k and the common initial TTL method from [15], we can infer the number of routers between the two points, h i,k. Assuming this structure of the network, one can partition the total layer-3 hop count distance values into the distance (or number of router hops) from the end host S i to the first core border router b, and the distance from b to the monitor M k. We define the variables {x i = the number of layer 3 hops along the path from source S i and the first core border router b}, and {w i,k = the number of layer 3 hops between the

51 27 first core border router b of source S i and monitor M k }. This allows us to partition the hop count distance values into the two separate paths, h i,k = x i + w i,k = the number of layer 3 hops between end host S i and monitor M k. S 1 S 2 S 3 S 1 S 2 S 3 M 1 M 2 M 3 M 4 M 1 M 2 M 3 M 4 Figure 3.1: (Left) Example Network Topology with sources S i sending packets through a core component to monitors M k, (Right) Example network where S 1 and S 2 share a border router. Now consider the situation where two end hosts (S i, S j ) are connected at the same border router (see Figure 3.1-(right)). Given that these two end hosts will share a path through the core to each monitor, we can state: Theorem Given two end hosts (S i, S j ) sharing a common core ingress border router, then h i,k h j,k = C for all monitors M k with paths through the core (for some integer constant C). Proof. Given hop count distance values h i,k = x i + w i,k and h j,k = x j + w j,k. For any monitor M k, with both S i and S j having paths through the core to the monitor, there will be a common path for both end hosts from the border router to the measurement, such that w i,k = w j,k. Therefore, h i,k h j,k = x i x j + (w i,k w j,k ) = x i x j = C : k. It is this constant offset property that will be the basis for our proceeding work on inferring network structure from passive measurements.

52 Hop Count Distance Vectors and Network Topology Our initial hypothesis will be that hop count distance vectors that are similar/close in a Euclidean sense, do not necessarily translate to end hosts that are close in the actual network topology. Our intuition is that clustering the raw hop count distance data ignores the network-centric border node information embedded in the distance vectors. So while two end hosts that shared a common border node may have two hop count distance vectors far in a Euclidean sense (due to a constant offset between all the hop elements), in a network topology sense they are very close. To exploit this network-centric information, we perform preprocessing on the hop count distance vectors such that if h i and h j share a common border router, then after some transformation, the two vectors are equivalent. The preprocessing we consider takes the form of converting the hop count distance vectors (h i = [h i,1, h i,2,..., h i,m ]) to hop count contrast vectors (h i), where the mean value of each vector is subtracted from each element of the hop count distance vector. h i = h i µ i 1 Where µ i = 1 M Mk=1 h i,k and 1 = [1, 1,...,1]. Using Theorem 3.2.1, we can state with certainty that if h i,k h j,k = C : k, then h i,k = h j,k : k. The hop-count contrast vectors of two sources from the same area of the Internet should be nearly identical. Slight variations will, of course, persist due to finer scale routing variations. The first goal of our work is to develop a method for generating clusters of end hosts that are topologically close to each other from a layer-3 hop count perspective. In this section, we describe our clustering methodologies and demonstrate their capability using synthetically generated network maps and real world Internet measurements Client Clustering Consider the generation of end hosts clusters using hop count distance vectors and the simple K- Means algorithm [57]. Here, the K-Mean algorithm considers K distance vectors as centroid points in the hop count vector space, where each of the remaining end hosts are mapped to the closest

53 29 centroid point (in terms of Euclidean distance) given their hop count distance vector. Experiments with synthetic topologies showed that clusters of various sizes could be generated (K-Means requires that the number of clusters be specified a priori) with a clear trade off between the number of clusters and the number of sources included in each cluster. A larger number of small clusters with minimal differences between contrast vectors might be considered a good choice with this approach. One methodology that may potentially create a large number of small clusters, would be to create clusters containing all equivalent contrast vectors. This method will be in contrast to using the unprocessed hop count distance vectors to cluster end hosts together (here we use a nearest neighbor norm threshold that results in the same total number of clusters in the topology as clustering using unique contrast vectors). The result of these two clustering methodology can be seen in Figure 3.2 for the real world honeypot passive data set. As shown in the figure, the contrast vector clustering results in a much higher average number of matches in the same IP subspace (i.e., same /8, /16, /24) when compared to a clustering methodology that uses the unprocessed distance vectors. Figure 3.2: Comparison of clustering results for Unique Contrast Clustering and Hop Distance Nearest Neighbor in terms of average number of cluster elements in matching IP subspaces Unfortunately, these small unique contrast clusters miss the case where sources located in the same area (which we will refer to as a subnet although this is not related to IP address structure) of the network have slightly differing hop count contrast vectors. This situation occurs when one or more monitor nodes are located in the same subnet as the cluster, or when the subnet has multiple

54 30 egress points. This sort of subnet topology produces variability in the contrast vectors. This results in contrast vectors that they may not be equivalent to each other, but with corresponding end hosts that are very close in the topology. This observation suggests that rather than clustering sources according to unique contrast vectors, clusters that allow for a bit of variation about a nominal value may better capture subnets of sources. S i x y # of measurement nodes = K # of measurement nodes = J Figure 3.3: Example of a subnet having multiple egress points. Consider the subnet topology in Figure 3.3, where there are two egress points to the set of monitors for each end host located in the subnet. The first egress router will send paths from subnet sources to k monitors M 1 to M k, and the second egress router will route paths from subnet sources to j monitors M k+1 to M M (where M = j + k). Every source will have a (potentially unique) path of length x to the first egress router, and (potentially unique) path of length y to the second egress router. For the paths from the egress router to the monitors, the paths will be common for all sources in the subnet. Using this setup, we can state Theorem Theorem Given a subnet with two egress points (as in Figure 3.3), all sources contained in the subnet will have collinear hop count contrast vectors. Proof. We first define the nominal distance vector (h) as the distances from the egress routers to the monitors, where h 1 is the k-length vector containing the distances from the border node to the first k monitors, and h 2 is the j-length vector with the distances from the border node to the last

55 31 j monitors. [ h = h 1 h 2 ] Therefore, for each source h i in the subnet, we can state the hop count distance vector as the addition of the some intra-subnet paths x i, y i and the nominal distance vector h: [ h i = ] [ x i 1 k y i 1 j + h 1 h 2 ] (Where 1 k is the k-length all ones vector.) We define the nominal contrast vector as (given hop mean value, µ hi = 1 M Mk=1 h i,k ): [ h = h µ h 1 = h 1 h 2 ] µ h 1 Therefore, each end host located in the subnet will have contrast vector: h i = h i µ hi 1 ([ ] [ ]) ( ( k h i = h 1 h 2 + x i 1 k y i 1 j µ h 1 + M x i + j ) ) M y i 1 ([ ] ) [ ] ( k h i = h 1 h 2 µ h 1 + x i 1 k y i 1 j M x i + j ) M y i 1 [ ] ( k h i = h + x i 1 k y i 1 j M x i + j ) M y i 1 [ ] h i = h + j M (x k i y i )1 k M (y i x i )1 j Setting r i = x i y i, the difference between the end host hop contrast vector and the nominal contrast vector is: h i h = [ j r i M 1 k k r i M 1 j ] Therefore, all end hosts sharing the same egress routers will have collinear contrast vectors.

56 32 From Theorem 3.2.2, we see that sources in subnets with multiple egress points may have slight variations in the hop count contrast vectors. The precise nature of these variations depends on several uncertain factors, including the number of egress points and the nature of the paths to the egress points. Thus, we will account for this uncertainty with a probabilistic model for the variability in hop count contrasts of sources within a subnet Gaussian Mixture Model for Subnet Clusters While the exact nature of the hop count contrast vector distribution for sources in a given subnet is unknown, a multivariate Gaussian model is perhaps the simplest way to capture the variability of the data. The covariance matrix can account for structure in the distribution, such as the collinearity discussed in Theorem 3.2.2, as well as other correlations arising from the idiosyncracies of routing internal to the subnet. Since the hop count data includes sources from many different subnets, the overall distribution of hop count contrast vectors can be modeled with a mixture of Gaussian models, in which each Gaussian component represents the distribution within one subnet. An example of these Gaussian clusters can be seen in Figure 3.4, where a two dimensional histogram of hop count contrast vectors are shown with possible Gaussian Mixture clusters shown by the drawn ellipses. Figure 3.4: 2-D histogram of hop count contrast vectors with clusters highlighted in ellipses. Gaussian mixture models can be fitted to data using the well known Expectation-Maximization (EM) algorithm, and in particular the version proposed in [89] automatically determines the proper number of clusters using an information-theoretic criterion. Once the Gaussian mixture model is

57 33 determined, each hop count contrast vector will be associated most significantly with a given Gaussian component. This provides a clustering of the sources, where the number of clusters is equal to the number of Gaussian components inferred by the EM algorithm [89]. Moreover, we will see later in Section 3.3 that the Gaussian mixture model and EM algorithm provide a powerful tool for imputing missing hop count data. Subnet Cluster Analysis To assess the topological relevance of the clusters determined by the Gaussian mixture model, we consider the problem of shared infrastructure estimation (to be discussed in detail in Section 3.4). The topology relating the sources in a given cluster to the monitors can be estimated by selecting one source from the cluster and performing traceroute measurements from this source to each monitor. If the all the sources in the cluster share the same paths, then this estimate is perfect. We do not, however, expect this to be the case, even for sources located in the same subnet, for the reasons stated above. Nonetheless, these routes should provide good predictions for the routes if the clusters are topologically meaningful. The accuracy of the predictions is measured by calculating the error in predicted shared hops in the paths between pairs of sources and a single monitor. The error rates in the predictions of shared path lengths are shown in Figures 3.5 and 3.6, comparing the performance of the predictions based on the Gaussian mixture clusters with that of predictions based on randomly clustered sets of sources. The clusters determined by the Gaussian mixture model result in significantly better predictions, indicative of the fact that they are indeed grouping sources that have share similar paths to the monitors. 3.3 The Missing Data Problem - Imputing Missing Hop Counts Using our setup for passive measurement acquisition, it is unlikely that packets from a large numbers of end hosts will be seen at a set of widely distributed monitors. Thus for a given set of monitors (such as the honeypots described in Section 3.1), there will be some number of hop count distance observations missing from the observed set. For each end host i, we have a (potentially incomplete)

58 34 Figure 3.5: Comparison of Gaussian mixture clusters to random clusters. Simulated topology, N = 1000, M = 8. hop count vector h (i) i,i where I (i) known is a subset of the indices from the complete set of monitors known for end host i with observed hop count values. We assume that this data is Missing-at-Random (where the missing data locations are chosen at random). With some assumptions on the topology and statistical techniques, we can develop methods for imputing this missing data Imputation Methods Network-centric Imputation Given missing data, one can easily conceive of simple imputation methods, such as imputing based on the mean value of the element, or using the nearest neighbors based on the observed elements of the distance vector. One problem with such a simplistic approach is that it does not take advantage of the structural characteristics of the network. Given our border router assumption, we can exploit the observed hop count distances from sources that are an integer offset from (e.g., sharing a common border router with) a source missing a hop count distance value. This method can be considered analogous to using nearest neighbor imputation on the hop count contrast vectors. For N sources and M monitors in the network, this imputation method has computation complexity O ( N 2 M ). The problems with this method can be seen in Figure 3.7, where the location of the passive

59 35 Figure 3.6: Comparison of Gaussian mixture clusters to random clusters. Skitter topology, N = 700, M = 8. Figure 3.7: Striped dots indicate passive measurement data observed, Black dots indicate no information observed (Left) - Observations where Network-centric imputation may perform well, (Right) - Observations where Network-centric imputation will fail. measurement observations dictates the performance of the network-centric imputation method. If multiple observations are made per border node cluster (as in Figure 3.7-(left)), then the imputation method may perform well. The case where only a single observation is typically given per border node (as in Figure 3.7-(right)), will result in poor imputation performance as there is not enough end hosts with similar contrast vector values to extract accurate hop count information. Gaussian Mixture EM Imputation In Section 3.2.3, we reasoned that a mixture of Gaussians model encapsulates the variability found in the hop count contrast vectors. In [90], a Gaussian mixture EM algorithm was purposed to both

60 36 learn the parameters (mean, variance, prior probabilities, responsibilities) for a group of Gaussian distributions given a set of incomplete data, and then use those estimated Gaussian mixtures to impute the missing data values. The only necessary parameter input to this algorithm is the number of Gaussian mixtures to use. From [89], an information-theoretic technique was purposed to determine, given a set of complete data, how many Gaussian mixtures to use to model the data. This method is a hybrid two-step Expectation-Maximization (EM) iterative approach in Algorithm 1, where the first step consists of estimating the number of Gaussians from the imputed data using the method from [89], and the second step then estimates the new imputed data values using the method from [90]. For N sources and M monitors in the network, this method has computation complexity O ( KNM 4), for K Gaussian modes. Algorithm 1 - Gaussian Mixture EM Imputation Algorithm Given: Incomplete hop count matrix H. Main Body 1. Replace any missing element in hop count matrix H with the mean of the observed elements of the hop count matrix. 2. Find the mean vector, µ, where each element µ i is the mean of the hop count row vector h i. 3. Convert hop count matrix H to hop contrast matrix H by subtracting out the mean of each row vector, h i = h i µ i. 4. Using the methodology from [89], estimate K, the number of Gaussian clusters that models the data in matrix H. 5. Perform the Gaussian Mixture EM algorithm from [90] to find the estimated hop contrast matrix Ĥ. 6. Convert the estimated hop contrast matrix to an estimated hop count matrix Ĥ = Ĥ + µ (such that ĥi = ĥ i + µ i). Imputation Performance Analysis Using the honeynet dataset described in Section 3.1, we simulate the performance of the imputation methodologies on predicting missing real-world passive measurement data. In Figure 3.8, we see the

61 37 performance of the two developed imputation methodologies (Network-centric and Gaussian Mixture EM) compared against a naive mean imputation. As expected, both imputation methodologies perform significantly better than the mean method, with the Gaussian Mixture EM algorithm performing the best for high levels of incompleteness. Due to assuming that any real-world passive measurements observed will be highly incomplete, we will solely focus on imputing missing data with the Gaussian Mixture EM algorithm for the remainder of this chapter. Figure 3.8: Imputation accuracy over a range of randomly selected missing values using data from the real-world honeynet dataset. Using a synthetic Orbis-derived passive measurement dataset (as described in Section 3.1), we can synthetically generate missing data examples by considering the sources that are seen in M monitors and knocking out (eliminating) a random subset of the hop count measurements for each hop count vector. Where X observed monitors refers to each hop count vector observing X randomly selected hop counts with the rest of the vector incomplete. The new imputation method (Gaussian mixture EM) results are compared in terms of Root Mean Squared Error in Figure 3.9 against a naive mean imputation method: RMSE = 1 N M ) 2 (ĥi,k h i,k NM i=1 k=1 In this analysis, we consider measurements from sources that were observed in 16 honeypots for the

62 38 synthetic topology with three different topology sizes (N = 1000, 2000, 3000). The results shows a clear advantage to using the Gaussian mixture imputation method for even a small amount of observed measurements. Figure 3.9: Imputation accuracy over a range of randomly selected missing values using data from M = 16 honeypots. (Left) N = 1000, (Center) N = 2000, (Right) N = Shared Infrastructure Estimation The previous section demonstrated that the source clusters identified by our algorithm are topologically meaningful. However, they do not reveal the topological relationship between the shared paths from the clusters to the monitors. In this section, we will show that by coupling the passive hop count data with a small number of active measurements, we can identify the topological relationships between clusters. The active measurements will take the form of traceroutes from the monitors to a small subset of target hosts which effectively act as representatives for the clusters. This is in contrast to the e.g., the Skitter methodology [1], where active measurements are taken from all measurement nodes to a large set of target hosts. Consider a triple {S i, S j, M k }, where two sources have a path to a single monitor as seen in Figure We can see the three possible potential topologies connecting this triple (two sources to one monitor), with a sharedness spectrum ranging from absolutely no sharedness with two completely separate paths from each source to the monitor (Figure 3.10-(left)), to complete sharedness with both sources on a single path to the monitor (Figure 3.10-(right)), with the intermediate stage of some length of shared path between the two sources (Figure 3.10-(center)). It is easy to verify

63 39 that if the number of shared hops is known for all such canonical subproblems, then the logical topology relating the sources to the monitors can be determined. This follows by observing that the set of paths from the sources to a given monitor form a tree. Therefore, this section will focus on estimating P (i, j, k), the length of the shared path between two end hosts i, j to a single monitor k using the passive data and a limited number of traceroute measurements. Figure 3.10: Spectrum of sharedness (black dots represent routers). (Left) No sharedness, (Center) Intermediate sharedness, (Right) Maximum sharedness Cluster-Level Shared Path Length Estimation Given a set of end host clusters, discovering shared topology between clusters becomes a straightforward task. For every cluster, randomly choose an end host in the cluster and perform active traceroute measurements between that end host (consider as a representative for its cluster) and the set of monitors (Figure 3.11). Therefore, using a single traceroute probe to each cluster (ignoring interface disambiguation problems), we have an estimate of the shared path lengths between end hosts contained in all other clusters in the topology. There are at least two potential problems with this straightforward approach to topology discovery. First, the source clusters may not be completely correct from a topological perspective, due the possible existence of multiple egress points and missing hop counts. Second, from an Internet-wide perspective, the number of clusters may still be prohibitively large for exhaustive (cluster-wise) traceroute probing.

64 40 C i C j x M k Figure 3.11: Example of cluster-level path estimation Predictive Shared Path Length Estimation For two hop count distance vectors, it is necessary to first develop some value indicating the amount of sharedness between the two vectors. As shown in our work on passive end host clustering, the similarity of the hop count contrast vectors will show that the two sources are within the same topological cluster. The greater the similarity the stronger the evidence for shared infrastructure in the paths to the monitors. To assess the potential for shared infrastructure to a given monitor we consider the difference in hop count distances to that node and calculate the number of other monitors that result in the same hop count difference. Formally, we define, U i,j,k = T i,j,k (3.1) Where the set of monitors with the same hop offset as monitor k, T i,j,k = {k : h i,k h j,k h i,k h j,k < ǫ} (for ǫ > 0). As the value of U i,j,k becomes closer to the number of monitors, there is a higher likelihood of a longer shared path to each monitor. And, as U i,j,k becomes closer to one, this indicates a long shared path is very unlikely. Given the training set I k, where each element is the index of an end host for which we have exact

65 41 knowledge (from active measurements) of the labeled path to monitor M k, we can then construct sets consisting of pairs of training nodes that share the same offset value for the particular monitor k. I c k = {[x, y] : x, y I k, U x,y,k = c} (3.2) Considering two paths from S i to M k and S j to M k, we can state that the shortest shared path would be of length zero as shown in Figure 3.10-(left), and the longest shared path would be of length = min(h i,k, h j,k ) as shown in Figure 3.10-(right). Given this range of shared path lengths, we can estimate the shared path length for any two sources i, j to any monitor k by attenuating the longest possible shared path length (= min(h i,k, h j,k )) by some value less than one, represented by α, (where α [0, 1]): P (i, j, k) = α min(h i,k, h j,k ) (3.3) The problem becomes estimating the value of α. Given some collection of training data where active measurements give observed values for the shared path lengths, we can estimate α as a function of the passive measurements of S i and S k. We hypothesize that the more hop count distance values that are a constant integer apart, the more sharedness that will be observed along the path. This results in learning a function whose domain is the number of hop count elements where the two vector h i and h j are a constant integer apart. We can then learn the attenuation function by taking the average of the observed path lengths for each integer offset value. Therefore, for each monitor k {1, 2,..., M} and uniformity metric value c {1, 2,..., M}: α (c, k) = 1 P (I c k I c (i),ic k (j), k) 2 ( ) (3.4) k i I c j I k c min h k I c k (i),k, h I c k (j),k Finally, we combine the learning attenuation function to create an estimator of the shared path length for each pair of end hosts using only our passive measurement data and the function α

66 42 (which we learn from a small set of active measurements): P (i, j, k) = α (U i,j,k, k)min(h i,k, h j,k ) (3.5) Shared Path Estimation Analysis In Table 3.3, we show the results for three different methods for shared path length estimation. The methods include : 1. Unique Contrast cluster-level Estimation - Cluster-level shared path estimation performed where each cluster represents a unique hop count contrast vector in the passive data set. 2. Cluster-level estimation using Gaussian mixture model - Cluster-level shared path estimation performed on clusters found using the Mixture Gaussian algorithm. 3. Predictive Function Estimation - Using Equation 4.3, the estimated shared path lengths are found. The results are based on a 1000 node synthetic topology, which was generated by Orbis [88]. We randomly select 800 leaf nodes (sources) and 8 monitors in the graph, and assume complete data i.e., that probes from all sources are received at all monitors. The error metric that we use to assess the estimation accuracy the Root Mean Squared Error (RMSE) is defined as: RMSE(ĥ) = i,j h i,j ĥi,j 2 (3.6) Where an RMSE of x indicates that the estimated shared number of hops is on average x hops away from the true number of hops extracted from the graph. The results in Table 3.3 show that the estimation from the unique contrast clustering performed the best, but required by far the largest number of active measurements. Using the informationtheoretic approach from [89], 7 clusters were found, (in comparison to 47 active measurement needed for each monitor if performing unique contrast clustering). Using the Gaussian mixture clustering,

67 43 the predictive method outperforms the cluster-level method. Simulations with different synthetic topologies provided similar results. Estimation Type # Probes RMSE Cluster-level Predictive Function Unique Contrast Table 3.3: Shared path estimation results for a 1000 node synthetic topology assuming that probes from 800 randomly selected end host nodes were observed in 8 randomly selected monitors. In Figure 3.12, we assess how increasing the number of clusters affects the performance of the Gaussian mixture EM cluster-level algorithm from a RMSE perspective. For this simulated synthetic topology (with N = 800, M = 24, in contrast to M = 8 results in Figure 3.5), the addition of more clusters (and hence more active measurements needed) causes a significant decrease in the error rate of the path length estimation. Figure 3.12: The effect of increasing the number of clusters on the shared path estimation performance on the simulated topology using the cluster-level shared path estimation method. In Table 3.4, we show how the same three methods for shared path length estimation considered above perform on the Skitter topology described in Section 3. A random subset of 700 leaf nodes were selected as end hosts and 8 leaf nodes were randomly selected monitors. Similar to the simulated topology, the estimation from the unique contrast clustering performed the best from an RMSE perspective but also required by far the largest number of active measurements. Using the

68 44 information-theoretic approach, 9 clusters were found, requiring only 18 active measurements of the topology for each monitor. Again, the predictive function outperforms the cluster-level method when considering the smaller number of clusters found by the Gaussian mixture EM algorithm. Estimation Type # Probes RMSE Cluster-level Predictive Function Unique Contrast Table 3.4: Shared path estimation results for the Skitter topology assuming that probes from 700 randomly selected end host nodes were observed in 8 randomly selected monitors. In Figure 3.13, we assess how increasing the number of clusters affects the performance of the Gaussian mixture EM cluster-level algorithm from a RMSE perspective. For the Skitter topology (with N = 700, M = 24 (in contrast to the results for M = 8 in Figure 3.6), the addition of more clusters causes a decrease in the error rate of path length estimation, but not as significant a decrease as seen in the simulated topology example (this is not surprising due to the pruned structure of the Skitter topology). Figure 3.13: The effect of increasing the number of clusters on the shared path estimation performance on the Skitter topology using the Gaussian mixture EM cluster-level shared path estimation method.

69 45 Shared Path Estimation with Imputed Data Next, we assess how topology estimation is affected by using imputed data. Following the shared path estimation method from Section 3.4.1, the incomplete case derives clusters from imputed hop contrast vectors (using the Gaussian mixture imputation algorithm from Section 3.3.1) and then performs active measurements on the clusters. The use of imputed hop contrast vectors introduces more uncertainty on the topological significance of the clusters. To derive the estimated shared path length estimation using incomplete data, we follow the derivation from Section 3.4.2, replacing all occurrences of the complete hop count distance vectors h i, with the imputed hop count distance vectors ĥi. ) ) P (i, j, k) = α (Ûi,j,k, k min (ĥi,k, ĥj,k (3.7) Topology Estimation Performance with Imputed Data Using synthetic topologies generated by Orbis, we assess the impact of missing data imputation on topology estimation. We generate three different synthetic topologies with 1000, 2000, and 3000 nodes. The monitors are randomly chosen from the set of leaf nodes in the topology, with the passive measurements simulated as the length of the shortest path found in the topology between the end hosts and the monitors. After imputation of the missing data using the mixture of Gaussians technique, 10 clusters are found in the imputed hop count contrast vectors, resulting in a active probe budget of 20 active measurements per monitor. Figure 3.14 shows that the estimation of shared paths using the predictive methodology from Section is comparable to the cluster-level deterministic estimation method for a majority of missing data percentages. The RMSE error rate for the estimated path lengths is defined as: RMSE( ˆP) = i,j,k ( P(i, j, k) ˆP(i, ) 2 j, k) When the simulated topology is expanded to 2000 and 3000 total nodes, the effects on the shared path estimation algorithms can be seen in Figures As seen from the figures, the increase in

70 46 Figure 3.14: Topology estimation performance for two different estimation methods in a 1000 node synthetic topology, ((Left) M = 8, (Center) M = 16, (Right) M = 24) graph size improves upon the estimation for both algorithms. Figure 3.15: Performance of topology estimation algorithm in 1000, 2000, and 3000 node synthetic topologies with M = 16, (Left) - Predictive Function Topology Estimation, (Right) - Cluster-Level Topology Estimation. 3.5 End Host to End Host Distance Estimation In this section, we describe a method for accurately estimating the hop distances between arbitrary pairs of end hosts in a network relying primarily on passive hop count measurements. We argue that this characteristic of Internet topology is important and useful in its own right. For example, if link failures are independently and identically distributed, then identifying the shortest paths between nodes is important to robust overlay network design. Pairwise hop distance estimation between arbitrary nodes is also an important step toward our overall objective of accurate and timely generation of Internet-wide topology maps using passive measurements.

71 47 Towards this goal, we will present a novel Landmark Multi-Dimensional Scaling (LMDS) algorithm that can generate estimates of hop distances between all nodes using both passive measurements and a small subset of active measurements. This algorithm uses an iterative approach to minimize the error between the embedding of the nodes and observed distances from the passive and active measurements. We emphasize that, in keeping with our experiments passive shared path estimation, M is orders of magnitude smaller than N. Thus, the overhead of the active measurement component of the process is almost negligible. We evaluate the feasibility and capabilities of our pairwise hop distance estimation method using a set of synthetic topologies and an empirical data set from Skitter [1]. We consider varying degrees of missing passive measurement data and find that the algorithm is able to consistently produce highly accurate pairwise hop distance estimates. We also consider the impact of different sizes of landmark infrastructures and find that even low dimensional embeddings (e.g., < 7) are able to produce highly accurate estimates Multidimensional Scaling (MDS) Several previous studies have indicated that the salient features (e.g., latencies or hop-count distances) of high-dimensional networks can be captured by low-dimensional embeddings [35, 36, 37, 38, 39]. We pursue this idea by determining a low-dimensional embedding that preserves the distances of the observed hop-counts h i,j as accurately as possible. In turn, the embedding produces estimates (or imputations) of the hop-count distances between sources and monitors that were not observed and between N sources themselves (also not observed). In our work on passive measurement based clustering of end hosts, it was shown that hop count distance vectors that are similar/close in a Euclidean sense, do not necessarily translate to end hosts that are close in the actual network topology. Therefore, the hop count distance vectors cannot be considered a sufficient embedding coordinate system. This motivates the mapping of the hop count distance vectors h i s to a low dimensional coordinate space X where the Euclidean distances between vectors relates to the hop count distances between sources, such that ˆD i,j = X i X j. Where ˆD i,j are the estimated number of hops between source S i and source S j, and ˆD i,j is close to

72 48 the true number of hops D i,j. Classical Multidimensional Scaling (MDS) First consider the case where N 2 active measurements have been performed between all pairs of the N end hosts in the network. This provides the hop-count distances between all pairs of end hosts, creating the distance matrix D (where D i,j is the number of hops between end hosts S i and S j ). Using the complete observation of pairwise distances, one can apply classical Multidimensional Scaling (MDS) [91] to map the sources to a lower dimensional space (S i X i ). This lower dimensional space will be of dimension = d, where d N. In this mapped space, the Euclidean distance between points ( X i X j, where, for this chapter, all norm references are to the l 2 -norm) corresponds to the estimated hop-count distances between the two end hosts in the network. Theorem Multidimensional Scaling Theorem [91] The doubly centered squared distance matrix B = 1 2 CD2 C (Defining a centering matrix as C = I 11T N ), is equivalent to the gram matrix of the array of end host coordinates (XT X), where each row of the matrix X i is the lower dimensional embedding of the source S i. Proof. By our definition of the lower-dimensional embedding, we can state each pairwise squared distance value (for two arbitrary end hosts S i, S j ): D 2 i,j = X i X j 2 = (X i X j ) T (X i X j ) = X i T X i 2X i T X j + X j T X j Therefore, for the centered squared distance D 2 i,o (the squared distance from source S i to the origin): D 2 i,o = X i X o 2 = X i 2 = X i T X i

73 49 Finally, we can state that each element of the doubly centered distance matrix: B i,j = 1 ( ) Di,j 2 Di,o 2 Dj,o 2 2 = 1 ) ) ((X T i X i 2X T i X j + X T j X j X T i X i X T j X j 2 = 1 ) ( 2X T i X j 2 = X T i X j This implies that B = X T X. Using Theorem 3.5.1, we can state that using the symmetric singular value decomposition of the doubly centered squared distance matrix, B = U T ΛU, we can find the lower-dimensional embedding of each end host X [d] = Λ 1 2 [d] U [d] (where Λ [d] represents the diagonal matrix of the d largest eigenvalues of B, U [d] represents the matrix of the first d eigenvectors of B, and where d is the chosen dimension of the embedding). Embedding Dimension Selection The embedding dimension d is chosen to be as small as possible while still providing an embedding that accurately preserves the hop-count distances. This is accomplished by considering the energy ratio e(d) = di=1 λ i Mi=1 λ i (3.8) Where λ i is the i-th eigenvalue of the matrix A. The embedding dimension d is selected as the smallest d such that e(d) > 90%. This can be considered as the smallest embedding dimension, such that 90% of the energy of the original high-dimensional data is retained. Motivation for this threshold of 90% will be shown in Section

74 Landmark MDS using Active Measurements Given a very large number of end hosts, the computational complexity of classical MDS may be too large to practically implement (given the O ( N 3) computation time and O ( N 2) number of active probes needed). This challenge can be addressed using a technique known as Landmark MDS [92]. Landmark MDS reduces the complexity by only considering the distances between each end host to M landmark points, and then embedding each end host using knowledge of the pairwise distances between each one of the landmarks (i.e., the M monitors in our application). We can partition the distance matrix (D) into four sections: D = A B T B C Where we have knowledge of distances for M M matrix A (distances between the monitors) and M N matrix B (distances from monitors to end hosts), but no knowledge of the (N M) (N M) matrix C (distances from end hosts to end hosts). This distance matrix results in similarly partitioned doubly-centered squared distance matrix, (K = 1 2 CD2 C): K = E F T F G (3.9) Using only information from submatrices A,B, we can state Theorem Theorem Landmark Multidimensional Scaling Theorem (Nystrom Approximation) [93, 94] Given a gram matrix K partitioned of the form in Equation 3.9, and knowledge of submatrices E,F, we can estimated the unknown submatrix, G F T E 1 F. Proof. Taken from the Theorem 3.5.1, we can state that the doubly-centered squared distance

75 51 matrix is a Gram matrix: [ K = X T X = ] R T S T R S = R T R R T S S T R S T S = E F T F G We can state that the known submatrix E, E = R T R = UΓU T Given F = R T S, we can then state that: S = R T F = Γ 1 2 U T F Finally, we have: G = S T S ( ) ( ) = F T UΓ 1 2 Γ 1 2 U T F = F T R 1 R T F = F T E 1 F Given a matrix D with rank M, this estimation is exact. Therefore, using just the pairwise distance information to M-number of monitors, we can find a lower d-dimensional embedding of the end hosts X, where: X = Γ 1 2 [d] U T [d] Γ 1 2 [d] UT [d] F (where Γ [d] represents the diagonal matrix of the d largest eigenvalues of E, U [d] are the first d eigenvectors of E, and d is the dimension of the embedding).

76 52 This method reduces the computational complexity from O ( N 3) (for Classical MDS) to O ( M 2 N + M 3) (for Landmark MDS). In a network setting, this requires making M 2 active measurements to obtain the pairwise distances between the monitors, and then M N active measurements to obtain the distances from each source to each of the landmarks. The GNP method from [35] can be considered as a special case of Landmark MDS with active measurements Landmark MDS using Incomplete Passive Measurements Active probing to determine all the hop-count distances between end hosts and monitors is impractical. The virtue of our proposed framework is that some of these distances can be obtained passively, this eliminates the requirement of M N active measurements between each IP source and each landmark. Note that if we had a complete set of passive measurements from each of the N end hosts to all M number of monitors, it would be possible to simply use Landmark MDS to embed both monitors and end hosts in a lower-dimensional space. The challenge of the passive approach is that many of the distances will not be observed, resulting in an incomplete data problem. Thus, for a given set of monitors, many of the hop count distance observations are missing from the observed set. For each end host S i, we have a (potentially incomplete) hop count subvector h i,i (i) where I (i) indicates the indices the passive monitors that observed traffic from end host S i. We assume that this data is Missing-at-Random (where the missing data locations are chosen at random in each vector) 1. The challenge becomes how to determine the embedding from incomplete hop-count data. Previous work on Landmark MDS with missing data in [95] simply removed any data point with missing features. In our application, this simple-minded approach would be catastrophic, since it is unlikely that many end hosts will be observed at all M monitors. We next adapt a method previously designed to handle noisy distance observations to handle the missing data problem. 1 Clearly, this Missing-at-random assumption is reliant on monitor placement in real topologies

77 53 Stress Function Construction Given A, the array of pairwise hop-count distances between passive monitors, H the (potentially incomplete) hop-count distances from the end hosts to the monitors, we can then construct the (N + M) (N + M) symmetric distance matrix D, where each element represents the observed pairwise distance between two end hosts. This matrix can be represented as follows: D = A H T 1 2 where 1 represents the unknown M N reverse hop count matrix (the number of hops from the monitors to the end hosts) and 2 denotes the N N array of missing hop-count distances between pairs of the N end hosts. Next, define a mask W to indicate the locations of the observed hop values: W i,j = 1 for 1 i, j M 1 : if h i,j is known W i,j = 0 : if h i,j is missing For m + 1 i M + N and 1 j M. An example of the structure of W is depicted in Figure The embedding process amounts to assigning a d-dimensional vector to each monitor and end host. Each vector is generically written as X i, for the i-th monitor or source. The collection of all vectors, or embedded points, is represented by the (M + N) d matrix X. The embedding is designed by minimizing the stress function with respect to the embedded points X: stress(x) = N+M i=1 N+M j=1 W i,j ( X i X j D ij ) 2 (3.10) Intuitively, minimizing the stress function will minimize the squared error between the estimated

78 54 Figure 3.16: Example mask array W, (with N = 4 and M = 2). Note that not all hop-counts from end hosts to monitors are observed, and none of the hop-counts between end hosts are observed. distance between end hosts and the observed hop count distances. Note that the stress function places no cost on unobserved distances. Stress Optimization Method The stress function is minimized through an iterative procedure. Let X (t) denote the embedded points at iteration t. Using the majorization method from the MDS literature [91], we can bound the stress function from Equation 3.10 using a convex function and then minimize the bounding ( function. The procedure guarantees the stress is reduced at each iteration; i.e., stress X (t+1)) ( stress X (t)). The bounding function can be solved iteratively by: X (t+1) i,a ( = z i W i,j X (t) j,a + D ij X (t) ) i,a X(t) j,a j i X (t) i X (t) j (3.11) 1 For z i = j i W, i = 1, 2,..., N + M and a = 1, 2,..., d (where d is found using Equation 3.8 i,j on the M M monitor distance array A). A proof of the monotonically decreasing stress of this method is stated in [96]. The procedure is repeated until the embedding coordinates converge to where X (t) X (t+1) < ǫ, with ǫ > 0. The step-by-step methodology for solving this problem is described in Algorithm 2.

79 55 Algorithm 2 - MDS Algorithm with Incomplete Passive Measurements Initialize: Using the matrix of pairwise distances between passive monitors (A), use Equation 3.8 to find the embedding dimension d. Randomly create an d-dimensional placement vector X (1) i for all i = 1, 2,..., N + M. Set t = 1, ǫ > 0. Main Body 1. Adjust the monitor embedding X (t+1) i,a = 1 M 1 For i = {1, 2,..., M} and a = {1, 2,..., d} ( M X (t) j,a + A i,j X (t) ) i,a X(t) j,a i=1,j i X (t) i X (t) j 2. Adjust all the end host node embedding with respect to the monitor embedding. We can reduce the formula in Equation 3.11 to: ( X (t+1) i,a = 1 I (i) X (t) j,a + h i,j X (t) ) i,a X(t) j,a X (t) i X (t) j j I (i) For i = {M + 1, M + 2,..., N + M} and a = {1, 2,..., d}, where I (i) is the subset of monitor indices observed for end host S i. 3. if X (t) X (t+1) < ǫ then Go to Step 4. else Set t = t + 1 and go to Step 1. end if 4. Estimate the pairwise hop distances using the embedding: d i,j = X i X j

80 56 Table 3.5: Comparison of three different techniques for discovering pairwise distances between end hosts. Where N is the number of end hosts, M are the number of monitors (N M) Method Number of Active Measurements Computational Complexity Skitter O ( N 2) O (1) LMDS using Active Measurements O ( M 2 + NM ) O ( M 2 N + M 3) LMDS using Passive Measurements O ( M 2) O ( dmn + M 3) Exploiting BGP Data Consider the setup in Figure 3.17, where two end hosts have the same observed hop count to a single monitor. Given only this single element of information, the embedding algorithm may place these two end hosts close (resulting in a small estimated pairwise hop distance), even though they exist in separate autonomous systems (ASes) and may be potentially far apart in the real Internet topology. By exploiting BGP routing data, we can extract the specific AS that each end host exists in, and use that information to improve our pairwise hop estimation technique. Intuitively, given no other distance knowledge, one would prefer to embed end hosts close together if they exist in the same autonomous system (AS), while embedding the end hosts farther apart if they are in different ASes. From either examining BGP looking glass servers or our own selection of passive measurement monitors, we can gather characteristics of paths from end hosts in different ASes, including the mean distance of these paths (µ B ). We can weakly assume that end hosts in the same AS should have path lengths closer to zero, while end hosts in different autonomous systems should have path lengths closer to this average value, µ B. This motivates creating a BGP distance matrix, Di,j B µ B : if S i and S j are in different ASes = 0 : if S i and S j are in the same AS For 1 i, j N + M.

81 57 AS 1 S i? S j AS Figure 3.17: Two end hosts with the same hop distance to a single monitor. M k Instead of relying only on the incomplete hop count distances between the end hosts and the monitors, we can also use weak assumptions based on the AS distance information to embed the end hosts. Intuitively, one would not weight the weak assumptions of the AS distances the same as the observed hop count distances from passive measurements. It follows that the AS distance confidence for each pair should be inversely proportional to the variance for all path. From a small set of probes, we can find the sample variance of the length of paths in the same AS, σin 2, and the sample variance of the length of paths in different AS, σ 2 out. This leads to the definition of a new mask array W B, W B i,j = 1 λ σin 2 S i and S j exist in the same AS 1 λ S σout 2 i and S j exist in different ASes For 1 i, j N + M, and where the weight value, λ, can be adaptively found using a bisection method to find the value that embeds the monitors with the lowest error (with respect to the observed pairwise hop distances to other monitors). The stress function for this AS exploiting technique can then be defined as: stress B (X) = N+M i=1 N+M j=1 W i,j ( X i X j D i,j ) 2 N+M + i=1 N+M j=1 ( ) 2 Wi,j B X i X j Di,j B The methodology for minimizing this stress function is described in Algorithm 3.

82 58 Algorithm 3 - MDS Algorithm with Incomplete Passive Measurements and BGP Information Initialize: 1. Using the matrix of pairwise distances between passive monitors (A), use Equation 3.8 to find the embedding dimension d. 2. Using bisection search, find λ that minimizes the error embedding the monitors in d dimensions. 3. Randomly create d-dimensional placement vector X (1) i for all i = 1, 2,..., N + M. 4. Set t = 1, ǫ > 0 Main Body: 1. Adjust the monitor embedding X (t+1) i,a = 1 M 1 For i = {1, 2,..., M} and a = {1, 2,..., d}. ( M X (t) j,a + A ij X (t) ) i,a X(t) j,a i=1,j i X (t) i X (t) j 2. Adjust all the end host embedding with respect to the monitor embedding and the AS information X (t+1) i,a = j I (i) 1 I (i) + N+M j=1 Wi,j B (s i + t i ) For i = {M +1, M +2,..., M +N} and a = {1, 2,..., d}, where I (i) is the set of monitor indices observed for end host S i. Where: ( s i = X (t) j,a + h ij X (t) ) i,a X(t) j,a t i = N j=1,j i W B i,j 3. if X (t) X (t 1) < ǫ then Go to Step 4. else Set t = t + 1 and go to Step 1. end if X (t) i X (t) j,a + DB i,j X (t) j ( X (t) i,a X(t) j,a X (t) i X (t) j ) 4. Estimate the pairwise hop distances using the embedding: d i,j = X i X j

83 End Host to End Host Distance Estimation Results Mean Hop Distance Estimation To compare the performance of our embedding algorithms, we must find a methodology to contrast against. Due to the reliance of our purposed method on incomplete passive measurements, the previous network embedding algorithms in [39, 38, 35] are not comparable. We instead choose the most logical, a elementary mean estimation approach, where the estimated pairwise hop distance for any pair of end hosts in the topology are estimated to be the mean of the pairwise hop distance found by active measurements between the monitor nodes (represented by the matrix A). dmean = 1 M 2 M i M A i,j j Note that the estimate of dmean is the same for all choice of end hosts i, j. Embedding Experiments For each dataset, we obtain a low-dimensional embedding using the MDS procedure described above. Given the low-dimensional embedding of each end host, we estimate (or impute) the hopcount distance between pairs of end hosts. We begin by running the iterative methods from Section and Section on a series of synthetic Internet topologies generated using the Orbis tool [88]. The monitors are randomly chosen from the set of leaf nodes in the topology, with the passive measurements simulated as the length of the shortest path found in the topology between the end hosts and the monitors. In all experiments, the BGP/AS information was synthetically created by randomly choosing a subset of 15 end hosts and classifying each end host s AS as the index of the pairwise closest end host in the random subset. The error metric used to assess the estimation accuracy is the Root Mean Squared Error (RMSE)

84 60 defined as: RMSE( ˆD) = i,j (D i,j D i,j ) 2 = N+M N+M i=m+1 j=m+1 (D i,j X i X j ) 2 If our estimator has an RMSE of 1, then we can estimate the true hop distance (on average) within a single hop. It also follows for the mean estimation approach (from Section 3.5.5), that the RMSE results are the sample standard deviation of the ground truth hop count values. In Figure 3.18, the size of the topology is held constant, while performance of the algorithm is compared as the number of monitors varies between graphs (from 8 measurement monitors to 32 measurement monitors). The error rate (RMSE) is plotted against the amount of missingness in the passive measurement data, where if there are 16 monitors, then X observed features refers to observing X out of 16 elements for each hop count vector with the observed elements chosen at random. Figure 3.19 shows the performance of the algorithm in estimating pairwise distances between end hosts are seen with respect to varying topology sizes, where using the Orbis topology generating toolkit the topologies are rescaled in terms of the total number of nodes in the graph (from 1000 to 2000 to 3000 nodes). In both Figures 3.18 and 3.19, the performance of the algorithm is seen with and without the autonomous system information. The results show that when there is very little information from the passive measurements (low total number of monitors, high levels of incompleteness, etc.), the BGP exploiting algorithm performs considerably better than the embedding algorithm that ignores the BGP data. It is only when there is a large amount of available passive measurement data, as shown in Figure 3.18-(right), that the BGP exploiting algorithm is moderately outperformed by the algorithm that ignores the BGP data. Meanwhile, both MDS algorithms outperform the naive mean estimation approach given moderate levels of missingness in the hop count vectors.

85 61 Figure 3.18: Simulation results for error rates of pairwise hop estimation for synthetic topology versus amount of available data (N=1000). (left) M=8, (center) M=16, (right) M=32 Figure 3.19: Simulation results for error rates of pairwise hop estimation for synthetic topology versus amount of available data (M=16). (left) N=2000, (right) N=3000 Effects of Dimensionality Selection Using the dimensionality selection technique described in Section 3.5.1, we now examine how much a reduced dimension embedding space effects our estimation technique. In Figure 3.20, we show the performance of the MDS algorithm (with no BGP/AS information) as the dimension of the embedding space increases. One can see that after the determined embedding dimension d = 5 (found using Equation 3.8), any additional embedding dimension has very little reduction on both the RMSE and the confidence bounds of the pairwise distance estimation. Effects of Additional Monitors Given the improvement on the pairwise hop estimation by adding monitors (as seen in Figure 3.19), one important question could be what effect does each additional monitor added to the experiment

86 62 Figure 3.20: The effect of embedding dimension to estimating the pairwise distance values for the synthetic topology, N = 1000, w/ M = 32 and calculated dimension d = 5, confidence bars indicating +/-1 standard deviation have on the estimation error rate? Using the synthetic topology, in Figure 3.21 we see the effects that each additional monitor has on the estimation RMSE and the confidence bounds. As shown in the figure, after the placement of roughly 10 monitors, each additional node has relatively low impact on the resulting error rates and confidence bounds. Figure 3.21: The effect of adding additional monitors to estimating the pairwise distance values for the synthetic topology, observing complete hop count data, N = 3000, confidence bars indicating +/-1 standard deviation Skitter Topology Experiments We also consider the performance of the estimation techniques using the Skitter dataset [1]. A random subset of N = 1000 leaf nodes were selected as end hosts and randomly selected other

87 63 leaf nodes were selected as monitors. The pairwise distances were estimated for different levels of missing data in Figure 3.22 using both embedding methodologies. Again, the MDS algorithm exploiting the BGP/AS information outperforms the non-bgp/as algorithm for experiments with low levels of observed data. Figure 3.22: RMSE of pairwise hop estimation simulation results for the Skitter topology (N = 1000). (Left) M = 8, (Center) M = 16, (Right) M = 32 Asymmetric Routing Results The resulting topology estimates from the Multidimensional Scaling technique reveal a symmetric topology, where the forward and reverse paths between end hosts are of equal lengths. However, prior empirical research on Internet routing has shown that asymmetry on forward and reverse paths between hosts is not uncommon (e.g., [97, 98], although we were unable to find prior empirical work that broadly quantifies the difference between the forward and reverse path lengths in terms of routers). Given the possibility of path length asymmetry, we examine the impact on our estimators when the reverse paths are off by one hop, two hops, and three hops (with equal probability of positive or negative length offset). The effects of this distribution of path length asymmetry on our estimation methodology can be seen in Figure Although the error rate increases as the amount of asymmetry in the forward and reverse hop counts increases, the MDS methodology (for both the BGP/AS and non-bgp/as method) still outperforms the naive mean estimation methodology for all but the highest levels of missingness.

88 64 Figure 3.23: Simulation results for asymmetric reverse paths for synthetic topology (N = 1000, M = 16) versus amount of available data. (left) Reverse paths off by 1 hop, (center) Reverse paths off by 2 hops, (right) Reverse paths off by 3 hops 3.6 Summary In this chapter, we introduced novel techniques leveraging passive measurements for (i) Clustering end hosts, (ii) Inferring missing measurement values, (iii) Estimating the shared path between end hosts, and (iv) Inferring end host-to-end host distances in the network. All of the presented methodologies require (potentially incomplete) passive hop count measurements and only O (1) additional active probes. This lightweight measurement regime allows for timely acquisition of network properties without the need for a large measurement load on the network infrastructure.

89 65 Chapter 4 Inferring Unseen Structure of the Network In this chapter, we address a subset of the general problem of router-level Internet topology mapping. Our objective is to infer the existence of components that have not been observed (unseen routers, unseen route lengths, and unseen links) in a partial probing of the Internet core. We appeal to prior work on Internet mapping to define what is meant by the Internet core. Specifically, the core is composed of the set of routers that are greater than one hop away from end hosts and is roughly bounded by the borders of stub autonomous systems [14]. We expand on this definition in Section 4.1. We believe that this somewhat imprecise definition of core is sufficient since we would like to identify as large a core component as possible and strict boundaries between core and edge are likely to be arbitrary and of little practical importance. We argue that unseen core inference is important because an extremely large volume of traffic traverses the core of the Internet. We assume an infrastructure from which active probe-based measurements can be made and that the infrastructure has a relatively broad deployment. Using this infrastructure as the starting point for gathering data, our inference methodology has three components. The first component addresses the unseen core router problem. Namely, given an increase in the network probing infrastructure, how many extra core routers will we find? Our solution to this problem is related

90 66 to solutions for the so-called unseen species problem but cast in a networking context. The second component of our methodology addresses the problem of inferring unseen links between observed routers. Instead of using measurements with router interface IP addresses, we exploit a matrix completion algorithm that is based only on hop counts between source-destination pairs. This novel approach enables accurate and efficient estimation of connectivity without the need for interface disambiguation which has proven to be difficult and to have a significant impact on resulting maps [21]. The final component consists of a targeted probing methodology that merges our contributions of inferring unseen core routers and unseen core links to efficiently reveal areas of the core containing the most uncertainty. 4.1 Experimental Dataset In this section we describe the core router data used in our study and how it was collected. The goal of our data collection effort was to establish a data set from which our methods could be evaluated. Specifically, we required a representative corpus of core Internet routers with disambiguated interfaces (to the extent possible using available techniques) that could act as ground truth for our work Core Router Definition In this chapter, we adopt a pragmatic definition of which routers constitute the Internet s core. Consider the result of performing a traceroute from an end host in a stub network with prefix A to a host in a stub network with prefix B. In the ordered list of routers obtained from tracing the route, the first router considered part of the core is the one with the last occurrence of an address in prefix A. The last router considered part of the core is the first one responding with prefix B. The attempt is to avoid considering any router as part of the core that is fully within a network with the same prefix as an end host. Figure 4.1 shows graphically what our identification methodology is designed to accomplish. This approach is conservative in the sense that it is likely to omit some core routers from

91 67 Figure 4.1: A representation of our pragmatic definition of the Internet s core. consideration, e.g., routers that connect to actual core nodes within a single AS. However, as stated above, our intent is not to focus on exactly identifying the core boundary, but rather to accurately capture the gross characteristics of the core so that performance and stability of Internet applications and services can be improved Core Router Identification In order to identify as many core router IPv4 addresses as possible, we leveraged high-quality data provided by an on-going measurement project and collected additional data using the Planetlab [99] infrastructure. The existing data was provided by the iplane project [100], which performs a traceroute from all available Planetlab hosting sites to a set of target prefixes obtained through the Routeviews project [101]. We used four weeks of iplane data collected over the period of 12 December 2008 to 8 January In addition to the iplane data, we collected traceroute data between a full mesh of Planetlab hosting sites. At the time of our measurement collection there were over 900 hosts that are part of Planetlab, but there were only about 375 distinct sites. Of these sites, only a subset are available at any given time due to host maintenance or other issues. To perform each traceroute, we used the Paris traceroute tool [102]. Informed by the recent study by Luckie et al. [103], we invoked the tool once using UDP-based probes and a second time using ICMP-based probes for each destination in order to discover as many core routers as possible through active probing. We set options in the

92 68 Paris traceroute tool so that an individual measurement between hosts took longer, but produced a low rate of probe traffic. We collected the full mesh of Planetlab traceroutes three times (roughly evenly distributed) over the same period that we obtained the iplane data set. Due to Planetlab site and host transience, we were able to use approximately 216 Planetlab sites for each of the three rounds of full mesh probing. We gathered three complete measurement data sets beginning on December 11, 2008, December 22, 2008 and January 6, Using these two data sets, we were able to discover 125,146 unique core router IPv4 addresses. The total number of hops (links) observed between all of these nodes was 519,273. A standard problem in traceroute-based topological studies is the issue of IP interface disambiguation, which is also referred to in the literature as alias resolution. That is, Internet routers are typically assigned multiple IP addresses (e.g., each interface on a router may have a unique IP address assigned to it). Identifying which addresses correspond to the same physical router is the role of alias resolution. To identify the core routers (i.e., de-alias) our data set, we used a router alias database published by the iplane project, which builds on previous published alias resolution methodologies, including those used by the Rocketfuel project [11]. After alias resolution, we identified 114,815 core routers. Indeed our main reason for using the iplane data (as opposed to other widely available topology datasets) was that an IP alias database is also published. 4.2 Inferring Unseen Components of the Core Our methodology for inferring the unseen components of the core of the Internet is divided into two distinct components. Practically speaking, these components are predicated on having an initial set of source/destination hop count data from which core topology estimates can be made and a measurement infrastructure from which additional probes can be sent. The components of our discovery methodology are: 1. Estimate the population size of unseen routers. Use the initial data set to predict the number of additional core routers that would be discovered using additional probes.

93 69 2. Estimate the unseen connectivity between observed core routers. Use the massively incomplete set of observed hop counts between core routers to estimate unseen route lengths and infer unseen links between observed core routers. We describe each component in the following sections. Toward our goal of being able to discover the core accurately and to maintain core maps over time, we envision these components being run on an on-going basis. It is important to note that in terms of practical deployment and use, the computational complexity of all of our algorithms is such that the unseen core estimates can be made using a modestly configured workstation and that the overall probe load is on the order of tens of thousands in total, which for a large distributed infrastructure is minimal. 4.3 Estimating the Number of Unseen Core Routers Consider sending a series of traceroute probes through the network and observing a collection of core routes 1. How complete is this set of core routers? Are there significantly more core routers in the network that have not been observed by the traceroute probing set? Determining this missing set size is analogous to the problem of predicting the number of unseen species in an environment given some sample set of observations, or estimating the size of Shakespeare s vocabulary based on the number of unique words appearing in his known works [47, 48]. Using the set of traceroute probes between Planetlab nodes and other points in the network, we can use properties of the occurrence of routers to predict how many more routers will be found in the network given increased probing. The prediction idea we employ is based on a more sophisticated version of the following simple idea. Suppose we randomly split the traceroute dataset into two halves. A certain number of routers are discovered by the routes in one half of the dataset, and a certain number of additional new routers are discovered in the other half. This gives us a rough idea of how many new routers might be discovered, were we to double the original size of the traceroute measurement campaign. Now let us consider the problem of predicting the number of additional routers found in the core of the Internet as the result of increased measurements, based on the number we discover through 1 We will refer to disambiguated interfaces as core routers

94 70 the initial traceroute campaign. Consider traceroute probing the Internet from a set of sources S to a set destinations D D. Let n i denote the number of core routers that appear in exactly i routes in this traceroute dataset. While the methodologies in [49, 50] have both examined the problem of unseen species estimation in the context of networking, their results are directed towards finding the total number of routers/links in a network given limited observations. For the purposes of this work we are interested in leveraging the methodology from [47, 48], where the number of unseen routers will be estimated for a fractional increase in the number of destinations probed. We consider this a more practical problem than the previously framed unseen networking research, as it is important to have knowledge of what it is possible to discover using a feasible increased probing of the network. Missing Species Estimator - Given the values n 1, n 2,...n k where n i is the number of routers that occur in exactly i routes in the traceroute dataset, the number of additional routers found by increasing the destination points by some fraction t [0, 1] can be estimated as, k r(t) := ( t) i n i, (4.1) i=1 where the value t represents the percent increase in the number of probes, with t = 1 being a doubling of the probing infrastructure. This estimator proposed in [48] relies on extrapolating the information from the observed data to an unseen fractional increase of the data. The rationale of the estimator hinges on two key assumptions. The first underlying premise is that the number of times a given router is observed increases roughly linearly as a function of the number of traceroute measurements, modulo a bit of randomness in this growth rate depending on the specific set of traceroute measurements employed. To test the validity of the first assumption, we observe the behavior of the growth of the number of times a router is encountered as a function of the number of probes. This observation is compared to a linear function fit to the observations. The agreement with a linear trend can

95 71 be gauged by the correlation between the data and the best linear fit. To evaluate correlation, we use the R 2 coefficient of determination metric [60], which measures the linear relationship between observed average values ({f o (1), f o (2),..., f o (N)}) and the best linear fit of these observed values ({f l (1), f l (2),..., f l (N)}). R 2 = ( 1 N (( N fo (i) f ) ( o fl (i) f ))) 2 l (4.2) i=1 σ fo Where f o = the sample mean of f o, f l = the sample mean of f l, σ fo = the sample standard deviation of f o, and σ fl = the sample standard deviation of f l. By definition, R 2 = 1 if there is perfect correlation between the observed values and the best linear fit, and R 2 = 0 if the two sets of sequences are uncorrelated. The average R 2 across all observed routers in our iplane/planetlab probe dataset was found to be = (with standard deviation = ), this indicates that the average router observations are almost perfectly correlated with the best linear fit. The second assumption is that all traceroute measurements, past and future, are independent and identically distributed. This is reasonable in our situation because the sources and destinations in our measurement campaign are widely distributed end hosts in the Internet, and therefore the traceroute dataset is a fairly random sampling of paths through the Internet core. σ fl Experimental Performance From the Planetlab/iPlane probing infrastructure described in Section 5.4.2, we observe 114,815 core routers. From the Missing Species Estimator, we can predict that from knowledge of core router occurrence characteristics, we will discover an additional 46,032 core routers given a doubling (t = 1) of the traceroute probing infrastructure. Next, we would like to assess the accuracy of this estimate. We can test the accuracy by taking ten random realizations of probing only half of our dataset (taken by maintaining the same number of traceroute sources, but probing to only a randomly chosen half partition of the traceroute destinations). Across these ten experiments, these reduced probing sets resulted in finding on average only 91,018 core routers (in contrast to

96 72 114,815 core routers found in the full probing set). By the simple formula in Equation 4.1 and the characteristics of the routers found using these reduced probing sets, we would predict to find an average additional 38,582 core routers given the full probing set, for an average total of 129,600 core routers. This is only a 13% deviation from the actual total observed number of routers found using the complete set of destinations. 4.4 Estimating Unseen Connectivity Given that we can now estimate how much of the core was not observed, we now look to examine what can be said about the unseen topology associated with the core routers that have been observed. Specifically, given an observed set of core routers, we focus on accurately estimating the hop length of the unobserved links between every pair of observed core routers. Our estimation method is based on the use of traceroute measurements with disambiguated interfaces. Our eventual goal is to minimize our reliance on traceroute measurements in order to simplify and streamline the measurement process. However, the focus of the analysis below is similar to unseen core router estimation, namely to estimate the unseen connectivity between core nodes given traceroute measurements. A natural way to represent the router-level topology of the Internet is by using a hop count matrix H (I). For the entire IPv4 Internet, H (I) is a matrix with each element H (I) i,j representing the number of routers between IP address i and IP address j. If the matrix H (I) is known, then the router-level topology of the Internet in all places is completely resolved. In this section, we will focus on reconstructing a portion of this full hop count matrix (H) given the subset of observed core routers and Planetlab nodes from the core measurement data. To fill in even this portion of the hop count matrix completely would require an infeasible N 2 probing of the Internet. Instead, we reconstruct this matrix by using the set of core measurements from Section 4.1. We examine how the traceroute measurements (i.e., containing labels of intermediate nodes) can be used to improve our connectivity estimates. To construct the hop count matrix from traceroute measurements, consider a single

97 73 traceroute probe sent between two of the Planetlab nodes (p 1, p 2 ) that returns the path, p 1 r 1 r 2 r 3 r 4 p 2 (4.3) From this single path many of hop elements of H can be observed (assuming interface disambiguation), with r 1 being one hop away from r 2, two hops away from r 3, etc. Intuitively, this has a multiplicative effect on our ability to populate the hop count matrix H versus a single ping-style hop count measurement. In this fashion, we use the large set of traceroute measurements to extract the hop count matrix partition in Table 4.1 and fill in the hop count matrix H Table 4.1: An example hop count matrix using observed hop elements from the single traceroute path p 1 r 1 r 2 r 3 r 4 p 2 (where - represents an unknown element). p 1 p 2 r 1 r 2 r 3 r 4 p p r r r r It is obvious from Table 4.1 that many of the matrix elements have no information (i.e, simply a - ). Using only traceroute probes to fill in the core-router-to-core-router hop elements will result in a hop count matrix that could be highly incomplete depending on the perspective afforded by the available measurements. Given this assumed incomplete hop count observation matrix, our objective is to impute (or fill in ) the missing observations to resolve the router-level topology of the observed core routers and Planetlab nodes. By partitioning the hop count matrix by the classification of Planetlab node or core router, we can rearrange the hop count matrix into block form: H = H pp H rp H pr H rr

98 74 Where H pp is the submatrix of hop counts between Planetlab nodes, H pr is the submatrix of hop counts between Planetlab nodes to core routers (with H rp being the submatrix of the reverse paths), and H rr is the very incomplete collection of hop counts between core routers. Our goal will be to estimate the unseen elements of the submatrix H rr by using the small set of observed pairwise measurements Matrix Completion Algorithm Given the observed elements of the hop count matrix, H, we will now estimate the unseen hop elements. We appeal to Occam s Razor, which states The simplest explanation for a phenomenon is most likely the correct explanation Therefore, the full hop count matrix will be reconstructed by using the solution that gives the simplest matrix while still preserving the values of the observed hop elements. The number of unique linear components that compose the matrix, the rank of a matrix, can be considered a measure of how complex a matrix is. Recent work in [51] frames the estimation of incomplete matrices in this manner, by finding the matrix with the lowest rank (the number of nonzero eigenvalues) that agrees with the observed elements. minimize rank (X) subject to X i,j = H i,j s.t. H i,j was observed Unfortunately, solving the above problem is NP-hard, requiring an infeasibly long computation time to solve. Instead, we focus on an approach described in [104], which shows that the convex hull of rank (X) is the nuclear norm X = N k=1 σ k (X) (where σ k is the k-th singular value of X). Therefore, the intractable rank minimization optimization problem can be changed to the more convenient convex optimization problem, minimize X subject to X i,j = H i,j s.t. H i,j was observed This convex optimization problem can be efficiently solved using the Lagrange multiplier approach from [52]. One important detail is that due to the matrix factorization in this methodology, it does

99 75 not require that the full estimated hop matrix X be stored in memory. This is significant, as the full hop matrix of our network data would require roughly 2 34 elements stored in memory 2. Motivation for using Matrix Completion to impute hop counts is two-fold. First, given k observations from the N N matrix H (for N number of end hosts), it was found in [51] that using the Matrix Completion algorithm, the matrix can be exactly reconstructed given the number of observations satisfy k O ( N 1.2 r log N ) (where r is the rank of the matrix). Therefore, for even massively incomplete matrices, we may be able to accurately reconstruct the unseen elements. Second, our work on estimating end host-to-end host distances using passive measurement (Section 3.5) shows that a hop count matrix can be accurately represented by a low-rank approximation, indicating Matrix Completion should perform well on these matrices Experimental Performance of Matrix Completion A hop matrix is constructed using 10,276 core routers found by probing between 216 active Planetlab nodes and 375 Planetlab node destinations using the methodology described in Section 4.1. This dataset is massively incomplete, with only 1.94% of the hop elements observed. To assess an accurate estimation error rate, 100,000 observed core router to core router hop elements (chosen completely at random) were held out of the dataset and used to validate the performance of the Matrix Completion procedure. The error metric used to assess the estimation accuracy is the Root Mean Squared Error (RMSE) defined as: RMSE(Ĥ) = 1 y {i,j} y ( ) 2 H i,j Ĥi,j (Where y denotes the holdout set of coordinates and y is the size of the holdout set). If our estimator has an RMSE of 1, then we can estimate the hop distance (on average) within a single hop of the true hop distance. In addition to the Matrix Completion algorithm results, we look to compare the hop estimation results against a naive hop estimation methodology. While the end host-to-end host distance estimation methodology of Section 3.5 looks at a similar problem, the 2 A matrix of this size would require 128GB of memory space using double-precision floating point numbers.

100 76 necessity for measurements to a set of landmarks (unavailable here) makes side-by-side comparison impractical. Mean Hop Estimation To provide a benchmark for comparison, we consider the following simple approach to impute the missing hop counts. One can estimate each missing hop count using the mean of the hop counts that have been observed. Ĥ i,j = 1 P (H) P (H x,y ) x y where P (H x,y ) is equal to H x,y if the hop length between routers x and y was observed and 0 if no information was observed, and P (H) denotes the total number of observed hop counts. Missing Hop Count Estimation Results Table 4.2 shows the error rate for estimating the held-out hop elements in the 10,276 10,276 core router hop count matrix. The new Matrix Completion algorithm estimates the missing elements with accuracy almost 4 hops better than the naive mean estimation procedure on average. The empirical cumulative distribution of the errors can be seen in Figure 4.2, this shows the probability that the imputation deviation (absolute value of the difference between the true value and the estimate value) is less than or equal to the value on the x-axis for both imputation methodologies. In Table 4.3, the deviation of the estimated held-out hop element values from the true hop element values are shown with respect to both the Matrix Completion method and the mean imputation method. As evident in the table, while the hold out data for the Matrix Completion method is slightly biased by the 4.7% of the hold out data that is estimated with deviation of greater than 4 hops from the observed value, this is in comparison to the mean imputation method that has over 43% of the hold out data estimated with deviation over 4 hops. Meanwhile, over 60% of the unobserved hop elements can be estimated within 1 hop with the Matrix Completion method. This level of accuracy in estimation directly motivates classifying unseen links in the

101 77 Figure 4.2: Empirical Cumulative Probability for the imputation error using both Matrix Completion and Mean imputation. topology using this new Matrix Completion methodology. Table 4.2: Hop Matrix reconstruction error rates. The RMSE of 100,000 core router to core router hop distances held out. Method RMSE Mean 5.96 hops Matrix Completion 2.03 hops Table 4.3: Division of Matrix Completion Errors for Holdout Data Error Range Matrix Completion Mean Imputation Percent Percent Less than one hop off 60.2% 15.8% Less than two hops off 82.7% 30.7% Less than three hops off 91.3% 44.5% Less than four hops off 95.3% 56.4% More than four hops off 4.7% 43.6% Inferring Unseen Core Links Using Matrix Completion, we now have an estimated hop count matrix Ĥ, containing the predicted hop counts between all the core routers found by probing. The natural question that now arises

102 78 is, given the estimated hop count matrix, can we now infer the existence of unseen links between routers? Given the previously defined notation, we wish to identify source-destination pairs {i, j} where H i,j = 1, but H i,j was not observed. Given an estimated hop count matrix Ĥ, one simple methodology for estimating unseen links would be to threshold the estimated hop count values. This methodology would classify all estimated hop values below a certain threshold (λ) as a link. This would result in the creation of a hop count thresholding adjacency matrix A (hop), A (hop) j,k = 1 : if ĥj,k < λ 0 : otherwise (4.4) Where the chosen value of λ gives an explicit trade-off between the number of links missed and the number of false links erroneously declared as existing (the false alarm rate). A deficiency of this hop thresholding methodology is that there is no consideration for the variance of our hop estimate. For areas of the network with high uncertainty, this method could possibly erroneously classify links. Instead, we consider the statistical inference methodology of bootstrapping [57] to systematically decide which pairs of observed routers have connected links in the topology by considering the variance of our hop count estimate. The bootstrap thresholding methodology will be executed as follows. Consider repeatedly subsampling the observed hop count data, where each subsample H (i) B (for i = {1, 2,..., M}), contains 95% of the observed hop counts chosen at random (with the other 5% of the observed hop counts held out from consideration). By performing the Matrix Completion algorithm on each subsampling matrix, we obtain M estimates of the full hop count matrix Ĥ(i) B. By performing this repeated subsampling when M = 40, for each unobserved hop count we will have the set of 40 estimates, {ĥ(1) j,k, ĥ(2) j,k,...,ĥ(40) j,k }. How stable are these multiple hop count estimates? We can find the empirical bootstrap confidence limits [ĥl j,k, ĥu j,k] by sorting the set of estimates and taking the second smallest hop count value (ĥl j,k ) and the second largest estimated hop count value (ĥu j,k ). These are the empirical 95% bootstrap confidence limits on our estimation of this hop count value, where given our observed data we are 95% confident that the true hop value, h j,k lies between [ĥl j,k j,k],ĥu.

103 79 How do the confidence limits help inform which elements are links? Consider the empirical bootstrap confidence limit [ĥl j,k j,k],ĥu. If the confidence upper bound, ĥu j,k, was close to one, this would imply that we are confident that the true value of h j,k is one due to the confidence region containing no other possible hop count value. Therefore, we would very likely infer that there is a link in the topology between routers j and k. On the other hand, if the confidence upper bound was much larger than one, this would imply that we are not confident that the true value h j,k is one, and therefore no link likely exists between core routers j and k. This intuition gives rise to a thresholding methodology to find the bootstrap thresholding adjacency matrix A (boot) where an unseen adjacency (core link) is implied to exist if the confidence upper bound, ĥu j,k is below some value λ, and an adjacency is assumed not to exist if the confidence upper bound is greater than λ: A (boot) j,k = 1 : if ĥu j,k < λ 0 : otherwise (4.5) The complete bootstrap thresholding methodology is described in Algorithm 4. Figure 4.3: Percentage of total links correctly classified plotted against threshold of confidence upper bound (λ) for both bootstrap upper bound estimate and hop count estimate.

104 Experimental Performance of Unseen Link Inference Using the probing set of 10,726 core routers found between Planetlab nodes, we tested the performance of our unseen link classification methodology. After holding out 500,000 randomly-chosen hop count values (containing 6,116 links and 493,884 non-links), bootstrap estimation was repeated 40 times and the Unseen Core Link Estimation methodology was tested. This behavior can be seen in Figure 4.3, which shows the percentage of correctly classified links (out of all the links classified) against the threshold value for both the bootstrap thresholding methodology and the hop count thresholding methodology. The results show that the bootstrap thresholding methodology is more accurate at classifying true unseen links than the more simple hop count thresholding methodology, with almost 10% more edges classified correctly for values of the threshold λ < 2. For the bootstrap thresholding methodology, the results in Figure 4.3 indicate that performance of link classification is good for λ 2, with a majority (roughly 70%) of identified links being true links with performance degrading as the threshold value λ > 2. The threshold λ 2 can be interrupted as follows: given the observed data we are confident that there are no intermediate hops between pairs of core routers classified as directly linked. For λ > 2, meaning that our 95% confidence bounds contains possible hop values greater than one, the performance of the classifier degrades significantly as λ is increased. Note the sharp knee in the curve at λ = 2, indicating that immediately past this point the fraction of correctly identified links drops off precipitously. Therefore, the logical choice for the threshold using the bootstrap methodology is λ = 2. The results for this Unseen Link Classification Algorithm on our 10,726 core router dataset can be seen in Table 4.4 for multiple values of the threshold λ on the two thresholding algorithms. As seen in the table, using the desired threshold (λ = 2), the bootstrap methodology discovers 1,315 (35.9%) of the true links in the topology, with only 37.4% of the classified links being false alarm links. This is in comparison to the hop count thresholding method which finds a greater number of true links (3,388) but more than half of the classified links using this methodology (50.7%) are false links. This significantly higher accuracy motivates the application of the bootstrap thresholding methodology over the simple hop estimate thresholding methodology.

105 81 Algorithm 4 - Unseen Link Estimation Algorithm - Bootstrap Thresholding 1. Obtain 40 subsampled versions of the observed hop count matrix H (i) B (i = {1, 2,...,40}). Where each matrix contains a randomly chosen 95% of the total observed elements. 2. Perform Matrix Completion of each of the subsampled hop count matrices. Obtaining the estimated matrices, Ĥ(i) B (i = {1, 2,...,40}) 3. For every unobserved hop count element h j,k (a) Using the vector of bootstrap estimates, {ĥ(1) j,k,ĥ(2) j,k,...,ĥ(40) j,k }, sort the estimated values from smallest to largest, {ĥi(1) j,k, ĥi(2) j,k,...,ĥi(40) j,k }. Such that ĥi(i) j,k (b) Eliminate the smallest and largest values (ĥi(1) bootstrap confidence bounds. j,k, ĥi(40) j,k ĥi(i+1) j,k ) to obtain the 95% empirical (c) The remaining largest value is now the bootstrap confidence upper bound, ĥu j,k = ĥi(39) j,k. (d) If ĥu j,k < λ, then establish an link between core routers j, k. Otherwise, no link exists. Table 4.4: Performance of Unseen Link Classification Algorithm with various threshold values using λ thresholding on both the bootstrap thresholding and the hop count thresholding methodologies. Hop Count Thresholding Bootstrap Thresholding λ = 1 λ = 2 λ = 3 λ = 1 λ = 2 λ = 3 Number of Classified True Links Number of Classified False Links Percentage of True Links out of all Classified Links 61.0% 49.3% 30.7% 69.1% 62.6% 37.3% Percentage of False Links out of all Classified Links 39.1% 50.7% 69.3% 30.9% 37.4% 62.7% 4.5 Adaptively Targeted Core Probing Given the unseen core router and unseen core link estimation procedures describe above, we now focus on a combination of these two algorithms in order to illuminate unseen areas of the Internet. Our approach is to use our previous techniques for identifying areas of the network to target for further probing. Specifically, consider the problem of trying to discover the core of the Internet using a series of traceroute probes between a set of possible traceroute end host sources, S = {s 1, s 2,..., s N }, and a set of possible traceroute end host destinations, D = {d 1, d 2,..., d M }. Each traceroute probe between source s i and destination d j can be denoted by the subgraph G i,j.

106 82 Therefore, the complete router topology that is visible from this full set of sources and destinations can be represented as, G = G i,j, (4.6) i Sj D i.e., the union of all the traceroute paths between elements in the source set to elements in the destination set. The key problem with this idea is that using all possible sources and all possible destinations requires performing all N M possible traceroute probes, which in cases where N or M are very large may be infeasible. Also, there are likely redundant probes sent out where particular G i,j subgraphs contribute little to no new information about the full graph G due to a majority of their vertices and edges already having been observed by a previous path. This realization motivates the following modification of the problem: given a set of possible traceroute end host sources and a set of possible traceroute end host destinations, our goal is to use the fewest number of source-destination probes to further expose the core topology. In our methodology, exposing the core topology is defined as finding a significant percentage of unique core routers and a significant percentage of unique core router edges with respect to the observed core topology using the entire source set S and the entire destination set D Naive Random Selection Probing If we make no assumptions or inferences about the utility of probing each source-destination pair, then the only methodology available is to probe by arbitrarily selecting source-destination pairs. We will consider this method as the baseline performance to compare the performance against our targeted methodology Unseen Target Estimation Probing Algorithm Given the methodologies on unseen router estimation in Section 4.3 and the techniques on unseen edge estimation in Section 4.4, we can combine estimates to inform the user as to which source-

107 83 destination pair will reveal the most information about the core. Unseen Species Selection Consider the case where we have already probed some subset of destinations from each traceroute source in the network. Therefore, for each source S i, we have a source subgraph GS i, consisting of the set of routers and links found probing from source S i to the selected destinations. We can state the current combined probing subgraph, G = GS i i=1,2,...,k Which source-destination probe pair will return the largest number of unique core routers with respect to the combined probing subgraph G? Given the limited information about the core topology in the combined subgraph G, we do not have enough data to make an informed decision to solve this problem. A problem that can be solved is the slightly modified problem of finding which source will return the largest number of unique core routers with respect to the specific source subgraph GS i. Reframing the problem in this manner indicates that it is very similar to the unseen species problem of Section 4.3. Given a source subgraph, GS i, we estimate the number of unique unseen routers that would be observed given a fractional increase in the number of destinations (analogous to the probing of another destination from this source) in this source subgraph. We argue that the source subgraph that is estimated to have the largest increase in the number of observed unique core routers with respect to its source subgraph should be considered the best choice for increasing the number of unique core routers with respect to the combined probing subgraph G. Matrix Completion Selection Algorithm The unseen species method selects the best source to probe from, but what destination should be chosen? Another possible probing strategy arises from the unseen core connectivity estimation techniques in Section 4.4. Consider the N M probe hop count matrix H representing the hop distances between the traceroute end host sources S and the traceroute end host destinations D.

108 84 From the traceroute probes, we receive the number of hops between the sources and destinations, thus filling in the probe hop count matrix (with the unprobed pairs having unknown hop count values). Instead of estimating the size of the changes on the subgraph (as the previous Unseen Species method), we send probes based on determining which unobserved source-destination pair has the most uncertainty in the probe hop count matrix with respect to the observed hop counts. To determine the uncertainty of the hop counts in each of the missing source-destination pairs, we consider performing K-fold cross-validation (CV) [57] on the observed hop count values, similar to the methodology in Section Using K-fold CV, the Matrix Completion algorithm is performed on the incomplete probe hop count matrix H, yielding K estimates for each unknown source-destination pair hop count element ({h (1) i,j, h(2) i,j,..., h(k) i,j }). These collections of estimates are used to obtain the variance of each unobserved hop count estimation value. Intuitively, if the variance of the hop count estimation is low, it implies that given the current observed hop counts, we have a good idea of the topology between the selected source-destination pair. Conversely, if the variance of the hop count estimation is high, it implies that we do not know very much about the topology between the specific source and destination, making it a candidate for probing. Unseen Target Estimation Probing Algorithm Notice that the Unseen Species Selection Algorithm has a potential flaw. Since it finds the optimum source to probe from (based on the criteria of increasing the size of the source subgraph), it neglects finding an ideal destination for that source to send probes. Also note that the Matrix Completion method makes no assumptions that the particular source-destination pairs probed will increase the size of the probed subgraph. Realizing the flaws in both approaches, we can combine the two previous methods, Matrix Completion Selection and Unseen Species Selection, to form the Unseen Target Estimation method that offers a best of both worlds solution. To perform this Unseen Target Estimation method, we simply use the Unseen Species selection method to find the best source to probe from (given the prior set of source subgraphs G Si ), and then use the Matrix Completion selection methodology to find the destination for that optimal source that has the

109 85 highest uncertainty (average variance) given K-fold cross-validation and the Matrix Completion algorithm. An outline of the approach is shown in Algorithm 5. Algorithm 5 - Unseen Target Estimation Probing Algorithm Initialize: From every source, randomly probe some number of destinations. Fill in the observed source-destination pair elements in the probe hop matrix, H. Main Body 1. Using Equation 4.1, find î = argmax ir i, the source that the unseen species estimator predicts will find the most unseen routers given an increase in probing. 2. Using K-fold Cross Validation and Matrix Completion, find the K cross validation estimates for each destination j for source î {h (1), î,j h(2) î,j,..., h(k) î,j } 3. Find the destination for source î with the highest cross validation variance. ( ( ĵ = argmax j var {h (1), î,j h(2) î,j )),..., h(k) } î,j (4.7) 4. Probe the chosen source-destination pair hop count matrix H. 5. If there are still more source-destination pairs to probe, go to 1. (î,ĵ). Adding the observed hop count value, hî,ĵ to Targeted Probing Experiments Using the 216 active Planetlab nodes as sources, we sent traceroute probes to a destination set of 360 Planetlab nodes. Those measurements identified a set of 10,276 core routers with 34,859 links found between them. To initialize the probing algorithm, we performed five traceroute probes to randomly selected destinations from each Planetlab node in the source set. Using the two probing methodologies pertaining to selecting source-destination pairs to probe (Unseen targeted estimation, random selection), we examine the performance of the probing methodologies

110 86 on discovering the unseen core topology using further probing. In this analysis, we compare performance by considering the number of unique core routers and unique core edges found by each probing methodology. Figure 4.4, shows the results with respect to the number of traceroute probes needed to find both previous unseen core routers and unseen core links given the two probing methodologies. The figure shows that for both routers and links, the Unseen Target probing methodology uses roughly half (50%) of the number of source-destination pair probes compared with a random methodology to obtain the same number of observed core routers/links. This suggests that the targeted probing methodology correctly selects areas of the network about which the structure is uncertain. Figure 4.4: (Left) - Number of additional unique core routers found using the two probing techniques, (Right) - Number of additional unique core links found using the two probing techniques 4.6 Summary In this chapter, we addressed the unseen core problem by developing a novel unseen discovery inference methodology. Our methodology for inferring unseen core topology has four components. The first estimates unseen core nodes using a technique adapted from unseen species literature. The second estimates unseen route lengths between observed core nodes using a matrix completion technique. The third technique uses a combined matrix completion and statistical bootstrapping method to estimate unseen core links. The fourth technique merges the methodologies of unseen

111 87 core node estimation and unseen core link estimation to develop a targeted probing methodology to efficiently reveal unseen areas of the Internet. We demonstrated the capabilities of our methodology using traceroute datasets collected in PlanetLab and by the iplane project [100]. We show that our unseen core node technique estimates the number of additional core nodes found given increased probing with only a 13% deviation from the actual observed number. We also show that a matrix completion algorithm is able to estimate over 60% of the core links within one hop of actual and roughly 82% of the core links within two hops of their actual value. We further develop an unseen core link classification algorithm, which finds over 35% of the true unseen core links with limited false alarm links. We then validated the performance of both unseen router and unseen edge estimation methodologies by merging the techniques in a targeted probing methodology that uses roughly 50% fewer source-destination probes than a naive random source-destination probing procedure.

112 88 Chapter 5 Toward the Practical Use of Network Tomography While the previous chapters have focused on TTL-based measurements to discover Internet topology, there are alternative methodologies available. One technique that has shown promise is tomographic inference of router-level topology using end-to-end measurements of packet delay or loss. Initial work on network tomography methodologies focused on the use of multicast measurements [7, 6]. While multicast inference is attractive due to total number of probes necessary (probing complexity) is O (N) (where N is the number of end hosts in the topology), the extremely limited deployment of open, multicast-enabled nodes renders these techniques impractical for a wide-scale topology study of the Internet. More recent work has focused on network tomography using unicast probes [8, 34], however these techniques are also impractical due to the quadratic number of probes (O ( N 2) ) needed to resolve the topology. Many unicast tomography techniques also require significant, coordinated measurement infrastructure. The tomographic technique we will focus on for this chapter is Network Radar [105]. Network Radar uses round trip time (RTT) measurements as the basis for topology inference and was developed as an attempt to obviate the need for significant coordinated measurement infrastructure. Consider the simple logical topology in Figure 5.1 with two back-to-back packets sent from end host a, one with destination b and the other with destination c. Both packets originating from end

113 89 host a will encounter the same path until router R. It can be assumed that any delays encountered before router R induced by router queuing delays will cause highly correlated delays for both backto-back packets (due to both packets being in the same router queues). Assuming that any delays encountered between the two packets past router R are uncorrelated, then the level of covariance between the RTT delays found from a series of back-to-back packets (cov (d b, d c )) will inform us to the amount of shared logical topology between paths {a, b} and {a, c}. Figure 5.1: Example of Network Radar on simple logical topology. The main contributions of this chapter are advancements in RTT-based tomography capabilities such that effective and efficient router-level topology discovery is possible. We exploit the idea of arranging end host targets for RTT probes in a Depth-First Search (DFS) Order. For a collection of end hosts in a tree topology, any of the non-unique ordinal lists found from a depth-first search on the end hosts (leaf nodes) of a tree structure can be defined as a DFS ordering. We will show how a DFS ordering clusters target end hosts based on the amount of shared infrastructure. Given this shared infrastructure clustering, we will demonstrate how the resulting covariance matrix has a special structure, and by exploiting this special covariance matrix structure the number of delaybased probes used to resolve the logical topology of a balanced l-ary tree can be reduced from the current state-of-the-art tomography probing methodology [54] by over a factor of 2 using our new DFS Ordering-based methodology. Our resulting probing complexity is the current lowest probing complexity for any developed unicast tomography algorithm. This reduction in the number of probes is an important step towards unicast network tomography being considered a feasible and practical topology discovery mechanism.

114 Depth-First Search (DFS) Order The foundation for the work in this chapter is the idea of Depth-First Search (DFS) Ordering. A depth-first search (DFS) is a tree search that starts at the tree root and progresses down the tree labeling each node and backtracking only when a node has been explored fully (e.g., every child of that node has been labeled). We will formally define a DFS Ordering as any ordinal list of the end hosts (which will be considered the leaf nodes of the logical routing tree) that would satisfy the ordering found by a depth-first search of the logical tree structure ignoring the labeling of the internal nodes of the tree. In previous literature, this was considered a topological sort [55] on the leaf nodes of a tree structure. Figure 5.2: Example simple logical topology in a proper DFS Order. For the tree structure in Figure 5.2, we can find the following valid DFS orderings all of which would satisfy a depth-first search on the tree topology: {a, b, c, d} {a, b, d, c} {b, a, c, d} {b, a, d, c} {c, d, a, b} {d, c, a, b} {c, d, b, a} {d, c, b, a} There are also many possible end host orderings that would violate a DFS ordering property of the tree. For example, the ordering {a, c, d, b} does not satisfy a depth-first search of the end hosts. The power of considering a depth-first search can be seen when examining the shared logical path matrix S, where S i,j = the number of logical routers shared between end hosts x i and x j in the paths from the root node to the two end hosts. For a proper DFS ordering of the topology in

115 91 Figure 5.2 ({a, b, c, d}), the matrix S proper will be found as: S proper = a b c d a b c d And for an improper DFS ordering ({a, c, d, b}), the out-of-order shared path matrix (S improper ) will be found as: S improper = a c d b a c d b Using the intuition from this small example, we can state the following proposition that shows how a Depth-first Search ordering clusters end hosts based on the degree of shared infrastructure, Proposition Given the set of end hosts {x 1, x 2,..., x N } in a proper DFS Ordering, then the resulting shared path matrix S has the following structure: S i,i+j S i,i+k : for 0 j k Proof. Consider the case where the end hosts are in a proper DFS ordering, but S i,i+j < S i,i+k (for 0 j k). This states that end hosts x i, x i+k have more shared infrastructure than x i, x i+j (e.g., a longer shared path length). This implies the tree structure has x i and x i+k at some point of depth (e.g., level of shared infrastructure), while x i+j is located at some point in the tree structure at some shallower point in the structure in comparison to x i (e.g., at some level with less shared infrastructure than x i and x i+k ). But by the depth-first search ordering, this requires j > k as

116 92 a depth-first search would encounter x i+k before x i+j, thus violating the setup of the problem. Therefore, if the end hosts are in a proper DFS order, Proposition must hold. 5.2 Logical Topology Discovery using DFS Ordering Assume that all the end hosts in an unknown topology are already in a proper DFS order. Given this proper ordering, we look to estimate the logical topology. Defining the unknown logical tree structure T = {V, E}, and the router node path for end host x i as p (i) = {v (i) 1, v(i) 2,...} V the set of nodes from the root of the tree to end host x i. We will denote the round-trip-time (RTT) delay variance along a single path as the sum of the delay variances of each router node along the path from the tree root to the end host, σ 2 i = σ 2 ( p (i)) = p (i) j=1 ( σ 2 v (i) ) j (5.1) Where σ 2 (v) = the delay variance induced by router v V. Using previous work on Network Radar in [105], we can state the covariance between two end hosts x i, x j, ( σi,j 2 = cov p (i),p (j)) ( = σ 2 p (i) p (j)) (5.2) is equivalent to the variance of the shared path from the root node to the two end hosts. Using this Network Radar probing technique, we obtain the delay covariance between any two end hosts x i, x j. We will define the covariance matrix Σ, such that Σ i,j = cov (x i, x j ) = σi,j 2. The first question we set about answering is, Is there any inherent structure to a depth-first search ordered covariance matrix that can be exploited in order to efficiently estimate the logical topology? Using the ordering results from Proposition 5.1.1, we can state that the covariance matrix Σ has structure similar to the shared path matrix S. Proposition Given the set of end hosts {x 1, x 2,..., x N } in a proper DFS Ordering, the

117 93 covariance matrix Σ will have the following property: σ 2 i,i+j σ 2 i,i+k : for 0 j k Given Proposition and the fact that every router will induce positive delay variance, it is trivial to see this property of covariance matrix Σ. The Hierarchical Clustering algorithm [7, 6] showed that in order to reconstruct the tree topology, the only information needed for each end host is the knowledge of which other end host, out of all the other end hosts, this end host has the most shared topology with. This is equivalent to finding the end host with the largest covariance magnitude. Unfortunately, to acquire this knowledge, it was previously necessary to obtain all possible covariance values. Given the end host DFS ordering assumption and the ordered covariance matrix structure as specified in Proposition 5.2.1, we will state that the only covariance values necessary to infer the logical topology will be (for each end host x i, with i = {1, 2,..., N}) the value of the immediately preceding end host covariance (σi 1,i 2 ) and the immediately successive end host covariance (σ2 i,i+1 ). This is due to the proposition stating that the covariance σ 2 i,i+1 σ2 i,i+j for any j > 1. Therefore, end host x i will share the most infrastructure in the topology with either x i+1 or x i 1. In order to reconstruct the tree topology, only the covariance values associated with these two pairs of end hosts, x i, x i 1 and x i, x i+1 are needed. The magnitude of these two covariance values (σi 1,i 2, σ2 i,i+1 ) will directly inform us as to the structure of the logical topology. In order to distinguish between covariance differences caused by differences in topology and covariance differences induced by noise, we introduce the value δ here to denote the smallest possible delay covariance induced by a router in the topology. 1 We can now state the following proposition: Proposition Using the set of end hosts in a proper DFS Order, only N 1 pair probes (the covariance values σ 2 i,i+1 for i = {1, 2,..., N 1}) are needed to reconstruct the unknown logical topology. Proof. We will now show how every covariance magnitude combination will inform our reconstruc- 1 In a real-world experiment, we could find this term by cross-validation on a partition of known infrastructure. Assume here that it is known a priori.

118 94 tion of the logical topology. Each of the following cases can be found in Algorithm 6. In constructing the tree topology, we will denote f (x i ) as the assigned parent node of end host x i Case A - σi,i 1 2 σi 1,i 2 2 < δ Given a non-significant magnitude difference between the two covariance values, this implies that the shared path between the pairs {x i, x i 1 } and {x i 1, x i 2 } are the exact same set of routers. Therefore, as seen in Figure 5.3-(A), we can infer that the last logical hop is shared for the set of end hosts {x i, x i 1, x i 2 }. This assigns f (x i ) = f (x i 1 ), the parent node of the current end host f (x i ) is the same as the parent of the previous end host f (x i 1 ). (A) (B) Figure 5.3: (A) - Case A - σi,i 1 2 σ2 i 1,i 2 < δ. The current end host x i is attached to the parent of x i 1. (B) - Case B - σi,i 1 2 σ2 i 1,i 2 + δ. A new router r i is created with children x i, x i 1 with parent f (x i 2 ) Case B - σi,i 1 2 σi 1,i δ This implies that there is more shared topology between the end host pair {x i, x i 1 } than the pair {x i 1, x i 2 }. We will then insert a new interior logical router node, r i, with children {x i, x i 1 } (this assigns f (x i ) = f (x i 1 ) = r i ), with the new router r i having the same parent as x i 2 (thereby assigning f (r i ) = f (x i 2 )). The covariance value associated with the shared path to the new logical node, σr 2 i must be recorded for future reference. An example of this structure can be see in Figure 5.3-(B).

119 Case C - σi,i δ < σi 1,i 2 2 In the case that the current covariance pair, σi,i 1 2, is less than the previous covariance pair, σ2 i 1,i 2, this implies that end host x i attaches to a logical router at some point in the topology higher in the tree than the current parent router (f (x i 1 )) attached to end host x i 1. But which logical router should it attach to? To find this router, we must traverse the current logical path from x i 1 to the root node (the set of nodes {f (x i 1 ), f (f (x i 1 )), f (f (f (x i 1 ))),...}) and discover the farthest logical router (r ) from the end host (i.e., the router on the path closest to the root node) that has recorded covariance greater than or equal to the current covariance pair σ 2 r σ2 i,i 1. Once this logical router r is found, one of the following cases will occur. Case C-1 - σr 2 σ2 i,i 1 < δ There is a non-significant difference between the covariance of the found router and the observed covariance. Simply assign the current end host x i as having the parent r (f (x i ) = r ) as seen in Figure 5.4-(A). (A) (B) Figure 5.4: (A) - Case C-1 - σr 2 σ2 i,i 1 < δ. The current end host (x i ) is attached to router r. (B) - Case C-2 - σr 2 < σ2 i,i 1 + δ. A new router r i is attached on the path between routers r and f (r ).

120 96 Case C-2 - σr 2 σ2 i,i 1 + δ This implies that there was a previously unseen logical router on the path between the router r and its parent node f (r ). We must add a new logical router r i between these two nodes, such that, f (r i ) = f (r ) and f (r ) = r i. Finally, we attach the current end host x i to the new router r i, setting f (x i ) = r i. An example of this is seen in Figure 5.4-(B). 5.3 Depth-First Search Ordering Estimation The major problem with the methodology in Section 5.2 is that it is based around the assumption that the end hosts are already correctly arranged in a proper depth-first search order. In any non-trivial problem, this ordering will not be known. Instead, given no a priori knowledge of the topology, we must estimate a proper DFS Ordering from targeted measurements. But how can we infer this ordering using as few targeted probes as possible? Given a random ordering of the set of end hosts, consider choosing a single end host (x 1 ) and obtaining the delay covariance between this end host and all other end hosts in the set (= {σ1,2 2, σ2 1,3, σ2 1,4,..., σ2 1,N }). Some end hosts will have very high delay covariance, while others will have significantly less shared infrastructure with the chosen end host and therefore have low delay covariance. Consider sorting these obtained covariance values, this would place the end hosts that have more shared infrastructure at one end of the list, and the end hosts with little shared infrastructure at the other end of the list. Can this be considered a proper DFS Ordering? No, as seen in Figure 5.5, a significant fraction of the end hosts will have the same observed delay covariance (in this case, σa 2 ) when compared against the chosen end host. While the end hosts with the same covariance values will be clustered together in this ordering, a proper DFS order inside this cluster is unknown using only this single vantage point. This implies that delay covariances from more than a single end host vantage point will be required to correctly order the entire set of end hosts. But how many vantage points will be required? For Figure 5.5, consider that any covariance will take one of three values { ( σ 2 A), ( σ 2 A + σ 2 B), ( σ 2 A + σ 2 B + σ2 C) }. Having correctly ordered the end hosts by these three val-

121 97 Algorithm 6 - Ordered Logical Topology Discovery Algorithm Initialize: Given set of end hosts (x 1, x 2,..., x N ) in a proper DFS Order with unknown logical topology δ = minimum possible delay covariance induced by a single router. Initial reconstructed topology - T = (V, E). Initial set of nodes - V = {r 1, x 1, x 2 }. Initial set of edges - E = {(r 1, x 1 ),(r 1, x 2 )} Main Body: For i = {3, 4,..., N}. Add the new end host x i to the set of nodes in the reconstructed topology, V = V {x i }. if σi 1,i 2 2 σ2 i,i 1 < δ then Assign the parent of x i 1 to be the same parent as the current end host x i, E = E {(f (x i 1 ),x i )}. else if σi 1,i 2 2 > σ2 i,i 1 then Create new node r i, V = V {r i }. Set r i as a child of the current parent f (x i 1 ), E = E {r i, f (x i 1 )}. Remove the previous edge between x i 1 and the assigned parent, E = E/{f (x i 1 ), x i 1 }. Assign r i as the parent of the end hosts x i, x i 1, E = E {(r i, x i ), (r i, x i 1 )} Record the current covariance value for future reference σr 2 i = σi,i 1 2. else Find the parent router of x i 1 (denoted as r ) such that σ r 2 σ2 i,i 1. if σr 2 σ2 i,i 1 < δ then Assign r as the parent of x i, therefore E = E {(r, x i )} else if σr 2 > σ2 i,i 1 then Create new node r i, V = V {r i }. Place new node r i between router r and its parent f (r ). First remove the current link E = E/{f (r ), r }, then add the new edges, E = E ({f (r ),r i }, {r i, r }) Record the current covariance value for future reference σ 2 r = σ2 i,i 1. end if end if

122 98 Figure 5.5: Example of covariance values from a single end host not revealing the entire topology ues, we now desire to order the subclusters (e.g., what should the ordering be of all the end hosts with covariance = σ 2 A for the topology?). One could consider dividing the set of end hosts into constant covariance clusters (e.g., all the end hosts with covariance σ 2 A in one cluster, all the end hosts with covariance σ 2 A + σ2 B in another cluster, etc.) and for each cluster repeating this probing process. This would be performed by taking a new intracluster vantage point, and then reordering the intracluster end hosts based on the delay covariance values with this new vantage point. While we could consider this methodology in a noise-free environment, when noise is present this division of the end hosts into multiple clusters introduces multiple possibilities for cluster misclassification error, thereby introducing error into the topology reconstruction. Instead, we will look to a recursive methodology that at each iteration bisects the ordered set of end hosts into only two clusters. This reduces our objective to the single problem of finding the correct end host to bisect the set at each iteration of the algorithm. The simplest approach to this problem is for a given value δ (where each router will induce at least δ delay covariance), sorting the covariance values and finding all the possible bisection candidate end hosts (denoted by set I) where i I if the difference between the i-th and the (i + 1)-th covariance value is more than δ, I = {i : σ 2 1,i+1 σ 2 1,i δ} The bisection point will then be the end host in i I that causes the two bisected end host sets X 1 = [x 1, x 2,..., x i ] and X 2 = [x i +1,..., x N ] to be closest in size to each other for all choices of

123 99 i I. i = arg min i I i N 2 (5.3) Using this intuition, we present Algorithm 7 to find a proper DFS Ordering for a set of end hosts using this recursive bisection methodology. Algorithm 7 - Bisection DFS Ordering Algorithm - bisect(x, δ) Given: 1. Unordered set of end hosts with unknown logical topology X 2. δ = minimum possible covariance induced by a single router Main Body: 1. Find Y, such that Y i = cov (X 1, X i ) = σ 2 1,i for i = {2, 3,..., X 1}. 2. Sort the covariance vector Y, obtaining the ordered index vector I 3. Find I δ = {i : Y (I (i + 1)) Y (I (i)) > δ}, the indices where the difference between consecutive sorted covariance values is greater than δ. 4. Bisect the set of sorted end hosts X at the index of I δ that creates two sets most equal in size using Equation 5.3, creating sorted end host subsets X 1,X If X 1 > 2, then find I 1 = bisect(x 1, δ) 6. If X 2 > 2, then find I 2 = bisect(x 2, δ) 7. Reorder X 1 using new indices I Reorder X 2 using new indices I Return the final ordered list of indices I = [I 1 I 2 ] Proposition Using Algorithm 7, the number of probes needed to correctly obtain a proper DFS Ordering for a balanced l-ary tree (where each non-leaf node has l children) with N end hosts is upper bounded by p (l) N log l N probe pairs (where p (l) is sublinear in l). Proof. Using Algorithm 7, consider the first step, end host x 1 will be chosen and the covariance values will be found between x 1 and x 2, x 3,..., x N. Given the l-ary balanced property of the tree,

124 100 after sorting the covariance values this implies that the first iteration of the algorithm will divide the set of end hosts into a group of N l end hosts and a group of (l 1)N l end hosts corresponding to the first branch on the first level of the tree as seen in Figure 5.6-(Left). Consider further subdividing the set of (l 1)N l end hosts, where a random end host is chosen in the set and (l 1)N l 1 covariance measurements are taken. Our bisection algorithm would then subpartition into a group of N l end hosts and a group of (l 2)N l end hosts, again corresponding to the first level of the tree as seen in Figure 5.6-(Right). In these initial steps of the algorithm each iteration is resolving a branch off the first level of this tree, clustering into l sets of N l end hosts each relating to a branch off the first level of the tree. Figure 5.6: (Left) The first split taken on a balanced l-ary tree. (Right) The second split taken on a balanced l-ary tree. Both splits indicated by the dotted line, the arrow indicates the randomly chosen end host covariance values are measured against. After the tree has been divided past the first level, the problem can now be considered ordering the l number of subtrees each with N l end hosts. Using this recursive property, we can state the number of probes needed for a balanced l-ary tree with N leaf nodes as f l (N).

125 101 f l (N) N + l 1 N + l 2 N ( ) N l l l N + lf l l = N ( ) N l ( l) + lf l l ( ) (a) N = Np(l) + lf l l { ( )} (b) N N Np(l) + l l p(l) + lf l l ( ) 2 N = 2Np(l) + l 2 f l l 2 (5.4). (c) p(l)n (log l N) (5.5) Where p(l) := ( l) l ( l + 1 = 1 ) 2 l (5.6) Given that once the ordering is found (using Algorithm 7), to resolve the logical topology only requires an additional N pair probes (using Algorithm 6). By intelligent bookkeeping of covariance values in Algorithm 7, it is possible to obtain knowledge of these values without addition probing. It is then trivial to prove Proposition for the total number of probes required by our new DFS Ordering Algorithm. Therefore using our new DFS Ordering methodology, we are using roughly half the probes needed by the current state-of-the-art tomography approach in [54] that would require at most Nl (log l N) pair probes. Proposition Using the DFS Ordering Algorithm, the number of probes needed to correctly obtain the logical topology from a balanced l-ary tree (each non-leaf node has l children) with N ( ) end hosts is upper bounded by Np(l)log l N (where p(l) = l l is sublinear in l).

126 Experiments Prior Methods Hierarchical Clustering Consider having access to every pairwise covariance value for all N end hosts in the topology. Given complete knowledge of the covariance matrix, we would have knowledge of which set of end hosts have the largest covariance in the entire topology (within some margin δ), and hence, knowledge of which set of end hosts have the most shared infrastructure from the root node. For the Hierarchical Clustering algorithm [7, 6], at each step of the algorithm the current set of end hosts with the largest covariance are found, and a logical router is inserted connecting this set of end hosts together. The corresponding rows/columns in the covariance matrix for this set of end hosts are then merged together. This process is repeated until there are no rows/columns in the matrix left to merge. The main disadvantage to this methodology is that it requires knowledge of all N(N 1) 2 covariance values, this is effectively exhaustive probing of the network. Sequential Logical Topology What happens when acquiring all N(N 1) 2 covariance values is infeasible? Informed by the generic tree structure of the topology, the work in [54] shows that the number of probes needed to reconstruct the topology can be considerably reduced. This methodology depends upon sequentially building the tree topology for each end host. For a given end host, the delay covariance for this end host and all the nodes that are children of the root node are found. Given the child of the root node with the largest covariance (and thus the most shared topology), c i, the delay covariance is found between the end host and the children of the specified child (c i ). The covariance value (and margin δ) determines whether the end host is a sibling, child, or descendant of c i. This process is repeated until the leaf node with the largest delay covariance is found. On a balanced l-ary tree (a balance tree where each non-leaf node has l children), each end host requires at most llog l N pair probes, thus for the entire topology the number of probes needed is upper bounded by ln log l N. A comparison of the probing complexity for all three probing methodologies (hierarchical clus-

127 103 Table 5.1: Upper bound probing complexity for the three probing methodologies for balanced l-ary tree. (Where p (l) is sublinear in l) Methodology Probing Complexity 1 Hierarchical Clustering 2N (N 1) Sequential ln log l N DFS Ordering p (l)n log l N tering, sequential, DFS ordering) is seen in Table 5.1. The new DFS Ordering algorithm is found to have the smallest probing complexity upper bound of the three algorithms. In comparison with the Sequential Topology algorithm, from Equation 5.6 we can see that p(l) l for all choices of l, therefore the new DFS Ordering will use at most 56.25% of the pairs probes needed by the Sequential Topology algorithm Datasets Synthetic Dataset We will compare the performance of these tomographic techniques on synthetic topologies generated by Orbis [88]. Similar to the experiments in Chapter 3, the Orbis-generated synthetic networks enable us to analyze the capabilities of our methods with full ground truth and over a range of network and embedding sizes. For these experiments we will consider three different sized topologies, N = {768, 1497, 2261}. Real World Data To observe the performance of our algorithm on real-world topologies, we chose 9 DNS servers located at small-to-medium sized colleges in the New England geographic area. Using the DNS server addresses and traceroute probes we discovered the following logical tree topology in Figure 5.7 starting at the University of Wisconsin - Madison as the root node. Using the Network Radar methodology [105], the sample covariances were found between pairs of the 9 end hosts in the topology.

128 104 Figure 5.7: Real world topology used to test tomography methods Synthetic Noise-Free Experiments To test the performance of our algorithm in a noise-free environment, several different sized Orbis topologies were generated. With each topology, every node was assigned a random covariance value, with the estimated measured covariance being the sum of the random covariance values (with the smallest router covariance assigned, δ = 0.1) along the shared shortest path from the root node to the two end hosts under consideration. In Table 5.2, we present the resulting number of probes required to resolve the logical topology for both the new DFS Ordering methodology, the previous state-of-the-art Sequential Inference algorithm, and the exhaustive Hierarchical Clustering method. Due to this experiment being noise-free, all methodologies will perfectly reconstruct the topology. As seen in the table, the DFS Ordering methodology does significantly better than the exhaustive Hierarchical Clustering approach, with over 94% fewer probes needed to resolve the topology. In comparison with the more recent Sequential Inference algorithm, from these experiments it was seen that our new method requires on average 50% fewer probes to resolve the topology, matching the derived bounds in Table Real World Experiments Using the Network Radar methodology [105], we observed 1,000 back-to-back round-trip-time delay samples for every end host pair in our real-world topology (Figure 5.7). Due to imperfect round-

129 105 Table 5.2: Comparison of number of probes needed to estimate logical topology using synthetic Orbis topologies. Hierarchical Sequential DFS Ordering Algorithm Algorithm Algorithm Number of Probe Probe Percentage of Probe Percentage of Percentage of End Hosts (N) Pairs Pairs Hierarchical Pairs Hierarchical Sequential ,528 38, % 16, % 44.09% 1,497 1,119, , % 63, % 49.34% 2,261 2,554, , % 135, % 55.85% trip-time measurements and other delay noise measured, the sample covariance was found to not be perfectly correlated to the traceroute observed shared path length. Therefore, for any estimation procedure based on the sample covariance, there will be potential errors in the reconstructed topology 2. In order to determine the accuracy of our estimated topologies, we must develop a metric that compares our estimated topologies to the ground-truth topologies. Shortest Path Estimation Consider a triple of end hosts {a, b, c} that exist in our estimated topology. From our estimated logical topology, we can predict whether there is a longer shared path between end hosts {a, b} or end hosts {a, c}. For the estimated logical topology T, these two paths will be denoted P (a, b) and P (a, c) respectively. And for the true topology, these two true path lengths will be denoted as P (a, b) and P (a, c) respectively. The more accurate our estimated topology, the more often our estimated topology will return the correct answer for whether {a, b} has more shared infrastructure than {a, c}. The percentage of times we are correct with this problem will be denoted as p. For all possible triples in our set of end hosts (X), this value can be found by, p = 1 X 3 i X j X k X f (i, j, k) 2 This could be improved upon by taking more back-to-back sample probes or using a DAG card to obtain more accurate time information, but here we will focus on the case where neither improvement is available.

130 106 Where the value f (i, j, k) = 1 if the reconstructed topology correctly classifies the triple {i, j, k} and f (i, j, k) = 0 if the reconstructed topology incorrectly classifies the triple {i, j, k}: f (i, j, k) = I( P (i, j) P (i, k))i (P (i, j) P (i, k)) + I( P (i, j) < P (i, k))i (P (i, j) < P (i, k)) Where I (x) = 1 if the condition x holds while I (x) = 0 if the condition x does not hold. The baseline for any topology reconstruction algorithm will be to outperform a naive random choice of the two possible triple combinations ({a, b} greater than {a, c}, {a, b} less than or equal to {a, c}), this would be roughly equivalent to p = 1 2 (given equal distribution of the two cases in the topology). This value p will be the metric we use to assess how accurate the estimated topologies are. Results The performance of the three algorithms are averaged over 2,000 random permutations of the end hosts. Both the Sequential Algorithm and the new DFS Ordering algorithm have performance sensitive to initial choice of end hosts. Averaging over many random permutations eliminates any order bias from the results. All three algorithms have a tunable parameter, δ, that must be chosen. To give the two comparison methodologies (sequential and hierarchical clustering) every possible advantage, for each restricted number of probes, the performance of both algorithms is shown for the best possible value of δ at each level of probing. Meanwhile, our new DFS ordering methodology has a constant value of δ across all levels of probing. For the real-world topology in Figure 5.7, the corresponding p values for the three topology reconstruction algorithms (DFS Ordering, Sequential, Hierarchical Clustering) can be seen in Figure 5.8 versus a restricted total number of delay probes available. All three topology reconstruction techniques do significantly better than the naive random classification technique (with performance p = 1 2 ). As shown in the figure, the new DFS Ordering algorithm will reconstruct the the real-world

131 107 topology with the highest accuracy for a restricted number of tomography probes, with improvements of several percent accuracy in classifying shared path lengths over all range of probing sizes. Given the probing complexity in Table 5.1, it is very likely that the accuracy improvements for the new DFS Ordering algorithm will grow as the size of the topology increases. Figure 5.8: Topology reconstruction results for the three algorithms (DFS Ordering, Sequential, and Hierarchical Clustering). 5.5 Summary One method for generating router-level Internet topology maps is the application of tomographic inference to network delay measurements. While delay-based tomography for topology discovery has been examined in the past, it has yet to be widely used in practice due to it s own set of limitations. In this chapter, we addressed the shortcomings of RTT measurement-based network tomography for discovering Internet logical topology, namely the requirement of an impractical quadratic number of probes to resolve the topology. Here we described algorithms that considerably reduce the number of probes needed to resolve logical topology. The ability to reduce the number of probes is reliant on exploiting the idea of a Depth-First Search (DFS) Ordering of the end hosts. We analyzed the capabilities of our algorithms on a set of large-scale synthetically generated topologies. The experiments on these topologies show improvements of over 94% fewer probes compared with an exhaustive methodology, and roughly 50% fewer probes compared with current state-of-the-art. Results from a small-scale real-world Internet experiment further validate the performance of our algorithms. This significant reduction in the number of probes needed opens

132 108 delay-based tomographic topology discovery techniques to new, large-scale avenues of topology discovery application.

133 109 Chapter 6 Active Clustering of Hierarchical Data Hierarchical clustering based on pairwise similarities arises routinely in a wide variety of scientific problems. These problems range from inferring gene behavior from microarray data [106] to resolving the router-level topology in the Internet (see Chapter 5). It is often the case that there is a significant cost associated with obtaining each similarity value. For example, in the case of Internet topology inference the determination of each similarity value requires many probe packets to be sent through the network, which can place a significant burden on the resources of the network. In other situations, the similarities may be the result of expensive experiments or human-based comparisons, again placing a significant cost on their collection. And finally one could envision situations where we have multiple realizations of each item to be clustered, and then use the realizations to compute estimates of expected similarities between the items, however this might be too expensive and pose serious computational burden when the number of items to be clustered is too large. The potential cost of obtaining similarities motivates a natural question: Is it possible to reliably cluster items using less than the complete, exhaustive set of all pairwise similarities? We will show that the answer is yes, under conditions that are reasonable in many practical situations. In fact it is possible to reliably determine the hierarchical clustering of N items using as few as O (N log N) of the total of N(N 1)/2 similarities. Since it is clear that we must obtain at least one similarity for each of the N items, this is about as good as one could hope to do. While it is natural to consider

134 110 the possibility of using a randomly chosen subset of similarity values, we show that this approach is quite limited in general. Instead, we propose an active approach that selects similarities in an adaptive fashion. The active clustering method is first developed under ideal (error-free) conditions, and then robustified to handle errors and outliers in the similarity values. While there have been some attempts at developing robust procedures for hierarchical clustering [107, 108, 109], these works do not try to optimize the number of similarity measurements needed to robustly identify the true clustering, and mostly require all O ( N 2) similarities. 6.1 The Hierarchical Clustering Problem Let X = {x 1, x 2,...,x N } be a collection of N items. In contrast to the previous chapters, these items are considered in a generic sense and do not have to be assumed to be objects in the Internet. Definition A cluster C is defined as any subset of X. A collection of clusters T is called a hierarchical clustering if Ci T C i = X and for any C i, C j T, only one of the following is true (i) C i C j, (ii) C j C i, (iii) C i C j =. The hierarchical clustering T has the form of a tree, where each node corresponds to a particular cluster. The tree is binary if for every C k T that is not a leaf of the tree, there exists proper subsets C i and C j of C k, such that C i C j =, and C i C j = C k. The binary tree is said to be complete if it has N leaf nodes, each corresponding to one of the individual items. Without loss of generality we will assume that T is a complete (possibly unbalanced) binary tree, since any non-binary tree can be represented as a binary tree (e.g., a merging of three clusters can be expressed as a sequence of two pairwise mergings). Let S = {s i,j } denote the collection of all pairwise similarities between the items in X; (i.e., s i,j denotes the similarity between x i and x j ) and we assume s i,j = s j,i. The traditional hierarchical clustering problem uses this complete set of similarities to infer T. In order to guarantee that T can be correctly identified from S, the similarities must conform to the hierarchy of T. We consider the following sufficient condition.

135 111 Definition A set of items X = {x 1, x 2,..., x N } satisfies the Complete Linkage (CL) Condition if for every triple {x i, x j, x k } such that x i, x j C and x k C, for some C T, the pairwise similarities satisfies, s i,j > max (s i,k, s j,k ). In words, the CL condition implies that the similarity between all pairs within a cluster is greater than the similarity of any item within a cluster with an item outside the cluster. It is easy to see that if the CL condition is satisfied, then the standard bottom-up agglomerative clustering algorithms such as single linkage, average linkage and complete linkage will produce T [57]. Given the complete similarity matrix S, agglomerative clustering is a recursive process that begins with singleton clusters (i.e., the N individual items to be cluster), and at each step the pair of most similar clusters are merged. The process is repeated until all items are merged into a single cluster. Different agglomerative clustering algorithms differ in how the similarity between two clusters is defined. Observe that agglomerative clustering requires all N(N 1)/2 similarity values (since all must be compared at the very first step). To properly cluster the items using fewer similarities requires a more sophisticated adaptive approach where similarities are carefully selected in a sequential manner. Before contemplating such approaches, we first demonstrate that adaptivity is necessary, and that simply picking similarities at random will not suffice. Proposition If a subset of n similarities are selected uniformly at random from S, then any clustering procedure will fail to correctly resolve a cluster C T with probability 1 C C 2 /2 (4n/N 2 ) C. Proof. Observe that to resolve a cluster of size C, we need to measure at least C of the C ( C 1)/2 similarities between items in C. Therefore, probability that the algorithm succeeds ( C ( C 1)/2) C (2n/(N(N 1))) C C C 2 /2 (4n/N 2 ) C. This result states that it is not possible to resolve a cluster of size C, if the number of randomly sampled similarities n < cn 2 / C C /2, where 0 < c < 1 is a constant. Consequently, to resolve a cluster of size log κ N for some κ 1, we need n = Ω(N 2 ) random similarities, ignoring log factors, i.e. almost all the similarities.

136 112 The contribution of this chapter is two-fold. First, the main goal is to infer T using only a small subset of S. We show that adaptive or active clustering methods that sequentially select similarities are more effective than methods based on randomly choosing which similarities to use. If the CL condition is satisfied, then no more than 3N log N similarities are required. Note that, at a bare minimum, we will require at least N similarities since we will need at least one similarity for each item, so our active clustering methodology is within a logarithmic factor of this lower bound. Second, in practice the CL condition may be too restrictive, since the similarities may violate the condition due to errors or outliers. Therefore, here we will consider more sophisticated clustering procedures that are robust to violations of the CL condition. If the CL condition is violated by a small fraction of the similarities due to errors, we can still recover the tree T up to a certain resolution using O(N log 4 N) similarities, which is only slightly more than the noiseless case. In addition, the degree to which T is balanced also governs how many similarities are required for the reliable recovery of T. Definition For each (non-leaf) cluster, we define the balance factor η k := min{ C i, C j }\ C k, which quantifies how evenly C i and C j split C k. The quantity η := min k η k reflects the overall degree to which the tree T is balanced. Note that 0 < η 1/2. As we will show, η plays a crucial role in the number similarities required to determine T ; the closer η is to 1/2, the fewer similarities are required. 6.2 Active Hierarchical Clustering under the CL Condition From Proposition 6.1.1, it is clear that random sampling of similarities cannot properly reconstruct the clustering hierarchy with high probability unless almost all similarities are used. Thus we focus on active clustering based on adaptively selected similarities. One active methodology for selecting similarities is based on an algorithm for efficient graphical model identification proposed in [110]. To fully understand how that algorithm applies to our scenario, we need to define the notion of the outlier of a triple of items. Given the collection of N items and the underlying hierarchical clustering T, the outlier of the three items {x i, x j, x k } is x k if there exists a C T such that

137 113 x i, x j C and x k / C. It can be shown using a simple recursive argument (which is omitted here to save space) that if T is a complete (possibly unbalanced) binary tree and the CL condition holds, then every triple of items has an outlier in this sense. Thus, the outlier of the triple {x i, x j, x k } is defined to be outlier (x i, x j, x k ) = x i x j : if max(s i,j, s i,k ) < s j,k : if max(s i,j, s j,k ) < s i,k (6.1) x k : if max(s i,k, s j,k ) < s i,j Assume that the items {x 1,...,x N } are ordered randomly. Using the notion of an outlier we will identify T using an algorithm of the following form. The algorithm begins with x 1 and x 2 and progresses by constructing a sequence of hierarchical clusterings T 2, T 3,...,T N, where T i is a hierarchical clustering of i items from X. The clustering T i+1 is obtained by adding an additional item, x i+1 to the clustering set T i, using both knowledge of the current clustering hierarchy and outlier tests. The goal is that the final clustering T N should be the correct clustering, i.e., T N = T. This process is similar in spirit to Algorithm II.I in [110]; in fact it is essentially the same algorithm properly translated into the context considered here. Suppose that we have formed T i and we wish to include some x i+1 in the hierarchy. We first select a cluster C T i such that the size of this chosen cluster, C, satisfies i 3 < C 2i 3 (where Lemma 1 in [110] guarantees the existence of a cluster C). Now since T i is a binary tree it follows that C can be split into proper subclusters C l, C r C such that C = C l C r and C l C r =. Let x l C l and x r C r and compute outlier(x l, x r, x i+1 ). Either one of x r or x l will be the outlier (and thus x i+1 is clustered with the other), and the process is repeated again beginning only with the subtree resulting from the subset of items in the cluster x i+1 was placed in and its associated subtree. If x i+1 is the outlier itself, then the process is repeated with the subtree resulting from only the items in{x 1,...,x i }\C plus any one of the items in C (some x C). The reason for including some x C is that although x i+1 may be the outlier relative to C, it may still be more similar to C than it is to {x 1,...,x i }\C; keeping one representative from C insures that this situation is detected if it exists. The process above is repeated using the chosen subset at each step until x i+1 is

138 114 paired with just one item in {x 1,...,x i }, which produces the new clustering T i+1. It is not difficult to show that if the CL condition is satisfied, then T N = T. Moreover, placing the (i + 1) th item requires at most log i outlier tests and 3 similarities per test. The algorithm is summarized with pseudocode below and the following theorem formalizes the discussion above. Theorem Assume that T is a complete binary tree and that the similarities S satisfy the CL condition. Then Algorithm 8 identifies T exactly using no more than 3N log N similarities. Algorithm 8 - Outlier-based Clustering Algorithm Given : A set of randomly ordered items, X = {x 1, x 2,...x N }, and an initial clustering containing only two items, T 2 = {x 1, x 2 }. For each x i = {x 3, x 4,..., x N } 1. Set T = T i While T > 2 (a) Chose a cluster, C T, such that T 3 < C 2 T 3 (b) Find C l, C r C, such that C l C r = C and C l C r =. (c) Construct C c = (T \C x ), where x is any item in cluster C. (d) Randomly choose items x l C l and x r C r. C l : if outlier (x l, x r, x i ) = x r (e) Set T = C r : if outlier (x l, x r, x i ) = x l C c : if outlier (x l, x r, x i ) = x i 3. Set T i = T i If T = 1 then pair x i with T. 5. Else If T = 2 then find x l, x r T, such that x l x r = T. pair x i with T : if outlier (x l, x r, x i ) = x i x l : if outlier (x l, x r, x i ) = x r x r : if outlier (x l, x r, x i ) = x l

139 Robust and Efficient Hierarchical Clustering with CL Violations The CL condition may be restrictive in real-world applications. For example, errors or outliers in the similarities may lead to violations in the CL condition leading to a incorrect clustering. This motivates developing techniques to discover the hierarchical clustering that are robust to cases where a fraction of the similarities violate the condition. Suppose each similarity probably satisfies the CL condition. This can be characterized as the additive noise case when there is a gap between similarities of items inside a cluster and outside a cluster, with the additive noise pushes some fraction of the similarity values over the gap. More generally, consider an arbitrary similarity s i,j and let C be the smallest cluster containing x i and x j with x k C, P (s i,j < max(s i,k, s j,k )) q < 1/2 (6.2) Moreover assume the violations occur independently among {s i,j }. Now recall the Outlier-Based Clustering procedure from Algorithm 8. The ability for this methodology to reconstruct the clustering hierarchy is predicated on the outlier determination technique of (Equation 6.1) always returning the correct item. The outlier-based clustering scheme proposed in the previous section depends crucially on the CL condition, and it will clearly fail to properly cluster the items if violations occur. However, if the frequency of violations is not too large, then it should be possible to correctly cluster the items (at least for clusters in T that are not too small). The intuition is that if the majority of similarities are correct, then appropriately crafted voting schemes should indicate the proper clusterings. In this section, we develop a top-down recursive algorithm that implements this idea. The algorithm begins with the complete set of items X and then recursive splits this set into smaller and smaller clusters. At each step the algorithm determines a good split of the cluster in question, say C, into two reasonably balanced subclusters, denoted C R and C L. The split is determined by a sophisticated voting procedure described next.

140 116 The key to each step is to find a good split of the cluster in question, which we denote by C. Consider finding two items from C which we are reasonably confident are on either side of the cluster split. Denoting these items by x i and x j, the idea is to use these two items as a basis for splitting up the remaining points in C by associating each with whichever of x i or x j it is most similar to. Ideally, x i and x j will be on opposing sides of the top-most split of C. We can robustly and efficiently find two items on opposing sides of the top-most cluster split using at least N P randomly chosen potential split item pairs, determined from a voting scheme using a randomly selected validation set of N C test items. For the remaining items in the cluster, they are assigned to either cluster associated with x i or x j determined from a voting scheme using N S randomly chosen split reinforcement pairs tested against the randomly selected validation set of N C test items. Theorem Let min(n C, N R, N S, N p ) = (log N) 2, where N p is the number of candidate pairs drawn in step 1. If the violation probability q < 1 (1/ 2 + ǫ) for some small ǫ < 1 1/ 2, and balance factor η > (1 (1 q) 2 + δ)/(1 q) 2, then the Robust Outlier clustering methodology in Algorithm 9 will recover all clusters with size C > γ N (log N) 2, where γ N is any function increasing in N, with probability (1 N 2 ) κn log N, where κ κ(η) > 0 is a constant. The number of similarities needed by the Algorithm is O(N log 5 N). Proof: We start with a single cluster of N objects, C. Let us define the top-most split of the hierarchy as CL, C R. The first step of Algorithm 9 requires finding some pair of items, x i, x j, that are on either side of the top-most split, such that x i CL and x j CR 1. Instead of performing a single, possibly erroneous outlier test, we now introduce the voting-based outlier count variable, c i,j. Where N C randomly chosen validation items ({x a1, x a2,..., x anc }) are used to determine the placement of items {x i, x j } in relation to the top-most split of C. N C c i,j = I (outlier (x i, x j, x k ) = x k ) (6.3) k=1 Where the indicator function, I (x) = 1 if x is true, and = 0 if x is false. Definition Ω i,j denotes the event that similarity s i,j satisfies the CL condition. 1 This is stated without loss of generality, as the proof holds for x i CR and x j CL

141 117 Algorithm 9 - Robust Outlier Clustering Algorithm Given : A set of objects, X = {x 1, x 2,...x N }, and violation probability bound q. Initialize : The initial cluster, C = X, and a small constant δ > 0. Methodology : 1. (a) Pick a pair of candidate items x i, x j C at random. (b) Draw a set R of N R Representative items at random from C \ {x i, x j }. (c) For each k R Draw a set of N C test items at random from C \ {x i, x j,r}. Compute Outlier Counts c i,k and c j,k (Eq. 6.3). (d) Compute Representative Outlier Count r i,j (Eq. 6.4). (e) If r i,j > 1 2 N R, go to 1a. 2. Set x L = x i and x R = x j, and initialize the set assignment clusters, C L = {x L } and C R = {x R }. 3. For each x k C\{x L, x R } Draw a set S of N s Validation items at random from C \ {x L, x R, x k }. For each l S Draw a set of N C test items at random from C \ {x L, x R, x k,s}. Compute Outlier Counts c k,l, c L,l and c R,l (Eq. 6.3). Compute Cluster Outlier Counts t L k and tr k (Eq. 6.5). Assign x k C L if t L k < 1 2 N S or x k C R if t R k < 1 2 N S. 4. If C L > 2, then split the constructed left set: set C = C L and go to step 1. If C R > 2, then split the constructed right set: set C = C R and go to step 1. Else, then stop.

142 118 Proposition Let δ > 0 be a small constant. If the balance factor η > γ/(1 q) 2 = (1 (1 q) 2 + δ)/(1 q) 2, N C 10 log N δ 2 and Ω i,j holds, then with probability > ( 1 N 2) if the outlier count satisfies c i,j < γ, the pair of items x i, x j are in opposing sides of the top-most split of cluster C (where x i C L and x j C R ). Lemma The expected outlier count (conditioned on the choice of x i, x j and event Ω i,j where the corresponding similarity value s i,j does not have a violation) for the Similarity Violation Condition can be stated as, E [c i,j (x i C R and x j C L), Ω i,j ] = E [c i,j (x i C L and x j C R),Ω i,j ] ( 1 (1 q) 2) N C E [c i,j x i, x j C L, Ω i,j ] (1 q) N C E [c i,j x i, x j C R, Ω i,j ] (1 q) 2 ηn C Proof. Using Lemma 6.3.1, consider defining the outlier count threshold, γ = (1 (1 q) 2 + δ)n C. Using Chernoff s Bound, we can state that if x i, x j C R, then, P (c i,j γ x i, x j C R,,Ω i,j ) ( ) 2 exp (γ E [c i,j x i, x j C N R, Ω i,j ]) 2 exp ( 1 ) C 2 δ2 N C where the last step holds if η > γ/(1 q) 2 = (1 (1 q) 2 + δ)/(1 q) 2. ( It follows that we want, exp 2 N C (E [c i,j x i CL and x j CR, Ω i,j] γ) 2) ) exp ( 1 2 δ2 N C Using a Union Bound argument and rearranging both sides, we find that if N C 10 log N, then with high probability ( ( 1 N 2) ) the threshold γ will distinguish values of δ 2 c i,j where x i, x j are in the same cluster and when they are in different clusters. Unfortunately, Proposition is dependent on the event Ω i,j, and from Equation 6.2 we observe that with probability upper bounded by q the event Ω i,j will not occur. To be robust to these potential erroneous similarities, consider the voting-based representative outlier count value, r i,j, where the outlier with respect to a set of N R randomly chosen representative candidate items (R C) is used to determine the placement of items in relation to the top-most split of C. Let

143 119 γ = (1 (1 q) 2 + δ)n C. r i,j = k R\x i,x j (I (c i,k > γ and c j,k > γ) + I (c i,k < γ and c j,k < γ)) (6.4) Proposition With high probability > ( 1 N 2), if the size of the Representative Candidate Set, N R 10 log N ( 1 2 (1 N 2 ) 2 (1 q) 2 ) 2, and q < 1 (1/ 2 + ǫ) for some small ǫ < 1 1/ 2, and the representative outlier count r i,j < 1 2 N R, the pair of representative candidates (x i, x j ) are on opposing sides of the top-most split of cluster C (where x i C L and x j C R ). Lemma The expected representative outlier count value (conditioned on the choice of x i, x j ) for the Similarity Violation Condition can be stated as, E [r i,j (x i C R and x j C L)] = E [r i,j (x i C L and x j C R)] E [r i,j x i, x j C L] E [r i,j x i, x j C R] ( ( 1 1 N 2) ) 2 (1 q) 2 N R ( 1 N 2) 2 (1 q) 2 N R ( 1 N 2) 2 (1 q) 2 N R Proof: Let the event Φ = {Neither s i,k nor s j,k has a similarity violation}, where P (Φ) = ( 1 ( 1 N 2) 2 ) (1 q) 2 and P (Φ c ) = ( 1 N 2) 2 (1 q) 2. Then we can state that for the conditional expectation E [r i,j (x i CL and x j CR )] (and following for the expectation E [r i,j (x i C R and x j C L )]), E [r i,j (x i C L and x j C R)] = P (Φ c )E [r i,j (x i C L and x j C R), Φ c ] + P (Φ) E [r i,j (x i CL and x j CR),Φ] ( N R P (Φ c ) + 0P (Φ) = N R (1 1 N 2) ) 2 (1 q) 2 For both items, x i, x j C L we can state, E [r i,j x i, x j CL] = P (Φ c )E [r i,j x i, x j CL, Φ c ] + P (Φ) E [r i,j x i, x j CL, Φ] ( P (Φ c )E [r i,j x i, x j CL, Φ c ] = 1 N 2) 2 (1 q) 2 N R

144 120 And finally for both, x i, x j C R we state, E [r i,j x i, x j CR] = P (Φ c ) E [r i,j x i, x j CR, Φ c ] + P (Φ) E [r i,j x i, x j CR, Φ] ( P (Φ c ) E [r i,j x i, x j CR, Φ c ] = 1 N 2) 2 (1 q) 2 N R Using Lemma 6.3.2, we can now prove Proposition Proof. First note that since q < 1 (1/ 2 + ǫ) for some small ǫ, ( 1 N 2) 2 (1 q) 2 > 1 2 for N N(ǫ) large enough. Using Lemma and invoking Chernoff s Bound, we can state that if x i, x j CR (or x i, x j CL ), then P (r i,j N R /2 x i, x j C R) ( ) 2 exp (N R /2 E [r i,j x i, x j C N R]) 2 ( R exp 1 ( 1 ( N 2) ) ) 2 2 (1 q) 2 N R We can also state that if x i CL and x j CR (or conversely, if x i CR and x j CL ), then using Chernoff s Bound, P (r i,j N R /2 x i C L and x j C R) ( ) 2 exp (E [r i,j x i CL and x j C N R] N R /2) 2 ( R exp 1 ( 1 ( N 2) ) ) 2 2 (1 q) 2 N R Using a Union Bound argument and rearranging both sides, we find that if N R 10 log N ( 1 2 (1 N 2 ) 2 (1 q) 2 ) 2, then with high probability ( ( 1 N 2) ) the threshold N R /2 will distinguish values of r i,j where x i, x j are in the same cluster and when they are in different clusters. Proposition With probability ( 1 N 2), only N p > 4 log N log(1/(η 2 +(1 η) 2 )) pairs x i, x j C must be chosen before finding a valid object pair selection such that WLOG x i C L and x j C R. Proof. Given the unbalanced constant bound η, the probability that two randomly chosen pairs are in the same cluster C L or C R is (η 2 + (1 η) 2). Therefore, the probability that any N p randomly chosen pairs of clusters are all in the same cluster is N 2 ( η 2 + (1 η) 2) N p N 2.

145 121 Finally, consider the assignment of each remaining item to either cluster CL or C R. To defeat the similarity violations inherent in the pairwise observations, we introduce the cluster-based outlier count variables, t L i, tr i, where we count the number of times an item is declared in the cluster C L and cluster C R using a voting procedure on a N S-sized set of randomly chosen split reinforcement items S C. t L k = t R k = (I (c k,l > γ and c L,l > γ) + I (c k,l < γ and c L,l < γ)) (6.5) l S\x k,x L (I (c k,l > γ and c R,l > γ) + I (c k,l < γ and c R,l < γ)) l S\x k,x R Proposition With high probability ( 1 N 2), if the size of the Validation Set, N s 10 log N ( 1 2 (1 N 2 ) 3 (1 q) 2 ) 2, and q < 1 (1/ 2 + ǫ) for some small ǫ < 1 1/ 2, then the cluster-based outlier count (t L i, tr i from Equation 6.5) will resolve if the pair of objects in different sets. Where, CL :if t L i < 1 2 x i N S CR :if t R i < 1 2 N S (6.6) Proof. The proof of Proposition follows directly from the proof of Proposition Therefore, the proof of Theorem follows from Propositions and since the number of clusters C is bounded by κn log N, where κ κ(η) > 0 is a constant. 6.4 Experiments Our quantitative theoretical results in the preceding section predict the performance of simulations that follow the assumed conditions, and therefore we feel there would be little value in such synthetic experiments. Therefore, we instead focus on a real-world dataset using genetic microarray data, provided by [111]. We use a set of 400 yeast genes with 7 expressions each, from which we exhaustively generate the standard Pearson correlation using the expression vectors for every pair of genes. Figure 6.1 depicts the results obtained by three different clustering methods. Due to no a priori knowledge of the balance factor or noise level of the similarity values, our Robust methodology uses a very conservative estimation of η = 0.3, q = 0.1. As a result of the relatively small

146 122 number of items in this dataset, the robust outlier methodology requires an exhaustive number of similarities for this experiment. Meanwhile, the non-robust outlier methodology uses only 3,256 targeted similarities to reconstruct the clustering hierarchy, only 4.08% of the number of similarities needed by agglomerative clustering. We show the similarity matrix with the items organized according to the inferred clustering of each method. To interpret these images, imagine that the perfect clustering would produce an image with well-defined and coherent blocks of similarity values organized along the diagonal. Poorer clustering would result in an image with fragmented blocks and many large similarity values well off the diagonal. At another extreme random clustering would yield a random looking image. Qualitatively from the figure, our robust outlier methodology (right) appears to yield better results relative to both standard agglomerative clustering (left) and the non-robust outlier-based method (center). We conjecture that the similarities in this case mostly obey the CL condition, but a few do not which could explain the reason that agglomerative cluster and non-robust outlier-based clustering seem to perform worse. Of course, without a ground truth clustering it is impossible to check the validity of the CL condition. Figure 6.1: Resulting ordering of Gene Microarray reconstructions. (Left) - Standard Agglomerative Clustering, (Center) - Outlier Based Clustering, (Right) - Robust Outlier Based Clustering. 6.5 Summary Despite the wide ranging applications of hierarchical clustering (biology, networking, scientific simulation), relatively little work has been done on examining the number of pairwise measurements needed to resolve the hierarchical dependencies in the presence of noise. The goal of our work was to use drastically fewer selected measurements to reduce the total number of pairwise similarities needed to resolve the hierarchical dependency structure while remaining robust to potential out-

147 123 liers in the data. We showed that in the presence of erroneous similarity values we only require on the order of N log 5 N similarities to robustly recover the cluster. Meanwhile, when there is no similarity noise, we presented a methodology that requires no more than 3N log N similarity values to recover the clustering hierarchy. Whether a robust method can be devised that eliminates the extra log factors is an open question.

148 124 Chapter 7 IP Geolocation using Population Data One characteristic of Internet structure that has significant implications for advertisers, application developers, network operators, and network security analysts is to identify the geographic location (or geolocation) of networked devices such as routers or end hosts. The ultimate goal of IP geolocation is to find the precise latitude/longitude coordinates of a target Internet device. As with discovering other structural characteristics, however, there are considerable challenges in finding the geographic location of a given end host in the Internet. First, the size and complexity of the Internet today, coupled with its highly diffuse ownership means that there is no single authority with this information. Second, no standard protocol provides the geographic position of an Internet device on the globe (although DNS entries can include a location record). Third, Internet devices are not typically equipped with location identification capability (e.g., GPS), although this may change in the future. However, even GPS-equipped devices may choose not report location information due to privacy concerns. Finally, measurement-based geolocation can be confused by NAT ed devices or by users who are trying to anonymize their communications [112]. IP geolocation methods that are currently used largely fall into two categories. The first is a survey-based approach in which a geolocation database is established by examining network address space allocations for providers and associating these with the geographic location of the providers. While this can be effective for providers who offer service in a restricted geographic region (e.g., a university or a small town), it will fail for providers with a large geographic footprint unless

149 125 coupled with additional information. The second method is to use active probe-based measurements to place the target host within some restricted geographic region. At a high level, many of these methods can be considered similar to standard triangulation methods used in geographic surveying. Unfortunately, as we will show later in this chapter, current techniques have relatively high median error and high error variability. The goal of this chapter is to broadly improve probe-based IP geolocation accuracy over prior methods. Our hypothesis is that the large estimation errors caused by imperfect measurements, sparse measurement availability, and irregular Internet paths can be addressed by expanding the scope of information considered in IP geolocation. We will used available societal information, particularly population data, to leverage performance gains in geolocation accuracy over prior methods. In this chapter we introduce two methodologies for IP geolocation: NBgeo - A Naive Bayes learning-based geolocation methodology that uses explicitly defined population data using census data [18]. PinPoint - A landmark-based geolocation algorithm that exploits implicitly defined population data via a set of known landmark locations. Our dataset considered in this chapter consists of 431 commercial hosts belonging to Akamai Technologies with known geolocation (with accuracy down to the GPS coordinates). During the weekend of January 16-17, 2010, pairwise bidirectional measurements of latency and hop count were performed between servers belonging to Akamai Technologies hosted at 431 distinct physical locations in the United States. 1 The street addresses for these locations are known and were used as the basis for GPS coordinates. The measurements were conducted using the MTR tool [113]. The servers belong to Akamai s production content delivery network and during the measurement period may have also been performing other tasks such as serving HTTP requests. To evaluate both geolocation algorithms and offer comparison with existing tools, we validated the performance on a collection of hop count and latency measurements using 431 commercial end 1 Although our data is based on the United States, the methodologies developed in this chapter are applicable to international end hosts will no modification of the algorithm.

150 126 hosts. It is important to note that the exact coordinates of all hosts used in this study were known, which gave us a strong foundation for evaluation. The geolocation methodologies in this chapter partition the known IPs into three categories of hosts in the network. Landmarks - The set of nodes in the network with known and very accurate geolocation information. End Hosts - The set of nodes we wish to geolocate. These nodes have the ability to target latency probes to a subset of the landmarks. Monitors - The set of nodes in the network with the ability to target hop measurements to both landmarks and end hosts. We compared our geolocation estimates of the end hosts with both survey-based commercial IP geolocation packages and measurement-based methodologies. The first experiment on partitioned geographic space with explicitly defined population data shows that our NBgeo algorithm results in geolocation accuracy with median error miles and an average error of miles, showing improvements over current commercial and measurement-based methodologies given the same measurement data. The second experiment on an unpartitioned geographic space with no defined population data shows that the landmark-based PinPoint methodology was able to establish the geolocation of the target hosts with a median error of 27 miles and an average error of less than 124 miles. As might be expected, we found that PinPoint error is proportional to the landmark density, which suggests that estimates can be further improved e.g., by adding DNS servers to our landmark set. We also compared our geolocation estimates to two commercial IP geolocation services, MaxMind [16] and IP2Geolocation [17]. The best commercial geolocation database yielded a median error of 34 miles 25% worse than PinPoint. The average error of the best commercial database was 493 miles, which could seriously jeopardize it s use in certain applications.

151 Geolocation using NBgeo In this section, we present the NBgeo geolocation algorithm, where the geographic location of an end host is classified via an estimated posteriori probability. Given the potentially large number of measurements to an IP target, the probability likelihood estimation is made computational tractable by the use of a Naive Bayes-based approach. The network measurement data considered in this framework includes latency and hop count from a set of landmarks to an IP target (obtained by lightweight ping probes). We also include explicitly defined population density in the framework as a demonstration of a non-network measurement that can help refine the location estimates Bayesian Geolocation Framework We start with the standard IP geolocation problem, can we determine the geographic location of the target IP? Consider a single target IP address with a set of measurements from a set of monitors with known geolocation to this target IP address. For the purposes of this work, the measurement set M (= {m 1, m 2,..., m M }) is the collection of both latency and hop count values going from the monitor set. Without loss of generality, now consider a set of possible counties in the continental United States (C), such that the target is located in some county c C. This changes the underlying problem to, Given the measurement set M, can we estimate which county c C the target IP is located in? The best classifier would choose the county (ĉ) that the target is most probably located in given the measurement set, ĉ = arg max c C P (c M) (7.1) Using Bayes Theorem [60] (P (A B) = P(B A)P(A) P(B) ), therefore we can restate the classifier as ĉ = arg max c C = arg max c C = arg max c C P (c M) P (M c)p (c) P (M) P (M c)p (c)

152 128 Where the value P (M), the probability of observing the set of measurements, can be ignored due to this value being constant across any choice of county c. Next, we expand our estimation framework to consider features other than measurements from monitors to IP targets. We can use the work in [61] to inform where network resources should be geographically located. Specifically, the value P (c), the probability of classifying a target in county c, will be chosen using the results showing that the number of resources in a specific geographic location is strongly correlated with the population of that geographic location. Therefore, we can estimate the probability of classifying into a given county to be the population of that county divided by the total population in all the counties under consideration. P (c i ) = Population of c i j C Population of c j (7.2) Finally we need to estimate the likelihood probability, P (M c), the probability likelihood of a measurement set M being observed given the target is located in county c. Given a set of training data, a set of IP addresses with known measurement sets M and locations c, we could use off-the-shelf techniques (kernel density estimators, histograms, etc.) to estimate the multivariate likelihood density P (M c). A problem is that the set M is most likely of high dimensions (with dimensionality equal to the number of hop count and latency measurements observed to this target, in this case, on the order of 100), and most density estimator techniques have an error rate that increases quickly with the dimension of the problem [60]. If all of the values of M were statistically independent from each other, then the likelihood density could be restated as P (M c) = P ({m 1, m 2,..., m M } c) P (m 1 c)p (m 2 c)...p (m M c) (7.3) This converts the problem from estimating one M-dimensional density to estimating M onedimensional densities. However, it should be assumed that there is a large degree of correlation between measurements, with our prior work in Section 3.5 showing correlation between hop count measurements, and work in [35] showing correlation between latency measurements. The risk

153 129 of assuming statistical independence between measurements is informed by empirical studies on highly dependent data in [59]. That work shows that for classification, there is little penalty for assuming statistical independence even when the measurements are highly statistically dependent. This is due to classification performance depending only on the most-probable class (in this case, county region) likelihood probability being greater than other class likelihood probabilities, not the goodness-of-fit of our estimated likelihood probability to the true likelihood probability. The next step in our learning-based framework is to estimate the one-dimensional densities, P (m i c), the probability of the measurement value m i being observed given that the target is located in county c. Consider a set of training data, where for each training target end host, both the measurement set M and the geolocation county c is known. Given the known monitor placement, for the entire training set we can determine the distance vector d = {d 1, d 2,..., d M }, where d i is the distance between the monitor associated with measurement m i and county c. These measurements with distance ground truth can then be used to learn the density (the probability of observing measurement m i given that the target is located d i distance away from the monitor associated with measurement m i ). Simple density estimators, such as histograms, can be used and will assure that measurement outliers do not significantly contribute to the density estimation. One drawback to histogram estimators is that the lack of smoothness in the estimated density can hurt performance. Instead, we will look to use Kernel Density Estimators [60], which use the summation of smooth kernel functions to estimate the density. This smoothness in the estimated density allows improved estimation of the true density given the limited size of our training set. For hop count measurements, a one-dimensional density will be estimated at each hop count value ranging from one hop away from a monitor to ten hops away (it is assumed that any distance longer than ten hops will not help in estimating distance). For latency measurements, due to the limited amount of training data, the measurements are aggregated together separated by 10ms, with a single estimated one-dimensional density for 0-9ms, a separate one-dimensional density for ms, 20-29ms, etc. An example of a kernel estimated density for latency measurements can be seen in Figure 7.1 along with the resulting probability distribution across the US counties for

154 130 observing this latency measurement to a monitor with known geolocation. Figure 7.1: (Left) - Probability for latency measurements between 10-19ms being observed given a target s distance from a monitor. Stem plot - Histogram density estimation, Solid line - Kernel density estimation. (Right) - The kernel estimated probability of placement in each county given latency observation between 10-19ms from a single monitor marked by x. The amount of location information from latency measurements is likely to be of more use than the location information derived from hop count measurements or population data. Therefore we introduce two weights λ hop and λ pop as the weights on the hop count measurements and the population density data respectively. Informed by the geolocation improvement by using measurement weights in the Octant framework [4], the ordering of the measurements should also imply some degree of importance, as the location of the monitor with the shortest latency measurement to the target should inform the classifier more than the monitor with the 30-th closest latency measurement. Therefore, we will also weight the ordering of measurement values by an exponential, such that the i-th latency measurement is weighted by exp( i γ lat ) and the j-th hop count measurement is weighted by exp( j γ hop ). The weight parameter values (λ hop, λ pop, γ lat, γ hop ) will be found by the weight values that minimize the sum of squared distance errors between the training set of IPs known locations and the Naive Bayes estimated locations. Methodology Summary Dividing the measurement set M into the set of latency measurements {l 1, l 2,..., l m } and the set of hop count measurements {h 1, h 2,..., h m } (where the total number of measurements M = 2m),

155 131 our learning-based classifier using the independence assumption can be restated using the kernel density estimators (where instead of the true likelihood P (m i c) we have the kernel estimated P (m i c)), the weight terms, and the monotonic properties of the logarithm function as ĉ i = arg max c C (λ pop log P (c) + f hop + f lat ) (7.4) Where f hop = λ hop mj=1 exp( j γ hop ) log P (h j c), and f lat = mj=1 exp( j γ lat ) log P (l j c), and the term P (c) for the 3,107 counties in the continental United States is found using Equation 7.2. Constraint-Based Restriction While our Naive Bayes methodology will find the most probable partition given the measurements, we must assure that the algorithm will chose a partition that is actually feasible given the measurements. Using the Constraint-based Geolocation methodology from [9], we find a subset of the geographic partition that the end host could feasible be located in. In contrast to the geolocation methodology from [9] (where the centroid of the feasible region is the chosen geolocation point), our NBgeo algorithm will instead chose the most probable geographic partition using our Naive Bayes learning-based methodology. An example of this technique can be seen in Figure 7.2. NBgeo Algorithm A summary of the complete NBgeo methodology is seen in Algorithm 10. Note that all the computational complexity of the NBgeo algorithm is on training the parameters (λ hop, λ pop, γ lat, γ hop ). Each target is geolocated using only O (M C ) number of multiplications, where C is the total number of location classes under consideration (later in this chapter we will consider the number of counties in the continental United States), and M is the total number of measurements to the current target IP. The computational complexity being linear in both the number of locations and the number of monitors demonstrates the feasibility of future large-scale Internet studies using this

156 132 Figure 7.2: (Left) - Estimated posteriori probabilities for all counties in the continental US. (Right) - Estimated posteriori probabilities for constraint-based restricted counties. method. 7.2 Geolocation using PinPoint This section proposes a novel approach to IP geolocation using landmarks that we call PinPoint. PinPoint is based on the procurement of a large set of highly reliable landmarks ideally hosts with known latitude/longitude coordinates that will respond to measurement probes. For example, landmarks used by PinPoint could be end hosts with known geolocations (e.g., PlanetLab nodes) or domain name system (DNS) servers with location information (via DNS LOC records [114]). The canonical example of a PinPoint landmark and a focus for our empirical evaluation are the stratum 0/1 Network Time Protocol (NTP) servers [115] that are deployed throughout the Internet. NTP servers often list their lat/long coordinates [116] which are established via their GPS receivers. Significant for our study is the fact that NTP servers are deployed to respond to measurement requests and actually provide reasonably accurate one way delay estimates. The geographic distribution of these landmarks will allows us to exploit implicitly assumed population data (as the landmarks in the topology should be assumed to be in areas of large population density). This avoids the necessity of the NBgeo algorithm to have explicit (and possibly unreliable) population data.

157 133 Algorithm 10 - NBgeo - Naive Bayes IP Geolocation Algorithm Initialize: Measure the hop-count and latency from every monitor to a training set with known geographic locations. Using a population density database, find P (c) for all c C using Equation 7.2. Using kernel density estimators, estimate the one-dimensional distribution P(m c) for every measurement m M. Find the optimal values for λ hop, λ pop, γ lat, γ hop that minimize the sum of squared distance errors over the training set. Main Body 1. For each target IP with unknown geography, estimate the location ĉ i using Equation 7.4 where the feasible partitions under consideration (C) are found using the Constraint-Based Geolocation method [9]. Geolocation of target end hosts using PinPoint is a three step process. First, using hop distances to the set of monitors, PinPoint estimates the landmark nearest to a given IP target. Second, the methodology uses the estimated nearest landmark to find a subset of landmarks assumed to be geographically close to the IP target. And finally, PinPoint uses latency measurements from the specified landmark subset to geolocate the target. We hypothesize that this methodology will result in highly accurate geolocation estimates since winnowing the field of landmarks to those that are closest in terms of hop distance to a target should make the subsequent latency measurements highly accurate predictors of geographic distance. In contrast, consider latency measurements from arbitrary landmarks which would tend distort the correspondence between latency and distance estimates when paths are indirect [68] as shown in the example in Figure 7.3. PinPoint s latency-based geolocation is based on a novel geographic embedding procedure. In contrast to standard embedding techniques, which aim only to preserve the latency distances in the geographic embedding, our sparse embedding algorithm also encourages the targets to cluster geographically, which is desirable since targets tend to concentrate in cities. This serves as an important regularization in the embedding process that further mitigates the effects of noise and errors. An added benefit of the selective use of landmarks, based on hop count proximity, is

158 134 Figure 7.3: Toy example of network routing geography vs. direct line-of-sight geography. a significant reduction in the number of pairwise ICMP probes necessary to geolocate targets. PinPoint also offers new capabilities including geolocation without reliance on traceroute records and without reliance on ICMP latency probes from shared infrastructure, and the possibility of geolocating targets that do not respond to ICMP probes by using passive measurements instead. 7.3 PinPoint Methodology Summary The PinPoint IP geolocation methodology is divided into three complementary components. 1. Hop-Based Mapping using Landmarks - Using acquired hop vectors and the known geolocation for a set of landmarks, we map the set of end hosts in a geographically significant manner to the closest inferred landmark. 2. Targeted Distance Estimates Using Hop-based Mapping - Using the inferred closest landmark, we estimate which subset of landmarks will be closest to each end host and estimate the distance to the subset of landmarks using latency measurements. 3. Geolocation Using the Sparse Embedding Algorithm - We introduce a novel Sparse Embedding algorithm, a methodology to embed end hosts into latitude/longitude coordinates using

159 135 distance estimates. This methodology uses a sparse penalty to regularize the embedding process. We describe each of these components in detail in the following sections. 7.4 Hop-Based Mapping using Landmarks Consider T landmarks in the network with known geographic location. Here we will focus on a subset of 80 NTP servers in the continental United States, with the geographic distribution of these nodes seen in Figure 7.4. Using the wide geographic diversity of these landmarks, it is intuitive that mapping each target end host to the geographically closest landmark may result in an accurate geolocation estimate. While this knowledge of geographically closest landmark is unknown, we state here that it can be inferred by measurements between the set of monitors and the set of landmarks and between the end hosts and the set of monitors. Here we will consider performing this inference using only hop count measurements. Figure 7.4: Geographic placement of NTP servers. We exploit the ability to send ICMP probes from a set of M diverse monitors in the Internet (such as, the Planetlab Infrastructure [40]) to inform us as to which landmark is closet to each end host. From these probes, we construct a set of hop count vectors. For end host i = {1, 2,..., N}, [ h end i = h end i,1 h end i,2... h end i,m ] (7.5) Where h end i,k is the observed hop count between end host i and monitor k

160 136 And for landmark j = {1, 2,..., T }, [ h land j = h land j,1 h land j,2... h land j,m ] (7.6) Where h land j,k is the observed hop count between landmark j and monitor k From previous work [117], it was shown that hop count vectors contain enough information to cluster end hosts in a topologically significant manner. This methodology relies on the existence of border routers from [14], where multiple end hosts can share the same egress router to the core of the network. If two end hosts share the same border router, this implies that all paths to the network core from both end hosts will be shared past the border router. Given this shared path property, we can resolve that two end hosts (i, j) are topologically close in the network if they have the hop count property h i,k = h j,k + C for every monitor k in the network located past the shared border router. A visual example of this can be seen in Figure 7.5. Figure 7.5: Example of network where an end host is C hops away from a landmark, with both sharing the same border router. Using the methodology from [117], we can cluster each end host (i) to the most significant landmark (c i ) by finding the server such that the hop count difference vector has the smallest variance. ( ( c i = argmin σ 2 h end i j )) h land j (7.7)

161 137 N Where σ 2 (x) = 1 N x i 1 2 N x j For two hop vectors separated by a constant integer offset N i=1 j=1 (indicating a shared border node), then σ 2 (h i h j ) = σ 2 (h i (h i + C)) = σ 2 (C) = 0. While this mapping to a landmark with known geolocation technique is inspired by the GeoPing algorithm [41], GeoPing finds the landmark with the smallest latency to the target end host, thus requiring exhaustive latency measurements between each target and the set of landmarks (of size T), with each end host performing O (T) latency probes. Meanwhile, this proposed hop-based methodology does not require any direct measurements (latency or hop count) to the landmarks, just hop measurements to a set of monitor nodes (resulting in probing complexity O (M), where M T) 2. This distinction is important, as we would like to consider as many landmarks as possible, while performing as few probes as necessary for each end host. This goal conflicts with the GeoPing approach, while it can be satisfied using our hop-based methodology where we can keep the total probes to each end host constant even while increasing the number of landmarks NTP Hop-Based Mapping Experiments Using a set of ICMP probes between 80 NTP servers and 26 Planetlab nodes as described in Section 5.4.2, we construct hop count vectors for the observed hop counts between the NTP servers and Planetlab nodes. Using leave-one out cross validation [57], we test the preliminary performance of our hop-based mapping methodology by using the hop vectors and Equation 7.7 to determine which of the remaining 79 NTP servers each held-out NTP server should be mapped to. The results for this experiment can be seen in Table 7.1 in comparison with mapping to the best NTP server (given complete geolocation knowledge), the worst NTP server, and a randomly chosen NTP server. As seen in the table, our hop-based mapping methodology performs geolocation with significant improvements over naive random mapping geolocation. In addition to geolocation mapping, the methodology in Equation 7.7 also returns the variance ( ) between the held out server and the NTP server it was mapped to, (σ 2 h i hĵ ). This variance can be considered a level of confidence in our geolocation estimate, with the smallest variance values 2 Experiments later in the chapter will be in the regime where M = 20 and T = 200, resulting in a factor of ten decrease in the number of probes needed

162 138 Table 7.1: NTP Dataset - The average geolocation error for various end host to landmark mapping methodologies. Methodology Mean Error Median Error (in miles) (in miles) Hop-Based Mapping Best Mapping Random Mapping Worst Mapping Table 7.2: NTP Dataset - Hop-based Mapping methodology quintile errors. Quintile Mean Geolocation Median Geolocation Error (in miles) Error (in miles) 1st nd rd th th 1, indicating the set of NTP servers we are most confident in the geolocating. This motivates the creation of quintile sets for our geolocation estimates, containing the 20% of the NTP servers we are most confident in (1st quintile, with the smallest variance values), to the 20% of the NTP servers we are least confident (5th quintile, with the largest variance values). The average geolocation error for each quintile set for the hop-based mapping methodology can be seen in Table 7.2. These results show a very strong correlation between the variance in the mapped server and geolocation accuracy. For the 1st quintile, the set of NTP servers we are most confident in, geolocation performance is on average 1,000 miles more accurate than the 5th quintile, the set of NTP servers we are least confident in performance. This matches our intuition that for pairs of nodes that share a common border router (and therefore have very low hop difference variance), we will estimate geographic location with high accuracy. The methodology fails when NTP servers do not share a common border router with any of the landmarks, resulting in large hop difference variance.

163 139 Table 7.3: Commercial Node Dataset - The average geolocation error for various end host to landmark mapping methodologies. Methodology Mean Error Median Error (in miles) (in miles) Hop-Based Mapping Best Mapping GeoPing Mapping IP2Location Method MaxMind Method Commercial Node Hop-Based Mapping Experiments The performance of the hop-based mapping on our small set of NTP servers motivates testing performance considering a larger set of landmarks. Consider random partitioning of our commercial node host set from Section Here we will divide the set of 431 total commercial nodes into 20 monitor nodes, 200 landmark nodes with given geographic location, and the remaining 211 nodes as end hosts we desire to estimate their geographic location. Using hop counts from the collection of 20 monitor nodes, we attempt to estimate the geolocation of the 211 end hosts by mapping each end host to one of the assigned landmarks. To compare against this hop-based method, we will first consider the best case mapping, where we map to the the closest landmark compared to each end host s true location. Additionally, we will consider mapping using GeoPing (the landmark with the smallest latency value), and two commercially available geolocation databases, the Maxmind database [16] and the IP2Location database [17]. From the results in Table 7.3, it is shown that using just 20 observed hop counts we can geolocate our set of end hosts within miles of the true end host location on average using our collection of landmarks. Additionally, this hop based method actually has a mean error rate that is less than the two commercial databases, IP2Location and Maxmind. While the GeoPing method performs better than our hop-based mapping approach, our method uses 180 fewer probes for each end host, which is a considerable measurement load trade-off. The average geolocation error for each quintile set for the hop count mapping methodology on this commercial node dataset can be seen in Table 7.4. Again we observe that this level of

164 140 Table 7.4: Commercial Node Dataset - Hop-based Mapping methodology quintile errors. Quintile Mean Geolocation Error (in miles) 1st nd rd th th confidence is directly related to the performance of our hop-based mapping methodology, with the most confident quintile geolocating over 280 miles closer than the least confident quintile. Using this metric of confidence, we can state that for the 50% of the end hosts we are most confident in, we can geolocation with average deviation from true geographic location less than miles using only our small set of hop count measurements Passive Hop-based Mapping Consider the case where our set of end hosts have blocked ICMP probes. Also consider the availability of a collection of passively obtained TTL counts to our set of monitors from each end host. 3 In this case, we can use the technique described in [15] to infer the number of hops between the monitor and the host. This inference is made based on the fact that there are only a few initial TTL values used in popular operating systems (e.g., 64 for most UNIX variants, 128 for most Microsoft variants and 255 for several others). The hop count is inferred by rounding the TTL up to the next highest initial TTL value and then subtracting the initial TTL. Due to the passive nature of these measurements, each end host would unlikely obtain hop measurements to all M of our monitors. Instead, it should be assumed that we only have hop measurements to a random subset of our monitors. Using the complete hop count vectors, the observation of passive measurements can be simulated by withholding a specified number of hop elements in randomly chosen locations. In Figure 7.6, 3 Assuming that our monitors would be co-located in popular network locations capable of observing a large number of passive measurements.

165 141 error results for hop-based mapping using simulated incomplete passive measurements are shown for both average and median error rates for the commercial node dataset. From the figure, we see that only 8 observed hop counts are needed to obtain mean geolocation performance comparable to the Maxmind database, while only 5 observed hop counts are necessary to obtain mean geolocation performance on the order of the IP2Location database. Error Distance (in miles) Error Distance (in miles) Number of Hop Counts observed Number of Hop Counts observed Figure 7.6: (Left) Hop-based geolocation mean error decay with the number of observed hop counts by each end host. (Right) Hop-based geolocation median error decay with the number of observed hop counts by each end host. (Standard deviations are shown in the error bars) 7.5 Targeted Distance Estimates using Hop-based Mapping In addition to hop count measurements from a set of monitors, now consider the ability to send a limited number of latency probes from each end host to a subset of the landmarks (or conversely, latency measurements from controlled landmarks to an end host). Assume a limited latency probing budget of K probes. In order to most effectively use these K probes, we would like to find the closest landmarks, which would presumably translate into the network routes with nearly direct line-ofsight paths and therefore result in more accurate latency to distance estimation. Initially, given only the hop count measurements to our set of monitors, we must determine which K landmarks are closest to our end host with unknown geolocation. Initially given only hop count measurements, we can consider the end host i mapped to the landmark c i using Equation 7.7. Taking the K-nearest neighbors of the landmark c i with respect

166 142 to their known geographic distance, we define the set [ Γ ci = γ ci,1 γ ci,2... γ ci,k ] (7.8) Where γ ci,k indicates the k-th nearest neighbor landmark to the landmark c i. Given the known geographic location of each landmark, this is trivial to find. For each end host, using the nearest neighbors to this landmark (Γ ci ) will obtain knowledge of which other landmarks are possibly geographically close to this end host. We consider that any latency measurements between the end host and this subset of landmarks may reveal short distance, nearly direct line-of-site paths. Therefore, we will probe between end hosts i and landmark j only if landmark j is contained in the nearest neighbors of the mapped landmark c i (with j Γ ci ). This restriction will significantly reduce the total number of latency probes needed by our geolocation algorithm, by restricting the measurements considered for geolocation to our best estimate of which landmarks will reveal short hop paths Latency to Distance Estimation Standard to all previous IP geolocation algorithms is the use of latency as a proxy for distance measurements. While some algorithms use latency measurements solely as an upper bound constraint on possible geographic locations [9], others have tried to directly estimate distance from the latency values (e.g., the spline-based method of [4]). Here we will exploit more recent work from our NBGeo methodology on nonparametric estimation of distance probability given observed latency. Consider the observation of round trip time (RTT) latency of 15ms to a landmark. Given a training set of observations of RTT latency of 15ms, we can generate a histogram of observed distances to a landmark given an observation of 15ms latency. Due to the likely small size of the training set, this histogram estimation may not be a very accurate predictor of the likelihood probability of observing landmark-to-end host distance. Instead, here we will use a kernel density estimator approach [57], which has been shown to improve the accuracy of density estimation when training set size is limited. From the estimated

167 143 kernel densities, we can construct the estimated cumulative distribution functions for various observed latency values, where L l (d) = P (D d l), the probability of being at most distance d away from a landmark given the observed latency l. An example of this cumulative distribution estimation can be seen in Figure Histogram Counts 200 Kernel Density Estimate Histogram Estimated Cumulative 0.5 Estimated Cumulative Probability Distance from landmark (in miles) Figure 7.7: Likelihood distribution of distance to landmark given observed latency of 10-20ms. Solid line - Kernel density estimation, Dashed Line - Estimated cumulative distribution, Dashed blocks - histogram Performing this methodology across the entire training set results in the estimated cumulative distribution set {L l1, L l2,..., L ln } for latency values {l 1, l 2,..., l N }. We now tackle the problem of estimating distance from the set of estimated cumulative distributions. While a naive approach would be to simply use the mean of each of the estimated distributions (e.g., the average distance to landmarks for all observed latency of l i, the distance d where L li (d ) = 1 2 ), this ignores known landmark-to-landmark distances and the estimated hop-mapping landmark information. Instead, here we will exploit prior information given to us by the hop-based mapping methodology while sampling from these estimated distributions. A standard method for sampling a random variable from a probability distribution ([57]) consists of generating a uniform random variable (u uniform (0, 1)) and then finding the distance d where L li (d ) = u. A problem with this methodology is there might be a large difference between the two random variables generated (u 1, u 2 ) for two landmarks that are very close to each other, resulting in a large difference between their estimated distances. Instead, we will use knowledge of the distance between a landmark j and the current mapped landmark (c i ) in order to sample from each distribution. Stating the estimated

168 144 distance ( d i,j ) between end host i and landmark j is found by the distance that satisfies, L li,j ( d i,j ) = max k d j,ci (d k,ci ) (7.9) Where max (d k,ci ) is the distance to the furthest landmark from the mapped landmark c i, and k l i,j is the observed latency value between end host i and landmark j. This assures that two landmarks that are relatively close in distance with similar observed latency measurements will have similar estimated distance to the end host. This distribution sampling method gives rise to a possible bootstrapping procedure, where we can obtain multiple distance estimates to the landmarks for each end host by resampling the distance likelihood distributions with respect to each of the landmarks in the set Γ ci from Equation 7.8 (e.g., in Equation 7.9 replacing c i with an element from the set of k closest landmark to the hop-based mapping landmark c i ). This idea will be further explored in Section Exponential Latency Weighting We argue that due to non-line-of-sight routing for a vast majority of paths through the Internet with medium/long path length, many latency measurements will be a very poor indication of distance in the Internet. Informed by the geolocation improvement by using measurement weights in prior geolocation methodologies [4], the latency of the path should also imply some degree of importance. Shorter latency values should hold more weight than longer latency values, as the shorter latency values are more likely to be the result of short hop paths through the network with possible directline-of-sight routing. Therefore, we will also weight each measurement value using an exponential, where we construct the weight array W such that the latency measurement of l i,j milliseconds (between end host i and landmark j) is weighted by w i,j = exp( φl i,j ) (7.10)

169 145 Where the tuning parameter, φ > Sparse Embedding Algorithm Given both the estimated distance matrix with respect to the landmarks, D, and the distance weight matrix, W, our goal is to estimate each end host s latitude/longitude coordinates (X = {x 1,x 2,...,x N }). Here consider that all end hosts lie on a feasible set of latitude/longitude coordinates, G = {g 1,g 2,...,g G } (with all end hosts x i G). Examples of this set of feasible coordinates could be quantized latitude/longitude coordinates in the continental United States, or the set of all cities in the world with population above a set threshold. The problem of finding low dimension coordinates given distance is a standard problem in the Multidimensional Scaling (MDS) literature [91], where a set of embedding points, X, are found that minimizes the sum of squared errors between the embedding points and the observed distances. X = min X N T (D i,j d (x i,y j )) 2 (7.11) i=1 j=1 Where d (x, y) is the geographic distance between latitude/longitude coordinates x and y, and assuming knowledge of the landmarks GPS-based latitude/longitude coordinates, Y = {y 1,y 2,...,y T }. Given the possibility of missing or targeted measurements, work on Weighted MDS [118, 119, 120] includes the weight term w i,j indicating confidence in pairwise distance d i,j. This modifies the stress optimization to the weighted sum of squared errors between the embedding points and the observed distances, X = min X N T w i,j (D i,j d (x i,y j )) 2 (7.12) i=1 j=1 There is also additional information we can exploit when embedding our end hosts. Consider the case where we have prior knowledge that the resulting embedding should be sparse. This is the case where we have observed some number of distance measurements between the end hosts

170 146 and landmarks and we wish to embed these points in some lower dimensional space, but a space where objects are only mapped to a small subset of places (e.g., the set of landmark locations). To enforce this restriction, a sparse penalty (λd (x i y j ), for some constant λ > 0) is added penalizing embedding points very far from the established landmarks with known location. X = min X N M i=1 j=1 ( ) w i,j (D i,j d (x i,y j )) 2 + λd (x i,y j ) Given prior work on geographic clustering of end hosts [61], consider this sparse embedding penalty to be constraining end hosts to areas of high population density in the geography (where our landmarks are likely to be located). In contrast to previous work [4], this is performed without the need to a priori know the population density of the geography of interest. The formulation of the stress function with a sparse penalty gives rise to the following lemma, Lemma (Sparse Embedding Algorithm Lemma) Using the stress function in Equation 7.13, the global minimum will be found at landmark ŷ {y 1,y 2,...,y M } only if for all possible coordinates x G, λ S W (ŷ) S W (x) Mj=1 d (x,y j ) M j=1 d (ŷ,y j ) (7.13) Where S W (t) = M j=1 w j (t) (D j (t) d (t,y j )) 2, D i (t) is the estimated distance from latitude/longitude coordinate t to landmark j, and w j (t) is the estimated weight for the distance estimate from latitude/longitude coordinate t to landmark j. Proof. Given the sparse stress function minimization function in Equation 7.13, the landmark ŷ will have minimum sparse stress for all possible latitude/longitude coordinates (and therefore be the correct embedding point) if, M M S W (x) + λ d (x,y j ) S W (ŷ) + λ d (ŷ,y j ) j=1 j=1 For any GPS coordinate x G.

171 147 Trivially, we can rearrange the terms on both sides of the inequality, M M λ d (x,y j ) λ d (ŷ,y j ) S W (ŷ) S W (x i ) j=1 j=1 Therefore, λ S W (ŷ) S W (x) Mj=1 d (x,y j ) M j=1 d (ŷ,y j ) Using Lemma 7.6.1, observe that the sparse penalty λ acts as a metric for how often to embed an end host to a location from our set of landmarks. If λ is too large, end hosts will be mapped to the landmarks very often, potentially ignoring valuable distance information. If λ is too small, the algorithm will be too reliant on potentially inaccurate distance measurements, ignoring information gained by considering the geographic placement of the landmarks. Due to the focus on latitude/longitude coordinates, our distance metric is non-euclidean and therefore difficult to solve via closed form solution. This is in contrast with prior Multidimensional Scaling methods. Here the coordinates are found that minimize the sparse stress function in Equation 7.13 by a simple grid search over a set of feasible latitude/longitude coordinates (G). The computational complexity to embed each end host for our algorithm is O (GM), where G is the number of feasible latitude/longitude coordinates considered and M is the total number of landmarks in the network, making the computational complexity of our algorithm only linear in the number of landmarks considered. This distinction is important, as it shows the computational complexity of the algorithm will not unreasonably increase as the number landmarks grows. 7.7 PinPoint IP Geolocation Algorithm By combining the hop mapping procedure of Section 7.4, the targeted latency technique of Section 7.5 and the Sparse Embedding methodology of Section 7.6 we create the full PinPoint algorithm. One unaddressed issue is the choice of feasible latitude/longitude coordinates G. Due to

172 148 Table 7.5: PinPoint Algorithm complexity for both probing and computation. Where N is the number of end hosts, K is the probing budget, T is the number of monitors, M is the number of landmarks, B is the number of bootstrap iterations, and G is the number of feasible geolocation points. Type Hop Counts Required Latency Measurements Required Computational Complexity Complexity Order O (TN + TM) O (KN) O (BN GM) the sparse embedding algorithm causing end hosts to be pushed towards landmark locations, there is a possibility that the algorithm may embed an end host in a location that is infeasible due to the observed latency measurement (e.g., embedded in a location that is impossible due to latency speed-of-light delays). To counteract this issue, we incorporate a conservative constraint-based procedure based on the methodology from [9] to determine the set of feasible coordinates G. In contrast to the Constraint-Based Geolocation procedure in [9], PinPoint simply uses the intersection of speed-of-light constraints derived from the observed latency measurements to determine a feasible set of coordinates. In addition to the constraint methodology, for each end host also consider multiple estimates of the latency to landmark distance inference. While we can rely solely on Equation 7.9 using the hop mapped landmark c i, performing this inference with respect to multiple landmarks in the vicinity of c i may help increase the geolocation accuracy and provide confidence bounds on the end host s location. This is a form of Bootstrap Estimation [57], where the algorithm is repeatedly run with new input resampled from the probability distributions and the final answer is an average aggregate of the computed responses. After estimating the geolocation of an end host with respect to B landmarks, the average aggregate location is returned as PinPoint s final estimated end host geolocation. The complete PinPoint IP Geolocation methodology is summarized in Algorithm 11. The algorithm complexity in terms of both probing and computation can be seen in Table 7.5. All of these terms are linear, and therefore exists little penalty for increasing either the number of landmarks, or the probing budget.

173 149 Algorithm 11 - PinPoint IP Geolocation Algorithm Given: Set of N end hosts with unknown geolocation X = {x 1,x 2,...,x N }. Set of T landmarks with known geolocation (e.g., a set of NTP or DNS servers) Set of M monitor nodes (e.g., Planetlab nodes) Training set of end hosts with known geolocation and latency measurements to the landmarks. Latency probing budget - K. Number of bootstrap iterations - B Initialize: Use cross validation to find the optimal values of φ (the latency weight coefficient) and λ (the sparse embedding penalty) with respect to the training set geolocation error rate. Measure the hop counts from every monitor to the set of end hosts and landmarks. From the training set, find the estimated cumulative distributions {L l1, L l2,..., L ln } Methodology: For each end host, i = {1, 2,..., N} 1. Determine the feasible set of latitude/longitude coordinates G i using the conservative speedof-light constraint region methodology. 2. Using hop counts and Equation 7.7, assign each end host i to the landmark that it is closest to in the network topology (c i ) 3. For each bootstrap iteration, b = {1, 2,..., B} (a) Chose the b-th closest landmark to c i, c (b) i = Γ ci (b). (b) Find Γ (b) c, the set of K landmarks geographically closest to landmark c (b) i. i (c) Measure the round trip time latency between end host i and set of landmarks Γ (b) c. Using the i cumulative probability estimates ({L l1,l l2,...,l ln }) and Equation 7.9, construct the estimated distance matrix, D. (d) Find the weight matrix W using parameter φ. (e) Find the estimated latitude/longitude coordinates x (b) i for the end host i by minimizing the sparse stress function in Equation 7.13 with sparse penalty value λ and feasible coordinates G i. 4. Find the final estimated location of end host i ( x i ), the average latitude/longitude coordinate of the B bootstrap iterations ({ x (1) i, x (2) i,..., x (B) i }).

174 Experiments Comparison Methodologies GeoPing Algorithm One of the first IP geolocation methodologies was the GeoPing techniques from [41]. This technique uses a series of latency measurements to a set of landmarks from an end host, and then maps that end host s geolocation to the landmark that has the shortest observed latency value. We expect GeoPing to work well in our evaluation in instances where the monitors are quite near targets. However, in instances where monitors are not near targets GeoPing s accuracy will decline and the strength of our sparse embedding method will be highlighted. Constraint-Based Geolocation Algorithm To generate CBG geolocation estimates, we implemented the algorithm described in [9]. CBG is the current state-of-the-art IP geolocation methodology using only ping-based measurements. The basic intuition behind CBG is that each latency measurement to a set of landmarks with known location can be considered a series of constraints, where given speed-of-light in fiber assumptions and self-calibration using a set of training data, we can determine a feasible geographic region given each latency measurement. Given a series of latency measurements, the possible geographic placement is considered the intersection of many constraint regions, with the estimated location behind the centroid of this intersection region. Octant Geolocation Algorithm Building off the Constraint-based Geolocation approach [9], the Octant algorithm [4] is the current state-of-the-art measurement-based geolocation methodology. Novel components in the Octant framework include the use of both positive and negative constraints from latency measurements, Bezier curves to determine feasible regions, and the iterative refinement of the feasible constraint region. In contrast to the PinPoint algorithm, Octant requires the use of both pingbased measurements to the end hosts and given geographic information from undns [2] of routers

175 151 along the path to the end hosts. In our experiments, this information will take the form of undnsderived geographic information of the last hop router encountered along the path before the end host. For our commercial set of 431 nodes, it was found that only 71 nodes had available last hop undns information down to the city location. To compensate for this lack of undns data these 71 nodes will always be classified as end hosts (in order to enhance Octant geolocation accuracy). Additionally, to give the Octant methodology every opportunity, this undns information will not be made available to the PinPoint framework. Similar to the PinPoint experiments, the numerous tuning parameters in the Octant algorithm will be trained based on minimizing the geolocation error in the training set. Commercial Geolocation Methodologies We will also compare geolocation accuracy with both the Maxmind database [16] and the IP2Location database [17]. Both of these databases are commercially available IP lookup packages. Unfortunately, due to the commercial nature of both products, the methodology used for geolocation is not known NBgeo Experiments To test and evaluate the capabilities of this initial instance of our NBgeo geolocation approach, we consider geographic partitioning at the level of counties in the continental United States 4. While considerable Internet topology lies outside the continental United States, the readily available population data for counties in the continental United States using Census Data [18] motivates initially focusing on this geographic region. The initial validation on this dataset will motivate future work on end hosts located outside the United States. Given our dataset of 431 commercial end hosts with known geolocation, we divide the end hosts into a set of 20 monitors and 411 target IPs (chosen at random). Using 5-Fold Cross Validation [60], we test the performance of the methodology five times using 20% of the target IPs as our training set (to train the parameters of our NBgeo algorithm, λ hop, λ pop, γ lat, γ hop ), leaving the 4 Finer-grained partitioning on the order of zip codes or city blocks is certainly feasible in our framework, but county-level was selected due to the availability of data for test and evaluation.

176 152 remaining 80% of the target IPs to test the accuracy of our methodology. The results presented are the aggregate results of all five cross validation tests. To assess performance of the NBgeo geolocation algorithm, we will consider the error distance to be the distance in miles between the centroid of our estimated classified county and the centroid of the ground truth county. The mean and median performance for both our new NBgeo algorithm and our competing methodologies can be seen in Table 7.6. For this dataset, the performance of NBgeo shows considerable improvements over both the constraint-based (CBG) methodology and GeoPing algorithm, with median performance 50 miles closer than CBG and over 118 miles closer than GeoPing. Without constraining the NBgeo algorithm to the CBG regions, there is still a significant performance gain over the GeoPing algorithm and a small gain over CBG. Performance of our learning-based NBgeo framework and the CBG method with respect to the empirical cumulative probability can be seen in Figure 7.8-(left). As seen in the figure, the geolocation estimates produced by our learning-based framework are more accurate than CBG for over 76% of our target IPs. 1 Cummulative Distribution NBGeo Constraint Based Error Distance (in miles) Figure 7.8: Empirical cumulative probability of error distance for both NBgeo with constraint information and the CBG method. The Impact of Additional Information To analyze the impact of using multiple features in our learning-based framework, we generate geolocation estimates when both population density information is removed (setting the weight of using the population density to zero, λ pop = 0) and when hop count information is removed (setting

177 153 Table 7.6: The geolocation error for all geolocation methodology using latency data from all landmarks (error distance in miles). Methodology Mean Error Median Error (in miles) (in miles) NBgeo NBgeo w/o Constraint-Based Restriction GeoPing Constraint-Based IP2Location MaxMind Table 7.7: The performance of the NBgeo Algorithm given additional data (error distance in miles). Methodology Mean Error Median Error (in miles) (in miles) NBgeo NBgeo w/o Population Data NBgeo w/o Hop Count Data NBgeo w/o both Population Data and Hop Count Data the weight of using the hop count data to zero, λ hop = 0). The results of these experiments can be seen in Table 7.7. These two conditions resulted in an average error distance of and miles, for missing population data and missing hop count data respectively. These results indicate that both the hop count data and the population density information significantly contribute to the improved performance of the methodology. Finally, removing both hop count and population data, the mean geolocation error increases to miles on average for each target IP, almost 25% higher than the standard NBgeo algorithm with both features included PinPoint Experiments We now consider the geolocation problem where either population data is unavailable, or a natural partitioning the geographic space does not exist. Using the dataset of commercial nodes, we evaluate the performance of our new PinPoint algorithm with the addition of landmark nodes, where we are given their geographic location and have the ability to ping to/from. We partition the set of

178 154 Table 7.8: The geolocation error (in miles) for all geolocation methodology using latency data from all landmarks (for number of landmarks, T = 50, 200). Methodology T = 50 T = 200 Mean Error Median Error Mean Error Median Error PinPoint GeoPing Octant IP2Location MaxMind total commercial nodes into 20 monitor nodes, 200 landmark nodes, and the remaining 211 nodes as end hosts, which we desire to estimate their geographic location. To properly qualify performance of the PinPoint algorithm, we again perform 5-way Cross Validation. This entails partitioning the end hosts into five sets, and repeatedly training the algorithm parameters (the sparse embedding parameter λ, and the latency weight parameter φ) using one of the five sets while testing performance on the remaining four sets (that were held out of the training stage). The algorithm performance is then evaluated across an aggregate of the five experiments Complete Landmark Probing Results The first experiment will be to geolocate using latency measurements to all the landmarks (where the probing budget for the PinPoint algorithm K = T, the number of landmarks). We compare results of the PinPoint algorithm with another latency-based geolocation method (GeoPing), the state-of-the-art latency-based methodology enhanced with last hop undns information (Octant), and two commercially available methods (Maxmind, and IP2Location) 5. The results in Table 7.8 show the improvements of the PinPoint methodology over all the other existing geolocation methodologies. For the case where the number of landmarks T = 50, PinPoint results in both the lowest mean and median error compared with the two other measurement-based geolocation algorithms (GeoPing and Octant). When compared against the commercial databases, PinPoint has significantly 5 While these methods are agnostic to our choice of landmarks, the end hosts under consideration are different between the two tests, thus resulting in performance changes.

179 155 lower mean error (roughly 350 miles less mean error in terms of the best competing commercial method, Maxmind). When increasing the number of landmarks to T = 200, PinPoint maintains the lowest mean error rate of any of the geolocation algorithms with a 14 mile improvement over the next competing methodology (GeoPing). We validated our Octant implementation by training and testing using the same data. In those highly idealized tests, Octant returned a mean error of miles and median error of 20.13, which is the best possible behavior for that algorithm. It is important to note that this result could never be replicated in a live deployment since target node positions are not available a priori and indeed if they were, there would be no need to perform geolocation. When restricted to the realistic measurement environment used to test all other geolocation algorithms, the performance of the Octant algorithm is significantly worse than the PinPoint methodology, with mean error of Octant over 70 miles larger than PinPoint in the case where the number of landmarks, T = 200. One concern might be that not all of the end hosts in our test set have last hop undns information, and that is biasing Octant s geolocation results lower. While we feel that the level of available undns information for the experiments in Table 7.8 is representative of a realistic measurement environment, an additional experiment was performed restricting only to end hosts with available last hop undns information. For the case of T = 200, the resulting geolocation performance of Octant improved to a mean error of miles and median error of miles. Meanwhile for this same set of end hosts, PinPoint is still outperforming Octant with mean error of miles and median error of miles. Therefore, PinPoint is over 50 miles closer in terms of mean error and over 10 miles closer in terms of median error when compared against Octant for this restricted set Latency Probing Budget Results While the performance of PinPoint has been found to have improvements over all previous methodologies in the case of a large probing budget, an even more impressive regime for the PinPoint algorithm is where we have only a limited probing budget for each end host. Using the hop mapping methodology from Section 7.4, PinPoint intelligently chooses a subset of landmarks from 200

180 156 possible landmarks. For comparison with both Octant and GeoPing, which do not target selected landmarks, a random selection of landmarks are chosen from the set of T = 200 landmarks for the latency measurements. The results for the Octant algorithm will include the last hop undns data that neither GeoPing nor PinPoint have access to. Using Cross Validation to train the parameters of both PinPoint and Octant, the median geolocation error in miles can be seen in Figure 7.9 for a range of probing budget values (K). For very small probing budgets (K 10, latency probes per end host), our new PinPoint algorithm does significantly better than the two competing methodologies with respect to the median error metric. For only 4 available latency probes for each end host, our new PinPoint algorithm has median geolocation error 200 miles closer than the current state-of-the-art Octant geolocation method. Even as the size of the probing budget grows, the median error results show that the PinPoint algorithm has consistently better performance over the competing methodologies. This motivates the application of PinPoint in lightweight regimes where the number of available latency measurements are limited. Median Geolocation Error (in miles) PinPoint Octant GeoPing Latency Probing Budget Figure 7.9: Median geolocation error (in miles) given a limited probing budget (T = 200). In terms of the distribution of geolocation errors, the results in Figure 7.10 show that for a limited probing budget K = 20 (while the number of landmarks, T = 200), our new PinPoint algorithm will geolocate roughly 90% of the end hosts with higher accuracy than the Octant algorithm.

181 157 1 Cumulative Probability PinPoint Octant Error Distance (in miles) Figure 7.10: Cumulative distribution of geolocation error for both PinPoint and Octant algorithms (K = 20, T = 200) PinPoint Component Performance We state that each component of the PinPoint is critical to increasing our geolocation accuracy performance. To test this, we examine performance of the PinPoint geolocation algorithm having removed selected components from the algorithm. To test the increased performance as the result of our bootstrapping methodology, we consider the case where all the bootstrapping components are removed. This consists of removing both the novel distribution sampling methodology of Equation 7.9, and the coordinate averaging from the multiple bootstrap estimates. Instead, distances to landmarks are derived a single time using the mean of each distance cumulative distribution (resolving d i,j such that L li,j (d i,j ) = 1 2 ). In a separate experiment, we examine the improvements that our exponential weighting of the distance measurements causes to our geolocation accuracy. Instead of using the observed latency values to infer the length of the routing path (where short latency observations imply shorter paths and thus are weighted more), all paths are weighted equally. To test this configuration, we again run PinPoint, modifying the weight array defined in Equation 7.10 to W one, such that, w one i,j = 1 : i, j (7.14) This is the case where the parameter φ = 0. As seen in Figure 7.11, for every end host under consideration both the bootstrap estimation

182 158 methodology and exponential latency weighting will improve geolocation accuracy. 1 Cumulative Probability PinPoint PinPoint w/o Bootstrapping PinPoint w/o Weighting Error Distance (in miles) Figure 7.11: Cumulative distribution of geolocation error for PinPoint removing the improvements of bootstrap estimation and exponential latency weighting. (K = 20, T = 200) Bootstrap Confidence Bounds Results In addition to improving the accuracy of our location estimates, the bootstrap methodology also returns geographic confidence bounds for the placement of each end host. With the number of bootstrap iterations B = 20, these empirical bootstrap confidence bounds gives rise to the geographic area where we are 95% confident the end host is located in. This consists of finding the distance of the second farthest bootstrap estimate from the aggregated bootstrap location average (i.e., the size of the region 95%, or 19 20, of our bootstrap estimates are located in). The size of this confidence region for each end host can be considered a metric for how confident we are in our geolocation estimate. For end hosts with very large confidence bounds we are relatively unsure of its geographic location, while if the confidence bounds are small then we are fairly confident our geolocation estimate is accurate. Dividing the end hosts in quintile sets, with the first quintile set containing the 20% of end hosts we are most confident in (to the fifth quintile containing the 20% of end hosts we are least confident in), the cumulative distribution of geolocation errors can be seen in Figure 7.12 for the experiment where the probing budget K = 20 and the number of landmarks T = 200. As seen in the figure, our bootstrap confidence value is directly correlated to the accuracy of our geolocation, as the most confident quintile (first) has the highest geolocation accuracy, while the fifth confident quintile has the least accurate geolocation performance.

183 159 1 Cummulative Distribution First Second Third Fourth Fifth Error Distance (in miles) Figure 7.12: Cumulative distribution of geolocation error for confidence quintiles derived from 95% bootstrap confidence interval size (K = 20, T = 200). In addition to partitioning the end hosts in quintile sets, we can divide the set of end hosts in half with the most confident end hosts in one set, and the least confident end hosts in the other. For the case where K = 20 and T = 200, we geolocate the most confident half with mean error miles, while the least confident half has mean error of miles. These results show that without any prior information of the end hosts location, our bootstrap confidence is directly related to our geolocation estimate accuracy. 7.9 Summary The goal of this chapter was to improve the accuracy of estimates of the geographic location of nodes in the Internet. Our work is based on the hypothesis that the ability to zero in on the geolocation of nodes is improved by considering a potentially broad set of features including both active measurements and more static societal characteristics associated with locations, specifically population data. To consider this hypothesis, we introduced two algorithms. First, the NBgeo algorithm, a learning-based framework using explicitly defined population data for a partitioned geographic location space. The second methodology, the PinPoint algorithm, uses the combination of a novel sparse embedding algorithm and implicitly inferred population information from a set of given landmarks with known geolocation. We assess the capabilities of both algorithms by using a data set of hop count and latency

184 160 measurements collected from hundreds of host in the Internet with precisely known geographic coordinates. Our experiments show that the NBgeo algorithm estimated geographic location with average error of miles and median error of miles. In terms of median error, this is an improvement of 50 miles over the next competing measurement-based geolocation methodology. We assess the capabilities of PinPoint using a data set of hop count and latency measurements collected from hundreds of host in the Internet with precisely known geographic coordinates. Our results show that PinPoint is able to identify the geographic location of target hosts with a median error of only 38 miles. We compare this with the current state-of-the-art methodology, which produces geolocation estimates with median errors of 91 miles. We also compare PinPoint s estimates against two commercial IP geolocation services, which produce mean error estimates that are nearly a factor of 4 higher than PinPoint. These results highlight the powerful capabilities of our approach.

185 161 Chapter 8 Model-based Anomaly Detection Networks are complex, dynamic and subject to external factors outside of their operators control. Network operators must therefore vigilantly monitor their networks for faults and other events that could jeopardize their contractual commitments to customers. The problem is relatively easy when the type of fault is well understood (e.g., link failures). There are standard protocols for alerting an operator to such faults, and although extending these methods particularly in the context of security is an ongoing effort, they are not the specific focus of this chapter. Instead, this chapter considers unforeseen faults. These faults are intrinsically more challenging to detect because we do not a priori know what we are looking for. These faults often manifest in unusual measurements that are commonly referred to as anomalies. Being able to find anomalies, and use them to diagnose network problems quickly and effectively would significantly enhance network operations. Developing a framework for effective and practical anomaly detection is the objective of our work. A large number of studies over the past decade have been focused on developing methods to detect anomalous events in networks. The typical approach begins by measuring network traffic (e.g., flow-export records) and then establishing a profile for normal behavior. Next, a method for detecting deviations from normality is applied. Most prior studies have largely taken a onesize-fits-all approach that has ultimately resulted in problems with accuracy and false alarm rate. It is critically important in any anomaly detection system to have a very low false alarm rate.

186 162 False alarms waste operator time and discredit results, leading to a cry wolf syndrome, where the anomaly detection system is quickly ignored. Most existing systems suffer from unduly high false-alarm rates. This is exacerbated by anomalies polluting the data used in determining the normal profile. In this chapter, we seek to improve the accuracy of network event detection to the point where it becomes an effective tool for network operators. To approach this problem of anomaly detection, we introduce the BasisDetect framework. The primary intuition behind the BasisDetect framework is that we can observe that both normal traffic and anomalies have features that we can model and exploit for the purpose of automated detection. For instance, it is well known the traffic has strong diurnal and weekly cycles. Our hypothesis is that by considering traffic as a superposition of waveforms and then breaking these down into their component parts, we can build detection models that offer the opportunity to separate bundles of energy that can be semantically divided into to normal and anomalous traffic. The BasisDetect framework is divided into three components. The first step learns potential anomaly signal features from a small set of labeled network data provided to the algorithm. The second step uses a novel basis pursuit methodology to simultaneously decompose traffic into components of both nonanomalous behavior representing expected network traffic, and anomalous behavior learned from the previous step. This simultaneous estimation avoids the problem of anomalies polluting our normal profile data. The final step of the algorithm exploits known network structure to intelligently merge together the detected anomalous behavior using state-of-the-art statistical techniques. Further objectives of our framework include developing an anomaly detection method that can be applied (i) - to different data types since critical anomalies may be entirely invisible in some data, and (ii) - in both a single node and network-wide context. Prior work has typically fallen into one or the other category due to detection methods that are primarily spatial or temporal. While our initial signal decomposition approach is temporal, and we combine anomalies across the network using a higher reasoning framework. This combined, best-of-both-worlds approach offers a significant opportunity to improve detection accuracy. Intuitively, we treat network wide detection as a data fusion problem, where one can significantly reduce false alarms through use of multiple

187 163 time-series. It has the secondary advantage that it naturally incorporates different data types, without the need for strong relationships between the different time series such as required, for example by PCA [74]. We use both synthetic and real world data to rigorously assess the capabilities of our modelbased detection methodology. The first part of our evaluation considers NetFlow data collected at a single router along with a set of labeled anomalies that include DoS attacks, outages, scans, etc. We isolate a subset of the anomalies in the data and then apply the BasisDetect framework to learn anomalous models using a combination of signal components that isolate key elements of the events. We find that our BasisDetect methodology identifies all the labeled anomalies with 50% improvement in the false alarm rate when compared with the best competing methodology. Next, we use a set of carefully generated synthetic data to assess the sensitivity of our modelbased detection methodology. The data is designed to capture the key low and high frequency and spatial characteristics of non-anomalous traffic flows in a network-wide setting. We insert simple volume anomalies into this data and modulate the relative amplitude and frequency of these anomalies versus the non-anomalous traffic in order to assess sensitivity. While this synthetic data is not as rich as measurements collected in situ, we argue that it provides a powerful and meaning starting point for assessing detection sensitivity. The results of our analysis show that the BasisDetect methodology detects all of the injected anomalies with false alarm rate over 65% less than the current state-of-the-art network-wide anomaly detection methodology. Finally, we consider a set of Internet2 byte count data collected simultaneously across 11 PoPs. While this dataset does not have labeled anomalous events, we can compare the ability of the BasisDetect methodology and a state-of-the-art distributed method [20] to detect the most dominant anomalies detected by the standard PCA [74] anomaly detection methodology. Our results show that BasisDetect method will identify the PCA anomaly locations with 40% fewer false alarms than the competing state-of-the-art network-wide anomaly detection method. We believe that these results along with the results from single node and network-wide labeled data sets make a strong case for the utility of our model-based approach.

188 Anomaly Datasets We use three different data sets to evaluate our model-based detection methodology. The intent of our analysis is to assess the capability of our approach as thoroughly as possible. To that end, we use empirical data sets for both single node with labeled anomalies and network-wide settings without labeled anomalies. We also use a synthetic data set in which we can precisely control both the normal and anomalous traffic in order to carefully assess the sensitivity of our method. Each of the data sets is described in detail below Synthetic Traffic Data In order to accurately test anomaly detection algorithms, we need to be able to simulate reasonable datasets in a controlled way. Ringberg et al. [121] explain in detail why simulation must be used for accurate comparisons of anomaly detection techniques. In brief the reasons are: (i) accurate and complete ground truth information is needed to form both false-alarm and detection probability estimates; (ii) many more results are needed (than one can obtain from any realistic real dataset) to form accurate estimates of probabilities, and (iii) simulation allows one to vary parameters (say the anomaly size) in a controlled way in order to see the effect this has on anomaly detection. Our approach to simulation is intended to highlight the features of the different techniques. We make no claim that the simulation is completely realistic, only that it illustrates clearly the properties of the different anomaly detection techniques. The simulations used here were generated in a similar manner to those in [122]. In particular, a spatial traffic matrix is generated using a gravity model and then extended into the temporal domain using a matrix product with a simple periodic signal. The resulting traffic is then enhanced by Gaussian noise with variance that is proportional to the traffic mean. The only differences with the previous study are that (i) we consider a range of sizes of networks, and (ii) consider a range of length of anomalies. We should stress that the goal of these simulations is not to produce the most realistic test possible for the algorithms (that will be accomplished later using real data). However, the simulations allow us to obtain exact quantitative comparisons of algorithms in completely controlled

189 165 circumstances, so we can explore the properties of the different approaches GEANT Data The second set of data will be a collection of time-series data obtained from a GEANT network backbone router [123] located in Vienna, Austria. Collection of data began on January 14th 2009 and ended on February 26th 2009, for a total of 43 days of data acquisition. The dataset contains packet counts, byte counts, and IP entropy measured along this single link extracted using Juniper J-Flow records, sampled in aggregation bins of 1 minute for a total of 61,440 data samples observed. This dataset contains labeled anomalies, including Denial of Service (DoS) attacks, portscan events, and Distributed Denial of Service (ddos) attacks. These events were found, validated, and annotated by network engineers. 1 The problem with the time-series data is that it cannot show the power of strictly networkwide techniques (such as PCA or Distributed Spatial), Although we are limited in the comparison methodologies available for this dataset, the single link information has the advantage that a great deal of effort has gone into classifying the anomalies in this data, so that we are closer to having ground truth than we are in any almost any other setting Abliene Real-World Data The final set of data consists of byte counts recorded from the Abliene Internet2 backbone network. 2 Across 11 PoPs in the continental United States with 41 network links, byte counts were sampled into 10 minute time intervals from April 7th 2003 to April 13th 2003, resulting in 1008 byte count samples across all 41 links. Unfortunately, this dataset is completely unlabeled with no prior annotation of possible anomaly locations. To compensate for this deficiency in the dataset, we will use this real world network data to study how the new BasisDetect framework detects anomalies that are found by previous network-wide anomaly detection algorithms. 1 We thank Fernando Silveira from Thompson Research for supplying us with this dataset 2 We thank Mark Crovella for supplying us with this dataset

190 BasisDetect Overview Our automated BasisDetect framework for detecting network anomalies is divided into three distinct components. Practically speaking, these components are predicated on having a small initial set of labeled network data from which anomaly characteristics can be learned and the algorithm parameters are optimized against. The components of the BasisDetect framework are: 1. Dictionary Construction from Labeled Set - Using a training set of labeled anomalies, we extract signal characteristics that have been pre-established as anomalous. 2. Anomaly Decomposition using Penalized Basis Pursuit - Using our novel Penalized Basis Pursuit methodology and the learned anomaly signal atoms from the previous step, the BasisDetect methodology extracts anomaly energy from temporal network data for each link observed in the network. 3. Network-wide Data Fusion - Using knowledge of the network topology structure, the estimated anomaly energy for each link is fused using a False Discovery Rate methodology to extract the probability of an anomaly occurring. A visual description of the BasisDetect framework can be seen in Figure 8.1. Figure 8.1: The BasisDetect Framework

Posit: An Adaptive Framework for Lightweight IP Geolocation

Posit: An Adaptive Framework for Lightweight IP Geolocation Posit: An Adaptive Framework for Lightweight IP Geolocation Brian Eriksson Department of Computer Science Boston University eriksson@cs.bu.edu Bruce Maggs Department of Computer Science Duke University

More information

Network Discovery from Passive Measurements

Network Discovery from Passive Measurements Network Discovery from Passive Measurements Brian Eriksson UW-Madison bceriksson@wisc.edu Paul Barford UW-Madison pb@cs.wisc.edu Robert Nowak UW-Madison nowak@ece.wisc.edu ABSTRACT Understanding the Internet

More information

Network (Tree) Topology Inference Based on Prüfer Sequence

Network (Tree) Topology Inference Based on Prüfer Sequence Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 vanniarajanc@hcl.in,

More information

Internet Infrastructure Measurement: Challenges and Tools

Internet Infrastructure Measurement: Challenges and Tools Internet Infrastructure Measurement: Challenges and Tools Internet Infrastructure Measurement: Challenges and Tools Outline Motivation Challenges Tools Conclusion Why Measure? Why Measure? Internet, with

More information

Detecting Network Anomalies. Anant Shah

Detecting Network Anomalies. Anant Shah Detecting Network Anomalies using Traffic Modeling Anant Shah Anomaly Detection Anomalies are deviations from established behavior In most cases anomalies are indications of problems The science of extracting

More information

Managing Incompleteness, Complexity and Scale in Big Data

Managing Incompleteness, Complexity and Scale in Big Data Managing Incompleteness, Complexity and Scale in Big Data Nick Duffield Electrical and Computer Engineering Texas A&M University http://nickduffield.net/work Three Challenges for Big Data Complexity Problem:

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

Internet (IPv4) Topology Mapping. Department of Computer Science The University of Texas at Dallas

Internet (IPv4) Topology Mapping. Department of Computer Science The University of Texas at Dallas Internet (IPv4) Topology Mapping Kamil Sarac (ksarac@utdallas.edu) Department of Computer Science The University of Texas at Dallas Internet topology measurement/mapping Need for Internet topology measurement

More information

WITH THE RAPID growth of the Internet, overlay networks

WITH THE RAPID growth of the Internet, overlay networks 2182 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 24, NO. 12, DECEMBER 2006 Network Topology Inference Based on End-to-End Measurements Xing Jin, Student Member, IEEE, W.-P. Ken Yiu, Student

More information

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

A Review of Anomaly Detection Techniques in Network Intrusion Detection System A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In

More information

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major

More information

IP Forwarding Anomalies and Improving their Detection using Multiple Data Sources

IP Forwarding Anomalies and Improving their Detection using Multiple Data Sources IP Forwarding Anomalies and Improving their Detection using Multiple Data Sources Matthew Roughan (Univ. of Adelaide) Tim Griffin (Intel Research Labs) Z. Morley Mao (Univ. of Michigan) Albert Greenberg,

More information

CHAPTER VII CONCLUSIONS

CHAPTER VII CONCLUSIONS CHAPTER VII CONCLUSIONS To do successful research, you don t need to know everything, you just need to know of one thing that isn t known. -Arthur Schawlow In this chapter, we provide the summery of the

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Conclusions and Future Directions

Conclusions and Future Directions Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions

More information

Prevention, Detection, Mitigation

Prevention, Detection, Mitigation Thesis for the Degree of DOCTOR OF PHILOSOPHY Multifaceted Defense Against Distributed Denial of Service Attacks: Prevention, Detection, Mitigation Zhang Fu Division of Networks and Systems Department

More information

Procedure: You can find the problem sheet on Drive D: of the lab PCs. 1. IP address for this host computer 2. Subnet mask 3. Default gateway address

Procedure: You can find the problem sheet on Drive D: of the lab PCs. 1. IP address for this host computer 2. Subnet mask 3. Default gateway address Objectives University of Jordan Faculty of Engineering & Technology Computer Engineering Department Computer Networks Laboratory 907528 Lab.4 Basic Network Operation and Troubleshooting 1. To become familiar

More information

IP addressing and forwarding Network layer

IP addressing and forwarding Network layer The Internet Network layer Host, router network layer functions: IP addressing and forwarding Network layer Routing protocols path selection RIP, OSPF, BGP Transport layer: TCP, UDP forwarding table IP

More information

Internet Firewall CSIS 4222. Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS 4222. net15 1. Routers can implement packet filtering

Internet Firewall CSIS 4222. Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS 4222. net15 1. Routers can implement packet filtering Internet Firewall CSIS 4222 A combination of hardware and software that isolates an organization s internal network from the Internet at large Ch 27: Internet Routing Ch 30: Packet filtering & firewalls

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University Motivation Routing anomalies are

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Instructor Notes for Lab 3

Instructor Notes for Lab 3 Instructor Notes for Lab 3 Do not distribute instructor notes to students! Lab Preparation: Make sure that enough Ethernet hubs and cables are available in the lab. The following tools will be used in

More information

Network-Wide Class of Service (CoS) Management with Route Analytics. Integrated Traffic and Routing Visibility for Effective CoS Delivery

Network-Wide Class of Service (CoS) Management with Route Analytics. Integrated Traffic and Routing Visibility for Effective CoS Delivery Network-Wide Class of Service (CoS) Management with Route Analytics Integrated Traffic and Routing Visibility for Effective CoS Delivery E x e c u t i v e S u m m a r y Enterprise IT and service providers

More information

LIST OF FIGURES. Figure No. Caption Page No.

LIST OF FIGURES. Figure No. Caption Page No. LIST OF FIGURES Figure No. Caption Page No. Figure 1.1 A Cellular Network.. 2 Figure 1.2 A Mobile Ad hoc Network... 2 Figure 1.3 Classifications of Threats. 10 Figure 1.4 Classification of Different QoS

More information

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016 Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points

The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points Kay Sripanidkulchai, Aditya Ganjam, Bruce Maggs, and Hui Zhang Instructor: Fabian Bustamante Presented

More information

Active Measurements: traceroute

Active Measurements: traceroute Active Measurements: traceroute 1 Tools: Traceroute Exploit TTL (Time to Live) feature of IP When a router receives a packet with TTL=1, packet is discarded and ICMP_time_exceeded returned to sender Operational

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 21 CHAPTER 1 INTRODUCTION 1.1 PREAMBLE Wireless ad-hoc network is an autonomous system of wireless nodes connected by wireless links. Wireless ad-hoc network provides a communication over the shared wireless

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Kings Regional Occupational Program Course Information

Kings Regional Occupational Program Course Information Kings County Board of Education Approval May 4, 2011 California Department of Education Certification August 22, 2007 Kings Regional Occupational Program Course Information Course Title: Cisco Discovery:

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Posit: A Lightweight Approach for IP Geolocation

Posit: A Lightweight Approach for IP Geolocation Posit: A Lightweight Approach for IP Geolocation Brian Eriksson Boston University and UW Madison eriksson@cs.bu.edu Paul Barford UW Madison pb@cs.wisc.edu Bruce Maggs Duke University and Akamai Technologies

More information

Introduction to LAN/WAN. Network Layer

Introduction to LAN/WAN. Network Layer Introduction to LAN/WAN Network Layer Topics Introduction (5-5.1) Routing (5.2) (The core) Internetworking (5.5) Congestion Control (5.3) Network Layer Design Isues Store-and-Forward Packet Switching Services

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Touring the Internet in a TCP Sidecar

Touring the Internet in a TCP Sidecar Touring the Internet in a TCP Sidecar Rob Sherwood Neil Spring University of Maryland IMC 2006 Topology Discovery: Along for the Ride Goal: Internet s complete router-level topology Challenges: Accuracy:

More information

AUTONOMOUS NETWORK SECURITY FOR DETECTION OF NETWORK ATTACKS

AUTONOMOUS NETWORK SECURITY FOR DETECTION OF NETWORK ATTACKS AUTONOMOUS NETWORK SECURITY FOR DETECTION OF NETWORK ATTACKS Nita V. Jaiswal* Prof. D. M. Dakhne** Abstract: Current network monitoring systems rely strongly on signature-based and supervised-learning-based

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

Studying Black Holes on the Internet with Hubble

Studying Black Holes on the Internet with Hubble Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas Anderson University of Washington August 2008 This

More information

Traceroute-Based Topology Inference without Network Coordinate Estimation

Traceroute-Based Topology Inference without Network Coordinate Estimation Traceroute-Based Topology Inference without Network Coordinate Estimation Xing Jin, Wanqing Tu Department of Computer Science and Engineering The Hong Kong University of Science and Technology Clear Water

More information

Monitoring of Internet traffic and applications

Monitoring of Internet traffic and applications Monitoring of Internet traffic and applications Chadi BARAKAT INRIA Sophia Antipolis, France Planète research group ETH Zurich October 2009 Email: Chadi.Barakat@sophia.inria.fr WEB: http://www.inria.fr/planete/chadi

More information

CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

More information

Efficient Methodical Internet Topology Discovery

Efficient Methodical Internet Topology Discovery Efficient Methodical Internet Topology Discovery Alistair King Supervisor: Dr Matthew Luckie This report is submitted in partial fulfilment of the requirements for the degree of Bachelor of Computing and

More information

Contents. Dedication List of Figures List of Tables. Acknowledgments

Contents. Dedication List of Figures List of Tables. Acknowledgments Contents Dedication List of Figures List of Tables Foreword Preface Acknowledgments v xiii xvii xix xxi xxv Part I Concepts and Techniques 1. INTRODUCTION 3 1 The Quest for Knowledge 3 2 Problem Description

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Cisco IOS Flexible NetFlow Technology

Cisco IOS Flexible NetFlow Technology Cisco IOS Flexible NetFlow Technology Last Updated: December 2008 The Challenge: The ability to characterize IP traffic and understand the origin, the traffic destination, the time of day, the application

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Prediction of DDoS Attack Scheme

Prediction of DDoS Attack Scheme Chapter 5 Prediction of DDoS Attack Scheme Distributed denial of service attack can be launched by malicious nodes participating in the attack, exploit the lack of entry point in a wireless network, and

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

CHAPTER 2. QoS ROUTING AND ITS ROLE IN QOS PARADIGM

CHAPTER 2. QoS ROUTING AND ITS ROLE IN QOS PARADIGM CHAPTER 2 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 22 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 2.1 INTRODUCTION As the main emphasis of the present research work is on achieving QoS in routing, hence this

More information

10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics. Clustering expression data 10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

More information

EECS 489 Winter 2010 Midterm Exam

EECS 489 Winter 2010 Midterm Exam EECS 489 Winter 2010 Midterm Exam Name: This is an open-book, open-resources exam. Explain or show your work for each question. Your grade will be severely deducted if you don t show your work, even if

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

co Characterizing and Tracing Packet Floods Using Cisco R

co Characterizing and Tracing Packet Floods Using Cisco R co Characterizing and Tracing Packet Floods Using Cisco R Table of Contents Characterizing and Tracing Packet Floods Using Cisco Routers...1 Introduction...1 Before You Begin...1 Conventions...1 Prerequisites...1

More information

Question 1. [7 points] Consider the following scenario and assume host H s routing table is the one given below:

Question 1. [7 points] Consider the following scenario and assume host H s routing table is the one given below: Computer Networks II Master degree in Computer Engineering Exam session: 11/02/2009 Teacher: Emiliano Trevisani Last name First name Student Identification number You are only allowed to use a pen and

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Efficient Doubletree: An Algorithm for Large-Scale Topology Discovery

Efficient Doubletree: An Algorithm for Large-Scale Topology Discovery Middle-East Journal of Scientific Research 15 (9): 1264-1271, 2013 ISSN 1990-9233 IDOSI Publications, 2013 DOI: 10.5829/idosi.mejsr.2013.15.9.11480 Efficient Doubletree: An Algorithm for Large-Scale Topology

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Enhancing Network Monitoring with Route Analytics

Enhancing Network Monitoring with Route Analytics with Route Analytics Executive Summary IP networks are critical infrastructure, transporting application and service traffic that powers productivity and customer revenue. Yet most network operations departments

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

William Stallings Data and Computer Communications. Chapter 15 Internetwork Protocols

William Stallings Data and Computer Communications. Chapter 15 Internetwork Protocols William Stallings Data and Computer Communications Chapter 15 Internetwork Protocols Internetworking Terms (1) Communications Network Facility that provides data transfer service An internet Collection

More information

Paris Traceroute. February 16, 2014

Paris Traceroute. February 16, 2014 Paris Traceroute SRINikhil Mamadapalli Junnutula Meghanath Reddy February 16, 2014 1 Contents 1 Introduction 3 1.1 Aim of the project.......................... 3 1.2 Methodology.............................

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Exploring Big Data in Social Networks

Exploring Big Data in Social Networks Exploring Big Data in Social Networks virgilio@dcc.ufmg.br (meira@dcc.ufmg.br) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about

More information

Traffic Driven Analysis of Cellular Data Networks

Traffic Driven Analysis of Cellular Data Networks Traffic Driven Analysis of Cellular Data Networks Samir R. Das Computer Science Department Stony Brook University Joint work with Utpal Paul, Luis Ortiz (Stony Brook U), Milind Buddhikot, Anand Prabhu

More information

Using Machine Learning Techniques to Improve Precipitation Forecasting

Using Machine Learning Techniques to Improve Precipitation Forecasting Using Machine Learning Techniques to Improve Precipitation Forecasting Joshua Coblenz Abstract This paper studies the effect of machine learning techniques on precipitation forecasting. Twelve features

More information

CHAPTER 8 CONCLUSION AND FUTURE ENHANCEMENTS

CHAPTER 8 CONCLUSION AND FUTURE ENHANCEMENTS 137 CHAPTER 8 CONCLUSION AND FUTURE ENHANCEMENTS 8.1 CONCLUSION In this thesis, efficient schemes have been designed and analyzed to control congestion and distribute the load in the routing process of

More information

Contents. vii. Preface. P ART I THE HONEYNET 1 Chapter 1 The Beginning 3. Chapter 2 Honeypots 17. xix

Contents. vii. Preface. P ART I THE HONEYNET 1 Chapter 1 The Beginning 3. Chapter 2 Honeypots 17. xix Honeynet2_bookTOC.fm Page vii Monday, May 3, 2004 12:00 PM Contents Preface Foreword xix xxvii P ART I THE HONEYNET 1 Chapter 1 The Beginning 3 The Honeynet Project 3 The Information Security Environment

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks

An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks 2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks Reyhaneh

More information

Strategic Online Advertising: Modeling Internet User Behavior with

Strategic Online Advertising: Modeling Internet User Behavior with 2 Strategic Online Advertising: Modeling Internet User Behavior with Patrick Johnston, Nicholas Kristoff, Heather McGinness, Phuong Vu, Nathaniel Wong, Jason Wright with William T. Scherer and Matthew

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

The primary goal of this thesis was to understand how the spatial dependence of

The primary goal of this thesis was to understand how the spatial dependence of 5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Dynamic Routing Protocols II OSPF. Distance Vector vs. Link State Routing

Dynamic Routing Protocols II OSPF. Distance Vector vs. Link State Routing Dynamic Routing Protocols II OSPF Relates to Lab 4. This module covers link state routing and the Open Shortest Path First (OSPF) routing protocol. 1 Distance Vector vs. Link State Routing With distance

More information

WHITE PAPER. Understanding IP Addressing: Everything You Ever Wanted To Know

WHITE PAPER. Understanding IP Addressing: Everything You Ever Wanted To Know WHITE PAPER Understanding IP Addressing: Everything You Ever Wanted To Know Understanding IP Addressing: Everything You Ever Wanted To Know CONTENTS Internet Scaling Problems 1 Classful IP Addressing 3

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Network Flow Data Fusion GeoSpatial and NetSpatial Data Enhancement

Network Flow Data Fusion GeoSpatial and NetSpatial Data Enhancement Network Flow Data Fusion GeoSpatial and NetSpatial Data Enhancement FloCon 2010 New Orleans, La Carter Bullard QoSient, LLC carter@qosient.com 1 Carter Bullard carter@qosient.com QoSient - Research and

More information

Some Examples of Network Measurements

Some Examples of Network Measurements Some Examples of Network Measurements Example 1 Data: Traceroute measurements Objective: Inferring Internet topology at the router-level Example 2 Data: Traceroute measurements Objective: Inferring Internet

More information

RARP: Reverse Address Resolution Protocol

RARP: Reverse Address Resolution Protocol SFWR 4C03: Computer Networks and Computer Security January 19-22 2004 Lecturer: Kartik Krishnan Lectures 7-9 RARP: Reverse Address Resolution Protocol When a system with a local disk is bootstrapped it

More information

Networking Systems (10102)

Networking Systems (10102) Networking Systems (10102) Rationale Statement: The goal of this course is to help students understand and participate in the significant impact of computer networking in their lives. Virtually any career

More information

Scaling 10Gb/s Clustering at Wire-Speed

Scaling 10Gb/s Clustering at Wire-Speed Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400

More information

A Catechistic Method for Traffic Pattern Discovery in MANET

A Catechistic Method for Traffic Pattern Discovery in MANET A Catechistic Method for Traffic Pattern Discovery in MANET R. Saranya 1, R. Santhosh 2 1 PG Scholar, Computer Science and Engineering, Karpagam University, Coimbatore. 2 Assistant Professor, Computer

More information

Limitations of Packet Measurement

Limitations of Packet Measurement Limitations of Packet Measurement Collect and process less information: Only collect packet headers, not payload Ignore single packets (aggregate) Ignore some packets (sampling) Make collection and processing

More information

Practical Issues with Using Network Tomography for Fault Diagnosis

Practical Issues with Using Network Tomography for Fault Diagnosis Practical Issues with Using Network Tomography for Fault Diagnosis Yiyi Huang Georgia Institute of Technology yiyih@cc.gatech.edu Nick Feamster Georgia Institute of Technology feamster@cc.gatech.edu Renata

More information

Network Architecture and Topology

Network Architecture and Topology 1. Introduction 2. Fundamentals and design principles 3. Network architecture and topology 4. Network control and signalling 5. Network components 5.1 links 5.2 switches and routers 6. End systems 7. End-to-end

More information

CS268 Exam Solutions. 1) End-to-End (20 pts)

CS268 Exam Solutions. 1) End-to-End (20 pts) CS268 Exam Solutions General comments: ) If you would like a re-grade, submit in email a complete explanation of why your solution should be re-graded. Quote parts of your solution if necessary. In person

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Application of Adaptive Probing for Fault Diagnosis in Computer Networks 1

Application of Adaptive Probing for Fault Diagnosis in Computer Networks 1 Application of Adaptive Probing for Fault Diagnosis in Computer Networks 1 Maitreya Natu Dept. of Computer and Information Sciences University of Delaware, Newark, DE, USA, 19716 Email: natu@cis.udel.edu

More information

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information