1 Diss. ETH No TIK-Schriftenreihe Nr. 140 Novel Techniques for Monitoring Network Traffic at the Flow Level A dissertation submitted to ETH ZURICH for the degree of Doctor of Sciences presented by Eduard Glatz Dipl. El.-Ing. ETH born July 15, 1955 citizen of Zurich and Basel accepted on the recommendation of Prof. Dr. Bernhard Plattner, examiner Dr. Xenofontas Dimitropoulos, co-examiner Prof. Dr. Björn Scheuermann, co-examiner Dr. Walter Willinger, co-examiner 2013
3 Abstract Research in Internet measurement provides us with new ways to understand, operate and improve the Internet. Learning from network traffic data requires a well-chosen set of analysis techniques. We envision a rich toolbox available for this task, and delve into novel techniques and their application on large data sets to extend the choice of analysis schemes. In particular, we focus on traffic data at the network level that is readily available from commercial routers in the form of flow metadata (e.g. NetFlow) to enable analyzes of ever growing traffic volumes with low demands on the measurement infrastructure. This thesis consists of two major parts. In a first part, we explore a promising approach to study unsolicited traffic without the need to reserve unpopulated IP address ranges to this task, as has been done in the past. Our approach is to study one-way traffic, i.e., packets that never receive a reply in live networks. We introduce a novel scheme to classify one-way traffic at the flow level into interpretable classes. We validate this scheme based on a data set that we prepare using all informative details available from packet data (e.g. header and payload contents). We use our classifier to shed light on the composition of one-way traffic, and illustrate how the particular class of Unreachable Services can be used to passively detect network service outages by processing flow-level traffic data only. Moreover, to obtain a comprehensive view on one-way traffic, we
4 ii Abstract conduct a large-scale study covering eight years of traffic data leading to new insights about the evolution of this exotic piece of traffic over time and space. In part two, we present novel visualization methods following the well-known adage A picture is worth a thousand words. In particular,wetackletheproblemsofhowtosummarizedata to extract the most relevant information from big data sets, and how to visualize this information in an easy interpretable way. We envision a top-down workflow that in a first step identifies probably hidden patterns in a data set captured from a potentially large network, followed by a second step that involves a closer inspection of the traffic of individual end systems or subnets. Specifically, we use frequent itemset mining to obtain a list of most relevant patterns from the traffic data of a network that we then visualize through hypergraphs. Then we make use of a graph representation and a domain specific summarization scheme, which is based on the characteristics of typical host roles (e.g. client, server, P2P) to provide a quick overview of what roles a host assumes and what applications it runs. We demonstrate the usefulness of our approach by using proof-ofconcept implementations in a number of illustrative case studies.
5 Kurzfassung Forschung im Gebiet der Internet-Verkehrsdatenanalyse zeigt uns neue Ansätze, um das Internet zu verstehen, zu betreiben und zu verbessern. Der Gewinn neuer Erkenntnisse aus Verkehrsdaten bedingt jedoch den Einsatz gut ausgewählter Analysetechniken. Unser Ziel ist die Bereitstellung eines reichhaltigen Instrumentariums zu diesem Zweck, weswegen wir neue Analysetechniken und ihre Anwendung auf grossen Datenbeständen zur Entwicklung dieses Instrumentariums erforschen. Im Speziellen fokussieren wir uns auf Flowdaten auf der Netzwerkschicht (z.b. NetFlow), die von kommerziellen Routern einfach zur Verfügung gestellt werden, um die stets wachsenden Verkehrsvolumina mit geringem Infrastrukturaufwand zu analysieren. Diese Dissertationsschrift ist in zwei Hauptteile gegliedert. Im ersten Teil erforschen wir einen vielversprechenden Ansatz zur Analyse von unangefordertem Verkehr, ohne dass wir dazu einen ungenutzten IP-Adressbereich reservieren müssen, wie das bisher gemacht wurde. Wir studieren das Phänomen des Einwegverkehrs, d.h., von Netzwerkpaketen, die in operativen Netzen keine Antwort erhalten. Wir führen ein neuartiges Klassifizierungssschema ein, um Einwegverkehr auf der Flow-Ebene in interpretierbare Klassen einzuteilen. Wir validieren dieses Schema mittels zusätzlicher Detailinformationen (z.b. Rahmendaten aller Pakete, Nutzlastinhalte) die nur Paket-Verkehrsdaten liefern können. Wir benutzen unseren
6 iv Kurzfassung Klassifizierer um die Zusammensetzung von Einwegverkehr sichtbar zu machen, und illustrieren die Nützlichkeit der speziellen Klasse Unerreichbare Dienste um Dienstausfälle ausschliesslich aufgrund von Flow-Verkehrsdaten passiv zu detektieren. Darüber hinaus führen wir eine umfangreiche Studie durch, in der wir Einwegverkehr aus einem Zeitraum von acht Jahren analysieren und neue Einsichten in die Eigenschaften dieses exotischen Verkehrsanteils und seiner Entwicklung über Zeit und Raum hinweg gewinnen. Im zweiten Teil der Arbeit stellen wir neue Visualisierungsmethoden vor, dem Sprichwort Ein Bild sagt mehr als tausend Worte folgend. Insbesondere befassen wir uns mit Methoden für die Verdichtung sehr umfangreicher Daten, die Extraktion relevanter Informationen und entwickeln Verfahren für die Interpretation und Visualisierung solcher Informationen. Unsere Methodik ist ein Top-Down Vorgehen, bei dem in einem ersten Schritt die potenziell vorhandenen, versteckten Muster einer Verkehrs-Datensammlung, die in einem möglicherweise sehr grossen Computernetz erfasst wurde, identifiziert werden, gefolgt von einem zweiten Schritt, bei dem Verkehrsdaten einzelner Endsysteme oder Subnetze im Detail inspiziert werden. Im Speziellen benutzen wir Frequent- Itemset Mining um eine Liste der relevantesten Muster zu extrahieren, die wir in Form von Hypergraphen visualisieren. Anschliessend benutzen wir einen domänenspezifischen Ansatz der Datenverdichtung, der auf den Eigenschaften typischer Endsystemrollen (Client, Server P2P) basiert, um selektiv den Netzwerkverkehr eines einzelnen Rechners überblicksartig in einem Graphen darzustellen, so dass seine Rollen und die von ihm ausgeführten Applikationen unmittelbar erkennbar sind. Mit Proof-of-Concept Implementierungen zeigen wir anhand von illustrativen Fallstudien die Nützlichkeit unseres Ansatzes auf.
7 Contents Abstract Kurzfassung Contents List of Figures List of Tables i iii v xi xiii INTRODUCTION 1 Network Traffic Monitoring Relevance ApplicationAreas StateoftheArt DataSets ResearchProblems Contributions Traffic Data Visualization Relevance... 25
8 vi Contents 2.2 ApplicationAreas StateoftheArt ResearchProblems Contributions Outline 37 References 38 PART I: ANALYZING ONE-WAY TRAFFIC 4 Classifying Internet One-way Traffic 53 Abstract Introduction Preliminaries DatasetsandSanitization DataSanitization One-wayTrafficClassification Signs Classifier Validation ValidationSetup ValidationCriterias ValidationResults ImpactonFlowClassifier One-wayTrafficComposition ServiceAvailabilityMonitoring Methodology OutagesandMisconfigurations RelatedWork IBR Traffic in Network Telescopes
9 Contents vii IBRTrafficinLiveNetworks NetworkOutages Conclusions References A First Look Into IBR In A Large Greynet 101 Abstract Introduction RelatedWork IBRTrafficinDarkNets IBRTrafficinLiveNetworks DatasetsandSanitization NetFlowDataSanitization ITU Internet Penetration Data Sanitization TargetedServicesandHosts Top Target Ports of IBR Traffic Over 8 Years Persistently Targeted Ports Over Time Distribution of IBR Traffic Over Ports and Hosts SourceCharacterization SpacetimeAnalysis Evolution of Geographical Distribution of IBRTraffic Evolution of the Spatial Distribution of IBRTraffic Conclusions References...135
10 viii Contents PART II: VISUALIZING NETWORK TRAFFIC 6 Visualizing Big Network Traffic Data 143 Abstract Introduction RelatedWork VisualizationScheme FIM Visualization ScalingtoLargeDataSets UseCases Usecase1:TrafficProfiling Use case 2: Attacks and Misconfigurations Conclusions References Visualizing Host Traffic through Graphs 165 Abstract Introduction HAPviewer Host Traffic Representation HostRoleSummarization FlowClassificationandFiltering CaseStudies SituationalAwareness AnalysisofanIDSAlarm Discussion RelatedWork FutureWork Conclusions...183
11 Contents ix 7.8 Acknowledgements References CONCLUSIONS 8 Summary and Conclusions One-way traffic Classification and Characterization Network Traffic Data Visualization Future Work and Outlook Automatic Inference of One-way Flow Classification Rules ClassifyingOutboundOne-wayTraffic CharacterizingOne-wayTrafficonIPv Subclassifying One-way Traffic linked to DoS Attacks A Long-term Study of the Reachability of Services Comparing One-way Traffic Analysis with NetworkTelescopes In-depth Analysis of the One-way Class Suspected Benign Extending the Host Application Profile Viewer Extending FIM Visualizations of Network Traffic. 214 References 215 APPENDIX 10 Data Sets 223
12 x Contents 11 Limitations NetFlowDataSet NetFlowPre-Processing One-wayFlowPerspective ClassificationRules Additional Measurements RoutingSymmetryTests One-wayTrafficSources References List of Publications Related Non-Related Acknowledgements 253
13 List of Figures 1.1 Networkmonitoringmethods Two- and one-way traffic illustrated Networktelescopetraffic Fiveflowtypesillustrated Exampleofacommunicationgraph Exampleofaparallelcoordinateplot Exampleofatreemapvisualization Impactoftimeintervalsizeonflowmetrics Mixtureofone-andtwo-wayflows Rulerefinementstages Evolutionofoneandtwo-wayflowcounts One-wayflowsasa(mean)fractionofall Compositionofone-waytraffic Coinciding outage on most university services Illustration that shows security-relevant events PersistencyofTCPdestinationports PersistencyofUDPdestinationports IBR flows received by a target port versus port rank IBR traffic flow volume over time decomposed
14 xii LIST OF FIGURES 5.6 IBR flows received by a target host Sourcehostactivitypatterns Average daily number of IBR flows per source One-way flows generated per host versus host rank IBRandregularflowsourcesandtargets Top-k persistence versus k (unnormalized) Top-k persistence versus k (normalized) IPv4 address space distributions of sources by ports Example of a parallel-coordinate plot ExampleofaFIMvisualization Runtimediagram Trafficprofilingexample Attacksandmisconfigurationsexample Illustrativegraphvisualizationexample Hostapplicationprofile(HAP)graphlet Exampleofaserverrolesummarization Hostrolesdefinitions ExampleofaHAPgraphlet Exampleshowingflowdirections Exampleofahostbrowselist Exampleofascantarget Exampleofascansource Exampleofaflowlist Exampleofascanpattern ExampleofaP2Phostpattern ExampleofaP2Pflowlist...179
15 List of Tables 4.1 Sizeofdatasetsperyear Overviewofdefinedsigns Rulesusedtoclassifyone-wayflows Resultsofvalidation Compositionofone-wayflowclasses Flowdatasetdescription List of top-10 target ports of one-way flows Top-10countriesofIBRscanflows Top-10countriesofIBRscan(normalized) Sizeofdatasetsperyear Fraction of pot. artificial inbound one-way flows UniqueOne-wayTrafficSources...242
17 INTRODUCTION 1
19 3 Research in Internet measurement provides us new ways to understand, operate and improve the Internet. Learning from network traffic data requires a well-chosen set of analysis techniques. We envision a rich toolbox available for this task, and delve into novel techniques and their application on large data sets to extend the choice of analysis schemes. In this introductory part we illustrate the relevance of network traffic monitoring and traffic data visualization. This includes a survey of application areas, a discussion of the state of the art and a description of the research problems investigated. We conclude with a summary of our contributions and provide an overview of how this thesis is organized.
21 Chapter 1 Network Traffic Monitoring 1.1 Relevance Today, the usage of the Internet penetrates most areas of our life making the Internet an important infrastructure for our society. According to the International Telecommunication Union (ITU)  average Internet penetration 1 has reached 35.7% for its member countries by the end of 2010 after a growth of 809% between 1998 and To operate such an infrastructure requires intimate knowledge of its working and its state at any moment. This is the task of network monitoring, as it is decentrally performed by network administrators of many organizations supervising the networks they are responsible for. 1 Internet penetration is measured as the percentage of inhabitants using the Internet. 2 Similar figures are reported by the Organization for Economic Cooperation and Development (OECD) for industrialized countries. In particular, OECD estimates the Internet penetration  for its member countries to be on average 25.2% by Q2/2011, and a growth between Q2/2002 and Q2/2011 of 582%.
22 6 1 Network Traffic Monitoring However, measuring the Internet is not as easy as it might appear on first sight - there are many pitfalls to be avoided and problems to be solved . Moreover, the extent to which a measurement infrastructure is established today is limited due to the decentralized organization of the Internet, trade-offs between cost of infrastructure and measurement support provided and privacy concerns. As a consequence, the Internet cannot precisely be characterized and we only have an incomplete view of its working and its state. Many quantitative measures of the Internet are still missing or at least are incomplete. Furthermore, to handle huge data sets resulting from ever growing traffic volumes asks for new ways to extract interesting information from summarized data as it is readily available in the form of flow metadata (e.g. NetFlow). 1.2 Application Areas There are many reasons why networks should be monitored. Network administrators want to know how well the network infrastructure is running and whether any traffic anomalies need further attention. At the same time they observe a traffic growth which routinely requires an extension of the network infrastructure to provide sufficient bandwidth to end users at any time. Additionally, emerging new applications can change the character of network traffic and therefore should not escape the attention of network administrators. As part of the quality assurance monitoring process, network traffic can help to check for compliance with service level agreements (SLA). Furthermore, privacy and availability concerns often ask for network monitoring to detect security incidents, preferably, at an early stage, or finally, as a part of forensic investigations to prosecute offenders and harden the infrastructure. Organizations offering network services to clients can use network traffic
23 1.2 Application Areas 7 data to measure the actual usage as input data for billing their services . On the other hand, researchers seek for new insights into the operation of network infrastructure, of networked applications and protocols with the goal to improve them or to provide useful information to those in charge of this task. Network traffic can be monitored across protocol layers (see Fig. 1.1 for an overview). Lower level protocol data consists primarily of router and link-level data that commonly is collected Ati Active Measurements Passive Measurements (injecting traffic) (listening to traffic) traceroutete ping Inter domain path Router/Switch End host BGP data SNMP data IDS alerts Active responders Application instance i i Endpoint connection Deep packet inspection Log data Flow data Packet data Granularity Figure 1.1: Network monitoring methods can be grouped into active (on the left) and passive techniques (on the right). Furthermore, employed data sets can be ranked by increasing granularity of details they provide (top-to-bottom inside of the rectangle).
24 8 1 Network Traffic Monitoring by an infrastructure using the Simple Network Management Protocol (SNMP). A popular use of SNMP data is the monitoring of basic information such as packet loss, delay and throughput with tools such as Observium  or MRTG [105, 106]. SNMP data can be made available from virtually everywhere in a network making it an ubiquitous data source. But, SNMP data typically is aggregated information that does not provide details about sources and destinations of traffic. Besides, SNMP frequently is used to build up-to-date inventories of network infrastructure based on a Management Information Base (MIB). An alternative is the gathering of packet traces  as part of Deep Packet Inspection (DPI) that records packet contents at a configurable granularity of details. This granularity is set by the number of attributes supported and the amount of payload captured per-packet. Packet traces provide details at several protocol layers by the encapsulated nature of packets. This starts at the link layer going up to the application layer as recorded packets contain higher layer packet contents as payload data of the underlying layers. Packet traces support, e.g. the measurement of packet jitter and round-trip time (RTT) through very precise timestamps. But, gathering packet traces poses high demands on the infrastructure to store and process the collected data due to the immense volume of such data sets. A next alternative is the use of flow-level data, e.g. in the form of the popular NetFlow [108, 109]. Flow-level data sets provide per-connection metadata as summaries over all involved packets. This comprises e.g. the total packet and byte counts aside the source and destination addresses and used protocol at the network layer. Still, gathering flow-level data at highspeed links may impose a too high demand on infrastructure. In this situation often sampling is used that e.g. only includes every n-th packet in the traffic data (other sampling strategies exist). Finally, at the application layer the amount of traffic data typically is much smaller making this kind of network
25 1.2 Application Areas 9 monitoring a good choice for some monitoring tasks. Such data sets are created by application programs that log important information about their network communications. Applicationlayer data sets can be collected by client or server programs. Popular examples of such data sets are server logs provided by web and DNS servers. Besides, there are more specialized measurement techniques that e.g. focus on traffic routing by inspecting exterior gateway protocol data, e.g. created by the Border Gateway Protocol (BGP). BGP data can be used to infer the Internet inter-domain (AS-level) topology and to assess route stability [110, 111]. Security appliances such as firewalls and Intrusion Detection Systems (IDS) create log data entries that list e.g. permitted and denied connections or alerts about suspicious traffic. So far, we have surveyed monitoring techniques that work passively, i.e., non-intrusive on the network traffic. For some tasks active measurements are more helpful. A popular technique is the use of the ping command to test the reachability of a destination system and the packet round-trip time. The program traceroute allows to inspect the path packets travel from source to destination aside the transfer times experienced. A further source of information about selected details of network traffic are databases. To geo-locate a source or destination system by IP address the Domain Name System (DNS) can provide the registered name (if any) including the top-level domain name identifying a country (e.g. ch ). Some top-level domain names are ambiguous when a country should be identified, e.g. the name org. In this situation alternative geo-location databases give more detailed information. Such databases commonly are maintained by commercial companies and may provide a finer granularity e.g. at the region- or even city-level. To learn more about the size of the network where packets originate from or target at, BGP data can be valuable that is collected as part of BGP traffic monitoring. Such a data set can associate individual IP addresses with subnets (identified by their network address
26 10 1 Network Traffic Monitoring and prefix length) and the associated Autonomous System (AS) number. Finally, we note that traffic data in principle can be gathered at any level of the Internet topology and from infrastructure operated by any organization that participates in the Internet. In practice, there are many limitations for several reasons. For example it is hard to exchange traffic data between interested organizations due to privacy concerns and legal requirements. This frequently raises the question if the measurement results obtained from one network are applicable to another network. A more detailed description of network monitoring issues can be found in . In summary, surveying the field of network monitoring we find a considerable diversity of data sources. Depending on the observation point and the measurement methodology a collected data set can be useful for multiple analysis purposes, but usually not for all. This requires a careful study about exactly what information can be inferred from a data set and as a consequence for what tasks it can be useful. Ignoring this preparatory step can lead to insights that are built on sand and finally are found to be misleading . Network traffic can be decomposed into two-way traffic representing dialogs between end systems and one-way traffic, i.e., packets that never receive a reply (see Fig. 1.2 for examples). In this thesis we are interested in one-way traffic - a very specific and less known perspective onto network traffic. One-way traffic represents monologs resulting from a number of communication situations which are of utmost interest to operators and researchers as they are associated with interesting events. Examples of such events are unreachable services, misconfigurations, scanning, prefix hijacking, filtering by Network Address Translation (NAT) and firewalls, peer-topeer applications, congestion and routing loops. Besides, one-way traffic constitutes a large fraction of Internet traffic (in terms of the number of flows). In this work we focus on one-way traffic at
27 1.3 State of the Art 11 Figure 1.2: Two-way traffic is the regular case (A) when there are replies on sent messages. On the other hand, one-way traffic results whenever packets get lost or are blocked on their way to thereceiveroronthereturnpath(b,c).figure(c)illustrates another case of one-way traffic when the receiver is not responsive or does not even exist. the network layer and use flow-level data to classify and further characterize this exotic piece of traffic. The use of flow-level data for analyzing one-way traffic is particularly interesting for network operators as flow data can be collected from routers as a by-product of operation. In contrast, prior work used either packet traces or IDS connection logs. Both data types require extra-infrastructure that scales not well on large networks. 1.3 State of the Art In the past, one-way traffic has been studied in a number of different ways to gain insight into its nature and causes. We describe the most popular techniques next. A traditional instrument to study one-way traffic is a network telescope, i.e., an infrastructure which routes traffic targeted at an
28 12 1 Network Traffic Monitoring unpopulated IP address range to a measurement site [ ]. The idea of a network telescope is to observe traffic using destination IP addresses not assigned to any end system (IP darkspace). In this measurement setup incoming traffic stays unreplied as it targets otherwise unused IP addresses. A network telescope enables the monitoring of different kinds of unsolicited traffic. For example in scenario (A) in Fig. 1.3 an attacker sends a large number of requests to a victim system trying to create a Denial-of-Service (DoS) situation. To remain hidden the attacker forges the source IP addresses of these attack flows using random values that to some extent fall into the address space of the network telescope. Therefore the victim system sends replies towards the network telescope among other random destinations. Network telescopes are also known as a darknet, whileincontrast a live network can be designated a greynet in the context of oneway traffic analysis. Attacker Victim Scanner Network telescope (B) (A) (C) DoS attack Backscatter Scan flow Misdirected flow Figure 1.3: This figure illustrates three scenarios generating traffic that can be seen in a network telescope: backscatter traffic as a result of a Denial-of-Service (DoS) attack using spoofed source IP addresses (A), scanning of randomly chosen target hosts (B), and misdirected traffic due to configuration errors (C).
29 1.3 State of the Art 13 An often used term for traffic seen in a network telescope is Internet Background Radiation (IBR) based on a definition of traffic sent to unused addresses [116, 117] or unsolicited one-way Internet traffic . Network telescopes have been shown to be useful to study major worm outbreaks [ ], scanning and probing , misconfigurations , botnets [118, 122] and major network disruptions . Typically, traffic towards a network telescope can be recorded with all packet details as this setup does not suffer from the same privacy concerns and accompanied restrictions than production networks carrying sensitive information of a company or organization. At a few places [114, 124] parts of incoming traffic towards a network telescope are used to actively probe the sources of this traffic through manual exploration or so-called active responders which are small programs that answer to predefined application protocols [113,114] with the goal of verifying the kind of targeted service. However, this active approach is considered problematic as it can reveal the location of the network telescope in IP address space to the attacker . If the location of a network telescope cannot be kept secret then it might be avoided by attackers thereby jeopardizing its value for the analysis of attack traffic. But, even if an operator of a network telescope is running it passively it can be detected by attackers when analysis results are published . Network telescopes are blind for targeted attacks that use lists of previously identified victim systems and, therefore, network telescopes provide an incomplete picture of the attack scene. Besides, a network telescope requires the availability of a substantial IP address range dedicated to this purpose which is hard to achieve for those not yet in possession of one. An extension of the idea of active responders is the use of socalled honeypots  that mimic an important end system, e.g. a web server to attract attacks with the goal to analyze them. In practice, such systems are measurement sites isolated from pro-
30 14 1 Network Traffic Monitoring duction networks that help operators and researchers to collect information about emerging attacks and to test counter measures. A honeypot can be built as a dedicated computer system or as a guest on top of a system virtual machine  which makes it easy to run a farm of honeypots (a honey farm) with reasonable hardware cost. In contrast to network telescopes, honeypots allow to analyze attacks that are specifically tailored to exploit server-type applications . Moreover, they do not occupy a large number of IPv4 addresses that likely cannot be reserved in the future for pure research purposes. But, honeypots will miss attacks trying to exploit vulnerabilities of client applications, e.g. web browsers. Similar to network telescopes, honeypots can be avoided by attackers at the moment their purpose is recognized and their exact IP address is identified . Honeypots can be very useful to analyze particular attacks in-depth. But they are less useful to create macroscopic statistics on malicious traffic. A third source of information about the failure or success of connection attempts is the use of dedicated intrusion detection systems (IDS) that create connection logs such as the Bro system . Connection logs are text files that describe each observed connection with a text line containing connection information and success or failure details. This data source is useful to analyze TCP traffic [116, 131, 132], and with extensions also UDP and ICMP traffic. But, Bro and any similar system are resource-intensive as they require the processing of packet traces. A fourth approach makes use of IDS alerts to identify malicious or otherwise unwanted traffic. IDS alerts are text files that describe each suspicious observation with a text line comprising an estimated threat priority, connection details and threat characterization. An interesting source of such IDS alert data is the Dshield project  that involves many sites contributing their alert data providing a globally distributed perspective . On the other hand, IDS alerts are limited by the employed ruleset that relies on the characteristics of