Visual Analysis of Network Traffic Interactive Monitoring, Detection, and Interpretation of Security Threats

Transcription

1 Visual Analysis of Network Traffic Interactive Monitoring, Detection, and Interpretation of Security Threats Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften an der Universität Konstanz im Fachbereich Informatik und Informationswissenschaft Universität Konstanz Universität Konstanz vorgelegt von Florian Mansmann Universität Konstanz

2

3 Abstract The Internet has become a dangerous place: malicious code gets spread on personal computers across the world, creating botnets ready to attack the network infrastructure at any time. Monitoring network traffic and keeping track of the vast number of security incidents or other anomalies in the network are challenging tasks. While monitoring and intrusion detection systems are widely used to collect operational data in real-time, attempts to manually analyze their output at a fine-granular level are often tedious, require exhaustive human resources, or completely fail to provide the necessary insight due to the complexity and the volume of the underlying data. This dissertation represents an effort to complement automatic monitoring and intrusion detection systems with visual exploration interfaces that empower human analysts to gain deeper insight into large, complex, and dynamically changing data sets. In this context, one key aspect of visual analysis is the refinement of existing visualization methods to improve their scalability with respect to a) data volume, b) visual limitations of computer screens, and c) human perception capacities. In addition to that, developmet of innovative visualization metaphors for viewing network data is a further key aspect of this thesis. In particular, this dissertation deals with scalable visualization techniques for detailed analysis of large network time series. By grouping time series according to their logical intervals in pixel visualizations and by coloring them for better discrimination, our methods enable accurate comparisons of temporal aspects in network security data sets. In order to reveal the peculiarities of network traffic and distributed attacks with regard to the distribution of the participating hosts, a hierarchical map of the IP address space, which takes both geographical and topological aspects of the Internet into account, is proposed. Since visual clutter becomes an issue when naively connecting the major communication partners on top of this map, hierarchical edge bundles are used for grouping traffic links based on the map s hierarchy, thereby facilitating a more scalable analysis of communication partners. Furthermore, the map is complemented by multivariate analysis techniques for visually studying the multidimensional nature of network traffic and security event data. Especially the interaction of the implemented prototypes reveals the ability of the proposed visualization methods to provide an overview, to relate communication partners, to zoom into regions of interest, and to retrieve detailed information. For an even more detailed analysis of hosts in the network, we introduce a graph-based approach to tracking behavioral changes of hosts and higher-level network entities. This information is particularly useful for detecting misbehaving computers within the local network infrastructure, which can otherwise substantially compromise the security of the network. To complete the comprehensive view on network traffic, a Self-Organizing Map was used to demonstrate the usefulness of visualization methods for analyzing not only structured network protocol data, but also unstructured information, e.g., textual context of messages. By

4 ii extracting features from the s, the neuronal network algorithm clusters similar s and is capable of distinguishing between spam and legitimate s up to a certain extent. In the scope of this dissertation, the presented prototypes demonstrate the applicability of the proposed visualization methods in numerous case studies and reveal the exhaustless potential of their usage in combination with automatic detection methods. We are therefore confident that in the fields of network monitoring and security visual analytics applications will quickly find their way from research into practice by combining human background knowledge and intelligence with the speed and accuracy of computers.

5 Zusammenfassung Das Internet ist ein gefährlicher Ort geworden: Schadcode breitet sich auf Personal Computern auf der ganzen Welt aus und schafft damit sogenannte Botnets, welche jederzeit bereit sind, die Netzwerkinfrastruktur anzugreifen. Netzwerkverkehr zu überwachen und den Überblick über die gewaltige Anzahl von sicherheitsrelevanten Vorfällen oder Anomalien im Netzwerk zu behalten sind schwierige Aufgaben. Während Monitoring- und Intrusion-Detection-Systeme weit verbreitet sind, um operationale Daten in Echtzeit zu erheben, sind Bemühungen, ihren Output auf detaillierter Ebene manuell zu analysieren, oftmals ermüdend, benötigen viel Personal, oder schlagen vollständig fehl, die notwendigen Einsichten zu liefern aufgrund der Komplexität und des Volumens der zugrunde liegenden Daten. Diese Dissertation stellt ein Bestreben dar, automatische Überwachungs- und Intrusion- Detection-Systeme durch visuelle Explorationsschnittstellen zu ergänzen, welche menschliche Analysten befähigen, tiefere Einsichten in riesige, komplexe und sich dynamisch verändernde Datensätze zu gewinnen. In diesem Zusammenhang ist ein Hauptanliegen von visueller Analyse, bestehende Visualisierungsmethoden zu verfeinern, um ihre Skalierbarkeit in Bezug auf a) die Datenmenge, b) visuelle Beschränkungen von Computerbildschirmen und c) die Aufnahmefähigkeit der menschlichen Wahrnehmung zu verbessern. Darüber hinaus ist die Entwicklung von innovativen Visualisierungsmetaphern ein weiteres Hauptanliegen dieser Doktorarbeit. Insbesondere beschäftigt sich diese Dissertation mit skalierbaren Visualisierungstechniken für detaillierte Analyse von riesigen Netzwerk-Zeitreihen. Indem Zeitreihen einerseits in Pixelvisualisierungen anhand ihrer logischen Intervalle gruppiert werden und andererseits zur verbesserten Abgrenzung eingefärbt werden, erlauben unsere Methoden genaue Vergleiche von temporären Aspekten in Netzwerk-Sicherheits-Datensätzen. Um die Eigenheiten von Netzwerkverkehr und verteilten Attacken in Bezug auf die Verteilung der beteiligten Rechner aufzudecken, wird eine hierarchische Karte des IP Adressraums vorgeschlagen, welche sowohl geographische als auch topologische Aspekte des Internets berücksichtigt. Da naives Verbinden der wichtigsten Kommunikationspartner auf der Karte zu störenden visuellen Artefakten führen würde, können Hierarchical Edge Bundles dazu verwendet werden, die Verkehrsverbindungen anhand der Hierarchie der Karte zu gruppieren, um dadurch eine skalierbarere Analyse der Kommunikationspartner zu ermöglichen. Ferner wird die Karte durch eine multivariate Analysetechnik ergänzt, um auf visuelle Art und Weise die multidimensionale Natur des Netzwerkverkehrs und der Daten von sicherheitsrelevanten Vorfällen zu studieren. Insbesondere deckt die Interkation der implementierten Prototypen die Fähigkeit der vorgeschlagenen Visualisierungsmethoden auf, einen Überblick zu verschaffen, Kommunikationspartner zuzuordnen, in interessante Regionen hineinzuzoomen, und detaillierte Informationen abzufragen. Für eine noch detailliertere Analyse der Rechner im Netzwerk, führen wir einen graphen-

6 iv basierten Ansatz ein, um Veränderungen im Verhalten von Rechnern und abstrakteren Einheiten im Netzwerk zu beobachten. Diese Art von Information ist insbesondere nützlich, um Fehlverhalten der Rechner innerhalb der lokalen Netzwerkinfrastruktur aufzudecken, welche andernfalls die Sicherheit des Netzwerks beträchtlich gefährden können. Um die umfassende Sicht auf Netzwerkverkehr abzurunden, wurde eine Self-Organizing Map dazu verwendet, die Eignung der Visualisierungsmethoden zur Analyse nicht nur von strukturierten Daten der Netzwerkprotokolle, sondern auch von unstrukturierten Informationen, wie beispielsweise dem textuellen Kontext von Nachrichten, zu demonstrieren. Mittels der Extraktion der charakteristischen Eigenschaften aus den s, gruppiert der Neuronale-Netzwerk-Algorithmus ähnliche s und ist imstande, bis zu einem gewissen Grad zwischen Spam und legitimen s zu unterscheiden. Im Rahmen dieser Dissertation demonstrieren die präsentierten Prototypen die breite Anwendbarkeit der vorgeschlagenen Visualisierungsmethoden in zahlreichen Fallstudien und legen ihr unerschöpfliches Potential dar, in Kombination mit automatischen Intrusion-Detection- Methoden verwendet zu werden. Deswegen sind wir zuversichtlich, dass Visual-Analytics- Anwendungen in den Bereichen Netzwerküberwachung und -sicherheit schnell ihren Weg aus der Forschung in die Praxis finden werden, indem sie menschliches Hintergrundwissen und Intelligenz mit der Geschwindigkeit und Genauigkeit von Computern kombinieren.

7 Parts of this thesis were published in: [1] Daniel A. Keim, Florian Mansmann, Jörn Schneidewind, and Tobias Schreck. Monitoring network traffic with radial traffic analyzer. In Proceedings of IEEE Symposium on Visual Analytics Science and Technology, pages , [2] Daniel A. Keim, Florian Mansmann, Jörn Schneidewind, Jim Thomas, and Hartmut Ziegler. Visual Data Mining: Theory, Techniques and Tools for Visual Analytics, chapter Visual Analytics: Scope and Challenges. Springer, Lecture Notes in Computer Science (LNCS), to appear. [3] Daniel A. Keim, Florian Mansmann, Jörn Schneidewind, and Hartmut Ziegler. Challenges in visual data analysis. In Information Visualization. IEEE Press, [4] Daniel A. Keim, Florian Mansmann, and Tobias Schreck. Mailsom - visual exploration of electronic mail archives using self-organizing maps. In Conference on and Anti- Spam, [5] Florian Mansmann, Fabian Fischer, Daniel A. Keim, and Stephen C. North. Visualizing large-scale IP traffic flows. In Proceedings of the 12th International Workshop on Vision, Modeling, and Visualization, pages mpn, Saarbrücken, Germany, November [6] Florian Mansmann, Daniel A. Keim, Stephen C. North, Brian Rexroad, and Daniel Shelehedal. Visual analysis of network traffic for resource planning, interactive monitoring, and interpretation of security threats. IEEE Transactions on Visualization and Computer Graphics, 13(6): , Proceedings of the IEEE Conference on Information Visualization. [7] Florian Mansmann, Lorenz Meier, and Daniel A. Keim. Visualization of host behavior for network security. In VizSec 2007 Workshop on Visualization for Computer Security. Springer, To appear. [8] Florian Mansmann and Svetlana Vinnik. Interactive exploration of data traffic with hierarchical network maps. IEEE Transactions on Visualization and Computer Graphics, 12(6): , [9] Svetlana Vinnik and Florian Mansmann. From analysis to interactive exploration: Building visual hierarchies from OLAP cubes. In Proceedings of the 10th International Conference on Extending Database Technology, pages , 2006.

8

9 Contents 1 Introduction Monitoring network traffic Intrusion detection and security threat prevention Visual analysis for network security Thesis outline and contribution Networks, intrusion detection, and data management Network fundamentals Capturing network traffic Intrusion detection Building a data warehouse for network traffic and events Foundations of information visualization for network security Information visualization Visual Analytics Related work on visualization for network monitoring and security Temporal analysis of network traffic Related work on time series visualization Extending the recursive pattern method time series visualization Comparing time series using the recursive pattern Case study: temporal pattern analysis of network traffic Summary A hierarchical approach to visualizing IP network traffic Related work on hierarchical visualization methods The Hierarchical Network Map Space-filling layouts for diverse data characteristics Evaluation of data-driven layout adaptation User-driven data exploration Case studies: analysis of traffic distributions in the IPv4 address space Summary An end-to-end view of IP network traffic Related work on network visualizations based on node-link diagrams Linking network traffic through hierarchical edge bundles Case study: Visual analysis of traffic connections

10 viii Contents 6.4 Discussion Summary Multivariate analysis of network traffic Related work on multivariate and radial information representations Radial Traffic Analyzer Temporal analysis with RTA Integrating RTA into HNMap Case study: Intrusion detection with RTA Discussion Summary Visual analysis of network behavior Related work on dimension reduction Graph-based monitoring of network behavior Integration of the behavior graph in HNMap Automatic accentuation of highly variable traffic Case studies: monitoring and threat analysis with behavior graph Evaluation Summary Content-based visual analysis of network traffic Related work on visual analysis of communication Self-organizing maps for content-based retrieval Case study: SOMs for classification Summary Thesis conclusions Conclusions Outlook

11 List of Figures 2.1 The layers of the OSI, TCP/IP and hybrid reference models Advertised IPv4 address count daily average [75] IP datagram Concept of an SSH tunnel Port scan using NMap Concept of a demilitarized zone (DMZ) A multilayered data warehousing system architecture Example netflow cube with three dimensions and sample data Modeling network traffic as an OLAP cube Navigating in the hierarchical dimension IP address Mapping values to color using different normalization schemes The Scope of Visual Analytics Computer network traffic visualization tool TNV Line charts of 5 time series of mail traffic over a time span of 1440 minutes Recursive pattern example configuration: 30 days in each of the 12 months Recursive pattern parametrization showing a weekly reappearing pattern Multi-resolution recursive pattern with empty fields for normalizing irregularities in the time dimension Enhancing the recursive pattern with spacing Different coloring options for distinguishing between time series Combination of two time series at different hierarchy levels Recursive pattern in small multiples mode Recursive pattern in parallel mode Recursive pattern in mixed mode Case study showing the number of SSH flows per minute over one week Visualizing different characteristics of SSH traffic Example of a hierarchical data visualization using a rectangle-packing algorithm Density histogram: distribution of sessions over the IP address space Multi-resolution approach: Hierarchical Network Map Scaling effects in the HNMap demonstrated on some IP prefixes in Germany HNMap on the powerwall Border coloring scheme Geographic HistoMap Layout HistoMap 1D layout

12 x List of Figures 5.9 Strip Treemap layout Anonymized outgoing traffic connections from the university gateway Average position change Average side change HNMap interactions Recursive pattern pixel visualization showing individual hosts Multiple map instances facilitate comparison of traffic of several time spans Interface for configuring the animated display of a series of map instances Visual exploration process for resource location planning Monitoring traffic changes Employing Radial Traffic Analyzer to find dependencies between dimensions Rapid spread of botnet computers in China in August Comparison of different strategies to drawing adjacency relationships The IP/AS hierarchy determines the control polygon for the B-spline HNMap with edge bundles showing the 500 most important connections Assessing major traffic connections through edge bundles Scatter plot matrix Design ratio of RTA Continuous refinement of RTA by adding new dimension rings RTA display with network traffic distribution at a local computer Animation over time in RTA in time frame mode Invocation of the RTA interface from within the HNMap display Security alerts from Snort in RTA Normalized traffic measurements of two hosts Coordinate calculation of the host position Host behaviorgraph of 33 prefixes over a timespan of 1 hour Fine-tuning the graph layout through cohesion forces Integration of the behavior graph view into HNMap Automatic accentuation of highly variable /24 -prefixes Overview of network traffic between 12 and 18 hours Nightly backups and Oracle DB traffic in the early morning Investigating suspicious host behavior through accentuation Evaluating SNORT alerts recorded from Jan. 3 to Feb. 26, Splitting the analysis into internal and external alerts reveals different clusters Analysis of SNORT alerts recorded from January 21 to 27, Performance analysis of the layout algorithm tf-idf feature extraction on a collection of 100 s The learning phase of the SOM Spam histogram of a sample archive Component plane for term work

13 1 Introduction,,Computers are incredibly fast, accurate and stupid: humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination. Contents Albert Einstein 1.1 Monitoring network traffic Intrusion detection and security threat prevention Visual analysis for network security Thesis outline and contribution IT is a fact that digital communication and sharing of data has proven to be cheap, efficient, and effective. Over the years, it has turned the Internet into an indispensable resource in our everyday life: in modern information society, not only private communication, but also education, administration, and business largely depend on the availability of the information infrastructure. To ensure the health of the network infrastructure, the following three aspects play critical roles: 1. Effective monitoring of the network to detect failures and timely react to overload situations. 2. Detection of intrusions and attacks that aim at stealing confidential information, misusing hijacked computers for malicious activities, and paralyzing business or services in the Internet. 3. Human capability to react to unforeseen threats to the network infrastructure. Network monitoring is an essential task to keep the information infrastructure up and running. It is usually executed through a system that constantly monitors the hard- and software components crucial for the vitality of the network infrastructure and informs the network administrators in case of outages. Through so-called activity profiling, the monitoring system tries to distinguish between normal and abnormal usage and network behavior. In most cases, it is easy to handle failures that have previously occurred. However, recognition of abnormal network behavior often involves unnoticed misuse of the network and many false alarms, which eventually lead to information overload for the involved system administrators. Network security has become a constant race against time: so-called 0-Day-Exploits, which are security vulnerabilities that are still unknown to the public, have become a valuable good in the hands of hackers. These exploits are used to develop malicious code, which infiltrates

14 2 Chapter 1. Introduction various computers in the Internet even before virus scanners and firewalls are capable of offering effective countermeasures. Often, this malicious code communicates with a botnet server and only waits to receive commands to execute code on the hijacked computer. If many of these infected computers are interlinked, they form a botnet and are a mighty weapon to harvest websites for addresses, to send out SPAM messages, or to jointly conduct a denial of service attack against commercial or governmental webservers. Today, signature-based and anomaly-based intrusion detection are considered the state-ofthe-art of network security. However, fine-tuning parameters and analyzing the output of these intrusion detection methods can be complex, tedious, and even impossible when done manually. Furthermore, current malware trends suggest an increase in security incidents and a diversification of malware for the foreseeable future [60]. In general, it is noticeable that systems become more and more sophisticated and make decisions on their own up to a certain degree. As soon as unforeseen events occur, the system administrator or security expert has to interfere to handle the situation. The network monitoring and security fields seem to have profited a lot from automatic detection methods in recent years. However, there is still a large potential of visual approaches to foster a better understanding of the complex information through visualization and interaction. In addition to that, a very promising research field to solve many of today s information overload problems in network security is visual analytics, which aims at bridging the gap between automatic and visual analysis methods. In the remainder of this chapter, the need for network monitoring will be explained. We proceed by discussing how intrusion detection systems are used to prevent security threats. The need for visual analysis for network security is then motivated through its potential to bridge the gap between the human analyst and the automatic analysis methods. The last section gives an outline of this thesis. 1.1 Monitoring network traffic The computer network infrastructure forms the technical core of the Information Society. It transports increasing amounts of arbitrary kinds of information across arbitrary geographic distances. To date, the Internet is the most successful computer network. The Internet has fostered the implementation of all kinds of productive information systems unimaginable at the time it was originally designed. While the wealth of applications that can be built on top of the Internet infrastructure is virtually unlimited, there are fundamental protocol elements which rule the way how information is transmitted between the nodes on the network. It is an interesting problem to devise tools for visual analysis of key network characteristics based on these well-defined protocol elements, thereby supporting the network monitoring application domain. Network monitoring in general is concerned with the surveillance of important network performance metrics to a) supervise network functionality, b) detect and prevent potential problems, and c) develop effective countermeasures for networking anomalies and sabotage as they occur. One may distinguish between unintentional defects due to human failures or other malfunctions, referred to as flaws, and the intentional misuses of the system, known as intrusions.

15 1.2. Intrusion detection and security threat prevention 3 The main focus of most network monitoring systems is to collect operational data from countless network connections, switches, routers, and firewalls. These data need to be timely stored in central repositories where network operations staff can conveniently access and use them to tackle failures within the network. However, one major drawback of these systems is that they only employ simple business charts to visualize their data not taking into account the inter-linked properties of the multi-dimensional network data. In the course of this dissertation, it will be demonstrated how information visualization techniques can be applied to gain insight into the complex nature of large network data sets. 1.2 Intrusion detection and security threat prevention Since the Internet has de-facto become the information medium of first resort, each host on the network is forced to face the danger of being continuously exposed to the hostile environment. What started as proof-of-concept implementations by a few experts for unveiling security vulnerabilities has become a sport among script kiddies and drawn the attention of criminals. Therefore, network security has turned into one of the central and most challenging issues in network communication for practitioners as well as for researchers. Security vulnerabilities in the system are exploited with the intention to infect computers with worms and viruses, to hack company networks and steal confidential information, to run criminal activities through the compromised network infrastructure, or to paralyze online services through denial-ofservice attacks. Frequency and intensity of the attacks prohibit any laxity in monitoring the network behavior of the system. One of the most famous infections to date was the SQL Slammer worm in January Due to a vulnerability in the Microsoft SQL Server, the worm was able to install itself on Microsoft servers and started to wildly scan the network in order to propagate itself. It was not the unavailability of the Microsoft SQL servers, but the traffic generated by extensive scans, which in turn caused packet loss or completely saturating circuits in some instances. Several large Internet transit providers and end-user ISP s were completely shut down. As a result, Bank of America s debit and credit card operations were impacted, denying customers the opportunity to make any transactions using their bank cards [166]. Economies of scale have made usage of the network infrastructure very efficient and extremely cheap. While this allowed the internet to experience unprecedented growth, it brought about the pitfall that almost every internet user is exposed to unwanted advertisement messages, so-called spam. In the last few years, more and more of these relatively harmless spam messages have turned into phishing mails, targeting at stealing online banking and e-commerce codes and passwords from naive users. Various automatic methods, such as virus scanners, spam filters, online surfing controls, firewalls, and intrusion detection systems have emerged as a response to the need of protecting the systems from harmful network traffic. However, as there will always exist human and machine failures, no fully automated method can provide absolute protection. Intrusion detection is the major preventive mechanism for timely recognition of malicious use of the system endangering its integrity and stability. There exist two different concepts to detect intrusions: a) anomaly-based intrusion detection systems (IDS) which offer a higher

16 4 Chapter 1. Introduction potential for discovering novel attacks, and b) signature-based IDS, which targets already known attack patterns. Anomaly-based detection is carried out by defining the normal state and behavior of the system with alerts sent out whenever that state is violated. It is a rather complicated task to define the normal behavior precisely enough as to minimize false alerts on the one hand and not to let attacks evolve unnoticed on the other hand. 1.3 Visual analysis for network security The roots of the field of exploratory data analysis date back to the eighties when John Tukey articulated the important distinction between confirmatory and exploratory data analysis [165] out of the realization that the field of statistics was strongly driven by hypothesis testing at the time. Today, a lot of research deals with an increasing amount of data being digitally collected in the hope of containing valuable information that can eventually bring a competitive advantage for its owner. Visual data exploration, which can be seen as a hypothesis generation process, is especially valuable, because a) it can deal with highly non-homogeneous and noisy data, and b) is intuitive and requires no understanding of complex mathematical methods [86]. Visualization can thus provide a qualitative overview of the data, allowing data phenomena to be isolated for further quantitative analysis. The emergence of visual analytics research suggests that more and more visual methods will be closely linked with automatic analysis methods. The goal of visual analytics is to turn the information overload into the opportunity of the decade [156, 157]. Decision-makers should be enabled to examine this massive, multi-dimensional, multi-source, time-varying information stream to make effective decisions in time-critical situations. For informed decisions, it is indispensable to include humans in the data analysis process to combine flexibility, creativity, and background knowledge with the enormous storage capacity and the computational power of today s computers. The specific advantage of visual analytics is that decision makers may fully focus their cognitive and perceptual capabilities on the analytical process, while allowing them to apply advanced computational capabilities to augment the discovery process. Network computers have become so ubiquitous and easy to access that they are also vulnerable [108]. While extensive efforts are made to build and maintain trustworthy systems, hackers often manage to circumvent the security mechanism and thereby find a way to infiltrate the systems and steal confidential information, to compromise network computers, and in some cases even to take over the control of these systems. In practice, large networks consisting of hundreds of thousands of hosts are monitored by integrating logs from gateway routers, firewalls, and intrusion detection systems using statistical and signature-based methods to detect changes, anomalies and attacks. Due to economic and technical trends, networks have experienced rapid growth in the last decade, which resulted in more legitimate as well as malicious traffic than ever before. A consequence is that the number of detected anomalies and security incidents becomes too large to cope with manually, thus justifying the pressing need for more sophisticated tools. Our objective is to show how visual analysis can foster deep insight in the large data sets describing IP network activity. The difficult task of detecting various kinds of system vulnerabilities, for example, can be successfully solved by applying visual analytics methods.

17 1.4. Thesis outline and contribution 5 Whenever machine learning algorithms become insufficient for recognizing malicious patterns, advanced visualization and interaction techniques encourage expert users to explore the relevant data by taking advantage of human perception, intuition, and background knowledge. Through a feedback loop, the knowledge, which was acquired in the process of human involvement, can be used as input for advancing automatic detection mechanisms. 1.4 Thesis outline and contribution The overall goal of this thesis is to show how visual analysis methods can contribute to the fields of network monitoring and security. In many cases, the large amount of available data from network monitoring processes renders the applicability of many visualization techniques impossible. Therefore, a careful selection and extension of current visualization techniques is needed. While the first three chapters motivate our work and introduce the necessary foundations of networking, intrusion detection, data modeling, information visualization, and visual analytics, Chapters 4 to 9 deal with efforts to appropriately represent and interact with the available information in order to gain valuable insight for timely reactions in case of failures and intrusions. Chapter 2 details basic concepts in networking and intrusion detection that are necessary to comprehend the data sets, which will be analyzed throughout the application chapters. Network protocols are discussed at an abstract level along with various tools for monitoring, intrusion detection, and threat prevention. Since in some cases one has to deal with extremely large data sets, performance requirements of the database management system play an important role in our network research efforts. The underlying data model for storing network traffic and event data was inspired by the OLAP (online analytical processing) approach used in building data warehouses for efficiently managing huge data volumes and computing aggregates under high performance requirements. In Chapter 3, the research fields of information visualization and visual analytics are discussed. Using Shneiderman s data type by task taxonomy, the visualization methods of this thesis are systematically put into context. Furthermore, we show how colors are mapped to values using different scaling functions and propose some literature for further reading. Next, the relatively young field of visual analytics is defined and its potential for network monitoring and security is pointed out. Based on an extensive review of scientific publications in the field, an overview of visual analysis systems and prototypes for network monitoring and security is presented to the reader. Starting from low-dimensional input data as in time series, the used input data increase in dimensionality as we proceed from Chapter 4 to Chapter 9. All these chapters follow the same methodology: after a short motivation, related visualization methods are reviewed, the respective visualization approach is introduced, discussed, and evaluated where applicable. Finally each method s applicability is demonstrated in at least one case studies. Chapter 4 describes the enhanced recursive pattern technique as an alternative to traditional line and bar charts for the comparison of several granular time series. In this visualization technique, through its color attribute each pixel represents the aggregated value of a time series

18 6 Chapter 1. Introduction for the finest displayed granularity level. Long time series are subdivided into groups of logical units, for example, several hours each consisting of 60 minutes. By allowing empty pixels in the recursive patterns, the technique can better cope with irregularities in time series, such as the irregular number of days or weeks in a month. In order to be able to compare several time series, three coloring schemes and three alternative arrangements are proposed. Finally, the applicability of the extended recursive pattern visualization technique is demonstrated on real data of large-scale SSH flows in our network. In Chapter 5, we propose the Hierarchical Network Map (HNMap), which is a space-filling map of the IP address space for visualizing aggregated IP traffic. Within the map, the position of network entities are defined through a containment-based hierarchy by rendering child nodes as rectangles within the bounds of their parent node in a space-filling way. While the upper continent and country levels require a space-filling geographic mapping method to preserve geographical neighborhood, node placement in the lower two levels depends on the IP addresses, which are contained within the respective autonomous system or network prefix. Since there exist two alternative layout methods and their combination for these lower two levels, we evaluate their applicability according to a) visibility, b) average rectangle aspect ratio, and c) layout preservation. Visual analysis of network traffic and events essentially involves exploration of the data. Therefore, various means of interaction are implemented within our prototype. Finally, three case studies involving resource location planning, traffic monitoring, and botnet spread propagation are conducted and show how the tool enables insightful analyses of large data sets. In the scope of Chapter 6, the HNMap is extended through hierarchical edge bundles to convey source destination relationships of the most important network traffic links. In contrast to straight connection lines, these bundles avoid visual clutter while at the same time grouping traffic with similar properties in the IP/AS hierarchy of the map. In order to communicate the intensity of a connection, we consider both coloring and width of the splines. The case study then assesses changes of the major traffic connections throughout a day of network traffic. Chapter 7 describes the Radial Traffic Analyzer (RTA), which is a visualization tool for multivariate analysis of network traffic. In the visualization, network traffic is grouped according to joint attribute values in a hierarchical fashion: starting from the inside each ring represents one dimension of the data set (e.g., source/destination IP or port). While inner rings show the high-level aggregates, outer rings display more detailed information. By interactively rearranging the rings, the aggregation function of the data is changed. By animating the display, it is demonstrated that the RTA can be used for temporal analysis of network traffic. The case study then demonstrates how the tool is applied for the analysis of event data of an intrusion detection system. In Chapter 8, we propose a novel network traffic visualization metaphor for monitoring the behavior of network hosts. Each host is represented through a number of nodes in a graph, whose position correspond to the traffic proportions of that particular host within a specific time interval. Subsequent nodes of the same host are then connected through straight lines to denote behavioral changes over time. In an attempt to reduce overdrawing of nodes with the same projected position, we then apply a force-directed graph layout to obtain compact traces for hosts of unchanged traffic proportions and large extension of the traces that represent hosts with highly variable traffic proportions. Two case studies show how the tool can be used to

19 1.4. Thesis outline and contribution 7 gain insight into large data sets by analyzing the behavior of hosts in real network monitoring and security scenarios. Chapter 9 details the analysis of content-based characteristics of network traffic using the well-known Self-Organizing Map (SOM) visualization technique. This neuronal network approach orders high-dimensional feature vectors on a map according to their distances. We create text descriptors by extracting the most popular terms from subject and text fields of messages in an archive and by applying the tf-idf information retrieval scheme. Within the case study, it is demonstrated that the SOM learned on these feature vectors can be used for classification tasks by distinguishing between spam and regular s based on the position of the s feature vector on the map. Chapter 10 concludes the dissertation by summarizing the contributions and giving an outlook to future work.

20

21 2 Networks, intrusion detection, and data management for network traffic and events,,not everything that is counted counts, and not everything that counts can be counted. Contents Albert Einstein 2.1 Network fundamentals Network protocols The Internet Protocol Routing UDP and TCP Domain Name System Capturing network traffic Network sniffers Encryption, tunneling, and anonymization Intrusion detection Network and port scans Computer viruses, worms, and trojan programs Countermeasures against intrusions and attacks Threat models Building a data warehouse for network traffic and events Cube definitions OLAP Operations and Queries Summary tables Visual navigation in OLAP cubes COMPUTER networks have become an integral part of our daily used IT infrastructure. It is therefore worth devoting some time to introducing networking concepts and terminology in order to foster the understanding of the data analysis challenges within this dissertation. In particular, methods for capturing network traffic, methods for intrusion detection, as well as data modeling issues are discussed.

22 10 Chapter 2. Networks, intrusion detection, and data management 2.1 Network fundamentals Within this dissertation, networking concepts are only explained in a very brief fashion. Kurose et al. [95] and Tanenbaum [153] have written excellent books on computer networking for a more thorough discussion. In general, one can distinguish between network hardware and software. Today s network hardware has diversified in various technologies: cable, fiber links, wireless, and satellite communication work together seamlessly. This is due to the fact that flexible network communication protocols the software part introduce the necessary abstraction to facilitate communication among various machines running diverse operation systems and wide-spread applications. Many innovations of today s network communication were first proposed in the Request For Comments (RFC). RFC was intended to be an informal fast distribution way to share ideas with other network researchers. It was hosted at Stanford Research Institute (SRI), one of the first nodes of the ARPANET, which was the predecessor of the Internet [103] Network protocols Computers in a network are called hosts. As mentioned above, different technologies exists to connect the hosts of a network. From a structural point of view, one often distinguishes between Local Area Networks (LANs) and Wide Area Networks (WANs). The main difference between them is the communication distance, not necessarily the size, resulting in the use of different communication hardware. Since long distance links are more expensive than wiring a few local hosts, there is obviously a consolidation effect resulting in one strong link connecting two networks instead of several low-bandwidth links. However, due to availability concerns, many networks are connected through several links. The Internet is a giant network, which consists of countless interconnected networks. Many people confuse the term World Wide Web (WWW) with the term Internet. In fact, the World Wide Web can be seen as a huge collection of interlinked documents accessed via the Internet. Countless webservers provide access to interconnected dynamic and static websites for arbitrary hosts in the Internet. Therefore, the Internet is the name of the network whereas WWW refers to a particular service running on top of this network infrastructure. s, for example, are sent through the Internet and are another service besides WWW. There exist two well-known reference models for network protocols: TCP/IP (Transmission Control Protocol/Internet Protocol) and OSI (Open System Interconnection). Neither the OSI nor the TCP/IP model and their respective protocols are perfect. The OSI reference model was proposed at the time when a lot of the competing TCP/IP protocols were already in widespread use, and no vendor wanted to be the first one to implement and support the OSI protocols. The second reason OSI never caught on is that the seven proposed layers were modeled unnecessarily complex: two of the layers are almost empty (session and presentation), whereas two other ones (data link and network) are overloaded. In addition to that, some functions such as addressing, flow control, and error control reappear in each layer. Third, the early implementations of OSI protocols were flawed and were therefore associated with bad quality as opposed to TCP/IP which was supported by a large user community. Finally, people thought of OSI as the creation of some European telecommunication ministries, the European Union, and later

23 2.1. Network fundamentals 11 Application Presentation Session Application 5 Application Transport Transport 4 Transport Network Network 3 Network Data Link Physical Host-to-network 2 1 Data Link Physical OSI TCP/IP hybrid Figure 2.1: The layers of the OSI, TCP/IP and hybrid reference models as the creation of the U.S. government and thus preferred TCP/IP as a solution coming out of innovative research rather than bureaucracy [153]. OSI protocols are rarely used nowadays, therefore the focus is on TCP/IP protocols, but a hybrid reference model combining TCP/IP and OSI is considered throughout this dissertation. Networking protocols are organized in so-called layers. These layers can be implemented in software (highest flexibility), in hardware (high speed), or in the combination of the two. The used hybrid model uses the upper three layers of TCP/IP, but splits the Host-to-network layer into the data link and physical layer as illustrated in Figure 2.1. The five layers of the hybrid model are as follows: 1. The physical layer is concerned with transmitting raw bits over a communication channel and ensures that if one party sends a 1 bit, the other party actually receives it as a 1 and not as a The data link layer, sometimes also called link layer or network interface layer, is composed of the device driver in the operation system and the corresponding network interface card. These two components are responsible for handling the hardware details with the cable. 3. The network layer, or internet layer, handles the movement of packets in the network and routes them from one computer via several hops to its destination. In the TCI/IP protocol suite this layer is provided by IP (Internet Protocol), ICMP (Internet Control Message Protocol), and IGMP (Internet Group Management Protocol). 4. The transport layer delivers the service of data flows for the application layer above it. The protocol suite includes two conceptually different protocols: TCP (Transmission Control Protocol) offers a reliable flow of data between two hosts for the application layer by acknowledging received packets and retransmitting erroneous packets. UDP (User Datagram Protocol), on the contrary, only sends packets of data from one host to the other, but no guarantee about the arrival of the packet at the other end is given. It is often used in real-time applications like voice, music, and video streaming where a loss of some packets is acceptable to a certain degree.

24 12 Chapter 2. Networks, intrusion detection, and data management Protocol Layer Name Purpose / applications FTP File Transfer Protocol file transfer HTTP HyperText Transfer Protocol hypertext transfer IMAP Internet Message Access Protocol electronic mailbox with folders, etc. POP3 Post Office Protocol version 3 electronic mailbox AL SMTP Simple Mail Transfer Protocol transmission across the Internet SNMP Simple Network Management network management Protocol SSH Secure Shell secure remote login (UNIX, LINUX) TELNET TELetype NETwork remote login (UNIX, LINUX) TCP Transmission Control Protocol lossless transmission UDP TL User Datagram Protocol transmission of simple datagrams (packets might be lost), music, voice, video ICMP Internet Control Message Protocol error messages IGMP NL Internet Group Management manages IP multicast groups Protocol IP Internet Protocol global addressing amongst computers Table 2.1: Common protocols that build upon the TCP/IP reference model (NL = Network Layer, TL = Transport Layer, AL = Application Layer) 5. The communication details of diverse applications are handled in the application layer. Common application layer protocols are Telnet for remote login, File Transfer Protocol (FTP) for file transfer, SMTP for electronic mail, SNMP for network management, etc. An important contribution of the OSI model is the distinction between services, interfaces, and protocols. Each layer performs some services for the layer above. A layer s interface, on the contrary, tells the processes above how to access it. The protocols used inside the layers can now be seen independently and exchanging them will not affect other layers protocols. Since the upper layer protocols build upon the services provided by the lower layers, the intermediate routers do not necessarily need to understand the protocols of the application layer, but it suffices when they communicate data using lower level protocols and only the respective source and destination computers are capable of interpreting the used application protocols. A few commonly used protocols are listed in Table 2.1 to convey the intuition about what is done in which layer. Let us consider a short example: After requesting a web page, the HTTP header (Application layer) specifies the status code, modification date, size, content-type, and encoding of the document, among other technical details. The TCP protocol of the Transport layer then subdivides the document into multiple frames and specifies the source (HTTP = port 80) and destination ports on which the requesting host already listens. This protocol

25 2.1. Network fundamentals 13 Figure 2.2: Advertised IPv4 address count daily average [75] guarantees reliable and in-order delivery of data from sender to receiver by sending requests and acknowledgments. In case of timeouts, TCP retransmits the lost frames and correctly assembles them. The IP protocol (Network layer) then provides global addressing and takes care of the routing of the frames from the source to the destination host. Often, this involves many routers, which each time transfer the packet to the next host. Normally, this host is closer to the destination with respect to network topology. Finally, the communication between two machines in this chain of involved routers and hosts is controlled using the data link layer. This might be handled by Ethernet or other data link and physical layer standards The Internet Protocol As demonstrated in the example above, many upper layer protocols depend upon the Internet Protocol with its global addressing and routing capabilities. Nowadays, version 4 is most commonly used. The Internet s growth has been documented in several studies [107, 120, 75] by means of estimating its network traffic, its users, and the advertised IP addresses. Figure 2.2 illustrates this growth, stating that currently about 40 % of the approximately 4 billion IPv4 addresses are used. Predictions suggest that IANA (Internet Assigned Number Authority) will run out of IP addresses in However, Network Address Translation (NAT) and IPv6 technology can compensate for the need of more IPv4 addresses. Both technologies are already in use and ready for a broader deployment. Figure 2.3 shows an IP datagram. The IP header normally consists of at least five 32-bit words (one for each of the first five rows in the figure). It specifies the IP version used (mostly 4), the IP header length (IHL), the type of service, the size of the datagram (header + data), the identification number, which in combination with the source address uniquely identifies a packet, several flags, the fragmentation offset (byte count from the original packet, set by

26 14 Chapter 2. Networks, intrusion detection, and data management Versi on IHL Type of Service Total Length Identification Flags Fragment Offset Time to Live Protocol Source Address Header Checksum Destination Address Options (optional) Data Figure 2.3: IP datagram the routers which perform IP packet fragmentation), the time to live (maximum number of hops which the packet may still be routed over), the protocol (e.g., 1 = ICMP; 6 = TCP; 17 = UDP), the header checksum, the source address and the destination address. In some cases, additional options are used by specifying a number greater than five in the IHL field. The rest of the datagram consists of the actual data, the so-called payload. The IP addressing and routing scheme build upon two components, namely, the IP address and network prefixes: An IP address is a 32-bit number (in IPv4) which uniquely identifies a host interface in the Internet. For example, is the IP address of a webserver at the University of Konstanz in dot-decimal notation. A prefix is a range of IP addresses and corresponds to one or more networks [65]. For instance, the prefix /16 defines the IP addresses assigned to the University of Konstanz, Germany. Each prefix consists of an IP address and a subnet prefix, which specifies the number of leftmost bits that should be considered when matching an IP address to prefixes Routing When traffic is sent from a source to a destination host, several repeaters, hubs, bridges, switches, routers, or gateways might be involved. To clarify these terms, we need to reconsider the layers of the reference model since these devices operate on different layers. Repeaters were designed to amplify the incoming signal and send it out again in order to extend the maximum cable length (Ethernet: ca. 500 m). Hubs work in a very similar way, but send out the incoming signal on all their other network links. These two devices operate on the physical level, since they do not understand frames, packets, or headers. Next, we consider switches and bridges, which both operate on the data link layer. Switches are used to connect several computers, similar to hubs, whereas bridges connect two or more networks. When a frame arrives, the software inside the switch extracts the destination address from the frame, looks it up in a table, and sends it out on the respective network link. When a packet enters a router, the header and the trailer are stripped off and the routing software determines by destination address in the header to which output line the packet should

27 2.1. Network fundamentals 15 be forwarded. For an IPv4 packet, this address is a 32-bit number (IPv6: 64 bit) rather than the 48-bit hardware address (also called MAC address or Ethernet Hardware Address (EHA)). The term gateway is often used interchangeably for a router. However, transport and application gateways operate one or two layers higher. Since each network is independently managed, it is often referred to as an Autonomous System (AS). An AS is a connected group of one or more IP prefixes (networks) run by one or more network operators, and has a single and clearly defined routing policy [65]. AS s are indexed by a 16-bit Autonomous System Number (ASN). Usually, an AS belongs to a local, regional or global service provider, or to a large customer that subscribes to multiple IP service providers. The border gateway routers, which connect different AS s, base their routing decision upon each one s so-called routing table. This table contains a list of IP prefixes, the next router, and the number of hops to the destination. Prefixes underlie Classless Inter-Domain Routing (CIDR) [54], which was preceded by Classful Addressing. Classful Addressing only allowed 128 A class networks (/8) each consisting of addresses, B class networks (/16) with each addresses, and C class networks (/24) of size 254. Note that the number of available addresses is always 2 N 2, where N is the number of bits used and the -2 adjusts for the invalidity of the first and last addresses because they are reserved for special use. Since many mid-size companies required more than 254 addresses, the fear arose that the B class networks would soon be depleted. CIDR introduced variable prefix length and thus offered more flexibility to vary network sizes for both internal and external routing decisions. Through its bitwise address assignment and aggregation strategy routing tables are kept small and efficient. Continuous ranges of IP addresses, which are all forwarded to the identical next hop, are aggregated in the routing tables. For example, traffic to the prefixes /24 and /24 both destined for AS 553 can be aggregated to /23. Each time a packet arrives at an intermediate hop, it is forwarded to the router with the most specific prefix entry matching the destination IP address. This is done by checking whether the N initial bits are identical. Note that routing is usually more specific within an AS, whereas external routing is highly aggregated due to the fact that all traffic from a particular source to a destination AS needs to pass through the same border gateway router. Further details of the exterior routing, such as policies, costs, announcement, and withdrawal of prefixes are dealt with in the Border Gateway Protocol (BGP) [106] UDP and TCP As mentioned previously, UPD and TCP operate on the transport layer and provide end-to-end byte streams over an unreliable internetwork. The connectionless protocol UDP provides the service of sending IP datagrams with a short header for applications. This is done by adding the source and destination port fields to the IP header, thus enabling the transport layer to determine which process on the destination machine is responsible for handling the packet. The destination port specifies which process on the target machine is to handle the packet, whereas the source port details on which port the reply to the request should arrive. In the reply, the former source port is simply copied into the destination port so that the requesting machine knows how to handle the answer.

28 16 Chapter 2. Networks, intrusion detection, and data management In contrast to UDP, the connection-oriented protocol TCP provides a reliable service for sending byte streams over an unreliable internetwork. This is done by creating so-called sockets, which are nothing else but communication end points, and by binding the ports local to the host to the sockets. TCP can then establish a connection between a socket on the source and a socket on the target machine. The IANA [74] is responsible for the assignment of application port numbers for both TCP and UDP. Conceptually, there are three ranges of port numbers: 1. On many systems, well-known port numbers ranging from 0 to 1023 can only be used by system (or root) processes or by programs executed by privileged users. They are assigned by the IANA in a standardization effort. 2. Registered port numbers ranging from 1024 to can be used by ordinary user processes or programs executed by ordinary users. The IANA registers uses of these ports as a convenience to the community. 3. Dynamic and/or private port numbers ranging from to can be used by any process and are not available for registration. In the analysis types presented in this thesis, application port numbers are often used as an indication about what applications are using the network. Since the application port numbers can be extracted from the packet headers, this kind of analysis does not require looking at the packet content, which might otherwise raise additional privacy concerns. Although the used application ports are a rather good estimate for regular traffic, the application port numbers can be used by other processes than the ones they were originally meant to. The peer-to-peer Internet telephone system Skype, for example, uses port 80 and 443, which are registered to web traffic (http) and secure web traffic (https). This is done in order to bypass application firewalls, which are an effort to protect the network infrastructure from malicious traffic by blocking unused ports. Naturally, this is only possible if the application ports on that particular machine have not already been bound to a webserver process Domain Name System So far, we have discussed various protocols which all rely on some sort of network address (e.g., the MAC address or the IP address). Whereas machines can perfectly deal with these kind of addresses, humans find them hard to remember. Due to this fact, ASCII names were introduced to decouple machine names from machine addresses. However, the network itself only understands numerical addresses. Therefore, a mapping mechanism is required to convert the ASCII strings to network addresses. Since the previously used host files could not keep pace with the fast growing Internet, the Domain Name System (DNS) was invented. The Internet is conceptually divided into a set of roughly 250 top level domains. Each one of these domains is further partitioned into subdomains, which in turn can have subdomains of their own, and so on. This schema spans up a hierarchy. One distinguishes between so-called generic (e.g., com, edu, org) and country top-level domains (e.g., de, ch, us). Subdomains can then be registered at the responsible registrar. Each domain is named by the path upward

29 2.2. Capturing network traffic 17 from it to the (unnamed) root, whereas the components are separated by periods (pronounced dots ). For example, specifies the subdomain www (common convention to name a webserver), which is a subdomain of uni-konstanz, which in turn is registered below the top-level domain de. This naming schema usually follows organizational boundaries rather than the physical network. When a domain name is passed to the DNS system, the latter returns all resource records associated with that name. For simplicity, we restrict ourselves to the address records, which might look like this: Domain TTL Class Type Value IN A Resource records contain five values, namely the domain name, time to live (TTL), class, type, and value. In the record above, the TTL value of (the number of seconds in one day) indicates that the record is rather stable since highly volatile information is assigned a small value. IN specifies that the record contains Internet information, and A that it is an address record. The final field specifies the actual IP address , which is mapped to the domain Other resource records hold information such as the start of authority, responsible mail and name servers for a particular domain, pointers, canonical names, host descriptions, or other informative text. For more details refer to [153]. 2.2 Capturing network traffic To conduct data analysis of network traffic, details of this traffic need to be obtained from hosts, routers, firewalls, and intrusion detection systems. Often, collecting this data turns out to become a practical challenge since some network packets might pass several routers and are thus stored several times. Furthermore, export interfaces of routers might return the so-called netflows, which are detailed information about size, time, source, and destination of the transferred network traffic, in different formats. For a more detailed description of problems and solutions to measuring network traffic, we suggest to read the book Network Algorithmics: an interdisciplinary approach to designing fast networked devices by George Varghese [172] Network sniffers There exists an alternative way of monitoring network traffic when access to the export interface of a router is not given. The network card of almost any computer can be set into promiscuous mode, which instructs the network card to pass any traffic it receives to the CPU rather than just packets addressed to itself. In the next step, the packets are passed to programs extracting application-level data. Depending on the network infrastructure, packet sniffing can be very effective or not effective at all: hubs forward all traffic to each of their network interfaces (except for the one where it came in), whereas switches only forward incoming traffic to one network link as long as it is not a broadcast packet. Despite this fact, Arp Poison Routing

30 18 Chapter 2. Networks, intrusion detection, and data management (APR) can be used to fool switches by misleadingly announcing the MAC addresses of other hosts in the network. Today, network administrators and hackers can choose from a wide variety of packet sniffers. A few commonly used freeware tools are listed here: libpcap is a system-independent interface for user-level packet capturing, which runs on POSIX systems (Linux, BSD, and UNIX-like OSes). tcpdump is a command line tool that prints out the headers of packets on a network interface matching a boolean expression. It runs on POSIX systems and is built upon libpcap. WinPcap contains the Windows version of the libpcap API. JPcap is a java wrapper for libpcap and winpcap, which provides a java interface for network capturing. Wireshark (formerly Etherreal) is a free graphical packet capture and protocol analysis tool, which runs on POSIX systems, MS Windows, and Mac OS X and uses libpcap or WinPcap depending on the OS. The OSU Flow-tools are a set of tools for recording, filtering, printing and analyzing flow logs derived from exports of CISCO NetFlow accounting records [55]. Since there has always been a need to monitor network traffic and debug network protocols, many more freeware and commercial products have emerged. An overview can be found in [72]. By simply instructing a router to duplicate the outgoing packets and output them on an additional network interface, it is possible to sniff all network traffic using a machine with a promiscuous network interface that is connected to the router. This reduces the effort of having to deal with various formats of the export interface. Surprisingly, a lot of commonly used applications, such as POP3, FTP, IMAP, htaccess, and some webstores do not encrypt transferred data, usernames or passwords. For hackers, it therefore often suffices to have a network sniffer installed which listens to the communication between their victims and the servers they use. The sniffed ethernet frames are simply passed to an application which searches their payload for passwords. Note that in high-load situations not all packets can be captured due to capacity problems of the machine where the sniffer runs. Routers are built and configured to pass large amounts of packets, whereas sniffers need to analyze the packets at a higher level, which needs more processing power Encryption, tunneling, and anonymization Encryption can be used to avoid stolen passwords and to guarantee the privacy of the communication. More formally, the desirable properties of secure communication can be identified as follows:

31 2.2. Capturing network traffic 19 Confidentiality is often perceived as the only component of secure communication. However, it only comprises that a message is communicated from the sender to the receiver without anybody being able to obtain its content. Authentication implies that both the sender and the receiver should be able to confirm the identity of one another. Naturally, face-to-face communication easily solves that problem, but other forms of communication make authentication a technical challenge. Message integrity and nonrepudiation deal with the cryptographic property that a message cannot be altered by the so-called man in the middle and with the fact that it can be proven that a certain message was written by the sender and nobody else. In the physical world, we often use signatures, but since it is trivial to make identical copies in the digital world, more sophisticated methods are needed. Availability and access control. Experiences with denial-of-service (DoS) attacks from the last few years have shown that availability is vital for ensuring secure communication. To further extend this concept, access control to the communication infrastructure can first and foremost prevent the communication to be intercepted. Confidentiality, authentication, message integrity, and nonrepudiation have already been considered key components of secure communication for some time [117]. More recently, availability and access control were added [111, 15]. There exist two conceptually different encryption methodologies, namely the symmetric the and asymmetric one. Symmetric encryption relies on the fact that sender and receiver share a secret, the so-called key. This key can be used to encrypt and decrypt messages. Asymmetric encryption also know as public key cryptography, employs two keys per identity: the public key can be obtained from a publicly accessible key repository, whereas the private key should be kept as a secret. The public key of the recipient suffices to encrypt a message addressed at that recipient. However, once encrypted, this message can no longer be read by the sender since it can only be decrypted using the private key of the receipient. Public keys can also be used to prove the sender s authenticity, the message integrity, and its nonrepudiation. A hash code is generated from the message and encrypted using the sender s private key. This so-called signature is then attached to the original message and encrypted using the recipient s public key. Decryption is also twofold: first, the message and signature are decrypted using the recipient s private key, thereupon, the signature is decrypted a second time using the sender s public key and is compared to the hash code of the original message. This only works due to the fact that the encryption and decryption functions are reversible, which means that the outcome of encrypting and then decrypting a message is the same as decrypting a message and then encrypting it. While asymmetric encryption is considerably slower than most symmetric encryption methods, it is often used to exchange the keys for symmetric encryption methods. Once this is done, fast symmetric encryption can be used to encrypt large volumes of data. In many network security applications, the SSH protocol is used to establish a secure channel over which traffic from other applications can be tunneled as sketched in Figure 2.4. In other words, the sender creates a local socket, which is bound to the SSH application and which encrypts all incoming

32 20 Chapter 2. Networks, intrusion detection, and data management Internet SSH tunnel sender eavesdropping recipient "bad guy" Figure 2.4: Concept of an SSH tunnel network traffic from diverse applications is automatically encrypted and decrypted, thus preventing eavesdropping by bad guys traffic and sends the encrypted payload over the Internet until it arrives at the recipient where it is decrypted and forwarded to the port on either the recipient s machine or on a host in the network specified by the SSH tunnel. Note that if the tunnel only exists from the sender s to the recipient machine, the traffic the recipient forwards is unencrypted. Naturally, the response also travels through this secured channel. Tunneling has proven to be very powerful for hiding private communication, since the network administrators only see how much encrypted data is transferred form a source to a destination host using SSH. So far, it was still possible for the network administrator or the evil man in the middle to analyze the source and destination IP addresses. However, today many service providers offer anonymization services, which forward traffic to conceal one communication partner. Once the log files of this anonymization service are deleted, it becomes impossible to trace back the communication, given that enough people use the anonymization service simultaneously. While most users employ encryption, tunneling, and anonymization to guard their privacy in an open network infrastructure like the Internet, these mechanisms can be easily misused by criminals to hide their delicts. Enforcing availability and access control as the final property of secure communication in an open network infrastructure can turn out to be a challenge. The next section will explain how today s technology is used to make the distinction between good and bad guys in order to protect the network infrastructure. 2.3 Intrusion detection The Internet was originally designed in a way that each host can communicate with any other host in the network. Since the number of hosts in the Internet has grown immensely, hosts became more and more anonymous and some started misbehaving. Although it is easily traceable which company owns the IP address of the misbehaving host, it can still take several days until that host is cut off. Due to the lack of international laws, a misbehaving host might even

33 2.3. Intrusion detection 21 Figure 2.5: Port scan using NMap be placed in a country where it does not violates any law, whereas its misbehavior is liable to prosecution in the victim s country. Therefore, Internet folks often prefer technical solutions, which can be put in place a lot faster than law enforcement. This section briefly describes different kinds of misbehavior and details methods to deal with them. Interested readers are referred to [115, 171] for more details Network and port scans In most attack scenarios, a host or a network is first scanned for vulnerabilities using two principle two methods. The first method is called port scan and checks a particular range or sometimes even all ports of the target host for the applications which are listening. As soon as an application answers to the request of the attacker, the latter tries to guess the kind of application from its protocol and port. Sometimes it is even possibly to infer which version of that particular application is running on the probed port. This knowledge then gives the hacker meaningful hints about unpatched security vulnerabilities to be exploited when launching an attack. Figure 2.5 shows the output of a port scan with NMap [71], detailing open ports and the respective listening applications of host Note that this quick scan only took seconds. Network scans are the second scanning method usually targeting a whole network or a range of addresses. Commonly, the whole network is scanned on a particular port where a known security vulnerability might still be unfixed. Since sequential scans are easily detectable and can be blocked, tool developers have implemented various strategies to obfuscate scans by permutation of the scanned addresses and by artificially spanning the scan over a larger time interval. Still, common Intrusion Detection Systems (IDS) easily manage to detect normal and obfuscated scans.

34 22 Chapter 2. Networks, intrusion detection, and data management Computer viruses, worms, and trojan programs Computer security vocabulary like computer viruses, worms, and trojan programs are commonly used as buzz words in the media. Despite this fact, their meaning is frequently confused. Therefore, these terms among other specialized vocabulary are briefly reviewed in this section. Computer virus A piece of software that copies itself into other programs. Possibly, it also performs other tasks like spying out users of the infected computer or deleting files. Commonly, computer viruses are passed by infected files on discs, by Internet downloads, or attachments. Computer worm Worms are viruses that spawn running copies of themselves. In the past, worms have repeatedly demonstrated extremely fast world-wide propagation through known bugs in unpatched clients or other application software. Trojan A trojan (derived from the trojan horse of Greek mythology) is a program which is supposed to perform a certain function, but secretly performs another, usually diabolical one. Botnet The term comes from robot NETwork and refers to a large number of compromised computers. Normally, hosts are added to botnets through trojan programs or worms. Many of these bot computers communicate with an Internet Relay Chat (IRC) server without the knowledge of their actual owner and receive remote control commands from there. It is common knowledge that bot computers are misused to harvest addresses in the Internet, send out spam, and participate in DDoS attacks (see below). 0-Day-Exploit Most of today s known malware exploits one or another known security leak and infects unpatched computers or applications with slow update cycles. 0-Day-Exploits, on the contrary, are security leaks within software, which are yet unknown to both the public and the software producers. On the black market, these exploits are vividly traded since patches do not yet exist and the malware spread is expected to be greater. Furthermore, 0-Day-Exploits can be secretly used to spy out particular target hosts running little risk of being detected. Denial Of Service attack (DoS) This class of attacks is characterized by the goal to make a computer resource unavailable to its intended users, either by consumption of computational resources, disruption of configuration information (i.e., routing information), or disruption of physical network components. A SYN flood, for example, is one common DoS attack targeting webservers or other hosts in the Internet. The concept is to send a TCP/SYN packet with a spoofed sender address to the victim host. Because each of these packets is handled like a connection request, the attacked host sends back a TCP/SYN-ACK packet and waits for the TCP/ACK packet in response. However, the number of available connections on the attacked host will soon be depleted, since the sender address was spoofed and the acknowledgment packet will never arrive. At this point of time, the victim is no longer reachable by its intended users.

35 2.3. Intrusion detection 23 Distributed Denial of Service attack (DDoS) DDoS attacks extend the above introduced concept of DoS attacks by being conducted using multiple compromised hosts. One can easily imagine the challenge of distinguishing normal traffic from the hostile traffic generated by remotely controlled hosts of a botnet. DDoS attacks have made especially those companies, whose business model depends on the availability of their e-commerce or e-banking platform, vulnerable to blackmailing. Spam The term spam ( SPiced HAm ) has become a synonym for unwanted advertisement s and originates from a Monty Python sketch where it was extensively used until no communication was possible any more. The countless unwanted advertisement s are still a major concern in network operations, probably less due the waste of network and storage resources, but more because end users spend a lot of time sorting them out of their inbox. The Symantec Internet Security Threat Report [151] stresses that between July 1 and December 31, 2006 spam made up 59 percent of all monitored traffic. Economies of scale resulted in the ability of just one single spam server with a good network connection to send out several millions of s per day at almost no cost. Lots of efforts have been made to enhance spam detection software, ranging from centralized spam signature repositories to individually trained artificial intelligence filters for automatic filtering of as many spam s as possible. However, these methods have to deal with the fact that a single falsely removed ham (opposite of a spam) message might potentially be very expensive as opposed to manually sorting out a few spam messages from the inbox. Phishing is a rather recently introduced synonym for fraudulent messages. Cyber criminals create these messages claiming to be sent from well-known companies. Links to faked online banking and e-commerce platforms within these s aim at stealing user names, passwords, and online banking numbers from credulous victims. Although this list of definitions is by no means complete, it gives a general idea about the current hot topics in the field of computer security. The next section deals with automatic methods for coping with the above mentioned issues Countermeasures against intrusions and attacks Up till today, a lot of effort has been made to develop intrusion prevention and detection technology. Modern personal computers are commonly equipped with anti virus software, which ensures that malicious code is detected, quarantined, and removed. Furthermore, many operation systems have considerably improved their security concept, for example by putting into place a software firewall which only allows traffic from registered applications to pass. Network firewalls serve the same purpose by controlling access to the internal computers of the network. Normally, only internal hosts are allowed to open connections to the wild Internet. Connection requests from the outside are automatically blocked with only few exceptions. In some scenarios, reaching an internal host in the network is crucial for its designated function. Web, mail, and DNS servers, for example, should be reachable from the outside and are thus often placed in a so-called demilitarized zone (DMZ) as depicted in Figure 2.6.

36 24 Chapter 2. Networks, intrusion detection, and data management DNS SMTP WWW DMZ subnet Internal network Firewalll Router to external network Figure 2.6: Concept of a demilitarized zone (DMZ) Obviously, hosts within this zone are less protected than the rest of the network. Nevertheless, they are integrated into the overall security concept of the network and are thus better protected than hosts in the wilderness. Many commercial networks employ Intrusion Detection Systems (IDS), which scan the network traffic for external and internal threats. Note that countless security holes of unpatched operation systems might allow malicious code to nest itself into an internal host. This compromised host can then infect other hosts in the internal network since the firewall only blocks external traffic. Signature-based IDS filter network traffic according to their set of already known attacks. Commonly, these systems are patched on a regular basis, but due to their design they cannot detect yet unknown attack patterns. More recently, a second class of IDS has evolved, so-called anomaly-based IDS. These systems maintain profiles of the network traffic of subnets and hosts. If, all of a sudden, there are significant changes within the type of traffic of a particular entity, an alert is generated to inform the responsible network administrator. Some installations even automatically initiate countermeasures, such as isolating misbehaving hosts to minimize harm to the rest of the healthy IT infrastructure. Although IDS have proven to detect and block many threats, hackers and cyber criminals will always find new ways to surpass them. Honey pots were invented to fight these bad guys with their own methods: A honey pot is a system that simulates a vulnerable host in the network and carefully protocols attacks conducted against it. This helps to keep track of novel threats, attacks, as well as actions undertaken by the attacker to break into the computer. Naturally, all these systems produce alerts and therefore it is possible to manually keep track of all alerts only up to a certain network size and activity level. The next section deals with data modeling as well as storage of network traffic and event data in order to quickly find the right data or to calculate aggregates for the visualization module.

37 2.3. Intrusion detection Threat models In the scope of this thesis, the presented visualization approaches support the analyst in the tasks of detecting misues and threats, filtering out irrelevant alarms, correlateing alarms, to monitoring the behavior of hosts, and inspecting the payload of network packets. None of the presented approaches is suitable for supporting all of these tasks since each proposed visualization is designed to solve a specific problem with a particular set of threats in mind. Therefore, we consider a particular threat model for each of the visualizations. In Chapter 4, the threat model assumes that high traffic values, correlations of high values in several traffic parameters, or regularly reappearing patterns indicate an attack or an intrusion. While peaks in time series can be easily spotted and referenced to their exact time of occurence, not all reappearing temporal patterns can be identified with our proposed visualization. However, the visualization tool can supply the analyst with valuable information for developing algorithms or signatures aimed at detecting more threats of this model. The next chapter deals with threats that are correlated with respect to their source or destination. For example, an attack could be launched from several computers within the same prefix or AS. The proposed visualization therefore offers an interface for exploring network traffic aggregated at several levels, which correspond to the network infrastructure. This threat model is further refined for the analysis in Chapter 6 in order to take the relationship between the source and the destination of traffic flows into account. Since the proposed edge bundles build upon the map introduced in Chapter 5, they also consider the hierarchical structure of the IP address dimension. The visualization therefore enables analysis of threats based on relating senders and receivers. Chapter 7 considers a different threat model by assuming that attacks or misuses can be detected by considering packet details and aggregates. While some threats of this model can be detected using the proposed visualization technique, it becomes infeasable for human analysts to consider all possible aggregation paths of complex data sets. To explicitly consider temporal changes, the threat model in Chapter 8 defines the host behavior based on the respective network traffic and the proposed graph visualization is used to illustrate these behavioral changes. Under the assumption that hacked computers change their behavior with respect to the quantity or type of network traffic, the proposed visualization enables detection of an enlarged set of threats. Chapter 9 deals with a completely different type of threat model, which considers the textual content of network traffic in order to filter out harmful traffic. It is demonstrated on the basis of a Self-Organizing Map how a feature-based approach can be used to filter out harmful traffic such as spam messages. At this point we have to remark that each of the threat models comprises a large set of threats. However, when visualization is applied to detect these threats, it is often up to the user to find pecularities within the data. Therefore, the human analyst cannot be guaranteed to detect all threats within the considered threat model using the proposed techniques. Furthermore, so-called false positives are likely to occur due to misinterpretation of ambivalent alerts.

38 26 Chapter 2. Networks, intrusion detection, and data management 2.4 Building a data warehouse for network traffic and events When a network is kept thoroughly monitored over a longer period of time, a lot of data is accumulated in log files of different formats originating from gateways, firewalls, honey pots, and intrusion detection systems. In order to obtain an overview of all these different data sources and to develop the capability of analyzing these large data sets in the visualization tools presented within the scope of this dissertation, it was inevitable to build up a data warehouse by identifying common data characteristics and proposing a data modeling scheme suitable for all data sources. Data warehousing is a field that has grown out of the integration of several technologies and experiences over the last two decades. The term data warehouse was originally coined by Inmon in 1990 with the following definition:,,a data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management s decisions. The data warehouse contains granular corporate data. [70] Traditional functional and performance requirements of On-Line Transaction Processing (OLTP) applications are unsuitable for the tasks at hand: rather than processing many transactions in real time, we need a system that supports fast calculation of aggregates for analysis tasks while taking cross-dimensional conditions into account. After careful considerations, the OLAP (On-Line Analytical Processing) architecture was found to be an appropriate option for processing huge amounts of network data in a multitude of scenarios. The term OLAP was coined by Codd et al. as follows:,,olap is the name given to the dynamic enterprise analysis required to create, manipulate, animate, and synthesize information from exegetical, contemplative, and formulaic data analysis models (...). This includes the ability to discern new or unanticipated relationships between variables, the ability to identify the parameters necessary to handle large amounts of data, to create an unlimited number of dimensions (consolidation paths), and to specify cross-dimensional conditions and expressions. [30] OLAP based solutions, initially adopted by traditional business applications, have proven to be beneficial and extendable to serve a much wider range of application domains, such as health care, biology, government, education, and network security, to name a few. Figure 2.7 places OLAP in the context of the data warehousing system architecture. Each layer of the depicted 5-layer-model encapsulates a different data flow in the system [137]. The Data Sources Layer comprises various data sources, which all store data in their proprietary data formats. Therefore, the ETL layer (Extract, Transform, Load) has to deal with integrating this data from these homogeneous sources into a consistent state within the target schemas. Next, the transformed data is passed on to the Data Warehouse Layer where it is stored and archived in a special purpose database. Data analysis methodologies and techniques, such as OLAP and Data Mining, form the Analysis Layer. Finally, the Presentation Layer completes the data warehouse system architecture with its frontend analytical applications also known

39 2.4. Building a data warehouse for network traffic and events 27 5th layer: PRESENTATION OLAP frontend Data Mining tool DSS frontend spreadsheet web frontend 4th layer: ANALYSIS OLAP Data Mining DSS methods Data Mart Data Mart Data Mart Archiving system Monitoring Administration 3rd layer: DATA WAREHOUSE Data Warehouse Metadata Operational Data Store 2nd layer: ETL Extractor cleansed raw data Staging area Extractor Enterprise ERM Resource Management legacy systems operational DBs unstructured data external sources 1st layer: DATA SOURCES Figure 2.7: A multilayered data warehousing system architecture. as BI tools for presentation and exploration of the data. Since data warehouses have different performance requirements than operational database management systems optimized for OLTP and because of the need to consolidate data from many heterogeneous sources, they are normally implemented separately from operational databases.,,olap technology draws its analytical power from the underlying multidimensional data model [131]. The data is modeled as cubes of uniformly structured facts, consisting of analytical values, referred to as measures, uniquely determined by descriptive values drawn from a set of dimensions. Each dimension forms an axis of a cube, with dimension members as coordinates of the cube cells storing the respective measure values. [113] Figure 2.8(a) shows a strongly simplified example of a three dimensional data cube, storing the number of IP addresses and the number of flows as measures determined by dimensions

40 punch :02:38 0:00:35 0:00:46 0:01:27 trephine :02:18 0:00:00 0:00:43 0:00:00 bone ablating instr :02:33 0:00:33 0:00:45 0:01:24 Table 1. Example results for Queries 1 and 2: Instrument occurrencies and average use times for 4 discectomy interventions. 28 Chapter 2. Networks, intrusion interventions for theirdetection, subsequent analysis and exploration. and data management Sender DstPort! by Time & DstPort! by Time / / / /16 Mail (25) SSH (22)!" by Sender & Time Time Jan Feb Mar Apr!" by DstPort!" ALL!" by DstPort & Sender!" by Sender to the novel domain of surgical process analysis. Conventional business process modeling tools are rather limited in the types of supported analytical tasks, whereas the data warehousing techniques appear more suitable when it comes to managing large amounts of data, defining various business metrics and running complex queries. The case study presented in this work is concerned with designing a recording scheme for acquiring process descriptions from surgical Confronted with the deficiencies of the relational OLAP approach to meet the requirements of our case study, we propose an extended data model that addresses such challenges as non-quantitative and heterogeneous facts, manyto-many relationships between facts and dimensions, runtime definition of measures, interchangeability of fact and dimension roles, mixed granularity, etc. Our solution is based on categorizing the facts into base and satellite facts, fact hierarchies and generalizations. At the level of dimensional modeling, various Measures hierarchy types are identified and examined with! NumIP respect (SUM) to their summarizability.! NumFlows(SUM) The proposed Dimensions model extensions can be easily implementedsemester using current OLAP DstPort tools: facts and Sender dimensions can Jan be stored Feb inmar relational Apr tables TotalandJan queried Feb with standard SQL. We demonstrate a prototypical implementation of a visual Mail (25) / interface for runtime measure definition and conclude our work by presenting the results of various analytical /8 queries 51 formulated by the 13 domain 120 experts 155 and 42 Mar Apr Total run against the modeled /16 surgical process 21data 6warehouse / Total Mail Acknowledgement SSH (22) / We would like to thank Oliver /8 Burgert from 8 ICCAS 9 at 2 the University 1 20 of Leipzig as well as Christos Trantakis /16 and Juergen3 Meixensberger 0 4 from 1 the Neurosurgery Department at the University /16 Hospital of 5 Leipzig 0 for4their surgical 0 support Total SSH Web References (80) / / Dayal, U., Hsu, M., /16 Ladin, R.: Business 588 process 0 coordination: State 900 of 1239 the art, trends, and open issues /16 In: VLDB 2001: 0 Proc th 0International 9 Conference 9 0 on Very Large Data Bases. (2001) 3 13 Total Web Total Product (a) Netflow OLAP cube (b) Pivot table Figure 2.8: Example netflow cube with three dimensions and sample data Sender, Time, and DstPort. The pivot table in Figure 2.8(b) illustrates the actual values contained in the cells of the data cube along with some chosen aggregates. Note that in many cases these data cubes store already pre-aggregated values, such as the number of flows per prefix, month, and destination port number in our example. Furthermore, depending on the complexity of the data, the cubes dimensionality can be significantly higher, which results in far more possible analysis questions. In relational OLAP, each cube is stored as a fact table with tuples containing one or more measure values and the values of their dimensional characteristics. Most data warehouses model dimension hierarchies using the denormalized star schema. The database then consists of fact table and a single table for each dimension. One entry in the fact table then stores the multidimensional coordinates and the associated numerical values for the measures. Because star schemas are limited in their capabilities to provide semantic support for attribute hierarchies, normalized snowflake schemas were introduced as a refinement. According to this modeling, a separate table is created for every level of the respective dimension. Each one of these tables stores the dimensional values of one hierarchy level for a particular dimension as well as a reference to the parent level. Index structures are then used to speed up join operations of the fact tables with the dimension tables of various granularities and to efficiently calculate aggregates. For further reading, refer to [21, 22, 131] Cube definitions For our data warehouse, we used the relational database technology of the open source database PostgreSQL [136] because it offers a robust, reliable, and efficient approach for storing and managing large volumes of data. While collaborating with network administrators of our university and staff of the network security team at AT&T, we were able to design four types of facts and populate them with data: the netflows cube for storing network traffic extracted from the university gateway, the webserver cube consisting of log entries of the webserver of our working group, the botnets cube containing data from a signature-base botnet detection mechanism, and the snort cube, which is defined through alerts generated by the IDS Snort

41 2.4. Building a data warehouse for network traffic and events 29 #bytes Measures #flows #requests #IPs #alerts IP address Time Dimensions Port Browser OS Alert Cube netflows webserver botnets 2 1 snort Table 2.2: Cubes of the network security data warehouse with their associated measures and dimensions. Inside the table we denoted the aggregation functions for the applicable measures and the number of dimensions of a particular type for the applicable dimensions. that was set up for test purposes. Table 2.2 describes these four cubes, their dimensions, and measures. Transactional facts are periodically retrieved from network devices, the webserver log or IDS and stored in the respective fact table. For network traffic, for instance, an entry describes a single network packet consisting of source and target IP addresses and ports, the timestamp, and the size of the payload. When measuring network traffic using several sensors, special precaution has to be taken in order not to count traffic redundantly when it passes multiple hosts on its way to its destination. To reduce the volume of facts to be stored, all packets and connections referring to the same source, destination and time interval are aggregated into a single fact, with the number of sessions and their total size in bytes as its two measures. A single day of network traffic measured on the main gateway of our mid-size university, for example, stores about 10 million connections, which are already aggregated on hourly intervals to approximately 2 million facts. An overview of the logical database design of the netflow cube is depicted in Figure 2.9. Using the snowflake schema, dimensional tables (e.g., IPAddress, Port) were normalized into subtables for each granularity level. The time dimension was left denormalized like in the star schema since data warehouse systems provide their own routines for handling temporal characteristics. IP address dimension IP address dimension is common to all of our cubes, some cubes even have two dimensions of this type, for example, the source and the destination IP addresses in the netflow cube. A balanced hierarchy is defined upon the IP address dimension using the following consolidation path: IP address IP prefix autonomous system country continent and we thus

42 30 Chapter 2. Networks, intrusion detection, and data management Port Continent Country Auton_System PortID Category Domain Description ContinentID Longitude Latitude Name CountryID Longitude Latitude Name ContinentID ASN Name CountryID PortDomain Domain Description PortCategory Category Description Time Timestamp Millisecond Second Minute Hour Day Month Year netflows Timestamp SourceIP DestinationIP SourcePort DestinationPort NumConn NumBytes Prefix StartIP EndIP ASN Network IPAddress IPAddress Prefix Dimensions Measures Figure 2.9: Modeling network traffic as an OLAP cube. Each entry of the fact table netflows is linked to its dimensional values and stores the measures NumConn and NumBytes obtain the following hierarchy with the number of entries at each level: 7 continents 190 countries autonomous systems prefixes The hosting country of an autonomous system is determined by looking up the geographical positions of the IP addresses of all the networks it contains and choosing the prevailing country. For this, we rely on the GeoIP database of Maxmind, Ltd. [116] (approx. 99% accuracy). A global map from IP prefixes to ultimate AS names and numbers can be somewhat complicated to obtain. Though a local map can be extracted from any border gateway router, due to route aggregation, it is unlikely to list many terminal or leaf-level AS s. Therefore, maps should be obtained from multiple vantage points in the Internet and aggregated, raising the problem of consistency and completeness, especially in the presence of intentional efforts to spoof AS identifiers. This problem has been studied and there are useful heuristic methods based on dynamic programming [114] and public prefix-to-asn tables available. Unfortunately, these tables date back to and we therefore reverted to the data we extracted from a single routing table in September 2006 [7]. Time dimension Time is a very common and important dimension in almost all data analysis scenarios. In the data cubes presented here, timestamps are aggregated by millisecond second minute 1 A possible side-effect could be that previously large AS s contain less prefixes, which makes the AS level easier to render in the visualizations.

43 2.4. Building a data warehouse for network traffic and events 31 hour day month year. Database systems provide support to this unique dimension in form of functions for extracting and manipulating temporal properties of interest. However, because these functions differ from product to product, we modeled time as an OLAP dimension to provide the full OLAP analysis capabilities. Port dimension The netflows and snort cubes both share the port dimension. TCP and UDP application ports may be grouped into categories (i.e., well-known, registered and dynamic ports as mentioned in Section 2.1.4) on the one hand, and into domains (e.g., web, , database, etc.) on the other hand. These two consolidation paths allow calculation of two different upper level aggregates. Note furthermore that the netflows cube contains two port dimensions, namely SourcePort and DestinationPort. Other dimensions Naturally, each application domain has its own dimensions characterizing the facts. In the scenarios discussed in this thesis, the hierarchical dimension alert (consolidation path: alert alert category) as well as non-hierarchical dimensions browser and operation system are contained in the snort and netflows cubes. Note that the dimensions presented here can by no means be considered complete or perfectly modeled since complicated analysis tasks might pose novel requirements (e.g., disk space efficiency, performance, finer granularities, etc.) to the model. Measures As mentioned before, measures are defined through a numerical attribute as well as an aggregation function or a set of aggregation functions applicable to it. For all of the measures presented here, the sum aggregation function was used. Evidently, other aggregation functions, such as average and bounds, or derived aggregation functions (e.g., alerts per IP address) can be easily defined. As shown in Figure??, the measures number of bytes, connections, requests, flows, IPs, and alerts were used in the four presented cubes of the network security data warehouse OLAP Operations and Queries The fact table contains a huge volume of raw or slightly aggregated data. OLAP operations are used for computing measure aggregates for a chosen combination of dimensions and their granularity levels. The most common operations are: Roll-up (aggregating) and drill-down (disaggregating) allow to change the granularity level. Slice-and-dice defines a subcube of interest or even reduces the dimensionality by flattening single dimensions to one selected value.

44 32 Chapter 2. Networks, intrusion detection, and data management Ranking can be applied as a dynamic filter showing the specified number of marginal (top or bottom) measure aggregates. Since the visualization approaches in this dissertation restrict themselves to a rather limited set of OLAP operations (drill-accross, rotating, etc. were not used), some of the classical OLAP restrictions in part of hierarchy properties, such as balancedness, covering, and strictness, could be relaxed. For instance, one port may belong to multiple domains or not belong to any Summary tables Unfortunately, storing the entire data in a network traffic monitoring or network security scenario might become unfeasible or managing the entire data as it is in the database can turn out to be simply impossible. Besides, the older the data entries get the less interesting they are for the analysis, since anomaly detection is basically concerned with recent and especially current traffic flows and events. In order to efficiently query upper level aggregates, many data warehouses store summary tables containing fact entries, pre-aggregated to a specific subset of dimensions and/or a specified granularity level within a dimension. Summary data can be stored either in the separate fact table, or in the same fact table extended to contain the references to the respective upper level dimension tables. In the latter case, null values are inserted into the columns of the dimension levels inapplicable to the summary facts. While reducing the number of tables, this method has shown to be error-prone in practice, due to the fact that correctly querying the data becomes more complicated. [21] When the hug data volume in a fact table results in performance degradation, the table can be partitioned into multiple subtables corresponding to different levels of detail. For instance, the netflows cube data can be managed using the following three fact tables: ShortTermFlows stores the most recent transactions (e.g., for the last one hour) as they are, i.e. without any aggregation. This table will be used for the run-time network load visualization. MiddleTermFlows stores the network data of the current day with the granularity coarsened from millisecond to minute. This table offers still rather detailed information for offline exploration of the current day s network behavior. LongTermFlows aggregates the transactions even further (e.g., by hour or day) to store them as historical data for less detailed analysis. Other granularity levels along time dimension (e.g. month quarter year) can be defined depending on the application needs. Furthermore, aggregation along IP address dimension might also result in a considerable speed-up. Often, these summary tables are realized through so-called materialized views, which get automatically updated once new fact entries are appended to the underlying fact tables.

45 2.4. Building a data warehouse for network traffic and events 33 A recent study in visual analytics proposes an alternative data management strategy, coined Smart Aggregation [155]. By combining automatic data aggregation with user-defined controls on what, how, and when data should be aggregated, this approach ensures that a system stays usable in terms of system resources and human perceptual resources. By automatically determining if aggregation is required, the system proposes candidate fields to the user for effective aggregation based on the cardinality of the data. After replacing detailed values through their aggregates, specialized index structures are used to speed up the query process Visual navigation in OLAP cubes There is an abundance of tools and interfaces for exploring multidimensional data. We limit ourselves to naming a few products which offer distinguished features relevant for our work. One developed system called Polaris [150] extends the Pivot Table interface by allowing to combine a variety of displays and tools for visual specification of analysis tasks. Polaris is a predecessor of a commercial business intelligence product called Tableau Software [152]. ProClarity was the first to enhance business intelligence with Decomposition Trees, a technique for iterative visual disaggregation of data cubes. XMLA enriches the idea of hierarchical disaggregation by arranging the decomposed subtotals of each parent value into a nested chart (Bar- and Pie-Chart Trees) in its Report Portal OLAP client [187]. Visual Insights has developed a family of tools, called ADVIZOR, with an intuiltive framework for parallel exploration of multiple measures [43]. Probably the most popular paradigm underlying the OLAP navigation structure is that of a file browser, with each cube as a folder containing the list of top-level dimensions and the list of available measures, as found in Cognos PowerPlay [31], BusinessObjects [17], CNS DataWarehouse Explorer [29], and many other commercial OLAP tools. Each hierarchical dimension is itself a folder containing its child entities. Hierarchical entities can be recursively expanded to show the subtrees of their descendants. The entities of the highest granularity (i.e. the leaf nodes) are represented as files and are non-expandable. Standard OLAP interfaces allow users to navigate directly in the dimensional data rather than in a dimensional hierarchy scheme. Our approach [176], however, pursues a clear distinction between the dimension s structure and its instances. Therefore, expansion of a dimension folder reveals solely the nested folders of its subdimensions, contrary to the standard OLAP navigation displaying the child-level data. The instances of any subdimension can be retrieved on-demand. Figures 2.10(a) and 2.10(b) demonstrates the differences between the standard show-data and our proposed show-structure interfaces, respectively, at the example of a hierarchical dimension IP address. Notice that expanding the top-level dimension IP address in Figure 2.10(b) reveals its entire descendant hierarchy, thus enabling the user to jump over right to the desired granularity level. The data view is available on explicit demand by clicking the preview button of the respective category. Figure 2.10(c) shows the activated preview of continents with the option to drill-down into any continent s descendant subtree. The advantages of our proposed navigation structure for building hierarchies can be summarized as follows: 1. Clear distinction between the dimension s structure and its contents.

46 34 Chapter 2. Networks, intrusion detection, and data management (a) show data -approach (b) show structure -approach (c) on-demand preview in show structure - approach Figure 2.10: Navigating in the hierarchical dimension IP address 2. Immediate overview of all granularity levels in a hierarchical dimension. 3. The ability to drill-through directly to any descendant subdimension. 4. On-demand preview of the data as well as any data node s descendant entities. 5. Compactness on the display due to moderate expansion at most steps. 6. The entire navigation is built from a single meta table. 7. The actual data is retrieved only if explicitly requested. 8. It is easier to find the entries of interest even somewhere deep in the hierarchy without knowing the data (e.g., any country can be accessed directly through the preview of countries without searching for and drilling through its ancestor in continents). Evidently, finding entries deep in the hierarchy can become a challenge since the deepest level of the IP dimension might contain up to 2 billion items. In this case, character- or bytewise navigation can make the content accessible.

47 3 Foundations of information visualization for network security,,discovery consists of seeing what everybody has seen and thinking what nobody has thought. Contents Albert von Szent-Gyorgyi 3.1 Information visualization The task by data type taxonomy Mapping values to color Further reading Visual Analytics Scope of Visual Analytics Visual Analytics challenges in network monitoring and security Related work on visualization for network monitoring and security Monitoring of network traffic between hosts, prefixes, and ASes Analysis of firewall and IDS logs for virus, worm and attack detection Detection of errors and attacks in the BGP routing system Towards Visual Analytics for network security THIS chapter will briefly introduce the field of information visualization, which originally started with static information displays and currently deals with interactive tools for visual data exploration. Based on Shneiderman s task by data type taxonomy, the visualization techniques of this dissertation will be discussed. Since many of these visualizations map values to colors, we will present different normalization strategies for obtaining an adequate mapping. Afterwards, the emerging field of visual analytics and its influence on network monitoring and security will be sketched. At the end of this section, previous visualization studies in the fields of network monitoring and security will be reviewed. 3.1 Information visualization Visualization can be considered a relatively young research field since the first IEEE conference on visualization was held in Nevertheless, some fields such as cartography have a history of more than 2000 years. Likewise, static diagrams for data visualization had been used for some time already as documented in Edward R. Tufte s books [161, 162, 163, 164].

48 36 Chapter 3. Foundations of information visualization for network security Some of the listed case studies in these books, for example, revert to material from the 18th century. In the late 19th century, statistical graphics then started to become common in government, industry, and science. However, not only data analysis was a major concern in the early 20th century, but also human factors. Psychological aspects of human perception, for example, were studied when the Gestalt School of Psychology was founded in Berlin in The pioneering work of Max Westheimer, Kurt Koffka, and Wolfgang Kohler resulted in a set of Gestalt laws of pattern perception, which easily translate into a set of design principles (i.e., proximity, similarity, symmetry, closure, continuity) for information displays as described in [178]. The field was revived some decades later in the 1960s when Jacques Bertin, a French cartographer, published his famous book Semiologie graphicque les diagrammes, les réseaux, les cartes (first in French and 16 years later in English), in which he carefully explains how effective and expressive diagrams are constructed [12, 14]. He identifies the eight visual variables, namely, x & y position, size, brightness, texture, color, orientation, and form in order to systematically illustrate how they can be used to convey information in matrices, diagrams, and maps. At approximately the same time, John Tukey, a statistician at Bell Labs, widened the scope of statistics from confirmatory data analysis to exploratory data analysis [165], thereby foreseeing the enormous potential of yet unavailable computer graphics and algorithms to advance the young field. In the mid-80s and 90s, raster displays and computer graphics became available facilitating novel research in a variety of areas. The graph drawing and computational geometry communities, for example, emerged to focus on some of the deeper theoretical problems in creating geometric representations of abstract information. In the same time frame, researchers at Xerox PARC (among them Card, MacKinlay, and Robertson), recognized the great importance of user interfaces for assessing vast stores of information although details of their vision such as 3D graphical metaphors have not been adopted so far. These developments jointly led to the formation of the term information visualization, which nowadays offers enough space for creativity to host its own research community. The term is defined as follows:,,information visualization: The use of computer-supported, interactive, visual representations of data to amplify cognition. [18] The task by data type taxonomy In the last 20 years many visualization systems have been developed all around the world and several taxonomies for information visualization have been proposed to systematically classify and discuss the techniques. Based on the task by data type taxonomy for information visualization (TTT) of Ben Shneiderman [145], the concepts proposed in this thesis will be systematically discussed and put into context. Seven tasks and seven data types were identified at a high level of abstraction and are detailed in Table 3.1. Naturally, those analysis tasks and data types can be further refined or extended for more detailed analysis when necessary. Gaining an overview over a collection of items is a very common task for today s knowledge workers. There exist several strategies to achieve this goal by means of information visualiza-

49 3.1. Information visualization 37 Overview Zoom Filter Details-on-demand Relate History Extract 1-dimensional 2-dimensional 3-dimensional Temporal Multi-dimensional Tree Network Tasks Gain an overview of the entire collection. Zoom in on items of interest. Filter out uninteresting items. Select an item or group and get details when needed. View relationships among items. Keep a history of actions to support undo, replay, and progressive refinement. Allow extraction of sub-collections and of the query parameters. Data types Linear data types are texts, program source code, or lists which are organized in a sequential manner. Mostly geographical data, such as maps, floorplans, or abstract 2D text layouts. Real-world objects like buildings, the human body, or molecules. Time-related items with a start and finish time, possibly overlapping. Data sets with many dimensions. Hierarchies or tree structures are defined by each item having a link to one parent item (except for the root). Items can be linked to an arbitrary number of other items. Table 3.1: Tasks and data types of Shneiderman s taxonomy (adapted from [145]) tion techniques. One very common approach is to have a zoomed out view with a field of view box that supports the user in matching it with the adjoined detail view. Another completely different approach is the fisheye view, which magnifies one or more areas of the display and thus presents overview and details in the same view [56]. Certainly, zooming can be seen as another basic task provided that the user is interested in some elements of a collection. The smoother the zoom the easier it is for the user to preserve his sense of orientation in the information space at hand. Zooming can be implemented through mouse interaction and is very natural on one- or two-dimensional data, or two-dimensional data presentations of other kinds of data, such as the embedding of a graph onto a map. In many situations, we are faced with information overload. There is simply too much information on the screen to grasp it all at once. In this case, filtering helps to remove unwanted items and to focus on the essential ones. As soon as the information display has been reduced to a few dozens of items, users request details on demand to compare and understand the issues at hand. Usually, this kind of interaction is done by simply clicking on an item or hovering over it, which triggers a pop-up window with values of the item s attributes. Relating items to each other is a challenging task since viewing all relationships at once might be too confusing and viewing only the ones of a particular item with others requires

50 38 Chapter 3. Foundations of information visualization for network security Dimensionality Task Overview Zoom Filter Details Relate History Extract Chapter Visualization Enhanced Recursive Pattern x x x x x 4 Hierarchical Network Map x x x x x 5 Edge Bundles x x 6 Radial Traffic Analyzer x x x x x 7 Behavior Graph x x x x x 8 Self-Organizing Map x x 9 Table 3.2: Visualization techniques of this thesis by tasks some a priori knowledge about which item to pick. Focusing the analysis on a particular attribute value of an item can also trigger filtering operations, e.g., selecting the director s name of a particular movie in the FilmFinder [2] results in showing all his movies. Since a single user interaction rarely produces the desired outcome, complex explorative tasks consist of several refining and generalizing steps. It is therefore essential that a history of the previous interaction is kept and made available to the user in order to jump back or reapply the previous interaction. A model for recording the history of user explorations, in visualization environments, augmented with the capability for users to annotate their explorations can be found in [61]. After the needle in the haystack has been found, it would be inappropriate to just throw it away and restart the search from scratch the next time. Users might rather want to extract the found items, store them, or use drag-and-drop capabilities of the operation system to drag them into the next application window. Although this feature is often desired, many applications still come without it or only support a very limited set of extraction capabilities. Application As shown in Table 3.2, visualization techniques presented in this dissertation are introduced with increasing dimensionality of the input data. The first technique is the Enhanced Recursive Pattern used to present one-dimensional temporal data in a space-efficent way to gain an overview, filter out unimportant time spans, retrieve more details in order to extract data according to the time dimension. Due to the multi-resolution properties of the Enhanced Recursive Pattern, it can also be used in a zooming context. Note that rather than representing overlapping time spans, we reduce the analyzed data to one dimension by aggregating all events of a particular time interval. However, when comparing several time series, we move from a one-dimensional to multi-dimensional analysis. In Chapter 5, we present the Hierarchical Network Map (HNMap), a technique to support the tasks overview, zoom, filter, details, and extraction. The space-filling visualization maps the tree data structure to containment relationships within rectangles on the screen. By visualizing the whole IP address space with traffic volumes in each part of the network, the analyst gets an overview of how much traffic is transferred between his network and other networks

51 3.1. Information visualization 39 in the Internet. The visualization then supports zooming on the rectangles which represent continents, countries, or autonomous systems, enables filtering, and offers several options for representing details. At any given time of the analysis, the user can save the current map to a graphics file for the purpose of documentation, presentation, and dissemination purposes of security incidents. The Edge Bundles technique builds upon the rectangles of HNMap by representing the communication network of traffic measurements through curved lines on top to encode its structural properties. Thus, more details of the traffic flows rather than only the traffic volume per entity can be shown. Though causing occlusion due to overdrawn rectangles, these lines make it possible to relate network entities with one another. The next visualization technique is used in the Radial Traffic Analyzer and is very strong when it comes to the exploration history. While in each analysis step another attribute of the multi-dimensional data set is added as a further ring revealing more details of the network traffic, results of the previous step remain visible on the inner rings. Furthermore, basic tasks like filtering, details-on-demand, relating, and extracting are also supported. The behavior graph, which can be seen as a projection of multi-dimensional data onto a two dimensional plane, is meant to give an overview of the behavior of several hosts in the network over a previously defined time span. By filtering the presented data with the help of sliders and check boxes, the user can focus the analysis on the interesting aspects of the data sets. Additionally, detailed bar charts of the underlying data can be displayed on demand. The gained insights about misbehaving hosts can then be used to reconfigure hosts within the administrated network or to initiate countermeasures against attacks from external hosts. Finally, the Self-Organizing Map in Chapter 9 can be used to cope with the high-dimensional feature vectors of messages. Details of each SOM node can be retrieved through mouse interaction. Note that all the visualization techniques presented above could be further expanded to support almost all basic tasks, but due to the limited time resources, we restricted our research prototypes to the most important ones. Other taxonomies Shneiderman s task by data type taxonomy is not the only taxonomy for visualization. The one which comes closest to his taxonomy is the classification of visual data analysis techniques proposed by Keim and Ward [86]. It classifies information visualization techniques based on three dimensions: a) the data types to be visualized, b) the interaction and distortion techniques, and c) the visualization techniques (standard 2D/3D display, geometrically transformed display, iconic display, dense pixel display, and stacked display). Keller and Keller presented their taxonomy of visualization goals as early as in 1993 [87]. This fundamental work arranged visualization techniques according to nine actions (identify, locate, distinguish, categorize, cluster, rank, compare, associate, and correlate) and seven data types (scalar, nominal, direction, shape, position, spatially extended region or object, and structure). Chuah and Roth later classified the semantics of interactive visualizations [27], in which they focused on the semantics of basic visualization interaction by characterizing inputs, out-

52 40 Chapter 3. Foundations of information visualization for network security puts, operations, and compositions of these primitives. Card and Mackinlay further extended the early work of Bertin [13, 14] and assessed each elementary visual presentation according to a set of marks (e.g., points, lines, areas, surfaces, or volumes), their retinal properties (i.e., color and size), and their position in space and time (x, y, z, t) [19] Mapping values to color Since the visualizations to be presented in Chapters 4, 5, 6, and 9 use the visual variable color in order to convey some kind of numerical measurement to the analyst, we briefly explain how this mapping from values to colors and vice versa is done within the scope of this dissertation. Although color is not the most effective visual variable to convey quantitative measurements [109], we often excluded other visual variables due to overplotting and perceptual issues, which occur due to the high number of data elements displayed simultaneously. In astronomy, medical imaging, geography, and many other scientific applications, the term pseudocoloring is commonly used for representing continuously varying map values using a sequence of colors. While physicists often use a color sequence that approximates the physical spectrum, some perceptual problems occur since there is no inherent perceptual ordering of colors. For example, in an experiment where test persons were prompted to order a set of paint chips with the colors red, green, yellow, and blue, the outcomes varied due to the missing order of colors. However, if the same experiment is repeated with a series of gray paint chips, the subjects either choose a dark-to-light ordering or vice versa [178]. Therefore, the color scales used in our work all employ monotonically increasing or decreasing brightness of the color values in order to convey quantitative measurements. A more detailed description about how these Hue Saturation Intensity (HSI) color scales are created can be found in [79]. For a more holistic view on color usage, the number of data classes, the nature of the data (i.e., sequential, diverging, or quantitative), and the end-user environment (e.g, CRT, LCD, printed, projected, photocopied) have to be considered [64]. However, in the scope of this dissertation, we focus on color schemes for interactive data visualization on computer screens. Besides the above discussed choice of an appropriate color scale, normalization is one of the most important aspects of color mapping. Numerical measurements in this dissertation usually refer to network traffic, such as a total number of bytes, packets, flows, or alerts within a specified time frame and can be formalized as a set of statistical values X = (x i ) i=1,...,n with x i 0,x i R and x max > 0. These values are represented by the filling color of some visual entity such as a rectangle. We provide several fixed color scales and the analyst is free to choose from the following three normalization schemes:

53 3.1. Information visualization ,9 y = log(x+1)/log(1001) , ,7 0,6 y = sqrt(x)/sqrt(1000) y = x/ , ,4 0,3 0, , (a) Normalization functions sqrt log (b) Color scales Figure 3.1: Mapping values to color using different normalization schemes color lin (x i ) = x i (3.1) x max xi color sqrt (x i ) = (3.2) xmax color log (x i ) = log(x i + 1) log(x max + 1) (3.3) The output of the chosen function is then mapped to the index positions within the color scale. Figure 3.1 details the used normalization functions on five different color scales. Due to its improved discrimination for low values and smoothing effect on outliers, we often used the logarithmic color scale in our studies. However, if high values need to be better distinguished, the square root normalization would be the better choice Further reading Apart from the already mentioned references, the Readings in Information Visualization with its 700 references is a good start to develop an overview of the field up till 1999 [18]. Moreover, the book The grammar of graphics [181] illumines visualization from a statistical background, whereas the German book Visualisierung - Grundlagen und allgemeine Methoden [144] is a general introduction to information visualization as well as scientific visualization. Bob Spence s book Information Visualization - Design for Interaction enriched the field through its countless case studies on building effective interactive visualization systems [147]. Likewise, perceptual aspects play an important role within visualization systems and Colin

54 42 Chapter 3. Foundations of information visualization for network security Ware s Information Visualization - Perception for Design is an excellent starting point to deepen one s knowledge about the field [178]. 3.2 Visual Analytics Visual analytics is the science of analytical reasoning supported by interactive visual interfaces [157]. Over the last decades data was produced at an incredible rate. However, the ability to collect and store this data is increasing at a faster rate than the ability to analyze it. While purely automatic or purely visual analysis methods were developed in the last decades, the complex nature of many problems makes it indispensable to include humans at an early stage in the data analysis process. Visual analytics methods allow decision makers to combine their flexibility, creativity, and background knowledge with the enormous storage and processing capacities of today s computers to gain insight into complex problems. The goal of visual analytics research is thus to turn the information overload into an opportunity: Decisionmakers should be enabled to examine this massive, multi-dimensional, multi-source, timevarying, and often conflicting information stream through interactive visual representations to make effective decisions in critical situations Scope of Visual Analytics Visual analytics is an iterative process that involves information gathering, data preprocessing, knowledge representation, interaction and decision making. The ultimate goal is to gain insight into the problem at hand which is described by vast amounts of scientific, forensic or business data from heterogeneous sources. To fulfill this goal, visual analytics combines strengths of machines with those of humans. On the one hand, methods from knowledge discovery in databases (KDD), statistics and mathematics are the driving force on the part of automatic analysis, while human capabilities to perceive, relate and conclude, on the other hand, turn visual analytics into a very promising field of research. Historically, visual analytics has evolved out of the fields of information and scientific visualization. According to Colin Ware, the term visualization is meanwhile understood as a graphical representation of data or concepts [178], while the term was formerly applied to form a mental image. Nowadays fast computers and sophisticated output devices create meaningful visualizations and allow users not only to mentally visualize data and concepts, but also to see and explore an exact representation of the data under consideration on a computer screen. However, the transformation of data into meaningful visualizations is not a trivial task and will not automatically improve through steadily growing computational resources. Very often, there are many different ways to represent the data and it is unclear which representation is the best one. State-of-the-art concepts of representation, perception, interaction, and decision-making need to be applied and extended to be suitable for visual data analysis. The fields of information and scientific visualization deal with visual representations of data. The main difference between the two fields is that the latter examines potentially huge amounts of scientific data obtained from sensors, simulations or laboratory tests, while the former is defined more generally as the communication of abstract data relevant in terms of

55 3.2. Visual Analytics 43 Information Analytics Geospatial Analytics Interaction Cognitive and Perceptual Science Scope of Visual Analytics Scientific Analytics Presentation, production, and dissemination Statistical Analytics Knowledge Discovery Data Management & Knowledge Representation Figure 3.2: The Scope of Visual Analytics action through the use of interactive visual interfaces. Typical scientific visualization applications are flow visualization, volume rendering and slicing techniques for medical illustrations. In most cases, some aspects of the data can be directly mapped onto geographic coordinates or into virtual 3D environments. There are three major goals of visualization, namely a) presentation, b) confirmatory analysis, and c) exploratory analysis. For presentation purposes, the facts to be presented are fixed a priori, and the choice of the appropriate presentation technique depends largely on the user. The aim is to efficiently and effectively communicate the results of an analysis. For confirmatory analysis, one or more hypotheses about the data serve as a starting point. The process can be described as a goal-oriented examination of these hypotheses. As a result, visualization either confirms these hypotheses or rejects them. Exploratory data analysis as the process of searching and analyzing databases to find implicit but potentially useful information, is a difficult task as the analyst has no initial hypothesis about the data. According to John Tukey, tools as well as understanding are needed for the interactive and usually undirected search for structures and trends [165]. Visual analytics is more than just visualization. It can rather be seen as an integral approach combining visualization, human factors, and data analysis. Figure 3.2 illustrates the detailed scope of visual analytics. At the visualization side, visual analytics integrates methodologies from information analytics, geospatial analytics, and scientific analytics. Especially human factors (e.g., interaction, cognition, perception, collaboration, presentation, and dissemination) play a key role in the communication between human and computer, as well as in the decision-making process. In this context, production is defined as the creation of materials that summarize the results of an analytical effort, presentation as the packaging of those ma-

56 44 Chapter 3. Foundations of information visualization for network security terials in a way that helps the audience understand the analytical results in context using terms that are meaningful to them, and dissemination as the process of sharing that information with the intended audience [158]. In matters of data analysis, visual analytics furthermore profits from methodologies developed in the fields of data management & knowledge representation, knowledge discovery, and statistical analytics. Note that visual analytics is not likely to become a separate field of study [184], but its influence will spread over the research areas it comprises. According to Jarke J. van Wijk, visualization is not good by definition, developers of new methods have to make clear why the information sought cannot be extracted automatically [169]. From this statement, we immediately see the need for the visual analytics approach using automatic methods from statistics, mathematics and knowledge discovery in databases (KDD) wherever they are applicable. Visualization is used as a means to efficiently communicate and explore the information space when automatic methods fail. In this context, human background knowledge, intuition and decision-making either cannot be automated or serve as input for the future development of automated processes. The fields of visualization and visual analytics are both built upon methods from scientific analytics, geospatial analytics, and information analytics. They both profit from knowledge out of the field of interaction as well as cognitive and perceptual science. In contrast to visualization, visual analytics explicitly integrates methodology from the fields of statistical analytics, knowledge discovery, data management & knowledge representation, and presentation, production & dissemination Visual Analytics challenges in network monitoring and security The fields of network monitoring and security largely rely on automatic analysis methods in detecting failures and intrusions. However, with the steadily increasing amount and diversification of threats, these methods often produce enormous amounts of alerts or fail to detect novel attacks. Purely visual approaches to network monitoring suffer from the same shortcomings. However, visual analytics as a combination of automatic analysis techniques with the background knowledge and intuition of human experts through interactive visual displays appears a promising research area for solving some of the information overload problems in these fields. Through an appropriate visual communication of the analysis results, security experts can make a better and faster assessment of the current threat situation, which enables them to initiate countermeasures in time. The fact that this kind of systems have not yet been used very much in practice by large network service providers can be explained by the following yet unsolved challenges: Large networks produce data streams at enormous rates, which aggravates their realtime analysis. However, in many cases, only immediate reaction would save the network resources from a major breakdown. Fast-paced analysis of large amounts of traffic logs and alerts from heterogeneous sources therefore needs to be improved through visual analytics applications in the near future.

57 3.3. Related work on visualization for network monitoring and security 45 Scalability of both automatic analysis methods and visualizations is another major issue. While detailed traffic analysis is computationally infeasible on large traffic links, many data visualization methods are also incapable of visualizing large amounts of data. Since the health of the network largely depends on the capability to analyze its behavior, both scalability issues need to be approached. In some analysis scenarios, it is not only the amount of data that places a burden on the analyst, but also interpretability issues due to the complexity of the available information. This motivates innovative research on both automatic methods to abstract data as well as novel on visual representations to gain an overview over complex analysis scenarios. Research on semantics is expected to considerably improve analysis tasks by transforming raw data into information useful for the analyst. While many of newly proposed visualization systems facilitate analysis tasks, their usage by the intended audience is not guaranteed. Often, user acceptability becomes a challenge since routinized workflows are substituted and the new tools often do not offer the same features as old systems from the first release on. Although these challenges have to be mastered first, the multitude of proposed systems for visual analysis of network traffic and events presented in the next section already indicates a transformation from automatic analysis systems towards visual analytics systems, which integrate the human expert into the analysis process. 3.3 Related work on visualization for network monitoring and security Visual support for network security has recently gained momentum, as documented by the CSS Workshop on Visualization and Data Mining for Computer Security in 2004 (VizSEC / DMSEC 2004) and by the Workshops on Visualization for Computer Security in 2005, 2006, First results have were presented there, but it remains an intriguing endeavor to design visual analysis tools for network monitoring and intrusion detection. While reviewing previous work in the field of visualization for network security, we realized that almost all proposed visualization systems aim at tackling one or more of these three major problems: 1. Monitoring network traffic between hosts, prefixes, and AS s; 2. Analysis of firewall and IDS logs to detect viruses, worms, and attacks; 3. Detection of errors and attacks in the BGP routing system. Ultimately, all previously proposed methods support the administrators in their task to gain insight into the causes of unusual traffic, malfunctions or threat situations. Besides automatic

58 46 Chapter 3. Foundations of information visualization for network security analysis means, network operators often relied on simple statistical graphics like scatter plots, pair plots, parallel coordinates, and color histograms to analyze their data [115]. However, to generate meaningful graphics, the netflow data and the countless alerts generated by IDS s and firewalls need to be intelligently pre-processed, filtered and transformed since their sheer amount raises scalability issues in both manual and visual analysis. Although traditional statistical graphics suffer from overplotting problems, they often form the basic metaphor of newly proposed visualization systems since analysts are familiar with the interpretation of the former. Scatter plots, for example, are prone to overplotting when many data points are assigned the same position within the plot. Additional interaction features can then be used to enhance the user s capabilities to discover novel attacks and to quickly analyze threat situations under enormous time pressure. Recently, the book Security data visualization: graphical techniques for network analysis by Greg Conti [33] appeared. The book aims at teaching the reader to design a visualization system for network security, reviews state-of-the-art visualization techniques for network security, and demonstrates in practice how large amounts of traffic, firewall logs, and IDS alerts can be visually analyzed. Furthermore, an outlook of subareas of network security and beyond that could potentially benefit from visual analysis is given Monitoring of network traffic between hosts, prefixes, and ASes Some of the earlier work in this field is the study by Erbacher et al. on intrusion and misuse detection [48]. Their proposed glyph visualization animates characteristics of the connections of a single host with other hosts in the network. In the center, the current workload of the monitored host is shown through the thickness of the center circle. Spokes extending from its perimeter represent the number of users in multiples of 10. Other hosts, which initiated connections to the monitored system, are then placed on several concentric rings according to their distance to the monitored system in the IP address space. These hosts are connected with the monitored system through different types of lines, thereby encoding the application and the connection status. At a more detailed level, port numbers give an indication of the running network applications that cause the traffic. For example, Lau presented the Spinning Cube of Potential Doom [99], a visualization based on a rotating cube used as 3D scatterplot. Variables such as local IP address space, port number and global IP address space are assigned to its axes. Each measured traffic packet is mapped to a point in the 3D scatterplot. The cube is capable of intuitively showing network scans due to emerging patterns such as horizontal and vertical lines or areas. However, 3D scatterplots may be difficult to interpret on a 2D screen due to overlay. InetVis reimplements Stephen Lau s spinning cube and is capable of maintaining interactive frame rates with a high number of displayed points in order to detect scan activity [168]. By comparing the visualized scans with the alert output of Snort and Bro sensors, the authors assess the sensors effectiveness. Another port analysis tool is PortVis described by McPearson et al. in [119]. It implements scatterplots (e.g., port/time or source/port) with zooming capabilities, port activity charts and various means of interaction to visualize and detect port scans as well as suspicious behavior on certain ports.

59 3.3. Related work on visualization for network monitoring and security 47 Figure 3.3: Computer network traffic visualization tool TNV This work was continued in [122] and resulted in a tool for automatic classification of network scans according to their characteristics, ultimately leading to a better distinction between friendly scans (e.g., search engine webcrawlers) and hostile scans. Wavelet scalograms are used to abstract the scan information at several levels to make scans comparable. Subsequently, these wavelets are clustered and visualized as graphs to provide an intuition about the clustering result. For an even more detailed analysis of application processes, Fink et al. proposed a system called Portall that allows end-to-end visualization of the communication between distributed processes across the network [50]. While previous tools either showed host level activities or network activities, this system enables the administrator to correlate network traffic with the running processes on the monitored machines. The system mainly uses hierarchical diagrams, which are linked with each other through straight connecting lines. A similar linking is also utilized in other applications, for instance, in TNV [59]. As shown in Figure 3.3, the main matrix links local hosts to external hosts through straight and curved lines. On the bottom, there is a time histogram detailing the amount of traffic per time interval. Through interactive selection, the interval of interest can be chosen and a bifocal lens enlarges the focused area while the reduced context is still visible. Color is used to show the activity level for each local host. Details of the used protocols and traffic direction are presented through colored arrowheads. Furthermore, details of the packets can be retrieved for each matrix cell via a popup menu. On the right-hand side, port activity can be visualized through a parallel coordinates view linking source and destination ports. While this open source tool is excellent for monitoring a small local network, its limitation to displaying approximately 100 hosts at a time might cause scalability issues when monitoring medium or large size networks. Linking detailed data to other visualizations through connecting lines is not the only possibility to enhance scatterplots. The IDGraphs system, for example, maps point density within

60 48 Chapter 3. Foundations of information visualization for network security a scatterplot to brightness [138]. Plotting time on the x-axis versus the number of received SYN and SYN/ACK packets on the y-axis possibly reveals SYN flooding, IP, port, or hybrid scans when using the respectively appropriate aggregation strategy (e.g., the number of SYN and SYN/ACK packets aggregated by destination IP and port can show SYN flooding attacks). The system is comprised by a correlation matrix, which is interactively linked with the scatterplot by means of brushing. Rumint is another tool for visual analysis of network packets and consists of a text display, a parallel coordinates plot, a glyph-based animation display, a thumbnail toolbar, a byte frequency display, and the binary rainfall visualization [34]. This binary rainfall visualization depicts one network packet per line detailing its foremost bits through green and black pixels. Alternatively, a 256-level gray-scale configuration for displaying the information bytewise or a 24-bit RGB configuration for groups of three bytes can be chosen. Whereas textual approaches are limited to displaying the content of about 40 packets per screen, the binary rainfall visualization enables comparison of up to 1000 packets per screen. The tool can be beneficial for comparing packet length, identifying equal values between packets, thus supporting signature development for network-based malicious software. The byte frequency display is very similar, but rather than directly visualizing the bytes, it shows aggregated statistics of the byte values per packet. The VIAssist (Visual Assistant for Information Assurance Analysis) application is another approach to discovering new patterns in large volumes of network security data [37]. On the data side, the tool consists of an expression builder for highlighting qualifying data instances and smart aggregation features, already introduced in Section The visualization frontend is composed of bar charts, a parallel coordinates display, a Table Lens, and a Star Tree graph visualization. Note that the latter two visualizations are adaptations of the products from Inxight Software. Furthermore, the tool offers several mechanisms for collaboration and reporting including sharing of annotations and items of interest as well as communication of hypotheses and analytical findings. Starlight is a software product developed at Pacific Northwest National Laboratory and originally targeting the intelligence community [139]. It allows users to analyze relationships within large data sets. The resulting shapes form clusters on the system s 3D graph display integrating structured, unstructured, spatial, and multimedia data, offering comparisons of information at multiple levels of abstraction, simultaneously and in near real-time. Network security is only one of several application areas of the product. In 1996, Lamm et al. published a study about access patterns of WWW traffic [98]. Their proposed Avatar system can be used for real-time analysis and mapping WWW server accesses to the points of their geographic origin on various projections of the earth using 3D bars on top of a globe and 3D scatter plots in a virtual reality context. Xiao et al. start their analysis in the opposite direction [186] by first visualizing network traffic using scatterplots, Gantt charts,or parallel plots and then allowing the user to interactively specify a pattern to be abstracted and stored using a declarative knowledge representation. A related system NVisionIP [97] employs visually specified rules and comes with the capability to store them for reusage in a modified form of the tcpdump filter language. The visual analytics feedback loop implemented in both approaches allows the analyst to build upon previous discoveries in order to explore and analyze more complex and subtle patterns.

61 3.3. Related work on visualization for network monitoring and security 49 To effectively identify cyber threats and respond to them, computer security analysts must understand the scale, motivation, methods, source, and target of an attack. Pike et al. developed a visualization approach called Nuance for this purpose. Nuance creates evolving behavioral models of network actors at organizational and regional levels, continuously monitors external textual information sources for themes that indicate security threats, and automatically determines if behavior indicative of those threats is present in the network [134]. The visual interface of the tool consists of a monitoring dashboard that links events to their geographical reference and a detail view with bar and line charts that are annotated through related real world information Analysis of firewall and IDS logs for virus, worm and attack detection One of the key challenges of visual analytics is to deal with the vast amount of data from heterogeneous sources. In the field of network security, large amounts of events and traffic are collected in log files originating from traffic sensors, firewalls, and intrusion detection systems. As demonstrated in the application Visual Firewall, consolidation and analysis of these heterogeneous data can be vital for proper system monitoring in real-time threat situations [102]. Because gaining insight into complex statistical models and analytical scenarios is a challenge for both statistical and networking experts, the need for visual analytics as a means to combine automatic and visual analysis methods steadily grows along with increasing network traffic and escalating alerts. A study on firewall logs by Girardin and Brodbeck proposes to use Self-Organizing Maps for automatic classification of log entries [58]. By merging colors, shapes, and textures, multiple attributes are encoded into a single integrated iconic representation on an 8 by 8 grid. Selecting a cell in the grid triggers a list of the contained log events. The proposed tool further comprises a spring embedder graph layout consisting of 500 events, a parallel coordinates display, and a dynamic querying interface. In many networks, additional intrusion detection sensors are installed behind firewall systems in order to monitor suspicious traffic that attempts to bypass the firewall. One study to visually evaluate the output of such firewalls and IDS was conducted by Alex Wood in 2003 [185], in which destination port vs. time scatterplots with zooming capabilities are created and color is used to encode the source IP. Other statistical visualizations such as boxplots with whiskers, bar charts and pie charts are used in order to visually represent the distribution of the additional variables of the log data. IDS Rainstorm also attempts to bridge the gap between large data sets and human perception [1]. A scatterplot-like visualization of all local IP addresses versus time is provided for analyzing thousands of security events generated daily by the IDS. After zooming into regions of interest, lines appear and link the pictured incidents to other characteristics of the data set. Koike and Ohno argue that many IDS signatures are not detailed enough to avoid false positives and propose to use visualization to distinguish between false positive and true positive alerts [92]. For near real-time monitoring, their system SnortView reads system logs and Snort alerts every two minutes. Its basic visualization framework is a traditional 2D source IP vs.

62 50 Chapter 3. Foundations of information visualization for network security time matrix overlaid with statistical information, such as the number of alerts per time interval, the number of alerts per source IP, and colored glyphs encoding details of the used protocols and alert priority. Furthermore, this time diagram is extended through a source destination matrix, which interactively highlights the temporal distribution of alerts for a particular source IP and the associated destination IP. The Starmine system focuses its analysis of IDS alerts on the combination of geographical, temporal, and logical views since some cyber threats can reveal distinct patterns in any of these views. The geographical components map view and globe view visualize the geographical source destination relationships of attacks through arcs on the map and straight lines on the 3D globe. The map view is furthermore capable of showing the amount of attacks for each location as bars in a 3D scene on top of the map. In the integrated view, hosts on the map view are linked through straight lines with the respective hosts positions in the IP matrix, which assigns a position to the respective IP address using the first and the second bytes of the latter. Furthermore, a line chart on the right of the 3D scene in the integrated view details the aggregated number of alerts per time interval. Although the system suffers from occlusion, the authors demonstrated in the cause of the analysis of several network viruses that correlations of geographical properties and IP address ranges can be found with their tool. The Starmine system was further extended in another study in order to deal with the geographical and logical properties of a local area network: The map now details the geographical positions of hosts within the university campus and the IP matrix visualizes the third and the fourth octets of the local IP addresses [123]. The temporal view was also extended and now consists of several colored line charts, each displaying the aggregated amount of alerts per subnet. Furthermore, horizontal bars and a zooming interface show port activity. In order to solve some of the occlusion issues in the integrated 3D view, the analysts can choose the expansion view, which plots the logical, temporal, and geographical view next to each other in a two-dimensional arrangement. It is worth mentioning that some visualization techniques such as parallel coordinates and graphs have meanwhile found their way into commercial products, e.g., the RNA Visualization Module of SourceFire [146]. In this context, the techniques are used to display correlations in the multidimensional characteristics of network traffic and connectivity of network nodes. However, the major drawbacks of the parallel coordinates technique are that they produce visual clutter due to overplotting lines and that only correlations between neighboring axes can be identified Detection of errors and attacks in the BGP routing system As described earlier in Section 2.1, the whole Internet builds upon the BGP routing system. If a network link fails, alternative routes to the destination will be used to route the traffic unless it was the only link connecting the respective AS with the Internet. Due to the huge scale of the Internet, routing updates are frequent and need to be quickly propagated to ensure reachability. Failures within the routing system can thus easily affect large shares of the Internet s network traffic and it is crucial to timely detect and resolve them. Link Rank is a tool for gaining insight into large amounts of routing changes by converting them into visual indications of the number of routes carried over individual links in an embed-

63 3.3. Related work on visualization for network monitoring and security 51 ded graph [96]. Its filtering mechanism helps to extract a reduced topological graph capturing most important or most relevant changes. The tool enables network operators to discover otherwise unknown routing problems, understand the scope of impact by a topological event, and detect root causes of observed routing changes. Cortese et al. offer an alternative view on changes in the routing system using a topographic map metaphor [35]. Coloring and contour lines are used to confine AS s at the same level of the ISP hierarchy. These lines are taken into account when calculating the layout for the AS connectivity graph, thus placing top level AS s onto the peak of the fictitious mountain surrounded by gradually less important AS s bounded by contour lines. This kind of metaphor permits an effective visualization of time-animated routing paths within the BGPlay system [32] Towards Visual Analytics for network security In general, a trend from rather static displays that visualize the outcome of an automatic analysis techniques towards highly interactive Visual Analytics system for network security can be observed. These novel systems try to tightly integrate the human expert into the analysis process by allowing him to interfere with automatic and visual analysis techniques in order to verify old or come up with new hypotheses, to test different views and algorithms on the data, and to reuse gained knowledge to further improve analysis techniques. Note that while this related work section covered tools and systems for visual analysis of network traffic, the related work sections within the application chapters will focus on reviewing relevant visualization techniques from a broader spectrum of fields: Section 4.1 considers contributions on time series visualization, Section 5.1 reviews hierarchical visualization techniques, Section 6.1 discusses work on node-link diagrams, Section 7.1 deals with work on radial and multivariate representations, Section 8.1 presents some dimension reduction techniques, and Section 9.1 lists related work on visual analysis of communication.

64

65 4 Temporal analysis of network traffic,,everything happens to everybody sooner or later if there is time enough. George Bernard Shaw Contents 4.1 Related work on time series visualization Extending the recursive pattern method time series visualization The overall algorithm Empty fields to compensate for irregular time intervals Spacing Comparing time series using the recursive pattern Small multiples mode Parallel mode Mixed mode Case study: temporal pattern analysis of network traffic Summary Future work TIME is one of the most important properties of network traffic. It is thus often used to correlate traffic loads and events of one host or higher level network entity with others. When monitoring large networks, the task of comparing time series becomes difficult, especially due to the following two circumstances: 1. Network time series data is recorded at a very fine granularity aggregation can be used for consolidating the data volume, but important details might get lost. 2. There is a multitude of potentially comparable time series, especially considering the total number (65536) of application ports of the TCP and UDP protocol or the number of individual hosts within the network. Unfortunately, traditional visualization techniques from statistics (e.g., line charts or bar charts) scale neither to the resolution of large time series nor to the increasing number of time series compared within the same plot. As demonstrated in Figure 4.1, general trends such as the increase in mail traffic in the morning, are visible, but the details of the exact timing of high or low traffic events get lost. Furthermore, overplotting becomes a serious problem as it makes the time series indistinguishable in some areas of the plot. As a countermeasure, in this chapter we present an overlap-free recursive pattern visualization technique and adapt it to the needs of our analysis.

66 54 Chapter 4. Temporal analysis of network traffic 1e+05 1e+04 smtp port 25 pop3 port 110 imap port 143 imaps port 993 pop3s port 995 packets/min 1e+03 1e+02 1e minutes on Figure 4.1: Line charts of 5 time series of mail traffic over a time span of 1440 minutes as monitored at the university gateway on November 20, 2007 The rest of this chapter is structured as follows. First, we discuss related work in the field of time series visualization. Next, details of the recursive pattern algorithm are given and the technique is augmented with empty fields that compensate for irregular time intervals. With spaces added between groups of different levels, the recursive pattern technique becomes both more structured and more readable. Thereafter, we present enhancements for comparing several time series with each other, such as different coloring schemes as well as tree configuration modes, namely, small multiples, parallel mode, and mixed mode. A case study is used to demonstrate the capabilities of the extended recursive pattern. Finally, the contributions are summarized in the last section followed by a brief outlook on the future work in this field. 4.1 Related work on time series visualization Time series is an important type of data encountered in almost every application domain. The field has been intensely studied and received a lot of research attention, especially from the financial sector. When it comes to information visualization, not only highlighting particular patterns is an important aspect, but also arrangement of multiple time series to support comparison between several monitored items as studied in [5, 63]. The study in [5] compares stock prices of 50 companies over time and demonstrates that with only 4 time series the line graph might reach its scalability limit, whereas the pixel-based circle segments technique is capable of showing all 50 companies and their stock price development over a long time period. The authors of [63] propose a set of layout masks for subdividing the available display space in a regularity-preserving fashion. The instances of the time series are then arranged according to their inter-time-series importance relationships by means of the masks. This work was continued and extended towards extraction of local patterns within multidimensional data sets in [62]. A local pattern can be interactively selected and so-called intelligent queries find similar or inverse patterns within the data set and order them according to their relevance. The

67 4.2. Extending the recursive pattern method time series visualization 55 technique has proven to be useful for identification of correlated load situations of database servers and for finding suitable load balancing schemes using the available hardware resources. Hochheiser and Shneiderman use traditional line graphs in their Time Searcher system [68]. However, the tool s focus is set on the dynamic query interface. Through rectangular boxes, the user can simultaneously specify ranges of values and time intervals to find matching time series. Furthermore, a query-by-example interface, support for queries over multiple timevarying attributes, query manipulation, pattern inversion, similarity search capabilites, and graphical bookmarks characterize the innovative user interface of the tool. Naturally, insight is mostly achieved through an iterative exploration process where previous results are progressively refined as emphasized in a study by Phan et. al. [132]. The authors demonstrate how they use progressive multiples of timelines and event plots to organize their findings in the process of investigating network intrusions. Another interactive approach is the LiveRAC system, which is designed to address limitations of traditional network monitoring systems by offering an exploration interface [118]. The system is based on a reorderable matrix where each matrix row represents a monitored system and each matrix column a group of one or more monitored parameters. Thereby, comparison of time series on various levels of details are enabled through the semantic zoom, which adapts each chart s visual representation to the available display space. In contrast to this work, the approach presented in this chapter is based on a pixel visualization technique to better use the scarce display space. Other application scenarios deal with the problem of finding usage patterns and visualizing them on larger time scales. Van Wijk and van Selow, for example, combined a clustering algorithm and a calendar view to identify daily energy consumption patterns[170]. Their presented methods provide insight into the data set and are suitable for fine-tuning their model s parameters to predict future energy consumption. Since calendars are a well-known metaphor for most users, they provide a good basis for extensions. DateLens is probably the most prominent novel calendar interface for PDAs [10]. It employs fisheye distortion to simultaneously show overview and details of calendar entries in a compact yet detailed fashion. In contrast to these studies, the overall goal of this chapter is to present a highly scalable method for visualizing large time series and making visual comparisons between them. 4.2 Extending the recursive pattern method time series visualization We start off by considering different tasks that typically need to be carried out when analyzing time series in a network monitoring context: 1. Find overall trends, 2. Spot repetitive events, 3. Reference an event to the exact time of occurrence,

68 56 Chapter 4. Temporal analysis of network traffic 4. Find co-occurring patterns in several time series. Since commonly used line charts are capable of supporting the user in solving the above tasks only to a certain extent, we sought for alternative, more scalable possibilities of representing temporal data. We draw our inspiration from the recursive pattern visualization [4], which arranges pixels or small rectangles in a line-wise back-and-forth fashion. This arrangement scheme has the property of rendering neighboring data elements right next to each other. Repeating this scheme through a recursive placement of higher order groups, this property will in most cases get lost when rendering neighboring data points which do not belong to the same upper level group, such as the last day of a month and the first day of the subsequent month. Nevertheless, other regularly reappearing patterns can be made visible when using an appropriate configuration of the recursive pattern The overall algorithm The basic recursive pattern algorithm is specified in terms of two array parameters widths and heights- that define the pattern s rectangle placement scheme. To adapt the recursive pattern to our needs, we extended by introducing three new parameters, namely direction, spacing x, and spacing y, resulting in the following set of parameters in the extended recursive pattern: widths Integer array of size n specifies the horizontal partitioning of the recursive pattern, with elements ordered in descending priority. heights Integer array of size n specifies the vertical partitioning of the recursive pattern, with elements ordered in descending priority. direction Boolean array of size n specifies the start direction of the pattern for each level (0 for horizontal, 1 for vertical). Note that this parameter was not introduced in the original publication [4], but it is necessary for handling top-down alternating recursive patterns. In case this parameter is not set, horizontal start direction for each level is assumed. spacing x Floating point array of size n specifies the horizontal space between elements of each level. spacing y Floating point array of size n specifies the vertical space between elements of each level. Our implementation of the extended recursive pattern consists of three functions: a) rec- Pos recursively finds the position relative to the pattern hierarchy as specified through widths, heights, and direction, b) abspos determines the absolute position, and c) addspacing introduces spaces between elements and groups of elements of different levels. Assume a pattern representing the days of one year with 30 days in each of the 12 months (30 12 = 360). We could define widths = (4,6) and heights = (3,5), which would create a 4 3 arrangements of the months and a 6 5 arrangement of the days. This leads to the recursive pattern shown in Figure 4.2: the line-wise forward and backward arrangement positions subsequent days within a month next to each other.

69 4.2. Extending the recursive pattern method time series visualization min, 1 7 months day, + 1 month Figure 4.2: Recursive pattern example configuration: 30 days in each of the 12 months with parameters widths = (4, 6) and heights = (3, 5) The recpos function defined in Algorithm 4.1 takes parameters pos, level and size as input and returns an array with the position relative to the pattern hierarchy. Querying position 50 with recpos(50, 0, 360) in the time series of the previous example would then return (1,20) (2 nd month and 21 st day). Note that we always count from 0 for easier array handling while using the mod and div functions. Array position 50, equivalent to (1, 20) in the recursive pattern hierarchy, is marked white in Figure 4.2 for illustrative purposes. Algorithm 4.1: Recursive pattern algorithm calculation of the position relative to the recursive pattern hierarchy procedure recpos (pos, level, size) begin size/ = (widths[level] heights[level]) if level < n 1 then return (pos div size) recpos(pos mod size, level + 1, size) else return pos end The abspos function in Algorithm 4.2 is somewhat more complex since it has to transform the relative position array into absolute positions that can be easily mapped on the screen. The function proceeds by retrieving the relative position array and adding the offset of the considered item from the top left corner for each level. Assuming the horizontal start direction, array position 50 in the above example is calculated as y rel = 1 div 4 = 0 and x rel = 1 mod 4 = 1

70 58 Chapter 4. Temporal analysis of network traffic Algorithm 4.2: Recursive pattern algorithm calculation of the absolute position from the relative position within the pattern procedure abspos (pos) begin dx = n 1 widths[i], dy = n 1 heights[i] i=0 i=0 rel = recpos(pos,0,dx dy) x abs = 0,y abs = 0 for i = 0 to n 1 do dx/ = widths[i] dy/ = heights[i] x rel = 0,y rel = 0 if direction[i] == 0 then y rel = rel[i] div widths[i] if y rel mod 2 == 0 then x rel = rel[i] mod widths[i] else x rel = widths[i] 1 rel[i] mod widths[i] else x rel = rel[i] div heights[i] if x rel mod 2 == 0 then y rel = rel[i] mod heights[i] else y rel = heights[i] 1 rel[i] mod heights[i] x abs + = x rel dx y abs + = y rel dy return addspacing (x abs, y abs ) end (lines 11 and 13 in the algorithm) in the first run of the loop resulting in x abs = 1 6 = 6 and y abs = 0 (lines 22 and 23); the second run of the loop calculates y rel = 20 div 6 = 3, x rel = mod 6 = 3 (lines 11 and 15) resulting in x abs + = 3 1 = 9 and y abs + = 3 1 = 2, which is the final position. The addspacing function is explained in Section Empty fields to compensate for irregular time intervals Figure 4.3 demonstrates different configurations of the recursive pattern algorithm for visualizing the days of one month. In (a) we assigned 30 rectangles to one month in a 6 5 matrix under the assumption of a space-filling display utilization. Each month consists of 28 to 31 days and, therefore, the value represented by one square in this visualization varies between 23.2 and 25.7 hours. This unconventional aggregation strategy might lead to confusions since we cannot tell for sure whether the high value of a cell is due to the actual data

71 4.2. Extending the recursive pattern method time series visualization 59 Mo We Fr Su Su Fr We Mo Mo We Fr Su (a) 6 5 configuration (b) 7 5 configuration (c) 7 1, 1 5 configuration Figure 4.3: Recursive pattern parametrization showing a weekly reappearing pattern distribution or whether it was introduced artificially. Apart from this drawback, the layout destroys the weekly patterns because the horizontal positions of the Friday squares end up at different positions in each row. When considered in a multi-resolution context, this layout inevitably aggravates both the understanding and the implementation of drill-down operations from months to weeks and days. The second possibility is a 7 5 configuration, which maintains daily patterns by filling the pattern with empty rectangles at the beginning and at the end, as demonstrated in Figure 4.3(b). Although it is now possible to compare traffic of subsequent days, the capability of easily comparing patterns occurring on the same weekday is still lacking. Figure 4.3(c) shows our final configuration with groups of 7 rectangles per line (first parameter), vertically arranged into 5 lines (second parameter). Note that when analyzing daily-grained time series spanning several months, one has to reserve space for up to six weeks per month because the first week could be almost empty, which leads to the need of the sixth week at the end of the month. This layout has two drawbacks: 1) it wastes up to 33 percent of screen space for the month of February (42 rectangles for 28 days), and 2) the property of having subsequent days always appear next to each other within a month is lost due to the line-wise left to right arrangement. However, these drawbacks are compensated by the gained advantage of easily spotting weekly reappearing patterns. Technically, empty fields are simply rendered in a color that does not appear in the colormap representing valid entries. This is demonstrated in Figure 4.4 at the example of mapping the number of s per day for one year of communication to a recursive pattern. As the information can be displayed at several granularities without changing positions of the underlying data items, this technique has proven to be suitable for interactive data exploration due to the multi-resolution capabilities. Since the resolution of the displayed data is only limited by the number of pixels on the screen, the amount of artificially introduced nodes, and the separating border between the displayed data measurements, this technique turns out to be very scalable and can easily display one million data elements on an ordinary computer screen.

72 60 Chapter 4. Temporal analysis of network traffic Figure 4.4: Multi-resolution recursive pattern with empty fields for normalizing irregularities in the time dimension Spacing Having realized the difficulty of interpreting the recursive patterns that either contain many elements or are highly nested, we introduced spacing as the second innovation aimed at enhancing the recursive pattern technique. With gaps added between the elements of different logical time intervals (e.g., hours, days, weeks, months, years, etc.), the user is able to mentally map the rectangles at arbitrary positions to a time reference. Figure 4.5 exemplifies the difference between the traditional recursive pattern, which in this case might be very useful for comparing hourly reappearing patterns, and a recursive pattern with spacing, both visualizations referring to the same packets data sent on November 20, Note, that this data set only contains the POP3 packets that entered or left the university network. As a result, the traffic from students or employees who check their at the university server from within the internal network is not included. In both configurations, it is evident that most employees start checking their at approximately 8:20 h and the intensity of usage drops at approximately 18:00 h. While the standard recursive pattern configuration enables the analyst to spot hourly repeating events, it remains difficult to estimate the exact time an event occurred due to the large numbers of elements (60 24) without visual reference points. In the recursive pattern with spacing configuration, one group of rectangles represents measurements of one hour (6 times 10 minutes). Several patterns can be found, such as the horizontal bars within a group that indicate low or high traffic in subsequent minutes whereas vertical bars represent traffic events in 10 minutes intervals. In the 6th and 7th hour (5:00 h to 6:59 h), for example, one can nicely spot traffic patterns in 5 minutes intervals, which are probably caused by a client machine repetitively fetching s from the server. An analyst interested in two subsequent red cells

73 4.3. Comparing time series using the recursive pattern hour min (a) Standard recursive pattern with an hour-per-line arrangement min, 6 hours 1 min, 1 hour (b) Recursive pattern with spacing separating 10 minute intervals Figure 4.5: Enhancing the recursive pattern with spacing (showing logarithmically scaled oneday traffic volume in packets per minute over port 995 (POP3 over TLS/SSL) at the gateway of a university network) in the evening would have a hard time finding out at what time exactly they occurred based only on the visualization depicted in Figure (a). Figure (b), in contrast, enables fast and intuitive estimation of hours and minutes of the pattern of interest, which, in our example, maps to the time interval from 21:21 h to 21:22 h. In our implementation, the width of spacing for each level is stored in arrays spacing x and spacing y. Algorithm 4.3 adds the spacing level-wise for each column to the left of and for each row above the current element. Note that we refrain from adding spacing if the previous factor in the array widths or heights was equal to 1 since this denotes a line-wise (row-wise) arrangement of the elements rather than an alternating back-and-forth (top-down) pattern. 4.3 Comparing time series using the recursive pattern Prior to presenting various layout strategies for combined visualization of several time series, let us consider according to what criteria time series can be distinguished from one another. First of all, the elements of each time series will be made distinguishable through their position. However, when comparisons of detailed elements originating from different time series are made, it should be clear which time series each element belongs to. Besides changing

74 62 Chapter 4. Temporal analysis of network traffic Algorithm 4.3: Recursive pattern algorithm: adding spacing between different levels procedure addspacing (x abs,y abs ) begin dx = 1,dy = 1 x f = x abs,y f = y abs dx old = dx + 1,dy old = dy + 1 for i = 0 to n 1 do if dx dx old then x f + = (x abs /dx) spacing x [n i 1] if dy dy old then y f + = (y abs /dy) spacing y [n i 1] dx old = dx,dy old = dy dx = widths[n i 1] dy = heights[n i 1] return {x f,y f } end the element position through a different layout function, we propose several visual separation options using the visual attribute color: a) element colors, b) border colors, or c) background colors for each time series, as shown in Figure 4.6. Depending on the analysis task, different data normalization methods, such as linear, squareroot, or logarithmic normalization, could be used. Furthermore, the analysis task also determines whether each of the time series should be normalized separately or whether the colormap should be adjusted accordingly. When the analyst searches for a pattern with identical absolute values in several time series, the same minimum and maximum for the normalization of all time series should be used. However, when comparing numerical values of several time series that differ in their value characteristics (e.g., comparing the number of hosts with the number of packets), normalizing each time series with its own minimum and maximum might be the more appropriate option. We define the following three configuration modes of the recursive pattern for time series comparison: a) the small multiples mode to investigate several time series one by one, b) the parallel mode to facilitate the task of finding parallel developments within two or more time series, and c) the mixed mode which combines recursive patterns at an intermediate level. As 10 min, 0 min 10 min, 0 min 10 min, 0 min 1 min, 1 hour (a) element color 1 min, 1 hour (b) border color 1 min, 1 hour (c) background color Figure 4.6: Different coloring options for distinguishing between time series

75 4.3. Comparing time series using the recursive pattern 63 A+B A 2008 B January... January A January B... January A... January B... Week1 A Week1 B Week2 A Week2 B... Week1 A Week2 A... Week1 B Week2 B... Week1 A Week2 A... Week1 B Week2 B... (a) weeks (b) months (c) years Figure 4.7: Combination of two time series (A and B) at different hierarchy levels exemplified in Figure 4.7, time series can be combined at different levels of their intrinsic (e.g., days or weeks) or artificially introduced (e.g., 10 min intervals) hierarchy. Thereby, this combination level indirectly determines the recursive pattern configuration mode. Note that combining time series at the lowest level (e.g., rendering day 1 of time series A, B, and C right of each other, followed by day 2 and so on) makes little sense since the original time series in such a view can be identified only with a great cognitive effort. Combination at the second lowest level results in the parallel mode, the one at the intermediate levels results in the mixed mode, and the one at the highest level corresponds to the small multiples mode Small multiples mode Small multiples have been extensively used in information as a straightforward way of comparing two or more visualizations. Once the visualization of a single time series has been understood, it is only a small step to transfer this knowledge to the analysis of several time series placed next to each other, as shown in Figure 4.8. The human perception is capable of recognizing patterns co-occurring in several of the displayed visualizations at either exactly the same time or shifted in time. However, due to the long eye movement distances, we suspect that there exist other recursive pattern layouts more efficient for spotting co-occurring events across multiple time series. 10 min, 4 hours 10 min, 4 hours 10 min, 4 hours 10 min, 4 hours 10 min, 4 hours 1 min, 1 hour 1 min, 1 hour 1 min, 1 hour 1 min, 1 hour 1 min, 1 hour Figure 4.8: Recursive pattern in small multiples mode (from left to right: the number of packets per minute on mail ports 25 (SMTP), 110 (POP3), 143 (IMAP), 993 (IMAPS), 995 (POP3S) on November 20, 2007 with square-root normalization)

76 64 Chapter 4. Temporal analysis of network traffic 0 min, 1 hour 1 min, 12 hours Figure 4.9: Recursive pattern in parallel mode Parallel mode Parallel mode aligns parallel elements of distinct time series underneath each other, while the recursive placement scheme for the upper level hierarchies remains unchanged. In order to distinguish between different time series, we use several intensity-varying colormaps as depicted in Figure 4.9. The major advantage of this approach is that only small eye movements are necessary for comparative analysis of time series since the elements referring to the same point in time are aligned underneath each other. However, it might be perceptually difficult to mentally map the values to the correct time series Mixed mode So far, we have considered the comparison of time series at the highest level with the small multiples mode and at the lowest level with the parallel mode. However, there is another possibility to combine a pair of time series at an intermediate level, namely, by using what we call the mixed mode. Figure 4.10 demonstrates the mixed mode approach at the example of rendering groups of size 10 6 underneath each other, whereby each group represents one hour in a single time series. Different background colors are used to mark different time series. The mixed mode layout represents a compromise between the parallel and the small multiples modes. We believe that the user can better follow the individual time series with this approach than with the parallel mode and that the eye movements for comparisons between the time series are considerably smaller than in the small multiples mode. As shown for treemaps in the work by Tu and Shen [160], there is another comparison mode applicable to the recursive pattern visualization. The authors suggest to split the rectangles into two and then visualize changes via the color and the position of the diagonal split line. For comparison of several attributes, vertical bars visualize the respective changes. 4.4 Case study: temporal pattern analysis of network traffic One of the current network security problems is detection of botnet servers within the internal network. While botnet clients are mostly Windows machines that have been hacked on

77 4.4. Case study: temporal pattern analysis of network traffic min, 0 min, 4 hours 1 min, 1 hour, 12 hours Figure 4.10: Recursive pattern in mixed mode well-known exploits, the servers commonly run Linux or Unix systems because those systems provide a rich set of tools, good connection, and uninterrupted availability. One way of hacking such a Linux or Unix system is to use a brute force approach by trying out password lists on common user names. While this approach rarely produces a lot of network traffic in absolute terms, it is noticeable through an increased amount of flows with destination port 22. In this case study, we captured the incoming SSH traffic from November 21 to 27, 2007 and stored it in the database. The acquired data comprises 1.4 million flows, or 162 GB, of network traffic. A single so-called flow of this netflow data is characterized by the source and the destination IP addresses and port numbers, the number of packets and the number of transferred bytes, whereas the latter two measures are aggregated over the flow interval. We realized that in our data set the flow interval commonly does not exceed 72 seconds. A longer SSH session therefore results in approximately one flows per minute. In most configurations, SSH server interrupts the connection after three subsequently failed login attempts, which forces the attacker s machine to reconnect in order to continue trying out further user name and password combinations. Such interrupt results in another flow due to the changed source port number. In order to aggregate the flows into minute intervals, each set of the fact table entries that map to the identical timestamp value at minute-wise precision, are consolidated into one entry. Figure 4.11 shows the number of SSH flows per minute over a time frame of one week (November 21 to 27, 2007). The extended recursive pattern visualization here places 12 hours horizontally each with 6 10 minutes. The corresponding 12 hour intervals of the subsequent days are the placed underneath. Below the middle, the reappearing color marks the second half of each day. Since values are scaled logarithmically prior to their mapping to the grayscale

78 66 Chapter 4. Temporal analysis of network traffic min, 0 min, 12 hours min, 1 hour Figure 4.11: Case study showing the number of SSH flows per minute captured at the university gateway from November 21 to 27, 2007 colormap, dark areas in this plot represent massive amounts of SSH connections caused by attacking machines. In this case, the same normalization is used for each of the seven days since their measurements are directly comparable. While these massive peaks would also be visible in line charts or bar plots, those layouts are less supportive in revealing the exact time of their occurrence. The extended recursive pattern visualization, on the contrary, allows the analyst to see the patterns and to associate them with the exact moments in time by counting the element s position from the top left corner of each hierarchy level. For example, a large-scale attack seen in Figure 4.11 (dark area in the red time series) took place from 13:51 h to approximately 16:29 h on November 21, Besides those extreme peaks, the visualization reveals other increases in the number of SSH flows, such as the one on November 22 (blue) from approximately 10:30 h to 21:09 h, which we want to investigate more deeply. Figure 4.12 shows the recursive pattern visualization of the detailed netflow data from November 22 as the number of flows (blue), distinct source hosts (green), distinct destination hosts (violet), network packets (orange), and transferred bytes (yellow). Note that the blue and the turquoise variables are scaled using their common maxima since they convey the same type of information from the source and the destination perspective, respectively. It is now interesting to observe that the attacks from 22:25 h to 22:45 h and from 23:33 h to 23:59 h are visible in four out of five of the data dimensions. Both attacks generated lots of flows to a large number of destination hosts, thereby sending lots of network packets with heavy payload. The gray flow pattern from 10:30 h to 21:07 h can be clearly seen in the number of packets and payload but it is not reflected in the number

79 4.5. Summary 67 2e+07 1e+07 8e+06 6e+06 4e+06 2e+06 1e+06 8e+05 6e+05 4e+05 2e min, 0 min, 12 hours 1e min, 1 hour Figure 4.12: Details on the number of flows (blue), external source hosts (green), internal destination hosts (violet), packets (orange), and payload (yellow) of SSH traffic captured at the university gateway on November 22, 2007 of destination hosts (turquoise), which suggests that it is either an exhaustive attack targetting a single host or, more likely, an ordinary SSH-session with massive data transfers. In a nutshell, the scalability of the extended recursive pattern visualization technique was demonstrated in Figure 4.11 by presenting the details of more than measurements. Despite this large amount of data, the precision of the technique made it possible to visually determine the exact timing of certain events. Moreover, Figure 4.12 proved that the technique enables the analyst to precisely correlate temporal patterns of several time series. 4.5 Summary Analysts in the field of network monitoring and security have to deal with time series that either contain thousands of measurement values, or consist of a multitude of different measurements. In order to properly assess critical network events, the analyst must be able to compare several time series at different levels of detail and to correlate the events contained in the time series. In our opinion, conventional line or bar charts do not scale to large sizes of the time series, characteristic for network monitoring and security, especially with respect to the tasks of spotting repetitive events, determining precise timings of events, and correlating these events across multiple time series. To support the above tasks, in this chapter we introduced the extended recursive pattern visualization for analyzing large or multi-dimensional time series. The technique has been adapted to match the needs of the analysis by incorporating the following innovative features: Empty fields compensate for irregular time intervals to use the technique in a multiresolution context.

80 68 Chapter 4. Temporal analysis of network traffic Spaces are inserted between different hierarchy levels within the time dimension to facilitate the interpretation of large time series. Three coloring methods are presented to distinguish between different time series in a comparative view. Three layout configuration modes (small multiples, parallel, and mixed mode) enable visual comparisons of time series with the extended recursive pattern. The case study exemplified the usage of the extended recursive pattern visualization on real network monitoring data sets and demonstrated how SSH attacks can be recognized within large amounts of netflows captured at the university gateway. The detailed analysis of the number of flows, source hosts, destination hosts, packets, and payload showed how a rather short attack targeting many destination hosts can be visually distinguished from an intensive SSH session Future work Although we have extended the recursive pattern technique, no user studies have been conducted to determine and quantify the pros and cons of our approach. A user study could be helpful in estimating whether our technique outperforms bar and line chars on certain tasks, whether users would prefer our technique, and which coloring technique for visual separation of the time series (element, border, or background color) is the most effective. Furthermore, the question about which comparison mode of the recursive pattern is most effective for common analysis tasks is still unanswered. However, due to time limitations, these issues could not be investigated within this work.

81 5 A hierarchical approach to visualizing IP network traffic What really makes it an invention is that someone decides not to change the solution to a known problem, but to change the question. Contents Dean Kamen 5.1 Related work on hierarchical visualization methods The Hierarchical Network Map Mapping values to colors Netflow exploration Scaling node sizes Visual Scalability Space-filling layouts for diverse data characteristics Requirement analysis Geographic HistoMap layout One-dimensional HistoMap layout Strip Treemap layout Evaluation of data-driven layout adaptation Visibility Average rectangle aspect ratio Layout preservation Summary User-driven data exploration Filtering Navigation within the map Colormap interactions Interactions with multiple map instances Triggering of other analysis tools Case studies: analysis of traffic distributions in the IPv4 address space Case study I: Resource location planning Case study II: Monitoring large-scale traffic changes Case study III: Botnet spread propagation Expert feedback Summary Future work

82 70 Chapter 5. A hierarchical approach to visualizing IP network traffic THE main focus of this chapter is to propose an interactive Hierarchical Network Map (HNMap) technique that supports a mental representation of global Internet measurements. HNMap is applied to depict a hierarchy of 7 continents, 190 countries, autonomous systems and IP prefixes. We adopt the Treemap approach [77]: each node in the hierarchy is drawn as a box placed inside its parent. Node sizes are proportional to the number of items contained. Popup and fixed-text labels are used to describe the nodes and an adjustable color scale encodes the values of the measurement attribute. An important aspect of this study is to assess the layout stability of Treemap algorithms under large adjustments in the displayed set of nodes. We compare two layout algorithms for ordered data, split-by-middle and strip Treemaps, with respect to squareness, locality preservation, and run time. Filtering and navigation for interactive exploration of the proposed hierarchy are provided. These operations may force re-layouting of the affected part of the hierarchy. In addition to that, we implemented interactions with the color scale and with multiple map instances for gaining insight into the traffic distribution over time. Furthermore, we show how other analysis tools can be triggered from within the HNMap interface to study additional dimensions of the data sets. Three small case studies conducted jointly with security experts are presented as evidence of the suitability of HNMap for visual analysis of security-related data. To the best of our knowledge, this is the first proposed method for visualizing large-scale network data aggregated by prefix, autonomous system, country, and continent. The chapter is structured as follows. We start by discussing the related work in the field of hierarchical visualization methods in Section 5.1 and then introduce our multi-resolution HNMap approach in Section 5.2. Section 5.3 presents the involved visualization algorithms, followed by their evaluation with respect to visibility, average rectangle aspect ratio, and layout preservation in Section 5.4. Section 5.5 discusses the user interaction techniques that constitute our exploration tool. Different application fields of HNMap are presented as small case studies with real-world data sets in Section 5.6. Section 5.7 concludes the chapter and summarizes its contributions. 5.1 Related work on hierarchical visualization methods Viewing our work in a broader context, one quickly realizes that data sets encountered in a multitude of application fields have immanent classification hierarchies imposed onto the data entries at the finest granularity level. Since hierarchical structures play a vital role in the process of analyzing and understanding these data sets, information visualization researchers have sought different ways of presenting and interacting with such data. To put the emphasis on the hierarchical structure, Robertson et al. proposed Cone Trees [140], in which the hierarchy is presented in 3D to effectively use the available screen space and enable visualization of the whole structure. Through interactive animation, some of the user s cognitive load is shifted to the human perceptual system. As an alternative 3D layout,

83 5.1. Related work on hierarchical visualization methods 71 Kleinberg et al. presented a rather unusual approach by encoding hierarchy information in the stem and branch structure of a botanical tree [89]. A common way of displaying hierarchical data is given by the layouts that place child nodes inside the boundaries of their parent nodes. Such displays provide spatial locality for nodes under the same parent and visually emphasize the sizes of sets at all hierarchy levels. Usually, leaf nodes may have labels or additional statistical attributes that may be encoded graphically as relative object size or color. The most prominent layout of this type is Treemap a space-filling layout of nested rectangles available in a number of variants. The earliest variant was the Slice-and-dice Treemap [77]. Here, display space is partitioned into slices sized proportionally to the total area of the nodes contained therein. This procedure is repeated recursively at each hierarchy level, rendering child nodes inside parent rectangles while alternating between horizontal and vertical partitioning. The technique is not hard to implement and can run efficiently, but it suffers from producing long, thin rectangles, which are hard to perceive and to compare visually. Squarified Treemaps [16] remedy this deficiency by using rectangles with controlled aspect ratios. Rectangles are prioritized by size, so that the large ones are treated as the most critical ones for layout. This improves the appearance of the Treemap, but does not preserve the input node order, which may be an undesirable effect in some applications. This drawback was addressed by the Ordered Treemaps [11]. Most Ordered Treemap variants are pivot-based. Unfortunately, our application often deals with rectangles of highly varying sizes, so that the resulting pivot rectangles can become highly deformed. We therefore investigated an adaptation of pivot-based Ordered Treemaps and compared it to Strip Treemaps. Similarly to the Million Items Treemap [49], the goal of our technique is to handle a large hierarchy (more than 200,000 nodes). Unlike demonstrated in the latter study, we do not intend to animate size changes, but rather focus on maintaining a stable layout when redistributing the gained display space after leaving out substantial parts of the hierarchy. An alternative non-treemap and non-space-filling layout algorithm was applied by Itoh et al. to visualize computer security data [76]. Their proposed rectangle packing algorithm maps hosts to rectangles and subnets to larger enclosing rectangles as demonstrated in Figure 5.1. Yet another possible layout of the IPv4 address space is the QuadTree decomposition of subsequent pairs of bits of the IP address [154] and was demonstrated in the context of analyzing routing changes. For each event, the point representing the IP address is connected with the two involved autonomous systems (placed outside of the QuadTree) by lines. Depending on the security events occurring in the network, characteristic patterns appear. The IP matrix approach [93] uses a similar placement schema, but splits each IP address into two matrix representations. This system is supplemented by stacking multiple IP matrices in 3D, one for each parameter of the attack data. Furthermore, space-filling curves have been applied to display the IP address space. In contrast to Treemaps, some of these curves depict continuous ranges of the IP address space through non-rectangular regions while preserving geometric continuity. Munroe, for example, drew a comic map of the registry information of all /8 subnets in the IPv4 address space using the fractal mapping of a Hilbert Curve [124]. Wattenberg s work on jigsaw maps [179] also uses this layout function as well as the H-Curve. Other researchers focused on surveying usage of the IP addresses ranging from aggregated data on images [135] to wall-sized 600 dpi

84 72 Chapter 5. A hierarchical approach to visualizing IP network traffic algorithm then decides the rectangle s position while it avoids overlapping the rectangle with previously placed rectangles, and attempts to minimize the entire grid-like space s area and aspect ratio. If no adequate candidate position for placing the rectangle exists, the algorithm generates several candidate positions outside the grid-like space and selects one candidate in which to place the rectangle. Visualization in the IP address space We group computers according to their IP addresses to form hierarchical data initially by the first byte of their IP addresses, then by the second byte, and finally by the third byte. Consequently, the technique forms fourlevel hierarchical data, as Figure 3a shows. It visualizes the computer network s structure by representing the hierarchical data as in Figure 3b, where black icons represent computers and rectangular borders represent groups of computers. By visualizing large-scale hierarchical data containing thousands of leaf nodes without overlapping, our technique can represent thousands of computers as clickable icons in one display space. The technique is therefore useful as a GUI for directly exploring detailed information about incidents of arbitrary computers in large-scale computer networks. In addition, by visualizing a computer hierarchy using the computers IP addresses, our technique can briefly represent the correlation between incidents and groups of computers in a real-world organization because IP addresses are often assigned according to an organization s structure. 1 Example of hierarchical data visualization using a rectanglepacking algorithm. Figure 5.1: Example of a hierarchical data visualization using a rectangle-packing algorithm 2 Improved rectanglepacking algorithm: (a) (a) printouts of all hosts in the Internet [66] by means of space-filling curves. previously placed rectangles and gridlike subdivision of a display space, (b) candidate positions for placing the = Candidate position current rectangle, Implementation Our technique consumes the log files of a commercial (b) and (c) placement of IDS (Cisco Secure IDS 4320). The system detects incidents taken based into on signatures account. that predefine typical mali- rectangle and the current cious access patterns. The technique inputs several items from the log file description (see Figure 4, next page): update of the grid-like subdivision. IP address of the computer sending incidents, IP address of the computer receiving incidents, (c) date and time, positive integer ID (signature ID) denoting the specific signature, and *.*.* security level (1, 2, 3, 4, or 5) *.* 3 (a) Hierarchy Visualization procedure * * of computers After consuming the log files (see Figure according to 4a), our technique visualizes the inci their IP address- dents [167]. in a specified processing order. es. (b) Illustration of *.* Relational-database-like structure visualization Using the log files, the technique forms a results of the relational-database-like structure, as Figure hierarchical 4b shows. It constructs tables for time, signature IDs, security levels, senders data. IP addresses, and receivers IP addresses. 2.*.*.* 3.*.*.* The data structure accelerates incident aggregation. (a) (b) The major advantage of our Hierarchical Network Map versus the rectangle packing algorithm, the QuadTree approach, the IP matrix, and the space-filling curves is a more meaningful positioning of the IP addresses on the display. We combine network topology with geographic information for an improved orientation and understanding of the analyzed data sets, while at the same time scaling up to the entire IPv4 address space. However, due to the uncertainty involved in the geographic placement of autonomous systems, a few misplacements are to be Our work has been inspired by geographical distortion methods, such as PixelMap [84] and HistoScale [83] that reposition points and polygons. Recent work on cartographic layouts, particularly the rectangular cartograms [67] that optimize the layout of rectangles with respect to area, shape, topology, relative position, and display space utilization, influenced our work. In the latter work, a genetic algorithm has been applied to find a good compromise between these objectives. The algorithm renders the layout offline, not interactively, and does not yet exploit hierarchical structures. An overview of further rectangular cartograms can be found in Our proposal combines several layout techniques to cope with the large, multilevel AS/IP hierarchy in a tool that runs fast enough for interactive response. A geographic layout method denoted HistoMap is proposed for continents and countries; a fast one-dimensional version of HistoMap for autonomous systems aims at preserving neighborhoods in 1D; and, finally, a Strip Treemap [11] approach that preserves the input order of data is especially appropriate for IP prefixes at the lowest level. IEEE Computer Graphics and Applications The Hierarchical Network Map Hierarchical Network Map is a visual exploration approach aimed at displaying the distribution of source and target data traffic of network hosts in a more expressive way than a simple density diagram like the one shown in Figure 5.2. The intention to provide an overview of

85 5.2. The Hierarchical Network Map 73 # of connections from source bit IP address prefix Figure 5.2: Density histogram showing the distribution of sessions over the IP address space. the entire Internet on the screen has forced us to treat every pixel as a valuable asset and to use a space-filling technique. Compared to node-link diagrams, our technique has three major advantages: 1. No display space is wasted. 2. Larger data sets can be visualized mapping network size to the node s area. 3. Better support for multi-resolution exploration since drill-downs have only local effects on the layout. The whole network structure can be viewed as a hierarchy with the IP addresses of single hosts as the bottom-level nodes. Each host belongs to a local IP network (or prefix), which, in turn, is a member of an autonomous system. The IP address hierarchy is extended by two additional geographic classes on top of the two just mentioned network levels, namely, countries and continents. The display rationale is to recursively nest child node rectangles in their parent rectangles, from single hosts up to local networks, AS s, countries, and continents. Figure 5.3 illustrates this multi-resolution approach to investigating the outgoing traffic of a chosen local network within the University of Konstanz in terms of number of packets sent from August 27 to September 4, In the world view with a drill-down on the European continent, shown in Figure (a), high network traffic targeting Germany can be observed. Figure (b) showing a continent view of Europe with a drill-down on Germany reveals that the volume at a single autonomous system (AS 553, Belwue) in Germany exceeds the aggregated traffic to Switzerland. In the view of Germany shown in Figure (c), a drill-down on BelWue (the research network of the federal state Baden-Württemberg, Germany) confirms the expectation that our university network ( /16) receives most data packets. In many space-filling hierarchical layouts such as TreeMaps [77], the area is partitioned according to a variable statistical measure so that the sizes of the resulting rectangles correspond to the respective measure values. In our scenario, choosing the per-node traffic volume as a measure would lead to constantly changing sizes and positions of node rectangles on the display thus aggravating user s orientation. Since the success of our technique is measured by its

86 74 Chapter 5. A hierarchical approach to visualizing IP network traffic (a) World view (b) Continent view (c) Country view Figure 5.3: Multi-resolution approach using the Hierarchical Network Map applicability for continuous network monitoring and analysis, recognition and familiarity turn into critical requirements. Therefore, we map the screen area and its partitioning to a rather non-volatile measure (at least in the short-run), namely, the total size of the network and its components. Given a constant ratio of the display space, this design results in an almost static map layout, as network architectures normally experience little changes in the short term. While the containment relationships within rectangles show the IP address hierarchy and class membership, the size of each hierarchical region from a single IP address all the way up to the whole continent is proportional to the number of IP addresses that region contains. Furthermore, the geographical nodes, i.e., those at country and continent level, preserve their relative geographical position (similar to a cartogram). It is this geographic awareness of the nodes that allowed us to classify our technique as a map. At the level of autonomous systems and local networks, the contained rectangles appear sorted by IP address in a left-to-right and top-down fashion, thus accelerating the visual lookup of any sub-node Mapping values to colors Since we used visual variables area and position for mapping the network size and the respective node s position within the IP address space, respectively, we decided to use color to convey per-node traffic loads measured as a total number of bytes, packets, flows, or sessions. For this color mapping, we rely on methods already discussed in Section In the HNMap, the output of the chosen normalization function is mapped to the index positions within the red to blue color scale for each displayed rectangle. We favor the logarithmic color scale as demonstrated in Figure 5.3 due to its improved discrimination for low values and smoothing

87 5.2. The Hierarchical Network Map 75 effect on outliers. In some cases, analyzing an absolute measure of network traffic (e.g., number of sessions or transferred bytes) hardly provides any insight in the time-varying data dynamics. It may be more informative to run a backend database query to calculate the first or second derivative over time, depending on the needed level of abstraction for the analysis task, with the subsequent visualization of the query results. Further details on derived measurements are given in Section Processing the entire network data the way it is protocolled by the system logs has turned out to be infeasible for the analysis. The generated amount of raw operations is simply too large to store and analyze even on high-performance workstations. For example, an attempt to apply our approach for exploring traffic at a gateway of a middle-sized university would result in storing several gigabytes of log data per hour. To avoid input data overflow, we store the aggregated operations, in which the packets are grouped into sessions and the sessions of related flows, i.e., those referring to the same source and target and the same port numbers within close timestamps, are summed up. Therefore, the load recorded by such aggregated entry is described by the number of sessions, the number of transferred packets, and the payload measured in bytes Netflow exploration Our proposed visualization can be applied both in online mode for actual network traffic monitoring, and in offline mode for exploring the network s behavior in a selected time and space window. Let us consider each operation mode to reveal their constraints and implications in terms of the required input data and performance. Offline Mode In offline mode, historical traffic data is loaded from the database. The user specifies the type of load and the time window to set as compulsory filters. Optional filters can be specified and/or added in the process of interaction. Suppose that a network administrator is concerned about the degrading performance of UDP traffic in the network as packets continue to get dropped at an increasing rate. He starts the investigation by choosing the target load filter to trace where the traffic was sent to, and proceeds by selecting a time period starting at the moment the problem was discovered. Finally, the protocol filter is activated to show only UDP traffic. Running the HNMap helps him to formulate a hypothesis about the target distribution of the analyzed data flows. Interactively, he discovers that most of the traffic is sent to Germany and selects another date to check whether such large amounts of traffic are typical for that target node. To verify if the spotted traffic pattern is the result of either a continuous growth or appeared all of a sudden, the network administrator runs an animated view of the daily traffic from the previous month. Observation of a continuous upward trend makes it obvious that there is a need for better connectivity to the target network.

88 76 Chapter 5. A hierarchical approach to visualizing IP network traffic Online Mode (Monitoring) Monitoring network flows in online mode is a challenging task due to high data arrival rate and run-time requirements. It can be performed by refreshing the display either continuously or periodically. Continuous re-rendering imposes extreme costs since each incoming event affects the visualization. Therefore, it is rather imperative to define a reasonably short interval in which the visualization gets updated with newly arrived data. In many networks, daily operations are monitored by measuring the performance of the gateway routers. We believe that extending the mere observation of traffic statistics to also take into account the source and target distribution in terms of the IP addresses involved, may result in improved adjustment of the routing policies of large networks. In the long term, considerable cost savings can be achieved if the same performance of the network is provided with less hardware resources due to more intelligent routing policies Scaling node sizes Big differences in sizes between IP prefixes, AS s, countries, and even continents turn visual comparison of the respective rectangles into a challenge, especially when dealing with ordinary computer displays as opposed to wall-sized displays. We opted for a compromise by scaling the IP prefix sizes (number of contained IP addresses), thus indirectly affecting the upper level aggregates using square-root, logarithmic, or uniform scaling: f sqrt (w i ) = w i (5.1) f log (w i ) = log(w i + 1) (5.2) f uni (w i ) = 1 (5.3) The effects of node size scaling are illustrated in Figure 5.4. Square-root scaling considerably reduces the size of the larger rectangles, while the latter still remain considerably larger (a) Linear (b) Square-root (c) Logarithmic (d) Uniform Figure 5.4: Scaling effects in the HNMap demonstrated on some IP prefixes in Germany

89 5.2. The Hierarchical Network Map 77 than previously smaller rectangles. While logarithmic scaling displays minimal size differences, uniform scaling assigns identical sizes to all rectangles. Note that the size of any upper level rectangle (with gray borders) is determined through the sum of the scaled child nodes. This can lead to the effect that AS s containing many small prefixes becomes bigger than AS s with only few large prefixes, although the latter one contain more IP addresses Visual Scalability The term visual scalability describes the capability of visualization tools to display large data sets in terms of the number or the dimensionality of data elements [45]. We designed the HNMap to be highly scalable in order to support a large tree structure (7 continents, 190 countries, autonomous systems and networks) in an effort to facilitate the overview. To which extent this explorative potential can actually be exploited depends on the available display size. Technological advancements such as large display walls enhance scalability while maintaining the overview (see Figure 5.5). The query interface on the top left shows the traffic distribution over time and specifies the selected data, which is in this case the traffic entering the gateway of the University of Konstanz on well-known ports (0-1023) on November 29, 2005 using transferred bytes as a measure with logarithmic color mapping. One recognizes a heavy traffic load from AS 3320 (red) of Deutsche Telekom as well as from neighboring AS s in Germany. A port histogram reveals high activity on the web ports 80 and 443. For security and privacy reasons, the data was aggregated and sanitized. To separate adjacent rectangles from each other, we implemented two border schemes. In the first scheme, rectangles are visually separated solely by one-pixel thin borders, irrespective of the rectangles resolution. No additional borders are placed around parent rectangles in order to maximize the effective display area. To improve visual perception of the hierarchical Figure 5.5: HNMap showing all AS s (data from 2005) on a large display wall (5.20m 2.15m, 8.9 Megapixels, powered by 8 projectors)

90 78 Chapter 5. A hierarchical approach to visualizing IP network traffic structure represented by the borders and of the depth of rectangle partitions, the borders are highlighted with a distinct color for each granularity level (see Figure 5.6). The second scheme does not draw borders between the lowest hierarchy level (prefixes) at all, but uses the saved space to separate nodes at upper levels incrementing the thickness by 1 pixel per level (AS s: 1 px, countries: 2 px, continents: 3px). Labels are displayed in black or white for better visibility, depending on the better contrast to the respective rectangle s background color. If labels are too big to fit into the rectangle boundaries, they will be either shortened or not drawn at all in case there is not enough space for at least three characters. continent country autonomous system network Figure 5.6: Improving the visibility of hierarchy levels through borders with a distinct color per level (borders are omitted for rectangles that are less that 2 pixel wide or high) 5.3 Space-filling layouts for diverse data characteristics The hierarchy of the IP address space exhibits a number of particularities, such as that the upper two hierarchy levels should be placed in a space-filling way according to geographic coordinates and nodes with similar children need to be positioned close to each other at the AS level. The latter goal can be achieved by calculating the middle IP address of the AS and arranging the nodes according to this linear order. At the lowest hierarchy level, the linear order becomes even more meaningful as many subsequent IP prefixes share common owners, routing policies, or are semantically linked (i.e., two universities that joined the internet at roughly the same time). To visualize the large hierarchy at hand in an interpretable way, we chose three different layout algorithms, namely the geographic HistoMap, a variant of this layout called HistoMap 1D, and the Strip Treemap. We discovered that the hierarchy levels consist of different properties and impose different requirements on the space-filling layout. Therefore, we propose a combination of the above mentioned layout algorithms. However, the preservation of the space-filling criteria comes at the expense of sacrificing geographic neighborhood, the absolute order of the nodes, visibility of a few small nodes, and layout changes when discarding nodes on purpose, as there is no ideal layout algorithm to date.

91 5.3. Space-filling layouts for diverse data characteristics Requirement analysis The starting point of our visualization technique is a squarified treemap [16], which optimizes the aspect ratio of the resulting rectangles to produce nearly square partitions, thus improving the visibility of the mapped hierarchical structure and the comparability of rectangle areas. We extend the above approach by setting the rectangle s position with respect to its neighbors to be another relevant optimization criteria and use the color attribute to express the statistical value of each rectangle representing a network or a geographic entity. During our preliminary analysis, we figured out that the following four criteria are to be taken into account: 1. Full space utilization: A good solution for space-filling is achieved by recursively splitting the display space into two rectangles. 2. Preserving proportions across rectangles: This criteria is fulfilled by setting the proportions of the two resulting rectangles according to the number of IP addresses they represent. 3. Geographic awareness (position): The optimal positions are used to initialize the layout, but are not considered in the further layout optimization process. 4. The aspect ratio of rectangle areas is optimized by evaluating alternative split directions. Some of these criteria are conflicting (see Table 5.1), for instance, aspect ratio conflicts with full space utilization, since the latter cannot avoid rendering rather stretched screen partitions. The rest of this section presents the used layout algorithms. Space utilization Area Position Aspect ratio Space utilization x x Area x x Position x x x Aspect ratio x x x Table 5.1: Conflicting optimization criteria Geographic HistoMap layout The upper two levels of the AS/IP hierarchy consist of geographic entities. In general, geographic visualization is often very compelling two-dimensional maps are familiar to most people as a convention for representing three-dimensional reality. Remarkably enough, mental representations derived from maps are effective for many tasks even when extreme scales and nonlinear transformations are involved. Many approaches have been investigated for showing geographically-related as well as more abstract information on maps [40]. To meet the challenging visibility goals of our application, the proposed geographic maps rely on several kinds of abstraction: a) ocean are omitted, b) countries are represented as rectangles, c) these rectangles are sized proportionally to the number of IP addresses in the AS s

92 80 Chapter 5. A hierarchical approach to visualizing IP network traffic Figure 5.7: Geographic HistoMap of the upper two levels of the IP hierarchy (size represents the number of IP addresses assigned to each country) assigned to those countries, and d) these geographic entities are repositioned while seeking to preserve neighborhood relationships. Figure 5.7 shows the result of applying these abstractions using the HistoMap algorithm. Note that the unscaled number of IP addresses is used for the size of each country s rectangle. For the geographic layout, the underlying HistoMap algorithm operates on four arrays containing latitudes, longitudes, weights, and an index of the geographic entities. In the initial setting for the partitioning algorithm sketched in Algorithm 5.1, there is one spatial point p i for each continent, country, autonomous system, or network; P = {p 1,..., p n } where (p i ) i=1,...,n R 2. A weight w i R is attached to each rectangle and is equal to the number of IP addresses x i represented by p i. Alternatively, the outcome of the scaling functions described in Section applied on x i can be used for w i. Function area determines the area of each rectangle on the screen as follows (d x is the width and d y the height of the display): area(r i ) = w i n j=1 w d x d y (5.4) j The algorithm searches for a partitioning of the display space into P = n rectangles R = {r 1,...,r n }. Each time a point set P is split into P 1 and P 2, the representative rectangle r is split first horizontally into two rectangles r 1 and r 2 and then vertically into r 3 and r 4. In the next step, the quality function determines the better split, depending on the average squareness of

93 5.3. Space-filling layouts for diverse data characteristics Algorithm 5.1: Partitioning algorithm. procedure partition (P ) begin if P > 1 then (P 1,P 2 ) splitrect (P ) partition (P 1 ) partition (P 2 ) else drawrect (P ) end procedure splitrect (P ) begin (P 1,P 2 ) = splithorizontal (P ) (P 3,P 4 ) = splitvertical (P ) if quality (P 1,P 2 ) > quality (P 3,P 4 ) then return (P 1,P 2 ) else return (P 3,P 4 ) end (r 1,r 2 ) and (r 3,r 4 ). The position of the split line between the rectangles is determined by the sum of weights associated with each of the two point set. To avoid deep unbalanced recursive calls while operating on these large arrays, we split the arrays in a way similar to Pivot-by-middle Ordered Treemap [11]. The result of a split is that all elements on one side of the pivot element are less than or equal to the pivot element, and all elements on the other side are greater than or equal to the pivot element. This can be achieved by choosing an arbitrary pivot from the array and applying a modified version of Quicksort. In our algorithm, each successive recursive call of Quicksort is limited to that part of the array, which includes the middle position, in order to obtained partitions of the same size ±1. The quality of horizontal and vertical splits in each call of splitrect can be tested by the first semi-sorting the data according to longitude and then latitude. Obviously, the second sorting destroys the order of the first one. To guarantee the reproducibility of the partitioning, we define an absolute ordering on the data set through the use of the index array. In case of equality, the higher index determines which element is greater. For speed, our quality function assesses the squareness of the two rectangles in a greedy fashion without testing the effects of further splits through look-aheads. In general, this approach is not limited to a two-way split, but can readily be extended to three-or-more-way splits. For simplicity, we used the 2-bin variant in our experiments. With respect to the use of geographic information, we realized two things: a) in some cases, neighborhood preservation for countries fails and b) assigning autonomous systems to countries involves uncertainty; thus erroneous information can be communicated. However, until now we have found only one misplaced autonomous system. The techniques strength

94 82 Chapter 5. A hierarchical approach to visualizing IP network traffic lies in the possibility to show network traffic at different clearly defined granularity levels, so that detailed information can be retrieved at continent, country, autonomous system, or network level. Although mapping a network or an AS to multiple countries (either by splitting partially or by rendering twice) is an option, we refrained from it in order to keep the system interpretable One-dimensional HistoMap layout To place autonomous system nodes in a spatially meaningful way, we first attempted to use AS numbers as a sorting criterion, but soon discovered that this ordering hardly reveals interesting patterns. As an alternative, we calculate the median IP address of all prefixes advertised in an AS. When applying the HistoMap in 1D layout, AS s containing predominantly low IP prefixes (e.g., /8) are placed near the upper left corner, and those containing high prefixes (e.g., /16) are placed near the lower right corner. Figure 5.8 demonstrates the outcome of this layout algorithm run on the set of all autonomous systems within Germany. One-dimensional HistoMap layout is a simplification of the geographic HistoMap algorithm. As in the geographic version, the effects of vertical and horizontal splits of the available display space are assessed, but the split along the only one-dimensional data array is the same for both directions. A speed-up factor of 2.5 is obtained by avoiding the need to resort. It is also possible to apply splitting to multidimensional data. Each split then represents a Figure 5.8: HistoMap 1D layout of all AS s in Germany (the measure (number of incoming connections) of each item is expressed through color)

95 5.3. Space-filling layouts for diverse data characteristics 83 hyperplane orthogonal to the chosen split dimension. Furthermore, a look-ahead function can be applied to find better layouts, but is not considered here Strip Treemap layout The Strip Treemap method was chosen as the bottom-level layout for four reasons: a) its linear layout (see Figure 5.9) preserves the order of the rectangles leading to improved readability, b) its local optimization is efficient when processing large arrays, c) it yields good aspect ratios, and d) it is relatively robust against changes [11]. In this algorithm, items of the input list are first sorted according to an index. Then, input items are iteratively added to the current strip. If the average aspect ratio of the rectangles within this strip after recalculating sizes increases, the candidate rectangle is removed from the strip, all other rectangles are finalized on the screen, and a new strip is initialized and made current. The algorithm terminates after processing all rectangles. To obtain better aspect ratios of the rectangles, especially to avoid long, skinny rectangles in the final strip, we apply an optimization proposed by Bederson et al. in [11], which consists in maintaining a look-ahead strip and moving items from this strip to the current strip if the combined aspect ratio improves. Figure 5.9: Strip Treemap layout emphasizing the prefix order in the AS ATT-INTERNET3

96 84 Chapter 5. A hierarchical approach to visualizing IP network traffic 5.4 Evaluation of data-driven layout adaptation For the experiments, we used the previously described geographic HistoMap layout for the upper two hierarchy levels to position continent and country nodes. The one-dimensional data type of the next two dimensions makes it possible to use a wide variety of layout algorithms, but we limited our research to HistoMap 1D, Strip Treemap, and their combinations due to their efficiency and consistent ordering properties. With regard to runtime performance, we measured 1.5 seconds to render the complete IP hierarchy using HistoMap 1D, 1.9 sec. for Strip / HistoMap 1D, 3.1 sec. for HistoMap 1D / Strip and 4.6 sec. for Strip layout. However, we still see some optimization potential when applying an array-based sorting algorithm to speed up our implementation of the Strip layout. The remainder of this section is dedicated assessing the rectangles visibility at the third and fourth hierarchy level, the average aspect ratio, and layout preservation when applying the HistoMap 1D, the Strip layout, or their combinations Visibility Showing all data in a large hierarchy at once is difficult, especiallyin presence of a large variance in size between data elements. On the one hand, we would like the display to accurately convey the sizes of various parts of the hierarchy and, on the other hand, we want to show as many items as possible including small ones. An obvious approach is to simply allocate the screen space according to the number of IP addresses in each prefix, which yields meaningful sizes for nodes at all hierarchy levels. However, this approach does not cope well with large variances in prefix lengths, ranging from 2 0 (subnet mask /32) to 2 24 (subnet mask /8). One improvement can be achieved by realizing that some IP prefixes (networks) are contained in others. Borders for deeply nested hierarchy levels are costly in terms of display space [143], so we can render overlapping prefixes adjacent to each other, assigning values (such as packet counts) to the most specific prefix, which is a common approach common when analyzing routing data. If display space has to be assigned to approximately 2 billion IP addresses, this still works out to half a pixel per 1000 IP addresses on a conventional megapixel display or about 5 pixels on a 9 megapixel powerwall. To further reduce the visibility problem, the size of the nodes corresponding to IP prefixes at the lowest hierarchy level can be normalized using square root, logarithmic, or uniform scaling. In practice, with square root normalization we were unable to show 657 prefixes using the optimal layout at all levels on a pixel screen (net resolution: pixels). Logarithmic normalization f log (w i ) = log 2 (w i + 1) combined with two heuristics can show all the prefixes. These heuristics are a) assign a minimum width and height of 1 pixel to each rectangle and b) give precedence to borders of larger rectangles but omit borders of smaller rectangles when there is insufficient space. As labeling is not possible for many rectangles on the screen, detailed information, such as the value of statistical attributes and the path to the root of the hierarchy are shown when the mouse hovers over a rectangle. Figure 5.10 demonstrates the outcome of the algorithms applied to the whole IPv4 address space with logarithmically scaled network sizes using the HistoMap 1D layout for the 3rd and 4th hierarchy level. Note that large parts of the screen appear blue partly because there is no

97 5.4. Evaluation of data-driven layout adaptation 85 Figure 5.10: Anonymized outgoing traffic connections from the university gateway on November 29, 2005 showing all IP prefixes traffic to these networks and partly because we abstained from drawing borders on the lowest level. Table 5.2 shows the outcome of our experiments. Algorithm (3 rd /4 th level) AS (23054) Prefixes (197427) HM 1D 0.00 % 0.00 % Strip 0.03 % 0.46 % Strip / HM 1D % HM 1D / Strip % Table 5.2: Percentage of invisible rectangles of the IP hierarchy Finally, it is admittedly very hard to analyze many randomly placed one-pixel rectangles. By removing insignificant nodes with little or no traffic from the hierarchy, more space can be allocated to other items.

98 86 Chapter 5. A hierarchical approach to visualizing IP network traffic Average rectangle aspect ratio As a means to evaluate the squareness of the rectangles, we measure the unweighted arithmetic average of the aspect ratios: aspect ratio = 1 N i max(w i,h i ) min(w i,h i ) (5.5) Table 5.3 shows that the AS-level hierarchy is more difficult to render with respect to squareness. This is explainable through the large variances in size: although the size of bottom-level rectangles was scaled logarithmically, the size of each upper level node is calculated by summing up the sizes of all its child nodes. Better results are generally achieved by HistoMap 1D layout, but the combination of Strip and HistoMap 1D is promising due to good aspect ratios. Here, better results are determined by the better layout (HistoMap 1D) for the lowest level, since the latter contains by far more nodes than the AS level. Algorithm (3 rd /4 th level) AS (23054) Prefixes (197427) HM 1D Strip Strip / HM 1D HM 1D / Strip Table 5.3: Average aspect ratio of rectangles of the IP hierarchy Layout preservation Showing all details at once is not always advisable from the perceptual point of view. We thus load a one-day data set of network traffic from a particular gateway into the hierarchy and proceed by continuously removing nodes in the order of least traffic, the ratio between the remaining empty nodes and all nodes within the parent node, and the lowest index. This procedure helps to stay close to the real data distribution, to avoid a randomized sampling function, and to create a well-defined node removing order that is tested with all layouts. Figure 5.17(c) demonstrates the effect of removing nodes where traffic falls below a certain threshold, compared to the original state in Figure 5.17(b). We discovered that the layout distance change metric proposed by Bederson et al. [11] obscured the effects that can be seen in the absolute rectangle side changes due to larger positional changes. To properly assess layout preservation, we split the evaluation into two parts: 1) evaluation of positional changes and 2) evaluation of absolute rectangle side changes. Positional changes In the first part, we evaluate the positional change as measured from the center of each original rectangle r 1 to the center of its corresponding rectangle r 2 in the projected layout of the reduced hierarchy.

99 5.4. Evaluation of data-driven layout adaptation 87 average pos change AS (23054) HistoMap 1D Strip Prefix (197427) HistoMap 1D Strip HistoMap 1D/Strip Strip/HistoMap 1D fraction of original nodes Figure 5.11: Average position change d pos (r 1,r 2 ) = ((x w 1) (x w 2)) 2 + ((y h 1) (y h 2)) 2 (5.6) Strong correlation between the red and the blue lines as well as between the green and the black dashed lines in Figure 5.11 clearly shows the higher significance of the layout algorithm at the third towards the fourth hierarchy level. HistoMap 1D layout is thus preferred over Strip layout for the third level due to its better preservation of the original rectangle positions. For the fourth hierarchy level, we could not find any significant difference between the two layouts with respect to positional changes. The unexpectedly good results for 50% of the prefix nodes (dashed lines) under all layout algorithms can be explained through a major change towards 40 % in the geographic layout of the first and the second level due to unavoidable recalculation of the weights of the upper level nodes. Absolute rectangle side changes Since layout change is expressed not only through positional changes of the nodes, but also through changing width and height of the rectangles, we decided to evaluate the absolute rectangle side change: d side (r 1,r 2 ) = (w 1 w 2 ) 2 + (h 1 h 2 ) 2 (5.7) The results of calculating the average side change for all layouts, shown in Figure 5.12, suggest that rectangle sides change drastically especially in those case, when only a small fraction of the original nodes is rendered. On the one hand, this effect is explainable through

100 88 Chapter 5. A hierarchical approach to visualizing IP network traffic average side change AS (23054) HistoMap 1D Strip Prefix (197427) HistoMap 1D Strip HistoMap 1D/Strip Strip/HistoMap 1D fraction of original nodes Figure 5.12: Average side change the fact that more space is allocated to the remaining rectangles and their sides therefore become considerably bigger. On the other hand, the evaluation shows that the HistoMap 1D layout (black) is superior to the Strip layout (red) both at the AS (straight lines) and at the prefix (dashed lines) level. The other two layouts, which are combinations of HistoMap 1D and Strip layout and vice versa, are in between HistoMap 1D and Strip layout at the prefix level. With more than 60 % of the original nodes, these two layouts are hardly distinguishable from HistoMap 1D with respect to the proposed quality measure average side change Summary Within this evaluation section, we demonstrated the benefits of HistoMap 1D over Strip Treemap layout in several ways. First, the latter is not capable of showing all nodes of our test data hierarchy at the target screen resolution due to its sequential display space partitioning. Second, HistoMap 1D yields a better average rectangle aspect ratio at both the AS and the prefix hierarchy level. Third, when filtering out parts of the hierarchy to focus the view on an area of interest, HistoMap 1D does a considerably better job of preserving the original layout. We also demonstrated that a combination of the two algorithms is promising, especially due to the line-wise reading order of the Strip layout. In case of a semantic ordering, like the prefix order, the layout might reveal sequential scanning patterns or failures (continuous or discontinuous values in subsequent prefixes).

101 5.5. User-driven data exploration User-driven data exploration Interactivity is the key to many visual analytics applications, and our HNMap is by no means an exception. The analyst chooses which region of the network traffic data set should be investigated in more detail and then continues the explorative search for meaningful patterns depending on the results of the previous steps Filtering Further characteristics of network flows are implemented as compulsory and optional filters: 1. Compulsory filters: a) Type of load: Since each IP address functions both as source (sending packets) and as target (receiving packets), the network load to display can refer to either the packets sent, or the packets received, or their total. b) Time window: Each visualization instance displays the traffic within a specified time interval. 2. Optional filters: a) Port or port cluster: Single ports or their groupings can be used as filter to display only the portion of traffic occurring at those ports (for instance, showing the outgoing load by selecting the source port 25 (SMTP) or the cluster of all common ports 25, 110, 143, 465, 993, 995). b) Protocol: It might be useful to filter the traffic of a particular protocol, for instance to separate UDP from TCP traffic. c) Node relevance: A relevance slider introduces a visibility threshold by removing low-traffic nodes up to the specified threshold. Empty nodes can be removed by using a checkbox. In the current implementation, most of these filters are entered as text and form part of an SQL query. In the future, these filters might be integrated in a more user-friendly way through appropriate GUI components. However, the current implementation has proven to be very flexible when dealing with different data sets with IP address as the only dimension common to them all Navigation within the map Explorative tasks are enabled through typical OLAP database operations such as drill-down (disaggregation), roll-up (aggregation), and slice & dice (filtering). All these basic interactions are mapped to the mouse events: rolling the mouse wheel performs drill-down and roll-up, double click triggers a slice & dice operation and single click opens a context-sensitive popup menu. This popup menu offers support for untrained users as well as access to some expert features such as multiple selection or multiple drill-down (roll-up) operations on the nodes of the same level.

102 90 Chapter 5. A hierarchical approach to visualizing IP network traffic (a) Before (left) and after (right) drill-down (b) Before (left) and after (right) roll-up (c) Before (left) and after (right) drill-down siblings (d) Before (left) and after (right) roll-up siblings (e) Before (left) and after (right) slice & dice (f) Before (left) and after (right) pruning empty nodes Figure 5.13: HNMap interactions A drill-down substitutes a node rectangle through its child rectangles as shown for the redcolored rectangle in the top-left corner of Figure 5.13(a). This interaction is especially useful when exploring traffic distribution details and the need to compare nodes at different granularity levels comes up. The inverse operation of a drill-down is roll-up, which subsitutes a group of nodes through their common parent node at an upper level (see Figure 5.13(b)). For faster exploration, these two operations are also available on all sibling (i.e., belonging to the same hierarchy level) nodes as demonstrated in Figures 5.13(c) and 5.13(d). Similarly to a zoom function, a double-click on one on a rectangle triggers the slice & dice interaction, which drills down within the selected node while removing all other nodes from the display as depicted in Figure 5.13(e). Another elegant way of allocating more screen space to relevant nodes is to simply exclude empty nodes from the presentation as shown in Figure 5.13(f) (available via the popup menu and keyboard shortcut). For fast navigation, HNMap offers predefined views of all continents, countries, AS s, and prefixes, which can also be applied to a previously defined subset of network entities (e.g., showing all prefixes within Germany). In addition to that, the analyst is free to export the current HNMap view as a PNG or PDF image for inclusion into security reports.

103 5.5. User-driven data exploration 91 Descending to the pixel level The technical limit of a visualization is achieved when every pixel is used for displaying distinct values, where the latter are mapped to the pixels color. Further attributes of the data points can be mapped to the pixels coordinates. Pixel-based visualization builds up the finest granularity view of the Hierarchical Network Map, namely, the behavior of single hosts. For instance, the pixel visualization can be employed for determining whether network traffic comes from just a limited number of IP addresses or is scattered over the considered network. Pixels are arranged according to the recursive pattern, already introduced in Chapter 4, starting at the top-left corner and descending row by row with alternating forward-backward direction for better cluster preservation. Thereby, the displayed IP addresses appear sorted in descending order. A large display wall with 8.9 Megapixels is potentially capable of showing up to 8.9 million IP addresses by mapping each IP address to one pixel. If the amount of pixels is insufficient, distinct IP addresses have to be replaced by groups of neighboring IP addresses by aggregating their traffic. Figure 5.14 shows active hosts within the network in the recursive pattern pixel visualization, thereby revealing their distribution behavior. The image on the left shows a pattern of a simulated network scan affecting every 10 th IP address, whereas the image on the right reveals a pattern with only very few target hosts. Moving the cursor over the pixel (or its surrounding region) triggers the appearance of the represented IP address label. Figure 5.14: Recursive pattern pixel visualization showing individual hosts Colormap interactions In our technique, the visual variable color is used to convey the values of the specified measure for each visible node. Employing a bipolar color scale (blue to red over white as the switching point) supports the observation that not only the existence of high load values (dark red) can imply interesting findings, but also the non-existence of any traffic (dark blue). Use of square root and logarithmic color scales helps to make the visualization more resistant to outliers while taking into account the difficulty of exact comparison of quantitative values using those color scales. For this purpose, linear mapping is also available. Through mouse interaction, the white transition point of the bipolar colormap can be moved in order to facilitate comparative analysis [3]. It is furthermore possible to select a particular node on the map: thereby, white color is assigned to this rectangle and the transition point of

104 92 Chapter 5. A hierarchical approach to visualizing IP network traffic Figure 5.15: Multiple map instances facilitate comparison of traffic of several time spans the color scale gets adapted accordingly. Nodes with higher values appear in shades of white and red while nodes with lower values in blue and white Interactions with multiple map instances In many analysis scenarios, is not only the traffic volume targeting or leaving a particular network entity that matters, but also its gradual changed. To visualize the changes in traffic, multiple map instances can be used as demonstrated in Figure Moving the mouse over one of the instances causes a label with the actual traffic values to be displayed in all other instances facilitating the interpretation of the absolute values of that particular entity in the IP hierarchy. Underneath each instance, there is a label indicating the covered time span. Note that the label position gives an additional hint about the chronology of the map instances. An alternative approach is to animate the HNMap display using the animation interface shown in Figure The analyst can simply click through the HNMap instances in forward and reverse chronological order or play them as an animation. Obviously, more screen space is allocated to each map instance in this second visualization approach. While instances of the same rectangle are always placed at the identical spot on the map, it is very difficult to simultaneously monitor several rectangles scattered all over the map. Note that in order to make the maps comparable to each other, the same color mapping scheme needs to be applied to all of them. This is done by calculating the global maximum

105 5.6. Case studies: analysis of traffic distributions in the IPv4 address space 93 Figure 5.16: Interface for configuring the animated display of a series of map instances value over all nodes in all map instances prior to the invocation of the specified scaling function. Similarly, only those empty nodes that remain free of traffic throughout all map instances may be removed from the series of map instances Triggering of other analysis tools HNMap has proven to be a powerful tool to analyze the IP dimension of network traffic with regard to a previously defined measure. However, network traffic consist of several dimensions, which can be enriched with additional information from intrusion detection systems or other sources. Therefore, a combination of various visualization tools is necessary for exploring network traffic particularities. One possibility to combine HNMap with other analysis tools is to further analyze the traffic of a particular node on the HNMap using bar charts of time, host, and port activity, the Radial Traffic Analyzer (Section 7.2), and the Behavior Graph (Section 8.2), accessed the respective tool via the pop-up menu called from the context of the node underneath the mouse cursor. 5.6 Case studies: analysis of traffic distributions in the IPv4 address space To demonstrate the adequacy of our proposed technique, we conducted a series of case studies with the input data obtained from a production web server, a university gateway router, and a large service provider. In all scenarios, the HNMap was the central exploratory tool. Other graphical representations were generated from the data specified on the map Case study I: Resource location planning Positive customer experience is crucial to the success of most businesses. Today, off-the-shelf web analytics software can show the geo-locations of customers who visit a web site. This information can be helpful for inferring customer demographics to optimize logistics or marketing strategies. Logically, we would also be interested in the AS s, from which the customers access a web server, in order to study the consequences of the placement of customer-facing web servers.

106 94 Chapter 5. A hierarchical approach to visualizing IP network traffic (a) Specifying visualization parameters in the HNMap loading interface (c) HistoMap 1D showing the AS s above threshold (b) HistoMap 1D showing all AS s (d) Interactive bar chart with a detail slider for hiding low value entries Figure 5.17: Visual exploration process for resource location planning To conduct this analysis, we processed the Apache log files of our web server and loaded the IP address, a timestamp, and the transferred byte count of each web request into a data warehouse. The system connects to a database within a server and probes the set of available tables, which are prejoined with the IP network tables to save time during interactive exploration. Choosing the most appropriate measure is the key to success for any analysis, and in this case we used the sum of transferred bytes to weight IP addresses. An optional filter can be invoked to ignore large transactions such as huge multimedia downloads that might otherwise skew the results. A checkbox enables removal of AS s without significant traffic. Other potentially distracting nodes with little traffic can be reduced with the detail slider (see Figure 5.17(a)). Figures 5.17(b) and 5.17(c) show IP addresses of clients who visited the web server between November 28, 2006 and March 16, 2007, aggregated by AS. It clearly highlights the significance of the BELWUE and the DTAG systems, both within the green country rectangle representing Germany. At the same time, the high volume requested by webcrawlers such as Google and Yahoo (Inktomi) as identified by their AS s is obvious. A right-click on any node brings up a context menu to either navigate within the hierarchy or trigger other graphical representations of detailed data. Figure 5.17(d) shows a logarithmically scaled bar chart visualizing the previously defined measure build upon the data set selected on the map. This kind of analysis may be conducted in many other resource planning scenarios, such as choosing an optimal Internet service provider (ISP) for the intranet of a widely-spread company or placing a shared database server at a favorable location.

107 5.6. Case studies: analysis of traffic distributions in the IPv4 address space Case study II: Monitoring large-scale traffic changes Another HNMap application is concerned with network traffic monitoring. The two main matters of interest here are 1) to track and predict traffic volumes within one or several interoperating networks to keep the network infrastructure running well and 2) to react quickly in order to recover from failures. For this case study, we used one day of netflows captured at our university gateway and anonymized by removing the internal IP address to ensure privacy. We started by specifying the relevant data, such as the number of failed connections (including those that did not contain any data transfer). This gives an indication of malfunctions or scanning activities many IP addresses within the probed network are unassigned or protected by a firewall and thus do not reply to the packets sent by scanning computers. Figure 5.18 shows the first derivative of failed connections over time (four successive map instances) aggregated by country. The intense background color of the rectangles in Asia (upper right) attract attention and raise the concern what might have caused the high number of failed connections in the early morning (first strip). In this scenario, the color was chosen to show change: white for little or no change, blue for decrease, and red for increase of traffic. Opening a bar chart for each of the suspicious countries reveals a conspicuously large number of different destination ports characterizing the traffic from China, which would under normal circumstances imply the invocation of a large variety of application programs. For a more thorough analysis, we opened the Radial Traffic Analyzer (RTA) to correlate the source IP addresses as the origin of this traffic with the used ports. With the IP addresses contributing relatively small shares of traffic removed through a slider, the resulting view shown in Figure 5.19 stresses that two IP addresses ( and ) were involved in port scanning activities, as seen on the colorful patterns on the outer ring. The RTA interface is capable of drawing one ring for each analyzed dimension (e.g., source & destination IP, application port, event type, etc.) of the data set. All rings are grouped by the dimensions of the Figure 5.18: Monitoring traffic changes: failed incoming connections at the university gateway on November 29, 2005 (square root normalization of the colormap alleviates the visual impact of outliers)

108 96 Chapter 5. A hierarchical approach to visualizing IP network traffic Figure 5.19: Employing Radial Traffic Analyzer to find dependencies between dimensions inner rings and sorted according to their own dimensional values. Interactive rearrangements of these rings help the analyst to explore the data and gain deeper insight into it. For more details on RTA, refer to Section 7.2. Such a scenario indicates that HNMap is capable of showing the aggregates of the selected measure or, alternatively, its first or second derivative over time. Navigational operations within the map, such as drilling down to deeper hierarchy nodes or rolling-up to higher level aggregates, are limited to the IP address dimension. Navigating along this dimension is often not adequate or not sufficient for finding a needle in the haystack. We therefore supplement HNMap with graphical data representations such as bar charts and the RTA Case study III: Botnet spread propagation Between July and December 2006, Symantec observed an average of 63,912 active botinfected computers per day [151]. Compared with the total number of computers on the Internet, this number is low. However, one should not forget the potential damage these hijacked computers might inflict when remotely controlled to collectively attack a commercial or governmental web server. For this case study, we used a signature-based detection algorithm, which exploits the knowledge about known botnet servers to collect bot-infected IP addresses. Without the visualization, we could only see that the list of IP addresses was continually changing, but we could neither map this change to the network infrastructure nor build a mental representation of what was going on. Therefore, the goal of our analysis was to identify the IP prefixes, AS s, or countries with a high infection rate and to investigate whether they build a cluster at

109 5.7. Summary 97 a higher level of the hierarchy. The captured data was stored in a data warehouse and loaded into HNMap. The HNMap view disclosed the severity of the spread drawing attention to the red rectangle representing China in the country view. Animating the map over time (one image per day) helps to confirm the dynamics of the developments in China. Figure 5.20 shows the results of zooming in to detailed IP address ranges grouped by AS (yellow borders) to identify which prefixes and AS s played a major role in the infection. We can observe that the infection was widely spread in the large AS on the upper left and in the smaller and thinner AS on the lower left. Furthermore, a few red colored prefixes outside these AS s probably contributed considerably to the spread of the worm. With this knowledge, network operators can adapt firewall configurations to block or filter traffic from the affected IP prefixes Expert feedback HNMap visualization was presented to the security experts at our university and at AT&T. From the university expert s perspective, it was very important to show detailed host activities within a network, an autonomous system, or a country, for the task of comparing the number of connections with the transferred bytes of active hosts to find suspicious patterns. After getting this valuable feedback, we integrated bar charts as well as the Radial Traffic Analyzer showing host and port activity into our application. An informal evaluation by AT&T s visualization and domain experts who tested HNMap yielded numerous improvements, such as an interactive feature to move the transition point in a bi-color scale, time-series animation, linked parallel instances of the map, and intelligent label placement, to name a few. The experts showed particular interest in the botnet spread and worm propagation as demonstrated in Case Study III. They see the potential to explore their vast amount of IDS events to gain deeper insight into malicious network traffic. Further use of the tool in AT&T s global networks operations center (GNOC) is currently under discussion. 5.7 Summary In this chapter, we presented the Hierarchical Network Map as a hierarchical space-filling mapping method for visualizing traffic of IP hosts. Our work can be seen as a novel application in the field of computer security visualization. To the best of our knowledge, such large amounts of IP-related network traffic data have never been shown in a geographically aware hierarchical and space-filling context. The proposed algorithm is an application of the spacefilling variant of the RecMap [67] algorithm which was modified to fit the analysis needs with respect to one-dimensional (IP addresses) and hierarchical (network hierarchies) data. The approach aims at finding a compromise between four optimization criteria: using display area to represent network size, full utilization of the screen space, position awareness, and aspect ratio. The performance challenges are mastered by employing OLAP cube operations commonly used in data warehouses to efficiently aggregate over large volumes of detailed data. To adapt the visualization to the large hierarchy at hand, we used one-pixel-thin colored

110 98 Chapter 5. A hierarchical approach to visualizing IP network traffic borders between the nodes of each level or, alternatively, a zero to three pixel thin border, incrementing the border s thickness at each upper hierarchy level. In our experiments, we compared two layout algorithms and their combinations against each other with respect to visibility, average rectangle aspect ratio, and layout preservation. While our proposed algorithm HistoMap 1D was superior to the Strip layout, we still see valid use cases that justify usage of the latter due to its intuitive sorting order. Furthermore, combinations of the two layouts - a different one for each level - has shown that some desirable features of HistoMap 1D can be retained to a certain degree. Apart from generating insightful graphics, the HNMap tool offers a variety of user interactions to explore large netflow and IDS data sets. Mouse interactions are implemented to facilitate explorative tasks on the map, to adapt the color mapping to better highlight relevant patterns, and to trigger other analysis tools on the data specified on the map. In order to show the tool s applicability, three case studies were presented with traffic data from the university s gateway router, from a webserver, and IDS data of botnet IPs. These case studies demonstrated that visual exploration of large IP related data sets can reveal valuable insights into network security issues Future work In the future, we hope to improve the proposed layouts at the AS level by exploiting information about border gateway router connectivity. Methods for avoiding deformation of small rectangles, which are placed next to large ones in the Geographic HistoMap and HistoMap 1D layout, may also be beneficial. Because this study emphasized the importance of the upper hierarchy levels to layout stability, we might also take a closer look at the extent, to which sacrificing squareness in the geographic hierarchy levels improves layout stability. Furthermore, a search feature is foreseen in order to find countries, AS s, prefixes, and hosts on the map.

111 5.7. Summary 99 (a) Day 1 (b) Day 3 (c) Day 5 (d) Day 7 (e) Day 9 Figure 5.20: Rapid spread of botnet computers in China in August 2006 as seen from the perspective of a large service provider with yellow boxes grouping prefixes by AS (prefix labels were anonymized for privacy reasons)

112

113 6 An end-to-end view of IP network traffic,,by three methods we may learn wisdom: First, by reflection, which is noblest; Second, by imitation, which is easiest; and third by experience, which is the bitterest. Contents Confucius 6.1 Related work on network visualizations based on node-link diagrams Abstract graph visualizations Geographic graph visualizations Linking network traffic through hierarchical edge bundles Hierarchical edge bundles Edge Coloring Interaction design Data Simulation Case study: Visual analysis of traffic connections Discussion Summary Future Work AS discussed in the previous chapter, Hierarchical Network Maps are an approach to the presentation of IP-related measurements on the global Internet. This pixel-conservative approach is appropriate considering the sparsity of the display space when showing about IP prefixes at once. However, in the previous chapter, the network statistics were observed at a single vantage point (arriving at or leaving from a particular gateway) and displayed according either to source or to destination in the IP address space. In this study, we consider traffic being transferred through multiple routers, such as in service provider networks. The use of edge bundles enables visual display and detection of patterns through accumulative effects of the bundles while avoding visual clutter. The novelty of this approach lies in its capability to support the analyst in the formation of a mental representation of the network traffic situation through visualization. Each autonomous system or IP prefix is placed on a map while nodes are linked according to their network traffic relationships. The remainder of this chapter is structured as follows. We briefly discuss related work dealing with node-link diagrams and proceed by extending our previously introduced HNMap approach with so-called edge-bundles for linking traffic sources and destinations on the map. Threupon, we present a case study to demonstrate the technique and conclude by assessing the overall contribution in the summary.

114 102 Chapter 6. An end-to-end view of IP network traffic 6.1 Related work on network visualizations based on node-link diagrams Node-link diagrams are the most prominent representations of graph data structures. These diagrams typically consist of nodes, which represent the vertices, and line segments representing edges, whereby each edge connects two vertices. Note that in large layouts nodes are usually not labeled due to display space limitations. Labels are rather shown using an interactive tool to pop them up when the user moves the mouse pointer over nodes. While node-link diagrams have been extensively used in information visualization and other research fields, their usage in network monitoring and security was so far limited to expressing connectivity and traffic intensity between hosts or higher-level elements of the network infrastructure. To gain an overview of the contributions related to node-link diagrams, we recommend Manuel Lima s webblog visual complexity a visual exploration on mapping complex networks [105]. It is an extensive effort to collect network visualization studies from various fields and currently features about 500 projects including both historic and state-of-the-art research and design Abstract graph visualizations While geographic nodes have clearly defined node positions, one of the key issues of abstract graph representations is the positioning of the nodes. A multitude of graph layout methods have been proposed with different optimization goals ranging from minimizing edge crossings to emphasizing structural properties of graphs [8]. Abstract graph representation normally seek a way to effectively use the available screen space. This implies that linked nodes are rendered close to each other in order to avoid visual clutter caused by crossing edges. Cheswick et al., for example, mapped a graph of about networks as nodes having more than connecting edges [24], obtained by measuring the quality of network connections in the Internet from different vantage points. Visualizing this information in graphs is challenging in terms of the layout calculation as well as in terms of visibility of nodes and links of such a graph. For better visualization, the authors reduced the graph by calculating a minmal distance spanning tree and then mapped the nodes and edges onto 2D using a spring-embedder layout, which places adjacent nodes in the vicinity of each other. Because large-scale graphs cannot be perceived in all details nor can layout methods be calculated on-the-fly, Ganser et al. have studied a method for reducing the number of displayed objects while preserving structural information essential for understanding the graph [57]. Their method is based on the precomputation of a hierarchy of coarsened graphs, which are combined into concrete renderings. Another related topic are attack graphs, which aim at showing how an attacker might combine individual vulnerabilities to seriously compromise a network. Early attack graphs represented transitions of a state machine, with network security attributes as states and the exploits of attackers as their transitions. Following a path through the graph generates a possible attack path. In fact, these graphs had serious scalability issues due to the large state space,

115 6.1. Related work on network visualizations based on node-link diagrams 103 resulting in the development of simplified graphs showing dependencies among exploits and security conditions, which increase only quadratically with the number of exploits. Noel and Jajodia presented a method for visual hierarchical aggregation of attack graphs, in which nonoverlapping attack subgraphs are recursively collapsed to single vertices to make attack graph analysis scalable both computationally and cognitively [127]. Williams et al. focused on a different visual approach to attack graphs [182]. Having figured out that mapping the attacks back on the network topology was a cognitively difficult task, they introduced a combined attack graph and reachability display using a treemap visualization in combination with a user-defined semantic abstraction. Since human creativity is not exclusively limited to abstract or geographic graph layouts, there exist hybrid approaches that partly take geographic information into account while calculating the graph layout on the screen. One such approach is the visualization interface of the Skitter application that uses polar coordinates to visualize the Internet infrastructure [28]. Each AS node s polar coordinate is determined by the geographical longitude of its headquarter and by the hierarchical connectivity information Geographic graph visualizations Early network mapping projects, such as the visualization study of the NSFNET [36], put their focus on geographic visualization where each network node had a clearly defined position on a map. This principle was also applied in a study to map the multicast backbone of the internet in 1996 [125], in which the authors used a 3-dimensional representation of the world and drew curved edges on top of it to show the global network topology. Other research efforts focused on visual scalability issues in 2-dimensional representations ranging from matrices to embeddings of the network topology in a plane as reviewed in [44]. In [9], for example, the concept of line shortening was introduced to prevent the visual clutter from obfuscating the whole image. This technique only draws a short fragment of each connection line at the start and the end point of the line. The Atlas of Cyberspace is another attempt to collect all kinds of maps that were made from cyberspace including geographic, 3D, and abstract graph layouts to represent the network infrastructure [40]. At this point, it is worth mentioning that the study presented in [35] (already discussed in Section 3.3.3) also uses a geographic metaphor, but visualizes abstract AS connectivity information without real spatial reference. Flow maps are another approach to showing the movement of objects from one location to another. Traditionally, flow maps were hand drawn to reduce the visual clutter introduced by overlapping flows. Phan et al. presented a method for generating well-drawn maps, which allowed users to see the differences in magnitude among the flows while minimizing the amount of clutter [133]. However, interpretation of flow maps with multiple vantage points remains a challenge. A recent work about hierarchical edge bundles handles the problem of visual clutter elegantly: a hierarchical classification of nodes is exploited to draw the bundling lines that connect the leaf nodes, thereby visually emphasizing correlations [69]. Inspired by this work and the decision to stick to the node positions fixed by the HNMap layout, we describe an approach to drawing edge bundles on top of HNMaps in this chapter. Such edge bundles vi-

116 104 Chapter 6. An end-to-end view of IP network traffic (a) Straight connection lines (b) Exploiting hierarchical structure (c) Compromise through edge bundles Figure 6.1: Comparison of different strategies to drawing adjacency relationships among the IP/AS hierarchy nodes on top of the HNMap sualize end-to-end relationships in network data sets, rather than limiting the analysis to the outgoing or the incoming traffic of a single vantage point. 6.2 Linking network traffic through hierarchical edge bundles With the HNMap technique described in the previous chapter, we faced the problem that drawing straight connecting lines between the most prominent sources and destinations on the map introduces heavy visual clutter, which makes it difficult to even follow a single straight line. The question arose how the hierarchical structure of the map could be exploited to organize the connecting lines without loosing too much detailed information. Figure 6.1 shows three approaches to drawing adjacency relationships among the nodes of the IP/AS hierarchy. The problem of visual clutter becomes obvious when straight lines are drawn as shown in (a). In (b), one observes that connecting with the center of the upperlevel hierarchy nodes (rectangles) removes visual clutter, but adds a lot of ambiguity, e.g., the lines connecting the left star (U.S.) and the stars in Europe (middle) one for each country are overdrawn more than a hundred times. An interesting compromise is to use so-called hierarchical edge bundles and transparency effects to combine the advantages of the latter two approaches, as shown in (c) Hierarchical edge bundles A recent contribution proposed a way of using spline curves to draw adjacency relationships among nodes organized in a hierarchy [69]. Figure 6.2 illustrates how we use this technique to exploit the hierarchical structure of the AS/IP hierarchy in order to draw a spline curve between two leaf nodes. P start (green) is the center of the source rectangle representing a node in the IP/AS hierarchy and P end (red) is the center of the destination rectangle. All points P i (green, blue, and red) from P start over LCA(P start,p end ) (least common ancestor) to P end form

117 6.2. Linking network traffic through hierarchical edge bundles 105 World Least Common Ancestor (LCA) North America AS553 Continents Countries AS's (a) IP/AS hierarchy AS123 United States Germany Europe (b) Spline (red) with control polygon (blue) Figure 6.2: The IP/AS hierarchy determines the control polygon for the B-spline the cubic B-spline s control polygon. Because many splines share the same LCA, which is in this case the center point of the world rectangle, we do not use them for the control path. Note that the hierarchy in Figure 6.2(a) does not exhibit the prefix level, which would be placed beneath the AS level and could possibly add two more points to the control polygon. However, the user is free to drill-down or roll-up arbitrary nodes on the HNMap and thereby to modify the number of available control points for the splines. Drilling-down to prefixes in one area of the map while keeping another area summarized as a continent is in wise aggravating for spline rendering. The result is a set of splines with varying number of control points. Essentially, HNMap represents an IP world map, in which traffic from the United States to Asia or Oceania would probably not make a detour over Europe. We thus draw these splines going out on the left side and coming in again on the right side, or vice versa. Since all control points have an influence on the course of the line, we need to duplicate all control points and then shift these by 360 degree either to the left or the right depending on their original position. Since all of these duplicated points lie outside of the drawing area, they are not visible on the rendered image but guarantee the same curvature at the borders of the image. In our implementation, the degree of the B-spline is used to control the bundling strength. A degree of six was experimentally proven to be a good choice for the cases with sufficient number of control points. Otherwise, the degree is reduced to the number of available control points. When drawing many of these splines, the bundling effects, which are caused by a number of common control points, reveal high-level information about the traffic intensity on top of the HNMap. Having this information right in the visualization is more useful than presenting a separate visualization since mental combination of the information on the map with this abstract traffic relationships becomes otherwise a difficult task for the analyst Edge Coloring In general, we considered two options for coloring and sizing the edges, namely (a) the use of color to distinguish between the edges and the use of width to judge about the edge s relevance, and (b) the use of both color and edge width to convey the relevance of the edge.

118 106 Chapter 6. An end-to-end view of IP network traffic For the first case (see Figure 6.3(a)), we chose a HSI color map with constant saturation and intensity, which is silhouetted against the background. While largely varying colors make tracing of single splines easier, users tend to feel strongly distracted by the colorful display. For the second case (see Figure 6.3(b)), we employed a heat map color scale from yellow to red using 50 to 0 percent alpha blending to visually weight the high-traffic links. Naturally, less important splines were drawn first. Due to the fact that only the most important traffic connections are displayed, we used square-root normalization for both width and color of the splines in order to assign a larger spectrum of colors to these high values. An alternative option would be to adjust the minimum of the colormap to the smallest of the top traffic connections and then to apply log-scaled mapping Interaction design In our opinion, the task of tracing splines becomes unsolvable by means of coloring once their number excels about one hundred. We therefore attempted to tackle the problem through interaction. Basically, there are two possible interaction scenarios. The first scenario is to select, refine, and visually highlight a particular spline or region. Selecting a spline or a region by the start and the end points of the splines is relatively intuitive to implement using the spatial data structure of the HNMap application. However, the second scenario of marking a whole bundle of splines had to be discarded due to tedious implementation. It probably boils down to a point in polygon test in each spline for the selected pixels. After specifying the start or the end points of the splines via a mouse interaction, we redraw all splines using their previous RGB color values non-selected splines with higher alpha blending and the selected ones without transparency effects on top. This allows the user to easily trance the few highlighted splines to their ends. Alternatively, non-selected splines can be completely removed Data Simulation Since our university network is not a very central node in the Internet backbone, we only route traffic from or to our own network and a few other nodes in the BelWü autonomous system. As a consequence, most incoming traffic at our gateway is headed towards the internal prefix /16 and most outgoing traffic comes from there. If we visualized this traffic on the HNMap, all splines would run into this AS or another prefix node in Germany. Because traffic information is generally considered confidential, it was impossible to obtain real traffic data from service providers for publication. We therefore settled on using real netflow data from our university gateway and substituted the internal source IP prefix with a randomized one according to Algorithm 6.1. The goal of this randomization schema is to distribute the measured traffic of the top n hosts over the connections among them in a way that the sum of weights of the connections for each host is equal to its total traffic. Implicitly, nodes with more traffic are more likely to communicate with several other nodes due to their higher probability of being repeatedly selected in the loop, whereas the opposite applies to low traffic nodes.

119 6.2. Linking network traffic through hierarchical edge bundles 107 (a) Randomly colored edge bundles make splines more distinguishable (the amount of traffic is expressed only through spline width) (b) Color combined with transparency and line width to show the traffic amount of a spline Figure 6.3: HNMap with edge bundles showing the 500 most important connections of oneday incoming and outgoing traffic at the university gateway with semi-randomly selected anonymized destination/source

120 108 Chapter 6. An end-to-end view of IP network traffic Algorithm 6.1: Randomization schema to simulate backbone provider traffic begin SortedList all pairs of unprocessed nodes and weights while SortedList > 0 do (n min,w min ) SortedList.removeMin() (n rand,w rand ) SortedList.RandomSample() drawspline(n min,n rand,w min ) SortedList.update(n rand,w rand w min ) end 6.3 Case study: Visual analysis of traffic connections To determine the changes of the most important traffic connections over a period of one day, we loaded traffic data from our university gateway into the HNMap visualization and ran the randomization algorithm to simulate a destination or a source AS outside of our network. Figures 6.3(b) and 6.3(a) demonstrate the outcome of our randomization schema based upon a one-day traffic load of our university gateway by displaying the top 500 traffic connections. The images display the high-level connectivity information while still being capable of providing the low-level relations to a large extent. Three major bundles between the traffic from the United States to Europe are identifiable, namely to the Southern Europe, to Germany, and to Northern Europe. Furthermore, major traffic bundles from Europe to Asia and from the United States to China as well as to South Korea characterize this traffic. Because of overdrawing effects, shorter splines, and decreased structure of the bundles due to insufficient number of control points, traffic bundles within a single continent are harder to analyze on the map. The strong bundling effects in North America (center left) are explainable through the proximity of the two control points of North America and the United States to each other. Figure 6.4 shows the result of analyzing the same data with four HNMap instances, each representing an equally sized time frame. To enlarge the important rectangles, we removed nodes with no traffic in all map instances. For better visibility of the splines, the colormap of the underlying HNMaps was flipped resulting in a better contrast with the dark background. Because each map instance displays 100 splines, it is hard to estimate whether there is more or less traffic when comparing two instances. However, the color of the red splines gives an indication about the time frame when the traffic connections with the highest load occurred. The largest connection in this image is in the third frame between the Deutsche Telekom in Germany and China-Backbone No. 3 (detailed labels could only be obtained through mouse interaction due to space limitations). Especially the fourth map instance displays the absence of thick red splines. In general, there are more connections with Europe, in particular with Germany (top right of the center), than in the other images. In the first frame, it is noticeable that more connections with Canada exist than in all the other frames. Furthermore, interaction allows to better trace individual connections or, alternatively, to monitor the changing connections of a particular continent, country, AS, or prefix in all four map instances.

121 6.4. Discussion 109 Figure 6.4: Assessing major traffic connections through edge-bundles (100 splines / 4 HNMap instances) 6.4 Discussion Edge bundles offer an opportunity to communicate source-destination relationships on top of the HNMap thereby leveraging the application from a purely measurement-based analytical approach to a more complete view on large-scale network traffic. Since tracing single splines becomes very challenging once a certain number of traffic links is placed on the map, we experimented with the RGB and alpha values of the spline colors. Mouse interaction is used to select splines at their start or end points in order to silhouet them from the remaining splines. The most obvious drawback of our approach is that edge bundles occlude important information in the background of the map thus making it inaccessible to the analyst. Furthermore, conveying significance of splines or distinguishing them through color disqualifies the visual variable color from being used for indicating the direction of traffic flows. So far, we have not found an acceptable solution to communicate traffic directions through the splines. When monitoring larger networks, focusing the analysis on a particular type of traffic, for example, communication of hijacked computers that belong to a botnet, helps to significantly reduce the number of nodes and connections to be displayed. Limiting the amount of visible splines as well as the HNMap option to remove empty nodes make the visualization more usable by focusing on the important nodes and traffic links.

122 110 Chapter 6. An end-to-end view of IP network traffic 6.5 Summary HNMaps visually represent network traffic aggregated on IP prefixes, AS s, countries, or continents and can be used for exploration of large IP-related data sets. In this chapter, we extended HNMaps through edge bundles to emphasize the source destination relationship of network traffic. Rather than representing new visualization or analysis techniques, we combined two existing methods and applied them to large-scale network traffic to gain insight into communication patterns. In order to facilitate tracing of splines, which represent source-destination relationships between networks or abstract nodes of the IP hierarchy, we compared two alternative coloring schemes. Mouse interaction can be used to highlight network traffic links of interest and offers the analyst a tool to explore details of large data sets while still being able to see other potentially relevant links and high-level patterns Future Work We plan to enhance the HNMap edge bundles concept with displaying detailed traffic information of the prefix of interest. By considerably enlarging this prefix and by mapping single IP addresses in it, we could show details about traffic connections to and from internal hosts. However, since almost all traffic links will share the nodes from our prefix up to the European continent in the IP/AS hierarchy, their points need to be removed from the control polygon to create more meaningful bundling effects. To avoid overdrawing issues in dense spline regions, we thought about the possibility of introducing empty nodes in the HNMap layout and thereby repositioning the otherwise overdrawn rectangles. However, due to time constraints, we leave the implementation and further elaboration of this idea open at this point.

123 7 Multivariate analysis of network traffic,,above all else show the data. Contents 7.1 Related work on multivariate and radial information representations Multivariate information representations Radial visualization techniques Radial Traffic Analyzer Temporal analysis with RTA Integrating RTA into HNMap Case study: Intrusion detection with RTA Discussion Summary Tufte IN many network security analysis scenarios, it is not only the relations at various levels of the IP/AS hierarchy or between the sender and the recipient that play an important role, but also other dimensions, such as port numbers, alerts, time, protocols, etc. Since there exist a number of techniques and tools that can be used for analyzing local and global correlations within the attributes of a multivariate data set, we had to focus our effort on implementing and integrating such promising analysis options into our HNMap tool. After careful considerations, we chose a radial hierarchical visualization technique because we found it appropriate for interpreting detailed information along with aggregated information in one display. The chosen radial layout has the property of allocating progressively more screen space to the outer rings, which represent more granular data than the inner rings. Therefore, these details can be explored while an overview of the data is maintained through the neighborhood relations with the inner ring segments, which represent high-level information. Interaction capabilities make the Radial Traffic Analyzer (RTA) a powerful tool for dynamic filtering and aggregation in network data sets. The chapter is structured as follows. We start by reviewing related work on multivariate and radial information representations and proceed by introducing the Radial Traffic Analyzer. After demonstrating how the RTA can be used for temporal analysis, the integration into HNMap is discussed. Subsequently, a short case study applies the approach to an intrusion detection data set. Finally, the last section summarizes the chapter s contributions.

124 112 Chapter 7. Multivariate analysis of network traffic 7.1 Related work on multivariate and radial information representations Since this chapter analyzes the multivariate nature of network data sets, related work discussed here addresses multivariate information representations. Besides, due to the choice of a radial visualization technique for our implementation, we also review some related radial layouts with a focus on the network monitoring and security application fields Multivariate information representations Although the design space for multivariate information representations seems to be almost unlimited at first glance, only a couple of visualization techniques have managed to set foot in everyday use. Some of these techniques are already quite old, such as scatterplots, bar charts, or line graphs. A multitude of applications extend these standard plots (e.g., 3D or animated scatterplots). One prominent example is GapMinder [141], a visualization tool that has become quite popular recently. Although the scatterplot-like visualization metaphor behind it is not new, its animation mode and the underlying data from the World Health Organisation make it a fascinating web-based exploration tool. Another possible way to extend these visualization techniques is the use of small multiples as demonstrated in the scatterplot matrix in Figure 7.1. It is a powerful technique, but comparing all possible dimension pairs results in a huge amount of plots for data sets with more than a few dimensions. Likewise, when considering large real-world data sets, a large extend of overlap makes scatterplot visualization hard to interpret and sometimes even misleading. A further extension of traditional visualization techniques for dealing with multivariate data is dimensional stacking [101], which involves recursive embedding of images defined by a pair of dimensions within the area of a higher-level image. time srcaddr dstaddr app_port Figure 7.1: Scatter plot matrix: color represents the number of flows on a logarithmic scale

125 7.1. Related work on multivariate and radial information representations 113 Glyphs are another popular technique for expressing multivariate data and comparing several objects with one another. By mapping the attribute values to shape, color, size, orientation, or position, correlations can be found by searching for similar characteristics in these glyphs. Probably the most famous research on glyph visualization are the Chernoff Faces [23]. In this work, points with up to 18 dimensions were mapped onto the features of a face, such as nose length, mouth curvature, etc. In the network security field, the idea of using glyphs for visual comparison of network events was demonstrated in the Intrusion Detection toolkit [94]. Its visual interface assigns variables of network statistics and intrusion detection data sets to visualization attributes producing a glyph visualization of past or current events. On the one hand, this approach is very flexible, but, on the other hand, many possible parameter settings make it difficult to choose a good visualization. One of the most famous geometric visualization techniques is parallel coordinates [73]. The basic idea is to represent each dimension of the data set through a vertical axis and to connect the points of one multidimensional data point with each other through straight lines. Parallel coordinates have become a popular analysis technique when dealing with network data. VisFlowConnect, for example, uses the parallel axis view to display netflow records as incoming and outgoing links between two machines or domains [190]. This tool allows the analyst to discover a variety of interesting network traffic patterns, such as virus outbreaks, denial of service attacks, or network traffic of grid computing applications. Detecting correlations of groups of data tuples can be easily spotted on neighboring axes. Therefore, the neighborhood configuration of these dimension axes strongly affects the correlations seen by the analyst. However, in high-dimensional data sets, there is a huge number of possible configurations since any permutation of these axes is valid. Furthermore, overplotting becomes a serious issue when dealing with medium-sized or large data sets. For visual analytics, not only the capability of pattern discovery is important, but also the means for refining and storing the insights gained in the process of analysis. The Xmdv- Tool, well-known for its multivariate visualizations including glyphs, parallel coordinates, and dimensional stacking as well as for interaction techniques like brushing and linking [177], recently introduced the Nugget Management System that supports refinement, storage, and sharing of valuable insights [188]. In pixel-oriented visualization techniques, each pixel represents one data value through its color, thus enabling visualization of large amounts of data [80]. The positional attribute commonly sets the data values into context by grouping them according to either dimensions or data tuples. Similarly to parallel coordinates, many pixel visualization techniques such as Pixel Bar Charts [81] either can be run on a subset of dimensions or provide several parameters for fine-tuning the visualization, which results in a large number of possible pixel visualizations. Smart visual analytics techniques such as Pixnostics can be used in this context to automatically filter out uninteresting plots in order to focus on those plots with potentially interesting features [142]. In business applications, the pivot table interface is especially popular. The Polaris system, for example, is a visual interface for large multidimensional databases that extends the pivot table [149]. It allows the analyst to interactively construct visual presentations of tablebased graphical displays from visually specified database queries. The experiences from the Polaris prototype lead to the development of the commercial analysis tool Tableau Software

126 114 Chapter 7. Multivariate analysis of network traffic [152]. Its recent innovations feature an integrated set of interface commands and defaults that incorporate automatic presentation into their software suite [110]. In contrast to these general-purpose visualization techniques, the aim of our visualization approach is to simultaneously show the aggregates computed at different granularity levels along with the detailed information of the network monitoring and security data sets. Note that this goal can be achieved through many possible layouts. In our case, we found the radial approach useful since more display space at the outer rings gives more space to the detailed information at low levels. Thereby, we take into accout that circular layouts waste a lot of space in rectangular displays Radial visualization techniques In the last few years, radial visualization techniques have become popular including in the network security field. Since we also decided to use a radial layout for our technique, some of the related visualization approaches will be discussed here. Fink and North introduced the Root Polar Layout of the IP address space [51]. Based on the assumption that classifying the IP address space into several trust levels can reveal useful information to the analyst, the tool visually separates these trust levels through concentric rings. IP addresses of the highest trust level are drawn in the inner circle, with IP addresses of progressively lower trust levels arranged on concentric rings around this circle. While this approach is certainly capable of displaying many different IP addresses on a single screen, a major drawback is that visual correlation of those IP addresses with other dimensions of the data becomes difficult. The work by Erbacher et al. detailed the analysis of the multivariate network traffic through a visualization that is conceptually close to parallel coordinates [47]. Circles around the middle encode time intervals from the latest events on the outer rings to historical events on the inner ones. The angle from the middle determines the internal IP address and the cutting points with the circles are then connected through straight lines with the external IP address (top and bottom side) and the application port (left and right side), which were mapped as axes on the sides of the display. Note that the redundant encoding of the respective opposite side is used to declutter the display by excluding lines that cross more than 50 % of the display in the respective direction. VisAware focuses on situational awareness and also uses concentric circles to express time intervals [52]. It was built upon the w 3 premise that every incident has at least the three attributes, namely, what, when, and where. In this display, the location attribute is placed on a map in the center, the time attribute is indicated on concentric circles around this map (with the current time spans inside and the historical ones outside), and the classification of the incident is represented through the angle around the circle. For each incident, the attributes were linked through lines that connect at its location on the map. While the first approach used concentric rings for trust levels of IP addresses and the latter two used the rings to map the time attribute, our approach uses the rings in a hierarchical fashion by showing highly aggregated information on the inner rings and detailed information on the outer rings. The visualization metaphor of our Radial Traffic Analyzer (RTA) consists of

127 7.2. Radial Traffic Analyzer 115 several concentric rings subdivided into sectors and is inspired by the space-filling hierarchical visualization techniques of Solar Plot [25], Sunburst [148], Interring [189], and FP-Viz [85]. 7.2 Radial Traffic Analyzer Our analysis focuses on the network layer of the Internet protocol stack. The network layer provides source and destination IP addresses, whereas the transport layer provides source and destination ports. Additionally, we collect information about the used protocols, (mostly TCP and UDP) and the payload (transferred bytes). In short, we store a tuple t = (time,ip src, ip dst, port src, port dst, protocol, payload) for each transferred packet. For matters of simplicity, we restrict ourselves to UDP (used by connection-less services) and TCP (used by connectionoriented services) packets. In the visualization presented in this chapter, we try to bring together the complementing pieces of information through extensive use of the visualization attribute position. Different variables of the data set are mapped to rings (see Figure 7.2), and the positioning scheme facilitates the analysis of a single data item (green) by following a straight line (red) from its position on one of the rings to the circle in the middle. Each ring segment that crosses this line (blue) is a higher level aggregate that contains the data item. Furthermore, sorting and grouping operations are applied to bring similar data tuples close to each other. Color is used to visually link identical data characteristics, such as a host that appears once as the destination of incoming traffic and once as source for outgoing traffic. We assume that a radial layout provides a better support for the task of finding suspicious group by attribute 1, attribute 2 order by attribute 1, attribute 2 group by attribute 1 order by attribute 1 country or continent attribute 1 attribute 2 data item higher level aggregate Figure 7.2: Design ratio of RTA

128 116 Chapter 7. Multivariate analysis of network traffic Figure 7.3: Continuous refinement of RTA by adding new dimension rings patterns, because the user does not misinterpret the item s position (left or right) as an indication of that item s importance, whereas in linear layouts, the natural reading order from left to right might cause such false impressions [26]. As users might tend to minimize eye movements, the cost of sampling will be reduced if items are spatially close [178]. Therefore, in the radial layout of RTA, the most important attribute, as chosen by the user, is placed in the inner circle, with the values arranged in ascending order to allow better comparisons of close and distant items. The subdivision of this ring is conducted according to the proportions of the measurement (i.e., number of packets or connections) using an aggregation function over all tuples with identical values for this attribute. Each further ring displays another attribute and uses the attributes of the rings further inside for grouping and sorting, prioritized by the order of the rings from inside to outside. RTA displays can also be created interactively. By adding one dimension after another as demonstrated in Figure 7.3, the analyst obtains a better feeling about the granularity of the data. If the granularity is too fine, there is still an option to remove or reorder rings. While each added ring triggers a new database query, reordering the rings influences the aggregates of the concerned ring as well as of all the outer rings, thus triggering a series of queries. The default configuration consists of four rings. The visualization is to be read from the center outwards, with the innermost ring for the source IP addresses, the second ring for the destination IP addresses, and the remaining two rings for the source and the destination ports, respectively. Starting on the right and proceeding counter-clockwise, the fractions of the payloads for each group of network traffic are mapped to the ring segments while sorting the groups according to ip src, ip dst, port src, and port dst. Having grouped the traffic according to ip src, we add the next grouping criteria for each outer ring, which results in a finer subdivision of each sector compared to the inner rings. Figure 7.4 shows the RTA display with network traffic distribution at a local computer. An overview is maintained by grouping the packets from inside to outside. The inner two circles represent the source and the destination IP addresses, the outer two circles represent the source and destination ports. Traffic originating from the local computer can be recognized by the lavender colored circle segment in the inner ring and traffic to this host is shown in the same color in the second ring. Normally, ports reveal the application type of the respective traffic. This display is evidently dominated by web traffic (port 80 colored green), remote desktop

129 7.2. Radial Traffic Analyzer 117 IP addresses localhost others Application ports web secure web mail secure mail Telnet/MS Remote Desktop SSH FTP/Netbios others Figure 7.4: RTA display with network traffic distribution at a local computer and login applications (ports 3389 and 22 colored dark red and red, respectively) and traffic (ports 993 and 25 colored blue and dark blue, respectively). To facilitate the understanding and correct interpretation of the rings, sectors representing identical IP addresses (inner two rings) are drawn in the same color. A common coloring scheme also applies to the ports (outer two rings). To further enhance the coloring concept, we created a mapping function for ordinal attributes that translates a given port or IP address number x to the index of an appropriate color scale: c(x) = x mod n (n is the total number of distinct colors used). Prominent ports (e.g., HTTP=80, SMTP=25) are mapped to colors not included into the color scale for their easier identification. The mapping function facilitates relating close IP addresses or ports to each other. To distinguish between the traffic transferred over an unsecured and over a secured channel, we modify the brightness of the respective color (i.e., HTTP/80 = green, HTTPS/443 = light green, etc.). For mapping numeric attributes (e.g., number of connections or time) to color, it makes more sense to normalize the data values and then to map them to a color scale with light to dark colors or vice versa. Different color scales were used for the attributes and should clarify the comparability of rings. Thereby, an IP address appearing as a sending host in the innermost circle and reappearing as a receiving host in the second circle should be colored identically, whereas this color should then be exempted from being used for mapping a port number in the outer rings. Mapping hierarchical data to a radial layout bring results in a generally favorable display allocation: each our ring shows more detailed information and also consumes more display

130 118 Chapter 7. Multivariate analysis of network traffic space. Depending on the analysis task at hand, different dimensional combinations are useful. The grouping is specified by assigning the chosen dimension (i.e., source IP, destination IP, source port, destination port) to one of the rings. A grouping according to the hosts might be useful when determining high-load hosts communicating on different ports, while a grouping according to the destination ports clearly reveals the load of each traffic type. To compensate for the strict importance imposed by the choice of the inner circle, the positioning and, thus, the importance within the sorting order can be interactively by dragging the circles into desired positions. With the increasing number of circle segments in a ring, some of them may become too small to plot labels into them. Therefore, we cut long labels and employ Java tooltip popups showing the complete label and additional information like the host name for a given IP address and the possible applications corresponding to the respective ports 1. As filtering is a common task, we implemented a filter for discarding all traffic with the chosen attribute values, which can be triggered with a simple mouse click. Detailed information about the data items represented through a particular circle segment is accessible via a popup menu. The payload is not the only available measure for network traffic analysis. To investigate failed connections, for example, the measure transferred bytes would not show the data entries of interest on the ring, as all failed connections have the value of 0 for this measure. In this situation, number of connections would be a more appropriate measure for sizing the circle segments. Experts often compare the number of transferred bytes to the number of sessions on a set of active hosts. High traffic with only few sessions is considered to be a download ressource, whereas medium traffic distributed over many sessions is typical for medium-bandwidth activities like surfing the world wide web. Plotting two RTA displays using different measures and the same ring order would enable this comparison at both granular and detailed levels. Normally, the data to be examined is abundant and commonly known patterns dominate over exceptional traffic patterns. Therefore, filters are crucial for the task of finding malfunctions and threats within large network security data sets. In our tool, we implemented rules to discard ordinary traffic (e.g., web traffic), but also to select just certain subsets of the traffic (e.g., traffic on ports used by known security exploits). The user interactively applies, combines and refines these filters to confirm or disprove his hypotheses about the data. One obvious disadvantage of radial layouts is unavoidable waste of a large portion of the available display at the background considering the circular shape of the former and the rectangular shape of the latter. 7.3 Temporal analysis with RTA For real-time capturing of network packets on a local PC, we used the packet capturing libraries libpcap [100] and WinPcap [39], as well as the Java wrapper JPcap [20] for accessing those libraries using a Java interface. To meet high performance requirements of monitoring large networks, we decided to man- 1 A list of port numbers with their applications is provided by the Internet Assigned Number Authority [74].

131 7.3. Temporal analysis with RTA 119 age the data in a popular open-source relational database PostgreSQL [136]. For the analysis of large data sets, a more intelligent preprocessing is employed by merging individual packets into sessions, thus significantly reducing the data size. One way of doing this preprocessing is to take advantage of the exporting features implemented in commercial routers (e.g., the CISCO netflow protocol). Thereby, packets with the same source and destination IPs, the same port numbers, and very close timestamps are grouped into one flow record. We figured out that our tool is useful for observing network traffic characteristics over time using three different real-time modes: 1. In the aggregate mode, all traffic is aggregated and new traffic is continuously added. 2. In the time frame mode, a constant time frame is specified and the old traffic is continuously dropped. 3. In the constant payload mode, the same amount of traffic is displayed by specifying a flexible time frame, which always includes the latest packets. Grouping the captured packets using a time frame up to the current moment, we can display a smooth transition by continuously updating the screen. Figure 7.5 contains a series of RTA displays for observing the evolution of the network traffic: (a) The user checks her (blue) on two different mail servers and sends out one using an unsecured channel (dark blue); (b) the user surfs on some web pages (port 80, dark green), whereas the blue mail traffic is still visible in the bottom left corner; (c) the user logs into her online banking account using HTTPS (bright green); (d) finally, a large file is accessed on the local file server using the netbios protocol (orange). Another possibility is to scale the radius of the circles according to the traffic load they represent. In this way, the network monitoring analyst gets a visual clue on the load situation. However, the major drawback of this option is that the display might become too small or too large to analyze because of strong variations in the network traffic. (a) time frame 1 (b) time frame 2 (c) time frame 3 (d) time frame 4 Figure 7.5: Animation over time in RTA in time frame mode

132 120 Chapter 7. Multivariate analysis of network traffic (a) Triggering RTA from a node on the HNMap (b) RTA user interface for specifying the rings Figure 7.6: Invocation of the RTA interface from within the HNMap display 7.4 Integrating RTA into HNMap RTA display can be triggered from any rectangle in HNMap through a popup menu as demonstrated in Figure 7.6(a). This allows the analyst to visualize more details about the hosts of the selected node or additional attributes of the data sets. After this interaction, a blank RTA display appears and the user can interactively add attributes as rings to the data set and remove them again. Figure 7.6(b) shows the user interface triggered on China s traffic node within the data set from our university gateway and details the accumulated number of failed inbound connections on 11/29/2005. A port scan from host is visible at the upper left and a large amount of failed attempts to open SMTP connections ( delivery) from host can be observed on the lower left. Due to the sorting order, the whole spectrum of colors from the color scale appears several times on the second ring. This pattern visualizes probation of a continuous range of ports, which is typical for a port scan. Network traffic of normal applications varies the used source ports only infrequently, with normally just a few target ports employed. 7.5 Case study: Intrusion detection with RTA RTA provides a flexible display for many different data sets and can be adjusted to a particular data set on the fly. Figure 7.7 shows an sample configuration with the inner two rings as the source and the target IP addresses and the outer ring with security alerts generated by an IDS. Having loaded IDS alerts, we discarded ICMP Router Advertisements, ping and echo alerts. the resulting view clearly revealed that host (yellow) was attacked by host (green) using various methods, indicated by a large number of small slices on the outermost ring. Each of the outer slices represents a distinct class of alerts and their width corresponds to the number of these alerts. In the RTA display, the IP address dimension can be consolidated by replacing single hosts with the associated higher-level network entities (e.g., IP prefix or AS), for example, to inves-

133 7.6. Discussion 121 Figure 7.7: Security alerts from Snort IDS in RTA tigate whether a denial of service attack originates from a particular AC or to assess the danger of a virus spread from neighboring AS s. 7.6 Discussion The feedback provided by a number of undergraduate students was encouraging. Mapping network data to a radial layout seems to present an effective overview of the composition of network communication at the level of network packets. It was recognized that the technique is applicable to small data sets captured on a local computer, as well as to traffic monitored at the university gateway, preceded by intelligent preprocessing (we obtained anonymous, cumulated statistics). However, the technique cannot show all details due to the visual limitations inherent in radial layouts. We can compensate for the shortcoming by discarding some obvious traffic, such as web and mail traffic, and by offering fast interactive filtering capabilities.

134 122 Chapter 7. Multivariate analysis of network traffic 7.7 Summary The main contribution of this chapter lied in the development of an analytical visual interface based on the adaptation and application of radial layout techniques to the analysis needs of network traffic monitoring and network security domains. We presented the Radial Traffic Analyzer, a technique capable of visually monitoring network traffic, relating communication partners and identifying the type of traffic being transferred. The data about the network traffic was collected, stored and aggregated in order to be presented in a meaningful way. The RTA display is suitable for showing aggregated information on the inner rings while presenting related information in more detail on the outer rings. It is complemented by appropriate interaction techniques, such as fading in hints on mouse-over, drag & drop to adapt the order of the rings, filtering by clicks, and details accessible via a popup menu. By specifying a time frame for refreshing the view, the user capable of continuously monitoring network traffic. Due to the applied grouping characteristics, changes within the visualization are smooth in many realistic scenarios. Integration of RDA into HNMap enables the analyst to receive additional information about arbitrary nodes in the IP/AS hierarchy. In a short case study, we demonstrated how IDS alerts can be intelligently grouped in order to see the types of alerts in proportion to their number. Radial Traffic Analyzer is not only suitable for monitoring purposes, but can also be used in the educational context for teaching and learning networking concepts, such as the flipped use of source and destination ports (TCP and UCP protocol) in the replies of network requests demonstrated in Figure 7.4.

135 8 Visual analysis of network behavior,,nothing has such power to broaden the mind as the ability to investigate systematically and truly all that comes under thy observation in life. Marcus Aurelius Contents 8.1 Related work on dimension reduction Graph-based monitoring of network behavior Layout calculation Implementation User interaction Integration of the behavior graph in HNMap Automatic accentuation of highly variable traffic Case studies: monitoring and threat analysis with behavior graph Case study I: Monitoring network traffic Case study II: Analysis of IDS alerts Evaluation Summary Future Work THIS chapter focuses on tracking behavioral changes in traffic of hosts or higher level network entities as one of the most essential tasks in the domains of network monitoring and network security. We propose a new visualization metaphor for monitoring time-referenced host behavior. Our method is based on a force-directed graph layout approach which allows for a multi-dimensional representation of several hosts in the same view. The proposed metaphor emphasizes changes in traffic over time and is therefore perfectly suited for detecting atypical system behavior. We use the visual variable position to give an indication about traffic proportions at a host at a particular moment in time: high traffic proportions of a particular protocol attract the observation nodes resulting in clusters of similar nodes. So-called traces are used to connect these host snapshots in chronological order. Various interaction capabilities allow fine-tuning of the layout, highlighting of hosts of interest, and retrieval of traffic details. As a contribution in the area visual analytics, an automatic accentuation of hosts with high variations in the used application protocols of network traffic was implemented in order to guide the interactive exploration process. In our work, we abstract from actual date and time values by aggregating network data into intervals. Each monitored network host (or, more abstract, a network entity) is represented through a group of connected nodes in a graph and the position of each one of these

136 124 Chapter 8. Visual analysis of network behavior nodes represents the network traffic for the particular host in the specified time interval. The brightness of the traces is used to convey the nodes time references. The chapter is structured as follows. We start by discussing related work in the field of dimension reduction and proceed by introducing the behavior graph approach detailing layout calculation, implementation, user interaction, integration in HNMap, and automatic accentuation of highly variable hosts and networks. The applicability of the presented methods for network monitoring and intrusion detection is demonstrated in two case studies, followed by the discussion of the performance as well as of the scalability limitations of the approach in the evaluation section. The last section concludes the chapter by summarizing its contributions. 8.1 Related work on dimension reduction In this chapter, a graph-based approach to monitoring the behavior of network hosts is proposed. In the essence, this approach can be described as a dimension reduction technique since it projects high-dimensional points onto the two-dimensional display space. We therefore briefly review some related work in this field. Rather than visualizing all dimensions of a data set, dimension reduction techniques reduce the high-dimensional data points to two or three dimensions, which can then be plotted to a visual layout. One such technique is the the Principal Component Analysis (PCA) [130], which displays the data as a linear projection onto a subspace of the original data in a way that best preserves the variance in the data. However, since the data is described in terms of a linear subspace, this technique cannot take handle arbitrarily shaped clusters. Multidimensional Scaling (MDS) [159] is an alternative approach, which operates on a matrix of pairwise dissimilarities of entries. The goal is to optimize the representations in such a way that the distances between the items in low-dimensional space are as close as possible to the original distances. Note that some MDS approaches can also be applied to non-metric spaces. One problem with nonlinear MDS methods is their computational expensiveness for large data sets. However, Morrison et al. presented a subquadratic MDS algorithm in 2002 [121]. Since even subquadratic algorithms are computationally too expensive for interactive analysis of large data sets, the MDSteer algorithm that refines selected subregions of the MDS plot was presented as a compromise [183]. In contrast to PCA and MDS methods, the goal of the method presented in this chapter is not only to plot high-dimensional data points in such a way that high-dimensional clusters are preserved in the two-dimensional visualization, but also to convey information about the characteristics of dimensional values of particular data points. This goal is achieved by introducing a few fixed nodes in a force-directed spring embedder graph layout [53]. In the proposed layout, each dimension is modeled as a vertice named dimension node and each host in a particular time interval as a vertice named observation node. Thereby, dimension nodes are fixed a priori and the position of each observation node is determined by traffic measurements, which are expressed through attraction forces between the respective node and the dimension nodes. Furthermore, repulsion forces between nodes in high density regions indirectly result in the assignment of more display space to these interesting data regions.

137 8.2. Graph-based monitoring of network behavior host A host B host B SSH FTP DNS HTTP Undefined IMAP SMTP Figure 8.1: Normalized traffic measurements defining the states of each network entity (host A or host B) for the intervals 1 to Graph-based monitoring of network behavior In this section we deal with the problem of monitoring the behavior of hosts or higher level 4 network entities through their network traffic. The goal of the proposed visualization is to effectively discover changes in the network entities behavior by comparing their states over time. This kind of analysis is mostly centered on a set of monitored hosts within the administrated local area network. While there are many attacks from outside the network, it is difficult to see when these attacks are successful. Our threat model therefore assumes that hosts, which all of a sudden experience drastic changes in their network traffic could be hacked and therefore impose a serious threat on the network infrastructure. In particular, the internal network might be at risk since the hacked computer can be used to bypass security measures such as a central firewall. Figure 8.1 shows the states of host A and host B at time intervals from 1 to 6 by calculating the normalized traffic proportions for each type of traffic within the interval. We interpret these states as points in a high-dimensional space (one dimension per traffic type). Although the charts show all the relevant information, their scalability is limited since displaying such detailed information for many hosts and time intervals makes would make it difficult to keep the overview. To account for the above deficiency, we represent each network entity in a two-dimensional plot through several connected points, which all together compose the entity s trace. Both color and shape are used to make entities distinguishable from one another. Each node represents the state of a single network entity for a specific time interval and its position is calculated through the entity s state during that interval. We basically map a high-dimensional space onto a distorted two-dimensional space. If the nodes for one entity are not in the same place, the entity s state has changed from one time interval to the other. This leads to some intended effects, which help to visually filter the image. Entities that do not change form small clusters or might even be only visible as a single point, whereas entities that have changed reveal visible trails, either locally or across the view. These long and prominent lines, which represent considerable changes in the behavior of a network entity, are more likely to catch the user s attention. To be able to visualize more than two dimensions in a two-dimensional plot, we use a

138 126 Chapter 8. Visual analysis of network behavior dimension nodes FTP SSH attraction forces 1 SMTP 5 observation nodes DNS IMAP traces 1 host A host B HTTP 2 Undefined Figure 8.2: Coordinate calculation of the host position at a particular point of time (the final graph layout is calculated using a force-based method) force-directed layout approach to approximate distance relationships from high-dimensional space in 2D. Each data dimension is represented by a dimension node. In the first step, the layout of these nodes is calculated. Although arbitrary layouts are possible for positioning these dimension nodes, the current implementation uses a circular force-directed layout to distribute the nodes over the available space. This chosen layout now defines the distortion of the projected space. After fixing the positions of the dimension nodes, the observation nodes are placed in the plane and connected to their corresponding dimension nodes through virtual springs. All observation nodes of the same entity are also tied together with virtual springs. The forces are calculated in an iterative fashion until an equilibrium is approximated. Figure 8.2 sketches the layout calculation exemplarily for the two hosts from the previous figure. The analyst can now trace the state changes of each host at all intervals. Fine-tuning the graph layout with respect to trace visibility is done by attaching additional attraction forces to the trace edges, which are then taken into consideration during layout calculation. To visually highlight the time dependency of the nodes, we mapped time to the alpha value of the connecting traces: older traces gradually fade out while newer ones are clearly visible. For many analysis scenarios, not only traffic proportions matter, but also absolute traffic measures. In other words, the graph layout will assign almost the same position to two nodes with each having 50 % IMAP and SMTP traffic, ignoring the fact that one of them has transferred several megabytes whereas the other one transferred only a few bytes. Therefore, we decided to vary node size according to the absolute value of the traffic measure (normally, the sum of the transferred bytes) using logarithmic scaling to compensate for large variations in traffic volumes.

139 8.2. Graph-based monitoring of network behavior Layout calculation The weights of the attraction edges of each observation node represent the portions of the employed application protocols within the network traffic in a particular time interval. The first node of host B in Figure 8.2, for example, is only connected to the SMTP dimension node. Since node positions are calculated step-wise using a spring-embedder graph layout and since all observation and dimension nodes push each other away due to additional repulsion forces, a consistent graph layout is generated where each nodes has a unique position. We used the spring embedder algorithm to calculate the forces between the nodes [53]. The calculation of the attracting forces follows the idea of a physical model of atomic particles, exerting attractive and repulsive forces depending on the distance. While every node repels other nodes, only those nodes that are connected by an edge attract each other. It is important to note that the forces calculated by this algorithm result in speed, not acceleration as in physical systems. The reason is that the algorithm seeks for a static equilibrium, not a dynamic one. There are several other algorithms that could solve our layout problem, such as the force directed algorithm in [42], its variant in [78], and the simulated annealing approach in [38]. The reason for choosing the Fruchterman-Reingold algorithm is its efficiency, speed, and the robustness of the force and iteration parameters. As weighted edges were needed, we extended the Fruchterman-Reingold implementation of the JUNG [129] graph drawing library to support additional factors on the forces Implementation To build a flexible and fast analysis system, we relied on the database technology provided by the relational database PostgreSQL [136]. Data loading scripts extract the involved IP addresses along with port numbers, the transferred bytes, and a timestamp from tcpdump files and store them in the database. To speed up query execution, traffic with identical IPs and ports can be aggregated into 10 minutes intervals in a new fact table. The actual behavior graph application is implemented in Java User interaction Since node positions depend on the traffic occurring in the respective time interval and the repulsion forces of nearby nodes, only an approximation of the actual load situation is given in our visualization. Furthermore, estimating traffic proportions from node positions becomes difficult or even impossible due to the ambiguity introduced by projecting multi-dimensional data points into 2D. Figure 8.2, for example, shows that hosts A and B have almost the same position in the sixth interval although their traffic load is completely different (see Figure 8.1). This might happen because there exist several sets of traffic loads that are mapped to the same 2D location. We resolve this ambiguity through user interaction: moving the mouse over a node triggers a detail view. Alternatively, dimension nodes can be moved by dragging to estimate their influence on a particular node or a group of nodes. Figure 8.3 shows a behavior graph of 33 prefixes over a time span of one hour, in which the user selected three prefixes to trace their behavior. Interaction is used as a means to retrieve

140 128 Chapter 8. Visual analysis of network behavior Figure 8.3: Host behavior graph showing the behavior of 33 prefixes over a timespan of 1 hour traffic details for a particular node (bar chart in the middle) and the configuration panel on the right allows for fine-tuning the layout. A simple click on a dimension node results in highlighting all observation nodes containing the respective traffic. This highlighting is realized by coloring all normal nodes in grayscale while showing the highlighted ones in color. Using the configuration panel, further dimension nodes and observation node groups can be added to or removed from the visualization. Because our application was designed carefully to ensure its usefulness for a variety of analysis scenarios, the user can flexibly choose the attributes representing attraction nodes and observation node groups depending on the available attributes in the considered data set. To abstract from the technical details, the user can simply select from the available data attributes in the two drop-down menus shown in Figure 8.3. In addition to the above manipulation options, the configuration panel provides four sliders: 1. Movement accentuation slider highlights suspicious hosts with highly variant traffic. Further details of this feature are given in Section A slider controlling the number of observation nodes by increasing or decreasing the time intervals for aggregating traffic. Changing the granularity of time intervals is a powerful means to remove clutter (less nodes due to larger time intervals) or to show more details (more nodes) in order to understand traffic situations better. 3. A slider for fine-tuning the strength of inter-object forces. Since each distinct node represents the state of a particular host during a time interval, we use edges to enable the user to trace a node s behavior over time. However, following these edges can become a challenge since nodes can end up in widely varying places. In order to make these

141 8.3. Integration of the behavior graph in HNMap 129 SSH 1 SSH FTP SMTP FTP SMTP DNS IMAP DNS IMAP host A host B HTTP 2 Undefined host A host B HTTP Undefined (a) Low inter-object forces (b) High inter-object forces Figure 8.4: Fine-tuning the graph layout through cohesion forces between the nodes of each host for improving the compactness of traces observation node groups more compact, additional attraction forces can be defined on neighboring nodes of a chain and adjusted with the given slider. Figure 8.4 demonstrates the effect of changing the forces. 4. Last, but not least, the attraction forces between observation and dimension nodes play an important role in ensuring the graph s interpretability. Too strong attraction forces result in dense clusters around the dimension nodes, whereas too weak attraction forces produce ambiguity in interpreting traffic proportions since repulsion forces between observation nodes push some nodes closer to unrelated dimension nodes. While adjusting the first slider results in mere re-coloring of the observation nodes and their traces, changing the lower three sliders causes changes in the calculation of the graph layout. In particular the time interval slider effects the number of observation nodes and, therefore, triggers a database query to determine the dimensional values of each observation node. Since the relationships between the old and the new observation nodes are difficult to figure out, the layout is recomputed from scratch in this case, whereas changes of the interobject and dimension forces can be calculated by using the current graph, adjusting the forces, and realigning the node positions in few iterations of the spring embedder layout. 8.3 Integration of the behavior graph in HNMap In Section 5 we presented the HNMap as a hierarchical view on the IP address space. Hosts are grouped by prefixes, autonomous systems, countries, and continents using a space-filling hierarchical visualization. This scalable approach enables the analyst to retrieve the details of a quantitative measure of network traffic leaving and entering the hosts in the visualization using the above mentioned aggregation levels. While this approach takes the IP address dimension

142 130 Chapter 8. Visual analysis of network behavior Figure 8.5: Integration of the behavior graph view into HNMap as well as a user-defined quantitative traffic measurement into account, the behavior graph focuses on behavioral changes, which are usually reflected in the dimensions of the data sets orthogonal to the IP dimension, e.g., application port or type of IDS alert. The combination of these two visualization modules is promising for providing a more comprehensive view on these complex data sets. Figure 8.5 shows the HNMap view drilled-down to the AS level. Through a pop-up menu, a behavior graph for any one of the shown AS rectangles can be displayed. Since the detailed information necessary for building the behavior graph is available for all hierarchy levels, the user is free to choose the appropriate one. Thereby, the behavior of the selected node can be analyzed at any grain by showing the node s constituent child or descendant nodes in the graph. Note that only the two lowest hierarchy levels are available in the menu because the selected node (red node at the upper left corner of the pop-up menu) is an AS node. Higher level behavior graphs can be triggered from coarsely grained nodes, e.g., in country or continent level HNMap views. While the behavior graph on prefixes, AS s, countries, or continents represents less detailed information about particular substructures of the Internet, it has proven to be beneficial since the provided aggregated views significantly reduce the information overload a network administrator faces when dealing with large-scale network traffic monitoring. Hence, finding the relevant subset using HNMap in combination with aggregation in detail views can be seen as a possible solution to getting hold of scalability. For example, visualizing the behavior of all prefixes within a particular AS can be understood better than visualizing the behavior of all available prefixes at once.

143 8.4. Automatic accentuation of highly variable traffic Automatic accentuation of highly variable traffic In the behavior graph, clusters immediately stand out. However, to minimize the threat of misbehaving hosts within the adminstrated network, the analyst is in many scenarios interested in nodes with highly variable traffic or, in other words, in nodes that jump from one position to another in the visualization. This corresponds to the assumpution in our threat model that we can see changes in the behavior of hacked computers by analyzing their network traffic. Since our visualization spans an n-dimensional metric space, it is possible to calculate the normalized positional changes pc norm of all subsequent observations o t (1 t t max ) of a host in this metric space: o r = o o (8.1) pc norm = t max 1 t=1 o t r o t+1 r, 0 pc norm 2 (8.2) t max 1 Note that the relative position o r of an observation node needs to be calculated first as the graph layout tries to place nodes with identical relative positions close to each other. After calculating pc norm for every node in each observation group, the groups with the highest values are accentuated. The bounds of pc norm can be explained through the fact that the maximum difference between two vectors o r t of length 1 is 2. To demonstrate the capabilities of our tool in a reproducible way, we used traffic data from our university network. In particular, we loaded all netflows passing the university gateway into the database and aggregated their traffic on the internal /24 -prefixes. A database query calculates the data for each node and loads it into the visualization. Figure 8.6 shows the behavior of 96 /24 -prefixes in the data set with time intervals of 12 minutes. Note that nodes with the highest variations in their traffic are automatically accentuated in accordance with the outcome of our calculations. Likewise, these calculations can be used as a filter to remove nodes with no significant changes over time. Since this filtering step reduces the amount of observation groups, less computational power is needed for the layout calculations of the remaining nodes. 8.5 Case studies: monitoring and threat analysis with behavior graph In this section we present two case studies validating the design of the behavior graph: the first case study deals with gaining an overview of network traffic thereby identifying groups of hosts with similar behavior as well as abnormal behavior of individual hosts; in the second case study, alerts from an IDS are visualized in order to further explore, relate and disseminate security-relevant data.

144 132 Chapter 8. Visual analysis of network behavior Figure 8.6: Automatic accentuation of highly variable /24 -prefixes using 1 hour traffic from the university network Case study I: Monitoring network traffic In this case study, we used a data set containing the traffic of the most active hosts of our university network within a time frame of one day. The IP addresses of both the internal hosts and their external communication partners were anonymized to protect the privacy of the network users. In the absence of a packet inspection sensor, we relied on the used application ports in order to infer the application program from the network traffic. Since most registered applications use small port numbers, we restricted our analysis to application ports below Due to the difficulty of determining for each packet, which one of the two hosts initiated the connection, the application port cannot always be identified with absolute confidence. Despite the fact that many applications use dynamic or private ports ranging from to as the source port of their network communication on which they will later wait for the replies, we did not automatically determine the application port by choosing the smaller port number between the source and the destination port since this would introduce additional uncertainty to the behavior graph. As a result, many observation nodes are strongly attracted by the unspecified node. Figure 8.7 shows 50 most active nodes in the data sets from 12 to 18 hours consolidated into 10 minute intervals, resulting in 1715 observation nodes, which comes close to the scalability limit of our visualization since both layout calculation and interactions causing layout changes become slow and tedious. Despite these drawbacks, observation node groups can be interactively highlighted in order to track their behavioral changes. The generated visualization displays several prominent clusters, such as cluster of university employees who use the Network File System (NFS) protocol or the cluster of Oracle database hosts. The largest cluster represents network traffic of standard applications, such as DNS, HTTP or the unspecified dimension nodes. Apart from these clusters, host details can be easily tracked, e.g., the

145 8.5. Case studies: monitoring and threat analysis with behavior graph 133 Standard: DNS, HTTP and Unspecified NFS of employees Oracle DBs Figure 8.7: Overview of network traffic from the University of Konstanz between 12 and 18 hours (showing 50 hosts with the highest traffic volume) green host strongly attracted by the SMTP node is probably the central outgoing mail server. The same behavior analysis can be conducted for the night time. As shown in Figure 8.8, less hosts are active between 0 and 6 hours of the same day; 10 of the top 50 daily hosts are not active at all. There are some behavioral changes during night time, such as the nightly backups, which result in movements towards the backup node and back again as seen on some hosts from the large cluster as well as on the green nodes of the DNS server. Furthermore, there is an increase of Oracle database traffic of both the green and the blue server. In general, specialized servers have less variant traffic since they predominantly serve a very limited range of application ports. Because of this high behavioral changes of the Oracle servers in the early morning, these two hosts are among the three hosts with the highest change according to their pc norm value as demonstrated in 8.9(a). In general, there are two different approaches to investigating suspicious traffic with the behavior graph. The first one is to use the automatic accentuation feature presented in Section 8.4, which is shown in Figure 8.9(a). A slider is used to adjust the number of colored hosts to be inspected in detail. This approach is based on the behavioral changes defined in the high-dimensional space, whereas the second approach is based on the observations in the lowdimensional visualization. As demonstrated in Figure 8.9(b), clicking on a particular node causes the associated observation node group to become colored, whereas the remaining nodes and their traces are shown in gray. In this exploration approach, the analyst is more likely to

146 134 Chapter 8. Visual analysis of network behavior Nightly backup of the DNS server Increasing Oracle DB traffic at 6 h AM Nightly backup Return to normal operations Figure 8.8: Nightly backups and Oracle DB traffic in the early morning (a) Three hosts with the largest behavioral changes (b) Two employee machines active at night time Figure 8.9: Investigating suspicious host behavior through accentuation

147 8.5. Case studies: monitoring and threat analysis with behavior graph 135 Figure 8.10: Evaluating SNORT alerts recorded from January 3 to February 26, 2008 pick out hosts that visually stand out not only due to their highly variant traffic, but also due to their prominent position outside of common clusters, such as the blue and the green hosts, which correspond to two employees working at night time. A double click on the background either colors all nodes or sets them back to gray. A single click colors a node group or sets it back to gray depending on the state of the group before the click. Having demonstrated the capability of the behavior graph to visualize host behavior of raw network traffic in a monitoring scenario, we proceed to the next case study demonstrating this technique s applicability for analysis of large amounts of IDS alerts Case study II: Analysis of IDS alerts For this case study, we evaluated the alerts generated by a SNORT intrusion detection sensor within a subnet of the university network from January 3 to February 26, The alerts referred to 684 hosts that scanned and attacked the network or generated suspicious network traffic. The attraction nodes in this case were initialized not with the application port numbers, but rather with the 15 most prominent SNORT alerts of our data set and an unspecified traffic node for the remaining 19 rarely occurring alerts. Figure 8.10 shows the outcome of the behavior graph, in which each node represents the aggregated observations for a particular host within 3 days. Larger nodes indicate a higher number of alerts thus helping to quickly identify the most active hosts. Colors and shapes

148 136 Chapter 8. Visual analysis of network behavior (a) Behavior of external hosts (b) Behavior of internal hosts Figure 8.11: Splitting the analysis into internal and external alerts reveals different clusters make nodes of different observation groups more distinguishable. For explanatory purposes, some observation groups were highlighted manually. While some clusters can be clearly seen, long colored straight lines immediately stand out. These lines can be explained either through alternating scanning or through attack patterns such as the top left alternating pattern between ICMP PING CYBERKIT 2.2 WIN and ICMP PING or through a transition from one attack pattern to the other. For example, it is quite common to first scan the target network with various pings or ICMP echo replies in order to infer the operation systems of the victims from the results. Afterwards, more specialized attacks that exploit security leaks of the victims operation systems can be launched. In general, our visualization reveals correlations of alerts to the security analyst through these connecting lines and thereby supports him in his task to gain an overview over thousands of security alerts. During interactive exploration, we discovered that the clusters were distinguishable by internal and external hosts since they generated quite dissimilar alert patterns. We therefore created an internal and an external view as shown in Figure Especially router advertisements seem to be an internal problem, which might be partially caused by the fact that the SNORT sensor does not explicitly exclude the internal routers and therefore logs any advertisement as a possible intrusion. Furthermore, the size of some internal nodes (see Figure 8.11(b)) is considerably larger since they trigger more alerts. In particular, the large green host on the left triggered more than 4000 alerts in each of the 3-day intervals. A closer examination revealed that the host is monitored externally and replies to three ICMP echo requests every 3 minutes. But not only large amounts of alerts are interesting to observe. Small clusters like the one on the left near the ICMP DESTINATION UNREACHABLE PORT alert might give an indication to hosts that silently scan the internal network infrastructure, especially when a host is represented in more than one time interval like the red node group.

149 8.6. Evaluation 137 (a) Behavior of attacking hosts (b) Behavior of victim hosts Figure 8.12: Analysis of SNORT alerts recorded from January 21 to 27, 2008 To further evaluate the behavior graph, we loaded a subset of a large data set containing SNORT alerts, which were generated in the week from January 21 to 27, Figure 8.12(a) shows the behavior of the attacking hosts by focusing the analysis on the source addresses. There are relatively few traces due to the choice of rather long 24-hour time intervals. Traces are therefore only drawn if a host generated alerts on two different days. Figure 8.12(b) shows more traces since this behavior graph monitors local hosts within the network and these hosts get attacked over and over again. This second plot is strongly biased by the ICMP PING node on the upper right because almost all hosts in the network are scanned daily. While the attacking hosts in Figure (a) form relatively homogeneous clusters, Figure (b) is dominated by the large cluster, which is linked to several outlier nodes. Having detected potential victims via a network scan using an ICMP PING, the attackers employ more specialized methods to find the victim s vulnerabilities. It is interesting to observe that, for example, the large red node in (b) is scanned between 2800 and 4300 times per day using up to five different ping or echo reply methods. Due to the similarity of these attacks, this attack pattern also forms a cluster in (a) (left of the center), which consists of large heterogeneous nodes. Having inspected these nodes in detail, we realized that they originate from the same IP address range, which is an indication of a monitoring script that checks those hosts availability on a regular basis. Note that some providers do not assign static IP addresses thus forcing their clients to change the IP address every 24 hours. Due to long time intervals of 24 hours, we observe no traces within this cluster.

150 138 Chapter 8. Visual analysis of network behavior 8.6 Evaluation To evaluate our tool, we measured the performance for laying out 100 to 1000 nodes on a dual 2 GHz PowerPC. While conducting the performance measurements, we realized that the second CPU stayed almost completely unused, which indicates that additional performance can be gained by parallelizing the algorithm. Since we used the Fruchterman-Rheingold algorithm to calculate the graph layout, the time complexity is θ( V 2 ) for each of the 100 iterations. Additional computational resources are used for drawing the preliminary results after each iterations. The resulting performance charts in Figure 8.13 show the expected quadratic runtime. Although real-world data was used in the experiments, the number of edges was proportional to the number of nodes by the factor of approximately 2.5. At this point we have to acknowledge the existance of faster algorithms for calculating the graph layout, for example, in the GraphViz software [46]. However, for fast implementation of our prototype we preferred an algorithm from the JUNG framework [129], which was written completely in Java. Our tool can calculate a stable layout within approximately seven seconds for 100 observation nodes and within 2 minutes for 1000 observation nodes. The number of actual observation nodes depends on the number of monitored network entities, the time interval over which the data is aggregated and the monitored time span. Each one of these can be seen as a factor impacting the estimation of the number of observation nodes (e.g., monitoring 50 hosts over six 10-minute intervals results in 300 or less observation nodes). Above 1000 observation nodes, layout calculation becomes tedious and fine-tuning layout parameters such as the host cohesion force, the repulsion force, and the dimension force turn into challenges of their own. In our case study, we used up to 15 data dimensions and one dimension to aggregate the remaining sparsely populated dimensions of the data sets. Although this might be appropriate for many data sets, there is definitely a limit to the number of used dimensions. With less sparse high-dimensional data, interpretation of the projected data points becomes more challenging. However, there are various possibilities to aggregate the data prior to analyzing runtime (ms) runtime (ms) layout nodes nodes edges runtime (sec) 7,2 12,9 20,6 29,3 39,4 54,7 65,8 80,9 102,1 122,1 edges Figure 8.13: Performance analysis of the layout algorithm

151 8.7. Summary 139 it with the plot: 1) several correlated dimensions, such as different kinds of ping, router advertisements, etc., could be aggregated into a single dimension, 2) the time interval can be adjusted to reduce the number of observation nodes in order to fit within the scalability limits of our visualization as small time intervals generate many observation nodes, whereas the opposite applies to long intervals, 3) user-defined filters can remove considerable amounts of well-studied parts of the data. 8.7 Summary In the scope of this chapter, we discussed a novel network traffic visualization metaphor for monitoring host behavior. The technique is denoted the behavior graph and uses an adaptation of the force-driven Fruchterman-Reingold graph layout to place host observation nodes with similar traffic proportions close to each other in a two-dimensional node-link diagram. Various means of interaction with the graph make the tool suitable for exploratory data analysis. Nodes with specific traffic or the temporal behavior of a few chosen hosts can be highlighted. Furthermore, the analyst can move the dimension nodes and, thereby, sort the observation nodes according to his own criteria. Since the behavior graph can be used to analyze both low-level host behavior and the behavior of more abstract network entities, we integrated it into the HNMap tool, from which the former can there be triggered through a pop-up menu on network entities of various granularity levels (e.g., hosts, prefixes, AS s). To enhance our tool with a visual analytics feature, we introduced a normalized measure for positional changes in n-dimensional metric space to automatically accentuate suspicious hosts with highly variant traffic. The usefulness of the tool was demonstrated in two case studies using traffic measurements from our university network and IDS alerts from a SNORT sensor. Thereby, it was shown that the tool is useful for typical traffic monitoring scenarios as well as for analysis of intrusion detection alerts. In the evaluation section, the scalability limitations of the visualization with respect to runtime performance were discussed Future Work We noticed that enabling the user to select a certain point in time for visualization would provide additional functionality to the tool. One possibility would be to use a histogram of the amount of traffic over time. The user could then select an interval on this histogram to view the traffic. Another interesting possibility would be the option to visualize network flows in realtime with a sliding time window starting at the present and extending to some time in the past. As our layout is calculated iteratively, realtime visualization should be possible with a standard processor. Another direction for further work is the integration of an option to automatically select interesting dimensions. For data sets with very high dimensionality the view gets cluttered. As our technique focuses on a general view, it would make sense to use methods like PCA to eliminate dimensions, which have only a minor effect on the resulting visualization layout.

152

153 9 Content-based visual analysis of network traffic,,exploration really is the essence of the human spirit, and to pause, to falter, to turn our back on the quest for knowledge, is to perish. Frank Borman Contents 9.1 Related work on visual analysis of communication Self-organizing maps for content-based retrieval Use cases Feature Extraction SOM generation Case study: SOMs for classification Summary Future work WHILE the previous chapters mainly dealt with the analysis of network traffic meta data, this chapter s focus is on the analysis of the actual content of network communication. For this purpose, we consider electronic mail, which has become one of the most widelyused means of communication. While mailing volumes have shown high growth rates since the introduction of as an Internet service and considerable work has been done to improve the efficiency of management, the effectiveness of management from a user perspective has not received a comparable amount of research attention. Typically, users are given little means to intelligently explore the wealth of cumulated information in their archives. In this chapter, we extend our framework with a visualization module based on Self-Organizing Maps (SOMs) [90] generated from a term occurrence descriptor. We apply this module to an archive and enhance the functionality of an management system by offering powerful visual analysis features. The rest of this chapter is structured as follows. In Section 9.1, related efforts on enhancing visual analysis of communication are discussed. The concept of self-organizing maps is introduced in Section 9.2 by presenting use cases, demonstrating tf-idf feature extraction on s, and giving an intuition about how SOMs are generated. Section 9.3 shows in a case study how the SOM can be used to explore s classified according to spam and ordinary . Finally, we sum up our contributions and possible directions for future work in the concluding section.

154 142 Chapter 9. Content-based visual analysis of network traffic 9.1 Related work on visual analysis of communication The main reason for improving the graphical user interface of applications is that the success and popularity of communication have led to high daily volumes of messages being sent and received, resulting in overload situations where important messages get overlooked or lost in archives [180]. Although usage of has changed significantly over the years since nowadays many people use it to manage their appointments and tasks, for file exchange and as a personal archive, clients have stayed very much the same [41]. Several research groups felt motivated to propose novel features for applications to make clients more adequate for these tasks. Mandic and Kerne, for example, recognized that communication is acually a diary we were never aware we were keeping [112] and, therefore, regard personal archives as a potential source of valuable insight into the structure and dynamics of one s social network. Acknowledging the fact that traditional interfaces have undergone little evolution despite their intensive usage, they propose famailiar, a visualization interface for revealing intimacy and rhythm of personal communication to the user. The system combines user-defined categorization of contact intimacy with message intimacy, computed using the presence of intimate and anti-intimate syntagmata, in order to visualize communication patterns of messages through glyphs in a calendar view. The IBM R project also focused on chronological aspects of communications with Thread Arcs, which visualize the reply patterns of communication threads in a compact way [88]. Apart from this key feature, the prototype was designed to integrate several sources, such as chat communication, news, calendar events, reminders, etc., into one communication platform. Through the scatter-and-gather feature, the inbox can be quickly reduced to the latest messages of on-going communication threats. So-called collections can be used to order the contents of communication according to user-defined preferences while not moving the messages out of the inbox. Especially the pivoting interface allows to rapidly switch between different views of the inbox, collections, and the to-do list without losing context. Microsoft also started an attempt to innovate interfaces with the Social Network and Relationship Finder (short: SNARF) [126]. The prototype aggregates social meta data about correspondents to aid triage, which is the process of viewing unhandled and deciding what to do with it. In the prototype, s can be sorted according to several social metrics that capture the nature and the strength of the relationship between the user and each correspondent. In addition to exploring relationships between users and groups of users, the Mining Toolkit (EMT) can be used to investigate chronical flows of s for detection of misuses, such as virus propagation and spam spread [104]. The tool comes with a clique panel for visualizing the relationships in a circular node link diagram, which is extended through concentric rings to depict the time dimension. Besides a node-link diagram for social network exploration called Social Network Fragment, the tool by Viégas et al. also offers a temporal visualization of communication named PostHistory, which uses a calendar metaphor to visualize overal volumes as well as highlighted communication from user-selected contacts [174]. To facilitate interactive management of large volumes of , we investigated techniques

155 9.2. Self-organizing maps for content-based retrieval 143 for visualizing temporal and geo-related attributes of emal archives in the previous work [82]. These techniques were based on a recursive pattern pixel visualization for displaying temporal aspects and on a map distortion technique to visualize the distribution of s according to their geographic origin. In contrast to the methods reviewed above, which all visualize structured attributes from the content headers (e.g., sender, date, and in-reply-to fields) or other meta data (e.g., geography), this chapter will focus on the analysis of unstructured text content of messages. Th , for example, is a typographic visualization of an individual s content over time [175]. The tool displays the most frequently used terms as yearly words in a large font in the background and a more detailed selection of monthly words in a small font in the foreground, where those terms were chosen according to their frequency and distinctiveness using an adapted tf-idf feature extraction approach. A recent approach to visualizing s in a self-organizing map [128] is probably closest to our own research. The authors propose an externally growing SOM with the aim of providing an intuitive visual profile of considered mailing lists and a navigation tool where similar s are located close to each other. While both their and our approach use tf-idf feature vectors, the focus of the former is to adapt the map to the distribition of the underlying data by a growing process in order to avoid the time-consuming retraining process. Our approach, on the contrary, deals with the issue of how well additional classification attributes, such as the classification into spam and ordinary , is preserved within the SOM. 9.2 Self-organizing maps for content-based retrieval Self-Organizing Map [90] is a neural network algorithm that is capable of projecting a distribution of high-dimensional input data onto a regular grid of map nodes in low-dimensional (usually, 2-dimensional) output space. This projection is capable of (a) clustering the data, and (b) approximately preserving the input data topology. The algorithm is therefore especially useful for data visualization and exploration purposes. Attached to each node on the output SOM grid is a reference (codebook) vector. The SOM algorithm learns the reference vectors by iteratively adjusting them to the input data by means of a competitive learning process. SOMs have previously been applied in various data analysis tasks. An example of the application on a large collection of text documents is the well-known WebSom project. Several visualization techniques supporting different SOM-based data analysis tasks exist [173]. The U-matrix, for example, visualizes the distribution of inter-node dissimilarity, supporting cluster analysis. Component planes are useful for visualizing the distribution of individual components in the reference vectors in order to support correlation analysis. If the input data points are mapped to their respectively best matching map nodes, histograms of map population, such as the distribution of object classes on the map, are possible Use cases Conceptually, we identify several interesting use-cases for SOM-based visualization support in an client:

156 144 Chapter 9. Content-based visual analysis of network traffic Classification. Using either automatic or manual methods, the SOM can be partitioned into regions representing different types of , e.g., spam and non-spam , business and private mail, and so on. For incoming , the best matching region can then be identified and the mail can be classified as belonging to the label of that region. Retrieval. The user can search for messages by mapping a query to the SOM node that best matches the query, followed by exploring the s mapped to the neighborhood of that node using a technique like U-matrix or a histogram-based visualization to guide the search. Organization. The user can employ the SOM generated from his/her archive to learn about the overall structure of the s contained in the archive. The user might then create a directory hierarchy for organizing s reflecting the SOM structure information Feature Extraction To obtain feature vectors from data, we employ a well-known scheme from information retrieval. The n most frequent terms from the subject fields of all s in the archive are determined after having filtered the irrelevant terms using a stop-word list in order to avoid inclusion of non-discriminating terms in the description. Then, the tf-idf document indexing model [6] is applied, considering each to be a document titled by its subject field. The model assigns to each document and each of the terms a weight indicating the relevance of the term in the given document with respect to the document collection. By concatenating the term weights for a given document we obtain a feature vector (descriptor) for that document. The tf-idf vectors can be calculated by counting the frequencies of the terms. Usually, the term frequency count is normalized to prevent a bias towards longer documents, which naturally contain terms with higher frequencies due to the overall document length. Therefore, the following formula is used to calculate the term frequency tf i, j of term t i in e j : tf i, j = n i, j max k (n k, j ) (9.1) where n i, j is the number of occurrences of the considered term in e j and the denominator is the maximum occurrence of any single term in e j. Note that there exist variations of the normalized term frequency, for example, using the sum instead of the maximum function in the denominator. The inverse document frequency idf i measures the general importance of term t i with respect to the whole collection E. E idf i = log {e j : t i e j } (9.2) Thereby, terms that rarely occur in the whole document collection get high idf values since those terms are characteristic for that collection, whereas terms that occur in almost every

157 9.2. Self-organizing maps for content-based retrieval 145 From: Subject: From: Buy a new car! Subject: flatrate! Why don t you buy a new car today? Get. cheap internet.. Toy example Term frequency counts e 1 e 2 e j : t i! e j buy new car today internet toy max 2 1 Florian Mansmann 21st July 2005, CEAS tf-idf vectors e 1 e 2 buy new car today internet toy 0 0 Figure 9.1: tf-idf feature extraction on a collection of 100 s document only result in small idf values. The tf-idf value of term t i in in e j is then calculated as the product of its term frequency and its inverse document frequency: tf-idf i, j = tf i, j idf i (9.3) The tf-idf vector of a document is composed of the respective tf-idf values of all terms. For two sample s in Figure 9.1, we first count the term frequencies of terms buy, new, car, today, internet, and toy. From these values and the sum of each term s total occurrences, we can calculate the normalized term frequencies, such as tf buy,1 = 2 2 = 1 and tf internet,2 = 1 1 = 1. Next, the resulting inverse document frequencies are idf buy = log and idf internet = log given the collection size of 100 s. Therefore, the relevance of the term buy is 0.89 for e 1 and that of term internet is 1.15 for e 2. Note that the term internet is more important for the second than the term buy for the first since the former term only occurs in 7 s whereas the latter appears in 13 s. On the resulting vectors, several distance measures for calculating the similarity between documents can be used. Baeza-Yates and Ribeiro-Neto [6], for example, propose to use the cosinus function between a document vector d j = (w 1, j,w 2, j,...,w t, j ) and a query vector q = (w 1,q,w 2,q,...,w t,q ), where d j and q are the norms of the document and query vector, respectively: sim(d j,q) = = d j q d j q t i=1 w i, j w i,q t i=1 w2 i, j t i=1 w2 i,q 0 sim(d j,q) 1 (9.4) (9.5)

158 146 Chapter 9. Content-based visual analysis of network traffic The set of all feature vectors of the collection serves as input for the SOM generation process. We acknowledge that more sophisticated descriptors can be thought of. Specifically, in addition to body text, data usually contains a wealth of meta data and attributes which are candidates for inclusion in the description. In this section, we chose to start with a basic feature extractor, leaving the design of more complex descriptors for future work SOM generation Self-Organizing Maps can be seen as a nonlinear projection of feature vectors of any dimensionality onto a usually two-dimensionally arranged set of reference vectors. Assume a set of 16 randomly initialized reference vectors. During the learning phase, the best matching reference vector m i, also denoted the winner neuron ( winner ), is chosen and adapted to the input vector x. In the diagrams in Figure 9.2, the angles of the reference nodes represent the values of the reference vectors. The winner neuron was chosen according to the best match with the input vector x. In this example, the input vector is identical to the vector inside the dark gray circle in the second diagram. Note that the original reference vector can be found in Figure 9.2: The learning phase of the SOM with the input vector adapting to its most similar reference vector (dark gray circle) and influencing its neighborhood (light gray). Six frames (top-down, left to right) show the process of transforming the unordered features (first frame) into the ordered features (last frame).

159 9.3. Case study: SOMs for classification 147 the respectively previous frame. Only minor changes need to be made to the chosen reference vectors since the input vectors were already close to them. In addition to above vector adaptation, the neighborhood function on the topology determines the adjacent map nodes and the degree of their adaptation to x. In Figure 9.2, this neighborhood function is represented through gray lines connecting each of the neurons with up to eight neighbors. Note that other more complicated neighborhood functions are possible. The neurons along with their interconnections form a neuronal network. In the example, all reference vectors in light-gray circles were adjusted from their original position half-way towards x. This process is repeated several times with all input vectors and results in smooth and ordered looking m i values like the ones shown in the last frame of Figure 9.2. In general, this competitive learning algorithm can be analogously applied to high-dimensional input and reference vectors resulting in an ordering of the high-dimensional input data. For the generation of the SOM in our experiments, we relied on the SOM PAK software and rules-of-thumb for good parameter choices found in the literature [91, 90]. 9.3 Case study: SOMs for classification As a proof of concept for the proposed technique, we present the results of two experiments. We generated a SOM from an archive of s using 500 most frequent terms in the subject field for the tf-idf descriptor. All s were labeled as belonging to either the spam or the non-spam class, as judged by a spam filter in combination with manual classification. By applying the competitive learning algorithm on 108 reference vectors using the tf-idf vectors of the s as input, we obtained a map layout for an ordering of the highdimensional input data. Figure 9.3 shows a spam-histogram on the generated SOM. For each Figure 9.3: Spam histogram of a sample archive (shades of red indicate spam)

160 148 Chapter 9. Content-based visual analysis of network traffic Figure 9.4: Component plane for term work (shades of yellow indicate high term weights) map node (gray), the coloring indicates the fraction of spam s within all s mapped to the respective node, with shades of red indicating high degrees of spam and shades of blue for low degrees of spam (the latter are the good regions on the SOM). The coloring of the fields between the nodes is determined by interpolating the values of the adjacent nodes. Clearly, the SOM, which was learned from our basic tf-idf descriptor, is capable of discriminating spam from non-spam s to a certain degree. Dark red or dark blue nodes indicate good classification results, whereas light red, light blue, or white nodes contain both spam and non-spam s. The image in Figure 9.4 illustrates the component plane for the tf-idf term work, with shades of yellow indicating high weight magnitude. Combining both images, we learn that this specific term occurs in s both of type spam and non-spam. The rightmost work cluster belongs to the good region and compounds university-related s from a PhD student in our working group. Interaction capabilities help the user to explore different regions on the SOM. A click on a gray node returns a list view of all s that were assigned to that particular reference vector. For explanatory purposes, an additional list, containing the tf-idf values of the terms of the reference vector m i sorted in descending order, can be displayed. 9.4 Summary In this chapter, we presented a method for extracting text-based tf-idf features from messages in order to be able apply content-based visualization techniques. In particular, a self-organizing map (SOM) was created on top of the feature vectors to enable visual exploration of large collections based on the tf-idf similarity metric from the information