Using Honeypots to Analyse Anomalous Internet Activities

Transcription

1 Using Honeypots to Analyse Anomalous Internet Activities Saleh Ibrahim Bakr Almotairi Bachelor of Science (Computer Science), KSU, Saudi Arabia 1992 Master of Engineering (Software Engineering), UQ, Australia 2004 Thesis submitted in accordance with the regulations for Degree of Doctor of Philosophy Information Security Institute Faculty of Science and Technology Queensland University of Technology June 2009

2

3 Keywords Internet traffic analysis, low-interaction honeypots, packet inter-arrival times, principal component analysis, square prediction error, residual space. i

4 ii

5 Abstract Monitoring Internet traffic is critical in order to acquire a good understanding of threats to computer and network security and in designing efficient computer security systems. Researchers and network administrators have applied several approaches to monitoring traffic for malicious content. These techniques include monitoring network components, aggregating IDS alerts, and monitoring unused IP address spaces. Another method for monitoring and analyzing malicious traffic, which has been widely tried and accepted, is the use of honeypots. Honeypots are very valuable security resources for gathering artefacts associated with a variety of Internet attack activities. As honeypots run no production services, any contact with them is considered potentially malicious or suspicious by definition. This unique characteristic of the honeypot reduces the amount of collected traffic and makes it a more valuable source of information than other existing techniques. Currently, there is insufficient research in the honeypot data analysis field. To date, most of the work on honeypots has been devoted to the design of new honeypots or optimizing the current ones. Approaches for analyzing data collected from honeypots, especially low-interaction honeypots, are presently immature, while analysis techniques are manual and focus mainly on identifying existing attacks. This research addresses the need for developing more advanced techniques for analyzing Internet traffic data collected from low-interaction honeypots. We believe that characterizing honeypot traffic will improve the security of networks and, if the honeypot data is handled in time, give early signs of new vulnerabilities or breakouts of new automated malicious codes, such as worms. The outcomes of this research include: Identification of repeated use of attack tools and attack processes through grouping activities that exhibit similar packet inter-arrival time distributions using the cliquing algorithm; Application of principal component analysis to detect the structure of attackiii

6 ers activities present in low-interaction honeypots and to visualize attackers behaviors; Detection of new attacks in low-interaction honeypot traffic through the use of the principal component s residual space and the square prediction error statistic; Real-time detection of new attacks using recursive principal component analysis; A proof of concept implementation for honeypot traffic analysis and real time monitoring. iv

7 Dedication This thesis is dedicated to my parents Ibrahim and Fatima who have inspired and encouraged me throughout my life. To my wife Medawi for her understanding and constant support over all these years of my PhD study. v

8 vi

9 Contents Keywords Abstract Dedication Table of Contents List of Figures List of Tables List of Abbreviations Declaration Previously Published Material Acknowledgment i iii v vii xiii xv xvii xix xxi xxiii 1 Introduction Motivation Research Outcomes Thesis Outline Background Internet Protocols TCP/IP Suite Traffic Attacks Network Monitoring and Traffic Collection Techniques Network Firewall vii

10 2.2.2 Intrusion Detection Systems Network Flow Monitoring Black Hole Monitoring Global Monitoring Projects DShield Network Telescopes The Internet Motion Sensor The Leurré.com Project SGNET Traffic Analysis Techniques Data Visualization Data Mining Statistical Techniques Honeypots Low-interaction vs High-interaction Honeypots Production vs Research Honeypots Physical vs Virtual Honeypots Server Side vs Client Side Honeypots Improving Honeypots While Lowering Their Risks Honeypot Traffic Anomalies Existing Honeypot Solutions Related Work Research Outcomes from the Leurré.com Project Application of Principal Component Analysis to Internet Traffic Research Challenges Summary Traffic Analysis Using Packet Inter-arrival Times Information Source The Leurré.com Honeypot Platform Data Manipulation Preliminary Investigation of Packet Inter-arrival Times Cluster Correlation Using Packet Inter-arrival Times Data set Measuring Similarities viii

11 3.3.3 Cliquing Algorithm Experimental Results Type I Cliques Type II Cliques Type III Cliques Supercliques Summary Honeypot Traffic Structure Motivation Principal Component Analysis Data set and Pre-Processing Data set Pre-processing Candidate Feature Selection PCA on the Honeypot Data set Number of Principal Components to Retain Interpretation of the Results Interrelations Between Components Identification of Extreme Activities A Discussion of the Detected Outliers Summary Detecting New Attacks Introduction Principal Component s Residual Space Square Prediction Error (SPE) Data set and Pre-Processing Data set Processing the Flow Traffic via PCA Robustness Setting up Model Parameters Model Architecture Illustrative Example PCA Model Construction Future Traffic Testing ix

12 5.5 Results and Evaluation Detection and Identification Stability of the Monitoring Model Over Time Computational Requirements Evaluation Summary Automatic Detection of New Attacks Introduction Principal Component Analysis Model Building the Initial PCA Detection Model Recursive Adaptation of the Detection Model Setting the Thresholds Model Architecture Detecting New Attacks and Updating the Model Model Sensitivity to New Attacks A Proof of Concept Implementation Flow Aggregator Monitoring Desktop: HoneyEye Deployment Scenario: Single Site Limitation Experimental Results Projection of the Testing Data: No Adaptation Projection of the Testing Data: With Adaptation The Effects of Adaptation on Threshold Values The Effect of Adaptation on Variables Summary Conclusion and Future Work Improving the Leurré.com Clusters Structuring Honeypot Traffic Detecting New Attacks Conclusion A Matlab Code 135 A.1 Extracting the Principal Components x

13 A.2 Robustification Using Squared Mahalanobios Distance M A.3 Estimate Parameters A.4 Recursive Mean A.5 Recursive Variance A.6 Recursive Normalization xi

14 xii

15 List of Figures 2.1 Network telescope setup SGNET Architecture An example of a virtual honeypot setup that emulates two operating systems Leurré.com honeypot platform architecture Illustration of the port sequence of an attack Illustration of packet inter-arrival times A global distribution of all IAT distribution < seconds IAT distribution values that range from 0 to seconds A time series conversion using SAX An example of finding cliques The different steps of the cliquing algorithm Directions of maximal variance of principal components (Z 1, Z 2 ) Scree plot of eigenvalues The scatter plot of TCP scan (PC2) vs live machine detection (PC5) The scatter plot of the first two principal components The scatter plot of the last two components The ellipse of a constant distance The scatter plot of the statistics D i vs. (Mi 2 D i ) Scree plot of eigenvalues Robustification of the correlation matrix through multivariate trimming Detection model architecture Steps for building the PCA model (Phase I) Steps for detecting new attacks (Phase II) Plot of SPE values of the training and testing traffic xiii

16 5.7 Plot of four-month attack data projected onto the residual space Adaptive detection model process flow Detecting new attacks Residual space sensitivity to new attacks HoneyEye interface Overview of a real-time deployment Detection charts, with no adaptation: using SPE statistics (upper chart) and using T2 statistic (lower chart) Two detection charts with 14-day adaptation: using SPE statistics (upper chart) and using T2 statistic (lower chart) SPE and T2 limit evolution over time using 14-day adaptation SPE limit evolution over time using 14-day adaptation (top) along with the mean of six selected variables xiv

17 List of Tables 3.1 Distinct sources and destinations of the top ten IATs Bin values of IAT ranges A summary of Type I Cliques A summary of Type II Cliques A summary of Type III Cliques Representative properties of Supercliques Summary of the data set used in this study Variables used in the analysis The extracted principal components and their variances The extracted communalities of variables The Varimax rotation of principal components Interpretations of the first seven components The top five extreme observations Summary of the data sets used in the study Extracted principal components variance Sample traffic matrix Standardized traffic matrix Eigenvectors Eigenvalues Scores of the residuals SPE values Future traffic matrix Standardized future traffic matrix New traffic PC scores Average execution times of the major tasks (seconds) Classes of detected attack activities xv

18 6.1 Summary of the data sets xvi

19 List of Abbreviations CAIDA CERT CPU CUSUM DDoS DoS EWMA FTP IAT ICMP IDS IMS IP ISC KNN LAN NDIS OD OS OSI PAA PC PCA RPCA SAX SMTP Cooperative Association for Intranet Data Analysis Computer Emergency Response Team Central Control Unit Cumulative Sum Distributed Denial of Service Denial of Service Exponentially Weighted Moving Average File Transfer Protocol Packet Inter-arrival Time Internet Control Message Protocol Intrusion Detection System Internet Motion Sensor Internet Protocol Internet Storm Center K-Nearest Neighbors Local Area Network Network Intrusion Detection System Origin Destination Operating System Open System Interconnect Piecewise Aggregate Approximation Principal Components Principal Component Analysis Recursive Principal Component Analysis Symbolic Aggregate Approximation Simple Mail Transfer Protocol xvii

20 SPE SSH SVD TCP TCP/IP UCL UDP UML Square Prediction Error Secure Shell Singular Value Decomposition Transmission Control Protocol Transport Connection Protocol / Internet Protocol Upper Control Limit User Datagram Protocol User Mode Linux xviii

21 Declaration The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made. Signed: Date: xix

22 xx

23 Previously Published Material The following papers have been published or presented, and contain material based on the content of this thesis. S. Almotairi, A. Clark, M. Dacier, C. Leita, G. Mohay, V. H. Pham, O. Thonnard, and J. Zimmermann, Extracting Inter-arrival Time Based Behaviour from Honeypot Traffic using Cliques, in the 5th Australian Digital Forensics Conference, Perth, Australia, S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, Characterization of Attackers Activities in Honeypot Traffic Using Principal Component Analysis, in Proceedings of the 2008 IFIP International Conference on Network and Parallel Computing Shanghai, China: IEEE Computer Society, S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, A Technique for Detecting New Attacks in Low-Interaction Honeypot Traffic, in Proceedings of the Fourth International Conference on Internet Monitoring and Protection, Venice, Italy: IEEE Computer Society xxi

24 xxii

25 Acknowledgment Praise and thanks be to Allah for his help in accomplishing this work. This thesis would not be successful without the assistance and support of the following individuals and organizations. I am grateful to them all. I would like to thank my supervisors Associate Professor Andrew Clark, Adjunct Professor George Mohay, and Dr. Jacob Zimmermann for their patience, guidance, and support in the completion of this work. Thank you Andrew and George for all the help you have given me during my research. Indeed, your suggestions and advice made the completion of this work possible. Acknowledgment is due to the National Information Center at the Ministry of Interior in Saudi Arabia for sponsoring my research. I am also very grateful to the Information Security Institute (ISI) at the Queensland University of Technology for providing the resources and environment for conducting this research. Additional thanks go to the Leurré.com honeypot project led by Marc Dacier for providing the honeypot data that made this work possible. I would also like to thank my colleagues at the Information Security Institute, for providing an environment in which I could learn and work as a researcher. xxiii

26 xxiv

27 Chapter 1 Introduction People and businesses alike are dependent on the Internet to communicate and run their businesses. The growing dependencies on the Internet are matched at the same time by the rising rate of attacks. Computers and networks connected to the Internet are vulnerable to a variety of threats that can compromise their intended operations, such as viruses, worms, and denial of service attacks. There are many reasons for the growing number and severity of attacks, including increased connectivity and increased availability of vulnerability information and attack scripts via the Internet. As the nature of Internet attacks is unpredictable, security managers need to implement multiple layers of security defence as part of the Defence-in-Depth protection strategy, such as firewalls, monitoring tools, vulnerability scanning tools, and intrusion detection systems. Firewalls are commonly used to protect local networks from the outside world through controlling traffic flow between the local network and the Internet. While firewalls protect local networks from the Internet, they have many limitations which include: they cannot see local traffic; they are vulnerable to misconfiguration; and they stop only network level attacks and are less effective in preventing application level attacks that target open ports such as TCP port 80. Network intrusion detection system (NIDS) are another component in the Defencein-Depth protection strategy. NIDS are used to detect malicious traffic within 1

28 2 Chapter 1. Introduction networks, based on predefined attack signatures or less generally on anomaly-based methods. Network intrusion detection systems also have their own limitations which include: the need for accurate signatures of attacks in order to work properly; the inability to detect, in the case of signature-based IDS, new and unseen attacks; the generation of a large number of alerts that need to be to investigated; and the inability to handle encrypted traffic. Recently, honeypots have gained popularity within the security community in providing an additional layer of network security. Honeypots are decoy computers that run no real services and can serve to complement other security systems through their ability to capture new attacks and to see encrypted traffic. In addition, honeypots, by definition, only collect malicious traffic, and therefore reduce the generation of false alarms. Honeypot applications for network security include the automatic collection of malware [12], detection of zero-day attacks [1], detection of worms [50] and the automatic generation of intrusion detection signatures [83]. Low-interaction honeypots are the simplest form of honeypot. They run no real operating system and usually offer only an emulated network stack with limited or no service interactions. The advantages of using low-interaction honeypots are their ease of deployment and their low level of risk. The Leurré.com project is a worldwide deployment of low-interaction honeypots [9] for collecting attack data that targets machines and networks that are connected to the Internet. The honeypot traffic data used in this thesis comes from the Leurré.com project. Analyzing this traffic has proved to be very useful in characterizing global malicious Internet activity. Various types of analysis have been carried out on honeypot traffic data collected from the project, to characterize different Internet attack activities and to unveil useful attack patterns [151, 112]. This research extends previous research on improving honeypot traffic analysis, introduces new techniques for the characterization of malicious Internet activities and automates the analysis and discovery of new attacks. The rest of this chapter is organized as follows. Section 1.1 identifies the motivation for this research. Outcomes achieved by this research are identified in Section 1.2. Finally, Section 1.3 presents the outline of the thesis.

29 1.1. Motivation Motivation This thesis examines the problem of data analysis of traffic collected by lowinteraction honeypots with the goal of identifying anomalous Internet traffic. This research is motivated by: the relative absence of research analyzing traffic data collected by honeypots in general and low-interaction honeypots in particular; the need for new detection techniques that suit the type of traffic data collected by low-interaction honeypots, which is considered suspicious by definition, and which is both multidimensional and sparse; the need for a real-time detection capability for new attacks with reduced or no human intervention that is: able to capture new trends and adapt to the dynamic nature of the Internet; low in computational resource requirements and thus suitable for realtime application. 1.2 Research Outcomes The aim of this thesis is to research and develop advanced techniques for identifying and analyzing anomalous Internet activities in honeypot traffic. This research has resulted in a number of significant improvements to honeypot traffic analysis. The outcomes of this research are five-fold: improving the Leurré.com clusters through the use of packet inter-arrival time (IAT) distributions and the cliquing algorithm to group similar attack activities, or clusters of attacks, based on similar IAT behaviors. The research results were published in: S. Almotairi, A. Clark, M. Dacier, C. Leita, G. Mohay, V. H. Pham, O. Thonnard, and J. Zimmermann, Extracting Inter-arrival Time Based Behaviour from Honeypot Traffic using Cliques, in the 5th Australian Digital Forensics Conference, Perth, Australia, 2007;

30 4 Chapter 1. Introduction the successful application of principal component analysis (PCA) in detecting the structure of attackers activities in honeypot traffic, in visualizing these activities, and in identifying different types of outliers. The research findings were presented in: S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, Characterization of Attackers Activities in Honeypot Traffic Using Principal Component Analysis, in Proceedings of the 2008 IFIP International Conference on Network and Parallel Computing Shanghai, China: IEEE Computer Society, 2008; the proposal of a detection technique that is capable of detecting new attacks in low-interaction honeypot traffic using the residuals of principal component analysis (PCA) and the square prediction error (SPE) statistic. The research results were published in: S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, A Technique for Detecting New Attacks in Low-Interaction Honeypot Traffic, in Proceedings of the Fourth International Conference on Internet Monitoring and Protection, Venice, Italy: IEEE Computer Society 2009; the design of an automatic detection model that is capable of detecting new attacks, capturing new changes, and updating its parameters automatically; and the implementation of a proof of concept system for analyzing honeypot traffic and providing a real-time monitoring application. 1.3 Thesis Outline The rest of this thesis is organized as follows: Chapter 2: Background This chapter provides an overview of honeypot concepts and technologies. This chapter also explores existing data collection techniques and monitoring methods of anomalous Internet traffic. Research in analyzing anomalous Internet traffic is also identified. In addition, this chapter discusses related work relevant to that described in Chapters 4-6.

31 1.3. Thesis Outline 5 Chapter 3: Traffic Analysis Using Packet Inter-Arrival Times This chapter gives a brief introduction to the Leurré.com project setup and its methodology in collecting and processing honeypot traffic. In addition, it details a methodology for improving the Leurré.com clusters by grouping clusters that share similar types of activities based on packet inter-arrival time (IAT) distributions. A number of cliques have been generated using the IAT of clusters that represent a variety of interesting activities targeting the Leurré.com environments. Chapter 4: Honeypot Traffic Structure This chapter introduces the concept of principal component analysis and presents a technique for characterizing attackers activities in honeypot traffic using principal component analysis. Attackers activities in honeypot traffic are decomposed into seven dominant clusters. In addition, a visualization technique, based on principal component plots, is presented to unveil the interrelationships between activities and to identify outliers. Finally, experimental results on real traffic data from the Leurré.com project are discussed. Chapter 5: Detecting New Attacks This chapter presents a technique for detecting new attacks in low-interaction honeypot traffic through the use of principal component residual space and square prediction error (SPE) statistics. The effectiveness of the proposed technique is demonstrated and evaluated through the analysis of real traffic data from the Leurré.com project. Two data sets are used in this analysis: data set I to construct the PCA model, and data set II to test and evaluate the detection model. Chapter 6: Automatic Detection of New Attacks This chapter addresses the challenges of real-time detection of new attacks and proposes an adaptive detection model that captures changes in Internet traffic and updates its parameters automatically. Moreover, a proof of concept implementation of the proposed detection system for real-time and offline applications is described. Chapter 7: Conclusion and Future Work future research are presented in this chapter. Conclusions and directions for

32 6 Chapter 1. Introduction

33 Chapter 2 Background The goals of the thesis, as described in Chapter 1, are to research and develop improved techniques for characterizing anomalous Internet activities present in low-interaction honeypot traffic. This chapter provides an overview of the honeypot concept and different types of honeypot technologies. This chapter also explores the data collection techniques and monitoring methods used in identifying anomalous Internet traffic. Research in analyzing anomalous Internet traffic is also identified with a particular emphasis on honeypots. This chapter is divided into six sections. Internet protocols are discussed in Section 2.1. Existing methods for monitoring network traffic for malicious activities are discussed in Section 2.2. Section 2.3 highlights existing global monitoring systems. Section 2.4 provides an overview of traffic analysis techniques. The concept of honeypots is presented in Section 2.5. Section 2.6 presents previous research related to work described in Chapters 3 to 6. Finally, Section 2.7 concludes the literature review. 2.1 Internet Protocols All Internet traffic is handled by the Internet protocol suite or what is commonly known as the TCP/IP (Transmission Control Protocol/Internet Protocol) protocol suite. This thesis aims to research and develop advanced techniques for identifying and analyzing anomalous Internet activities in honeypot traffic. This section provides a brief overview of the TCP/IP protocol suite, traffic anomalies, and traffic 7

34 8 Chapter 2. Background analysis techniques TCP/IP Suite The TCP/IP (Transmission Control Protocol/Internet Protocol) protocol suite is a set of communication protocols for transmitting data over the Internet [59, 60], which is maintained by the Internet Engineering Task Force (IETF) [8]. The TCP/IP protocol suite is hierarchical and comprises four interactive layers: link, Internet, transport and application. Network communication is achieved through the interaction between different layers where higher layers draw on the services of lower layers. While there are many protocols that are defined by the TCP/IP suite to achieve its functionalities, the main protocols of the suite are: Internet Protocol (IP) [57]: IP is the main network layer protocol. It is a connectionless protocol and is mainly responsible of delivering data between source and destination devices. IP s main functionalities include: formatting data into packets or datagram, IP addressing, handling fragmentation, and network routing. Internet Control Message Protocol (ICMP) [56]: ICMP is a network layer protocol for complementing the IP protocol in providing error reporting and querying mechanisms for testing and diagnosing networks. Transmission Control Protocol (TCP) [58]: TCP is a connection-oriented transport layer protocol. TCP is responsible for providing reliable communications for application layer protocols and for ensuring end-to-end delivery of data between communicating application programs. As the IP is a connectionless and unreliable protocol, TCP provides the necessary functionalities enabling several applications to share, at the same time, the same IP address and perform bi-directional communications over the network. TCP s main functions include providing connection control and error and flow control. Application layer protocols that utilize TCP include HTTP and SMTP. User Datagram Protocol (UDP) [55]: UDP is a connectionless and unreliable transport layer protocol. Like TCP, UDP is responsible for providing endto-end communication for applications, but with a minimal level of error checking and no flow control mechanism. UDP provides applications with a

35 2.1. Internet Protocols 9 lightweight method for sending small amounts of data where reliability is not important, such as broadcasting applications where a loss of some bytes will not be noticed, or for applications that send small amounts of data where requests are resent whenever the response is not received, for example, voice over IP Traffic Attacks Vulnerabilities in the TCP/IP suite exist in almost all of its layers [37], as the TCP/IP suite was not designed with security in mind. These vulnerabilities have been utilized by attackers against networks and systems, using attacks such as TCP SYN and ICMP flooding. Statistics [131] show that the number of network attacks and the number of new vulnerabilities are both on the rise despite increased efforts in the areas of software engineering and security management practices. Several factors have contributed to the rise of attacks, including increased connectivity, increased financial and other incentives to launch attacks, the availability of vulnerability information and attack tools, the high prevalence of exploitable vulnerabilities and the lack of patches from vendors or long delays before the patches are made available. The source of vulnerabilities can be attributed to many factors including the design of the protocols themselves and the flawed implementation of these protocols. There exist several studies to classify network anomalies and security threats to resolve confusion in describing particular attacks. Howard [71] developed a processbased taxonomy of computer and network attacks. His approach was intended to describe the process of attacks, rather than providing attack classification. His approach was to establish a link between attackers and objectives in computer and network attacks. This link is established through an operational sequence of tools, access and results. Hansman s taxonomy [69] is aimed at classifying and grouping attacks based on their similarities rather than the attack process. Hansman proposed four dimensions for attack classification. The first dimension categorizes attacks based on their attack vector or method of propagation. The second dimension identifies attacks according to their targets. The third dimension deals with the vulnerabilities that the attack exploits, mainly based on the Common Vulnerabilities and Exposures (CVE) standardized names of vulnerabilities [3]. The fourth dimension deals with attacks that might have extra effects or which are able to launch other at-

36 10 Chapter 2. Background tacks, such as a worm carrying a Trojan in its payload. The next section examines methods for monitoring network traffic for detecting these malicious activities. 2.2 Network Monitoring and Traffic Collection Techniques The Internet has become essential for governments, universities and businesses to conduct their affairs. The reliability and availability of networks and the security of the Internet are critical for organizations to conduct their daily work. These networks are under constantly increasing threats from different types of attacks, such as worms and denial of service attacks. Monitoring and characterizing these threats is crucial for protecting networks and to guarantee smooth organizational activities. Broadly speaking, two methods exist for monitoring network traffic for malicious activities, live network and unsolicited traffic monitoring. The live monitoring techniques include data collected by policy enforcing systems such as firewalls, network intrusion detection systems (NIDS) logs, and traffic from network management tools such as NetFlow [132]. Unsolicited traffic monitoring techniques include passive monitoring of unused IP spaces, such as darknets, and the use of active decoy services, such as honeypots Network Firewall A network firewall comprises software and hardware that protect one network from another network. Firewalls are mainly used to filter incoming Internet traffic according to a predefined organizational security policy. A firewall can provide protection across different levels of the open system interconnect (OSI) model for networking, such as application and network layers. Firewalls deployed at the boundary of networks have a view of inbound and outbound traffic and this makes them very useful in monitoring traffic. Firewalls logs are a rich source of information about network traffic, including traffic volume, successful and rejected connections, traffic arrival times, IP addresses, ports and services.

37 2.2. Network Monitoring and Traffic Collection Techniques Intrusion Detection Systems Intrusion detection systems complement firewalls in monitoring network traffic and providing another level of protection for systems. Network intrusion detection systems (NIDS) are passive in their monitoring of network traffic. They detect attacks by capturing and analyzing traffic and generating alarms when the level of suspicion about the traffic is high. Two types of NIDS currently exist based on their detection methodologies, signature-based and anomaly-based NIDS. Signature-based NIDS rely on a knowledge base of predefined patterns of attack for identifying attacks in the network traffic being monitored. In contrast, anomaly-based NIDS measure any deviation from normality and raise alarms whenever the predefined threshold level is exceeded. A normality profile is constructed by training the detection model on historical network traffic that is believed to be attack free, or normal, over a period of time. An alert is then signaled whenever a large deviation is encountered from the normality profile. While signature-based NIDS detect only known attacks, anomaly-based NIDS are capable of detecting zero-day attacks, but usually with a high level of false positive alarms. NIDS are passive systems that are capable of logging very detailed information regarding suspicious traffic Network Flow Monitoring A network flow [132] is a unidirectional stream of packets between a given source and a destination. A network flow can be identified by seven key fields: source IP address, destination IP address, source port number, destination port number, protocol type, type of service, and the router input interface. If a packet differs from another packet by a single key field, it is considered to belong to another flow. While flow data was originally used for resource management and accounting, this data contains enough information for detecting a variety of network anomalies [33, 63]. Moreover, network flows excel in their performance for real-time analysis and detection of attacks. In this research, traffic flows were used to analyze honeypot traffic. Barford et al. [33] conducted a visual analysis of traffic flows and categorized traffic anomalies into three types based on statistical characteristics of their traffic features. These three types are: operational anomalies as a result of network device outages and misconfiguration; flash crowd anomalies which are described

38 12 Chapter 2. Background as a sudden rise in traffic to a host for a short period of time; and network abuse anomalies as a result of malicious intent which include a variety of anomalous activities such as denial of service attacks (DOS) and worms. Analysis of flow data has been widely used for real-time network monitoring [66, 63, 101, 33, 81]. Barford et al. [34] proposed an anomaly detection technique that uses network traffic flow as an input. Kim et al. [81] proposed a flow-based method to detect abnormal traffic using traffic patterns of flow header information. Munz et al. [101] proposed a framework for real-time detection of attacks using traffic flows Black Hole Monitoring Black holes [48] or Darknets [19] are blocks of routable IP addresses which have no legitimate hosts deployed. Traffic targeting these blocks must be the result of misconfiguration, backscatter from spoofed source addresses, or port scanning. Thus, black hole networks provide an excellent method of studying Internet threats. The size of unused address space has less effect on the usability of this method; however more address space would increase the visibility and accuracy of statistical inference. Monitoring of a Darknet can include [102] monitoring backscatter, which helps in the analysis of denial of service attacks [100], or monitoring requests for access to unallocated spaces, which helps in the analysis of worms. Traffic data can be acquired using various approaches, such as exporting flow logs from routers and switch equipment or by placing a passive network monitor at the entrance of the network or listening on a router interface that serves the unallocated network space. One drawback of the darkspace monitoring technique is the difficulty in hiding their deployment so that attackers are not able to avoid them. Another interesting approach is to monitor the grey IP address space, or IP addresses that are not active for a period of time in a large IP class. Jin et al. [76] proposed a correlation technique for identifying and tracking potentially suspicious hosts using grey IP space. In the following section, we review existing monitoring projects of Internet threats.

39 2.3. Global Monitoring Projects Global Monitoring Projects The following section highlights a number of projects whose aim is to monitor Internet threats along with the techniques they utilize DShield DShield.org is a non-profit organization that was launched at the end of 2000 and funded by SANS Institute as part of its Internet Storm Center (ISC) [4, 73]. The DShield.org goal was to collect malicious Internet activities from all over the world by analyzing activity trends and improving firewall rules. DShield s data set consists of firewall logs and IDS alerts submitted by a variety of networks from around the world. Trends of attacks such as top source IP addresses and destination ports are published on daily basis. Yegneswaran et al. [149] presented an analysis of Internet intrusion activities based on data from DShield.org. In their study, they utilized packets rejected by firewalls and port scan logs recorded by network intrusion detections systems. They investigated several features of intrusion activity features, which include: the daily volume of intrusion attempts, the sources and destinations of intrusion attempts, and specific types of intrusion attempts. They then used their results to predict intrusion activities for the entire Internet. However, research shows that attempts to extrapolate results from small network IP spaces in order to predict the global Internet traffic would lead to insignificant results, since different IP spaces observe different traffic patterns [48] Network Telescopes The Network Telescope is an initiative of the Cooperative Association for Internet Data Analysis (CAIDA) for monitoring routable but unused IP address space [14]. The Network Telescope assumes that no legitimate traffic should be sent to monitored space, which provides a unique basis for studying security events; any traffic that arrives at a network telescope is either a result of malicious activities (such as backscatter from denial of service attacks, Internet worms, and network scanning) or a misconfiguration. The network telescope has helped in studying worm activities, such as Slammer and Code Red II, and has assisted in the analysis of large scale denial of service attacks and backscatter [100]. Figure 2.1 depicts the network telescope setup used for backscatter collection.

40 14 Chapter 2. Background Internet Monitor Hub /8 Network Figure 2.1: Network telescope setup The Internet Motion Sensor The Internet Motion Sensor (IMS) is a global threat monitoring system for measuring and tracking Internet threats, such as worms and denial of service attacks [20]. The IMS is managed by the University of Michigan and consists of over 28 sensors at 18 different locations. These sensors monitor blocks of routable but unused IP addresses. They are deployed across the globe at major Internet service providers, large organizations, and universities. The IP address spaces these sensors monitor range in size from class C (256 addresses) to class A (16,777,216 addresses) networks. Each sensor consists of an active and a passive component. The passive component collects received packets sent to the sensor s monitored address space and the active component manages replies to the source of received packets. A study of traffic targeting ten IMS sensors shows that there were significant differences in traffic observed between these sensors [30]. These differences have been observed over all protocols and services, over a specific protocol, and over a particular worm signature The Leurré.com Project Institute of Eurécom launched the Leurré.com project in 2004 for the purpose of collecting malicious traffic using globally distributed environments [9]. The Leurré.com environments consist of deploying similar honeypot sensors at different locations around the world; currently 40 platforms are deployed in 25 different countries. Data from these honeypots was used in this research.

41 2.3. Global Monitoring Projects 15 On a daily basis, traffic logs are transferred to a centralized machine where raw traffic data is processed, enriched with external data, and inserted into relational database tables. The five most important tables in the database are the following [109]: host table: this table contains all attributes which characterize one honeypot virtual machine; environment table: this table contains all attributes which characterize one honeypot platform; each platform consists of three hosts; source table: this table gathers all attributes required to characterize one attacking IP within one day; large_session table: this table contains all attributes required to characterize the activity of one source observed against one platform; tiny_session table: this table contains all attributes required to characterize one source observed against one host; hacker_honeypot_packets table: this table tracks all packets sent by hackers to honeypots; honeypot_hacker_packets table: this table tracks all packets sent by honeypots to hackers. The Leurré.com project provides two types of interface for accessing the database: the protected web interface which provides some useful queries for extracting data from the database; and direct access through secure shell SSH. Leurré.com s platform architecture and data manipulation are explained in detail in Chapter 3, while research outcomes from the project are presented in Section SGNET SGNET is another initiative from the Institute of Eurécom, creating an open distributed framework of honeypots for collecting suspicious Internet traffic data and analysing it for malicious activity [92]. The framework focuses mainly on self-propagated code and code injection attacks. The architecture of SGNET is divided into three parts (Figure 2.2):

42 16 Chapter 2. Background 1. SGNET sensor: SGNET sensor is a low-interaction honeypot daemon that interacts with attackers mimicking real services. The interactions are handled by a ScriptGen system [93]. ScriptGen is a system for emulating services with no prior knowledge of their behaviors, through incremental learning that is drawn from samples of previous interactions with high-interaction honeypots. ScriptGen uses state machines to replay previously learned responses in reply to attackers requests. When a new request arrives that is not captured by the emulator knowledge base, ScriptGen proxies this request to a high-interaction honeypot to continue the conversation, which is later incorporated in refining the emulator knowledge state machines. 2. Gateway: the gateway acts as an interface between the SGNET sensors and the service provider components. It also balances the load of the service provider components. The communication between the gateway and SGNET components is achieved through Peiros, a TCP-based protocol [92]. 3. Service provider: currently, the service provider has two components, the sample factories and ShellCode handlers. The sample factory is a highinteraction honeypot that is implemented using a modified version of the Argos virtualization system [1]. Any requests that cannot be answered by SGNET sensors, as they fall outside their state machines, are proxied to the sample factory to continue these conversations with the attackers and then pushed back by the gateway to refine the sensors state machines. The ShellCode handler is a modified implementation of Nepenthes [12] for handling shell code behaviors and network interactions. When a shell code is detected, the payload is handed by the sensor to the ShellCode handler for further analysis. The SGNET collects different level of attack information including a TCP- Dump [24] log of the SGNET sensor and gateway log. The information is then extracted and stored in a relational database. One disadvantage of the SGNET sensors is that they can only monitor a few IP addresses (currently four), which limits their view of the Internet threat. Currently only two sensors have been deployed across the globe. While the SGNET is limited to code injection detection, it has the appealing feature of not depending on manually crafted static responses to retrieve the payload of the attack, but rather depends on valid responses that were extracted from real servers. There has not been any analysis of data collected

43 2.4. Traffic Analysis Techniques 17 SGNET Sensors Gateway Service Provider Internet SGNET Sensor Sample Factory Sample Factory 2 Internet SGNET Sensor Internet ShellCode Handler 1 ShellCode Handler 2 SGNET Sensor 3 Figure 2.2: SGNET Architecture. by the project. 2.4 Traffic Analysis Techniques Characterizing or analyzing anomalous network traffic is considered a first step in increasing our knowledge of attack threats and then protecting production networks from them. Traffic analysis is a wide research field, which can be roughly divided, based on the utilized technique, into three categories: data visualization, data mining, and statistical techniques. This section will review some of the previous research in each category Data Visualization Traffic visualization can be very helpful in assisting administrators in making effective decisions. Complex attack patterns can be easily detected and interpreted by humans if they are represented properly in a visual form. A number of researchers have investigated the applicability of using visual techniques for the identification of attack tools, without relying on intrusion de-

44 18 Chapter 2. Background tection system signatures and statistical anomalies. Abdullah et al. [47] used the Parallel Coordinate Plots visualization technique [141], a technique for displaying multi-dimensional data in one representation, to visualize captured packets in real time in order to fingerprint popular attack tools from the top 75 network security tools [61]. By focusing on complex patterns that are not easily automated by systems, they found that visual representation allows traffic attacks to be more easily detected and interpreted by humans. Different systems have been designed to help visualize network traffic for security analysis [90, 26]. Krasser et al. [82] designed a network traffic visualization system for real-time and forensic network data analysis. Their system supports a real-time monitoring mode and playback mode, and playback of previously captured data for forensics. Archived honeypot captured traffic was used to evaluate and test the system. Zhang et al. [150] explored the use of plots of singular value decomposition (SVD) and scatter plots between eigenvectors in detecting traffic patterns. Examples of traffic flow visualization for traffic anomaly detection include FlowScan and NVisionIP [106, 89] Data Mining Data mining refers to the extraction of knowledge from a large amounts of data. This knowledge serves two main objectives [68]: Description: finding patterns that describe the current data, such as clustering; Prediction: predicting the behavior of new data sets given the current data set, such as classification. Julisch [78] surveyed the most used data mining techniques in the field of intrusion detection and found that four types of data mining techniques have been widely used: Association rules, which search for interesting relationships among the items in large data sets, and the frequency of items occurring together. This technique is widely used in the market basket analysis to study customers buying habits by finding associations among items being placed in the shopping basket. This could help decision makers and analysts in finding sets of products that are frequently bought together and develop marketing strategies. From

45 2.4. Traffic Analysis Techniques 19 the presence of certain items in the shopping basket, one can infer with high probability the presence of other items. Frequent episode rules, which are similar to association rules, but take record order into account; Classification, which is a process of learning from given data to build classification models for each class of data based on features in the data. Then, these classification models will be used to predict new data classes; Clustering, which is a method of partitioning data into groups (clusters) for the purpose of data simplification. A number of researchers have applied data mining techniques to problems in intrusion detection. Lee [91] used data mining techniques to discover consistent and useful patterns of system features that describe program and user behaviors. They used the set of system features to develop classifiers that can recognize anomalies and known intrusions. Two data mining algorithms were implemented, association rules and frequent episode algorithms. The hierarchical tree classification clustering technique has also been used to eliminate intrusion detection false alarms and identify the root causes of attacks [79] Statistical Techniques Statistical analysis techniques have been widely used for characterizing and classifying network traffic and for detecting attack patterns. The basic concept of statistical technique, in detecting anomalies, is to build a profile of normal behaviors and then measure large deviations from the normal profile. Deviations from the normal profile are tested against a predefined threshold value, where anomalous behaviors are flagged once these deviations exceed the threshold. Nong et al. [145] presented a host-based anomaly detection technique based on a chi-square test (X 2 ), a statistical significance test. A system profile was built from events of normal system audit data and the upper limit threshold was estimated from the empirical distribution of the normal event data using 3-sigma, the mean plus 3 times the standard deviation. New events were tested against the normal system profile, where large deviations were flagged as intrusions. Details of applying the chi-square test (X 2 ) in the identification of intrusions is detailed in [67].

46 20 Chapter 2. Background The use of a k-nearest neighbor (KNN) classifier for detecting intrusive program behavior was presented by Liao et al. [95]. A program behavior vector is built from frequencies of system calls and and the KNN classifier was used to categorize a new program behavior into either normal or intrusive, based on its distance from the previous k normal profile vectors, using a threshold value. Babara et al. [32] proposed the use of a Naive Bayes classifier, built from a training data set, to reduce the number of false alarms of ADAM [31], an anomaly detection system. Change point detection, a cumulative sum (CUSUM) algorithm, has been used to monitor statistical properties of network features to detect abrupt changes, i.e., deviation from the normal behavior, as being a result of anomalous traffic or attack. Wang et al. [138] proposed the use of sequential change point detection for detecting TCP SYN flooding attacks. Attacks are detected as a violation of the normal behavior of the TCP SYN-FIN flags, or an abrupt change in the difference between numbers of SYN packets to the number of FIN packets. Ahmed et al. [28] proposed a technique for detecting anomalous activities of a Darknet, a class C address block, using sliding windows and a non-parametric cumulative sum. In the context of worm detection, the change point detection technique has been used to detect Internet worms [44]. Yan et al. [143] used change point detection to detect two classes of worms that target Internet messaging systems through monitoring surges in file transfer requests or URL-embedded chat messages. Feinstein et al. [52] proposed the use of chi-square and entropy statistics in detecting distributed denial of service attacks (DDoS). Application of the exponentially weighted moving average (EWMA) to intrusion detection was explored by Ye et al.[147]. Finally, Barford et al. [34] proposed the use of wavelet analysis, a signal processing techniques, in detecting network traffic anomalies. 2.5 Honeypots The first use of the honeypot concept was by Cliff Stoll in his book The Cuckoo s Egg [129]. This book describes the author s experience, over a ten-month period in 1986, with an attacker who succeeded in compromising his system. When the attack was discovered, the attacker was allowed to stay further, while being monitored, in order to learn more of his tactics, interest and identity. Stoll used a production system in the process of creating the lure that was used to study the

47 2.5. Honeypots 21 attackers activities, unlike current honeypots which are decoy computers that run no legitimate services. Another early use of the honeypot concept was described by Bill Cheswick [45]. Cheswick discussed his several months experience with an attacker who broke into his lure system, which had been initially built with several vulnerable services for the purpose of monitoring threats to his system. The paper is considered the first technical work in building and controlling a lure system, or what would be later called a honeypot. The term honeypot was first introduced by Lance Spitzner [129]. Spitzner defines a honeypot as a security resource whose value lies in being probed, attacked, or compromised. Provos et al. [117] defines a honeypot as a closely monitored computing resource that one wants to be probed, attacked, or compromised. These definitions of a honeypot imply that the honeypot can be of any computer resource type, such as a firewall, a web server, or even an entire site. Other properties of a honeypot, implied by its definition, include that it runs no real production services and any contact with it is considered potentially malicious. Also, traffic sent to or from a honeypot is considered either an attack or a result of the honeypot being compromised. Figure 2.3 shows an example of a virtual honeypot setup that emulates two operating systems, a Windows 2000 server and a Linux server. The whole honeypot setup, including the logging mechanism, is hosted using in a single Linux machine. The open source daemon Honeyd [116] is used to emulate the two operating systems. Honeypots are valuable security resources that are widely known and used. Several honeypot characteristics have contributed to their popularity in the security community. The most appealing characteristics of honeypots are the low rates of false positives and false negatives in their collected data. The low noise nature of a honeypot s collected traffic results from its design concept of running no production services therefore all traffic is considered suspicious. Notable features include: honeypots collect small volumes of higher value traffic; honeypots are capable of observing previously unknown attacks; honeypots detect and capture all attackers activities including encrypted traffic and commands; and

48 22 Chapter 2. Background Host Machine Virtual Honeypot s (Honeyd) Internet xx.xx.xx.02 Linux xx.xx.xx.01 xx.xx.xx.03 Host Machine Router Windows 2000 Server Traffic Logger (Tcpdump) Figure 2.3: An example of a virtual honeypot setup that emulates two operating systems. honeypots require minimal resources and can be deployed on surplus machines. There are several types of honeypots, which can be grouped into four broad categories [129, 117] based on: their level of interaction (low- and high-interaction honeypots); their intended use (production and research honeypots); their hardware deployment type (physical and virtual honeypots); and their attack role (server side and client side honeypots). In the following subsections, we will discuss these categories in more detail. In addition, we will highlight some of the available honeypot technologies and solutions.

49 2.5. Honeypots Low-interaction vs High-interaction Honeypots A low-interaction honeypot is the simplest form of honeypot. It runs no real operating system and offers an emulated network stack with limited or no service interactions. Some of the advantages of using low-interaction honeypots are their ease of deployment and their low level of risk of being compromised by attackers. The main disadvantages of low-interaction honeypots are the limited amount of information in their collected traffic and the ability of attackers to detect their presence. An example of a low-interaction honeypot is a port listener such as Netcat [5]. The Netcat command: NC v l p 445 > port445.log opens a listener on TCP port 445 and accepts connections where all activities are logged into the file port445.log. More sophisticated examples of low-interaction honeypots include Honeyd [116] and LaBrea [97]. Honeyd is capable of emulating different operating systems (OSs), at the same time, and supports emulation scripts of basic protocol behaviors such as FTP (TCP port 21) and SMTP (TCP port 25). In contrast to low-interaction honeypots, high-interaction honeypots are full systems with real operating systems and real applications. The main advantages of using high-interaction honeypots are the rich information collected from their attack traffic and the difficulty for attackers of detecting their presence. On the other hand, high-interaction honeypots are complex to deploy, overwhelm administrators with vast amount of collected data, and introduce high risk to networks, in the case that they are compromised by attackers. Examples of high-interaction honeypots include Generation II honeynets [70] and Argos [1]. While high-interaction honeypots provide more information for studying attackers activities through their full system functionality, they do not scale very well for large-scale deployments in terms of hardware and software requirements and high cost of maintenance. In contrast, low-interaction honeypots scale very well, such that thousands of low-interaction honeypots can be run in parallel using a single machine. However, low-interaction honeypots suffer from the limited (or absent) support for emulation scripts. Detailed discussion of this topic is presented in Section

50 24 Chapter 2. Background Honeyd Honeyd, a honeypot daemon, is a low-interaction honeypot that was developed in 2002 by Niels Provos of the University of Michigan [116]. Honeyd is an open source distribution that was originally designed to run on UNIX systems, and later was ported to the Windows environment. Honeyd is based on the open source packet-capturing library, libpcap, and libdnet. It can detect and log connections to any TCP or UDP port and monitor up to 60,000 victim IP addresses at the same time. When an attacker tries to connect to a non-existent IP address of a computer system, Honeyd assumes the identity of the non-existent system and replies to the attacker s connection attempts. Honeyd can emulate different operating systems at the same time on both the application and IP stack levels. In emulating specific operating system (OS) TCP/IP stacks, it relies on the database files of Nmap and Xprobe2, which are the most common tools for fingerprinting OSs by hackers, and for manipulating and creating traffic. Because these tools specialize in fingerprinting, Honeyd can fool many attackers. One of Honeyd s most flexible features is the ability to add emulation scripts that mimic application and network services, either by creating custom scripts, using any scripting language, or by downloading ready-made scripts from the Honeyd project web page. Ready-made scripts include IIS emulators, FTP, POP, SMTP, and telnet. Moreover, as Honeyd relies on the work of other open source tools in emulating OS TCP/IP stacks, including Nmap [62] and Xprobe2 [144], Honeyd can be updated by refreshing the fingerprinting databases. Limitations of Honeyd include the difficulty of writing programs that completely emulate the behaviors and vulnerabilities of network services Production vs Research Honeypots Honeypots can be divided further, based on their intended use, into production and research honeypots. Production honeypots are used by many organizations to protect their production services [70]. Production honeypots are usually deployed to mirror some or all of an organization s production services in order to study attackers techniques and the tools that they use against the organizations real networks, to expose unknown vulnerabilities, and to assess security measures. Moreover, by analyzing data collected by production honeypots, organizations can

51 2.5. Honeypots 25 build better systems, easily assess damage to compromised systems, and collect forensic evidence that is not mixed with production traffic. An example of a production honeypot is Honeynet [115]. Research honeypots are usually deployed by universities and research centers to collect information on threats. This information is then used for a variety of purposes, including studying attackers motivations and tools, and researching better techniques to analyze honeypot traffic. The Leurré.com project is an example of a research honeypot, see Section Honeynets Honeynets are high-interaction honeypots that were developed by the Honeynet project [70, 115]. The concept behind the Honeynet is to build a complete network of production systems where all activity is controlled, captured, and analyzed. The Honeynet controls the attacker s activity using a Honeywall gateway. This gateway allows inbound traffic to the victim systems, but controls the outbound traffic using intrusion prevention technologies and a connection limiting mechanism. Three Honeynets currently exist: Generation I, Generation II, and Generation III Honeynets. Generation I was developed in 1999 to capture beginner-level attacker activities with limited data control. The Generation II Honeynet is a derivative of Generation I, and was developed in 2002 to address several weaknesses of the Generation I Honeynet, to improve data control, and to make it difficult to fingerprint, or determine the type of operating system used. Finally, Generation III Honeynet was released in 2004 with further refinements Physical vs Virtual Honeypots A physical honeypot is a single machine running a real OS and real services, where the honeypot is connected to a network and is accessible through a single IP address. Physical honeypots are always associated with the concept of high interaction. However, physical honeypots are less practical in real network environments due to the limited view of their single IP address and the higher cost of maintaining a farm of physical honeypots. Honeynets are examples of physical honeypots. In contrast, virtual honeypots are more cost effective in monitoring large IP address spaces and in emulating different operating systems at the same time. Virtual honeypots are usually implemented using a single physical machine that

52 26 Chapter 2. Background hosts several virtual honeypots. User Mode Linux (UML) [21] is a well-known tool to deploy virtual honeypots in Unix environments. The commercial tool VMware [22] allows more flexibility by running different operating systems at the same time. An example of a virtual honeypot is Argos [1]. Other examples implemented using VMware include Collapsar [142], Potemkin [137], and HoneyStat [50] Argos Argos [1] presents a new method of deploying virtual honeypots, in which the emulating host monitors and detects attacks against the emulated guests, the honeypots. Argos was designed based on the open source emulator QEMU [17]. The QEMU capability of running multiple operating systems was extended by Argos to detect attacks targeting the emulated guest, without any modification of the guest operating systems, through dynamic taint analysis. In dynamic taint analysis, network data from an untrusted source is tagged and its propagation and execution is tracked. When the tainted data execution leads to an unexpected system behavior, such as a buffer overflow attack, Argos identifies and prevents the usage of the tainted data. It dumps the memory block, along with tagged data and some extra information for further analysis of the vulnerability. The usage of Argos to detect new attacks is different from traditional honeypots as it uses dynamic taint analysis of external traffic targeting the emulated hosts. Furthermore, the emulated hosts IP addresses are advertised Server Side vs Client Side Honeypots Conventional honeypots are server side honeypots, which are set up to lure attackers involved in malicious activities. These honeypots are passive by design and do not initiate any traffic unless they are compromised. An example of a server side honeypot is the low-interaction honeypot, Honeyd [116], see Section Server side honeypots have proved to be useful in detecting new exploits, collecting malware, and enriching research of threat analysis. A new trend emerging is the active or client-side honeypot, in response to clientside attacks [127]. Client-side attacks represent attacks that target vulnerable client applications, such as a web browser, when these applications interact with malicious servers. The aim of client-side honeypots is to search and detect these malicious servers. An example of a client-side honeypot is Strider HoneyMonkey

53 2.5. Honeypots 27 [140], a Microsoft project to detect and analyze web sites that host malicious codes that exploit web browsers. Honeyclient [11], and HoneyC [6] are further examples HoneyMonkey HoneyMonkey [140] is a client-side honeypot developed by Microsoft research for detecting malicious web content that exploits vulnerabilities in Internet Explorer. HoneyMonkey works by using monkey programs that configure different highinteraction honeypots with different configurations and patches, running under virtual machines, to mimic humans when they browse the Internet. HoneyMonkey searches Internet web sites looking for malicious content that exploits browser vulnerabilities. When web content succeeds in exploiting the browser, HoneyMonkey generates a detailed report of the vulnerability, including the web site URL, windows registry, and a log of the infected virtual machine Improving Honeypots While Lowering Their Risks One major drawback of low-interaction honeypots is their inability to interact with the attacker to the level needed to reveal the characteristics of the attack. In contrast, high-interaction honeypots have this capability, but carry an increased risk of being fully compromised. A high-interaction honeypot needs constant monitoring to decrease the legal risks associated with it either being used against other networks or exposing the local production system to attacks. Increasing honeypot interactivity, while reducing the risk associated with its deployment, is a useful but challenging task. Early attempts used scripts to mimic services [116]. However, this method proved time consuming and impractical in the case of complex protocols, as it requires full understanding of the required protocol. Several proposals for systems have emerged to address these challenges and to leverage the level of honeypot interactivity while at the same time keep their risk low [49, 93]. ScriptGen [93] is a system that extends the capability of the wellknown low-interaction honeypot, Honeyd, for the automatic generation of emulation scripts of protocol behaviors, without requiring prior knowledge. The system starts with a limited emulation capability, which is extracted through training the system via real interaction with a real server, using a high-interaction honeypot. A state machine is then built on these data in an incremental way to react to attackers requests. When the request is not recorded by the state machine of the ScriptGen, the whole conversation is played against a high-interaction honeypot

54 28 Chapter 2. Background to extract the required response. Then, the emulation script is refined with this new data in order to respond to similar requests in the future. The use of highinteraction honeypots is limited to cases where the response is not present in the current state machine of Scriptgen. Another challenge with honeypots is to keep their deployment hidden so as to preserve their value and maximize their role in tracking attackers. When an attacker detects that he is dealing with a honeypot, he will generally try to avoid it or feed it with bogus data. Yegneswaran et al. [148] described a technique for defending against honeynet mapping through changing the locations of honeypots randomly in the address space whenever the number of probes exceed a predefined threshold Honeypot Traffic Anomalies Honeypots are passive machines which, by definition, run no production services, and their deployments are not advertised. These properties make any traffic targeting honeypots suspicious. Analysis of honeypot traffic reveals that honeypots may collect traffic related to other Internet phenomena such as misconfigured servers and backscatter, and also some legitimate traffic such as vulnerability scans from local administrators. While it is easy to filter out the backscatter [100] and detect local administration scans (as the scanning IPs are known), detecting and eliminating misconfiguration traffic is very challenging [151]. Other categories of traffic types seen by honeypots are considered malicious and fall under the broad categories of threats facing computers and networks that are connected to the Internet, such as denial of service attacks, scans, and worms. Other properties of anomalous traffic collected from different IP address spaces, including honeypots, show that patterns and volumes of this traffic vary from one location to another [30]. These differences have been observed over all protocols and services and over a specific protocol. Many factors can be attributed to the variability in traffic volumes and patterns, which include the filtering policy, configuration of the monitored address space, propagation strategy of the malicious code, and limited global reachability due to either poor or absent routing.

55 2.5. Honeypots Existing Honeypot Solutions Honeypots are a relatively recent security technology, yet they have drawn the attention of significant numbers of researchers and network administrators. Increased interest in honeypots has led to the development of a variety of honeypot technologies. The following section reviews several existing honeypot-based solutions for countering different types of security threats Automatic Generation of IDS Signature Honeycomb [83] is a honeypot system that automatically generates intrusion detection signatures for unknown attacks. Honeycomb is built as an extension to the open source honeypot, Honeyd. Honeycomb integration with Honeyd enables it to see sent and received traffic and connection states of Honeyd. Honeycomb uses pattern-detection techniques and packet-header conformance tests in order to generate signatures for the two popular intrusion detection systems: Bro and Snort. Honeycomb is one of the few systems that utilizes the honeypot s characteristic of collecting only malicious traffic and automates the current manual process of analyzing collected data. Honeycomb shares limitations inherited from Honeyd Worm Detection Systems Computer worms are defined as independent replicating and autonomous infection agents, capable of seeking out new host systems and infecting them via the network [102]. Through exploiting worms, attackers can do massive damage to the Internet by compromising a vast number of hosts. Such damage includes distributed denial of service attacks (DDoS), and access and corruption of sensitive data [130]. Honeypot solutions aimed at detecting worms include HoneyStat [50] and SweetBait [107]. HoneyStat [50] is a worm detection system for local networks using Honeypots. It is implemented using a VMware GSX Server running virtual machines of several operating systems, such as Windows and Linux. It represents the third generation of honeypot deployments at Georgia Institute of Technology, which followed the deployment of Honeynets Gen I and Gen II. HoneyStat was designed based on modeling worm infections in a honeypot. It monitors unused address spaces and generates three types of alerts: memory, disk and network alerts. These streams of alerts are automatically collected and statisti-

56 30 Chapter 2. Background cally analyzed, using logistic regression, for detecting worm outbreaks. HoneyStat uses data from the local network only, which limits the amount of traffic that can be observed by its nodes Malware Collection Malware is malicious software for exploiting vulnerabilities in computer systems. Types of malware include viruses, worms, and trojan horses. There are several honeypot projects that collect malware such as Nepenthes [12], Honeytrap [103], and IBM Billy Goat [124]. Nepenthes [12] is an open source low-interaction honeypot for collecting malware. As honeypots are the most effective tool in collecting malware, Nepenthes was developed specifically to fill a gap that existed in the honeypot technology for collecting automated malicious software. Nepenthes inherits the main characteristics of low-interaction honeypots in emulating thousands of honeypots at the same time with low hardware requirements, and excels in its efficiency due to its way of only emulating the vulnerable part of the services. Another appealing new feature of Nepenthes as regards capturing malware is the flexibility of its emulation process. Nepenthes is able to decide the right configuration for the exploit to be successful during the run time, for example, whether Unix or Windows is required. Finally, the deployment and maintenance cost of Nepenthes is minimal, and it carries very low risk as all systems and services are emulated. However, the use of Nepenthes is limited to self-propagating malware that first scans for vulnerabilities then eventually attempts to exploit them. 2.6 Related Work The previous sections have established the background for the thesis as a whole. This section reviews research outcomes from the Leurré.com project and provides other background and related material that is relevant to the work described in Chapters 3 to 6. Research challenges in honeypot traffic analysis are also identified Research Outcomes from the Leurré.com Project Various types of analysis have been carried out on honeypot traffic obtained from the Leurré.com project. The aims of this analysis were to characterize different

57 2.6. Related Work 31 Internet attack activities and to unveil useful attack patterns [109, 114, 108, 113, 105]. Pouget et al. [112] applied association rule mining [27] to different features of low-interaction honeypot traffic, with the port sequence of a large session as a main clustering feature. The aim of their study was to group traffic that shares similar activity fingerprints into clusters to find the root causes of attacks, or tools. In their research, each cluster is assumed to represent one attacking tool or its re-configuration. The clustering of honeypot traffic was investigated further by Pouget et al. [110] and the notation of cliques was introduced to identify the inter-relationship between clusters, that is, clusters that share strong similarities in one or more dimensions, such as targeted environments and origin of attacks. The use of packet Inter-Arrival Times (IAT) for characterizing anomalous honeypot traffic was introduced by Zimmermann et al. [151]. The study was conducted on six months of honeypot traffic data. The usefulness of the IAT in characterizing anomalous honeypot traffic was demonstrated through the discovery of several anomalous activities, from different IP sources, that share similar IAT peak distributions. Thonnard et al. [134] proposed a framework for identifying attacks that share similar patterns based on the selection of different traffic features. The model was demonstrated using time signatures to find temporally correlated attacks. The framework utilizes the clique-based clustering algorithm to group pre-clustered honeypot traffic Application of Principal Component Analysis to Internet Traffic Principal component analysis (PCA) is a statistical technique for reducing the dimensionality of data into a few uncorrelated variables that retain most of the variation in the original data. These newly derived variables are called principal components (PCs) and they can be used instead of the original variables. The reduction in the number of variables, PCs, serves as a base for many data analysis techniques, which include data reduction, data visualization, and outlier detection [75, 74]. The applications of PCA to computer network traffic fall roughly into three categories: detecting the latent structure of the traffic data, reducing the dimension

58 32 Chapter 2. Background of the traffic data, and identifying anomalies. A number of researchers have used principal component analysis to reduce the dimensionality of variables and to detect anomalous network traffic. The use of PCA to structure network traffic flows was introduced by Lakhina [87], whereby principal component analysis is used to decompose the structure of Origin-Destination flows, from two backbone networks into three main constituents, namely periodic trends, bursts and noise. Labib et al. [86] utilized PCA in reducing the dimension of the traffic data and for visualizing and identifying attacks. For detecting different types of attacks, the loadings of attack features of the retained PCs were compared to a predefined threshold and visualized using Bi-Plots. Bouzida et al. [42] presented a performance study of two machine learning algorithms, namely nearest neighbors and decision trees, when used with traffic data with or without PCA. They found that when PCA is applied to the KDD 99 data set, to reduce the dimension of the data, the learning speed improved while accuracy remained the same. Terrell et al. [133] used principal component analysis on features of the aggregated network traffic of a link connecting a university campus to the Internet in order to detect anomalous traffic. Sastry et al. [126] proposed the use of singular value decomposition and wavelet transform for detecting anomalies in self-similar network traffic data. Wang et al. [139] proposed an anomaly intrusion detection model for monitoring network behaviors based on principal component analysis. The model utilizes PCA in reducing the dimensions of the historical data and in building the normal profile, as represented by the first few components that account for the most variation in the data. An anomaly is flagged when the distance between the new observation and the normal profile exceeds a predefined threshold. Ye et al. [146] studied the performance of Hotelling s test, a multivariate statistical process control technique that is equivalent to retaining all components in the PCA model, against a chi-squared distance test for host-based anomaly detection. The study was conducted on two data sets with different sizes and it was concluded that the chi-squared test scales well for real-time detection, while the Hotelling s test detects the counter-relationships or changes in structure of the variables. Shyu et al. [128] proposed an anomaly detection scheme based on robust principal component analysis. Two classifiers were implemented to detect anomalies: one was based on the major components that capture most of the variation in the

59 2.6. Related Work 33 data; and the second was based on the minor components, or residuals. A new observation is considered to be an outlier or anomalous when the sum of squares of the weighted principal components exceeds a threshold in any of the two classifiers. Lakhina et al.[88] applied the principal component analysis technique to Origin- Destination (OD) flow traffic counts of link data bytes. The network traffic was isolated into normal and anomalous spaces by projecting the data onto the resulting PCs one at a time, ordered from high to low. PCs are moved to the normal space as long as a predefined threshold (3-sigma) is not exceeded. When the threshold is exceeded, then the PC and the subsequent PCs are moved to the anomalous space. New OD flow traffic is projected into the anomalous space and an anomaly is flagged if the value of the square prediction error, or Q-statistic, exceeds a predefined limit. The subspace method was extended by the same authors [25] to detect anomalies in multivariate time series of OD flow traffic with three features (number of bytes, number of packets, and number of flows). Their new model tests anomalies in both the normal space and in the anomalous space. Guangzhi et al. [119] proposed a real-time detection system based on multivariate statistics. The normal profile of the network system was built from attributes of the network hierarchy using Hotelling s T 2 statistic. New traffic triggers an alarm if its distance from the normal region exceeds the predefined upper and lower control limits. Terrell et al. [133] have used singular value decomposition, SVD, a different method for extracting principal components, to detect attacks in near real-time Internet traffic, collected every hour from a university main link. Attack traffic is aggregated into bins of different sizes and features are extracted from these bins. Attack detection is achieved by measuring the weighted sum of squares of least significant component scores against a predetermined threshold value that is extracted from a gamma distribution, with normality assumption of network traffic Research Challenges As discussed in Section 2.2, the amount of traffic collected by monitoring methods is tremendous, which makes it very difficult to handle, store and analyze. Adding to this challenge is the dynamic nature of anomalous traffic which changes frequently, due to factors including changes in software configurations and deployments of new protocols and services.

60 34 Chapter 2. Background While honeypots excel in collecting smaller traffic volumes, when compared to other traffic collection techniques, extracting useful anomalous patterns or detecting new attacks in honeypot traffic necessitates the research of better analysis techniques to summarize and process this traffic. This study has identified several research challenges that need to be addressed for efficient detection of anomalous honeypot traffic: The proposed technique should be capable of extracting useful attack patterns from traffic gathered by low-interaction honeypots, which has a low level of detail; The proposed technique must have the capacity to handle and summarize traffic data sets with multiple features; The proposed technique must have low computational requirements and be suitable for real-time application; The proposed technique should adapt to the dynamic nature of Internet attacks and capture new trends with little or no human intervention or tuning. Two aspects of traffic analysis need to be considered to address the research challenges: traffic representation and traffic analysis techniques. Broadly speaking, traffic features are extracted either from packet level or flow level data. While packet level analysis provides a wealth of information on all aspects of attacks, including access to payloads [51], it suffers from performance issues when handling large amounts of data. In contrast, network flows, such as NetFlow [132] and Argus [2], provide an aggregated summary of connection information [63]. Traffic flows provide enough information to characterize a variety of Internet anomalies [33], excel in their performance for real-time analysis, and are able to aggregate large amounts of traffic data in a summarized form. In our search for optimal analysis techniques that address the previously mentioned research challenges, we have selected data reduction techniques, a subset of multivariate statistical techniques for analyzing honeypot traffic. Data reduction refers to the process of reducing the number of variables in data sets while at the same time retaining enough information for the intended type of analysis. Examples of linear dimension reduction techniques are principal component analysis, projection pursuit, factor analysis, and independent component analysis.

61 2.6. Related Work 35 Principal component analysis (PCA), sometimes known as singular value decomposition (SVD), is the best known and most widely used linear dimension reduction technique [54, 36]. PCA is easy to implement and has low computational requirements. The basic idea of PCA is to reduce the dimensionality of a data set into a few uncorrelated variables or principal components (PC). The resulting principal components are a linear combination of the original variables and retain most of the variance in the original data. Principal component analysis (PCA) is explained in detail in Chapter 4. The aim of the Leurré.com project s clustering approach is to find the root causes of attacks, where each cluster is assumed to represent one attacking tool or its re-configuration. The aim of this research, in using PCA, is different in the following ways. It seeks to: determine factors that contribute to the variations in the traffic patterns, and detect new attacks in real-time applications. Previous applications of PCA to network traffic treat the network traffic as either normal or anomalous, and the detection model is built on what is believed normal. The notion of normal and anomalous does not apply in honeypot traffic where all traffic is suspicious. Thus, our technique is fundamentally different from the previous applications of PCA techniques in the following ways. Firstly, traffic features are extracted from aggregated flows, where standard flows from a single IP address are grouped together to provide sufficient information on attack patterns. Secondly, PCA is used to build a model of existing attacks that have been seen in the past, rather than a model of normal behaviors. Any large deviation from the attack model is considered either a new attack vector or an attack that is not present in the model. Finally, the use of recursive principal component analysis (RPCA) is introduced in order to design a real-time adaptive detection model that both captures new changes in anomalous Internet traffic and updates its parameters automatically Honeypot Traffic Validation Broadly speaking, there are two methods for validating an attack detection model. The first validation method consists of manually labeling attacks in a data set and then testing the model performance, false positive and false negative criteria,

62 36 Chapter 2. Background against this labeled data. The second validation approach is based on testing the model performance against synthetically crafted attacks that are manually injected into the data set. Applying these classical IDS validation methods to a low-interaction honeypot detection model is not applicable, as the notions of normal and abnormal do not apply. The collected traffic which is considered suspicious per se, and the low level of detail available, makes labelling and categorizing the collected traffic difficult or impossible. To address these challenges in the absence of a well established methodology to validate a honeypot traffic detection model, manual inspection of traffic flagged by the detection method is used. Although manual inspection of traffic is expensive, it allows a better understanding of the nature, significance, and classes of the flagged traffic. Manual inspection of traffic has been used previously to validate and inspect low-interaction honeypot traffic [111, 151, 134]. 2.7 Summary In this chapter, current research in monitoring anomalous network traffic has been presented, with emphasis on honeypots as essential tools for gathering useful information on a variety of malicious activities. Honeypots come in three basic varieties, fully dedicated real systems like Honeynets, emulated service honeypots such as Honeyd, and virtual honeypots such as Argos. Low-interaction honeypots pose minimal risks to the network and require low administration efforts when compared with high-interaction honeypots, making them ideal for research. Running honeypots would result in collecting a huge amount of traffic data. Extracting useful information and knowledge from this data requires efficient techniques for discovering hidden attack patterns and detecting new attacks. Three types of data analysis techniques have been widely used in traffic analysis, including data mining, statistical analysis, and visualizations. Research in applying these techniques to network traffic has been reviewed. In addition, research challenges in the field of honeypot traffic analysis have been identified. The next chapter presents this study s first contribution in analyzing lowinteraction honeypot traffic, using data from the Leurré.com project, and details the methodology for improving the Leurré.com clusters, by grouping clusters that share similar types of activities based on packet inter-arrival time distributions. The use of principal component analysis to characterize and visualize low-

63 2.7. Summary 37 interaction honeypot traffic is described in Chapter 4. Detecting new attacks in low-interaction honeypot traffic through the use of principal component residual space and square prediction error (SPE) statistics is detailed in Chapter 5. Finally, Chapter 6 proposes an adaptive detection model that captures changes in Internet traffic and updates its parameters automatically.

64 38 Chapter 2. Background

65 Chapter 3 Traffic Analysis Using Packet Inter-arrival Times The Leurré.com project is a world-wide deployment of identical low-interaction honeypot platforms. This chapter gives a brief introduction to the Leurré.com project setup and its methodology in collecting and processing honeypot traffic, with a focus on Leurré.com s attack clusters. Then, a new methodology is proposed that overcomes a limitation in Leurré.com s clustering technique for producing large number of clusters that share some similarities. The new method is based on grouping clusters that share similar packet inter-arrival time (IAT) distributions. This chapter is organized as follows. Section 3.1 provides a brief overview of Leurré.com s terminologies, platform architecture, and data manipulation. Section 3.2 provides a preliminary investigation of packet inter-arrival time. Section 3.3 discusses our methodology in analyzing pre-clustered honeypot traffic using packet inter-arrival times. Experimental results are presented in Section 3.4. Finally, Section 3.5 summarizes the chapter. The work described in this chapter is a joint project with the Leurré.com team and has led to the following publication: S. Almotairi, A. Clark, G. Mohay, O. Thonnard, M. Dacier, C. Leita, V. Pham, J. Zimmermann, Extracting Inter-arrival Time Based Behaviour from Honeypot Traffic using Cliques, in the Proceedings of the 5th Australian Digital Forensics Conference, Perth, Australia, Dec

66 40 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times 3.1 Information Source All analyses in this thesis use data that comes directly from the Leurré.com project, a world wide deployment of low interaction-honeypots. This section gives a brief overview of the Leurré.com s terminologies, the platform architecture, and data manipulation The Leurré.com Honeypot Platform The Leurré.com project is a world-wide deployment of identical low-interaction honeypot platforms. Each platform, which is based on the open source lowinteraction Honeyd [116], runs on a single machine and emulates three operating systems at the same time: Windows 2000 Professional, Windows 2000 Server and Linux RedHat 7.3. The Windows virtual hosts have the following open services to provide interactions with attackers: TCP 21, 23, 80, 139, and 445; and UDP 37. The UNIX virtual honeypot has the following open TCP ports: 21, 22, 25, 80, 111, 514, 515, and The platform architecture is presented in Figure 3.1. Host Machine (Unix) Virtual Honeypots (Honeyd) IP: xx.xx.xx.02 Internet Windows 2000 Professional TCP: 21,23,80,139,445 UDP: 137 IP: xx.xx.xx.03 xx.xx.xx.01 Windows 2000 Server TCP: 21,23,80,139,445 UDP: 137 Honeypot Router IP: xx.xx.xx.04 Linux TCP: 21,22,25, 80,11,514,515,8080 Traffic Logger (Tcpdump ) Figure 3.1: Leurré.com honeypot platform architecture.

67 3.1. Information Source Data Manipulation The data manipulation of Leurré.com s honeypot traffic is done offline. On a daily basis, traffic logs from all platforms are transferred to a centralized machine where they are processed, enriched with external data, and inserted into relational database tables. In this section, some of the data manipulation tasks and terminologies are overviewed Port Sequences The port sequence is the main feature of Leurré.com s clustering algorithm. A port sequence is a list of targeted honeypot ports generated by a single IP address during the attack period. An example of a port sequence of an attack generated by an IP address that targeted TCP port 139, 445, 998, 139, 445, UDP port 137 and ICMP traffic, would be: {T139,T445,T998,U137,I}. Figure 3.2 provides an illustration of the port sequence of an attack. Honeypot Platform Attack Vector ICMP Windows 2000 Prof. Windows 2000 Serv. Honeyd TCP Port 139 TCP port 445 TCP port 998 TCP port 139 TCP Port 445 Unix UDP port 137 Port Sequence: I,T139,T445,T998,U137 Honeypot Host (Unix Workstation) Internet Attacker Figure 3.2: Illustration of the port sequence of an attack Large vs. Tiny Sessions Two notions of sessions are currently used in the Leurré.com project: large and tiny sessions. While a large session is the set of all activities and packets exchanged by one source against one platform, a tiny session is a subset of the large session

68 42 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times for activities of one source against a single virtual host (every platform runs three virtual hosts). Accordingly, a large session consists of three tiny sessions. The large session is terminated when the next packet arrives after 25 hours from the same source Traffic Clusters Large sessions that share similar traffic fingerprints are grouped together according to a hierarchy-based clustering approach [111]. The aim of the clustering algorithm is to discriminate between different attacking activities based on their distinct cluster signatures, where each cluster represents an attacking tool. Current features that are utilized by the clustering algorithm include: the number of targeted virtual machines on the honeypot platform; the sequence of ports; the number of packets sent by attacking source; the number of packets sent to each honeypot virtual machine; the duration of attack; and the ordering of attacks against the virtual machines. A final refinement step of the incremental clustering approach is through payload validation. Payloads sent by an attacker within a large session are ordered, according to their arrivals, and concatenated. Then, the Levenshtein-based distance [29] is used to check the consistency of clusters for a possible further split, if attack payloads are available. 3.2 Preliminary Investigation of Packet Interarrival Times Packet inter-arrival times (IATs) are the time intervals between packets arriving from the same attacker s IP address to a single honeypot machine, see Figure 3.3. The IAT has been widely used in network traffic analysis to infer denial of service attacks (DOS) [72], in studying network congestion [136], and in studying unsolicited Internet traffic [151].

69 3.2. Preliminary Investigation of Packet Inter-arrival Times 43 While certain features of traffic excel in characterizing different types of attack activities, this study investigates the application of IAT as a meaningful and discriminatory feature in identifying traffic that share similarities, i.e. caused by the same attacking tools or originated from the same sources, but which have been put in different attack clusters. Other clusters features, such as the geographical location of the attacker or location of targeted platform, might be relevant for studying other attack phenomena, such as the popularity of certain tools with certain IPs or the observation of specific tools being used against particular environments. The main focus of this work is the identification of repeated use of attack tools that exhibit similar packet inter-arrival time distributions. However, this methodology can be applied to any type of analysis. Packet Arrival Times Packet: Time of Attack P1 P2 P3 P4 P5 P6 P1: 7:30:00 P2: 7:30:02 P3: 7:45:02 P4: 8:30:20 P5: 8:30: IAT Seconds P6: 12:15:50 IAT Vector: {2,900,2718,30,13450} Honeypot Host Internet Attacker Figure 3.3: Illustration of packet inter-arrival times Prevalence of IATs in Honeypot Traffic We have carried out a simple frequency analysis of IATs of honeypot traffic observed by Leurré.com for the period of time from January, 2003 until June, Traffic data was collected without the notion of large and tiny sessions, which are used to classify sessions, and all packets from one source to one destination were arranged in one vector. Figure 3.4 shows a global distribution of IATs across all platforms for IATs that are less than seconds. In contrast, Figure 3.5 provides a zoom into the IAT values that range from 0 to seconds.

70 44 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times Figure 3.4: A global distribution of all IAT distribution < seconds. Figure 3.5: IAT distribution values that range from 0 to seconds. The figures show the prevalence of IAT peaks as multiple spikes of various heights, locations and spacings. Table 3.1 lists the top ten IAT peaks, sorted by number of packets in descending order. As Table 3.1 shows, IAT peaks are caused by different IP addresses (distinct IP

71 3.3. Cluster Correlation Using Packet Inter-arrival Times 45 IAT Peak Distinct IP Address Distinct Source Id Distinct Host Id Number of Packets Table 3.1: Distinct sources and destinations of the top ten IATs. addresses) that have repeatedly attacked the honeypot platforms (distinct source Ids), and targeted different hosts (Distinct Host Ids). When the same attacking IP returns after 25 hours it is assigned a new source Id. While these attackers used different IP addresses and targeted different honeypot platforms, they have generated similar IAT fingerprints, in terms of IAT peaks. 3.3 Cluster Correlation Using Packet Inter-arrival Times Preliminary analysis of the Leurré.com clusters showed that the clustering algorithm results in a large number of clusters where some of these clusters share common attack features between them. Thus, this study was carried out on grouping clusters that share similar types of activities. The method in grouping clusters was based on finding clusters that share similar packet inter-arrival time (IAT) distributions. All traffic collected by the distributed platforms of the Leurré.com project was classified into clusters according to the clustering approach utilized by the project. The IAT distributions of the clusters were represented with a vector in which every element corresponded to the IAT frequency of a pre-defined bin (range of time values). The ranges were chosen to be more fine-grained for the shorter IATs, and for certain peak values. The IAT ranges corresponding to bin values (per cluster) are listed in Table 3.2. The result was an IAT vector of 152 bins,

72 46 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times with the first bin groups IATs falling in the interval 0-3 seconds, and the last bin corresponds to IATs of 25 hours or more. Bin Start Time (s) Stop Time (s) Comment 1 0:00:00 0:00:03 2 0:00:04 0:00:08 [ 5 second increments ] 7 0:00:29 0:00:33 8 0:00:34 0:00:43 [ 10 second increments ] 17 0:02:04 0:02: :02:14 0:02:43 [ 30 second increments ] 43 0:14:44 0:14: :14:58 0:15:02 15 minute peak 45 0:15:03 0:15: :15:33 0:29: :29:58 0:30:02 30 minute peak 48 0:30:03 0:45: :45:03 0:59: :59:58 1:00:02 1 hour peak [ 15 minute increments ] 54 1:45:03 1:59: :59:58 2:00:02 2 hour peak [ 15 minute increments ] 63 3:45:03 3:59: :59:58 4:00:02 4 hour peak [ 15 minute increments ] 80 7:45:03 7:59: :59:58 8:00:02 8 hour peak [ 15 minute increments ] 97 11:45:03 11:59: :59:58 12:00:02 12 hour peak [ 15 minute increments ] :45:03 15:59: :59:58 16:00:02 16 hour peak [ 15 minute increments ] :00:03 Table 3.2: Bin values of IAT ranges.

73 3.3. Cluster Correlation Using Packet Inter-arrival Times Data set The data set used in this study covers three months of traffic (March May 2007) collected from all environments of the Leurré.com project. This was the most recent data available at the time this research was conducted. A three-month period was chosen as it provided enough data to demonstrate the effectiveness of the proposed technique while being manageable in size. For each cluster, the IAT frequencies (from the tiny sessions which took place within that three month period) were extracted and the values in the corresponding vector bins (as described above) were incremented. Only those clusters which had at least one bin were considered, after the 21st bin (the 22nd bin corresponds to around five minutes), with a count of more than 10. This means that clusters which do not have more than 10 occurrences of a particular IAT value greater than five minutes were ignored. Indeed, small IAT values are less meaningful for this analysis because of network artefacts such as congestion, packet loss, and transmission latency, for example. As might be expected, the vast majority of tiny sessions (and clusters) contain packets with only relatively small inter-arrival times. As a result, the cliquing algorithm focuses on differentiating the behavior exhibited by clusters which contain large, regular IATs Measuring Similarities Most pattern matching and data mining techniques are heavily based on the concept of similarity or closeness in grouping objects into clusters. One common way of measuring similarity is through the use of distance measures (or dissimilarity). Consequently, when the similarity between two objects increases, the distance between them decreases, and the two objects are considered similar when the distance between them becomes zero. There are many types of distances, such as Canberra and Manhattan distances, but by far the most commonly used distance is the Euclidean distance. In its simple form, the Euclidean distance, in two-dimensional space, represents a straight line between the two points P1 and P2, and is represented by a single positive number that measures the number of units between the two points. The Euclidean distance between the two points P 1 = (x 1, y 1 )and P 2 = (x 2, y 2 ) becomes: d(p 1, P 2 ) = (x 1 x 2 ) 2 (y 1 y 2 ) 2 (3.1)

74 48 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times As the previous example shows, the Euclidean distance treats coordinates equally and does not account for differences in variability contributed by each coordinate, which makes it very sensitive to the scale of variables. The limitation of the Euclidean distance and the nature of the application of clustering time series vectors of inter-arrival times (IATs) of traffic necessitates the search for an alternative distance that fits the need in finding temporal similarities between vectors of IATs. Symbolic Aggregate Approximation (SAX) distance, which was introduced by Lin et al. [96], is applied to IAT vectors with the aim of finding temporal similarities. The SAX representation of a time series is obtained by converting the time values to Piecewise Aggregation Approximation (PAA) or dividing the signal into equal segments and then calculating the mean value of each segment. Then, the PAA is transformed into symbols. Figure 3.6 illustrates the SAX technique of converting a time series into a word of 16 symbols (DDDDDCBABCDEDDEF) A B C D E F Figure 3.6: A time series conversion using SAX. To find the minimum distance between two time series Q and C of same length n, the time series are transformed into PAA representations Q and C of w symbols. Then, the MINDIST() function is given by: MINDIST ( Q, C) n = w (dist( q i, c i )) w 2 (3.2) The dist() function returns the distance between two PAA symbols and can i=1

75 3.3. Cluster Correlation Using Packet Inter-arrival Times 49 be implemented using a table look-up for better computational efficiency. Details of the function implementation and source code can be found on the SAX home page[18]. The similarity matrix M of size m m is constructed using the SAX distance that was described earlier, where m is the number of time series vectors. Given the two clusters i and j, the M(i,j) represents the similarity between these two clusters in a symmetrical way, where M(i, j) = M(j, i) and the diagonal are equal to zeros Cliquing Algorithm Due to the large quantity of data collected, it was necessary to rely on an automated methodology that was able to extract relevant information about the attack processes. The correlative analysis relies on concepts from graph and matrix theory. In this context, a clique (also called a complete graph) is an induced subgraph of an (un)directed graph in which the vertices are fully connected. In this case, each node represents a cluster, while an edge between a pair of nodes represents a similarity measure between two clusters. Figure 3.7 provides an illustrative example of finding cliques of a graph Clique Clique Figure 3.7: An example of finding cliques.

76 50 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times Determining the largest clique in a graph is often called the maximal clique problem and it is a classical graph theoretical NP-complete problem [43]. Although numerous exact algorithms [84, 85, 40] and approximate methods [41, 104] have been proposed to solve this problem, this study addressed the computational complexity of the clique problem by applying our own heuristics to generate sets of cliques very efficiently. While this technique is relatively straightforward, it possesses two significant features. Firstly, it is able to deliver very coherent results with respect to the analyzed similarities. Secondly, regarding the computational speed, this technique out-performs other algorithms by several orders of magnitude. For example, we applied the approximate method proposed by [104], which consists of iteratively extracting dominant sets of maximally similar nodes from a similarity matrix. On our data set, the total computation was very expensive (several hours) whereas the custom cliquing algorithm only took a few minutes to generate the same cliques of clusters with the same data set. On the other hand, our heuristic imposed a constraint on the similarity measure, namely that it has to be transitive. With this restriction, it was sufficient to compute the correlation between one specific node and all other nodes in order to find a maximal clique of similar nodes. This transitive property was achieved by carefully setting a global threshold on the measurement of similarities between clusters. Clearly, this algorithm takes advantage of the already created cliques to progressively decrease the search space; so in the average case the algorithmic complexity will be less than O(n 2 ), and a complexity order of O(n.log(n)) would typically be expected. The clique algorithm is detailed in Figure Experimental Results In this section, our analysis of the IAT-based cliques obtained using the above approach when applied to the Leurré.com data set is described. A data set covering three months of traffic (from March to May 2007) collected from the Leurré.com environment was considered. For the sake of conciseness, only those clusters which had at least one bin, after the 21 st bin (the 22 nd bin corresponds to around five minutes), with a count of more than 10 were considered. This means that clusters which did not have more than 10 occurrences of at least one IAT value greater than five minutes were ignored. After this filtering, 1475 vectors were obtained,

77 3.4. Experimental Results 51 Input: C n IAT vectors of clusters, Q Threshold value Output: Cliques 1 i = 0 % clique index 2 Cliques = % start with an empty list of vectors 3 while (C is not empty) do { 4 Move the first vector in the list C to V, V = C(0) 5 Remove V from the vector list C, C = C V 6 Compute the similarities S between V and all vectors in C, S= simliarity(v, C) 7 Find similar vectors that are greater than the threshold and add them to the clique, Clique(i) = C, where S > Q 8 Update the cluster list C by removing clusters in the Clique(i), C = C Clique(i) 19 i = i + 1 } Figure 3.8: The different steps of the cliquing algorithm. representing the IAT frequency distributions of the corresponding clusters. The clique algorithm described above was then applied to these vectors, yielding 111 IAT-based cliques comprising 875 clusters. The remaining 600 clusters did not fall into any clique. Each clique contained a group of clusters which, based upon their IAT distribution (and the parameters of the cliquing algorithm) were similar. Prior to describing the detailed analysis of the cliques obtained, three types of cliques that were expected to be represented in the results are presented: Type I: Cliques which contain clusters of large sessions targeting the same port sequences. The difference between the various clusters contained within such a clique lies in the number of packets sent to the targeted ports. These cliques are mostly symptomatic of classes of attacks where the attacker repeatedly tries a given attack, a varying number of times. Type II: Cliques composed of clusters of large sessions targeting different port sequences but exhibiting the same IAT profile. These cliques are symptomatic of tools that send packets to their target according to a very specific timing and which have been used in several distinct campaigns targeting different ports. Type III: Cliques which contain clusters grouped together based upon the presence of long IATs (longer than 25 hours), representing sources which are ob-

78 52 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times served on one platform, then, within 25 hours, are detected on another platform, before again returning to the original platform. Such behavior would be indicative of a source which is scanning large numbers of devices across the Internet, in a predictable manner, resulting in them repeatedly returning to the same platform. We also found many similarities across the different cliques that were generated. A number of so-called supercliques were identified as a result, which suggests that the IAT-based analysis focused on in this study is good at automatically identifying very specific types of activity within a very large data set. Analysis of these supercliques is presented below Type I Cliques Type I cliques are expected to contain clusters which are very similar with respect to most traffic features, including port sequence, with one exception being that the large and tiny sessions within the clusters contain varying durations (both in terms of time, and the number of packets sent by the source). The variation in the duration of the sessions accounts for such traffic being arranged in different clusters. Two particular cliques that are seen to fall clearly into the Type I category are Clique 7 and Clique 49, summarized in Table 3.3. Clique 7 Clique 49 Number of Clusters 8 11 Number of Large Sessions Number of Packets Number of Platforms Targeted 5 37 Number of Source IPs Number of Countries 4 46 Targeted Port Sequence TCP/135 TCP/22 Peak IATs (bin) (32) (49) Min, Average, Max Durations (Seconds) 4657, 70491, , 9922, No of Targeted Virtual Hosts 3 3 Table 3.3: A summary of Type I Cliques. Clique 7 is composed of 8 clusters, 9 large sessions and a total of 821 packets. In this clique, there were 5 platforms targeted by 6 distinct IP addresses originating from 4 different countries (China, Germany, Japan, and France). The peak IAT bin was bin 32, with IAT values in the range seconds, and the average

79 3.4. Experimental Results 53 duration was seconds with a minimum duration of 4657 seconds, and a maximum of seconds. All three virtual hosts on each of the targeted platforms were hit with the same number of packets, with the average number of packets per session equal to 35. Also, several IP addresses were found to occur in multiple clusters within the clique. While these sources were grouped in different clusters due to their varying durations, there were strong similarities in terms of the IAT characteristics of the sessions, resulting in these clusters being grouped in the same clique. Clique 49 contains 11 clusters, 285 large sessions, and 3274 packets, and the targeted port sequence is TCP/22. There were 248 distinct IP addresses which attacked 37 different platforms. The sources of the IPs are widely spread among 46 different countries. Despite the widespread location of the sources of the traffic in this clique, there were a number of similarities in the behavior observed. Firstly, large sessions in this clique always targeted all three virtual hosts on each platform, and the number of packets sent to each virtual host was similar in each case (one packet for the Windows hosts and an average of 10 packets for the UNIX host). The average duration of attacks was 9922 seconds with minimum and maximum durations in the range of 1035 to seconds. The IAT sequences of these clusters were similar with all IATs in the session being short except one belonging to bin 49 ( seconds). Cliques 7 and 49 were typical examples of Type I cliques where attack traffic ends up in different clusters due to the variations in either the duration of the attack or the number of packets sent. In each case, the duration and number of packets varied significantly between the sessions, while the IAT behavior remained consistent. Also, a number of IP address were shared between clusters within each clique, with over 50 % of the clusters sharing IP addresses or class C networks. The identification of cliques of Type I addresses a weakness of the original clustering algorithm which was, by design, unable to group together activities that clearly were related to each other and should have, therefore, been analyzed together Type II Cliques Type II cliques are those which contain a large variety of targeted port sequences, yet each cluster exhibits similar IAT characteristics. It was hypothesized that clusters belonging to this type of clique correspond to the same attack tool using

80 54 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times Clique 92 Clique 69 Number of Clusters Number of Large Sessions Number of Packets Number of Platforms Targeted 1 2 Number of Source IPs Number of Countries Targeted Port Sequence TCP {6769, 7690, 12293, 18462, 29188, 64697, 64783} TCP:{4662, 6769, 7690, 12293, 29188, 38009, 64697, 64783} Peak IATs (bin) (46) (46) (48) Min, Average, Max Durations (Seconds) 953, 9278, , 44163, 22522, 4 No of Targeted Virtual Hosts 1 1 Table 3.4: A summary of Type II Cliques. the same strategy to probe a variety of ports (such as a worm which targets multiple vulnerable services or some other type of systematic scanner targeting a number of different ports). Two cliques which exhibit this type of behavior are Cliques 92 and 69 (see Table 3.4). Clique 92 consists of 40 clusters, 502 large sessions and 4234 packets in total. While a variety of ports were targeted by these clusters, traffic within each cluster only targeted a single port. The TCP ports targeted within this clique were: 6769, 7690, 12293, 18462, 29188, 64697, and This clique is a result of 502 distinct source IP addresses originating from 25 different countries, and targeting only a single platform. Additionally, only one virtual host was targeted on this platform. The average number of packets per large session was 16 (minimum 3 and maximum 103), and the average duration was 9278 seconds. Clique 92 contains peak IAT bins of 46 ( seconds) and 48 ( seconds) where the IAT sequences were repeated patterns of short and long IATs. A possible explanation for the traffic which constitutes this clique is that it corresponded to the same tool being used to scan for the existence of services using a strange port (such as peer-to-peer related services) where the scan used a regular (long) delay between retransmissions. Clique 69 is similar to Clique 92 in that it also contains a variety of clusters,

81 3.4. Experimental Results 55 where each cluster contains traffic targeting a single, unusual port. This clique contains 64 clusters, 1336 large sessions and packets. It was a result of 1300 distinct attacking IP addresses, that originated from 37 different countries and target 2 platforms (all but one targeted the same platform as that targeted by the traffic in Clique 92). The targeted TCP ports were: 4662, 6769, 7690, 12293, 29188, 38009, 64697, and The durations of attacks ranged from 133 to seconds with an average of seconds. The number of packets sent in each large session was in the range 2 to 135 with an average of 25 packets. The IAT sequences were repeated patterns of short, short, and long IATs with a peak IAT bin of 46 ( seconds). The traffic in Cliques 92 and 69 represent a large number of distinct sources from a variety of countries targeting a variety of ports, predominantly (with one cluster being the exception) targeting the same platform in China. These cliques represent very interesting activity which is difficult to characterize in further detail due to the lack of interactivity of the honeypots on these ports. The significance of the ports being targeted was unclear, but might be easier to determine if packet payloads were available. The fact that all of these sources exhibited a very distinct fingerprint in terms of their IAT characteristics made the activity all the more unusual. The identification of cliques of Type II enabled the highlighting, in a systematic way, of the existence of tools with a specific IAT profile that were reused to launch different attack campaigns against various targets. Without such analysis, the link that existed between the IPs belonging to different clusters in a given clique would have remained hidden Type III Cliques Based upon observation of the Leurré.com data over a long period of time, it was found that there were a number of large sessions which continuee for an extended duration (sometimes many weeks). Of these there were a number which target multiple platforms within a 25 hour period, where the intervening time before returning to the same platform was more than 25 hours. These very long IATs were placed into bin 152 during the cliquing process. A number of cliques that resulted from the cliquing algorithm were characterized by these long IATs, and here two of them are investigated in detail, Cliques 31 and 66, see Table 3.5. Clique 31 is a large clique of 150 clusters, 3456 large sessions, and a total of packets. The port sequence for Clique 31 is the single port UDP/1434

82 56 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times Clique 931 Clique 696 Number of Clusters Number of Large Sessions Number of Packets Number of Platforms Targeted Number of Source IPs Number of Countries 22 2 Targeted Port Sequence UDP 1434 UDP 1026,UDP 1027 Peak IATs (bin) >25 hours (152) Very large (152) Min, Average, Max Durations (Seconds) 132, , 1, , No of Targeted Virtual Hosts varies 3 Table 3.5: A summary of Type III Cliques. (MS SQL). In Clique 31, there were 277 distinct IP addresses originating from 22 different countries which targeted 39 different platforms. Characteristics of clusters in this clique include: a varying number of hosts targeted, with the average number of packets sent per host equal to 12 (minimum 2 and maximum 85) and an average duration equal to seconds. These sessions are indicative of a very slow scanner, seen on multiple platforms, returning to the same platform only after an extended delay of more than 25 hours. Clique 66 contains 3 clusters, 13 large sessions and 171 packets. These sessions were characterized by sending multiple packets, alternating between UDP ports 1026 and 1027 repeatedly. In Clique 66, 12 platforms were targeted by 9 distinct IP addresses originating from 2 different countries. All clusters within this clique contained sessions which targeted all three virtual hosts on the target platforms, with only a small number of packets sent per session (on average 4, with a minimum of 3, and a maximum of 6). The average session duration was seconds. Cliques 31 and 66 represent examples of activities where a source IP was scanning the globe, targeting different honeypot platforms in less than 25 hours. UDP port 1434 is used by the MS SQL Monitor service and is the target of several worms, such as W32.SQLExpWorm and Slammer. It is likely that traffic targeting this port is result of worms that scan for vulnerable servers. UDP ports 1026 and 1027 are common targets for Windows Messenger spammers, who have been repeatedly targeting these ports since June 2003.

83 3.4. Experimental Results 57 Super Clique Cliques Clusters Large Sessions Distinct IPs Peak Bins Port Sequence U U 1027U , 48, T , 48, 49 22T , 48, 49 Unusual TCP ports , T Table 3.6: Representative properties of Supercliques Supercliques It was observed that across all of the obtained cliques, only a relatively small number of peak IAT bin values were represented. Indeed, from the point of view of the peak bin values, it was found that a limited number of combinations existed. This suggests that the cliques we obtained possessed a high level of uniformity in terms of the activities that they represent. Based upon the small set of common peak bins, and the dominant port sequences targeted within those cliques, the cliques were manually grouped together into 6 supercliques, which are summarized in Table 3.6. As can be seen from the table, the supercliques accounted for just over half of the cliques generated. The cliques not represented within the supercliques were not considered in the remaining analysis. Representative examples of each of the first five supercliques have been presented in the previous three sections. The Type I Cliques 7 and 49 are examples of Supercliques 3 and 4, respectively. Superclique 6 contains Type I cliques which target port TCP/135, similar to Superclique 3, with the difference being that the dominant IAT for cliques from Superclique 6 are in bins 31 and 32, rather than 46, 48, and 49 (for Superclique 3). Cliques 92 and 69 (Type II) are examples of cliques from Superclique 5. The Type III Clique 31 is an example of a clique that belongs to Superclique 1; while Type III Clique 66 is an example of a clique from Superclique 2.

84 58 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times 3.5 Summary Due to the low-interaction nature of the honeypots used by the Leurré.com project, attempts to cluster the low-interaction honeypot traffic on packet level result in a huge number of clusters. Consequently, it becomes very difficult to interpret these clusters or to reach accurate conclusions about the exact nature of the tools that generate them. In this chapter, we have presented a methodology that overcomes the weaknesses in Leurré.com s clustering algorithm. The use of packet arrival times distributions has generated a number of cliques that represent sets of clusters and a variety of interesting activities which target the Leurré.com environments. It was shown that more than half of the cliques can be easily characterized as one of the three major types. In accordance with the supercliques that were manually identified, there are six major classes of activity that the cliquing algorithm extracted for the time period that was examined. The strong similarities within the supercliques highlight the usefulness of the cliquing algorithm for identifying very particular kinds of traffic observed by the honeypots. While the proposed method was efficient in improving the existing clustering approach, there are several limitations to the work described in this chapter. The first limitation concerns the manual extraction of the IAT distributions of clusters and the grouping and interpretation of the results. Secondly, the methodology requires moderate to intensive computational power. Consequently, it is necessary to address the need for a better traffic analysis technique that suits the nature of low-interaction honeypot traffic, which is sparse, and is capable of extracting useful attack patterns automatically, and is suitable for a real-time application. In the next chapter, a new methodology will be presented, which bypasses the traffic clustering imposed by the Leurré.com project. The new methodology overcomes the previous method s limitations, and is capable of working on a massive data set. The method is based on a well established data reduction technique and works on aggregated traffic flows rather than the packet level.

85 Chapter 4 Honeypot Traffic Structure Running a honeypot that is connected to the Internet results in collecting a massive amount of malicious Internet traffic. The collected traffic is very representative of global Internet, artefacts such as different types of attacks, traces of misconfigured traffic, and backscatter. The first step to better understand the nature of this traffic is through detecting the different classes of activity and measuring their prevalence. However, analysis of honeypot traffic comes with several challenges, which include: the high dimensionality of the data, resulting in a large number of features; the large amounts of collected traffic, resulting in high storage and computational requirements; and Internet noise, such as scans, which obfuscates useful attack patterns. In the previous chapter, the Leurré.com methodology of manipulating honeypot traffic was explored, and it was demonstrated that their clustering algorithm has resulted in a large number of clusters, over 27,000. Furthermore, a new approach was proposed for improving the cluster interpretation through grouping clusters that share similar IAT distributions. In this chapter, the study seeks to explore questions related to deploying a honeypot from a strategic viewpoint, which are: What types of activities can be detected with a low-interaction honeypot? What are the relative frequencies of the detected activities? What are the interrelationships between these activities? To answer these questions, we embarked upon an analysis of the use of a multivariate statistical technique, principal component analysis (PCA), for characterizing 59

86 60 Chapter 4. Honeypot Traffic Structure attackers activities present in honeypot traffic data in terms of structure and size. The use of PCA in this study is motivated by: the popularity of PCA as one of the best exploratory and data reduction techniques [54]; the facts that the extracted principal components are uncorrelated and the first few principal components retain most of the variation in the original data; the ease of implementation and the low computational requirement; and the lack of any distributional assumptions, which makes PCA suitable for many types of data. The main contribution of this chapter is the application of principal component analysis (PCA) in three areas: in detecting the structure of attackers activities in low-interaction honeypot traffic; in visualizing these activities; and in identifying different types of outliers. The following chapters, Chapters 5 and 6, build on this work for further applications of PCA to analyze honeypot traffic. The research findings were presented in: S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, Characterization of Attackers Activities in Honeypot Traffic Using Principal Component Analysis, in Proceedings of the International Conference on Network and Parallel Computing (IFIP), Shanghai, China, 2008 This chapter is organized as follows. Section (4.1) provides an introduction and discusses the motivation behind the work. Section 4.2 introduces the concept of principal component analysis. Section 4.3 describes the data set used and the preprocessing that has been applied to the traffic data. Principal component analysis on the honeypot data set is described in Section 4.4. Interpretations of the major principal components are presented in Section 4.5. The interrelations between components are presented in Section (4.6). The identification of extreme activities is discussed in Section 4.7. Finally, Section 4.8 summarizes the chapter. 4.1 Motivation Monitoring and characterizing Internet threats is critical in order to better protect production systems by gaining an understanding of how attacks work, and conse-

87 4.2. Principal Component Analysis 61 quently protecting systems from them. Honeypots are a valuable tool for collecting different types of attack traffic. However, characterizing attackers activities present in honeypot traffic data can be challenging due to the high dimension of the data (or large number of variables) and the large volumes of traffic data collected. The large amount of background noise, such as scans and backscatter, adds to the challenge by hiding interesting abnormal activities that require immediate attention from security personnel. Detecting these activities can potentially be of high value and give early signs of new vulnerabilities or breakouts of new automated malicious codes, such as worms, but only if the honeypot data is handled in time. Principal component analysis (PCA) is a widely used multivariate statistical technique for reducing the dimensionality of variables, unveiling latent structures and detecting outliers in data sets [74, 75]. In this research, principal component analysis (PCA) is used to detect the structure of attackers activities in honeypot traffic, to visualize these activities, and to identify different types of outliers. 4.2 Principal Component Analysis Principal component analysis (PCA) is a multivariate statistical technique that has been widely used in multi-disciplinary research areas such as Internet traffic analysis, economics, image processing, and genetics, to name only a few. PCA is mainly used to reduce the dimensionality of a data set into a few uncorrelated variables, principal components (PCs), which retain most of the variation in the original data. The resulting principal components are a linear combination of the original variables, are orthogonal, and ordered with the first principal component having the largest variance. Although the number of resulting principal components is equal to the number of original variables, much of the variance in the original set of p variables can be retained by the first k PCs, where k < p. Thus, the original p variables can be replaced by the new k principal components. Let X = (X 1,.., X p ) T be a matrix of p-dimensional data variables, where C is the covariance or correlation matrix of X. We seek to find a lower dimension matrix A = (A 1,.., A k ) of C, that solves the following equation: A 1 CA = L (4.1) where A is the matrix of eigenvectors of C and is orthogonal (A 1 A = I ), and

88 62 Chapter 4. Honeypot Traffic Structure L is the diagonal matrix of eigenvalues of C and is greater than or equal to zero. Then, principal component analysis becomes Then, the first linear combination Z 1 of X having a maximum variance becomes: Z = A T X (4.2) Z 1 = A T 1 X = a 11 X 1 + a 12 X 2 + a 13 X 3 + a 1p X p (4.3) The second linear combination of X, Z 2, is uncorrelated with Z 1 and has the second largest variance and so on until the k th function, Z k, is found which is uncorrelated with Z 1,.., Z k 1 : Z 2 = A T 2 X = a 21 X 1 + a 22 X 2 + a 23 X 3 + a 2p X p. (4.4) Z k = A T k X = a k1 X 1 + a k2 X 2 + a k3 X 3 + a kp X p Geometrically, principal component analysis represents a shift of the origin of the original coordinates (X 1,.., X p ) to ( X 1,.., X p ) and then a rotation of these original coordinate axes into new axes (Z 1,.., Z p ) that are orthogonal and which represent the direction of maximum variability. Figure 4.1 illustrates a linear transformation of two random vectors X 1 and X 2 into the direction of maximum variance, Z 1 and Z 2. X 2 Z 2 Z 1 X 2 X 1 X 1 Figure 4.1: Directions of maximal variance of principal components (Z 1, Z 2 ).

89 4.3. Data set and Pre-Processing 63 In addition, the principal components define axes of a p-dimensional ellipsoid that is centered at the mean and has its semi-major and semi-minor axes lengths equal to half the square root of the eigenvalues (l 1 2 1, l 1 2 2,.., l 1 2 p ). The equation of the ellipsoid becomes: p k=1 Z 2 ik l k = (X i X) T S 1 (X i X) (4.5) where Z ik is the score of the k th PC of the i th observation and l k is the k th eigenvalue. It will be seen later in this chapter how the previous equations can be utilized in setting up the threshold for detecting outliers. In the following sections, PCA will be utilized in detecting different groups of activities found in the honeypot traffic, without any assumptions about these groups or the interrelationships between them. In addition, some of our objectives in this study are to explore the usefulness of PCA in visualizing honeypot traffic and in detecting outliers. While the application of PCA to network traffic analysis is not new, to the best of our knowledge this is the first time that PCA has been used to analyze low-interaction honeypot traffic. As will be demonstrated, the technique shows much promise in extracting much value from this sparse data and paves the way for further applications of the technique in the following chapters. 4.3 Data set and Pre-Processing In this section, the data set used in this study and the pre-processing that has been applied to the data are described. The traffic features used in the analysis and the steps for applying PCA to these traffic features are also discussed in the following subsections Data set The honeypot traffic data used in this analysis came from the Leurré.com project [9]. For the purpose of this study, only one low-interaction honeypot sensor s data was used due to the availability of log files. Traffic data for the period of September 15 until November 30, 2007, for two of the honeypot environments were included, namely Windows 2000 Professional and Windows 2000 Server. Both environments are identical in terms of open ports, TCP and UDP. The traffic traces consisted of packets in total which were the result of attacks from over 5400 different

90 64 Chapter 4. Honeypot Traffic Structure IP addresses, see Table 4.1. This data set was the most recent honeypot traffic data set available from the Leurré.com project when this study was conducted, and it represented the current patterns of attackers behaviors. Start Date End Date Packets Standard Flows Activity Flows 15/09/ /11/ Table 4.1: Summary of the data set used in this study Pre-processing Before applying the PCA to the traffic data, the following steps were performed to process the raw traffic data. First, raw TCPDump [24] files of daily honeypot data were collected and merged into a single traffic file. Packets were then grouped together into basic flows (according to the notation of flow [46]). The basic flow conformed to the standard definition of an IP flow of packets that share the five keys: source IP address, destination IP address, source port, destination port, and protocol type. If a packet differed from another packet by any key field, it was considered to belong to another flow [2, 132]. Other features associated with flows were also extracted to enrich the analysis. These features include number of packets, number of bytes, total activities, and durations. For the purpose of this study, the timeout of basic flows was set to a maximum of five minutes. The five-minute timeout parameter was selected based on our experiments and on the nature of low-interaction honeypots where the majority of flows were less than 300 seconds. A higher value of time out has little influence on the final results. The second step was to group the basic flows again into activity flows, where the newly generated flows were combined, based upon the source IP address of the attacker with a maximum of sixty minutes inter-arrival time between basic flows. The aggregation of basic flows into activity flows was necessary to overcome the low level of detail in low-interaction honeypot collected traffic, to be representative of the behavior of the three monitored protocols, TCP, UDP and ICMP, and to account for a variety of network anomalies. Finally, the data was filtered to remove Internet backscatter by examining each flow individually against common backscatter flags, such as TCP RST and TCP SYN/ACK [100]. The Leurré.com project platform is less effective in collecting backscatter, because of the limited view of its platform to the Internet; the number

91 4.4. PCA on the Honeypot Data set 65 of monitored IP addresses are three in total Candidate Feature Selection Eighteen features were extracted from the activity flows. These traffic features were selected as being representative of the behavior of the three protocols that are monitored by the honeypot, namely TCP, UDP and ICMP. Table 4.2 lists the selected variables and their descriptions. Traffic features were selected to account for a variety of network anomalies [33]. Since these traffic features were extracted from aggregated traffic flows, they were very efficient in detecting different types of attacks using single or multiple connections and attacks spanning different protocols. For example, the total number of flows, source packets, and source bytes allow the detection of anomalies in traffic volume, total activities and IAT help in detecting denial of service attacks [72] and mis-configurations [151], and TCP and UDP ports allow the characterization of a wide range of network anomalies such as scans and worms. Finally, since no work is performed directly on these traffic features, principal component analysis is relied on in removing redundancies and correlations among these variables. 4.4 PCA on the Honeypot Data set Principal component analysis can be calculated using either the covariance matrix or the correlation matrix. However, PCs defined using the covariance matrix are very sensitive to the unit of measurement of variables. In addition, when the variance of the variables differs widely, which is the case for the honeypot data, the first few PCs will be dominated by variables with high variances, as they contribute little information to the structure of the data set. Moreover, one drawback of PCs on covariance matrices, with different units of measurement, is the difficulty in interpreting the PC scores. Thus, the use of correlation rather than the covariance matrix for deriving the PCs was preferred in this analysis. Calculating the PCA based on the correlation matrix involves the following steps: 1. Arrange the extracted n traffic vectors data of p features into a data matrix X where each observation is represented by a single column and p rows.

92 66 Chapter 4. Honeypot Traffic Structure No. Variable Description 1 TF Total number of basic flows generated by individual IPs and aggregated based on sixty minutes 2 TCP_O Total number of open TCP ports targeted 3 D_TCP_O Total number of distinct open TCP ports targeted 4 TCP_C Total number of closed TCP ports targeted 5 D_TCP_C Total number of distinct closed TCP ports targeted 6 UDP_O Total number of open UDP ports targeted 7 D_UDP_O Total number of distinct open UDP ports targeted 8 UDP_C Total number of closed UDP ports targeted 9 D_UDP_C Total number of distinct closed UDP ports targeted 10 ICMP Total number of ICMP flows 11 TM Total number of machines targeted 12 DUR Total duration of basic flows 13 SPKTS Total number of source packets 14 SBYTES Total number of source bytes 15 SRATE Total of source rates of basic flows, where a source rate is the number of source packets in a basic flow divided by the duration of that flow 16 AVG_PK_SIZE Sum of average packet size 17 T_ACT Total Activities, the summation of source and destination rates 18 IAT Total inter-arrival times between basic flows Table 4.2: Variables used in the analysis. Then the traffic matrix X becomes: X (p n) = X 1.. X p = X 1,1 X 1,2 X 1,n X 2,1 X 2,2 X 2,n... X p,1 X p,2 X p,n (4.6) 2. Standardize the p-dimensional matrix X = (X 1,.., X p ) T, to have zero mean and unit variance, by: Y ij = X ij X i ˆσi (4.7) where X i is the sample mean of X i, for i=1,.,p, and ˆσ i is the sample variance of X i.

93 4.4. PCA on the Honeypot Data set Compute the sample correlation matrix R p p of Y n p : R (p p) = 1 r 1,2 r 1,p r 2,1 1 r 2,p r p,1 r p,2 1 (4.8) 4. Find the matrix of eigenvectors A = (A 1,.., A p ) and eigenvalue vector L = (l 1,.., l p ) of R: A = ( A 1 A 2 ) A p = a 1,1 a 1,2 a 1,p a 2,1 a 2,2 a 2,p... (4.9) a p,1 a p,2 a p,p L = ( l 1 l 2 l p ) (4.10) Eigenvectors are rearranged in a matrix where each column represents an eigenvector and they are reordered to match their equivalent eigenvalue, which are in decreasing order, l 1 > l 2 >.. > l p > Select a subset E = (A 1,.., A k ) of A, according to criteria that will be discussed later in this section, where k<p: e E = ( 1,1 e 1,2 e 1,k ) e E 1 E 2 E k = 2,1 e 1,2 e 2,k (4.11)... e p,1 e p,2 e p,k 6. Calculate the principal components scores by projecting the standardized data Y onto the reduced vector E Z = E T Y (4.12) The first principal component then equals: Z 1 = E 1 Y = e 11 Y 1 + e 12 Y e 1p Y p (4.13)

94 68 Chapter 4. Honeypot Traffic Structure and the k th principal component becomes: Z k = E k Y = e k1 Y 1 + e k2 Y e kp Y p (4.14) Number of Principal Components to Retain In PCA, components are in decreasing order where the most important component, listed first, has the highest variance. Consequently, only the first few PCs are retained as they explain most of the variance in the data. There exist several methods for deciding how many PCs to retain. Kaiser s rule [120] for eliminating PCs with an eigenvalue less than one suggests retaining the first six components (see Table 4.3 for the eigenvalues). The cumulative total variance of these components is 80% of the total variance of the original data, which means that the majority of the variance in the data has been accounted for in the extracted components. Principal Component Eigenvalues % of Variance Cumulative % e-4 1.5e e-5 3.1e e-7 1.8e Table 4.3: The extracted principal components and their variances. The Scree plot of energy contributed by each component is summarized in Figure 5.1. This plot suggests that six components can be retained as a sharp drop occurs between the sixth and seventh component, where the eigenvalues are

95 4.5. Interpretation of the Results 69 greater than or equal to 1. The sharp drop in the curve indicates a typical cut-off for selecting the correct number of components to be considered in the analysis Eigenvalue Component Number Figure 4.2: Scree plot of eigenvalues. However, the extracted communality of variables, i.e. an amount of variance within each variable accounted for by the components, indicates that one of the variables, namely the total number of distinct open TCP ports targeted, has a low extraction value (see Table 4.4). This suggests the inclusion of more components. After the inclusion of the seventh component, all the communalities are high, which indicates that the extracted components represent the variables well. All of the above supports the decision to retain seven components, with the rest of the components being eliminated. Table 4.3 shows the accumulated percentages of the total variances of the 18 extracted components. The first seven components contribute over 85% of the total variance in the original data, which suggests that the extracted components are very representative of the data. 4.5 Interpretation of the Results As mentioned earlier, components are listed in decreasing order of importance, where components with larger variances are more important and give more information about the data. The components were rotated to simplify the analysis and make the interpretation easier.

96 70 Chapter 4. Honeypot Traffic Structure Variable Communalities of Extraction 1 (Six components) Communalities of Extraction 2 (Seven Components) TF TCP_O D_TCP_O TCP_C D_TCP_C UDP_O D_UDP_O UDP_C D_UDP_C ICMP TM DUR SPKTS SBYTES SRATE AVG_PK_SIZE T_ACT IAT Table 4.4: The extracted communalities of variables. Interpretation of the components is achieved by examining the loading of the variables for each component, as variables with high loading are of high significance in the interpretation. Then, each PC s interpretation was validated by inspecting sample traffic against the original data. For this study, variables with a loading value over 0.6 were selected as they are the most significant in the analysis. Figure 4.5 shows the Varimax rotation of principal components. The goal of the rotation is to ease interpretation by increasing the contrasts between variables loadings. Interpretation of the first seven PCs (PC1-PC7) for the honeypot data is given in Table 4.6. The first component (PC1) is highly correlated with the total number of basic flows, total number of TCP ports targeted, total duration of basic flows, total number of source packets, and total number of source bytes. The first component indicates high interactions between attackers and the honeypot on open ports and, as the variance suggests, is the most important component. PC2 is highly correlated with closed TCP ports. This component suggests vertical and horizontal scan activities which focus on very specific ports. In PC3, activities target closed UDP ports and could be interpreted as spam, worm activities, or

97 4.6. Interrelations Between Components 71 PC1 PC2 PC3 PC4 PC5 PC6 PC7 TF TCP_O D_TCP_O TCP_C D_TCP_C UDP_O D_UDP_O UDP_C D_UDP_C ICMP TM DUR SPKTS SBYTES SRATE AVG_PK_SIZE T_ACT IAT Table 4.5: The Varimax rotation of principal components. mis-configured servers. PC4 is related to repeated activities over a short period of time; this is explained by the high correlations between the total activities and variables in the first PC s variables, such as SPKTS, SBYTES, DUR, TCP_O, and TF. PC5 represents the total machines targeted and ICMP traffic. It can be inferred that these activities are of IPs sweeping the globe seeking live machines. PC6 represents activities that target open UDP ports. PC7 is a subset of the first component and represents short attacks against specific open ports, mainly port 80, 139, and 445. The principal component analysis of the data shows that there are at least seven clusters of activities represented in the data. These clusters of activities can be separated and then PCA can be applied further to find new sub-clusters of activities and the process repeated. 4.6 Interrelations Between Components Plots of PCs can serve two main purposes: to define the interrelations between components and to identify outliers. As discussed in Section 5 (interpretation of

98 72 Chapter 4. Honeypot Traffic Structure Principal Component Percentage of Variation Interpretation % Targeted attacks against open ports % Scan activities % Spam or miss-configuration % Repeated short activities % Detection activities % Targeted attacks against open UDP ports % Short attacks Table 4.6: Interpretations of the first seven components. the results), the two components PC2 and PC5 represent two types of activities: TCP scanning and live machine detection respectively. The interrelationships between these two components are presented in Figure 4.3. Figure 4.3: The scatter plot of TCP scan (PC2) vs live machine detection (PC5). The figure shows that there are at least two clusters of activities: detection with very few scans, on the upper left side of the figure along the PC5 axis; and scans with very few machine detection activities at the bottom of the figure along the PC2 axis. Mixed activities, moderate rate of scans with moderate live machine detection activities are located in the middle part of the figure. Extreme activities of scanning and live machine detection activities are also visible as far points along both PC axes.

99 4.7. Identification of Extreme Activities 73 An example of scan-only activities is observation 4253, which originated in Germany. The IP scanned all machines for closed port 2967 and then, two weeks later, scanned closed port Observation 304 is an example of the second type, live machine detector. The IP originated in Japan and was only involved in detection activities. 4.7 Identification of Extreme Activities Outliers, in statistics, can be defined as observations that deviate significantly from the rest of the data [35, 135]. In honeypot traffic, outliers are extreme activities that are distanced from the p-dimensional hyperspace defined by the variables. Detecting extreme activities in honeypot traffic is analogous to outlier detection in multivariate statistics. In this study, we are concerned with two types of extreme activities (outliers) in low-interaction honeypot traffic: model and structure extreme activities. The first type (type I), model extreme activities, represents activities that have high values across some or all the variables. In contrast, the second type (type II), structure extreme activities, represents traffic activities that violate the structure of the data represented by the main principal components. The aim of detecting these extreme activities is to help in searching for the root causes of variations in patterns of the defined structures and to take measures to protect production networks against them. Extreme activities in honeypot traffic might arise from the introduction of new malicious network activities or intensive existing activities, such as releases of newly automated codes (worms) or discovery of new vulnerabilities, or even mis-configured servers. One of the challenges in detecting outliers in high dimension data, such as honeypot traffic, is the difficulty of inspecting large numbers of variables in the data set simultaneously. In addition, inspecting each variable by itself, or even inspecting plots of pairs of variables, might not reveal any extreme behavior when the combination of multiple variables is considered an outlier. This study provides a preliminary investigation of utilizing principal component analysis in detecting extreme observations, through graphical inspection of the first few and last few principal components plots, and through the statistics of squares of the weighted principal component scores against the squared Mahalanobis distance. Inspecting two and three dimensional scatter plots of the first few and last few

100 74 Chapter 4. Honeypot Traffic Structure PCs for detecting outlying observations was suggested by Gnanadesikan [65]. This was justified, since the first few PCs are good at detecting outliers that inflate the correlations (model outliers), while the last few PCs are useful in detecting outliers that add unimportant dimensions to the data and which are hard to distinguish from the original variables (structure outliers). Figure 4.4: The scatter plot of the first two principal components. The scatter plot of the first two principal components is illustrated in Figure 4.4. These two components, PC1 and PC2, account for 43% of the total variance in the data. The first component has high loading values on multiple variables: total number of basic flows, total number of open TCP ports targeted, total durations of basic flows, total number of source packets, and total number of source bytes. The second component has two variables with high loadings on total number of closed TCP ports and distinct closed TCP ports targeted. Outlying observations (circled on the plot) can be spotted, in Figure 4.4, as points that have extreme values along the principal component axes near the edges, far from the body of data. Observations 4124, 4900, 3929, 4892, 4131, 3720, 426, 4890 and 428 are extreme on the first principal component (PC1) while observations 426 and 428 appear extreme on the second principal component (PC2). Specific outliers are discussed in Section The scatter plot of the last two principal components, PC17 and PC18, which account for less than 1% of the total variance, is illustrated in Figure 4.5. There

101 4.7. Identification of Extreme Activities 75 Figure 4.5: The scatter plot of the last two components. are two observations, 4131 and 1193 that appear extreme for PC17 and PC18 near the edges of the graph. Although scatter plots of principal components are very useful for spotting outlying observations visually, automatic detection of outlying observations can be achieved through construction of a control ellipse. As the contours of constant probability for p-dimensional normal distribution are ellipsoids [74], the ellipsoid defined by random vectors X has the following characteristics: Constant probability contour for the distribution of X is defined by p k=1 Z 2 ik l k = Const (4.15) where Z ik is the score of k th PC of i th observation and l k is the k th eigenvalue. The ellipsoid is centered at the mean and its axes lies along the principal components, where half the square root of the eigenvalues (l 1 2 1, l 1 2 2,.., l 1 2 p ) are the lengths of its semi-major and semi-minor axes. The ellipsoid of p-dimensional space of x value satisfies p k=1 Z 2 ik l k x 2 p(α) (4.16)

102 76 Chapter 4. Honeypot Traffic Structure where x 2 p(α) is the percentile of a chi-square distribution with p degrees of freedom. Setting a threshold for detecting outlying observations based on x 2 p(α) requires the distribution of X to be multivariate normal. However, since we do not make any assumptions about the distribution of our data, the population ellipsoid, in Equation 4.15, is still valid despite any normality assumption, while the ellipsoid loses its interpretation as contours of constant probability [74], and the threshold can be computed from the empirical distribution of the principal components. Figure 4.6 provides a zoom into Figure 4.4, omitting the very clear outliers and a sketch of the control ellipse for the first two principal components. Figure 4.6: The ellipse of a constant distance. Jollife [74] discussed the uses of the sum of the squares of the weighted principal component scores of the last principal components in detecting outliers that are hard to distinguish from the original variables, which is given by: D i = p k=p q+1 where q < p and Z ik is the score of k th PC of i th observation and l k is the k th eigenvalue. When q = p, the equation represents the squared Mahalanobis distance of the i th observation from the mean of the data, which is given by: Z 2 ik l k

103 4.7. Identification of Extreme Activities 77 M 2 i (x i ) = (X i X) T S 1 (X i X) (4.17) Squared Mahalanobis Distance - Sum ( (PCs) 2 /Eigenvalue) for q= Sum ( (PCs) 2 /Eigenvalue) for q=7 Figure 4.7: The scatter plot of the statistics D i vs. (M 2 i D i ). Figure 4.7 provides a scatter plot of the statistics D i vs. (Mi 2 D i ) for detecting outliers that are different from the first p q [53]. Finally, most of the detected outlying observations were identified by more than one statistic, but with different ordering. Table 4.7 lists the top five outliers, ordered according to their significance (from high to low). PC1 PC2 PC17 PC18 M 2 D i M 2 i D i Table 4.7: The top five extreme observations.

104 78 Chapter 4. Honeypot Traffic Structure A Discussion of the Detected Outliers To judge the significance of the detected outliers, sample points of the outlying observations were manually inspected against the original data set to explain the reasons these points were selected as outliers. Observation 4124, which is extreme in PC1, Figure 4.4, was a result of an attack from an IP in the USA targeting one machine on a single open TCP port, port 80. The attack started on Wednesday, 21 November 2007 at 06:18:44 GMT and ended on Friday, 23 November 2007 at 08:01:08 GMT. The attack generated over 150,062 packets. Observation 4124 was also extreme on M 2, (Mi 2 D i ), and PC17. While it is very hard to reach a definitive conclusion of its exact nature, due to the low level of detail in the low-interaction honeypot traffic, this observation resembles denial of service attacks and falls under the first type, extreme existing activities. Observation 2228 is extreme on both PC2 and (Mi 2 D i ) statistics. The attacking IP originated in China and lasted for less than 10 seconds. It was a combination of ICMPs and moderate scans of seven unusual closed TCP ports and one TCP open port, port 80. The attacker targeted all machines on the honeypot environment. This observation represents an intensive version of existing scanning activities. Observation 1193 is extreme on both PC17 and PC18. The IP address originated in Thailand and lasted for 40 minutes. It was mainly alternating connections to two open TCP ports 445 and 139 and one closed TCP port This observation has large value on TCP_O variable, moderate values on TF, DUR, and SPKTS variables, and low values on TC_C, and ICMP variables. Observation 614 is an outlier on M 2 and D i statistics. It was caused by an attack from an IP address that originated in Romania and lasted for 40 minutes. It was mainly connections to two open TCP ports (445 and 139) and one closed TCP port (9988). This observation shows similar behaviors to observation 1193 with the same duration, but with different IPs from different countries a week later. Observations 1193 and 614 represent structure extreme activities (type II). The pattern of these

105 4.8. Summary 79 activities is the same as a class of worms that targets Microsoft LSASS vulnerability. Observation 1980 generated a large amount of UDP traffic (two packets every 30 minutes) against port 137. The attack took place between Thursday, 18 Oct 2007 and Wednesday, 24 Oct 2007, and has large UDP_O and IAT values. This observation is on the top lists of outliers on both M 2 and (Mi 2 D i ). Observation 1980 is most likely caused by worms or other malicious activities that scan and exploit Netbios Name Service. Observation 1980 represents a type II or structure extreme activity. The main source of difference between the two statistics M 2 and (Mi 2 D i ) was due to the value of q in D i statistics. More experiments are needed to select an appropriate value for the current data set. Moreover, setting a higher value for activity flow time-out, currently 60 minutes, would improve the detection of attacks that propagate slowly over an extended period, such as observations 1980 and Summary In this chapter, the use of principal component analysis (PCA) on the traffic flows of low-interaction honeypots has been proposed. PCA was proven to be very powerful tool in detecting the structure of attackers activities and the decomposition of the traffic into dominant clusters. Moreover, scatter plots of the PCs were very efficient in looking at the interrelationships between components or groups of activities and in identifying outliers. Experimental results on real traffic data showed that principal component analysis provides very simple and efficient visual summaries of honeypot traffic and attackers activities. In the next chapters, it will be shown how PCA can be used to detect new attacks that have not previously been seen by the honeypot.

106 80 Chapter 4. Honeypot Traffic Structure

107 Chapter 5 Detecting New Attacks The previous chapter proposed the use of principal component analysis (PCA) in characterizing honeypot traffic. The strength of principal component analysis has been shown in unveiling honeypot traffic structure, visualizing attackers activities, and detecting outliers. The analysis is extended further to benefit from PCA s strength in detecting different types of outliers. This chapter presents a technique for detecting new attacks in low-interaction honeypot traffic through the use of principal component s residual space. The main contribution of this chapter is the detection of new attacks using the residuals of principal component analysis (PCA) and the square prediction error (SPE) statistic. The research work described in this chapter has led to the publication of the following paper: S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, "A Technique for Detecting New Attacks in Low-Interaction Honeypot Traffic", in the Proceedings of the Fourth International Conference on Internet Monitoring and Protection, Venice, Italy: IEEE Computer Society The chapter is structured as follows. Section 5.1 provides a brief introduction to the methodology of detecting attacks in honeypot traffic. Section 5.2 introduces principal component s residual space and the square prediction error (SPE) statistic. The data set used in this study, the prepossessing, and the detection mode architecture are described in Section 5.3. A practical step-by-step illustrative example is demonstrated in Section 5.4. The results and the evaluation 81

108 82 Chapter 5. Detecting New Attacks of the detection technique are discussed in Section 5.5. summarized in Section 5.6. Finally, the chapter is 5.1 Introduction The method presented in this chapter for detecting attacks draws its roots from anomaly intrusion detection, through building a model of the honeypot s profile, and using multivariate statistical technique capabilities, namely principal component analysis, in detecting different types of outliers. New observations are projected onto the residuals space of the least significant principal components, and their distances from the main PCA hyperspace, defined by the first k principal components, are measured using the square prediction error (SPE) statistic. A higher SPE value indicates that the new observation represents a new direction that has not been captured by the PCA model of attacks seen in the historical honeypot traffic. A number of researchers have used principal component analysis (PCA) to identify attacks [88, 25, 128, 42, 86]. However, previous applications of PCA treat the network traffic as a composition of normal and anomalous activities, and the detection model is then built from the normal part. The notion of normal and anomalous does not apply in honeypot traffic, where all traffic is potentially malicious. Thus, our technique is different from those techniques, in using PCA to build a model of existing attacks that have been seen in the past, and then using the residual space to detect any large deviation from the attack model as either a new attack vector or an attack that is not present in the model historical data. 5.2 Principal Component s Residual Space There are two different types of spaces that are defined by the principal component analysis: the main PC space that captures most of the variations in the original data, which is defined by the first k PCs. Most of the PCA, such as data reduction, is based on utilizing this space; and the residual space that represents the insignificant variations that is defined by the last (p-k) principal components with the smallest eigenvalues.

109 5.2. Principal Component s Residual Space 83 Principal component analysis can be expressed as: Z = A T X (5.1) where Z is the principal scores of projecting observation in X onto the eigenvector matrix A. The previous equation can be represented in the original coordinates, by projecting Z back onto A, then X becomes: k X = A i Z i + p A j Z j (5.2) i=1 j=k+1 k X = A i Z i + E = ˆX + E (5.3) i=1 and the residual matrix E represents the difference between X and ˆX E = X ˆX (5.4) Outliers detected by principal component analysis are divided into two categories, based on the principal component space [75]: General outliers, type I in the previous chapter, that inflate the variance. PCA is very effective in detecting outliers of this type, but they are also detectable through inspecting variables individually or by using other multivariate techniques such as Mahalanobis distance. Specific outliers (type II). These outliers represent new directions that are not captured by the PC model and can be detected using the sum of squares of the residuals or Q statistics Square Prediction Error (SPE) Outliers that represent new directions to the PCA model can be tested using Q statistics, or the square prediction error (SPE), which is defined as [75]: Q = E T E (5.5) The square prediction error (SPE) measures the sum of squares of the distance of E from the main space defined by the PCA model. Alternatively, the square prediction error (SPE) can be calculated as :

110 84 Chapter 5. Detecting New Attacks Q = p i=k+1 Z 2 i l i (5.6) The new observation is considered an outlier in the model if its Q statistic exceeds a predefined threshold. 5.3 Data set and Pre-Processing This section describes the data set used in this study, the pre-processing that has been applied to the data, and the architecture of the detection model Data set The honeypot traffic data used in the analysis in this chapter also comes from the Leurré.com project [9]. Two data sets of traffic data were extracted for the purpose of this study. Traffic data sets that have been used in this study are: Data set I for constructing the PCA model (this is the same data set used in Chapter 4); and Data set II for evaluating the model. Table 5.1 gives a brief summary of the data sets used. Data Set Start Date End Date Packets Standard Flows Activity Flows I 15/09/ /11/ II 01/12/ /03/ Table 5.1: Summary of the data sets used in the study. The first data set was used in the previous Chapter to study the structure of attackers activities in low-interaction honeypot traffic. The reliability and accuracy of the data set encouraged further usage of the same data set in building the PCA profile of honeypot traffic. The data set was adequate in size for building the PCA model [23]and represented a trade-off between performance and size. The second data set, which was used to test the model, was the most recent data at the time of the study.

111 5.3. Data set and Pre-Processing Processing the Flow Traffic via PCA The processing of raw traffic data and the extraction of traffic features were described in Chapter 4 (Sections ). In this section, we describe the methodology for performing principal component analysis on the correlation matrix of honeypot activity flows. Principal Component Eigenvalues % of Variance Cumulative % E E Table 5.2: Extracted principal components variance. To calculate the PCs from the correlation matrix, the p-dimensional matrix X = (X 1,.., X p ) T is first standardized by: C ij = X ij X i ˆσi (5.7) for i=1,...,p, where X i is the sample mean and ˆσ i is the sample variance for X i. Let R be the sample correlation matrix of C, then the principal component analysis

112 86 Chapter 5. Detecting New Attacks Z = A T C (5.8) where A= (A 1,..,A k ) is the matrix of eigenvectors of R, with the first component equal to: Z 1 = a 11 C 1 + a 12 C a 1p C p (5.9) Several factors were considered for selecting the number of principal components (PCs) that are representative of the variables. First, Kaiser s rule [120, 74] for eliminating PCs with eigenvalue less than one (see Table 5.2 for the eigenvalues) was considered. Second, an inspection was made of the Scree plot of energy contributed by each PC (see Figure 5.1), where a sharp drop in the curve indicates a typical cut-off for selecting the correct number of components (between six and seven PCs). Finally, consideration was given to adding the seventh component to achieve 90% of the total variance of the original data for representing the main space (90% reflects the total variance after the robustification process, described in the following section, where extreme activities are eliminated, compared to 85% in the previous chapter) Robustness Extracting principal components from a standard correlation matrix is very sensitive to outliers, where the resulting principal components might be determined by their directions. An effective technique for improving the principal component analysis and reducing the effect of these outliers is through the robustification of the correlation matrix during the model building phase, Phase I. The robustification works by eliminating observations with large squared Mahalanobis distance M 2 in an iterative process until the data is believed to be clean or the given number of iterations is reached [77, 65]. Given a p-dimensional random matrix X = (X 1,.., X p ) T of n samples, where X i is the sample mean and S i is the sample variance of X i, then M 2, given by equation (5.10), is an ellipsoid in the p-dimensional space which is centered at the mean X, and the distance to its surface is given by M 2 values (a constant probability density contour)[77]. M 2 i = (X i X) T S 1 (X i X) (5.10)

113 5.3. Data set and Pre-Processing Eigenvalue Component Number Figure 5.1: Scree plot of eigenvalues. The constant probability contour for the distribution of X satisfies M 2 χ 2 p(α), where χ 2 p(α) is the percentile of a chi-square distribution with p degrees of freedom. There is α probability that x i falls outside the ellipsoid defined by M 2. Setting a threshold for detecting outlying observations based on χ 2 p(α) requires the distribution of X to be a multivariate normal. However, since we do not make any assumptions about the distribution of our data, the threshold for the robustification process can be determined from the empirical distribution of M 2. The robustification algorithm is detailed in Figure 5.2 and the corresponding Matlab code is listed in the Appendices Setting up Model Parameters A critical step in designing a detection approach is setting the limit for judging new observations, since this has a dramatic effect on the quality of the detection. When the limit value is very small, it will frequently be exceeded, resulting in a high rate of false positive alarms, and when the limit is very large the limit will never be exceeded, resulting in many false negative alarms. Let X be a data matrix of n samples of p-dimensional data, where X=( X 1,.., X p ) T

114 88 Chapter 5. Detecting New Attacks Input: X p n Data matrix of p variables and n observations, N Number of iterations Output: X Data matrix 1 c = 0 % Iteration counter while (c <N) 2 { 3 c = c Estimate the sample mean X of X 5 Estimate the sample variance S of X 6 Calculate the Mahalanobis distance, M 2 in Eq. 5.10, for all n observations 7 Calculate the threshold UCL, based on Eq Find O observations with M 2 i > UCL, for i = 1, 2, 3,.., n 9 Update data matrix X by trimming observations in O, new X= X O 10 } Figure 5.2: Robustification of the correlation matrix through multivariate trimming. is the sample mean vector of X and R is the sample correlation matrix. The sum of the squares of the weighted principal component scores of the last q principal components, the residual space, in detecting outliers is given by: Q i = p k=p q+1 Z 2 ki l k (5.11) where q < p and Z ki is the score of the k th PC of the i th observation and l k is the k th eigenvalue. When q = p, the previous equation can be represented by the distance of the i th observation from the mean of the data, which is given by: M 2 i = (X i X) T S 1 (X i X) UCL (5.12) Then M 2 follows a chi-square distribution χ 2, for larger sample size, with p degrees of freedom [38]. Thus, the upper control limit becomes: UCL = χ 2 1 α,p (5.13) where χ 2 p(α) is the percentile of a chi-square distribution with p degrees of freedom.

115 5.3. Data set and Pre-Processing 89 Although we do not make any assumption about the exact distribution of each of the p variables, and we are only interested in large values of M 2, the upper limit would be computed from the empirical distribution of the M 2 population as follows: UCL = u + 3s (5.14) where u is the sample mean and s is the standard deviation. Even if X is not normally distributed, setting the control limit as a multiple of a standard deviation, usually 3, is an acceptable practice and gives good practical results [99, 15]. If the data is normally distributed, then over 99.7 percent of the data will be under the control limit, or the probability of an observation falling outside the limit is equal to one out of a thousand (0.001). Alternatively, Chebyshev s inequality theorem states that regardless of the distribution of data X, at least 89 percent of the observations fall under the control limit of 3 standard deviations from the mean [98]. The M 2 test is equivalent to using all principal components in Equation The M 2 test was used in Phase I to clean the data and to reduce the effect of extreme observations before conducting the analysis and estimating model parameters. Square prediction error (SPE), or the Q-statistic, is a test of how a particular observation fits the principal component model. SPE is calculated from the sum of squares of the residuals and it measures the distance from the observation to the k-dimensional hyperspace defined by the PCA model. A high value of SPE indicates that the new observation represents a new direction not included in the PCA model. The Q-statistic of the residual space can be represented by the sum of the squares of the weighted principal component scores of the last p q principal components in Equation The upper limit for Q is given by [75]: Q α = θ 1 C α 2θ 2 h 2 0 θ θ 2h 0 (h 0 1) θ h 0 (5.15) where C α is the normal deviate corresponding to the upper (1 α) percentile, θ i = p i=k+1 lj i, for j= 1, 2, 3, and h 0 = 1 2θ 1θ 3 3θ 2 2 The use of the upper limit in Equation 5.15 assumes that the data is normally distributed. Alternatively, the upper limit is set based on the empirical distribution.

116 90 Chapter 5. Detecting New Attacks Flow Aggregation Phase I Phase II & Feature Extraction PCA Model Extraction Detection Honeypot Traffic Data Basic Flow Extraction Robustness Standardized New Observation Test Traffic Historical Traffic Aggregated Flow Extraction Filtering Standardize Observations Extract PCs Detect New Attack Honeypot Traffic Data Feature Extraction Generate Model Parameters PCA Model Parameters Figure 5.3: Detection model architecture. of Q. The Q statistic is used in Phase II to detect new attacks in the detection model Model Architecture The architecture of the detection model is depicted in Figure 5.3. As the figure shows, the model consists of three main components: Traffic Flow Aggregator: The traffic flow aggregator accepts Argus traffic flows [2], set to 5-minute maximum expiration, and then groups the traffic flows into the activity flows. The newly generated flows, activity flows, are combined by the source IP address of the attacker with a maximum of 60 minutes inter-arrival time between original flows. Internet noise, such as backscatter, is filtered out in this model. PCA Model Extraction: In this case, the PCA profile is built from historical honeypot data. This includes the calculation of the correlation matrix, the extraction of the eigenvectors and eigenvalues, and the generation of principal components. Detection: In the detection model, new observations are tested against the predefined PCA model parameters for detecting new attacks.

117 5.3. Data set and Pre-Processing 91 The methodology of detecting new attacks in low-interaction honeypot traffic is adapted from multivariate statistical process control (MSPC), a widely used statistical technique in monitoring production processes in industry, e.g. chemical industries, to detect manufacturing process faults [99]. The proposed detection model is performed in two phases: Phase I: Building a PCA profile of the honeypot traffic from historical data over a defined period of time. This includes the calculation of the correlation matrix, the extraction of the eigenvectors and eigenvalues, and the generation of principal component scores. Figure 5.4 illustrates the required steps to construct the detection model. Input: X pxn Data matrix of aggregated flows of p variables and n observations, N Number of iterations Output: X mean,s variance, E residual space, UCL upper control limit 1 Clean the data matrix from extreme observations, as describe in Section Compute the mean vector X of X 3 Compute the variance S of X 4 Compute Y standardized observation vectors of X using Eq Estimate the correlation matrix from X 6 Calculate the eigenvalues (L) and eigenvectors (A) 7 Extract the PC scores using Eq Find the number of significant PCs, or K, and the residual space E using criteria described in Section Compute the Q statistics using Eq Compute the upper control limit, U CL, for judging future attacks using Eq Figure 5.4: Steps for building the PCA model (Phase I) Phase II: Detecting new attacks where new observations are standardized and are projected onto the residuals of the predefined PCA model and then their SPE values are tested against a predefined threshold. Figure 5.5 illustrates the required steps to apply the detection model. To test the model s ability to detect traffic that is not present in its training data set, proof of concept testing was conducted. This testing consisted of two parts: manual generation of new traffic not present in the training data set, and

118 92 Chapter 5. Detecting New Attacks Input: X px1 Attack vector, Model parameters, from Phase I( X mean, S variance, E residual space, U CL upper control limit) 1 Standardize the new attack vector using Eq Project the new attack vector Y into E, using Eq Compute the Q statistic from Eq if Q > threshold (UCL) then investigate X for being a new attack Figure 5.5: Steps for detecting new attacks (Phase II). the testing of the detection model against this new traffic. A set of new traffic was generated, which consisted of SYN-Flooding attacks using Hping3 [7], Nmap SYN scans [62], operating system identification using Xprobe2 [144], and Nessus vulnerability scans [13]. The testing results showed that all new traffic was detected by our technique as not being seen in the training data and assigned high SPE values. In order to confirm that the new traffic did not exist in the original training data set, a new attack model was constructed which included the newly generated traffic. When the generated traffic was was projected onto the residual space of the new model, the Q values were small and below the threshold value. While it is not claimed that the technique is capable of detecting all new attacks, the testing results confirm that this detection model is capable of detecting traffic that is either new or not present in the training data set. 5.4 Illustrative Example To illustrate this methodology of modeling low-interaction honeypot traffic and utilizing principal component residual space in detecting new attacks, the following example provides a practical step-by-step demonstration on a sample of honeypot traffic. The sample traffic has been reduced in its feature space from 18 to only 10 variables, to simplify both the calculation and demonstration. At first, let X be the data matrix of the historical honeypot traffic flows, see Table 5.3. The features used are: the total number of basic flows generated by individual IPs (V 1 ); total number of open TCP ports targeted (V 2 ); total number of distinct open TCP ports targeted ( V 3 ); total number of closed TCP ports targeted (V 4 ); total number of distinct closed TCP ports targeted (V 5 ), total number of ICMP flows (V 6 ), total number of machines targeted (V 7 ), total duration of basic flows (V 8 ); total number of source packets (V 9 ); and summation of inter-arrival

119 5.4. Illustrative Example 93 times between basic flows (V 10 ). No. V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V Table 5.3: Sample traffic matrix The example is divided into two parts: the first part demonstrates the PCA model construction, which includes data cleaning, standard principal component analysis extraction, and the principal component residual space isolation; and the second part shows how new traffic vectors are manipulated and then tested against the principal component residual space for new attack detection PCA Model Construction At first, the data set is arranged into a data matrix X as (where the rows represent variables and the columns constitute observations): X (10 20) = (X 1,.., X 10 ) T = X 1,1 X 1,2 X 1,20 X 2,1 X 1,2 X 2,20... (5.16) X 10,1 X 10,2 X 10,20

120 94 Chapter 5. Detecting New Attacks Then the mean vector X = ( X 1,.., X 10 ) T is computed as: X i = n j=1 X ij n for i = 1,.., 10, j = 1,.., 20 X = (5.17) Before conducting any analysis on the data set, a cleaning process is needed to robustify the analysis and to reduce the effect of extreme values. Using the algorithm described in Figure 5.2, an iterative ellipsoidal trimming is applied on the data set. The iterative trimming uses Maharani s distance to measure the distance of observations from the center of the data. Extreme observations are those that exceed a predefined limit. Let S be the covariance matrix of the sample data as: S = 1 ni (X n 1 i X)(X i X) T S = (5.18)

121 5.4. Illustrative Example 95 The squared Mahalanobis distance M 2 of the data set is computed as Mi 2 = (X i X) T S 1 (X i X) UCL, i = 1,..20 (5.19) The upper control limit UCL for testing extreme values needs to be estimated. First, the mean ( M ) and the standard deviation(s M ) of the M 2 are calculated. Then the upper control limit U CL is derived from the empirical distribution of M 2 using the following formula: UCL = M + 3S M (5.20) Testing M 2 values in the vector 5.21 against the threshold value, reveals that none of the observations qualify for being eliminated. In the case that some observations were eliminated, then the mean X would need to be recalculated based on the newly reduced data set. M 2 = (5.21)

122 96 Chapter 5. Detecting New Attacks The next step is to standardize the data matrix to have zero mean and unit variance. The variance vector ˆσ = (ˆσ 1,.., ˆσ 10 ) T of the data set is calculated, ˆσ = (5.22) Then the standardized data matrix Y = (Ȳ1,..., Ȳ10) is calculated as Y = X 1 X 1 ˆσ1 X 2 X 2 ˆσ2.. X 10 X 10 ˆσ10 (5.23) Table 5.5 depicts the standardized values of the data set. The correlation matrix R of the data is then computed R =

123 5.4. Illustrative Example 97 Obs Y Y Y Y Y Y Y Y Y Y Continued.. Obs Y Y Y Y Y Y Y Y Y Y Table 5.5: Standardized traffic matrix. The eigen decomposition is then performed on R and the results are rearranged so that eigenvalue and eigenvector pairs are in descending order of variance. The extraction of eigenvector and eigenvalues of R are depicted in Tables 5.6 and 5.7 respectively.

124 98 Chapter 5. Detecting New Attacks PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC Table 5.6: Eigenvectors. Eigenvalues Percentage Cumulative Percentage Table 5.7: Eigenvalues. E = (5.24)

125 5.4. Illustrative Example 99 Values in Table 5.7 suggest that the main PCA space is represented by the first four PCs as the eigenvalues are high, or close to one, and the cumulative percentage of the total variance is close to 90%. Thus, the residual space E, Vector 5.24, is represented by the last six PCs. To calculate the SPE values of the the first phase, which is necessary to drive the detection threshold, we first need to project the standardized data onto the residual space. The resulting new PC scores Z R of the projections becomes: Z R = E T Y (5.25) Table 5.8 depicts the resulting new PC scores Z R of the projection. Z Z Z Z Z Z Z Continue.. Z Z Z Z Z Z Z Table 5.8: Scores of the residuals. The calculation of the SPE statistics is performed using Equation 5.11 as follows: SP E i = Table 5.9 shows the resulting SPEs. 6 k=1 Z 2 ki l k (5.26)

126 100 Chapter 5. Detecting New Attacks Observation No. SPE Value Total 114 Mean (M) Standard Deviation (STD) Table 5.9: SPE values. Finally, the upper control limit U CL or threshold, for SPE values becomes: UCL = M (SP E) + 3 ST D (SP E) = (5.27) Future Traffic Testing Suppose that new traffic has been collected, Table 5.10, and that there is a need to test this traffic for either being new or having been seen before by the honeypot.

127 5.4. Illustrative Example 101 No. V 1 V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V Table 5.10: Future traffic matrix. First, the new traffic is standardized using the mean X and variance ˆσ from the first phase, Vectors 5.17 and Using Equation 5.23, the standardized new traffic of the first observation, Y 1, becomes: Y 1 = X 1 X 1 ˆσ1 = = (5.28) Table 5.11 illustrates the standardized new traffic vectors. The next step is to project Y onto the residual space E, see Equation The new PC scores Z F of the projections then become: space. Z F = E T Y Table 5.12 shows the PC scores of projecting new observations onto the residual Then, the new SPE values of the projected traffic are computed using Equation 5.11 as: SP E i = 6 k=1 Z 2 ki l k (5.29)

128 102 Chapter 5. Detecting New Attacks No Y Y Y Y Y Y Y Y Y Y Table 5.11: Standardized future traffic matrix. Z 1 Z 2 Z 3 Z Table 5.12: New traffic PC scores. SP E = (5.30) Finally, the SPE values of the new observations are compared with the threshold values, computed using Equation 5.9 in stage one. The SPE values of observation 3 and 4 exceed the threshold of and are considered new attacks. As can be seen from Table 5.3, the first two observations of the traffic set have been seen before and are included in the training data, while the last two observations are new.

129 5.5. Results and Evaluation SPE Magnitude 80 Training Data Set Testing Data SPE Limit Observations Figure 5.6: Plot of SPE values of the training and testing traffic. 5.5 Results and Evaluation Two and a half months of real attack data were used to build the detection model in Phase I using Data set I in Section Parameters from this phase were then used to detect future attacks. The following subsections provide analysis of the detection technique in terms of results, stability and performance and evaluation of the detected results Detection and Identification Four months of real attack evaluation data were extracted from the honeypot environment, Data set II in Section 5.3.1, and projected onto the residual space of the detection model. Figure 5.7 illustrates the SPE values of projection. The projection shows observations that had high SPE values as spikes that rose above the threshold value. These observations were possibly new attacks and required further investigation. As the figure shows, there were 81 observations flagged by our detection algorithm which violated the structure of the attack model. Moreover, the figure shows intense attack activities obvious along the X axis around observation These activities reflect a single class of attack that

130 104 Chapter 5. Detecting New Attacks Magnitude Observations Figure 5.7: Plot of four-month attack data projected onto the residual space. one of our honeypot sensors experienced in late February and early March of Details of these attacks and the rest of the attack activities are discussed in Section Stability of the Monitoring Model Over Time As our technique used a historical block of data to construct the PCA detection model, it is very important to evaluate the stability of the PCA model over time. Preliminary investigation, which included the stability of the estimated mean vectors and the correlation matrix, suggested that there was a slight variation in some of the variables means and also in the amount of variance contained in the first seven components. The number of significant components was still the same for both of the data sets, however the amount of variation in the first seven components had increased slightly in data set II. These changes were not significant and had little effect on the residual space. However, in the next chapter, Chapter 6, the current design is developed further, and an adaptive real time detection model is devised that updates its parameters automatically over time to incorporate any changes in the traffic.

131 5.5. Results and Evaluation Computational Requirements The detection model was developed using a combination of the open -ource programming language Python [16] and the high level scientific computing environment Matlab [10]. The Python language was utilized in developing the flow aggregator while Matlab was used in the development of the detection engine. Tasks required by the detection model include: standardizing the data, calculating the correlation matrix, finding the eigenvectors and the eigenvalues, extracting the PC scores, and computing the M 2 and Q statistics. If X is a data matrix of n samples of p dimensional random variables, then the computational requirements for computing the correlation matrix of X is O(np 2 ) and the extraction of the eigenvectors and eigenvalues are O(p 3 ) [125]. Tasks involved in the detection Time Calculating the correlation matrix for all data Finding the eigenvectors and eigenvalues of the correlation matrix Calculating the PC scores of X M 2 Test of one vector for all PCs (18 PCs) Q Test of one vector for the residual (11 PCs) Table 5.13: Average execution times of the major tasks (seconds). The computational requirements are mainly matrix manipulations and are not considered expensive when taking into account the massive reduction in data records from using our aggregation technique of flows. Moreover, to detect new attacks, only the Q statistic needs to be calculated using parameters from the detection model, Phase I. Table 5.13 shows the empirical execution times required by a number of components of the detection model. The detection model was tested on a personal computer with a 2.0 GHz Intel dual core processor and 2GB of RAM Evaluation In this section, we detail our evaluation methodology of the detection model using the data set described in Section To help better understand the nature of the detected observations and judge their significance, a manual inspection was carried out for every observation that was

132 106 Chapter 5. Detecting New Attacks Activities Class Distinct Behaviors Possible Type No. Worm Activity I Worm Activity II Worm Activity III Moderate: TF, TCP_O Low: TCP_C, IAT Moderate: TF, TCP_C Low: TCP_O Moderate: TF, TCP_C High: AVG_PK_SIZE W32.Rahack.W worm 26 Mydoom worm family 6 Bobax worm family 1 Worm Activity V Low:TF High: IAT Backdoor.Evivinc 2 Denial of Service High: TF, Dur, TCP_O, SPackets,T_ACT Short: IAT Distributed DOS or DOS 21 Scan Activities Large: TF, TCP_C, SPackets Moderate: TCP_O, ICMP Horizontal scan or machine detection 2 Misconfiguration Low: TF,UDP_C DHCP request 8 Miscellaneous LOW: TF, TCP_C,UDP_C Unknown 15 Table 5.14: Classes of detected attack activities. flagged by our detection algorithm, 81 observations in total. The aim was to explain the reasons for these observations being flagged by the detection algorithm and to group them according to their similarities into different classes. The process consisted of manual inspection and manual classification of these detected attack observations. Firstly, we examined all of the 18 traffic features that had originally been used by the algorithm, and then went further by checking the basic flows for other patterns of attack, such as destination ports, protocols, and flags. Moreover, flagged observations were also checked against the original traffic logs. Secondly, observations were grouped together into different types of attack clusters, based on their attack port similarities, or port sequences. The port sequence is a list of targeted honeypot ports that are generated by a single IP address during the attack period (See Section ). The manual inspection of the detected traffic found eight clusters of attack activities. Table 5.14 provides a brief summary of the clusters. As Table 5.14 shows, there are four types of activities that were classified as worm attacks. The first class, Worm Activity I, was the largest with a port

133 5.6. Summary 107 sequence (T139, T445, T9988, ICMP). This class represents repeated attempts targeting two open TCP ports, 139 and 445, and a single TCP closed port, These activities resemble the well known Rahack worm [123], which targets Microsoft OSs. The second class of worm activities was distinguished by its port sequence (T1080, T3128, T80, T8080, T10080, ICMP). The pattern of these activities port sequence is the same as for the Mydoom worm family [121]. The third class of worm activity, port sequence (T445, T135, T1433, T139, T5000), is another type of automated exploit that targets a Microsoft Windows LSASS vulnerability [39]. The last worm class of activity targeted TCP port This class of activities is mainly scans for Trojans that listen for remote connections on TCP Port 5900, such as Backdoor.Evivinc [122]. The denial of service activities class came second in terms of number of observations. The attacking IPs targeted a single machine on a single open TCP port, port 80, with very short time between packets. These attacks were detected by our algorithm because the total activities of the source IP were huge, in addition to other parameters such as number of source packets sent. The attack was mainly caused by a few IP addresses over the period from 20/02/2008 to 04/03/2008. The third class of activities that was detected by our model was scan activities. While low to moderate scanning activities were very common in our log files, these activities were flagged by our algorithm since they generated large values on single or multiple features. The misconfiguration class of activities was mainly a DHCP request on UDP port 53. The last class of activities, miscellaneous, consisted of all observations that we were not able to explained and which did not fit in any other class. This class of activities represents short attacks on non-standard single TCP, single UDP ports or both. 5.6 Summary This chapter has presented a technique for detecting new attacks in low-interaction honeypot traffic. The proposed detection is performed in two phases. Firstly, an attack model is constructed and model parameters are estimated using principal component analysis from historical honeypot traffic. Secondly, new traffic vectors are projected onto the residual space of the PCA model, from the first phase, and

134 108 Chapter 5. Detecting New Attacks their square prediction error (SPE) statistics are computed. Traffic vectors are flagged as being new attacks if their SPE values exceed a predefined threshold. Traffic that has a large SPE value represents a new direction that has not been captured by the PCA attack model and which needs further investigation. The effectiveness of the proposed technique is demonstrated through the analysis of real traffic data from the Leurré.com project. Results of the evaluation show that this technique is capable of detecting different types of attacks with no prior knowledge of these attacks. In addition, the technique has a low computational requirement, which makes it suitable for on-line detection systems. The promising capability of the proposed techniques, both in detecting new attacks and in requiring low computational resources, motivated our investigations into improving the technique further to suit an on-line monitoring system. Further investigation was required in order to overcome some of the limitations identified in the work described in this chapter, namely: the need for manual extraction of the model parameters, such as the number of PCs required by the main PCA model and the residual space; the use of a static PCA attack model to detect new attacks since it is built from a fixed block of historical data, and the lack of a mechanism for improving the detection technique by inspecting traffic that exhibits high SPE values, either because it is new to the PCA model or because it is an extreme example of traffic that has been previously observed by our honeypot. The next chapter describes how to overcome these limitations, how to automate the detection model to adapt to the dynamically changing nature of Internet attack traffic, and the implementation of a proof of concept detection system.

135 Chapter 6 Automatic Detection of New Attacks Detecting emerging Internet threats in real time presents several challenges, which include the high volume of traffic and the difficulty of isolating legitimate from malicious traffic. As previously noted, an efficient way of collecting and detecting these threats is through deploying honeypots, since they are decoy computers that run no legitimate services and any contact with them can therefore be considered suspicious. In the previous chapter, a technique was proposed for detecting new attacks in low-interaction honeypots using principal component analysis (PCA). The technique flags new attacks by detecting changes in the residual of the PCA model space. While the technique is very efficient in detecting these changes, it suffers from several limitations that make it inefficient for the real-time detection of anomalous honeypot traffic. In addition, Internet traffic is very dynamic and changes very rapidly, which necessitates a real-time detection model capable of capturing these changes automatically. In this chapter, these limitations are addressed, and a real-time adaptive detection model is proposed that automatically captures new changes and updates its parameters automatically. The main contributions of this chapter include: a method for automatic extraction of model parameters, such as number of components that are representative of the main PCA space and the residual space, and threshold values; 109

136 110 Chapter 6. Automatic Detection of New Attacks a method for automatic differentiation between two types of activities that exhibit high SPE values as either genuinely new activities or extreme examples of some of the existing activities that have been observed by our honeypot before; a method for automatic update of the model correlation structure without the need to retain the old traffic data, based on the work of Li et al. [94]; and a proof of concept implementation of the proposed detection system for realtime and offline applications. The remainder of the chapter is structured as follows. Section 6.1 provides an introduction and discusses the motivation behind this work. Section 6.2 details the methodology of constructing the attack model using principal component analysis. The model architecture is described in Section 6.3. Section 6.4 presents a proof of concept implementation of the proposed attack detection model. Experimental results are discussed in Section 6.5. Finally, Section 6.6 summarizes the chapter. 6.1 Introduction No matter how extensive the training data that is used to extract the detection model, this data has a limited scope of attack space and cannot be considered representative of the entire attack space. In addition, the attack space is very dynamic and changing, where new attacks are reported every day. A major limitation of building an attack model based on historical data is the production of a fixed model. One consequence of this limitation is the high number of false alarms, or previously seen attacks continuing to be identified as being new. A reliable traffic detection model is required to capture new changes in Internet threats and to adapt to these changes automatically, which, in the context of our proposal, involves the following: automatic update of the mean and standard deviation vectors, and the correlation matrix; automatic extraction of the model parameters, such the number of PCs that are representative of the main PCA model and the residual space; and

137 6.2. Principal Component Analysis Model 111 automatic adjustment of the threshold values for flagging new attacks and eliminating extreme observations. When a new block of traffic data becomes available, the PCA model needs to be updated using all accumulated traffic data up to that point in time, which requires storing all historical traffic data. Although this methodology is correct and works for small data sets or for short intervals, accumulating and handling large amounts of traffic in terms of storage and computational requirements is very difficult. Different methods exist for updating models over time, such as the exponentially weighted moving average (EWMA) and the moving windows schemes. However, EWMA uses a forgetting factor, or decay factor, that gives more weight to recent data while the weight of old data declines over time [99]. While this method works for many applications, it does not suit our need of detecting new attacks, because old attack data is neglected over time. Alternatively, the detection model could be updated recursively giving equal weight to old and new data. This methods accounts for all data and only requires that the most recent block of data be retained. A complete recursive algorithm for updating a generic PCA model recursively is described by Li et al. [94] and will be utilized to update the proposed detection model. 6.2 Principal Component Analysis Model The proposed detection model is performed in two stages. The initial stage involves standard PCA model extraction as described in the previous chapter, and the second stage performs recursive PCA for adaptive real-time monitoring. In the next sections, these two stages are described in detail Building the Initial PCA Detection Model The methodology for building the attack model includes accumulating historical traffic data and then building the initial detection model out of this block of traffic data. The main steps (from Chapter 5) are briefly summarized here. First, historical honeypot traffic is grouped as the initial traffic data set block. Let X be this initial data matrix of n observations of p variables, where X=( X 1,.., X p ) T is the sample mean vector of X, ˆσ = (ˆσ 1,.., ˆσ p ) is the sample standard deviation vector of X, and R is the sample correlation matrix. Then, principal component analysis can be expressed as:

138 112 Chapter 6. Automatic Detection of New Attacks Z = A T X (6.1) where Z is the principal scores of projecting observations in X onto the eigenvector matrix A. The previous equation can be represented in the original coordinates, by projecting Z back into A, then X becomes: k X = A i Z i + E = ˆX + E (6.2) i=1 where the residual matrix E represents the difference between X and ˆX E = X ˆX (6.3) The Q-statistic, or the square prediction error, is defined as [75]: Q = E T E (6.4) where the square prediction error measures the sum of squares of the distance of E from the main space defined by the PCA model. Alternatively, the square prediction error can be calculated as : SP E = p i=k+1 Z 2 i l i (6.5) An observation is considered new to the PCA model if its Q-statistic exceeds a predefined threshold limit. Finally, the upper control limit UCL or threshold, for SPE value is computed from the empirical distribution of the historical data set as : UCL = M (SP E) + 3 ST D (SP E) (6.6) where M is the mean and STD is the standard deviation Recursive Adaptation of the Detection Model In the previous section, it was shown how the initial detection model was constructed from a block of historical data X. To update the PCA model, when a new block of data becomes available, the following model parameters need to be

139 6.2. Principal Component Analysis Model 113 recursively updated: the mean vector, the standard deviation vector, the correlation matrix, and the threshold. Our notation in recursive updating of the PCA detection model conforms to the work described by Li et al. [94], who proposed a recursive PCA algorithm for adaptive process monitoring. Let X 1 be the first block of data of n 1 observations, that is used to build the detection model, then the sample mean vector X 1 becomes: where I 1(n1,1) = (1, 1,.., 1) T. Then, the standardized data matrix Y 1 becomes: X 1 = 1 n 1 (X 1 ) T I 1 (6.7) Y 1 = (X 1 I 1 XT 1 ) Σ 1 1 (6.8) where Σ 1 = diagonal (ˆσ 1.1,.., ˆσ 1.p ) of sample standard deviation vector ˆσ 1 = (ˆσ 1.1,.., ˆσ 1.p ) which can be estimated as: ˆσ 2 1.i = where X 1 (i) is thei th column of the matrix X 1. The correlation matrix R 1 is calculated using: 1 (n 1 1) X 1(:, i) I 1 X1 (i) 2 (6.9) R 1 = 1 (n 1 1) Y T 1 Y 1 (6.10) Let X K be the current data block of n k observations, which has been used to estimate the detection model, with sample mean vector X k, sample standard deviation ˆσ k, standardized data matrix Y k, and sample correlation matrix R k. To augment a new block of data X K+1 to the current model, the recursive calculations of the new model parameters X k+1, ˆσ k+1, Y k+1, R k+1 become: X k+1 = where I k+1(nk+1,1) = (1, 1,.., 1) T and N k = k i=1 n i. N k Xk + 1 (X k+1 ) T I k+1 (6.11) N k+1 N k+1 The standard deviation vector ˆσ 1 = (ˆσ 1.1,.., ˆσ 1.p ) is estimated as:

140 114 Chapter 6. Automatic Detection of New Attacks ˆσ 2 k+1.i = ˆσ 2 k.i + N k (N k+1 1) X 1 2 k+1(i) + (N k+1 1) X k+1(:, i) I k+1 Xk+1 (i) 2 where X k+1 = X k+1 X k The standardized data matrix Y k+1 becomes: (6.12) Y k+1 = (X k+1 I k+1 XT k+1 ) 1 k+1 (6.13) And the correlation matrix R k is computed as: R 1 = N k k+1 kr k k k+1 (6.14) N k N k 1 k+1 N k+1 1 X k+1 X T 1 k+1 k N k+1 1 Y k+1y T k Setting the Thresholds Two types of limits have been utilized in building the detection model: Mahalanobis distance limit for eliminating extreme observations and for robustifing the analysis; and SPE limit for detecting new attacks. As mentioned previously in Chapter 5, no assumptions are made here about honeypot traffic distributions. This leads us to derive the robustification and detection limits from the empirical distribution of the historical data set. The SPE and Mahalanobis upper control limits were derived according to the following equations (3-Sigma): UCL(SP E) = M (SP E) + 3 ST D (SP E) UCL(T 2 ) = M (T 2 ) + 3 ST D (T 2 ) (6.15) where T 2 is equivalent to Mahalanobis distance for n = 1 and henceforth will be used in describing this statistic, M is the mean, and STD is the standard deviation. When new traffic data becomes available and the criteria for updating the PCA model are met, based on number of days or number of packets, the PCA model is recursively updated. As a result, the control limits change, and this necessitates the update of these control limits every time the PCA model is updated. The

141 6.3. Model Architecture 115 Real-Time Honeypot Traffic Flow Aggregation & Feature Extraction Detection Parameters New Attack Detection New Attack PCA Model Generation New Attacks Repository Figure 6.1: Adaptive detection model process flow. recursive calculation of these control limits conforms to Equation 6.11 for recursive calculation of the mean and and Equation 6.12 for recursive calculation of the variance. 6.3 Model Architecture Detecting new attacks in low-interaction honeypot traffic is achieved by detecting changes in the PCA model. The detection of these changes is achieved by a statistical test of the new traffic projection onto the predefined PCs residuals using the square prediction error (SPE). Traffic that violates the structure of the PCA model produces high SPE values and represents a new direction that has not been captured by the detection model. In the previous chapter, a model was presented that utilizes the PCA and SPE in detecting new attacks. However, the model is static and does not anticipate new changes in the monitored traffic. For real-time monitoring of honeypot traffic, the model needs to feedback any new changes that might occur and recalculate the model parameters accordingly. Figure 6.1 shows the process flow of the proposed adaptive attack model. As the figure shows, the system consists of three main functions: Traffic Flow Aggregator: the traffic flow aggregator accepts Argus traffic flows, set to 5 minutes maximum expiration, and groups them into what we call activity flows. The newly generated flows are combined by the source IP address of the attacker with a maximum of 60 minutes inter-arrival time between the original flows, as generated by Argus. PCA Model Generator: the PCA model generator consists of two compo-

142 116 Chapter 6. Automatic Detection of New Attacks nents: the initial PCA and the recursive PCA model generators. The PCA model generator provides tools for the analysis of honeypot traffic and the generation and update of the principal component analysis model. New Attack Detection Engine: the detection engine works in two modes, the live mode where traffic flows arrive in real time from the honeypot and the offline mode where traffic flows are read from a file Detecting New Attacks and Updating the Model While our main detection criterion is observations with high SPE values, our experiments show that some of the extreme observations also have high SPE values. Since the aim is to detect only new attacks, traffic is verified using T 2 statistics, which conform with the following: T 2 = n(x i X) T S 1 (X i X) (6.16) where n=1 for testing a single observation. Traffic that has high T 2 values is considered extreme, existing traffic that has high values across some or all the variables, and is eliminated. New traffic is first tested using Equation 6.16 and is discarded if its T 2 statistic exceeds the predefined threshold (see Section 6.2.3). Figure 6.2 illustrates the new attack detection steps. As the figure shows, only attack traffic with a low T 2 statistic is retained to further update the model, while traffic with high a SPE value is flagged as being a new attack. New Traffic Compute T 2 High T 2 Yes Discard Traffic No Compute SPE Accumulate Traffic for Model Update No High SPE Yes New Attack Figure 6.2: Detecting new attacks.

143 6.3. Model Architecture Model Sensitivity to New Attacks The residual space is very sensitive to new traffic not present in the main PCA model during the model building phase. As a result, this new traffic would result in the production of high SPE values. However, when these attacks are considered during the model update stage and the model parameters are recalculated, the residual space s sensitivity decreases dramatically and eventually approaches zero over time as the frequency of inclusion of these attacks in the main model increases. 140 SPE Value SPE Magnitude SPE Limit Observations Occurance Figure 6.3: Residual space sensitivity to new attacks. Figure 6.3 illustrates the sensitivity of the residual space to a single new attack vector (see Section 6.6 for more details of this example). The first projection of the new observation has produced a high SPE value (119.4). Then, to test the sensitivity of the residual space to a new attack, the new attack is included in the training data for the first time and the PCA model is recalculated. The same attack projection onto the updated residual space has generated a very low SPE value (14.6). Further inclusion of the same attack into the PCA model results in declining SPE values, which reach the value of 0.26 after the 10 th inclusion. The previous trial indicates that the inclusion of the new attack in the model for the first time would lower the SPE values of similar attacks. However, as the previous illustration shows, the new attack s SPE value is still high after the first inclusion, but it decreases further after several inclusions in the PCA model. In conclusion, this trial supports our decision to adapt the detection model recursively

144 118 Chapter 6. Automatic Detection of New Attacks and periodically in a way that accounts for historical data. 6.4 A Proof of Concept Implementation A proof-of-concept system was implemented for aggregating flow, analyzing, visualizing, and monitoring honeypot activities. The following sections present different aspects of this system in more detail Flow Aggregator The flow aggregator is a utility that was developed using the Python language, to filter and aggregate basic flows into activity flows, by combining basic flows by source IP address of the attacker with a maximum of 60 minutes inter-arrival time between basic flows. The flow aggregator works in two modes: offline aggregation for use with traffic log files and live aggregation to support on-line monitoring of honeypots. The flow aggregator accepts Argus Client flows [2] and then processes them to produce activity flows. In the live mode, the flow aggregator creates a socket and listens on TCP port 6000 for connections from the monitoring station, where flows are exported in real time as they arrive from the Argus client. In addition, the flow aggregator, in offline mode, generates detailed information to help the system operator interpret the detected observation. The detailed file contains additional information for every generated activity flow, which includes a list of the original basic flows, start and end times of the attack, protocols and ports that are targeted Monitoring Desktop: HoneyEye The HoneyEye monitoring system provides a complete system for analyzing and detecting attacks in a low interaction honeypot environment. The HoneyEye system was developed using the high-level scientific computing environment Matlab [10]. The selection of Matlab for building the system was due to its capabilities in matrix manipulation, statistics, and its general mathematical capabilities. The system implements three models: PCA extraction, residual analysis, and monitoring (see Figure 6.4). The PCA extraction model provides the following functionality:

145 6.4. A Proof of Concept Implementation 119 it imports activity files into the system; it provides a robustification mechanism to clean the log files of extreme activities that might mislead the analysis; it performs standard PCA extractions with the capability to inspect the extracted eigenvalues, eigenvectors, and principal component scores; and it provides visual tools to inspect the scree plot and pairs of principal components. The residual analysis model contributes the following to the analysis: it provides a means to inspect and test the automatically generated number of components that constitutes the main PCA model and eventually the residual space; it provides visual inspection of the SPE values and the threshold; it provides the capability to select different threshold values based on: threesigma, Chi-Square or user defined threshold; and it saves the results in a file to be used by the monitoring model. The monitoring model has the following capabilities. It: Imports saved detection parameters, log traffic to be tested and an investigation file for further interpretation of the detected attacks; Provides offline detection of attacks from log files; Monitors remote honeypots in a real-time mode through a TCP/IP network; Adapts to new change in traffic, based on number of packets or number of days; Provides visual projection of the attack that shows the SPE values and the threshold limit; and Provides an investigation panel that depicts detailed information of the detected attacks.

146 120 Chapter 6. Automatic Detection of New Attacks Figure 6.4: HoneyEye interface Deployment Scenario: Single Site The detection system works in two modes, real-time monitoring of traffic and offline analysis based on logs of collected data traffic. The system architecture consists of two parts: the honeypot sensor and the monitoring station. Figure 6.5 illustrates an overview of the system components of the proposed deployment architecture. The honeypot sensor is based on the open source low-interaction Honeyd [22]. It runs on a single Unix host and emulates three operating systems at the same time. Also, on the same machine, Argus [2] is configured to capture all packets sent to and from the honeypots when there is interaction with attackers targeting its IP addresses. The monitoring station (HoneyEye) provides a complete system for analyzing and detecting attacks in a low interaction honeypot environment (See Section 6.4.2). It connects to a remote honeypot sensor in a real-time mode through a TCP/IP network and detect new attacks.