Mining Frequency Content of Network Traffic for Intrusion Detection Mian Zhou and Sheau-Dong Lang School of Computer Science, and National Center for Forensic Science University of Central Florida, Orlando, FL 32816 E-mail: {mzhou, lang}@cs.ucf.edu Abstract This paper presents a novel network intrusion detection method that searches for frequency patterns within the time series created by network traffic signals. The new strategy is aimed for, but not limited to, detecting DOS and Probe attacks. The detection method is based on the observation that such kind of attacks are most likely manipulated by scripted code, which often result in periodicity patterns in either packet streams or the connection arrivals. Thus, by applying Fourier analysis to the time series created by network traffic signals, we could identify whether periodicity patterns exist in the traffic. We demonstrate the effectiveness of this frequency-mining strategy based on the synthetic network intrusion data from the DARPA datasets. The experimental results indicated that the proposed intrusion detection strategy is effective in detecting anomalous traffic data from large-scale time series data that exhibit patterns over time. Our strategy does not depend on prior knowledge of attack signatures, thus it has the potential to supplement any signature-based intrusion detection systems (IDS) and firewalls. Keywords: Network intrusion detection, time series, Fourier transform 1. Introduction Network-based intrusion detection (NID) focuses on network traffic signals passing through the communication infrastructure in an attempt to stop the attacks before they infect the host systems. One approach to intrusion detection is to look for suspicious patterns in the network traffic signals. The network traffic, when recorded in terms of individual network packets, is a series of time-based events from which various time series can be extracted. Typically, a network packet consists of the header information and the packet payload. The packet header includes a rich amount of information including the arrival time, packet length, payload size, protocol, source and destination ports, flags, window size, etc. Based on the header information, various types of time series can be constructed, such as the following: The rate of packet arrivals (in a unit time) The inter-arrival time of the packets The size of the packet payloads The time interval of the initial TCP connection attempt. The number of the distinct IP addresses reached during a time period, etc. Many network attacks are executed by running prewritten scripts, which automate the processes of attempting connections to various ports, sending packets with fabricated payloads, etc. Based on this observation, our intrusion detection strategy looks for periodicity patterns within the above time series. The first four types of time series given above are particularly relevant to the DOS, probe, and password-guessing attacks. The last type may be used in detecting traffic patterns when a worm or virus is spreading itself from an infected computer to other computers. All five time series of traffic signals could exhibit certain frequency patterns over time that distinguish the attack traffic from the normal traffic. We adapted the Fourier analysis technique from signal processing to detect periodic frequency patterns for network intrusion detection. Our experimental results running synthetic network intrusion data from the DARPA datasets indicate that the proposed intrusion detection strategy is effective in detecting anomalous traffic data from large-scale time series data that exhibit patterns over time. To reduce the processing time for large traffic datasets, Wavelet transform was applied for data dimensionality reduction, which is similar to that used in [1, 2]. The results of applying the Wavelet transform are not reported in this paper. The remainder of the paper is organized as follows. Section 2 describes how to use Fourier analysis to search for periodic frequency patterns. Section 3 describes the process of collecting and preprocessing the traffic data. Section 4 reports the experimental results evaluating the effectiveness of the detection strategy using the DARPA TCPDUMP datasets. Section 5 comments on related work and some issues that need further work. Finally, Section 6 concludes the paper. 2. Mining Periodic Frequency Patterns We adapted the techniques of signal processing and data mining on time-series data to designing intrusion detection
Traffic data Parsing the new Connection New connection history Variance analysis Average variance of packet size for each connection Generate the time - series data Ignore the trusty Connections Data sequences for each connection DFT Data sequences for all connections Compare with a threshold value Local Frequency pattern Global Frequency pattern Report attacks Figure 1. Strategy of mining periodic frequency patterns strategies. In our research, we focus on three types of time series derived from the network traffic. The first one is the time series on the rates of packet arrivals. The second is the time series of inter-arrival time of network packets. The last one is the time series of the packet payload size. Our observation is that network attacks generated by a bruteforce approach such as the DOS, probes, and password guessing, will create a large number of network packets by the scripted code, which often exhibit regular patterns in the traffic data. In addition, such attacks often use fabricated payloads of a constant size. By identifying these patterns our intrusion detection algorithm attempts to capture anomalous traffic behaviors without prior knowledge of the specific intrusion or attack signatures. Our intrusion strategy could be used in combination with other intrusion strategies. For instance, before applying our intrusion detection strategy we could use the technique introduced in [3] to differentiate the clean and dirty traffic so as to reduce the amount of suspicious traffic that need to be carefully examined by our IDS. The central idea of [3] is based on the observation that under normal situation a machine will make a rather low rate of outgoing connections to new or different machines. Instead, it is more likely to connect to the same IP regularly than to different IPs [4, 5], which is referred to as the locality property of machine interaction. Thus, a connection queue will be built for the protected computers to record all the coming connections, in which connections from familiar IPs would have a higher priority than the connections from the new Ips, and the connections with a higher priority will be passed to their destination without much delay. Our frequency-based detection strategy could be used to watch those unfamiliar connections. There are three major steps in our detection strategy as depicted in Figure 1. First, we construct the connection history for all those connections from the new IPs and record their traffic measures including: the packet size, the inter-arrival time between packets, and the rate of packet arrivals (per unit time) within or among the connections. Second, we apply the discrete Fourier transform (DFT) to the time series data and collect the resulting frequency information. Both the global frequency patterns for all connections in the connection history, and the local frequency patterns for individual connections, are computed using DFT. Finally, by identifying the sharp peaks in the spectrum of DFT results, the algorithm determines if periodic frequency patterns exist in the network traffic. Details of the algorithm are presented in the following subsections. 2.1 History of IP connections In our studies, a network connection includes all the network traffic (packets) sent between two connected IP addresses in a certain time period, which has the following properties: A pair of source and destination IP addresses A single protocol or a set of multiple protocols used at different times during the connection A set of consecutive packets within a connection which are not necessarily sent or received consecutively in terms of time by a machine For each protected computer, we use a firewall to control the connection queue. The firewall will pass connections from familiar IPs, and maintain the history of recent new connections, that is, connections from new IPs will be observed, and their frequency patterns will be computed and recorded. For each connection within a connection history, all of three types of time series mentioned earlier are generated. 2.2 Types of Frequency Patterns The generated time series essentially reflects the way by which packets are sent by the attacker, the packet delivery process would eventually fall into three categories. First, the packets are sent out very randomly, that is, the interarrival time between two consecutive packets would not conform to any repeated, periodic patterns. Second, the packets are sent out at a constant rate. Third, the packets are not sent out at a constant rate, but a periodic pattern exists in the inter-arrival time. We use a simple example to illustrate these three different patterns: Random Inter-arrival time sequence: 1 23 4 10 1000 45 320 12 33445 6 23 Constant Inter-arrival time sequence: 5 5 5 5 5 4.9 5 5 5 5 5 Periodic Inter-arrival time sequence: 6 7 3 23 6 7 3 23 6 7 3 23 6 7 3 23
Each number in the above sequence stands for an interarrival time between two consecutive packets. The case of a constant inter-arrival time would be easy to detect either by analyzing packet arrival rates or the inter-arrival time. When the inter-arrival time is not a constant but is periodic, the packet arrival rates will not necessarily produce periodic patterns since a time unit is an adjustable value depending on the duration of the connection and the sending speed of the traffic. However, applying the DFT to the series of inter-arrival time would be appropriate to capture the periodicity pattern. For the series with a random interarrival time, our intrusion detection strategy will fail. The traffic could be legitimate or could be from a manually operated attack. Another possibility is that the traffic flow is still an automated attack, but is created in a purely random manner. We had simulated the attacks using various DOS and probe tools downloaded from Internet. Some of them use a simple attack strategy, such as flooding a machine by sending packets at a rate based on a user-specified time. Some of them allow the user to specify the scan delay or the time range for a RTT timeout before processing the next packet, which creates a uniformly distributed packet stream. The periodic inter-arrival time patterns would appear in the more sophisticated attacks, which loop through the same random process. However, for attacks that generate the entire traffic stream from a long, single random process (i.e., no repeated use of the same random processes), detection by purely looking for periodic patterns would fail. Possible remedies are discussed in Section 5. 2.3 Frequency Extraction Fourier Transform is a well-known technique used in signal processing [6]. The Discrete Fourier Transform (DFT) takes the original time series in the time domain, and transforms them into the associated frequency data in the frequency domain. For a given data sequence s(n) where n 0 is a discrete value representing the time, the DFT coefficients F(k) are defined as follows: F( k) = N 1 n= 0 s( n) e j2πkn / N where 0 k N 1, and N is the length of the sequence s(n). Expanding the right-hand side yields N 1 F( k) = s( n) cos(2π kn / N) j s( n)sin(2πkn / N) n= 0 N 1 n= 0 Using the Fast Fourier Transformation (FFT) procedure, the frequency coefficients F(k), 0 k N 1, can be computed in O(N logn) time. 2.4 Global Frequency vs. Local Frequency As mentioned in Figure 1, we create local time series for each separate connection as well as the time series for the multiple connections. Both the local frequency patterns within each single new connection and the global frequency patterns over multiple connections in the connection history list are analyzed. The local frequency patterns can be used in detecting attacks originated from a single source IP targeting a single victim IP. However, when an attack sends packets using multiple spoofed IPs as their source addresses, attempting to evade the intrusion detection systems, we will see many incoming connections to the target, but in fact, the attack traffic is not independent and is possibly originated from one source computer, and we shall search for frequency patterns from arrivals of those connections. On the other hand, the reverse of the above phenomenon is also possible. When a single attacker sends out packets to exploit a group of machines, each victim only receives a part of the exploit traffic. For this situation, the firewalls installed on the gateways for the victim LAN will record all the exploit traffic heading to different victim IPs. Both of these two cases will need the global frequency analysis to detect the one to many or many to one attacks. In reality, when connections are established between different pairs of computers, the connection traffic from one connection is statistically uncorrelated to the connection traffic from the others, so there should be no frequency patterns across the different connections. Therefore, the local frequency patterns are mainly used to identify the uncoordinated attacks occurring in the different connections. 3. Preparing Input Data We have two sources for the network traffic data, one comes from the DARPA TCPDUMP files, which contain intrusion traffic simulating various network attacks [7], the second data source is the synthetic data captured by the network sniffer tool Ethereal [8], which was installed on our local networks that captured both the intrusion traffic and normal traffic. We used the network exploration and auditing tool NMAP [9] to generate the intrusion traffic. For both the TCPDUMP files and Ethereal files, we used our own pre-processing tools to extract the traffic information. After parsing the TCPDUMP files and Ethereal files, we have the information on data packet headers, the flag information, the time distribution, etc. The traffic datasets from the DARPA Lab normally collect a whole day s traffic with a file size of around 500 mega-bytes. To process the traffic file and generate the time series, we chose a sliding window of 20-minute span to slice each dataset into small pieces, with an overlap of three minutes between two consecutive windows. At the end of data parsing and extraction, the time-series data generated within a sliding window would include the following items: (1) Global data sequence of packet arrival rates. (2) Global data sequence of packet inter-arrival time (3) Global data sequence of packet size (4) Local data sequence of packet arrival rates for each connection. (5) Local data sequence of inter-arrival time for each connection. (6) Local data sequence of packet size for each connection. 4. Experimental Studies We report the results on using the DARPA datasets, which are supposed to simulate the attacks across the Internet. In the following we describe the experimental results for three different attacks from the DARPA 1999 week 5 and 4 datasets (abbreviated as L1999w4&w5).. The three
Connection 0 Connection 1 Connection 2 Connection 3 Connection 4 Connection 5 (a) (b) Figure 3: (a) Local frequency patterns on inter-arrival-time for the first 9 connections in the portsweep attack. (b) The global frequency pattern on inter-arrival time of the portsweep attack, which includes 18 connections. Figure 2: Frequency patterns on inter-arrival time for each connection in the ProccessTable attack. attacks are ProcessTable, Portsweep, and Dictionary attack. The first one is a DOS attack; the other two belong to Probe and remote-to-root attacks, respectively. A detailed description of the attacks could be found in [10]. 4.1 The ProcessTable Attack The ProcessTable attack is a TCP connection based attack, which may attack various network services by launching a huge number of TCP/IP connections to a particular port in a short period. For each incoming TCP/IP connection, the underlying Unix system allocates a new process to handle it. Therefore, it is possible to completely fill a target machine's process table with a large number of network service instantiations, eventually rendering the system lifeless until the attack terminates or is killed by the system administrator [10]. In our studies, we used the DARPA TCPDUMP file that contains the ProcessTable attack packets, which contains slightly less than 2 minutes of data with a total of 5526 data packets. Our intrusion detection (frequency mining) algorithm constructs the connection history table for the target computer zeno.eyrie.af.mil. The record of each connection includes all the statistical features of the connection, and the IP address that identifies the connection. The DFT is then applied to two time series created: the packet arrival rates series and the inter-arrival time series. The variance of the packet size is calculated. Based on the output from DFT coefficients, the frequency patterns of each connection are plotted in Figure 2. There are a total of 12 connections in the connection history table. Five of them have less than 20 packets. We plot the frequency information on six connections that are long enough to bear meaningful DFT analysis. From Figure 2 we could see that the patterns for connections 0 and 1 show distinct frequency peaks, whereas the rest of them do not. In fact, the first connection is the processtable attack that we were expected to find, and as a byproduct, we also detected a Probe attack in connection 1, which probes the ports ranging from 1794 to 2631. The attacker s IP address was identified as hobbes.eyrie.af.mil (172.16.112.20). 4.2 The Portsweep Attack and Dictionary Attack Connection 0 Connection 1 Connection 2 Connection 3 Connection 4 Connection 5 Figure 4: Frequency patterns on inter-arrival time for each connection in the Dictionary attack. The source data for Portsweep traffic is from the Tuesday data of L1999w5 outside data. The attacker sends a small number of packets to each of 38 different destinations. The local frequency patterns for each separate connection didn t reveal much information as depicted in Figure 3 (a). Since the attacking process to each target lasts a very short time (about 3 seconds) and used very few packets (around 10 packets), there is no obvious peak in the local frequency patterns (except for that all the local frequency patterns are very similar to each other). However, the global frequency pattern for all connections originated from the same IP (attacker) shows an obvious sharp peak depicted in Figure 3(b). All the variances of the packet size for each separate connection are zero except for the connection with 192.168.1.20, which has a variance of 10.63. We identified the IP address 195.73.151.50 as a suspicious attacker. In addition to the DOS and Probe attacks, other types of attack such as the Dictionary attack, which belongs to the type of remote-to-root attack, also falls into the scripted attack category. Figure 4 is the frequency pattern for the dictionary attack, data originally from Monday of L1999w5 TCPDUMP data. There are a total of 11 connections to the victim. We give our experimental result on their frequencies in Figure 4. There is an obvious frequency pattern for connection 1. The variance of the packet size is 0.117088, the sending speed is very high, 5265 packets sent out in 834.067566 seconds. We confirmed that this connection is for a password guessing attack, which tried different username/password combinations to the telnet service.
Attack Name Type Manual Attack Duration Total Time Frequency found? /Auto (DFT) pingofdeath DOS Man 0:00:06 2.79400 Y packets/unit time Dosnuke DOS Auto 0:16:41 10.65500 Y inter-arrival time Apach2 DOS Man 0:11:00 92.39300 Y packets/unit time Syslogd DOS Auto 0:15:01 7.56100 N didn t find apparent frequency pattern Neptune DOS Auto 0:06:51 2.97500 Y inter-arrival time Crashiis DOS Auto 0:00:06 16.90700 N totally only two connections to victim: 172.16.112.100, and each connection has only 7 and 5 packets Selfping DOS Auto 0:03:03 4.56700 Y packets/unit time ProcessTable DOS Auto 0:02:02 12.98600 Y inter-arrival time Sshprocess Table DOS Auto 0:00:8-0:00:50 3.75600 Y inter-arrival time & packets/unit time Back DOS Auto 0:05:00-0:20:38 904.60100 Y packets/unit time Udpstorm DOS Auto 0:15:00 8.37200 N didn t find apparent frequency pattern mailbomb DOS Auto 0:04:27 10.21500 Y packets/unit time Dict RLA Auto 0:00:10-0:08:40 12.11700 Y inter-arrival time Guesstelnet RLA auto 0:03:19 10.61500 Y inter-arrival time & packets/unit time Ipsweep Probes auto 0:00:01*6 16.37300 N single packet sent to each IP from different sources IP Portsweep Probes auto 0:04:00 10.10500 Y inter-arrival time & packets/unit time queso Probes auto 0:00:01*7 2.93400 Y inter-arrival time (a large time gap exists in the middle of the attack though) tcpreset Probes auto 0:10:38 12.17700 Y packets/unit time nmap Probes auto 0.04:24 8.54600 Y packets/unit time & inter-arrival time teardrop Probes auto 0:15:01 4.76700 Y packets/unit time & inter-arrival time satan Probes auto 0:02:12 82.72900 Y packets/unit time ntinfoscan Probes auto 0:16:09 26.82800 N didn t find apparent frequency pattern Table 2: DOS and Probe Attacks in Lincoln Lab IDS evaluation data, 1999 weeks 4 and 5 4.3 Experimental Results of DOS and Probe Attacks in 1999 DARPA Data We attempted a thorough study of the DOS and Probe attacks from the L1999w5&w4 datasets. Since our frequency-based approach is sensitive to the attack duration and whether an attack is automated, we are not reporting those that are either manual attacks or have very short attack time. Table 2 lists all the attacks we considered, which include eight probes, eleven DOS attacks, and two Remoteto-Local Attacks. The unit time used in the frequency analysis is defined as: µ = 0.1 ( T / n), where T is the duration of a connection or multiple connections depending on whether it is a local frequency or global frequency, and n is the number of packets. Out of 22 attacks, our detection algorithm failed on five attacks: Ipsweep, crashii, syslogd, udpstorm and ntinfoscan. The data sources for the first three attacks are from Monday, L1999w5 data; ntinfoscan is from Thursday, L1999w5 data. Among them, Ipsweep is conducted by sending a single packet to each target IP from different source IP, which is contrary to what we expect that the attack is originated from the same IP; The crashii attack sent only seven and five packets to the victim IPs in two separated connections, which are not long enough to bear frequency analysis. Syslogd, ntinfocan and Udpstorm are three automated attacks with long attack durations, but we couldn t identify any frequency patterns in the traffic data. We were also able to detect certain DOS attacks even though they are manual attacks. The ping of death and apache2 attacks are from the Monday of week five dataset, which was labeled as manual attack. The apache2 attack created a huge amount of traffic: 53931 packets are sent in 603.971863 seconds; the Ping of death attack sent out 91 packets in 30.114 seconds. They both showed frequency patterns in local connections. A possible explanation for this is that the traffic sent out in the attacks is mostly processed automatically without much human interaction involved. For all these attacks, our detection strategy does not need any prior knowledge of the attack signatures, and the false alarm rate is fairly low. In practice, periodic but legitimate traffic flow does exist commonly in the real world. For example, network traffic sent by a router, which broadcasts periodically to the lists of its subnet machines to look for the shortest routing path, shows frequency patterns. (Details of Internet routing protocols can be found in [11].) The DNS service sometimes creates traffic with frequency patterns. The traffic under the Network Time Protocol for time synchronization certainly creates periodic patterns as well. For most of these legitimate periodic traffic flows, both the sender and receiver belong to the insiders of the LAN, and the traffic occurs routinely. Thus, the generated traffic should be filtered out as from trusted IPs due to the locality strategy mentioned in Section 2, and immune from the false alarms under our detection algorithms. However, it would be a tricky problem for intrusion detection when these trusted machines have been compromised and are attempting to infect the others. 5. Discussions and Related Work Developing countermeasures for random exploration is crucial to our work. To begin with, we will study in detail the TCP protocol effects on the traffic time series pattern. We concentrate on TCP instead of other protocols because TCP provides connection-oriented stream services, and because its popularity in the Internet. The TCP traffic flows are initialized by a three-way handshake protocol [12]. After a connection is established, the interactive bulk data flow is controlled by the sliding window protocol. Therefore, the TCP-based attacks which exploit the establish-
ment/termination of a TCP flow, and those attacks that exploit the bulk data flow, will have different traffic characteristics. As a simple example, a TCP SYN attack would not be affected by a victim s response time, all the attacker needs to do is to send a SYN/FIN packet and, without waiting for any feedback, the next SYN packet could be sent immediately to the target. The traffic-timing patterns at the receiver end should match closely to the timing patterns of how the attack tools generate the traffic. However, for a Back attack which sends many HTTP requests with a URL containing many slashes, the packet s arrival process is also affected by the sliding window protocol. Since the sender machine has to wait for the acknowledgement on the last group of packets before it could process the next packet, the traffic flow is closely related to the Round Trip Time (RTT) [13]. For the first type of the attacks, when the random exploit is used as mentioned in section 2.2, a possible solution is to study the timing characteristics of legitimate TCP connections, e.g., the inter-arrivals times of TCP connections are distributed approximately Weibull [14,15]. The manipulated random arrivals are expected to deviate from the Weibull distribution or exponential distribution. For the second type of attack, applying the power spectral density of the traffic signal within a TCP connection would be appropriate to differentiate the normal traffic from the one mixed with attack [13]. Both of these two approaches, nevertheless, require an analysis of the traffic that lasts a relatively long duration. 6. Conclusions and Future Work In this paper, we proposed a time-series frequency mining technique for detecting automated, scripted network attacks, which typically exhibit frequency patterns over time. The technique is based on the sound theory of Fourier analysis from signal processing research. We used the DARPA datasets in our simulation studies; the experimental results demonstrated that the frequency-based intrusion detection algorithm is effective in, but not limited to, detecting the DOS and probe attacks that typically run from pre-written scripts and have relatively long duration. Some limitations of our frequency-based intrusion detection approach include its sensitivity to the attack s duration and the degree to which the attack is automated. Also, the computation time of the algorithm may place a high demand on the detection hardware. We are yet to apply the frequency mining technique to live network traffic signals in real time. Another piece of future work is to integrate our intrusion detection (frequency mining) strategy into a firewall device and evaluate its effectiveness against other strategies that are based on network traffic s temporal properties in real network environments. References [1] R. Agrawal, C. Faloutsos, A. Swami, Efficient Similarity search in sequence databases, Proceedings of the 4 th Conference on Foundations of Data Organization and algorithms, 1993. [2] E. Keogh, K. Chakrabarti, M. Pazzani, S.Mehrotra, Dimentionality Reduction for Fast Similarity Search in large Time Series Databases, Journal of Knowledge and Information Systems, 2000. [3] M. Williamson, Throttling Viruses: Restricting Propagation to Defeat Malicious Mobile Code, Proceedings of 18th Annual Computer Security Applications Conference, 2002. [4] L. Heberlein, G. Dias, K. Levitt, B. Mukherjee, J. Wood, and D. Wolber, A Network Security Monitor, Proceedings of IEEE Symposium on Security and Privacy, 1990. [5] S. Hofmeyr, A Immunological Model of Distributed Detection and its Application to Computer Security, Ph.D. thesis, Department of Computer Science, University of New Mexico, Apr. 1999. [6] E. Brigham,. The Fast Fourier Transform, Prentice- Hall, 1974. [7] DARPA Intrusion Detection Evaluation project, at http://www.ll.mit.edu/ist/ideval/data/1999/ 1999_data_index.html. [8] G. Combs, Ethereal Network Protocol Analyzer, http://www.ethereal.com/ [9] NMAP, at http://www.insecure.org. [10] K. Kendall, A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems, Master's Thesis, Massachusetts Institute of Technology, 1998. [11] J. Slone, editor, Handbook of Local Area Networks, CRC Press LLC, 1999. [12] W.R. Stevens, TCP/IP illustrated, volume I, Addison- Wesley press, 1994. [13] C. Cheng, H. Kung, K. Tan, Use of Spectral Analysis in Defense Against DOS Attacks, Proceedings of IEEE GLOBECOM 2002. [14] A. Feldmann, Characteristics of TCP Connection Arrivals, Technical Report, AT&T, 1998. [15] W. S. Cleveland, D. Lin, D. Sun, IP Packet Generation: Statistical Models for TCP Start Times Based on Connection-Rate Superposition, Proceedings of SIGMETRICS 2000.