Data Mining for Network Intrusion Detection: A Comparison of Alternative Methods *

Transcription

1 Decision Sciences Volume 32 Number 4 Fall 2001 Printed in the U.S.A. Data Mining for Network Intrusion Detection: A Comparison of Alternative Methods * Dan Zhu and G. Premkumar Department of Logistics, Operations and MIS, Iowa State University, Ames, IA 50011, dzhu@iastate.edu, prem@iastate.edu Xiaoning Zhang Tellabs Operations, Inc., 4951 Indiana Avenue, Lisle, IL 60532, mzhang@tellabs.com Chao-Hsien Chu School of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, chu@ist.psu.edu ABSTRACT Intrusion detection systems help network administrators prepare for and deal with network security attacks. These systems collect information from a variety of systems and network sources, and analyze them for signs of intrusion and misuse. A variety of techniques have been employed for analysis ranging from traditional statistical methods to new data mining approaches. In this study the performance of three data mining methods in detecting network intrusion is examined. An experimental design ( 3 2 2) is created to evaluate the impact of three data mining methods, two data representation formats, and two data proportion schemes on the classification accuracy of intrusion detection systems. The results indicate that data mining methods and data proportion have a significant impact on classification accuracy. Within data mining methods, rough sets provide better accuracy, followed by neural networks and inductive learning. Balanced data proportion performs better than unbalanced data proportion. There are no major differences in performance between binary and integer data representation. Subject Areas: Data Mining, Inductive Learning, Intrusion Detection, Network Security, Neural Networks, Rough Sets, and Telecommunications. INTRODUCTION Information technology has become a key component to support critical infrastructure services in various sectors of our society. In an effort to share information and *This research is supported in part by the Chinese National Science Foundation, grant number Corresponding Author. 1

2 2 Data Mining for Network Intrusion Detection streamline operations, organizations are creating complex networked systems and opening their networks to customers, suppliers, and other business partners (Chin, 1999). While most users of these networks are legitimate users, an open network exposes the network to illegitimate access and use. Increased network complexity, greater access, and a growing emphasis on the Internet have made network security a major concern for organizations. The number of computer security breaches has risen significantly in the last three years. In February 2000, several major web sites including Yahoo, Amazon, E-Bay, Datek, and E-Trade were shut down due to denial-of-service attacks on their web servers. The U.S. General Accounting Office (GAO) disclosed that approximately 250,000 break-ins into Federal computer systems were attempted in one year, and 64% of these attacks were successful (Durst, Champion, Witten, Miller, & Spagnuolo, 1999). Worse yet, the number of attacks is doubling every year. Based on previous studies, the GAO estimates that only 1-4% of these attacks will be detected, and only about 1% will be reported. While traditional approaches to network security have focused on prevention, network intrusion detection has become increasingly important in recent years to enable firms to reduce undetected intrusion. Typically, network intrusion is detected by examining the data trail left by users and searching for abnormal user behavior. Data mining has become a very useful technique to reduce information overload and improve decision making by extracting and refining useful knowledge through a process of searching for relationships and patterns from the extensive data collected by organizations. The extracted information is used to predict, classify, model, and summarize the data being mined. (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Data mining technologies, such as rule induction, neural networks, genetic algorithms, fuzzy logic, and rough sets are used for classification and pattern recognition in many industries (Tam & Kiang, 1992; Chu & Widjaja, 1994; Bigus, 1996; Desai, Crook, & Overstreet, 1996; Zhu & Padman, 1997). They have been extensively used in discriminating normal from abnormal behavior in a variety of contexts (Chung & Tam, 1993; Spangler, May, & Vargas, 1999; Fox, Henning, Reed, & Simonian, 1990; Fanning & Cogger, 1998; Mé, 1998). In recent years data mining techniques have been successfully used in the context of network intrusion detection (Hofmeyr, Forrest, & Somayaji, 1998; Hosmer, 1995; Lee & Stolfo, 2000). Studies have compared the performance of various data mining methods, but the results have been conflicting (Sen & Gibbs, 1994, Sung, Chang, & Lee, 1999; Spangler et al., 1999). Identifying the appropriate method is important in network intrusion since performance in terms of detection accuracy, false alarm rate, and detection time become critical for near real-time network monitoring. Very few studies have examined the relative performance of various methods in the context of intrusion detection (Warrender, Forrest, & Pearlmutter, 1999). This study addresses this research gap. The primary objective of this paper is to compare the detection accuracy of three data mining methods neural networks, inductive learning, and rough sets. Comparing multiple data mining methods will provide us with significant insights into selecting the appropriate model for detecting intrusions. This research, therefore, will have a tremendous impact on electronic commerce and information security.

3 Zhu, Premkumar, Zhang, and Chu 3 In the next section, we provide the research background on data mining and network security, the two areas that form the foundation for this study. Following that, we introduce the three data mining methods used in this study. We then describe the experiments including the network security problem, the data creation and representation, and the experimental design, followed by the results of the study. The final section discusses the results and its implications for future research. BACKGROUD Network Security There are two broad types of techniques for network security protection and detection (Lunt, 1993). Protection techniques are designed to guard hardware, software, and user data against threats from both outsiders and malicious insiders. A common protection device is a firewall that sets up a barrier at the point of connection between the external and the corporate internal networks, and ensures that only valid data are allowed to pass through it. However, firewalls are not foolproof. They require accurate configuration of numerous and confusing access control lists, and continuous updating to allow access to new network services and to keep up with changing security policies. Even properly configured firewalls are known to have weak spots. Also, firewalls cannot prevent attacks from the inside of the network, which is also a frequent source of break-ins. In addition to firewalls, operating systems provide user authentication through passwords and multilevel access control to information. Unfortunately, most mechanisms are powerless against misbehavior by legitimate users who perform unauthorized actions. Intrusion detection systems (IDS) and vulnerability assessment systems (also known as scanners) are good complements for network security. Vulnerability assessment systems, such as SATAN and COPS (Farmer & Spafford, 1990), perform rigorous examinations of systems to identify weaknesses that might allow security violations. These products inspect system configuration files for problematic settings, system password files for weak passwords, and other system objects for security policy violations. Although these systems cannot reliably detect an attack in progress, they can determine the possible weak points for attack. Intrusion detection systems (IDS) collect information from a variety of systems and network sources, and then analyze the information for signs of intrusion and misuse. They can be host-based systems or network-based systems (Lippmann, Haines, Fried, Korba, & Das, 2000). The major functions performed by IDS are: (1) monitoring and analyzing user and system activity, (2) assessing the integrity of critical system and data files, (3) recognizing activity patterns reflecting known attacks, (4) responding automatically to detected activity, and (5) reporting the outcome of the detection process. The success of an intrusion detection system can be characterized by both false alarm rates and detection efficiency (Stillerman, Marceau, & Stillman, 1999). Because the audit services record the occurrence of all security-relevant events, which result in an enormous quantity of audit data, using the correct method to analyze appropriate system data becomes an important issue in having high detection efficiency and a low level of false alarm rate.

4 4 Data Mining for Network Intrusion Detection Intrusion detection can be broadly divided into two categories: misuse detection and anomaly detection. Misuse detection systems detect attacks based on well-known vulnerabilities and intrusions stored in a database (a.k.a., signatures), while anomaly detection systems detect deviations in activity from normal profiles. Misuse detection systems use various techniques including rule-based expert systems, model-based reasoning systems, state transition analysis, genetic algorithms, fuzzy logic, and keystroke monitoring. Rule-based expert systems have been used to encode knowledge about vulnerabilities and past intrusions (Snapp & Smaha, 1992; Porras & Valdes, 1998). This approach has been extensively used in commercial systems. Model-based reasoning systems identify intrusions by comparing user actions with a database of attack scenarios specified in terms of sequence of user behavior (Garvey & Lunt, 1991). State transition analysis is a method in which penetration is viewed as a sequence of actions that take the system from the initial state prior to an attack to the final compromised state after the attack (Porras, 1992). The system converts penetration scenarios into state transition diagrams where successive states are connected by arcs that represent the events required for changing the state. Fuzzy logic is a set of concepts, techniques, and theorems designed to handle vagueness and imprecision. RETISS (Carrettoni, Castano, Martella, & Samarati, 1991) uses fuzzy logic to evaluate the probability of a given threat by using knowledge about all the possible attempts against security in the target system. Keystroke monitoring is based on pattern match with specific keystroke sequences, but has not been extensively used. A major shortcoming of misuse detection is that it is only based on existing knowledge of attacks and vulnerabilities. The rules and logic have to be continuously updated as new forms of attacks are identified. Anomaly detection systems detect deviations from normal behavior, and based on a threshold value determine if it is normal or abnormal behavior. There are various approaches for anomaly detection including statistical analysis, sequence analysis, neural networks, machine learning, and artificial immune system. In statistical analysis, profiles are created for various system objects (users, files, directories, and devices) using various attributes of normal use (number of accesses, time of day, number of logon failures), and intrusions are detected if the values fall outside the normal range. EMERALD (Porras & Valdes, 1998) and NIDES (Lunt et al., 1992) are examples of this approach. Statistical approaches provide well-researched robust procedures, but have their shortcomings. They are insensitive to order of occurrence of events, and it is possible to train the system to consider abnormal behavior as normal behavior over a period of time. Sequence analysis addresses some of those shortcomings by examining the sequence of activities over a fixed window size and comparing normal data with test data to identify anomalous behavior. The unit of analysis is a sequence of system calls rather than individual calls. Hofmeyr et al. (1998) used look ahead pairs (timedelay embedding TIDE) and fixed length sequences of system calls (sequence time-delay embedding STIDE) to set up a normal profile, and compared them with actual system traces using statistical indexes such as local frame frequency and Hamming distance. Neural networks train the neural nets using both normal and abnormal data sets and then use it on actual system traces. It is a robust approach with limited assumptions on data distribution, and automatically

5 Zhu, Premkumar, Zhang, and Chu 5 accounts for correlations between various input measures. However, training neural nets is a time-consuming task. Many studies have explored the use of neural networks for intrusion detection (Bonifacio, Causian, Carvalho, & Moreira, 1997; Debar, Becker, & Siboni, 1992; Fox et al., 1990). Machine learning has been extensively used in pattern recognition and optimization problems. Lee, Stolfo, and Chan (1997) used a machine learning tool called RIPPER for intrusion detection. The system extracts a collection of decision rules from the information provided using an inductive learning algorithm. Forrest, Hofmeyr, Somayaji, and Longstaff (1996), Forrest, Jovornik, Smith, and Perelson (1993), and Kim and Bentley (1999) proposed a unique artificial immune system, simulating human immunology, to detect intrusions. Their primary premise was that a security system, much like the human immune system, should be able to protect itself from unauthorized intruders. Comparison of Data Mining Methods Data mining methods search through a database using specialized algorithms to identify general patterns that are useful in classifying individual observations and in making reasoned predictions about outcomes (Fayyad et al., 1996). A variety of algorithms are used including statistical analysis, multidimensional analysis, neural networks, expert systems, fuzzy logic, rough set theory, intelligent agents, genetic algorithms, machine learning, data visualization, and inductive learning or decision trees (Chung & Gray, 1999; Berry & Linoff, 1997). Each method uses a different search algorithm for searching, extracting, and exploring different kinds of knowledge. Chen, Han, and Yu (1996) classified the techniques for knowledge discovery into six categories: (1) mining of association rules, (2) data generation and summarization, (3) classification, (4) data clustering, (5) pattern-based similarity search, and (6) mining path traversal patterns. Prior research on comparison of data mining methods in different domains has provided mixed results. Many studies have compared the performance of these methods in the context of bankruptcy prediction, but the results have been conflicting. For example, while Tam and Kiang (1992) reported that backpropagation neural nets performed better than discriminant analysis, logit analysis, k-nearest neighbor, and ID3, Weiss and Kapouleas (1989) found the opposite: Inductive learning performed better than neural networks, discriminant analysis, and other statistical methods. Messier and Hansen (1998) and Sung et al. (1999) found that inductive learning outperformed discriminant analysis in bankruptcy prediction. In other contexts, Fanning and Cogger (1998) found that neural networks performed better than traditional statistical methods in identifying fraudulent financial statements. In contrast, Sasisekharan, Seshadri, and Weiss (1994) found that inductive learning performed better than neural networks, discriminant analysis, and nearest neighbor in the network performance field. Chung and Tam (1993) compared three data mining methods across five managerial tasks in construction project assessment and concluded that performance was generally task dependent, although neural networks tended to perform better across task domains. Sen and Gibbs (1994) compared neural networks and logistic regression for analyzing corporate takeover and found little difference in performance among them.

6 6 Data Mining for Network Intrusion Detection There has been very limited research comparing performance of various data mining methods in the intrusion detection domain. Recently, Warrender et al. (1999) examined the performance of four different methods on a suite of data sets consisting of different types of program and intrusion techniques. Hidden Markov models (HMM) provided better accuracy but at a higher computational cost. They found that no single method consistently gave the best results on all programs, and the results between programs varied more than the results between methods. Hofmeyr et al. (1998) and Hofmeyr and Forrest (1999) conducted a series of experiments evaluating various intrusion detection methods using data from sequences of system calls. They found that a short sequence of system calls could detect some common sources of anomalous behavior in some Unix programs. DATA MINING METHODS Three data mining methods were used in this study neural networks, inductive learning, and rough sets. We chose these three methods based on prior research and relevance to our problem context. Two of the three methods, neural networks and inductive learning, have been used in prior studies on intrusion detection. Neural networks have been widely used for data mining and have also been found to be effective in intrusion detection (Lippman & Cunningham, 2000; Bonifacio et al., 1997; Debar et al., 1992; Fox et al., 1990). Inductive learning systems have recently been used in intrusion detection with much success (Lee, Stolfo, & Mok, 1998). Rough sets have been very successful as a data mining technique in many fields including medicine, business, market research, conflict analysis, and other areas. Prior studies on rough sets have found that it consistently performed better than statistical approaches (Dimitras, Slowinski, Susmaga, & Zopounidis, 1999). However, there has been very little research on the use of rough sets for intrusion detection systems. Given that the rough set technique is useful for data reduction, data classification, pattern discovery, and other data mining applications, it should be well suited for intrusion detection. Using rough sets will provide a unique contribution to research on intrusion detection, as it is a new method with limited prior research in this context. Hence, the rough set technique was included as the third method in our study. Neural Networks Neural networks were first inspired by an attempt to mimic the neural functions of the human brain (Rumelhart, Hinton, & Williams, 1986). They are powerful prediction and classification tools, and provide new opportunities for solving difficult problems that have been traditionally modeled using statistical approaches. Among the numerous neural networks that have been proposed, backpropagation networks, as shown in Figure 1, are probably the most popular and widely used. As shown, the network consists of several components: (1) a set of neurons or processing units that receive and send signals from an outside environment or other neurons in the network using three layers input, hidden and output layers; (2) connectivity, which shows the interactivity between neurons; (3) propagation rules, which aggregate input signals from other neurons; (4) activation/transfer functions, which convert the aggregated inputs to output to be sent to other connected

7 Zhu, Premkumar, Zhang, and Chu 7 Figure 1: Backpropagation neural network architecture. a 1 a 2 a n Input Neurons (i) 1 2 n w j1 w j2 w jn Hidden Neurons (j) (Learning) S j = i n = 1 a iw ji O j = f j ( a j, s j ) w 1j w 2j w mj Propagation Rule Activation/ Transfer Function Output Neurons (k) 1 2 m O 1 O 2 O m a 1 a 2 a m (a 1 -O 1 ) (a 2 -O 2 ) (a m -O m ) Computed Outputs Desired Outputs Errors neurons; and (5) learning algorithms, which update the patterns and strength of connectivity. Typically, the network starts with a random set of weights, W ij, and adjusts the weights each time it detects an input-output pair of errors. This process is called learning. During the training period, various classes of training data are fed into the networks. Activation flows from the input layer, through the hidden layer, and then to the output layer. Each neuron receives as input the outputs of all neurons from the previous layer. After input data is applied as a stimulus to the input layer of the network, it is propagated through neurons in each upper layer until an output, O k, is generated. The error signals (a k O k ) are then transmitted backward from the output layer to the middle layer. This process repeats layer by layer and, based on the error signal received, connection weights are then updated to cause the network to converge toward a stable state. A detailed description of the process can be found in Rumelhart et al. (1986). Overall, neural networks are comparable to their statistical counterparts. For real-world problems with high nonlinearity and short memory dynamics, neural networks usually perform better at prediction and classification accuracy. Furthermore, neural network models are more robust, more easily adaptive to a changing environment, and less sensitive to changes in sample size, number of variables, and data distribution. They work well when the form of the mapping function is unknown (Sun, Wang, & Zhu, 1997). Due to its fault tolerance and adaptability to noisy data, neural networks are being used in a growing number of industrial and research applications including pattern recognition in engineering, control, manufacturing, and financial investment (Tam & Kiang, 1992; Chu & Widjaja, 1994; Desai et al., 1996; Zhu & Padman, 1997).

8 8 Data Mining for Network Intrusion Detection Inductive Learning Inductive learning attempts to induce general concepts from examples by creating a decision-tree-like knowledge structure. Each node in the decision tree is labeled with attributes, the edge is labeled with attribute value, and the leaf is labeled with class. The ID3 algorithm (Quinlan, 1984) and its descendants, such as C4.5 (Quinlan, 1993), are simple and yet powerful algorithms of learning from examples. ID3 performs a top-down heuristic search through a problem space and uses information gain as a criterion for selecting the branching attribute of a node. Let the node contain a set of T cases, with C j of the cases belonging to one of the predefined class C j. The information needed for classification in the current node is: inf( T) = C j C j log (1) T T j The value measures the average amount of information needed to identify the class of a case. Assume that using attribute X as the branching attribute will divide the case into n subsets. Let T i denote the set of cases in subset i. The information required for the subset i is inf(t i ), Thus, the expected information required after choosing attribute X as the branching attribute is the weighted average of the subtree information: Inf x ( T) = i T i inf( T. (2) T i ) Thus, the information gain will be: gain( X) = inf( T) inf x ( T). (3) After the branching attribute is selected, the training cases are divided by the different values of the branching attributes. If all examples in one branch belong to the same class, this branch becomes a leaf labeled with that class. If all branches are labeled with a class, the algorithm terminates. ID3 uses the chi-square test to avoid overfitting due to noise. If the χ 2 value is lower than a threshold, then the attribute will not be used. This avoids creating unnecessary branches and complicating the tree. The use of information gain in ID3 has a serious deficiency. It favors tests with many outcomes. C4.5 improves this by using a gain ratio: gain ratio ( X) = gain( X) split inf x ( T), (4) where

9 Zhu, Premkumar, Zhang, and Chu 9 split inf x ( T) = n i = 1 T i log2 T i. (5) T T The attribute with the maximum value on gain ratio (X) is selected as the branching attribute. In addition, to avoid overfitting, C4.5 does not use the χ 2. Instead, it allows the tree to grow and prunes the unnecessary branches later. A detailed discussion of the algorithm can be found in Quinlan (1993). The C4.5 algorithm is able to generate a decision tree based on the data samples. It constructs classification rules in the form of a decision tree, recursively starting at the root. At each node, attribute a i is selected to split the training data into examples, where a i = 0 or 1. This algorithm is then invoked recursively on the two subsets of training data until all examples in one node belong to the same class. At this point, a leaf node is created and labeled as the expected value of the categorical attributes for the records described by the path from the root to that leaf. C4.5 has been tested in many domains and has demonstrated to be a good classification model in machine learning. Rough Sets Rough sets use a mathematical approach to extract knowledge from imprecise and uncertain data. It was introduced by Pawlak (1982) in the early 1980s and was motivated by practical needs in concept formation (Pawlak, Gryamala-Busse, Slowinski, & Ziarko, 1995). A brief tutorial on rough sets is provided in the Appendix. The essence of rough set theory is that objects may be indiscernible in terms of the value of their attributes. A rough set is a set of objects that cannot be precisely characterized based on a set of available attributes. In this case, a pair of lower and upper approximations replaces any vague concept in the set. These two approximations are two basic operations in rough set theory. Rough sets can identify and characterize non-deterministic systems and incorporate probabilistic information in decision making. The rough set theory is characterized by its knowledge representation system (KRS), indiscernible relations, approximation of set, dependency of attributes, reduction of attributes, and decision rules. Let S = < U, Q, V, f > be a KRS, where U = non-empty, finite set of objects, the universal of data. For example, U = { u 1, u 2,, u n }; Q = set of attributes, including a non-empty set of condition attributes C and a non-empty set of decision attributes D, where Q = C Dand C D = Φ; V = U q Q V q, where for each q Q, V q is the domain of attribute q, and the elements of V q are called values of the attribute of q. f = information function that assigns a unique value of the attribute q to each object U. u i

10 10 Data Mining for Network Intrusion Detection Suppose P is a non-empty subset of Q, and u i and u j are members of U. We can associate an approximation space in S by defining a binary indiscernible relation as follows: IND (P) = { ( u i, u j ) U: q P fu ( i, q) = fu ( j, q) }. We say that u i and u j are indiscernible or equivalent by a set of condition attributes P in S IFF q P, fu ( i, q) = fu ( j, q). This indiscernible relation partitions U into several elementary sets. Each elementary set in IND (P) consists of a group of objects which has the same value of attributes; thus, u i and u j are in one elementary set in terms of the attribute subset P. Based on the concept of indiscernible relation, a universe U can be divided into several elementary sets by any subset of the attribute Q. Suppose X is a non-empty subset of C. U is divided into A = { A 1, A 2,, A i } in terms of X, and divided into D = { D 1, D 2,, D j } in terms of D. The lower approximation of set D n, denoted by X D n, is the union of all A m in the positive regions. The upper approximation of set D n, denoted by XD n, is the union of all A m in the positive regions and boundary regions. The dependency of attributes is the relationship between condition attributes C and decision attributes D. Analysis of dependency is used to determine whether D can be characterized by the value of C. It is of primary importance in rough sets to discover data regularities for deriving rules. The dependency of the decision attribute (D) on the condition attribute (C) equals the ratio of the number of objects in the positive regions to the number of objects in the universe U. It can range from zero to one. Another important issue is the identification and elimination of redundant conditions. The objective is to find a subset of attributes that have the same discriminating power as the set of original attributes without losing any essential information. After all redundant attributes have been eliminated, the remaining subset of attributes is called a minimal subset or reduct. Decision rules are generalized based on the non-redundant attributes contained in the chosen reduct. Values for these attributes are then analyzed to identify patterns in the data. The patterns are then expressed as logical statements that link the value of specific conditions with an outcome. The decision rules can be employed to analyze new objects and partition them into different classes. If the new object matches one possible rule, strength for all suggested decision classes in Dec_D in this rule will be assessed, and the new object will be included in the class with the most strength. The performance of decision rules can be measured by the accuracy of decision rules and/or decision coverage. The main advantage of rough sets is that they do not require any preliminary or additional information about the data. The method can work with missing values, switch between different reducts, and use less expensive or alternative sets of measurements. It is able to discover important facts hidden in the data and express them in the natural language of decision rules. The rough sets method offers the ability to handle large amounts of both quantitative and qualitative data. Its ability to model highly nonlinear or discontinuous functional relationships provides a

11 Zhu, Premkumar, Zhang, and Chu 11 powerful method for characterizing complex, multidimensional patterns. It offers transparency of classification decisions, allowing for their argumentation. The rough sets method has been successfully applied in knowledge acquisition, forecasting and predictive modeling, and decision support (Pawlak et al., 1995; Hashemi, LeBlanc, Rucks, & Rajaratnam, 1998; Dimitras et al., 1999; Slowinski & Zopounidis, 1995). EXPERIMENT Data Systems can be monitored at various levels. Various factors including cost, accuracy, and ability to differentiate normal from abnormal behavior influence the choice. Typically, intrusion detection systems monitor either user behavior or privileged processes. Although the former method was more popular earlier (Denning, 1987), recent studies have used the latter method (Lee, Stolfo, & Mok, 1998; Hofmeyr et al., 1998). Privileged processes are programs that require access to system resources that are usually inaccessible to ordinary users. Privileged processes are easier to detect, since, unlike a user with a wide latitude of actions, they perform a specific limited function; the range of behaviors is limited compared to that of users and is fairly stable over time. In Unix, the user has to be granted super-user status to run privileged process. The normal user with super-user status gets a broad range of permissions to perform tasks that are typically not allowed for the user. Normally, the processes are trusted to access-only relevant system resources, but can be misused due to improper configuration or modification of code. The privileged process is observed through system calls that the Unix process uses to access system resources. Hofmeyr et al. (1998) found that short sequences of system calls are a good discriminator for several types of intrusion. A detection system should be reliable and efficient reliable in discriminating between acceptable and unacceptable behavior, and efficient to detect intrusion with nominal use of computer resources. We can record a variety of information from the system calls including timing, parameters passed, instruction sequence, and interactions with other processes. Hofmeyr et al. used the temporal ordering of system calls for intrusion detection. In intrusion detection, as in most data mining operations, a database of normal behavior is developed, and data from system calls are compared with this database to detect abnormal behavior. Traces of system calls generated by a particular program (e.g., sendmail program) are analyzed, and a database is created of all unique sequences of a given length. The following example, based on Hofmeyr et al., illustrates the creation of the database. Let us assume we have the following trace of system calls: Open, read, mmap, mmap, open, read, mmap. For a window size of 3 we get four unique sequences: Open, read, mmap Read, mmap, mmap Mmap, mmap, open Mmap, open, read

12 12 Data Mining for Network Intrusion Detection These sequences are stored as trees, with each tree rooted at a particular system call. To evaluate a new trace, overlapping sequences of length K in a new trace are compared with the database of normal trace, and those that do not occur in the database are considered as mismatches. The number of mismatches, both the raw number and the percentage of total number of matches, is an indicator of abnormal behavior. Abnormal behavior can be both legal and illegal actions. The normal database may not have all possible actions, and some legal infrequent sequences may be termed abnormal by the comparison procedure. Identifying them may be as important as identifying illegal behavior, since these may signal other non-security problems with the system. In this study we used the above procedure to capture data. The data used in this study is based on an immune system developed at the University of New Mexico (Lee & Stolfo, 2000; Lee et al., 1998). It is for one privileged program sendmail. The data includes both normal and abnormal traces. The normal trace is a trace of the sendmail daemon and several invocations of the sendmail programs. During the period of collecting these traces, there are no intrusions or any suspicious activities happening. The abnormal traces contain several traces including intrusions that exploit well-known problems in Unix systems. For example, Sunsendmailcp (SSCP) is a script that sendmail uses to append an message to a file, but when used on a file such as /.rhosts, a local user may obtain root access. Syslog attack uses the syslog interface to overflow a buffer in sendmail. Forwarding loops occur in sendmail when a set of files in $home/.forward form a logical circle. In our study, intrusion traces include five error conditions of forwarding loops, three sunsendmailcp (sscp) attacks, two traces of the syslog-remote attacks, two traces of the syslog-local attacks, two traces of the decode attacks, and two traces of unsuccessful intrusion attempts sm5x and sm565a. Detailed descriptions of these intrusions can be found in Hofmeyr et al. (1998). Each trace has two attributes: the first one is the process ID, indicating the process the system call belongs to; and the second one is the system call value. There are a total of 182 kinds of system calls. The system calls are converted from strings to integer values using a lookup table. Table 1 illustrates a section of one sendmail trace from a single process with the process ID, the actual system calls, and its integer value in the three rows. These traces can be normal or abnormal. Data Preprocessing Prior research indicates that short sequences of system calls made by a program during its normal execution are very consistent and can be used for anomaly detection (Hofmeyr et al., 1998). Our objective is to recognize the different patterns of normal and abnormal behavior by using various learning algorithms. First, we need to set up these sequences from the original data sets. One system call and N 1 subsequent system calls in the same process comprises one sequence of length N. Sequences of system calls from normal traces would be normal sequences, while those from suspicious traces are compared with sequences from normal traces, and if no match is found, will be termed abnormal sequences. Table 2 shows two sequences labeled. Using a sliding window of length N, all traces are searched and two data sets are created, one consisting of normal sequences and another consisting of abnormal ones. The selection of sequence length is determined by

13 Zhu, Premkumar, Zhang, and Chu 13 Table 1: Sendmail trace. Values Process ID System Call Number System Call* write fork sstk sstk write sethostid sstk *The system calls in last row will not appear in the data set. Table 2: Normal and abnormal sequence System Call Sequences Length 7 Class Labels normal abnormal two conflicting criteria. While a short sequence helps to minimize computation and database size, it may not be adequate to discriminate normal from abnormal behavior. Prior research (Hofmeyr et al., 1998) indicates that a window size of around 6-7 is appropriate for most instances. The length of the sliding window was set to 7 to facilitate easy comparison of results with prior studies. Eventually, 1,112 normal sequences and 1,576 abnormal sequences were identified from the data. The number of abnormal sequences for each intrusion type is listed in Table 3. Experimental Design While the primary objective of our study was to compare the performance of the three data mining methods, we were also interested in evaluating the impact on performance due to two other variables, data proportion and data representation, on performance. Hence, an experimental design ( ) incorporating all the three variables was used. The design includes: Data mining method (3) neural networks, inductive learning, and rough sets. Data set representation (2) binary and integer Data set proportion (2) balanced and unbalanced The three data mining methods were discussed in the earlier section. For neural networks, 42 input neurons were used since we had seven attributes, each represented by six bits. Output neurons were set at 2 since it was a binary decision of yes or no. The number of hidden layers was set at 1, and hidden neurons were set at 15. A training cycle of 1,000 with a learning rate of 1.0 was used. The data representation can be in binary or integer form. In integer representation the original integer values of the system calls were used directly in the training system, but they were considered as qualitative attributes instead of quantitative numbers. Since the window size was 7, the number of attributes or input units in each record was 7. In binary representation each system call was changed to 6 bits valued at 0 or 1. Hence, the number of condition attributes was

14 14 Data Mining for Network Intrusion Detection Table 3: Sequences of system calls in different traces. Traces # of Sequences Normal 1112 Total 1112 Abnormal Decode 16 Forwarding loops 258 Sunsendmailcp 219 Syslog-local 359 Syslog-remote 439 Sm565a 23 Sm5x 262 Total instead of 7. In both representations the output unit had one value, a normal sequence classified as 0, and an abnormal sequence classified as 1. The proportion of normal and abnormal sequences in the training and testing data set is another important variable (Wilson & Sharda, 1994). The proportion rate can affect performance in multiple ways. Some methods do not perform well when the number of records of abnormal traces (base rate) is very low, since it may not be able to identify all the features necessary for classification. The base rate proportion in the training data set could be different from the testing data set. For a system to be robust it should be able to work with different proportions in the testing data after learning all the classification rules. In this study we used two data proportions, balanced and unbalanced, based on whether the proportion of normal and abnormal sequences is equal or not. Multiple data points are required in each of the 12 cells of the experimental design to statistically evaluate the research model. Hence, a three-fold cross-validation approach was used to create the data sets. For balanced proportion we split the normal data set into three parts, each consisting of about 370 sequences. Each time, we put two parts in the training set and combined them with the same number of abnormal sequences. The remaining normal data are placed in the testing set, combined with 300 abnormal sequences that are different from those in the training set. For the unbalanced proportion, the training data consists of 80% normal sequences and 20% abnormal sequences. The testing data consists of the remaining 20% of normal sequences and the remaining abnormal sequences. The three-fold approach generates data sets for three experiments. The exercise is repeated three times by randomly generating three different sets of partitions. Hence, data sets for nine experiments in each cell are created for a total of 108 data points. However, since only binary representation is used in neural networks, only 10 cells are feasible, thereby providing 90 data points for this study (see Table 4).

15 Zhu, Premkumar, Zhang, and Chu 15 Table 4: Data proportion Balanced and unbalanced. Training Testing Proportion Type Normal Abnormal Normal Abnormal Balanced Unbalanced Table 5: Accuracy rates for different learning methods. Neural Networks Binary Representation Inductive Learning Rough Sets Integer Representation* Inductive Learning Rough Sets Balanced Proportion Unbalanced Proportion *Integer representation is not possible for neural networks. Table 6: ANOVA analysis. Mean Square df F-Value Significance Model Residual Main Effects Method Representation Proportion Method * Representation Method * Proportion Representation * Proportion Method * Representation * Proportion RESULTS Table 5 shows the average classification accuracy rate for various combinations of data proportion, data representation, and data mining methods. The columns represent the two data representation methods and each data mining method under it, and the rows represent the two data proportions. A cursory analysis of the data clearly indicates that there are differences in performance due to the three variables. The results of ANOVA analysis, examining the impact of the three variables on performance, are presented in Table 6. The results indicate that the overall model is significant at p < Two of the three variables, data mining method and data proportion, are significant at p < All the interaction effects, except method*proportion, are also significant.

16 16 Data Mining for Network Intrusion Detection DISCUSSION Although ANOVA tests the overall model and evaluates the impact of the three variables on performance, it does not provide explanations on the reasons for the difference and an understanding of the interactions. A more detailed analysis of the impact of the individual variables provides these explanations. Table 7 provides the results of t-tests for each of the individual variables, which are discussed in detail below. Data Mining Method The results of ANOVA provide sufficient empirical evidence that the data mining method has the most influence on performance. The results of Duncan s test, shown in Table 7a, indicate that the performance of the three data mining methods is significantly different. Rough sets had the best performance, followed by neural networks and, finally, inductive learning. The finding that the neural network model is better than inductive learning is consistent with a few prior studies (Tam & Kiang, 1992; Chung & Tam, 1993). However, it should be noted that other studies (Weiss & Kapouleas, 1989; Sashisekharan et al., 1994) have found conflicting results, and Chung and Tam (1993) claimed that the results are dependent on the problem context. Since prior studies have not compared the performance of these two methods in the context of intrusion detection, based on our results we can claim that neural networks perform better than inductive learning in the IDS context. More studies may have to be conducted to conclusively validate this finding. There have been no studies comparing rough sets with the other two methods. The results of this study are very encouraging since rough sets perform better than neural networks, which is a very popular method in this area. Its performance in the context of intrusion detection is also noteworthy since studies have not explored use of this method for intrusion detection. Data Proportion Balanced proportion was significantly better than unbalanced proportion in classification accuracy, which is consistent with the results from prior studies (Wilson & Sharda, 1994). Balanced proportion had equal amounts of normal and abnormal sequences in the training set. It indicates that a greater number of sequences of one class in the training set leads to better learning and a more accurate classification. Another important difference between the two proportions is the source of abnormal sequences in the training set. While the balanced proportion contains all the intrusion traces for abnormal sequences, there are only four traces of abnormal sequences in the second. For all classifiers, an adequate amount of training samples is required to achieve satisfactory performance. The lower accuracy rate in the unbalanced proportion could be attributed to the possibility of losing some important abnormal patterns. Data Representation The results of the t-test, shown in Table 7c, comparing the two data representation schemes, indicate that there is no significant difference in classification accuracy

17 Zhu, Premkumar, Zhang, and Chu 17 Table 7: Performance comparison I. a. Comparison of data mining method. Variable Mean (SD) Inductive-Significance Rough Sets-Significance Neural networks (2.30) Inductive learning (7.08).0001 Rough sets (3.87) b. t-test proportion. Variable Mean (SD) t-value (df) Significance Balanced (9.6) 2.91 (88).005 Unbalanced (13.88) c. t-test representation. Variable Mean (SD) t-value (df) Significance Integer (13.85).64 (88).521 Binary (11.49) between the two representations. Data mining methods are able to learn from the data set regardless of the representation. Data representation does not significantly impact performance since we are capturing the same phenomenon in different representation formats. The data in this study were primarily discrete values of qualitative variables (system calls), which could be one reason for the lack of significance. The results may be different in other contexts where the values are continuous variables (Zhu & Padman, 1997). Overall Performance Although the individual analysis provides an understanding of the impact of each variable, it is useful to study the interactions among the variables. Table 8 provides information on the mean values for each of the 10 variations of the three factors and the results of t-test, testing for statistical significance. Table 8a indicates that while data proportion has a significant impact on all three methods, it is least pronounced in the context of neural networks. The ability to have good classification accuracy in unbalanced data proportion provides greater flexibility in practice, where in some contexts one may not be able to obtain large data sets with abnormal cases for training. As expected, the impact of data representation for the two data mining methods is negligible (Table 7b). An interesting analysis is the comparison of data representation and data proportion. The difference between balanced and unbalanced proportion is more significant for integer representation than binary representation. Binary representation performs better than integer representation for unbalanced proportion, but the difference is not statistically significant. Data mining methods need to perform well in both normal and abnormal cases. Classification accuracy is determined by the ability of the system to identify normal cases as normal and abnormal cases as abnormal. The system should minimize errors of identifying normal cases as abnormal behavior, which could lead to

18 18 Data Mining for Network Intrusion Detection Table 8: Performance comparison II. a. Comparison of method and proportion. Neural Networks Mean (SD) Inductive Learning Mean (SD) Rough Sets Mean (SD) Balanced (1.21) (2.54) (2.23) Unbalanced (2.77) (3.61) (3.77) t-value Significance b. Comparison of method and representation. Inductive Learning-Mean (SD) Rough Sets-Mean (SD) Integer (5.96) (5.08) Binary (8.23) 76 (2.19) t-value Significance c. Comparison of representation and proportion. Binary-Mean (SD) Integer-Mean (SD) t-value Significance Balanced (10.15) (9.06) Unbalanced (11.94) (16.64) t-value Significance Table 9: Classification Normal and abnormal cases. Correct Classification Neural Networks Inductive Learning Rough Sets Balanced Normal Abnormal Unbalanced Normal Abnormal many false alarms, or by identifying abnormal as normal behavior, which could lead to intrusions. The classification accuracy for the three data mining methods and two data proportions for both cases (normal and abnormal) are provided in Table 9. Neural networks classify 65.52% of normal cases as normal and 77.37% of abnormal cases as abnormal. It is rather conservative in classification because the probability of accepting normal behavior as abnormal is greater than the probability of accepting abnormal behavior as normal. Inductive learning has a very high classification accuracy of normal cases in unbalanced proportion but its accuracy

19 Zhu, Premkumar, Zhang, and Chu 19 drops significantly in identifying abnormal cases. This is a serious issue since it accepts almost 74% of abnormal cases as normal cases, thereby creating a false sense of security even when intrusions are taking place in the network. Rough sets are very good in identifying normal cases as normal, thereby considerably reducing false alarm rates. However, its performance in identifying abnormal cases is lower and it has lower prediction accuracy than neural networks. All three methods have lower accuracy in classifying abnormal sequences for unbalanced proportion. The relatively limited number of sequences in the data set may be inadequate to learn all the patterns and develop classification rules. CONCLUSIONS The tremendous growth in the Internet and electronic commerce has created serious challenges to network security. Advances in data mining and knowledge discovery provide new approaches to network intrusion detection. In this study, experiments were conducted to evaluate the prediction accuracy of three different data mining methods to sequences of system calls in sendmail privileged processes. An experimental design ( ) was created to evaluate the impact on the classification accuracy of intrusion detection systems due to three data mining methods, two data representation formats, and two data proportion schemes. The results indicated that data mining method and data proportion had a significant impact on classification accuracy. Among data mining methods, rough sets provided better accuracy, followed by neural networks, and then inductive learning. Balanced data proportion performed better than unbalanced data proportion. There were no major differences in performance between binary and integer data representation. To the best of our knowledge, our research was the first attempt to evaluate and compare multiple data mining methods including rough sets in the IDS context. This study provides opportunities for exploring new directions for future research. The data used in this study was created from a limited set of programs in a single environment. The data set can be expanded to include more variations in settings, to enable us to generalize the results for a broader set of parameters. We could expand to include more programs/processes within the Unix operating system. We could also include new versions of intrusions. Another possibility is to extend the testing to other operating systems such as Linux, NT, or other Unix versions. Data mining techniques are only as good as the training data that help these techniques make decisions. A key issue is where should we get the training data real-life data or artificially created data, or a combination of both? While real-life data provides a better representation of the activity, data from certain features of the program that are rarely invoked may be missing. Artificial data could be used to incorporate those features as a supplement to real-life data. Also, the feasibility of self-learning systems needs to be explored so that systems learn from detection of new intrusions in their daily operations. In this study the testing was primarily offline, using data collected earlier. Ideally, IDS should be real-time, so that it can detect intrusions as and when they are occurring. False alarm rates and computational efficiency become very significant in these situations. The system should not be a drain on the computing

20 20 Data Mining for Network Intrusion Detection resources of the server, and also should not generate too many false alarms to become ineffective. System architecture issues need to be addressed while designing those systems. [Received: October 2, Accepted: August 22, 2001.] REFERENCES Berry, M. J. A., & Linoff, G. (1997). Data mining techniques for marketing, sales, and customer support. New York: John Wiley & Sons, Inc. Bigus, J. (1996). Data mining with neural networks Solving business problems from application development to decision support. New York: McGraw-Hill. Bonifacio, J. M. Jr., Cansian, A. M., Carvalho, A. C. P. L. F., & Moreira, E. S. (1997). Neural networks applied in intrusion detection systems. Proceedings of the International Conference on Computational Intelligence and Multimedia Application. Gold Coast, Australia, Carrettoni, F., Castano, S., Martella, G., & Samarati, P. (1991). RETISS: A real time security system for threat detection using fuzzy logic. Proceedings of 25 th IEEE International Carnahan Conference on Security Technology, Taipei, Taiwai ROC. Chen, M. S., Han, J., & Yu, S. (1996). Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Engineering, 8, Chin, S. K. (1999). High confidence design for security. Communication of the ACM, 42(7), Chu, C. H., & Widjaja, D. (1994). A neural network system for forecasting method selection. Decision Support Systems, 12, Chung, H. M., & Gray, P. (1999). Special section: Data mining. Journal of Management Information Systems, 16(1), Chung, H. M., & Tam, K. Y. (1993). A comparative analysis of inductive learning algorithms. Intelligent Systems in Accounting, Finance and Management, 2 (1), Debar, H., Becker, M., & Siboni, D.(1992). A neural network component for an intrusion detection system. Proceedings of the 1992 IEEE Symposium on Research in Security and Privacy. Oakland, CA: IEEE Computer Society Press, Denning, D. E. (1987). An intrusion-detection model. IEEE Transaction on Software Engineering, 12(2), Desai, V. S., Crook, J. N., & Overstreet, G. A. (1996). A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operational Research, 95, Dimitras, A. I., Slowinski, R., Susmaga, R., & Zopounidis, C. (1999). Business failure prediction using rough sets. European Journal of Operational Research, 114,