Data Mining For Intrusion Detection Systems Monique Wooten Professor Robila December 15, 2008
Wooten 2 ABSTRACT The paper discusses the use of data mining techniques applied to intrusion detection systems. The goal of these Data Mining based Intrusion Detection Systems is to discover patterns of program and user behavior, and determine what set of events indicate an attack. The paper includes information on what intrusion detection and data mining are, the significance of data mining based IDS, and major data mining techniques that have been applied to preexisting intrusion detection systems.
Wooten 3 Introduction Several layers of security are necessary to reduce the potential for malicious attacks on a system. An Intrusion Detection System (IDS) is one of these layers of defense against malicious attacks. In IDS a stream of data is inspected and rules are applied in order to determine whether some attack is taking place. Intrusion Detection Systems typically operate within a managed network between a firewall and internal network elements. The idea of Intrusion Detection Systems has been around since the 1980 s, beginning with James P. Anderson s study on ways to improve computer security auditing and surveillance at customer sites [7]. The IDS field has made significant advancements over the years. Today there are a number of security options available. Even though there are a large number of security measures available, there are still many instances in which malicious users succeed in attacking systems. These attacks can sometimes result in loss of crucial data. In some cases, this is due to errors in configuring security systems and insider attacks, which can operate behind security walls. Security policies that aren t structured sufficiently can make a system more susceptible to insider attacks. Intrusion Detection Systems An Intrusion Detection System (IDS) is software that detects unwanted behavior. There are different types of intrusion detection systems. These include Network Intrusion Detection (NIDS), Host-based Intrusion Detection (HIDS), Hybrid Intrusion Detection, and Network Node Intrusion Detection (NNID) [8]. NIDS analyze packets flowing through the system, HIDS are run on a host and monitor activity on the host. These intrusion detection systems search for abnormal activity such as, misuse and anomalies. Misuse detection involves sorting through data, examining sequences of calls and comparing them to a list of signatures for known attacks. Anomaly detection looks at the state of the network, making sure it matches a predefined normal state. Advancements in IDS The earliest IDSs were developed with string matching rules looking for command sequences used by known attacks. These string-matching approaches have limited use and are easy to foil. Other attacks focus on communication protocols (TCP/IP, for example) and seek to exploit vulnerabilities in specific protocol implementations. These more sophisticated attacks are dynamic in nature. Some current IDSs (Grids, STAT, etc.) that are based upon dynamic. [7] The rapid increase in network bandwidth from megabits to gigabits per second is making it progressively more difficult in carrying out analysis for detecting network attacks in a timely and accurate manner [7]. Studies using Data Mining based IDS Applying Data Mining to Snort This section will discuss instances in which data mining techniques have been applied to a well known Intrusion Detection System, Snort.
Wooten 4 What is Snort? Snort is a popular open source IDS created by Martin Roesch. It monitors network traffic, and uses content searching and matching to detect denial of service and buffer overflow, as well as other attacks. The main difference between Snort and commercial IDS is that they don't have customer support to help you out and you have to teach yourself how to install, configure and maintain your IDS. Some researchers have used Snort to apply data mining IDS techniques. It was stated in [1] that Snort is known for triggering a large number of false alerts, and it when giving a warning about an attack it doesn t state what kind of attack it is. Clustering Approach used on Snort Alerts [1] The paper [1] presented a clustering approach for handling Snort alerts more effectively. One of problems with using Snort is the fact that a large percentage of alerts generated were false positives. It was believed that the cause of this problem was the way in which the packets were being analyzed. It was suggested that instead of looking at each individual packet, all the alerts should be assembled into an XML document which would allow for analysis of patters of alerts. All the alerts in a session are placed in an Intrusion Detection Message Exchange Format (IDMEF), an XML format. This XML file represents patterns of alerts that may be used to identify an attack. The information in multiple files is then made into a cluster using a distance measure. The following is the example given of an IDMEF file. <?xml version="1.0"?> <IDMEF-Message version="1.0"> <Alert ident="12773"> <Analyzer analyzerid="snort00" model="snort" </Analyzer> <CreateTime ntpstamp="0xb9225b23.0x9113836a">1998-06-05t11:55:15z</createtime> <Source> </Source> <Target> </Target> <Classification origin="vendor-specific"> <name>msg=icmp PING</name> <url>none</url> </Classification> <Classification origin="vendor-specific"> <name>sid=384</name> <url>http://www.snort.org/snortdb/sid.html?sid=384</url> </Classification> <Classification origin="vendor-specific"> <name>class=misc-activity</name> <url>none</url> </Classification> <Classification origin="vendor-specific"> <name>priority=3</name> <url>none</url> </Classification> <Assessment> <Impact severity="high" /> </Assessment> <AdditionalData meaning="sig_rev" type="string">5</additionaldata> <AdditionalData meaning="packet Payload" type="string">2a2a20202020202020202000aaea020097a4020075da</additionaldata> </Alert> </IDMEF-Message> Implementation After reading multiple articles in which the researcher was able to use an existing intrusion detection system and data mining methods to evaluate the efficiency of data mining techniques used with IDS, I decided to attempt implementation. The goal was not to create a data mining based intrusion detection system, but to run intrusion data sets in data mining software, observe how the data mining software correctly or incorrectly classified the data, and then evaluating the efficiency of the results. Implementation was attempted with two data sets, KDD and DARPA, and two data mining softwares, WEKA and CART. WEKA was chosen because it s an open-source data mining software
Wooten 5 that was used in the [6] research experiment. CART was chosen because its speed and clarity of results is greater than that of WEKA. Preprocessing the data The data had already been preprocessed, but the original file formats were incompatible with the software being used. Since CART appeared to be a more powerful data mining tool than WEKA, implementation began by attempting to load DARPA s tcpdump data set into the CART software. It was immediately realized that it would not be possible to use DARPA s data set for this experiment. The data appeared in the form of a table of two unlabeled columns with various symbols. Initially it was not possible to use WEKA with the data sets because the amount of memory it required was too large. It was necessary to increase the Java heap size for WEKA. Once this was accomplished, I attempted to run the KDD data set in WEKA, but the software never completed loading the data. This problem was most likely caused by incapability of the computer system that the experiments were being performed on. Next, I attempted to load the KDD data into CART. The data successfully loaded, but the labels for the data did not appear. Attempts to edit the data file included the use of Excel, Notepad, and Microsoft Word. Excel was able to open an editable view of the data, but less then 10% was able to load since the file sizes of the data sets were so large. Limited memory was the main problem, so I downloaded TextPad, a text editor that s memory has no upward bound. Results of Experiment The goal of running tests on the intrusion detection data sets was not attained. After successfully preprocessing some of the KDD data sets, the CART software experienced an error, and I have not been able to get the CART software running again. Attempts to find other open-source or trial versions of data mining software with CART s capabilities have been successful. Qt Orange Canvas is slower, but appears to have more capabilities than CART.
Wooten 6 There are still some bugs to be worked out. For example, I was able to run tests, but not able to chose the target. Without identifying the target attribute, testing is pointless.
Wooten 7
Wooten 8 Conclusion Due to a series of complications I was unable to make a conclusion based on my own experiments. However, conclusions have been drawn from others experiments. Data mining can be implemented as an added portion to a preexisting IDS. When implemented properly, data mining can improve the classification process resulting in a lowered number of false positive alerts [2]. Data mining allows the IDS to analyze a sequence of events as opposed to one event at a time. Data mining will continue to be researched and applied, and more beneficial results will come from the implementation of data mining based intrusion detection systems.
Wooten 9 References [1] Distinguishing False From True Alerts in Snort by Data Mining Patterns of Alerts, 2006, Florida State University. <http://ww2.cs.fsu.edu/~jidolong/publications/finalspie.pdf> [2] A Data Mining Framework for Building Intrusion Detection Models <http://www.google.com/url?sa=u&start=2&q=http://www.snort.org/docs/ieee_sp99_lee.ps&ei =lidgszlkgzh2eeamopei&usg=afqjcne33fo2hbx3zqbpxhoallb6i1rc3w> [3] Data Mining for Network Intrusion Detection <http://www.google.com/url?sa=u&start=10&q=http://wwwusers.cs.umn.edu/~kumar/presentation/minds.ppt&ei=lidgszlkgzh2eeamopei&usg=afqjcnhc C7pN_u6hTV4huXWchFFAxFW3Zw> [4] Data Mining Approaches for Intrusion Detection <http://www1.cs.columbia.edu/~sal/hpapers/usenix/usenix.html> [5] Data Mining-based Intrusion Detectors: An Overview of the Columbia IDS Project, 2001 [6] Application of Data Mining to Network Intrusion Detection: Classifier Selection Model <http://www.google.com/url?sa=u&start=3&q=http://www.apnoms.org/2008/data/papers/tec hnical/10-3.pdf&ei=zblgsyp0f5taeul0jnii&usg=afqjcnh97r5i3n_ghsox2s8aguaokuyepw> [7] The History and Evolution of Intrusion Detection <http://www.sans.org/reading_room/whitepapers/detection/344.php> [8] The Evolution of Intrusion Detection Systems <http://www.securityfocus.com/infocus/1514>