Classification Algorithms in Intrusion Detection System: A Survey V. Jaiganesh 1 Dr. P. Sumathi 2 A.Vinitha 3 1 Doctoral Research Scholar, Department of Computer Science, Manonmaniam Sundaranar University, Tirunelveli Tamil Nadu, India. jaiganeshree@gmail.com Abstract 2 Doctoral Research Supervisor, Assistant Professor, PG & Research Department of Computer Science, Government Arts College and Science College, Coimbatore, Tamil Nadu, India. sumathirajes@hotmail.com 3 M.Phil Scholar, Department of Computer Science, Dr. N.G.P Arts and Science College, Assistant Professor, Sasurie Arts &Science College, Erode, Tamilnadu, India. vinithasmsc@gmail.com Intrusion Detection system is a software which helps us to protect our system from other system when other person tries to access our system through network. It secures our system resources without giving access to other system. Nowadays internet has becoming more popular and wide. Many of them try to access the resources of unauthorized person to win their business. In this paper the data mining algorithm which helps to secure our system. In data mining classification algorithms helps easily to secure the system. Classification predicts the future data what the output comes. Intrusion detection system can be used for both host and network. The two algorithms surveyed are ID3 and C4.5. There are two types of detection methods. One is misuse detection and another one is anomaly detection. Keywords: Intrusion Detection System Architecture, Detection types, Attacks, Protocols, KDD cup data set, ID3 algorithm, C4.5 algorithm, Decision trees, Classification. 1. Introduction Intrusion detection system and prevention system are same. Both are used to detect the malicious program which enters in our network or host. The only difference is the prevention system will give the response to malicious program by using firewall, anti spam and by blocking the malicious activity. We can perform the intrusion detection in network and host. There are two types of intrusion detection system. They are signature based and anomaly based detection methods. We can provide the intrusion prevention system with the proper soft ware s and hardware. Then only we can secure our system. Predictive modeling is used to predict the output based on historical data. Classification is used to predict the output by historical data. It has two processes. One is we should build the model and another one to see the resulting model. It is mainly used in customer segmentation, business modeling, credit risk and biomedical research and drug responses modeling. 2. Intrusion Detection Systems Architecture An intrusion detection system is a software program which helps to identify the malicious program which enter our system or in network. It helps to secure our system by responding to the malicious program. It is divided into two types. They are host based intrusion detection system and network based intrusion detection system. The active system will respond to the malicious program. But the passive system will detect only whether any malicious packets entered the system or not. IDS Architecture Internet Firewall Router Figure 2.1 I D S I D S Company Network Company Network Host Based Intrusion Detection System 746
The host based intrusion detection system detects only the malicious packet which enters our system. It detects only our host system. It does not detect the whole network. Network Based Intrusion Detection System TCP (Transmission control protocol) If one application wants to connect with another application TCP protocol is used. It set ups a communication line between two systems. The attacker tries to access this connection. The network based intrusion detection system detects the whole network and alerts the network administrator about the malicious activity. It secures whole network. 3. Detection Types There are two types of detection. They are anomaly detection and signature detection Anomaly detection It checks the normal system activity like the network bandwidth, ports, protocols and device connection. If there is any abnormal activity in system or network it informs the administrator Signature detection It monitors all network packets with previously known attacks that are called signatures. It is stored in database. 4. Attacks in IDS There are four different types of attacks. Denial of service attack (Dos): It is an attack in which the attacker makes the memory too busy or too full to handle the requests. User to Root Attack (U2R): It is an attack in which attacker tries to access the normal user account. Remote to Local Attack (R2L): It is an attack in which attacker sends packets to a machine over a network but does not have an account on that machine. Probing Attack: It is an attempt to gather information about the network of computers. 5. Protocol Attacks in IDS ICMP (Internet control message protocol) UDP (User Datagram Protocol) Using UDP the user can send message to another host without transmission channels. It may arrive out of order. The attacker may send some messages by using this protocol. Detection Rate The detection rate is number of intrusions detected by the system divided by total number of intrusions present in the sample data. False Alarm Rate It is defined as the number of normal patterns detected as attacks. 6. Data Mining Data mining is used to search information from the large set of databases. It is divided into two types. The first one is predictive and the second one is descriptive. Predictive is used to predict the output using historical data. It predetermines the output. The descriptive method gives information about what the data contains, and tells about its relationships. We have chosen the predictive technique for intrusion detection system. Classification Classification is used to determine the predetermined output. It predicts the target class for each data item. It assigns the data into target classes. For example it is used to identify the credit risk as low, high, medium. Classification Task Training set Induction Learning Algorithm Learn model Model It is used by internet protocol layer to send one way message to host. There is no authentication in ICMP which leads to denial of service attack. Test set Deduction Figure 4.1 Apply model 747
Examples of Classification Task 1. Predicting tumor cells as benign or malignant. 2. Classifying credit card transactions as legitimate or fraudulent. 3. Classifying secondary structures of protein as alpha helix, beta sheet, or random coil. 4. Categorizing news stories as finance, weather, entertainment and sports etc. Classification techniques: 1. Decision tree based methods 2. Rule based methods 3. Memory based reasoning 4. Neural networks 5. Naïve Bayes and Bayesian Belief networks 6. Support vector machines. Decision Tree It is used in statistics, machine learning, and data mining. It is a predictive model which is used to observe the data item and concludes the target output value. Here leaves represent class labels and branches represent conjunctions. It does not describe data or decisions it simply makes the classifications. It generates rules and it is very easy for the humans to understand. It helps to search a record in a database. These rules provide a model transparency. There are two properties of rules. They are support and confidence. It helps us to rank the rules and predict the output. Example for decision tree Abdomen Throat Chest None Appendicitis Fever Pain Heart attack Cough Yes No different groups. They are top down approach and bottom up approach. The algorithms ID3 and C4.5 are top down approaches. The C4.5 contains two phases. They are growing phase and pruning phase. The ID3 contain only one phase that is growing phase. Both algorithms are greedy for optimum solutions. 7. ID3 Algorithms The ID3 stands for Iterative Dichotomiser2. It is the precursor for C4.5 algorithm. The algorithm was invented by Ross Quinlan. 1. Create a root node If all the elements in C are positive then create yes node and stop. If all the elements in C are negative then create no node and stop. Or Select the feature F with values from v1 to vn. 2. Divide the training elements in c into subsets c1, c2, and c3 cn with v values. 3. Apply the algorithm recursively for all the ci elements. For selecting feature node the user has to use selection heuristic. It uses the greedy search to select the best possible attribute. If the attribute selects best then it will stops otherwise it repeats till the condition satisfies. Data Description 1. Attribute value description. 2. Predefined classes 3. Discrete classes The ID3 can decide the best attribute by using the statistical property information gain. The gain measures how the attributes separates the training examples into target classes. The one with the highest information is selected. In order to define gain we can use entropy from information gain. The entropy measures the amount of information gain. Given a collection S of c outcomes Yes No Fever None Entropy(S) = S -p (I) log2 p (I) Flu Strep Yes No Where p (I) is the proportion of S belonging to class I. S is over c. Log2 is log base 2. S is not an attribute but the entire sample set. Flu Cold The complexity of the tree is measured using its one of the metrics. They are total number of leaves; total number of nodes, number of attributes used, depth of the tree. There are two Advantages of ID3 Algorithm 1. Easy prediction rules can be generated from the training data. 2. It builds the fastest tree 3. It builds the short tree 748
8. C4.5 Algorithms It was developed by Quinlan. C4.5 builds decision trees from a set of training data using information theory concept. The training data is an S= S1, S2 are already classified samples. Each Si has a p-dimensional vector where Xj represents attributes of samples. At each node of the tree C4.5 chooses an attribute that mostly splits the samples into subsets. The splitting criteria use information gain. The attribute with the highest information gain is chosen to make decision. For building decision tree, 1. Check for base classes 2. For each attribute a find the information gain from splitting a 3. Let a is a best attribute with the highest information gain. 4. Create a decision node that splits the a 5. Recurse on the sub lists obtained by splitting a best and add those nodes as children s of nodes. It can handle both continuous and discrete data. It can handle the missing attributes values. After finishing it goes back for pruning. The new version is C5.0. 9. KDD Cup Dataset It is a sample dataset which is used for intrusion detection methods. It consists of 4 gigabytes of compressed raw data of 7 weeks of network traffic. It contains 2 million connection records. Using this data set the data can be classified either as normal or attack 10. Weka Data Mining Tool Weka (Waikato environment for knowledge analysis) is a machine learning software. It is free software available under general public license. It is a collection of algorithms for data analysis and predictive modeling. It is easy to use. It can run on any platform. It is fully implemented in java programming language. 11. Conclusion Security is the main thing for protecting our files. Many hackers try to access the unauthorized files. For protecting the data, decision trees algorithm is the one of the easy technique to secure our system. In this paper ID3 algorithm and C4.5 algorithms are compared to find the best results. In this best one suited for intrusion detection is C4.5 algorithm, because it uses numeric and nominal data. The C4.5 algorithm is also very easy to understand. 12. References [1] Anomaly-based network intrusion detection Techniques, systems and challenges P.Garcıa- Teodoroa, J. Dıaz-Verdejoa, G.Macia-Fernandez, E. Vazquezb [2] A Survey and Comparative Analysis of Data Mining Techniques for Network Intrusion Detection Systems Reema Patel, Amit Thakkar, Amit Ganatra. [3] Intrusion Detection: A Survey Aleksandar Lazarevic, Vipin Kumar, Jaideep Srivastava Computer Science Department, University of Minnesota. [4] Dimension Reduction Techniques Analysis on SVM Based Intrusion Systems machine learning course fall 2012/2013 Aviv Eisenschtat. [5] Modern Intrusion Detection, Data Mining, and Degrees of Attack Guilt Steven Noel, Duminda Wijesekera, Charles Youman. [6] Comparative Study of Data Mining Techniques to Enhance Intrusion Detection Mitchell D silva, Deepali Vora. [7] A Comparative Analysis of Current Intrusion Detection Technologies James Cannady, Jay Harrell. [8] Intrusion Detection Techniques Peng Ning, North Carolina State University Sushil Jajodia, George Mason University. [9] A Survey of Intrusion Detection Systems Douglas J. Brown, Bill Suckow, and Tianqiu Wang. [10] 10. A Survey of Modern Advances in Network Intrusion Detection V. Kotov, V. Vasilyev Department of Computer Engineering. [11] An Introduction to Intrusion-Detection Systems Herve Debar. [12] Design Network Intrusion Detection System using hybrid Fuzzy-Neural Network "Muna Mhammad T.Jawhar, Monica Mehrotra. [13] Efficient Packet Classification for Network Intrusion Detection using FPGA Haoyu Song, John W. Lockwood 749
13. Author Biographies Mr. V. JAIGANESH is working as an Assistant Professor in the Department of Computer Science, Dr. N.G.P. Arts and Science College, Coimbatore, Tamilnadu, India. He is doing Ph.D., in Manonmaniam Sundaranar University, Tirunelveli. Tamilnadu, India. He has done his M.Phil in the area of Data Mining in Periyar University. He has done his post graduate degrees MCA and MBA in Periyar University, Salem. He has presented and published a number of papers in reputed conferences and journals. He has about twelve Years of teaching and research experience and his research interests include Data Mining and Networking. Dr. P. SUMATHI is working as an Assistant Professor, PG & Research Department of Computer Science, Government Arts College, Coimbatore, Tamilnadu, India. She received her Ph.D., in the area of Grid Computing in Bharathiar University. She has done her M.Phil in the area of Software Engineering in Mother Teresa Women s University and received MCA degree at Kongu Engineering College, Perundurai. She has published a number of papers in reputed journals and conferences. She has about Sixteen years of teaching and research experience. Her research interests include Data Mining, Grid Computing and Software Engineering. Ms A.VINITHA is working as an Assistant Professor, Department of Computer Science and Applications, Sasurie College of Arts & Science, Vijayamangalam, Erode, Tamilnadu, India and she is doing her M.Phil Degree under the guide Mr.V.JAIGANESH of Dr N.G.P Arts & Science College Coimbatore. She finished her MSc in Dr N.G.P Arts & science college Coimbatore. She is doing her M.Phil in the area Data mining. She has attended many conferences and she had 2 years of teaching experience. She is interested in Data mining and networking. 750