FUZZY DATA MINING AND GENETIC ALGORITHMS APPLIED TO INTRUSION DETECTION Susan M. Bridges Bridges@cs.msstate.edu Rayford B. Vaughn vaughn@cs.msstate.edu 23 rd National Information Systems Security Conference October 16-19, 2000
OUTLINE AI and Intrusion Detection Intrusion Detection System Design Fuzzy Logic and Data Mining Fuzzy Association Rules Fuzzy Frequency Episodes Intrusion Detection via Fuzzy Data Mining GA s for System Optimization Summary and Future Work
AI TECHNIQUES AND INTRUSION DETECTION Long history of AI techniques applied to intrusion detection. For example: Rule-Based Expert Systems( Lunt and Jagannathan 1988) State Transition Analysis (Ilgun and Kemmerer 1995) Genetic Algorithms (Me 1998) Inductive Sequential Patterns (Teng, Chen and Lu 1990) Artificial Neural Networks (Debar, Becker, and Siboni 1992) Data mining applied to intrusion detection is an active area of research. Examples include: Lee, Stolfo, and Mok (1998) Barbara, Jajodia, Wu, and Speegle (2000)
UNIQUE FEATURES OF OUR WORK Combines fuzzy logic with data mining Overcomes sharp boundary problems of many systems Reduces false positive errors Can be used for both anomaly detection and misuse detection Includes real-time components Uses genetic algorithms for system optimization
FUZZY LOGIC AND SECURITY Many security-related features are quantitative e.g., temporal statistical measurements (Porras and Valdes 1998; Lee and Stolfo 1998)» SN: number of SYN flags in TCP header during last 2s» FN: number of FIN flags in TCP header during last 2s» RN: number of RST flags in TCP header during last 2s» PN: number of distinct destination ports during last 2s 0.0 1.0 2.0 3.0 4.0 5.0 6.0 Timestamp first 2 seconds second 2 seconds third 2 seconds...
FUZZY LOGIC ALLOWS OVERLAPPING CATEGORIES 1 0.8 1 0.8 0.6 0.6 0.4 0.2 0.4 Low Medium High Low Medium High 0.2 0 0 5 10 15 20 25 30 Destination Ports a. Non-fuzzy sets 0 0 5 10 15 20 25 30 Destination Ports b. Fuzzy sets
FUZZY LOGIC OVERCOMES SHARP BOUNDARY PROBLEMS Medium A B 0 x1 x2 x3 x4... A B LOW MEDIUM HIGH 0 x1 x2 x3 x4... If Medium is the normal pattern, then without fuzzy sets, A & B are both outside of the normal pattern. Fuzzy logic allows degrees of normality.
INTELLIGENT INTRUSION DETECTION MODEL Integration of fuzzy logic with data mining Fuzzy association rules Fuzzy frequency episodes Preliminary architecture Includes both misuse detection and anamoly detection Integrates machine-level and network-level information Optimization using genetic algorithms
Network Traffic or Audit Data (1) Network Traffic or Audit Data (2)... Network Traffic or Audit Data (m) Background Unit Machine Learning Component ( by mining fuzzy association rules and fuzzy frequency episodes ) Core Component Intrusion Detection Module 1 Intrusion Detection Module n+1... Intrusion Detection Module n Experts... Anomaly Detection Intrusion Detection Module n Decision-Making Module Administrator Misuse Detection Server Communication Module Clients Intrusion Detection Sentry 1 Host or Network Device Intrusion Detection Sentry 2 Host or Network Device... Intrusion Detection Sentry s Host or Network Device
MINING FUZZY ASSOCIATION RULES Association rules represent commonly found patterns in data. Association Rule Format: R1: X Y, c, s X and Y are disjoint sets of items s (support) tells how often X and Y co-occur in the data c (confidence) tells how often Y is associated with X. Our system is unique: X and Y are fuzzy variables that take fuzzy sets as values
FUZZY ASSOCIATION RULES Sample Fuzzy Association Rule: { SN=LOW, FN=LOW } { RN=LOW }, c = 0.924, s = 0.49 Interpretation: SN, FN, and RN are fuzzy variables. The pattern { SN=LOW, FN=LOW, RN=LOW } has occurred in 49% of the training cases; When the pattern { SN=LOW, FN=LOW } occurs, there will be 92.4% probability that { RN=LOW } will occur at the same time.
SAMPLE FUZZY FREQUENCY EPISODE RULES { E1: PN=LOW, E2: PN=MEDIUM } { E3:PN=MEDIUM } c = 0.854, s = 0.108, w = 10 seconds E1, E2 and E3 are events occurring within the time window 10 seconds. PN is a fuzzy variable The events occur in the order E1, E2, E3, but there may also be intervening events { PN=LOW, PN=MEDIUM, PN=MEDIUM } has occurred 10.8% in all training cases; When { PN=LOW, PN=MEDIUM } occurs, { PN=MEDIUM } will follow with 85.4% probability.
FUZZY DATA MINING FOR INTRUSION DETECTION Modification of non-fuzzy methods developed by Lee, Stolfo, and Mok (1998) Anomaly Detection Approach Mine a set of fuzzy association rules from data with no anomalies. When given new data, mine fuzzy association rules from this data. Compare the similarity of the sets of rules mined from new data and normal data.
1 0.8 Similarity 0.6 0.4 0.2 0 T1 T2 T3 T4 T5 T6 T7 T8 T9 Similarity 0.773 0.795 0.775 0.564 0.513 0.288 0.1 0.116 0.0804 Test Data Sets Similarities between Training Data Set and Different Test Data Sets by Mining Fuzzy Association Rules on SN, FN, and RN. Training data collected in the afternoon. T1-T3 afternoon T4-T6 evening T7-T9 late night Data source: MSU CS network
1 0.8 Similarity 0.6 0.4 0.2 0 T1 T2 T3 T4 T5 T6 T7 T8 T9 Similarity 0.681 0.883 0.89 0.207 0.171 0.0645 7.37E-05 8.91E-05 1.67E-05 Test Data Sets Similarities between Training Data Set and Different Test Data Sets by Mining Fuzzy Frequency Episodes on PN. Training data collected in the afternoon. T1-T3 afternoon T4-T6 evening T7-T9 late night Data source: MSU CS network
Similarity 1 0.8 0.6 0.4 0.2 0 Baseline Network1 Network3 Similarity 0.744 0.309 0.315 Test Data Sets Similarities between Training Data Set and Different Test Data Sets by Mining Fuzzy Association Rules on SN, FN, and RN Similarity 1 0.8 0.6 0.4 0.2 0 Baseline Network1 Network3 Similarities between Training Data Set and Different Test Data Sets by Mining Fuzzy Frequency Episodes on PN Similarity 0.885 0 0.000155 Testing Data Sets Training data: no intrusions Test data: baseline (no intrusions) network1 (includes simulated IP Spoofing intrusions) network3 (includes simulated port scanning intrusions)
REAL-TIME INTRUSION DETECTION Given a fuzzy episode rule R: { e1,, ek-1 } { ek }, c, s, w, if {e1,, ek-1} has occurred in the current event sequence, then { ek } can be predicted to occur next with confidence of c. If the next event does not match any prediction from the rule set, it will be alarmed as an anomaly. Define anomaly percentage = number of anomalies / number of events
Anomaly Percentage (%) 40% 30% 20% 10% 0% T1' T2' T3' T4' T5' T6' Anomaly Percentage 8.99% 9.55% 7.30% 25.60% 33.71% 32.39% Test Data Sets Anomaly Percentages of Different Test Data Sets in Real-time Intrusion Detection by Mining Fuzzy Frequency Episodes on PN Training Data: No intrusions Test Data: T1 -T3 no intrusions T4 -T6 simulated mscan Experiments and Results
FUZZY VS. NON-FUZZY Comparing the false positive error rates of fuzzy episode rules with non-fuzzy versions for real-time intrusion detection 20% False Positive Error Rate (%) 15% 10% 5% 0% T1' T2' T3' T4' T5' T6' Fuzzy 8.99% 9.55% 7.30% 4.44% 6.67% 8.89% Non-Fuzzy 17.98% 12.92% 15.25% 11.11% 17.78% 11.11% Test Data Sets
USING GENETIC ALGORITHMS FOR OPTIMIZATION Problem with NIDS System uses a fixed set of features for all kinds of situations Fuzzy membership functions must be predefined. Hypothesis Different features may be useful for different classes of intrusion attacks and for different situations. The performance of the system can be improved by using a GA to evolve an optimal set of features and fuzzy membership functions.
GENETIC ALGORITHMS Optimization goals APPROACH Maximize the similarity of rules mined from normal data with baseline rule set Minimize the similarity of rules mined from abnormal data with baseline rule set Parameters to change Features available from audit data Fuzzy membership function parameters
0a' d c' a c b Z S PI X Y + = = = ),, ( ),, ( ),, ( ') ',, ( 1 ') ',, ( 1 2 1 2 0 ),, ( 2 2 d b b Z b d b S b d PI c a S c a Z a c c a c a a c S µ µ µ µ µ µ µ µ µ a 2 c a a + <µ c c a < + µ 2 c<µ µ µ < < b b FUZZY SETS ARE DEFINED BY PARAMETERIZED MEMBERSHIP FUNCTIONS
EXAMPLE RUN Membership functions before and after the GA optimization FN Original Membership Functions FN Learned Membership Functions 1 1 Membership 0.8 0.6 0.4 0.2 Membership 0.8 0.6 0.4 0.2 0 1 6 11 16 21 26 31 low medium high 36 41 46 Number 0 1 6 11 16 21 26 31 36 41 46 Number low medium high
OPTIMIZATION RESULTS Similarity to Normal 1.2 1 0.8 0.6 0.4 0.2 0 baseline network1 network3 pre-selected feature and fuzzy membership functions optimized features and fuzzy membership functions baseline (no intrusions) network1 (includes simulated IP Spoofing intrusions) network3 (includes simulated port scanning intrusions)
FEATURES SELECTED FOR IP SPOOFING AND PORT SCANNING ATTACKS IP Spoofing Port Scanning Feature Selected Luo s Features Source IP, FIN, Data Size, Port number Source IP, Destination IP, Source Port, and Data size. SYN, FIN, RN SYN, FIN, RN
CONCLUSIONS Developed an architecture for integrating machine learning methods with other intrusion detection methods. Extended data mining techniques by integrating fuzzy logic Demonstrated that these methods are superior to their non-fuzzy counterparts. Developed a method for real-time intrusion detection using fuzzy frequency episodes. Used GA s to improve the performance of the system by selecting best set of features and by tuning the fuzzy membership function parameters
Current and Future Work Further work with fuzzy frequency episodes and real-time intrusion detection Using fuzzy logic for data fusion by the decision module Generating misuse modules from association rules Using incremental data mining to deal with drift in normality Investigating intrusion detection in high speed clusters of workstations