Network Payload-based Anomaly Detection and Content-based Alert Correlation. Ke Wang

Transcription

1 Network Payload-based Anomaly Detection and Content-based Alert Correlation Ke Wang Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2006

3 ABSTRACT Network Payload-based Anomaly Detection and Content-based Alert Correlation Ke Wang Every computer on the Internet nowadays is a potential target for a new attack at any moment. The pervasive use of signature-based anti-virus scanners and misuse detection Intrusion Detection Systems have failed to provide adequate protection against a constant barrage of zero-day attacks. Such attacks may cause denial-of-service, system crashes, or information theft resulting in the loss of critical information. In this thesis, we consider the problem of detecting these zero-day intrusions quickly and accurately upon their very first appearance. Most current Network Intrusion Detection Systems (NIDS) use simple features, like packet headers and derived statistics describing connections and sessions (packet rates, bytes transferred, etc.) to detect unusual events that indicate a system is likely under attack. These approaches, however, are blind to the content of the packet stream, and in particular, the packet content delivered to a service that contains the data and code that exploits the vulnerable application software. We conjecture that fast and efficient detectors that focus on network packet content anomaly detection will improve defenses and identify zero-day attacks far more accurately than approaches that consider only header information. We therefore present two payload-based anomaly detectors, PAYL and Anagram, for intrusion detection. They are designed to detect attacks that are otherwise normal connections except that the packets carry bad (anomalous) content indicative of a new exploit. These payload-based anomaly sensors can augment other sensors and enrich the view of network traffic to detect malicious events. Both PAYL and Anagram create models of

4 site-specific normal network application payload as n-grams in a fully automatic, unsupervised and very efficient fashion. PAYL computes, during a training phase, a profile of byte (1-gram) frequency distribution and their standard deviation of the application payload flowing to a single host and port. PAYL produces a very fine-grained model that is conditioned on payload length. Anagram models high-order n-grams (n > 1) which capture the sequential information between bytes. We experimentally demonstrate that both of these sensors are capable of detecting new attacks with high accuracy and low false positive rates. Furthermore, in order to detect the very early onset of a worm attack, we designed an ingress/egress correlation function that is built in the sensors to quickly identify the worms initial propagation. The sensors are also designed to generate robust signatures of validated malicious packet content. The technique does not depend upon the detection of probing or scanning behavior or the prevalence of common probe payload, so it is especially useful for the detection of slow and stealthy worms. An often-cited weakness of anomaly detection systems is that they suffer from mimicry attack : clever adversaries may craft attacks that appear normal to an anomaly detector and hence will go unnoticed as a false negative. A mimicry attack against a site defended by a content-based anomaly detector can be executed by an attacker by sniffing the target site s traffic flow, modeling the byte distributions of that flow, and blending their exploit with normal appearing byte padding. To defend against such attacks, we further propose the techniques of randomized modeling and randomized testing. Under randomized modeling/testing, each sensor will randomly partition the payload into several subsequences, each of whom are modeled/tested separately,thus building a model/test diversity on each host that is unknown to the mimicry attacker. This raises the bar for the attackers as they have no means to know how and where to pad the exploit code to appear normal within each randomly computed partition, even if they have the global knowledge of the target site s content flow. Finally, PAYL/Anagram s speed and high detection rate makes it valuable not only as a stand-alone network-based sensor, but also as a host-based data-flow classifier in an instru-

5 mented, fault-tolerant host-based environment; this enables significant cost amortization and the possibility of a symbiotic feedback loop that can improve accuracy and reduce false positive rates over time. Besides building stand-alone anomaly sensors, we also demonstrate a collaborative security strategy whereby different hosts may exchange payload alerts to increase the accuracy of the local sensor and reduce false positives. We propose and examine several new approaches to enable the sharing of suspicious payloads via privacy-preserving technologies. We detail the work we have done with our PAYL and Anagram, to support generalized payload correlation and signature generation without releasing identifiable payload data. The important principle demonstrated is that correlation of multiple alerts can identify true positives from the set of anomaly alerts, reducing incorrect decisions and producing accurate mitigation against zero-day attacks. A new wave of cleverly crafted polymorphic attacks has substantially complicated the task of automatically generating string-based signatures to filter newly discovered zeroday attacks. Although the payload anomaly detection techniques we present are able to detect these attacks, correlating the individual packet content delivering distinct instances of the same polymorphic attack are shown to have limited value, requiring new approaches for generating robust signatures.

6 Contents 1 Introduction Problem Statement and Our Approach Thesis Contributions Thesis Outline Related Work Network Anomaly Detection Worm Detection and Signature Generation Polymorphic Worms, Mimicry and Learning Attacks Privacy-preserving Correlation PAYL: 1-gram Payload Modeling and Anomaly Detection Length Conditioned n-gram Payload Model Simplified Mahalanobis Distance Learning Issues Incremental Learning Reduced Model Size by Clustering Unsupervised Learning Z-String Finer-grained Modeling: Multiple Centroids Detection Evaluation i

7 3.6.1 Experiments with 1999 DARPA IDS Data Set Experiments with CUCS Data Set Summary Anagram: Higher Order n-gram Payload Modeling and Anomaly Detection Higher Order N-gram Payload Model Model Size and Bloom Filters Memory overhead Computation overhead Discussion Implementation Summary Randomization against Mimicry Attack Mimicry Attack Anagram against Mimicry Attack Randomization Randomized Modeling Randomized Testing Threshold reduction and extreme padding Summary Learning Strategies Epoch-based Learning Semi-supervised Learning Adaptive Learning Training Attacks versus Mimicry Attacks ii

8 6.3.2 Feedback-based learning and filtering using instrumented shadow servers Adaptive model training with shadow servers Summary Content-based Alerts Correlation and Signature Generation Ingress/Egress Correlation for Worm Propagation Ingress and Egress Traffic Correlation Evaluation Buffer Size vs. Stealthiness Data Diversity across Sites Anomalous Payload Correlation among Sites Evaluating correlation techniques Alert correlation techniques Baseline: Raw payload correlation Frequency-modeled 1-gram alert correlation Binary-modeled n-gram alert correlation Similarity Score Testing results with real traffic Cross-Domain Alert Correlation Evaluation Signature Generation Ingress/Egress Correlation for Worm Signature Generation Correlation among Sites for Signature Generation Accuracy of Generated Signatures Signature for Polymorphic Worms Summary iii

9 8 Conclusion Summary Thesis Contributions Future Work Closing Remarks A Pseudo Code for PAYL and Anagram 131 A.1 Pseudo Code for Anagram A.2 Pseudo Code for PAYL Bibliography 140 iv

10 List of Figures 3.1 Example byte distributions for different ports. For each plot, the X-axis is the ASCII byte 0-255, and the Y-axis is the average byte frequency Example byte distribution for different payload lengths for port 80 on the same host server Raw packet of CRII; only the first 301 bytes are shown for brevity Example of CodeRed II packet (in Figure 3.3), and its payload distribution against the normal traffic at the same packet length The average relative frequency of each byte, and the standard deviation of the frequency of each byte, for payload length 185 of port Payload distribution for the CRII packet (in Figure 3.3) appears in the top plot, re-ordered to the rank-ordered count frequency distribution in the bottom plot The signature Z-String computed from the CRII packet (in Figure 3.3). The top numerical part is the ASCII representation of the Z-String, with the middle omitted for brevity. The lower byte sequence is the byte representation of the Z-String, while the non-printable character is displayed as One pass clustering algorithm Merge two cluster sets v

11 3.10 First six centroids for a dataset at length upper plot: one-pass online clustering; lower plot: semi-batched one-pass online clustering (batch size 50). The number under each subplot is the number of packets in that one centroid during training ROC curves for ports 21, 23, 25, 80 for the five different models. Notice the x-axis scale is different for each plot and does not span to 100%, but limited to the worst false positive rate for each plot ROC of PAYL detecting incoming worms, false positive rate restricted to less than 0.5% General architecture of Anagram sensor, the training phase General architecture of Anagram sensor, the detection phase ROC curves comparing the frequency-based and binary-based n-gram approach Inserting n-grams (n=5) into a Bloom Filter This figure shows the structure of the blended exploit buffer. The variable parts depend on the exploit used. The buffer may be split into several packets by the network stack when transmitted. The Maximum Segment Size (MSS) on our system was 1460 so each packet above including headers was no larger than 1460 bytes Comparison of frequency distributions of attack packet (unpadded left and padded right) and normal port 80 traffic. The padded worm packet matches the normal traffic well vi

12 5.3 Comparison of the payload model computed using the global traffic and local partial traffic. The upper plot shows the global model computed using all the traffic that web server W received, and the lower one gives the local model by observing the traffic to server W originating from several local IP address, for payload length Randomized Modeling Payload distribution examples for the randomized modeling. The top subplot is the byte distribution using the whole packet, and the bottom two subplots are for each of the two random sub-partitions Randomized Testing The average false positive rate and standard deviation with 100% detection rate for randomized testing of Anagram, with normal training (left) and semi-supervised training (right) The pseudo code of epoch-based training Evolving of stability metrics of epoch-based training for PAYL. The left plot gives the number of centroids over epochs, and the right one gives the sum of Manhattan distance between corresponding centroids after each new epoch The likelihood of seeing new n-grams as training time increases False positive rate (with 100% detection rate) as training time increases Distribution of bad content scores for normal packets (left) and attack packets (right) The false positive rate (with 100% detection rate) for different n-grams, under both normal and semi-supervised training Shadow server architecture vii

13 7.1 The buffer size (in number of anomalous data units) against the number of alerts the worm need to wait before spreading to avoid detection, for 5 different worms Example byte distribution for payload length 536 of port 80 for the three sites EX, W, W1, in order from top to bottom Similarity score comparison of 80 random pairs of good-vs-good alerts Methods comparison. The correlation methods are, from 1 to 8, Raw-LCS; Raw-LCSeq; Raw-ED; Frequency-MD; Zstr-LCS; Zstr-LCSeq; Zstr-ED; N-grams with n = The initial portion of the generated signature for CodeRed II by LCS on raw payload The initial portion of raw packet of CRII; only the first 301 bytes are shown for brevity Frequency distribution for the CRII packet First 20 bytes of the Z-String computed from the CRII packet Generated 5-gram signature from the CRII packet; only the first 172 bytes are shown for brevity The accumulative frequency of the signature match scores computed by matching normal traffic against different worm signatures. The first plot shows for the signatures of phpbb worm, and the second plot is that of the CodeRedII worm viii

14 List of Tables 3.1 Detection performance comparison for sorted and unsorted centroids Overall detection rate of each model when false positive rate lower than 1% Speed and Memory measurements of each model. The training and testing time is in units of seconds per 100M data, including the I/O time. The memory consumption is measured in the number of centroids that were kept after clustering or learning False positive rate of PAYL using different modeling algorithm on dataset W, W1 and EX, when detecting all the worms The percentage of the observed unique n-grams for different frequencies of occurance for 90 hours of training traffic The false positive rate (%) of the two approaches using different n-grams when achieving 100% detection rate, www1-06 train/test dataset Hit rate of different cache strategies for Anagram speed-up The maximum possible padding length for a packet of different varieties of the mimicry attack The detection performance of PAYL with randomized testing on the mimicry attack which is designed to target it Different fragmentation for CodeRed and CodeRed II ix

15 7.2 Results of ingress/egress correlation for different metrics The detail number of figure 7.1. The number in the parenthesis following the worm name is their anomaly score For each pair of sites, the 3 packet lengths with the largest Manhattan distance between distribution The number of unique 5-grams in dataset W, W1 and EX, and the common 5-grams numbers between each pair of sites Manhattan distance from Raw-LCSeq; lower is better The similarity scores between different versions of shell code 2 of CLET polymorphic engine, in average and standard deviation x

16 ACKNOWLEDGEMENTS First of all, I would like to express my sincere appreciation to my advisor, Prof. Salvatore J. Stolfo, for his invaluable guidance and support throughout my whole PhD study. His enthusiasm, broad knowledge and deep insights in computer science have made the research experience exciting and intriguing. And his patience and humor made my graduate study a fun and enjoyable journey. I also appreciate Prof. Angelos Keromytis, Prof. Vishal Misra, Prof. Moti Yung, Dr. Niels Provos, for serving on my committee and providing helpful suggestions. My special thanks to two of my main collaborators and best friends: Janak J. Parekh and Gabriela Cretu. They are always helpful whenever I have questions not only on research but also on other matters, and their enthusiasm and devotedness in science always inspire me. The collaboration with them was really a joyful experience. Many thanks to the members of the IDS and the network security group, past and present, for their help over the past four years. In particular, I would like to thank Dr. Shlomo Hershkop, Vanessa Frias-Martinez, Wei-Jen Li, Michael Locasto for their collaboration in my work. I would also like to thank Angelos Stavrou, Stelios Sidiroglou, Phil Gross, and many other people for their helpful discussion about my research. Finally, my most heartful thanks go to my parents and my husband. Their constant love and support make my graduate studies easier and much more joyful. xi

17 This thesis is dedicated to my dear mom and dad. xii

18 1 Chapter 1 Introduction Computers and the Internet have become a vital part of our everyday life; at the same time, they have often also become targets of malicious attackers. Attacks can destroy files, wipe the hard disk, or even bring down part of the Internet if the routing infrastructure is targeted, or if the attack succeeds in reaching a broad collection of hosts with common vulnerabilities. As new attacks appear every day, it is crucial to find effective ways to detect these zero day attacks and defend networked systems. Many anti-virus scanners and Intrusion Detection Systems (IDS) are available to help detect possible attacks. There are two major categories of IDS: misuse-based and anomalybased. Misuse-based IDSes are primarily based on signatures that identify previously detected attacks. Although these systems are effective at detecting known intrusion attempts and exploits, they fail to recognize new attacks and carefully crafted variants of old exploits. A new generation of systems have been the subject of research for at least a decade and are now appearing as standard security products that are based upon anomaly detection. Anomaly detection systems model normal or expected behavior in a system, and detect deviations of interest that may indicate a security breach or an attempted attack. Some attacks exploit the vulnerabilities of a protocol, while other attacks seek to survey

19 2 a site by scanning and probing. These attacks can often be detected by analyzing network packet headers or by monitoring network traffic connection attempts and session behavior. Many well-known examples of worms have been described that propagate at very high speeds on the Internet [67]. These are easy to detect by analyzing the rate of scanning and probing from external sources, which would indicate a worm propagation is underway. Unfortunately, while this approach detects the early onset of a propagation, the worm has already successfully penetrated a number of victims, infected it and started its damage and its propagation. Then it should also be evident that slow and stealthy worm propagations may go unnoticed if one depends entirely on the detection of rapid or burst changes in flows or probes. There are other attacks that display normal protocol behavior except that they may carry malicious content in an otherwise normal connection. For example, slow-propagating parasitic worms targeting specific sites may follow the connection pattern of a host and thus may not exhibit any unusual volumes of connection attempts. Misuse and anomaly detectors that analyze packet headers and traffic flow statistics may be blind to these attacks or they may be too slow to react and reliably detect worms that are designed to evade detection by shaping their behavior to look like legitimate traffic patterns [49]. Furthermore, signature scanners are vulnerable to zero-day exploits [67] and polymorphic worms/stealthy attacks with obfuscated exploit code [6]. Consequently, there has been an increasing focus on payload analysis to detect the early onset of a worm or targeted attack. A number of researchers have focused on payload-based anomaly detection. Approaches that have been studied include specification-based anomaly detection [61] as well as techniques that aim to detect code-like byte sequences in network payloads [30, 81]. In our work, we focus on automated statistical learning approaches to efficiently train content models on a site s normal traffic flow without requiring significant semantic analysis. Ideally, we seek to design a sensor that automatically learns the characteristics of normal attack-free data for any application, service, network or host. Consequently, a model learned from normal attack-free data may be used to identify abnormal or suspicious traf-

20 3 fic that would be subjected to further analysis to validate whether the data embodies a new attack. Ideally, we aim to detect the first occurrences of a worm attack either at a network system gateway or within an internal network and to prevent its propagation. Although we cast the payload anomaly detection problem in terms of worms, the method is useful for a wide range of exploit attempts against many if not all services and ports. We have cast the problem as a network packet level analysis problem. However, the techniques are equally applicable as a host-level anomaly detector. 1.1 Problem Statement and Our Approach This thesis studies the following problem: We seek to accurately detect zero-day attacks upon their very first appearance, or very soon thereafter using network packet payload anomaly detection. New attacks against network services occur every day, attacks for which no signatures have yet been produced and deployed. These so called zero-day attacks can cause great harm. We conjecture that the normal content stream, both ingress and egress, of a site can be effectively and efficiently modeled to detect abnormal content indicative of a zeroday attack against some network service. The ideal case is the 100% accurate detection of the network packets delivering the attack vector, with 0% false positive rate. In this work we seek to approach this ideal level of performance and conduct several experiments using real network traffic to demonstrate how close we may reach this ideal. The approach we take is to compute a normal model of content flow for each network available service during a training phase, and then to use this learned normal model to detect abnormal, never-before-seen content. These suspicious network connections may or may not be attacks; hence, the approach presumes that other correlated information will identify the suspicious abnormal content that is true attacks, from false positives. We propose to correlate other content alerts under the principle that the attack vector will likely

21 4 be presented at multiple sites or in multiple connections. In the case of worms, we expect that a successful attack will create a propagation phase of worm execution which includes egress packet streams containing substantially the same abnormal content as the original ingress attack vector. Hence, correlation includes testing and comparing suspicious ingress and egress content alerts. There are many design choices in modeling payload in network flows. The primary design criteria and operating objectives of any anomaly detection system entails: automatic hands-free deployment requiring little or no human intervention, generality for broad application to any service or system, incremental update to accommodate changing or drifting environments, accuracy in detecting truly anomalous events (here anomalous payload), with low (or controllable) false positive rates, resistance to mimicry attack, and efficiency to operate in high bandwidth environments with little or no impact on throughput or latency. These are difficult objectives to meet concurrently, yet they do suggest an approach that may balance these competing criteria for payload anomaly detection. The primary modeling technique proposed in this thesis is language-independent n-gram modeling of payload using machine learning techniques to automate the modeling of normal content flow. The techniques are completely general, and we discuss several strategies to automatically calibrate, update and improve models over time. An important primary area of work we present is a new and general method to resist mimicry attack for any technique one may choose to use to model normal content. The modeling techniques presented in this thesis are designed with high efficiency, so that the techniques proposed may scale to high speed networks, and even may be applicable to low bandwidth, low power environments such as MANETS.

22 5 It is important to note that we focus on clear text content channels, and do not address the issue of encrypted content flows. Anomaly detection of encrypted content has been studied by others by computing various entropy-based or Kolmogorov complexity-based models; we speculate about the effectiveness of these approaches in the future research section. For our work, we presume the content is available for inspection; the techniques proposed in this thesis can be applied at the point of decryption either by using a network service proxy architecture or on the host where network content can be passively sniffed at the point where it is decrypted and delivered to the targeted application software. We also restrict our analysis and experimental evaluations to HTTP traffic. Such traffic poses fewer privacy restrictions since most web traffic content is usually public. Hence, we were able to acquire significant amounts of web traffic for use in our thesis research, although we believe the algorithms and technology presented in this thesis are applicable to other content flows. We also have chosen to limit our study to web traffic since historically web services have been a common target of previous worm attacks, and offer a comprehensive model of monoculture problems [43]. Our work on cross-site content alert sharing is thus applicable to a wide variety of systems on the Internet. In this thesis, we detail two light-weight, real-time network anomaly detection sensors we designed, PAYL and Anagram, and experimentally demonstrate that they can successfully detect inbound worm packets with high accuracy and a low false positive rate. We then show that if the worm has already infected a machine and starts to propagate to other victims, PAYL/Anagram can quickly detect the propagation and automatically generate a signature concurrently that can be distributed to other machines in the local LAN or across domains to filter the zero-day attack vector. This signature is accurate, and won t block normal traffic (thus exhibiting a low false positive rate). It is important to note that in this line of work we assume there is an active adversary who seeks to attack sites without notice. Hence, the threat model we assume is that the attacker will know that a content-based anomaly detector is in use, and they will seek means to blind the sensor from seeing their attack. An external research group led by Wenke

23 6 Lee at Georgia Tech fashioned an automated system that was designed to blind PAYL, the first sensor we proposed, to attacks creating unwanted false negatives [28]. Anagram, the second sensor presented in this thesis, was devised as a new sensor with a different modeling strategy from PAYL to thwart these automatically generated mimicry attacks levied at PAYL. Under this threat model, we assume that the attacker will attempt to train the anomaly sensor to consider attack data as normal data. Under this learning attack strategy the attacker may send a stream of ingress data that is successively distant from normal attack free data but closer to their intended crafted exploit data. One means of thwarting these training attacks is to notice whether or not the target service replies with an error message for unrecognized requests. In such cases, the anomaly detector in its training phase can simply ignore such ingress packets and hence not include such purposely crafted training data in its model. We do not treat this case in detail in this thesis since to date no one has yet demonstrated a successful training attack. However, we discuss approaches to thwart training attacks. Alternatively, the attacker may analyze the true content stream of the intended target and craft the content of their attack to appear to be normal content. In the latter case of this mimicry attack strategy, we propose a method to compute randomized models to thwart the attacker. Hence, even if the attacker knew the true content model of a target site, they may not know how exactly how to craft their attack to avoid detection. We present an alternative strategy we call randomized testing, that is less expensive to implement than randomized modeling, but that substantially provides very much the same anti-mimicry capabilities as randomized modeling. 1.2 Thesis Contributions This thesis research makes the following contributions: Demonstrate the usefulness of analyzing network payload for anomaly detec-

24 7 tion. We systematically study the possibility and approaches of modeling normal network payload for use in anomaly detection to detect likely network attacks. A new statistical, semantics-independent, efficient content-based anomaly detector based on 1-gram analysis that is shown to be effective at detecting abnormal content and attacks. The sensor does not rely upon a specification or semantic analysis of the target applications. The sensor learns a model of normal content in a completely automated fashion. A binary based model representation of a mixture of high order n-grams detects abnormal content surprisingly well. Such modeling in the Anagram sensor can capture the sequential information between bytes and is resistant against existing mimicry attacks; the technique is particularly efficient in space and computational costs, and does not incur infeasible amounts of computation, unlike building the full frequency distribution of higher order n-grams. The implementation of Anagram models using Bloom filters provides fast and effective correlation while also preserving the privacy of shared content. Development of a run-time measurement of the stability of a network s content flow, providing an automatic and reasonable estimate of when the sensor has been sufficiently trained and is ready for deployment. A bad content model created from known old attack signatures and collected virus samples that can be used to perform semi-supervised learning that improves accuracy of the anomaly detector. This information was acquired from publicly available sources such as Snort rules and online malware collections. Identify the data diversity of network payload across sites, which can be used to thwart large-scale attacks. The so-called Monoculture Problem (a large population of hosts sharing the same vulnerability exploited by a single attack) is the fundamental reason why worm attacks spread broadly with great efficiency and speed. Even though each potential target may still have the exact same vulner-

25 8 able software application available for attack, we demonstrate that each site s diverse content flow produces different and diversified payload models at different sites making a single common attack hard to evade all of the collaborating content anomaly detectors. A new defensive strategy showing how a symbiotic relationship between hostbased sensors and a content-based sensor can adapt over time to improve accuracy of modeling a site s content flow. Novel techniques of randomized modeling/testing that can help thwart mimicry attacks. There are a new class of smart worms that launch their attack by first sniffing traffic and shaping the datagram to the statistics specific to a given target site to appear normal. By randomizing the portion of packets that the anomaly sensors extracts for modeling and testing, it is difficult for the attacker to guess where or how to pad content. This technique gives the sensors artificial diversity and robustness against future mimicry attacks. The techniques are general and applicable to any content modeling technique one may wish to use. A technique of correlating ingress/egress payload to capture a worm s initial propagation attempt. Key features of worms include their self-propagation strategy; a newly infected host will begin sending outbound traffic that is substantially similar (if not exactly the same) as the original content that attacked the victim. Instead of waiting until the volume of outgoing scans suggests full-blown propagation attempts, we can stop the worm spread from the very first attempt. This technique is especially good for catching the stealthy or targeted worms which do not display scanning or probing behavior. Novel techniques for efficient privacy-preserving payload collaboration across sites, and automatic signature generation. As the data diversity characteristics suggests, different sites have different normal payload models. This implies from a statistical perspective that they should also have different false positive alerts. Any

26 9 common or highly similar anomalous payloads detected among two or more sites logically would be caused by a common attack exploit targeting two or more sites. Cross-site or cross-domain sharing may thus reduce the false positive rate at each site, and may more accurately identify worm outbreaks in the earliest stages of an infection. Robust and privacy-preserving means of representing content-based alerts for cross-site alert sharing and signature generation. The best candidates, as detailed later, include the 1-gram based Z-String produced by PAYL, and the Bloom filter representation of anomalous n-grams detected by Anagram. Highly accurate signatures that are automatically generated. Such signatures detail and capture the core invariant part of the attacks even for polymorphic/metamorphic attack vectors 1.3 Thesis Outline The rest of the thesis is organized as follows. Chapter 2 discusses related work in intrusion detection, worm detection, automatic signature generation and collaborative security. In chapter 3, we describe the PAYL anomaly detection sensor, the modeling and detection techniques employed in PAYL, and demonstrate how well it can detect attacks. Chapter 4 describes the Anagram anomaly detection sensor, designed to ameliorate the core mimicry attack problems that PAYL exhibits. The techniques described in the design of Anagram to resist mimicry attack are general, and in chapter 5 we propose the randomized modeling and testing approach to help thwart mimicry attack against any content-based anomaly sensor. The sensors described in this thesis are based upon machine learning algorithms applied to content flows. Metrics need to be found to decide when the model is well enough trained. Furthermore, the content flow environment may change in time and hence we seek ways to improve the learning phase and to produce accurate normal models. We explore several learning strategies in chapter 6, which includes epoch-based learning,

27 10 and adaptive learning that uses the sensors together with shadow servers for feedback from the server whose data flow is modeled. Chapter 7 introduces the idea of content-based alert correlation, which includes local (or host-level) ingress/egress correlation, and cross-site correlation. We examine several different ways to achieve privacy-preserving payload alert correlation, and demonstrate their effectiveness in detecting true alerts and reducing false positives. Chapter 8 summarizes the thesis and outlines ideas for future work.

28 11 Chapter 2 Related Work 2.1 Network Anomaly Detection There are two types of systems that are called anomaly detectors: those based upon a specification (or a set of rules) of what is regarded as good/normal behavior, and others that learn the behavior of a system under normal operation. The first type relies upon human expertise and may be regarded as a straightforward extension of typical misuse detection IDS systems. Here we regard the latter type, where the behavior of a system is automatically learned, as a true anomaly detection system. Rule-based network intrusion detection systems such as Snort and Bro use hand-crafted rules to identify known attacks, for example, virus signatures in the application payload, and requests to nonexistent services or hosts. They can do little to stop zero-day worms. They depend upon signatures only known after the worm has been launched successfully, essentially disclosing their new content and method. The rules can also be specified for good behavior instead of bad behavior. Sekar [61] proposed specification-based anomaly detection in which they specify all the legitimate behavior of a server, for instance, IP, TCP or SMTP, using finite state machines (FSM), and any execution path that fails to track the specified FSM may be flagged as anomalous. But it s hard to guarantee that a manually-developed state machine is accurate and complete, and hard to update if

29 12 the service has any modification. Network anomaly detection systems, such as NIDES [22], PHAD [47] and ALAD [45], compute (statistical) models for normal network traffic and generate alarms when there is a large deviation from the normal model. Such model-learning systems have the ability to detect new attacks that never appeared before, but they might suffer from false alerts from those unusual content that is otherwise normal. We take the same approach in this thesis. However, these systems differ in the features extracted from available audit data and the particular algorithms they use to compute the normal models. Most use features extracted from the packet headers. ALAD and NIDES model the distribution of the source and destination IP and port addresses and the TCP connection state. PHAD uses many more attributes, a total of 34, which are extracted from the packet header fields of Ethernet, IP, TCP, UDP and ICMP packets. Some systems use some payload features but in a very limited way. NATE is similar to PHAD; it treats each of the first 48 bytes of a packet as a statistical feature starting from the IP header, which means it can include at most the first 8 bytes of the payload of each network packet. ALAD models the incoming TCP request and includes as a feature the first word or token of each input line out of the first 1000 application payload bytes, restricted only to the header part for some protocols like HTTP and SMTP. PAYL and Anagram model the entire packet datagram, and may also be applied to long session data. The work of Kruegel et. al. [31] describes a service-specific intrusion detection system that is most similar to our work. They combine the type, length and payload distribution of the request as features in a statistical model to compute an anomaly score of a service request. However, they treat the payload in a very coarse way. They first sort the 256 ASCII characters by frequency and aggregate them into 6 groups: byte value ranges 0, 1-3, 4-6, 7-11, 12-15, and , and compute one single uniform distribution model of these 6 segments for all requests to one service over all possible length payloads. They use a chisquare test against this model to calculate the anomaly score of new requests. In contrast, PAYL models the full byte distribution conditioned on the length of payloads and we use

30 13 Mahalanobis distance as fully described in the following discussion. Furthermore, the modeling we introduce includes automatic clustering of centroids that is shown to increase accuracy and dramatically reduce resource consumption. The method is fully general and does not require any parsing, discretization, aggregation or tokenizing of the input stream (eg, [46]). Rieck [60] also proposed a payload-based network intrusion detection approach using n-grams recently. Unlike Anagram which builds binary-based n-gram models using Bloom filters, they store the n-grams in a trie data structure, and the similarity between two tries is defined according the number of matching and mismatching nodes. Another difference is that Anagram builds a universal model for the whole dataset, while [60] builds a trie for each connection in the dataset. During detection, they compare the test data against each of the pre-computed tries using a method similar to k-nearest neighbor, or simplified Mahalanobis distance. One major shortcoming of this work is its high computation overhead, considering the number of tries that are compared in the model for each packet tested. Early intrusion anomaly sensors focused on system calls. Forrest[15] s foreign sequences of system calls in a binary-based anomaly detector is quite similar to the modeling implemented in Anagram. Tan and Maxion[72] shows why Forrest s work produced optimal results when the size of the token window was fixed at 6 (essentially a 6-gram). Forrest s grams were sequences of tokens each representing a unique system function, whereas Anagram models n-grams of byte values. It has much more complicated situations with a huge feature space. Anagram also employs a semi-supervised training strategy whereby models of previously known attacks are employed in the sensor to improve accuracy. Network intrusion detection systems can also be classified according to the semantic level of the data that is analyzed and modeled. Some of the systems such as MADAMID [33], Bro [55], EMERALD [57], STAT [74], ALAD [45], etc., reconstruct the network packets and extract features that describe the higher level interactions between end hosts. For example, session duration time, service type, bytes transferred, and so forth are regarded as higher level, temporally ordered features not discernible by inspecting only the

31 14 packet content. Other systems are purely packet-based like PHAD [47], NATED [21], NATE [37]. They detect anomalies in network packets directly without reconstruction of connection data. This approach has the important advantage of being simple and fast to compute, and they are generally quite good at detecting those attacks that do not result in valid connections or sessions, for example, scanning and probing attacks. 2.2 Worm Detection and Signature Generation There has been much work done on worm detection and their signature generation. Most of them based on detecting abnormal communication patterns or based on their abnormal content. Fast-spreading and large number of connections is one important feature discernible from the most typical symptoms of a worm attack. Many models have been proposed to predict worm propagation speed and possible defense mechanisms [67, 5, 66]. Helped by network telescopes, Pang et. al. [53] use the macro symptoms such as Internet background radiation to generate early alerts of possible Internet-wide worm propagation. Other techniques are based on blocking or rate limiting traffic from hosts that exhibit abnormal local traffic patterns: number of connections to new destinations [83], the ratio of failed to successful connections [82], destination address dispersion [23], etc. But these techniques cannot detect worms that have normal traffic patterns, for example, the targeted worms, parasitic worms, or slow, stealthy worms. Many researchers also have considered the use of content invariance or content prevalence for worm detection, and generate content-based signatures. Honeycomb [29] is a host-based IDS that automatically creates signatures by applying longest common substring (LCS) on malicious traffic captured by a honeypot targeting dark space. Computed substrings are used as candidate worm signatures. Similarly, EarlyBird [65] uses Rabin fingerprints to find the most frequent substrings for signatures. Polygraph [52] extends the work done in Autograph [25]; both are signature generators that assume traffic is separated

32 15 into two flow pools, one with suspicious scanning traffic and a one with non-suspicious traffic. Instead of assuming signatures are contiguous, like Autograph, Polygraph allows a signature composed of multiple noncontiguous substrings (tokens), particularly to accommodate polymorphic worms. Tokens may be generated as a set (of which all must be present), as an ordered sequence, or as a probabilistic set (Bayes signature). Hamsa [35] further improves Polygraph in terms of efficiency, accuracy and attack resilience. Like Polygraph, Anagram is capable of identifying multiple tokens. However, Anagram s design also does not assume an external flow classifier, being one itself. PADS [73], or Position-Aware Distribution Signatures, seek to blend frequency distributions and packet signature positioning. Each of the aforementioned projects are based on detecting frequently occurring payloads delivered by a source IP that is suspicious, either because the connection targeted dark IP space or the source IP address exhibited pre-scanning behavior. These approaches imply that the detection occurs some time after the propagation of the worm has executed, and can easily be misled by deliberate noise [56]. Unlike these approaches, the ingress/egress correlation approach we proposed does not depend on scanning behavior and payload prevalence. Instead, we can detect the first propagation attempt of the worm immediately and generate the signature of it by extracting the common part of the incoming and outgoing traffic of the worm. Another substantial advantage of the Anagram anomaly detector is its ability of computing robust signature. Instead of considering the whole payload or the most frequent part, Anagram first identifies the malicious n-grams, so that the signature composed by them only captures the malicious exploit parts and are more accurate. Even under purposefully obfuscated content, the small invariant decoder regions of the payload are still identifiable across multiple suspicious payloads. More recently, work has focused on building semantic-aware or vulnerability-based signatures to handle multiple (or polymorphic) attacks for the same exploit. Kruegel et. al. [30] use structural analysis of binary code and generate control-flow graphs to catch worm mutations. Shield [77] provides vulnerability signatures instead of string-oriented content

33 16 signatures, and blocks attacks that exploit that vulnerability. A shield is manually specified for a vulnerability identified in some network available code; the time lag to specify, test and deploy shields from the moment the vulnerability is identified favors the worm writer, not the defenders. Vigilante [8] introduces the notion of vulnerability-specific selfcertifying alerts that focus on filtering undesirable execution control, code execution, or function arguments, and can be exchanged via P2P systems. VSEF [51] builds executionbased filters that filter out vulnerable processor instruction-based traces. COVERS [36] analyzes attack-triggered memory errors in C/C++ programs and develops structural memory signatures; this is a primarily host-specific approach, while PAYL/Anagram focuses on network traffic in this thesis. However, as mentioned, PAYL/Anagram may also be used as host-based sensors with little effort. SigFree [81] uses a different approach, focusing on generic code detection; as its name implies, it does not rely on signatures, preferring to disassemble instruction sequences and identify, using data flow anomaly detection, if requests contain sufficient code to merit them as being suspicious. PAYL/Anagram do not explicitly differentiate between code and data, although it is often able to do so based on training. Additionally, PAYL/Anagram monitor content flows, not just requests, and can apply to a broader variety of protocols. 2.3 Polymorphic Worms, Mimicry and Learning Attacks Polymorphic viruses are nothing new; 1260 and the Dark Avenger Mutation Engine were considered the first two polymorphic virus engines, written in the early 90s [70]. However, early work focused on making it more difficult for detection by COTS signature scanners, would be easily detected by an anomaly detector as they contain significantly different byte distributions than non-malicious code, and were primarily targeted for manual execution and so did not incorporate exploit mechanisms like common Internet worms. Polymorphic worms with vulnerability-exploiting shellcode, e.g., ADMmutate [24] and CLET [13], do support exploit vectors and are primarily designed to fool signature-based

34 17 IDSes. CLET does feature a form of padding, which they call cramming, to defeat simple anomaly detectors. However, cram bytes are derived from a static source, i.e. instructions in a file included with the CLET distribution; while this may be customized to approach a general mimicry attack, it must be done by hand. Wagner and Dean [75] were among the first to demonstrate a mimicry attack on an anomaly detection system, but these initial efforts to generate mimicry attacks, including [76] and [71], focused on host-based system-call anomaly detection. With the advent of effective network payload-based anomaly detection techniques, researchers have begun building smart worms that employ a combination of polymorphism and mimicry attack mechanisms. Kolesnikov, Dagon and Lee [28] built a worm specifically designed to target network anomaly detection approaches, including PAYL. They use a number of techniques, including polymorphic decryption, normal traffic profiling and blending, and splitting to effectively defeat PAYL and several other IDSes. Defeating learning attacks by training attacks is also a current research theme; [3] discusses the problem for anomaly detectors from a theoretical perspective, categorizes different types of learning attacks (e.g., causative vs. exploratory, etc.) and speculates as to several possible solutions. We independently proposed randomized modeling employed in Anagram that implements some of the techniques proposed in [3]. Anagram uses randomization, hiding key parameters of the model from the attacker, and may be extensible to any learning based anomaly sensor. Our ongoing work includes exploring several other strategies, including the randomization of n-gram sizes, and various strategies to test whether an attacker is polluting learning traffic at given points in time. We discuss several of these ideas in the section of future research. 2.4 Privacy-preserving Correlation There have been recent efforts to focus on the privacy of content alerts to enable effective correlation without leaking confidential information. Lincoln et al. [37] suggest hash-based

35 18 sanitization of several header fields, enabling equality matching (e.g., identifying the same source IP) while removing other features, including payloads; instead, our techniques keep (and analyze) these payloads. Kissner [26] describes the notion of privacy-preserving set operations using cryptographic techniques; this achieves stronger privacy guarantees than hashing approaches described by Lincoln, but it is restricted to set union, intersection, element reduction (set count difference), which could still potentially be used with n-gram analysis. Privacy-Preserving Friends Troubleshooting Network [20, 19] extends earlier work on PeerPressure [78] a collaborative model for software configuration diagnosis with a privacy-preserving architecture utilizing a friend -based neighbor approach to collaboration, including the use of secure multiparty computation to vote on configuration outliers and homomorphic encryption to protect privacy. Xu [84] introduces the notion of concept hierarchies to abstract low-level concepts, along with the use of entropy, to balance the sanitization and information gain of alerts; a similar use of entropy may also be applicable here. Our work focuses more on the privacy-preserving representations of network payload and the techniques to correlate them. Our purpose is to successfully find those alerts containing very similar payload among different sites, which are likely to be the common attacks vector, and to reduce the false positives generated by local sensors. At the same time, we will use the correlated payload based on these representations to generate accurate attack signatures. There is also a tremendous volume of work on privacy-preserving data mining, e.g. [1, 38]; these primarily assume secure querying, perturbation, and aggregate computation of values amongst one or two databases, and does not generally scale to the collaboration described here. Additionally, most of the research in this field is more concerned with offline analysis of data base query processing.

36 19 Chapter 3 PAYL: 1-gram Payload Modeling and Anomaly Detection In this chapter, we present the payload-based anomaly detector PAYL. we choose to consider language-independent statistical modeling of sampled data streams, which is best exemplified by well known n-gram analysis. Many have explored the use of n-grams in a variety of tasks. The method is well understood, efficient and effective. The simplest model one can compose is the 1-gram model. A 1-gram model is certainly efficient (requiring a linear time scan of the data stream and an update of a small 256-element histogram), but whether it is accurate requires analysis and experimentation. To our surprise, this technique has worked surprisingly well in our experiments. 3.1 Length Conditioned n-gram Payload Model Network payload is just a stream of bytes. Unlike the network packet headers, payload doesn t have a fixed format, small set of keywords or expected tokens, or a limited range of values. Any character or byte value may appear at any position of the datagram stream. To model the payload, we need to divide the stream into smaller clusters or groups according to some criteria to associate similar streams for modeling. The port number and the length

37 20 are two obvious choices. We may also condition the models on the direction of the stream, thus producing separate models for the inbound traffic and outbound responses. Usually the standard network services have a fixed pre-assigned port number: 20 for FTP data transmission, 21 for FTP commands, 22 for SSH, 23 for Telnet, 25 for SMTP, 80 for Web, etc. Each such application has its own special protocol and thus has its own payload type. Each site running these services would have its own typical payload flowing over these services. Payload to port 22 should be encrypted and appear as uniform distribution of byte values, while the payload to port 21 should be primarily printable characters entered by a user and a keyboard. Within one port, the payload length also varies over a large range. The most common TCP packets have payload lengths from 0 to Different length ranges have different types of payload. The larger payloads are more likely to have non-printable characters indicative of media formats and binary representations (pictures, video clips or executable files etc.). Thus, we compute a payload model for each different length range for each port and service and for each direction of payload flow. This produces a far more accurate characterization of the normal payload than would otherwise be possible by computing a single model for all traffic going to the host. However, many centroids might be computed for each possible length payload creating a detector with large resource consumption. To keep our model simple and quick to compute, we model the payload using n-gram analysis, and in particular the byte value distribution, exactly when n=1. An n-gram is the sequence of n adjacent bytes in a payload unit. A sliding window with width n is passed over the whole payload and the occurrence of each n-gram is counted. N-gram analysis was first introduced by [12] and exploited in many language analysis tasks, as well as security tasks. The seminal work of Forrest [15] on system call traces uses a form of n- gram analysis (without the frequency distribution and allowing for wildcards in the gram) to detect malware execution as uncharacteristic sequences of system calls. For a payload, the feature vector is the relative frequency count of each n-gram which is calculated by dividing the number of occurrences of each n-gram by the total number of

38 21 n-grams. The simplest case of a 1-gram computes the average frequency of each ASCII character Some stable character frequencies and some very variant character frequencies can result in the same average frequency, but they should be characterized very differently in the model. Thus, we compute in addition to the mean value, the variance and standard deviation of each frequency as another characterizing feature. So for the payload of a fixed length of some port, we treat each character s relative frequency as a variable and compute its mean and standard deviation as the payload model. Figure 3.1 provides an example showing how the payload byte distributions vary from port to port, and from source and destination flows. Each plot represents the characteristic profile for that port and flow direction (inbound/outbound). Notice also that the distributions for ports 22 (inbound and outbound) show no discernible pattern, and hence the statistical distribution for such encrypted channels would entail a more uniform frequency distribution across all of the 256 byte values, each with low variance. Hence, encrypted channels are fairly easy to spot. Notice that this figure is actually generated from a dataset with only the first 96 bytes of payload in each packet, and there is already a very clear pattern with the truncated payload. Figure 3.2 displays the variability of the frequency distributions among different length payloads. The two plots characterize two different distributions from the incoming traffic to the same web server, port 80 for two different lengths, here payloads of 200 bytes, the other 1,460 bytes. Clearly, a single monolithic model for both length categories will not represent the distributions accurately. And most importantly, which also serves as the basic assumption of this work, is that the attack s payload distribution is very different from the normal ones. Figure 3.3 and Figure 3.4 below gives such an example. Figure 3.3 shows a portion of the first packet of CodeRed II, and the Figure 3.4 compares the byte distribution of the worm packet (upper one) against the normal traffic (lower one) which has the same packet length Given a training data set, we compute a set of models M ij. For each specific observed length i of each port j, M ij stores the average byte frequency and the standard deviation of each byte s frequency. The combination of the mean and variance of each byte s frequency

39 22 Figure 3.1: Example byte distributions for different ports. For each plot, the X-axis is the ASCII byte 0-255, and the Y-axis is the average byte frequency can characterize the payload within some range of payload lengths. So if there are 5 ports, and each port s payload has 10 different lengths, there will be in total 50 centroid models computed after training. As an example, we show the model computed for the payload of length 185 for port 80 in figure 3.5, which is derived from a dataset described in the evaluation section 3.8. (We also provide an automated means of reducing the number of centroids via clustering as described in section 3.4) PAYL operates as follows. We first observe many exemplar payloads during a training phase and compute the mean and variance of the byte value distribution producing model M ij. During detection, each incoming payload is scanned and its byte value distribution is computed. This new payload distribution is then compared against model M ij ; if the distribution of the new payload is significantly different from the norm, the detector flags the packet as anomalous and generates an alert. The means to compare the two distributions, the model and the new payload, is described next.

40 23 Figure 3.2: Example byte distribution for different payload lengths for port 80 on the same host server 3.2 Simplified Mahalanobis Distance Mahalanobis distance is a standard distance metric to compare two statistical distributions. It is a very useful way to measure the similarity between the (unknown) new payload sample and the previously computed model. Here we compute the distance between the byte distributions of the newly observed payload against the profile from the model computed for the corresponding length range. The higher the distance score, the more likely this GET./default.ida?XXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX%u9090 %u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801% u9090%u6858%ucbd3%u7801%u9090%u9090%u8190%u 00c3%u0003%u8b00%u531b%u53ff%u0078%u0000%u0 Figure 3.3: Raw packet of CRII; only the first 301 bytes are shown for brevity.

41 24 Figure 3.4: Example of CodeRed II packet (in Figure 3.3), and its payload distribution against the normal traffic at the same packet length 1360 payload is abnormal. The formula for the Mahalanobis distance is: d 2 (x, y) = (x y) T C 1 (x y) (3.1) where x and y are two feature vectors, and each element of the vector is a variable. x is the feature vector of the new observation, and yis the averaged feature vector computed from the training examples, each of which is a vector. And C 1 is the inverse covariance matrix as C ij = Cov(y i, y j ). y i, y j are the ith and jth elements of the training vector. The advantage of Mahalanobis distance is that it takes into account not only the average value but also its variance and the covariance of the variables measured. Instead of simply computing the distance from the mean values, it weights each variable by its standard deviation and covariance, so the computed value gives a statistical measure of how well the new example matches (or is consistent with ) the training samples. In our problem, we use the naive assumption that the bytes are statistically independent. Thus, the covariance matrix C becomes diagonal and the elements along the diagonal

42 25 Figure 3.5: The average relative frequency of each byte, and the standard deviation of the frequency of each byte, for payload length 185 of port 80 are just the variance of each byte. Notice, when computing the Mahalanobis distance, we pay the price of having to compute multiplications and square roots after summing the differences across the byte value frequencies. To further speed up the computation, we derive the simplified Mahalanobis distance: n 1 d(x, y) = ( x i y i /σ i ) (3.2) i=0 where the variance is replaced by the standard deviation. Here n is fixed to 256 under the 1-gram model (since there are only 256 possible byte values). Thus, we avoid the timeconsuming square and square-root computations (in favor of a single division operation) and now the whole computation time is linear in the length of the payload with a small constant to compute the measure. This produces an exceptionally fast detector (recall our objective to operate in high-bandwidth environments). For the simplified Mahalanobis distance, there is the possibility that the standard deviation σ i equals zero and the distance will become infinite. This will happen when a character or byte value never appears in the training samples or, oddly enough, it appears with exactly

43 26 the same frequency in each sample. To avoid this situation, we give a smoothing factor α to the standard deviation similar to the prior observation: n 1 d(x, y) = ( x i y i /(σ i + α)) (3.3) i=0 The smoothing factor α reflects the statistical confidence of the sampled training data. The larger the value of α, the less the confidence the samples are truly representative of the actual distribution, and thus the byte distribution can be more variable. Over time, as more samples are observed in training, α may be decremented automatically. The formula for the simplified Mahalanobis distance also suggests how to set the threshold to detect anomalies. If we set the threshold to 256, this means we allow each character to have a fluctuation range of one standard deviation from its mean. Thus, logically we may adjust the threshold to a value in increments of 128 or 256, which may be implemented as an automatic self-calibration process. 3.3 Learning Issues Incremental Learning The 1-gram model with Mahalanobis distance is very easy to implement as an incremental version with only slightly more information stored in each model. An incremental version of this method is particularly useful for several reasons. A model may be computed on the fly in a hands-free automatic fashion. That model will improve in accuracy as time moves forward and more data is sampled. Furthermore, an incremental online version may also age out old data from the model keeping a more accurate view of the most recent payloads flowing to or from a service. This drift in environment can be solved via incremental or online learning [32]. To age out older examples used in training the model, we can specify a decay parameter of the older model and emphasize the frequency distributions appearing in the new samples.

44 27 This provides the means of automatically updating the model to maintain an accurate view of normal payloads seen most recently. To compute the incremental version of the Mahalanobis distance, we need to compute the mean and the standard deviation of each ASCII character seen for each new sample observed. For the mean frequency of a character, we compute x N = N i=1 x i/n from the training examples. If we also store the number of samples processed, N, we can update the mean as x N+1 = x N N + x N+1 N + 1 = x N + x N+1 x N N + 1 when we see a new example x N+1, a clever update technique described by Knuth [27]. (3.4) Since the standard deviation is the square root of the variance, the variance computation can be rewritten using the expected value E as: V ar(x) = E(X EX) 2 = E(X 2 ) (EX) 2 (3.5) We can update the standard deviation in a similar way if we also store the average of the x 2 i in the model. This requires maintaining only one more 256-element array in each model that stores the average of the and the total number of observations N. Thus, the n-gram byte distribution model can be implemented as an incremental learning system easily and very efficiently. Maintaining this extra information can also be used in clustering samples as described in the next section Reduced Model Size by Clustering When we described our model, we said we compute one model M ij for each observed length bin i of payloads sent to port j. Such fine-grained modeling might introduce several problems. First, the total size of the model can become very large. (The payload lengths are associated with media files that may be measured in gigabytes and many length bins may be defined causing a large number of centroids to be computed.) Further, the byte

45 28 distribution for payloads of length bin i can be very similar to that of payloads of length bins i 1 and i + 1; after all they vary by one byte. Storing a model for each length may therefore be obviously redundant and wasteful. Another problem is that for some length bins, there may not be enough training samples. Sparseness implies the data will generate an empirical distribution that will be an inaccurate estimate of the true distribution leading to a faulty detector. There are two possible solutions to these problems. One solution for the sparseness problem is relaxing the models by assigning a higher smoothing factor to the standard deviations which allows higher variability of the payloads. The other solution is to borrow data from neighboring bins to increase the number of samples; i.e. we use data from neighboring bins used to compute other similar models. We compare two neighboring models using the simple Manhattan distance to measure the similarity of their average byte frequency distributions. If their distance is smaller than some threshold t, we merge those two models. This clustering technique is repeated it until no more neighboring models can be merged. This merging is easily computed using the incremental algorithm described in Section 3.3.1; we update the means and variances of the two models to produce a new updated distribution. Now for a new observed test data with length i sent to port j, we use the model M ij, or the model it was merged with. But there is still the possibility that the length of the test data is outside the range of all the computed models. For such test data, we use the model whose length range is nearest to that of the test data. In these cases, the mere fact that the payload has such an unusual length unobserved during training may itself be cause to generate an alert. The reader should note that the modeling algorithm and the model merging process are each linear time computations, and hence the modeling technique is very fast and can be performed in real time. The online learning algorithm also assures us that models will improve over time, and their accuracy will be maintained even when services are changed and new payloads are observed.

46 Unsupervised Learning Our model together with Mahalanobis distance can also be applied as an unsupervised learning algorithm. Thus, training the models is possible even if noise is present in the training data (for example, if training samples include payloads from past worm propagations still propagating on the Internet.) This is based on the assumption that the anomalous payload is a minority of the training data and their payload distribution is different from the normal payload. These abnormal payloads can be identified in the training set and their distributions removed from the model. This is accomplished by applying the learned models to the training dataset to detect outliers. Those anomalous payloads will have a much larger distance to the profile than the average normal samples and thus will likely appear as statistical outliers. After identifying these anomalous training samples, we can either remove the outliers and retrain the models, or update the frequency distributions of the computed models by removing the counts of the byte frequencies appearing in the anomalous training data. 3.4 Z-String Consider the string of bytes corresponding to the sorted, rank ordered byte frequency of a model. Figure 3.6 displays a view of this process. The frequency distribution of payloads of one packet of the worm CRII is plotted in the top graph. The lower graph represents the same information by the plot is reordered to the rank ordering of the distribution. Here, the first bar in the lower plot is the frequency of the most frequently appearing ASCII character. The second bar is likewise the second most frequent, and so on. This rank ordered distribution surprisingly follows a Zipf-like distribution (an exponential function or a power law where there are few values appearing many times, and a large number of values appearing very infrequently.) The rank order distribution also defines what we call a Z-string. The byte values ordered from most frequent to least frequent serves as a representative of the entire dis-

47 30 tribution. Figure 3.7 displays the Z-String for the plot in Figure 3.6. Notice that for this distribution there are only 175 distinct byte values appearing in the distribution. Thus, the Z-string has length 175. Furthermore, as we shall see later, this rank ordered byte value distribution of the new payload deemed anomalous also may serve as a simple representation of a new worm signature that may be rapidly deployed to other sites to better detect the appearance of a new worm at those sites; if an anomalous payload appears at those sites and its rank ordered byte distribution matches a Z-string provided from another site, the evidence is very good that a worm has appeared. This distribution mechanism is part of an ongoing project called Worminator [41, 68] that implements a collaborative security system on the Internet. A full treatment of this work is beyond the scope of this thesis, but the interested reader is encouraged to visit for details. Figure 3.6: Payload distribution for the CRII packet (in Figure 3.3) appears in the top plot, re-ordered to the rank-ordered count frequency distribution in the bottom plot.

48 X..u0U%tej.d.P..E Ta..f..hr.g...1cno... <=bil.$567c GI\py&ADLS sx...-/fjkm...,.4:;?ry )2>@BHMNOQWZ]kw... Figure 3.7: The signature Z-String computed from the CRII packet (in Figure 3.3). The top numerical part is the ASCII representation of the Z-String, with the middle omitted for brevity. The lower byte sequence is the byte representation of the Z-String, while the non-printable character is displayed as Finer-grained Modeling: Multiple Centroids In the first implementation, PAYL computes one centroid per length bin, followed by a stage of clustering similar centroids across neighboring bins. But when we examine the traffic, we found that the traffic payload of the same length can be further classified into several categories, like pure text requests,.pdf files,.jpeg files, MS office files, executable files, etc. A natural way to separate them out is to cluster the payload of one length into multiple centroids, which can represent normal content flow more accurately and revealing anomalous data with greater clarity. This modeling idea can be extended to include centroids for different media that may be transmitted in packet flows. Different file and media types follow their own characteristic 1-gram distribution; including models for standard file types can help reduce false positives. (See [34] for a detailed analysis of this approach.) Previously we compute a model M ij for each specific observed packet payload length i of each port j. Now we compute a set of models M kij, k 1. Then again the clustering is now executed across the neighboring length bins to substantially reduce the memory requirements for models. The approach we used for multi-centroids building is clustering:

49 32 similar packets can be modeled together as one distinct centroid. But since PAYL is a light-weighted sensor passively listening to the network traffic, we can not store too many packets and apply clustering algorithm. Also we don t know a prior how many clusters there are. So the traditional clustering algorithms cannot be applied here, like K-means, EM, etc. We decided to adapt the one-pass online clustering algorithm [17]. Basically it merges a packet into existing cluster if they are similar; otherwise create a new cluster centered at it. If the total number of clusters exceeds the threshold, find the nearest two to merge. But the clustering result of this approach will be affected by the incoming order of the packets. So we improve the one-pass online clustering algorithm with a small buffering, which we call semi-batched one-pass online clustering. The difference is that we first buffer N packets, for example 50, then apply optimal hierarchical clustering on these N packets locally, and merge these clusters into the existing ones from previous batches. In such way, the ordering problem will get ameliorated by the small buffering. The batch size N needs to be chosen properly. Larger N can improve the clustering accuracy, but also incurs a higher memory and computation consumption. while (more packets){ p = next packet; if ( p is similar to one of the existing centroids ) merge into that centroid else create a new centroid; use p as center if ( total number of centroids > MaxSize ) merge the two nearest ones } Figure 3.8: One pass clustering algorithm The following figure 3.10 compares the first six centroids generated by these two algorithms. From which we will see that the semi-batched algorithm (right plot) is better at grouping out different payload patterns, while the original one-pass online clustering algorithm more likely to lump them together (especially the first subplot in the left one). And

50 33 merge ( c_set1, c_set2) { for (each c in c_set1){ if (c is similar to one of the centroids in c_set2) merge c into that centroid else add c as a new centroid to c_set2 } if ( size of c_set2 > MaxNum) merge the two nearest ones until (size==maxnum) } Figure 3.9: Merge two cluster sets Sorted Centroid Unsorted Centroid Dataset Avg Std Avg Std W W EX Table 3.1: Detection performance comparison for sorted and unsorted centroids in the next evaluation section, we will show their detection performance comparison. The multi-centroid strategy requires a different test methodology. During testing, an alert will be generated by PAYL if a test packet matches none of the centroids within its length bin. To avoid slowing down the test speed while having many centroids to compare with, we sort the centroids according to their popularity, which is the number of packets got merged into the centroid during training. Under the normal traffic condition, which should be similar to the traffic been used for training, most of the packets will find its matching centroid quickly and stop comparing. Only the anomalous ones will go through all the centroid. Thus the average test time won t get delayed a lot. The following table 3.1 compares the detection performance in terms of how many centroids got compared for sorted and unsorted centroids, while for single centroid modeling it s always one. W, W1 and EX are three datasets we used for evaluation, and will be described later in section

51 34 Figure 3.10: First six centroids for a dataset at length upper plot: one-pass online clustering; lower plot: semi-batched one-pass online clustering (batch size 50). The number under each subplot is the number of packets in that one centroid during training

52 Detection Evaluation We conducted two sets of experiments to test the effectiveness of the 1-gram models. The first experiment was applied to the 1999 DARPA IDS Data Set which is the most complete dataset with full payload publicly available for experimental use. The experiment here can be repeated by anyone using this data set to verify the results we report. The second experiment used the CUCS dataset which is the inbound network traffic to the web servers of the computer science department of Columbia University. Unfortunately, this dataset cannot be shared with other researchers due to the privacy policies of the university. (In fact, the dataset has been erased to avoid a breach of anyone s privacy.) Experiments with 1999 DARPA IDS Data Set The 1999 DARPA IDS data set was collected at MIT Lincoln Labs to evaluate intrusion detection systems. All the network traffic including the entire payload of each packet was recorded in tcpdump format and provided for evaluation. In addition, there are also audit logs, daily file system dumps, and BSM (Solaris system call) logs. The data consists of three weeks of training data and two weeks of test data. In the training data there are two weeks of attack-free data and one week of data with labeled attacks. This dataset has been used in many research efforts and results of tests against this data have been reported in many publications. Although there are problems due to the nature of the simulation environment that created the data, it still remains a useful set of data to compare techniques. The top results were reported by [39]. In our experiment on payload anomaly detection we only used the inside network traffic data which was captured between the router and the victims. Because most public applications on the Internet use TCP (web, , telnet, and ftp), and to reduce the complexity of the experiment, we only examined the inbound TCP traffic to the ports of the hosts xxx.xxx which contains most of the victims, and ports which covers the majority of the network services. For the DARPA 99 data, we conducted experiments

53 36 using each packet as the data unit and each connection as the data unit. We used tcptrace to reconstruct the TCP connections from the network packets in the tcpdump files. We also experimented the idea of truncated payload, both for each packet and each connection. For truncated packets, we tried the first N bytes and the tail N bytes separately, where N is a parameter. Using truncated payload saves considerable computation time and space. We report the results for each of these models. We trained the payload distribution model on the DARPA dataset using week 1 (5 days, attack free) and week 3 (7 days, attack free), then evaluate the detector on weeks 4 and 5, which contain 201 instances of 58 different attacks, 177 of which are visible in the inside tcpdump data. Because we restrict the victims IP and port range, there are 14 others we ignore in this test. In this experiment, we focus on TCP traffic only, so the attacks using UDP, ICMP, ARP (address resolution protocol) and IP only cannot be detected. They include: smurf (ICMP echo-reply flood), ping-of-death (over-sized ping packets), UDPstorm, arppoison (corrupts ARP cache entries of the victim), selfping, ipsweep, teardrop (mis-fragmented UDP packets). Also because our payload model is computed from only the payload part of the network packet, those attacks that do not contain any payload are impossible to detect with the proposed anomaly detector. Thus, there are in total 97 attacks to be detected by our payload model in weeks 4 and 5 evaluation data. After filtering there are in total 2,444,591 packets, and connections, with nonzero length payloads to evaluate. We build a model for each payload length observed in the training data for each port between and for every host machine. The smoothing factor is set to which gives the best result for this dataset (see the discussion in Section 3.2). This helps avoid over-fitting and reduces the false positive rate. Also due to having an inadequate number of training examples in the DARPA99 data, we apply clustering to the models as described previously. Clustering the models of neighboring length bins means that similar models can provide more training data for a model whose training data is too sparse thus making it less sensitive and more accurate. But there is also the risk that the

54 37 detection rate will be lower when the model allows more variance in the frequency distributions. Based on the models for each payload length, we did clustering with a threshold of 0.5, which means if the two neighboring model s byte frequency distribution has less than 0.5 Manhattan distance we merge their models. Because of the simplicity of the DARPA dataset, which contains very regular payload content, multi-centroids modeling does not bring any benefit, which incur a more expensive computation. We also experimented with both unclustered and clustered models. The results indicate that the clustered model is always better than the unclustered model. So in this section, we will only show the results of the clustered models using single centroid for each length bin. Different port traffic has different byte variability. For example, the payload to port 80 (HTTP requests) are usually less variable than that of port 25 ( ). Hence, we set different thresholds for each port and check the detector s performance for each port. The attacks used in the evaluation may target one or more ports. Hence, we calibrate a distinct threshold for each port and generate the ROC curves including all appropriate attacks as ground truth. The packets with distance scores higher than the threshold are detected as anomalies. Figure 3.11 shows the ROC curves for the four most commonly attacked ports: 21, 23, 25, and 80. For the other ports, eg. 53, 143, 513 etc., the DARPA99 data doesn t provide a large enough training and testing sample, so the results for those ports are not very meaningful. For each port, we used five different data units, for both training and testing. The legend in the plots and their meaning are: 1. Per Packet Model, which uses the whole payload of each network packet; 2. First 100 Packet Model, which uses the first 100 bytes of each network packet; 3. Tail 100 Packet Model, which uses the last 100 bytes of each network packet; 4. Per Conn Model, which uses the whole payload of each connection;

55 38 5. Truncated Conn Model, which uses the first 1000 bytes of each connection. Figure 3.11: ROC curves for ports 21, 23, 25, 80 for the five different models. Notice the x-axis scale is different for each plot and does not span to 100%, but limited to the worst false positive rate for each plot From Figure 3.11 we can see that the payload-based model is very good at detecting the attacks to port 21 and port 80. For port 21, the attackers often first upload some malicious code onto the victim machine and then login to crash the machine or get root access, like casesen and sechole. The test data also includes attacks that upload/download illegal copies of software, like warezmaster and warezclient. These attacks were detected easily because of their content which were rarely seen executable code and quite different from the common files going through FTP. For port 80, the attacks are often malformed HTTP requests and are very different from normal requests. For instance, crashiis sends request

56 39 GET../.. ; apache2 sends request with a lot of repeated U ser Agent : sioux\r\n, etc. Using payload to detect these attacks is a more reliable means than detecting anomalous headers simply because their packet headers are all normal to establish a good connection to deliver their poison payload. Connection based detection has a better result than the packet based models for port 21 and 80. It s also important to notice that the truncated payload models achieve results nearly as good as the full payload models, but are much more efficient in time and space. For port 23 and port 25 the result is not as good as the models for port 21 and 80. That s because their content are quite free style and some of the attacks are well hidden. For example, the framespoofer attack is a fake from the attacker that misdirects the victim to a malicious web site. The website URL looks entirely normal. Malformed and telnet sessions are successfully detected, like the perl attack which runs some bad perl commands in telnet, and the sendmail attack which is a carefully crafted message with an inappropriately large MIME header that exploits a buffer overflow error in some versions of the sendmail program. For these two ports, the packet-based models are better than the connection-based models. This is likely due to the fact that the actual exploit is buried within the larger context of the entire connection data, and its particular anomalous character distribution is swamped by the statistics of the other data portions of the connection. The per packet model detects this anomalous payload more easily. There are many attacks that involve multiple steps aimed at multiple ports. If we can detect one of the steps at any one port, then the attack can be detected success-fully. Thus we correlate the detector alerts from all the ports and plot the overall performance. When we restrict the false positive rate of each port (during calibration of the threshold) to be lower than 1%, we achieve about a 60% detection rate, which is pretty high for the DARPA99 dataset. The results for each model are displayed in the Table 3.2: Modeling the payload to detect anomalies is useful to protect servers against new attacks. Furthermore, careful inspection of the detected attacks in the tables and from other sources reveals that correlating this payload detector with other detectors increases the cov-

57 40 Per Packet Model 57/97 (58.8%) First 100 Packet Model 55/97 (56.7%) Tail 100 Packet Model 46/97 (47.4%) Per Conn Model 55/97 (56.7%) Truncated Conn Model 51/97 (52.6%) Table 3.2: Overall detection rate of each model when false positive rate lower than 1% erage of the attack space. There is large non-overlap between the attacks detected via payload and other systems that have reported results for this same dataset, for example PHAD [47]. This is obvious because the data sources and modeling used are totally different. PHAD models packet header data, whereas payload content is modeled here. Our payload-based model has small memory consumption and is very efficient to compute. Table 3.3 displays the measurements of the speed and the resulting number of centroids for each of the models for both cases of unclustered and clustered. The results were derived from measuring PAYL on a 3GHz P4 Linux machine with 2G memory using nonoptimized Java code. These results do not indicate how well a professionally engineered system may behave (re-engineering in C probably would gain a factor of 6 or more in speed). Rather, these results are provided to show the relative efficiency among the alternative modeling methods. The training and test time reported in the table is seconds per 100Mof data, which includes the I/O time. The number of centroids computed after training represents an approximation of the total amount of memory consumed by each model. Notice that each centroid has fixed size: two 256-element double arrays, one for storing averages and the other for storing the standard deviation of the 256 ASCII bytes. A reengineered version of PAYL would not consume as much space as does a Java byte stream object. From the table we can see that clustering reduces the number of centroids, and total consumed memory by about a factor from 2 to 16 with little or no hit in computational performance. Combining Figure 3.11, Table 3.2 and Table 3.3, users can choose the proper model for their application according to their environment and performance requirements.

58 41 Unclustered/Clustered Per Packet First 100 Tail 100 Per Conn. Trunc Conn. Train time(uncl) Test time(uncl) No. centroid(uncl) Train time(clust) Test time(clust) No. centroid(clust) Table 3.3: Speed and Memory measurements of each model. The training and testing time is in units of seconds per 100M data, including the I/O time. The memory consumption is measured in the number of centroids that were kept after clustering or learning. This result is surprisingly good for such a simple modeling technique. Most importantly, this anomaly detector can easily augment existing detection systems. It is not intended as a stand alone detection system but a component in a larger system aiming for defense in depth. Hence, the detector would provide additional and useful alert information to correlate with other detectors that in combination may generate an alarm and initiate a mitigation process. The DARPA 99 dataset was used here so that others can verify our results. However, we also performed experiments on a live stream that we describe next Experiments with CUCS Data Set In previous section we showed PAYL s good performance for the DARPA99 dataset, which contains a lot of artifacts that make the data too regular [39]. One of the most difficult aspects of doing research in this area is the lack of real-world datasets available to researchers that have full packet content for formal scientific study. Privacy policies typically prevent sites from sharing their content data. However, we were able to use data from three sources, and show the distribution for each. The first one is an external commercial organization that wishes to remain anonymous, which we call EX. The others are the two web servers of the CS Department of Columbia, and www1.cs.columbia.edu; we call

59 42 these two datasets W and W1, respectively. So here we report how PAYL performs over the three real-world datasets using known worms available for our research. Since all three datasets were captured from real traffic, there is no ground truth, and measuring accuracy was not immediately possible. We thus needed to create test sets with ground truth, and we applied Snort for this purpose. Each dataset was split into two distinct chronologically-ordered portions, one for training and the other for testing, following the 80%-20% rule. For each test dataset, we first created a clean set of packets free of any known worms still flowing on the Internet as background radiation. We then inserted the same set of worm traffic into the cleaned test set using tcpslice. Thus, we created ground truth in order to compute the accuracy and false positive rates. The worm set includes CodeRed, CodeRed II, WebDAV, and a worm that exploits the IIS Windows media service, the nsiislog.dll buffer overflow vulnerability (MS03-022). These worm samples were collected from real traffic as they appeared in the wild, from both our own dataset and from a third-party. Because PAYL only considers the packet payload, the worm set is inserted at random places in the test data. The ROC plots in Figure 3.12 show the result of the detection rate versus false positive rate over varying threshold settings of the PAYL sensor, using semi-batched multi-centroid modeling, and set the centroids number to be 10 in each length bin. The detection rate and false positive are both based on the number of packets. The test set contains 40 worm packets although there are only 4 actual worms in our zoo. The plots show the results for each data set, where each graphed line is the detection rate of the sensor where all 4 worms were detected. (This means more than half of each the worm s packets were detected as anomalous content.) From the plot we can see that although the three sites are quite different in payload distribution, PAYL can successfully detect all the worms at a very low false positive rate. To provide a concrete example we measured the average false alerts per hour for these three sites. For 0.1% false positive rate, the EX dataset has 5.8 alerts per hour, W1 has 6 alerts per hour and W has 8 alerts per hour. Although at first

60 43 Figure 3.12: ROC of PAYL detecting incoming worms, false positive rate restricted to less than 0.5% blush, 5-8 alerts per hour may seem too high, a key contribution of this thesis is a method to correlate multiple alerts to extract from the stream of alerts true worm events, which will be covered in later chapter. We manually checked the packets that were deemed false positives. Indeed, most of these are actually quite anomalous containing very odd abnormal payload. For example, in the EX dataset, there are weird file uploads, in one case a whole packet containing nothing but a repetition of a character with byte value E7 as part of a word file. Other packets included unusual HTTP Get requests, with the referrer field padded with many Y characters (via a product providing anonymization). To demonstrate the accuracy improvements with finer modeling techniques, table 3.4 gives the false positive rate using different modeling algorithms, while successfully detecting all the worm. While multi-centroids gives more accurate payload modeling and reduces false positive rate, it also brings a lot of problems. First, the memory cost and the trained model size

61 44 dataset W dataset W1 datset EX Single-centroid 0.66% 0.487% 0.982% Mutli-centroid 0.42% 0.225% 0.32% (one-pass) Mutli-centroid % 0.029% 0.107% (semi-batched) Table 3.4: False positive rate of PAYL using different modeling algorithm on dataset W, W1 and EX, when detecting all the worms. is times bigger. Secondly, the computation speed is slower for both training and testing, especially for training because of the clustering algorithm. Thirdly, it takes more training data/time time to get a well-trained model since it s fine-grained now. But the biggest concern is that once we separated out the models, possibly into different media types, will it make it easier for mimicry attack? Since the media type s distribution is kind of fixed, it s easy for attackers to cheat and escape from the sensor. So there is a trade-off about choosing one centroid or multiple centroid modeling. We also tested the detection rate of the W32.Blaster worm (MS03-026) on TCP port 135 port using real RPC traffic inside Columbia s CS department. Despite being much more regular compared to HTTP traffic, the worm packets in each case were easily detected with zero false positives. 3.7 Summary In this chapter, we described the new approach of using statistical analysis of payload information for fast and accurate intrusion detection. We have developed the fully automatic, real-time network payload anomaly detection sensor PAYL, which has demonstrated the effectiveness under DARPA99 dataset and other real environment traffic, inside and outside of campus. The 1-gram based payload model is length-conditioned, and specific to a site and service. It is simple, state-free, and quick to compute in time that is linear in the payload length. It also has the advantage of being implemented as an incremental, unsupervised learning method. The fine-grained modeling by building multiple centroids for

62 45 each length bin greatly improves the detection accuracy, and clustering similar centroids of neighboring lengths help reduce the model size substantially.

63 46 Chapter 4 Anagram: Higher Order n-gram Payload Modeling and Anomaly Detection In the previous chapter we described PAYL (short for PAYLoad anomaly detection ) anomaly sensor that modeled the normal attack-free traffic of a network site as 1-gram, byte-value frequency distributions, and demonstrated its ability to effectively detect attacks. The sensor was designed to be language-independent, requiring no syntactic analysis of the byte stream. Furthermore, PAYL was designed to be efficient and scalable for high-speed networks and applicable to any network service. Various experiments demonstrated that PAYL achieved a high detection rate and with low false positives for typical worms and exploits available at the time. This approach is effective at capturing attacks that display abnormal byte distributions, but it is likely to miss well-crafted attacks that focus on simple CPU instructions and that are crafted to resemble normal byte distributions. For instance, although a standard CodeRed II s buffer overflow exploit uses a large number of N or X characters and so appears as a peak in the frequency distribution, [10] shows that the buffer can instead be padded with nearly any random byte sequence without affecting the attack vector. Another example that

64 47 does not display abnormal byte distributions is the following simple phpbb forum attack: GET /modules/forums/admin/admin styles.php?phpbb root path= /cmd.gif?&cmd=cd%20/tmp;wget% /criman;chmo d%20744%20criman;./criman;echo%20yyy;echo..http/1.1.host: User Agent:.Mozilla/4.0.(compatible;.MSIE.6.0;.Windows.NT.5.1;). In such situations, the normal byte distribution model is insufficient by itself to identify these attack vectors as abnormal data. However, invariants remain in the packet payloads: the exploit code, the sequence of commands, or the special URL that should not appear in the normal content flow to the target application. By modeling higher order n-grams, Anagram captures the order dependence of byte sequences in the network payload, enabling it to capture more subtle attacks. The core hypothesis is that any new, zero-day exploit will contain a portion of data that has never before been delivered to the application. These subsequences of new, distinct byte values will manifest as anomalous n-grams that Anagram is designed to efficiently and rapidly detect. Furthermore, most researchers correctly suspected that PAYL s simplicity would be easily blinded by mimicry attacks. Kolesnikov, Dagon and Lee [28] demonstrated a new blended, polymorphic worm designed to evade detection by PAYL and other frequency distribution-based anomaly detectors. This demonstration represents a new class of smart worms that launch their attacks by first sniffing traffic and shaping the datagram to the statistics specific to a given site to appear normal. The same principles may be applied to the propagation strategy as well as in, for example, parasitic worms. Since PAYL only models 1-gram distributions, it can be easily evaded with proper padding to avoid detection of anomalous byte sequences. As a countermeasure, we conjecture that higher order n-gram modeling may likely detect these anomalous byte sequences. Unfortunately, computing a full frequency distribution for higher order n-grams is computationally and memory-wise infeasible, and would require a prohibitively long training period even for modest gram

65 48 sizes. In this chapter we present a new sensor, Anagram, which introduces the use of Bloom filters and a binary-based detection model. Anagram s approach to network payload anomaly detection uses a mixture of higher order n-grams (n > 1) to model and test network traffic content. N-gram analysis is a well-known technique has been used in a variety of tasks, such as system call monitoring [48, 15, 72]. In Anagram, the n-grams are generated by sliding windows of arbitrary lengths over a stream of bytes, which can be per network packet, per request session, or other type of data unit. Anagram does not compute frequency distributions of normal content flows; instead, it trains its model by storing all of the distinct n-grams observed during training in a Bloom filter without counting the occurrences of these n-grams. In this chapter, we demonstrate that this binary-based high order n-gram approach attains remarkably high detection and low false positive rates. The use of Bloom filters makes Anagram memory efficient and allows for the modeling of a mixture of different sizes of n-grams extracted from packet payloads, i.e. an Anagram model need not contain samples of a fixed size gram. This strategy is demonstrated to exceed PAYL in both detection and false positives rates. Furthermore, Anagram s modeling technique is easier to train, and allows for the estimation of when the sensor has been trained enough for deployment. The Bloom filter model representation also provides the added benefit of preserving the privacy of shared content models and alerts for cross-site correlation. In figure 4.1 and figure 4.2, we show the overall architecture of Anagram sensor, for both training and testing. The general architecture, except detail modeling/testing techniques, can also be applied to PAYL sensor. The bad content model is a pre-computed n-grams model from known attacks, and its utility will be described in section 6.2. And the alerts correlation techniques will be covered in chapter 7. In the following sections we will give a detailed description of Anagram, which outperforms PAYL in the following respects: Accuracy in detecting anomalous payloads, even carefully crafted mimicry attacks

66 49 Figure 4.1: General architecture of Anagram sensor, the training phase. with a demonstrably lower false positive rate; Computational efficiency in detection by the use of binary-based modeling and fast (and incremental, linear-time) hashing in its Bloom filter implementation; Model space efficiency since PAYL s multiple-centroid modeling is no longer necessary, and Bloom filters are compact; Fast correlation of multiple alerts while preserving privacy as collaborating sites exchange Bloom filter representations of common anomalous payloads; The generation of robust signatures via cross-site correlation for early warning and detection of new zero day attacks. In the following sections, we will describe the mechanisms in detail and present experimental results of testing Anagram against network traces sniffed from our local LAN. For the last two points listed above, which are about the alerts correlation and signature generation, will be covered in later chapter 7 about collaborative security.

67 50 Figure 4.2: General architecture of Anagram sensor, the detection phase. 4.1 Higher Order N-gram Payload Model While higher order n-grams contain more information about payloads, the feature space grows exponentially as n increases. Comparing an n-gram frequency distribution against a model is infeasible since the training data is simply too sparse; the length of a packet is too small compared to the total feature space size of a higher-order n-gram. One TCP packet may contain only a thousand or so n-grams, while the feature space size is 256n. Clearly, with increasing n, generating sufficient frequency statistics to estimate the true empirical distribution accurately is simply not possible in a reasonable amount of time. Frequency-based modeling. During the training phase, we compute the appearance frequency of each n-gram as f(g i ), where f(g i ) = t(g i )/ i t(g i) and t(g i ) is the number of occurance of each n-gram g i. Obviously, f(g i ) = 0 for n-grams never seen in the training dataset. Then, during the testing phase, we compute the average appearance frequency of all the n-grams in a test data unit as the detection score. Let g p represent the n-grams that appear in data unit p, and T is the total number of n-grams in p, the detection score of it is score = p f(g p )/T [0, 1] (4.1)

68 51 Binary-based modeling. In this approach, we no longer consider the appearance frequency of each n-gram. Instead, we simply consider the presence or absence of the test n-grams in the set of n-grams learned from the training data. Now, the detection score is the percentage of never-seen n-grams out of the total number of n-grams in a data unit. As N new is the number of new n-grams not seen before and T is the total number of n-grams in data unit p, now score is score = N new /T [0, 1] (4.2) As the formulas suggest, both of the above approaches are straightforward. The frequencybased approach assumes that the attacks contain n-grams that not often seen, while binarybased approach assumes that the attacks usually contains some never-seen n-grams for exploitation. At first glance, the frequency-based approach contains more information about packet content; one might suspect it would model data more accurately and perform better at detecting anomalous data. Interestingly, the experiment result shows the opposite result. Further analysis shows that this result is a direct consequence of the huge feature space of higher order n-grams: the feature space is so huge that it s impractical to obtain a good frequency model during training within a reasonable time period. As a concrete example, given a 100Mbps network with sustained traffic, it would take approximately 24 hours to see all possible 5-grams even if each 5-gram were to appear only once. As n increases the time also gets much longer. For 7-grams, it takes about 178 years! Given the same amount of training data, the binary-based model performs significantly better than the frequencybased approach. We analyzed the network traffic for the Columbia Computer Science website and, as expected, a small portion of the n-grams appear frequently while there is a long tail of n-grams that appear very infrequently. This can be seen in table 4.1, which displays the percentage of the n-grams by their frequency counts for 90 hours of CS web traffic. Since a significant number of n-grams have a small frequency count, and the number of n-grams in a packet is very small relative to the whole feature space, the frequency-distribution model incurs relatively high false positives. Thus, the binary-based model provides a reasonable

69 52 frequency count 3-grams 5-grams 7-grams % 39.13% 32.53% 2 to % 28.22% 28.48% % 32.65% 38.99% Table 4.1: The percentage of the observed unique n-grams for different frequencies of occurance for 90 hours of training traffic estimate of how normal a packet may be. This is a rather surprising observation; as we will demonstrate, it works very well in practice. The conjecture is that true attacks will be delivered in packets that contain many more n-grams not observed in training than normal packets used to train the model. After all, a true zero-day attack must deliver data to a server application that has never been processed by that application before. Hence, the data exercising the vulnerability is very likely to be an n-gram of some size never before observed. By modeling a mixture of n-grams, we increase the likelihood of observing these anomalous grams. To validate this conjecture, we compare the ROC curves of the frequency-based approach and the binary-based approach for the same datasets (representing equivalent training times) as displayed in figure 4.3. We collected the web traffic of two CS departmental web servers, www and www1; the former serves the department webpage, while the latter serves personal web pages. Traffic was collected for two different time periods: a period of sniffed traffic from the year 2004 and another dataset sniffed in The 2004 datasets (www-04 and www1-04) contain 160 hours of traffic; the 2006 datasets (www-06 and www1-06) contain about 560 hours. We tested for the detection of several real worms and viruses: CodeRed, CodeRed II, WebDAV, Mirela, a phpbb forum attack, and a worm that exploits the IIS Windows media service, the nsiislog.dll buffer overflow vulnerability (MS03-022). These worm samples were collected from real traffic as they appeared in the wild, from both our own dataset and from a third-party. For the first experiment, we used 90 hours of www1-06 for training and 72 hours for

70 53 testing. (Similar experiments on the other datasets display similar results, and we skip them here for brevity.) To make it easier for the reader to see the plots, the curves are plotted for the cases where the false positive rate is less than 0.03%. Both the detection rate and false positive rate are calculated based on packets with payloads; non-payload (e.g., control) packets were ignored. Notice that the detection rate in figure 4.3 is based on per packet, instead of per attack. Some attacks have multiple packets; while fragmentation can result in a few packets appearing normal, we can still guarantee reliable attack detection over the entire set of packets. For example, for the IIS5 WebDAV attack, 5-grams detect 24 out of 25 packets as being anomalous. The only missed packet is the first packet, which contains the buffer overflow string SEARCH /AAA...AA, because 5-gram AAAAA appears in the training data before; this is not the key exploit part of the attack. For further comparison, we list minimum false positive rates when detecting all attack attempts (where an attack is detected if at least 80% of the packets are classified as anomalous) for both binary-based and frequency-based models in table 4.2. Figure 4.3: ROC curves comparing the frequency-based and binary-based n-gram approach The binary-based approach yields significantly better results than the frequency-based approach. When a 100% detection rate is achieved for the packet traces analyzed, the false

71 54 3-grams 4-grams 5-grams 6-grams 7-grams 8-grams Freq-based Binary-based Table 4.2: The false positive rate (%) of the two approaches using different n-grams when achieving 100% detection rate, www1-06 train/test dataset positive rate of the binary-based approach is at least one order of magnitude less than the frequency-based approach. The relatively high false positive rate of the frequency-based approach suggests much more training is needed to capture accurate statistical information to be competitive. In addition, the extremely high false positive rate of the 3-gram frequency-based approach is due to the fact that the 3-grams of the phpbb attack all appear frequently enough to make it hard to distinguish them from normal content packets. On the other hand, the binary-based approach used in Anagram results in far better performance. The 0.01% false positives average to about 1 alert per hour for www1 and about 0.6 alerts per hour for www. The result also shows that 5-grams and 6-grams give the best result, and we ve found this to be true for others as well. In Anagram, we therefore adapt the binary-based approach which performs better than frequency-based one. Another great benefit of binary-based approach is its much smaller memory consumption. No need to remember the float number about appearance frequency, we only need one bit to remember whether a n-gram been seen or not, by using Bloom Filter. As previously stated, Anagram may easily model a mixture of different n-grams simply by storing these in the same Bloom filter. However, for larger n-grams additional training may be required; as we shall describe shortly, our modeling approach allows us to estimate when the sensor has been sufficiently trained.

72 Model Size and Bloom Filters As previously stated, one significant issue when modeling with higher order n-grams is memory overhead. By leveraging the binary-based approach, we can use more memoryefficient set-based data structures to represent the set of observed n-grams. In particular, the Bloom filter (BF) [4] is a convenient tool to represent the binary model. Instead of using n bytes to represent the n-gram, or even 4 bytes for a 32-bit hash of the n-gram, the Bloom filter can represent a set entry with just a few bits, reducing memory requirements by an order of magnitude or more. A Bloom filter is essentially a bit array of m bits, where any individual bit i is set if the hash of an input value, mod m, is i. As with a hash table, a Bloom filter acts as a convenient one-way data structure that can contain many items, but generally is orders-of-magnitude smaller. Operations on a Bloom filter are O(1), keeping computational overhead low. A Bloom filter contains no false negatives, but may contain false positives if collisions occur; the false positive rate can be optimized by changing the size of the bit array and by using multiple hash functions (and requiring all of them to be set for an item to be verified as present in the Bloom filter; in the rare case where one hash function collides between two elements, it s highly unlikely a second or a third would also simultaneously collide). By using universal hash functions [50], we can minimize the probability of multiple collisions for n-grams in one packet (assuming each n-gram is statistically independent); the Bloom filter is therefore safe to use and does not negatively affect detection accuracy. Figure 4.4 gives an example of inserting 5-grams into a Bloom Filter using two hashing function Memory overhead While Bloom filters are comparatively small even when inserting a large number of entries, choosing the optimal size of a Bloom filter is nontrivial, since Anagram is not aware of a site s distribution (and the number of unique n-grams) before building its model. Additionally, a Bloom filter s size cannot be dynamically resized, as the hash values cannot be

73 56 Figure 4.4: Inserting n-grams (n=5) into a Bloom Filter recomputed without the original underlying training data. A large Bloom filter will waste memory, but small Bloom filters saturate more quickly, yielding higher false positive rates. It is worth pointing out that large is relative; a 24-bit Bloom filter is capable of holding 2 24 /n h elements, where n h represents the number of hash functions used, in only 2MB of memory, e.g., each n-gram inserted uses about 2.7 bits when 3 hash functions are used. Additionally, we can use traditional compression methods (e.g., LZW) for storing a sparse Bloom filter, which significantly reduces storage and transport costs. As discussed later in this chapter, our experiments anecdotally suggest this Bloom filter is large enough for at least 5-grams, assuming a mostly textual distribution. The presence of binary data does significantly increase Bloom filter requirements; if allocating extra initial memory is undesirable, a layered approach can be employed, where new Bloom filters are created on demand as previous Bloom filters saturate, with a small constant-time overhead. It should be evident that Bloom filters can be trivially merged via bitwise ORing and compared via bitwise ANDing.

74 Computation overhead The overhead of inserting or verifying a single item into a Bloom filter is O(1) per item inserted in the Bloom filter. However, the constant-time overhead to process an n-gram can become significant when inserting or checking n-grams over a large population of packets. This effect is magnified when an Anagram model uses different-size n-grams, as the same data is repeatedly hashed into different positions on the Bloom filter based on the various n-gram windows being processed. Couple this with the fact that Bloom filters need a good source of hashes, and Anagram s largest computation over-head in both training and testing rapidly becomes the hash operation. To reduce this computation overhead, we make use of a cumulative universal hash function. A cumulative hash function fulfills the requirement that h(c(a, b)) = d(h(a), h(b)), where h represents our hash function, a and b are data (n-grams or fragments of them), c is a (bitwise) concatenation function, and d is a composition function. Given such a hash function, we can avoid re-computing hashes when sliding the n-gram window and when using different window sizes, e.g., if we ve hashed a 5-gram and need to generate the hash of a 7-gram, we can just hash the incremental 2 grams and combine that with the 5-gram hash value. A class of universal hash functions known as H 3 [59] uses XOR as the composition function, which is very fast and lends itself well to our application. Thus, Anagram s modeling engine may perform fast enough to scale to very high bandwidth environments. Dharmapurikar, et. al. describe a similar technique [14] for static signature scanning in hardware. 4.3 Discussion In our experiments, we noticed that there are some packets that always have similar high anomaly scores and been deemed as false positives, even when we keep increasing the training time. We manually checked these packets to figure out what are the foreign n- grams causing high scores and where are they from. The result shows that these foreign n-

75 58 grams are mostly located in one of the following fields of HTTP request: Cookies, Referer, or some type of session/authorization ID. The following is part of the Referer field as a typical example, which is extracted from a false positive packet: :.*/*..Referer:. start=s2v5d29yzhm9bw9ua2v5cyuymgzvciuymhnhbgumegfyz3m 9MTJLUGpnMTlKU29JZTltdmluRjl5MVdlR0h3RnNQNWNQcHNONXND celoshbsdnhyuvven All of such fields have a common feature: they display high entropy in the content. Anagram works poorly here because statistically every n-gram has almost equal probability to appear, high percentage never-seen n-grams does not mean an anomaly. Instead, a low entropy or a low percentage of never-seen n-grams might be an indicator of abnormal for such areas. There are several possible ways to handle it. The intuitive way we used is an application-specific protocol normalizer for the content and we provide special handling for such fields. Separate model can be built which measures entropy level instead of normal n-grams. There have also been some work [62] using compression rate, entropy or Kolmogorov complexity (a measure of randomness of strings based on their information content) to detect anomalies in the encrypted channels, from which we can adapt to detect anomalies within such high-entropy areas. A more extreme case is the encrypted channels. The payload anomaly detector clearly would need the clear text to model the traffic appropriately. Thus, our sensors would require placement at the point of decryption. In the case of a LAN appliance, the system would serve as a man-in-the-middle proxy decrypting requests, testing the payload and re-encrypting normal traffic as may be necessary to forward the data on to its intended service. In this thesis, we only consider the clear text modeling and detection of unencrypted channels, assuming the data is available for modeling.

76 Implementation There are several problems to consider when we want to deploy the sensor in real-time. As we described before, we use Bloom filters to store the n-grams information. But Bloom filters may have high conflict rates if it is over-saturated, and dynamic resizing is difficult. Alternatively, we can pre-allocate a huge filter, with the risk of wasting memory; or we can do pre-sampling to estimate the size of the Bloom filter, with the risk of bad performance if the there is content shifting or sampling is not accurate. To avoid such problems, we use a stack of Bloom filters in our implementation. Each Bloom filter is a fixed size, for example 2 24 bits, and a new one will be added when the existing ones starting to saturate. We search each Bloom filter in turn. Such use of Bloom filter makes the memory consumption low, but it is at the cost of computation speed. Multiple hash values (three in our implementation), instead of one, need to be computed to decide an entry in one Bloom filter. When we have a stack of M Bloom filters, it s M times larger in the worst case. While we have recently built an optimized Bloom filter implementation using H 3 [59], the speed of the Anagram is still about only 10Mbps including the I/O time. To speed it up, we built a small cache using hashset to the most frequent n-grams. This comes from the observation that there is a small set of n-grams that repeated appear with high frequency, like the Zipf distribution suggests. Note that if an n-gram appears in the cache, there is no need to check the Bloom filters, and the computation cost is reduced from 3 or higher to 1. To validate this idea, we compared several cache strategies: Caching the first appeared N n-grams. The cache is static once it s filled and no cost to maintain, but a bad start point might give a skewed cache with low hit rate. Caching the most frequent appearing N n-grams, which is calculated from sample dataset offline. Caching the most recent appeared N n-grams. This cache need to be dynamically maintained.

77 60 Cache Size First N n-grams 23.4% 57.5% 69.1% 75.0% Most frequent N n-grams 76.1% 80.5% 82.9% 84.5% Most recent N N-gram 77.7% 80.8% 82.9% 84.4% Table 4.3: Hit rate of different cache strategies for Anagram speed-up. We apply each of the strategy to two 12 hours dataset, and summarize their average hit rates in the following Table 4.3. For the offline most frequent n-grams calculation, we used 6 hours data which appears before the test data. The total unique n-grams number in the test dataset is and respectively. Using small cache greatly improves the computation speed with a tiny memory cost. When use 1000 most frequent n-grams as cache, the speed more than doubled, achieving 26.3Mbps including the I/O time. As the current Anagram prototype is implemented in approximately 3,000 lines of Java code, we hope to get more improvement in engineering optimized C++ code. We expect to be able to handle a 100Mbps network subnet in realtime. 4.5 Summary In this chapter, we presented Anagram, a content anomaly detector which is based upon high order n-gram (n > 1) analysis using a binary-based modeling techniques. Compared to PAYL, Anagram can detect significant anomalous byte sequences and capture more subtle attacks which do not display a strange byte distribution. The sensor models the distinct content flow of a network or host using a semi-supervised training regimen. Previously known exploits, extracted from the signatures of an IDS, are likewise modeled in a Bloom filter and are used during training as well as detection time. Our tests suggest Anagram has less than a.01% false positive rate along with a 100% detection rate for a variety of worms and viruses detected in traces of our local network traffic. Anagram s use of Bloom

78 61 Filter reduces space requirement, and will play more important roles in effective privacypreserving cross-site correlation and signature generation, which we will cover in later chapters.

79 62 Chapter 5 Randomization against Mimicry Attack 5.1 Mimicry Attack As mentioned earlier, mimicry attacks are one of the most significant threats to any anomaly detector. If the attackers can gain information about the normal profile, they can mimic the normal environment and hide the exploit to avoid the sensor easily. The notion of a mimicry attack on an anomaly detection system was first introduced in 2001 by Wagner and Dean [75], and initial efforts to generate mimicry attacks, including [76] and [71], focused on system-call anomaly detection. With the advent of effective network payload-based anomaly detection techniques, researchers have begun building smart worms that employ a combination of polymorphism and mimicry attack mechanisms. Kolesnikov, Dagon and Lee [28] built a worm specifically designed to target network anomaly detection approaches, including PAYL. They use a number of techniques, including polymorphic decryption, normal traffic profiling and blending, and splitting to effectively defeat PAYL and several other IDSes. Figure 5.1 shows the detail worm structure 1. The resulting worm will display a very similar byte distribution frequency as the normal profile and successfully evade several anomaly IDS systems including PAYL, while before 1 Many thanks to Wenke Lee for his permission of letting use the figures from his paper.

80 63 Figure 5.1: This figure shows the structure of the blended exploit buffer. The variable parts depend on the exploit used. The buffer may be split into several packets by the network stack when transmitted. The Maximum Segment Size (MSS) on our system was 1460 so each packet above including headers was no larger than 1460 bytes. padding PAYL can easily catch it. The following Figure 5.2 copied from the paper [28] shows the comparison of the byte frequency distribution of the worm packet against the normal port 80 traffic, before and after padding. The byte frequency of the normal traffic is sorted first, and then the attack packet s plot is shown following its byte order. The right plot shows how similar the attack packet s distribution is to that of the normal traffic. Mimicry attacks are possible if the attacker has access to the same information as the victim, or has the knowledge of the normal profile for mimicking. It is simple in some cases where the environment is standard. In the case of application payloads, attackers (including

81 64 Figure 5.2: Comparison of frequency distributions of attack packet (unpadded left and padded right) and normal port 80 traffic. The padded worm packet matches the normal traffic well. worms) would not know the distribution of the normal flow to their intended victim. The attacker would need to sniff each site for a long period of time and analyze the traffic in the same fashion as the detector described herein, and would also then need to figure out how to pad their poison payload to mimic the normal model. The attackers have to be clever indeed to guess the exact distribution as well as the threshold logic to deliver attack data that would go unnoticed. Additionally, any attempt to do this via probing, crawling or other means is very likely to be detected. This sounds quite daunting initially, especially since the attacker would need to sniff for a long time without being noticed. But after a more careful examination, it turns out maybe it is not that difficult a task to accomplish. For an anomaly detector like PAYL, it learns the traffic for a long enough time to cover as much normal content it sees as possible, since we wish to have a low false positive rate. That s one of the reason why we use both average and standard deviation of the byte distribution in modeling. But for the mimicry attacks, they don t need to cover every possible case. As long as it s similar to one of the normal cases, thus there is no need to be similar to the average centroids in the case of PAYL, that strategy is good enough for a mimicry attack to succeed. So, if the mimicry attack can invade a host to sniff the traffic in its local LAN for profile learning, it can do mimicry easily. Otherwise, alternatively, it

82 65 can sniff some external environment, for example the attacker s local LAN, for the regular traffic going to the targeted host network. Since the sniffed normal traffic from the local to the target should be judged as non-malicious, the mimicry attack simulating such traffic has a high chance to be treated as normal too. If there is no traffic at all to the target from the attacker s local environment, the attacks can even do some normal requests to the target server first, then use that as a baseline to mimic, although that may not be accurate enough to shape a normally appearing attack vector. To verify this assumption, we first compute the byte distribution model from the dataset W, which has the traffic to the www webserver of our CS department, using 36 hours traffic. Then we limit the traffic to those originating from a subset IP address within our department ( xx) to web server W, and recompute the payload model using another 24 hours of traffic. In Figure 5.3, we show the global and local payload model for packet length 525, which has 9534 and 26 samples respectively. As the plot shows, although the local model is computed using much less samples than the global one, the generated local model is quite similar to the global one. So that mimicking by observing a small local range of normal traffic might be a good enough estimation for the attacker to shape a successful mimicry attack. Thus the mimicry attack engine need not be embedded in the target environment, and is a much more serious threat than first suspected. Besides mimicry attack, clever worm writers may figure a way to launch training attacks against anomaly detectors such as PAYL. In this case, the worm may send a stream of content with increasing diversity to its next victim site in order to train the content sensor to produce models where its exploit no longer would appear anomalous. This as well is a daunting task for the worm. The worm would be fortunate indeed to launch its training attack when the sensor is in training mode and that a stream of diverse data would go unnoticed while the sensor is in detection mode. Furthermore, the worm would have to be extremely lucky that each of the content examples it sends to train the sensor would produce a non-error response from the intended victim. Indeed, PAYL ignores content that does not produce a normal service response. These two evasion techniques, mimicry and

83 66 Figure 5.3: Comparison of the payload model computed using the global traffic and local partial traffic. The upper plot shows the global model computed using all the traffic that web server W received, and the lower one gives the local model by observing the traffic to server W originating from several local IP address, for payload length 525. training attack, is an active area of research on anomaly detection, and a formal treatment of the range of counter-evasion strategies will be discussed later Anagram against Mimicry Attack For the mimicry attack mentioned in previous section, it mimics the normal byte distribution and successfully evaded PAYL. But this mimicry attack pads the attack content payload without considering the sequence of bytes, so Anagram can easily detect any variants of the crafted attacks. We adapted the PAYL-evading worm engine to launch a mimicry attack against Anagram. Instead of padding the packet to simulate the byte frequency, we padded attack packets with normal strings; in this case, long URLs of the target website which should

84 67 Version 418, , , , , , 100 Padding length Table 5.1: The maximum possible padding length for a packet of different varieties of the mimicry attack be, by first principles, composed of normal n-grams that the site sees often. Although the anomaly scores are greatly reduced by this padding, the remaining portions of the crafted attack packets still have enough abnormal n-grams to be detected by Anagram. Besides the sled, which provides the opportunity for crafted padding, the attack packet still requires a byte sequence for the polymorphic decryptor, the encrypted exploit, encoded attacks, and the embedded mapping table. Since the amount of space in each packet is limited, the mimicked worm content containing the exploit vector is purposely spread over a long series of fragmented packets. Thus, the worm is fragmented so that each packet on its own does not appear suspicious. This strategy is described in the aforementioned paper and is akin to a multi-partite attack strategy where the protocol processor assembles all of the distributed pieces necessary for the complete attack. Using the blended polymorph worm engine, we generated different variants of the worm. The following table 5.1 shows the maximum padding length of each version. Each cell in the top row contains a tuple (x, y), representing a variant sequence of y packets of x bytes each. The second row represents the maximum number of bytes that can be used for padding in each packet. It s obvious that there is a substantial chunk of packet that needs to be reserved for the exploit, where we conjecture malicious higher order n-grams will appear to encode the encrypted exploit code or the decryptor code. We tested Anagram over these modified mimicry attacks where the padding contained normal, non-malicious n-grams, and all of the attacks were successfully detected with less than 0.1% false positive rate. This is the case since the crafted attack packets still require at least 15%-20% of the n-grams for code, which were detected as malicious. The false positive rates grows, however, as the packet length gets longer. The worst case for the

85 68 (1460, 100) experiment yields a false positive rate around 0.1%. This experiment demonstrates that Anagram raises the bar for attackers making mimicry attacks harder since now the attackers have the task of carefully crafting the entire packet to exhibit normal n-grams throughout the whole content. Further effort is required by mimicry attacks to encode the attack vectors or code in a proper way that appears as normal high order n-grams. Without knowing exactly which value of n, the size of the modeled grams, they should plan for, the problem becomes even harder. We take this uncertainty and extend it in the next section for a more thorough strategy to thwart mimicry attacks Randomization The general idea of payload-based mimicry attack is simply to evade detection by crafting small pieces of exploit code with a large amount of normal padding data to make the whole packet look normal. But as we ve seen in the example above, no matter what techniques are used for padding, there has to be some non-padded exposed sections of data to decode the exploit of the target vulnerability. Since our current payload-based anomaly sensors use the whole packet payload as a data unit for modeling and testing, the attackers can easily craft the whole packet to be normal, without considering locally where to place their padding and where to put the exploit data. The key idea here to thwart these mimicry attacks is introducing some randomness into the anomaly detector s modeling or testing, and keeping this information secret. Instead of using the whole payload, the sensor can randomly choose secret sub-portions of the packet payload to model and test separately. When mimicking, the attacker has to determine where to pad with normal data, and where to hide the exploit code. Without knowing exactly what portions of the packet are tested by the detector, the task is complicated, and possibly reduced to guessing, even if the attacker knew what padding would be considered normal.

86 Randomized Modeling We first discuss the notion of randomized modeling. As illustrated in figure 5.4, instead of modeling and testing the whole packet payload, we randomly partition packets into several (possibly interleaved) substrings or subsequences S 1, S 2,..., S N, and model each of them separately. And the incoming test packet s payload is divided accordingly for testing. We conjecture that these randomly chosen substring/subsequence locations would produce distinct normal models to thwart mimicry attack; the attacker would not know precisely which byte positions it may have to pad to appear normal. This provides a much higher level of diversity on the site-specific payload modeling. Since the partition is randomly chosen by each sensor instance and kept secret, we assume the attackers cannot gain this information before they compromise the machine. The attempted mimicry attack would be thwarted since the attacker would have no means of knowing precisely how a remote payload anomaly sensor at its target location has chosen to partition the space of the data flow. The attackers could only succeed if they can craft an attack vector ensuring that the data is normal with respect to each randomly selected portion of a packet; this makes the attacker s tasks much harder than before. Figure 5.4: Randomized Modeling For the above described approach, one big assumption is that the partitions will produce different models from the one built using the whole payload. If not, there is not need to build multiple models which will incur higher computation and storage cost. To test this

87 70 conjecture, we performed experiments using randomized modeling for PAYL, where we computed multiple models based on different secret partitions of the data stream. Figure 5.5 give examples of the partitioned payload modeling. For length 258 and length 1460, the top subplot shows the payload distribution computed using the whole packet, and the lower two subplots are distributions for each of the two random sub-partitions. In this example, we simply partitioned the packet into two parts. Figure 5.5: Payload distribution examples for the randomized modeling. The top subplot is the byte distribution using the whole packet, and the bottom two subplots are for each of the two random sub-partitions Unfortunately, the result is not very encouraging. For both lengths, the models from the sub-partitions do not display a very different distribution from the one with full payload. The possible reason for this result is: the modeling is packet-based instead of sessionbased, and the partitioning is done without considering protocol specification, so that there is no strict relationship between the packet location and the modeled content 2. Furthermore, there are several shortcomings with randomized modeling. First, as each packet is partitioned into N parts, there will be N times as many models computed, and thus will increase the overhead of the sensor, both for training and testing. Secondly, once the training is done, the testing packet has to been partitioned in the same way as the training phase, 2 The beginning part of the packets are more likely to see words GET, POST than other places, but there is no decisive relationship.

88 71 which reduces the randomness level of the sensor. And if we want to change the partitioning, we have to retrain the sensor each time for each selected partition. Considering of all these problems, we devised a simpler and more flexible randomization approach, randomized testing, which is described next Randomized Testing Alternatively, we can employ randomized testing, which has great flexibility and does not incur much substantial overhead, as in figure 5.6. Instead of testing and scoring the whole packet payload, we randomly partition packets into several (possibly interleaved) substrings or subsequences S 1, S 2,..., S N, and test each of them separately against the same single normal model. Similarly, since the partition is randomly chosen by each sensor instance and kept secret, we assume the attackers cannot gain this information before they compromise the machine. The attackers could only succeed if they can craft an attack vector ensuring that the data is normal with respect to any randomly selected portion of a packet. This technique can be generally applied to any content anomaly detector. Figure 5.6: Randomized Testing To demonstrate the effectiveness of this counter-evasion tactic, we first developed a

89 72 simple randomization framework for Anagram. We generate a random binary mask with some length (say, 100 bytes), and repeatedly apply this mask to the contents of each packet to generate test partitions. The mask corresponds to subsequences of contiguous bytes in the packet tested against the model; each is scored separately. The sequence of bytes appearing at locations where the mask had bits set to 1 are tested separately from those byte sequences where the mask had zero bits, randomly separating it into two non-overlapping parts. The packet anomaly score is adapted from section 4.1 to maximize the score over all partitions, i.e. score = max(n inew /T i ), where N inew and T i are the number of new and total n-grams in partition i, respectively. This can easily be extended to overlapping regions of the packet, and more than just two randomly chosen portions. There are several issues to consider. First, we want to optimize the binary mask generation. While the mask can be a purely random binary string, we may then lose information about sequences of bytes. Since Anagram models n-grams, it s not surprising that this strategy performs poorly. Instead, we randomize the mask using a chunked strategy. Any string of contiguous 0 s or 1 s in the mask must be at least 10 bits long (corresponding to 10 contiguous bytes in a partition), enabling us to preserve most of the n-gram information for testing. The results achieved for this strategy are far better than a random binary mask. Usually the chunk length needs to be greater than n to ensure enough valid n-grams are tested, and the choice of a chunk size of 10 provided good results empirically in our test case. The interesting problem of how to optimize the binary mask is one of our future research problems. We believe this can be formally cast as an optimization problem requiring perhaps an additional offline training phase before the sensor is fully trained. Another observation is that the length of the randomly chosen partitions is best balanced, which means the two parts should have roughly the same size, i.e. equal numbers of 1 s and 0 s, to avoid those extremely short snippets which do not have enough statistical information. The false positive rate is usually much higher when the partitions have extremely unbalanced lengths; for example, a partition where one fragment is 10% of the total length and the other is 90% produces a poor false positive rate.

90 73 For the following results in Figure 5.7, we use this chunk-based binary mask strategy and guarantee that one partition of the packet datagram is at most double the size of the other one. Again, we measure the false positive rate when we achieve 100% detection rate for our test traces. For each size n-gram, we repeated the experiment 10 times to show the average (the middle line) and standard deviation (the vertical bars), both for the unsupervised learning (left plot) and semi-supervised learning (right plot) using the malicious n-gram content model. As about the semi-supervised learning, please refer to section 6.2. The experiment was performed on dataset www1-06, trained for 90 hours and tested on the following 72 hours of traffic. Figure 5.7: The average false positive rate and standard deviation with 100% detection rate for randomized testing of Anagram, with normal training (left) and semi-supervised training (right). On average, the randomized testing strategy produces comparable false positive rates to the non-randomized approach, especially when using semi-supervised learning. The lower order n-grams are more sensitive to partitioning, so they exhibit a high standard deviation, while higher order n-gram models are relatively stable. Next we apply this framework to PAYL sensor in a similar way. The only difference is that because PAYL models 1-gram and does not need the continuous information between

91 74 Detection Times Avg. FP Std. FP Pure random mask 16/ % 0.375% Chunked random mask 14/ % 0.409% Table 5.2: The detection performance of PAYL with randomized testing on the mimicry attack which is designed to target it. bytes, the pure binary mask can also be applied here. But the balancing requirement still works here. Because the mimicry attack we studied before was designed to target PAYL purposely, we will first examine how well can PAYL detect these attacks with the randomization testing technique. The mimicry worm 3 has 10 packets in total, and each of them is 418 bytes long. We used www1-06 dataset as background traffic. First we prepared the mimicry attack and PAYL using 24 hours data, then tested it using the next 24 hours data. For pure binary mask and chunked binary mask, we repeated 20 times run for each of them, and summarize the testing result in Table 5.2. Out of the 20 runs, pure random mask approach detects the worm 16 times, and chunked random mask approach detects it 14 times. Here a successful detection is defined as any packet out of the 10 worm packets successfully detected by PAYL. The average false positive rate and its standard deviation are calculated for those runs with successful detection. This experiment demonstrates the value of randomized testing. Even though the mimicry attack is designed to display all normal byte distribution as PAYL models, the randomized PAYL testing can still capture most of the attack instances. This is already a big improvement. But many problems remain. First, the detection is not 100% gauranteed. Both of the approaches produce acceptable false positive rates, but their standard deviation is relatively high. Further research needs to be done to improve the detection rate, while still employing randomness within the sensor. A possible choice is having more than one random mask 3 Many thanks to Wenke for providing us the mimicry worm to study.

92 75 and then combining all of the test result for all subpartitions. Another problem is to determine a good partitioning approach that can reduce the deviation while keeping the essential randomization strategy intact. Although problems remain, we believe randomization is the correct direction, and a valuable step toward complicating the task for the mimicry attacker. 5.2 Threshold reduction and extreme padding In the experiments we reported above, we noticed that the randomized models false positive rates, while comparable on average, exhibit a higher variance than the case where models are not randomized. Consider an extreme mimicry attack. Suppose an attacker crafts packets with only one instruction per packet, and pads the rest with normal data. If a 100% detection rate is desirable, lowering the score threshold to some minimum nonzero value might be the only way to achieve this goal. Unsurprisingly, this approach corresponds to a direct increase in the false positive rate, which may vary between 10% to 25% of all packets, depending on the n-gram sizes chosen and the amount of training of the model. Such a false positive rate may be viewed as impractical, rendering the sensor useless. We believe this is wrong. The false positive rate is not always the right metric especially if the anomaly detector is used in conjunction with other sensors. For example, Anagram may be used to shunt traffic to a host-based, heavily instrumented shadow server used as a detector; in other words, we do not generate Anagram alarms to drop traffic or otherwise mitigate against (possibly false) attacks, but rather we may validate whether traffic is indeed an attack by making use of a host-based sensor system. If we can shunt, say 25% of the traffic for inspection by an (expensive) host-based system, while leaving the remaining normal traffic unabated to be processed by the operational servers, the proposed host-based shadow servers can amortize their overhead costs far more economically. Thus, the false positive rate is not as important as the true positive rate, and false positives do not cause harm. We describe this approach in greater detail in the next chapter.

93 Summary In this chapter, we discussed the problem of mimicry attacks against payload-based anomaly detectors, and proposed the randomized modeling/testing approach that we believe can help thwart mimicry attacks. We first analyzed the sample mimicry attack created by [28] which targets and evades detection by PAYL. Then we adapted the same blend polymorphic engine against Anagram, and demonstrated that Anagram is resilient to such simple mimicry attacks that blend exploits with normal appearing byte padding. Then we proposed a general idea of introducing randomization into anomaly detectors. Each sensor may randomly partition the payload into several subpartitions to model/test, so that the attackers have no means to know how to pad the exploit code to appear normal within each local part. We compared the ideas of randomized modeling and randomized testing, and demonstrated the effectiveness of randomized testing approach.

94 77 Chapter 6 Learning Strategies Training is relatively easy for the anomaly detectors when we have offline pre-collected datasets. But if we plan to deploy such sensors in live traffic to compute a model real-time, there are many challenging problems. When is the model well trained and ready to use? How well can they handle noise within the training data? And when do models need to be retrained when the environment shifts? 6.1 Epoch-based Learning No matter a sensor is trained offline or online, there is always the problem of determining how much training data is enough, or whether the model is ready for use. If we can analytically guarantee the training data is more than enough, the problem is less serious. But for online environments, we cannot execute an unbounded learning phase before deploying a detector. However, we propose an epoch-based training approach and propose several measurements computed against models to judge the stability of the models indicating that training should be completed. Here an epoch can be measured in terms of the number of packets analyzed, or by means of a time period. The basic assumption here is that if the model current computed has changed little for several continuous epochs, we conclude that the training phase is sufficiently complete. Thus, we need define metrics to compare the

95 78 similarity between models computed in two successive epochs. The pseudo-code of the epoch-based training is displayed in Figure 6.1. stable = false; old_model = new_model = null; while(!stable) { new_model = Model.Train(oldmodel, N packets); stable = compare_similar(old_model, new_model); old_model = new_model; } Figure 6.1: The pseudo code of epoch-based training. For PAYL, the models are length-conditioned 1-byte frequency distribution and standard deviation, maybe multiple such models are computed for each packet length. The stability of two such PAYL models is estimated by two metrics: the first is the number of new centroids produced in the current epoch, and the second metric is the Manhattan distance of each centroid to each model computed in the prior epoch. As for the multi-centroid case, where there are multiple centroids for one length and that can dynamically change in each epoch, the Manhattan distance is computed between the most similar models computed in each adjacent epoch. The decision to stop the training phase can be adjusted using different thresholding of these two metrics. (This is one area of future research, casting the problem as an optimization problem to select the threshold parameters by optimizing accuracy of the learned models.) As an example, we display in Figure 6.2 examples of how the two metrics evolve over successive training epochs. The left one plots the total number of centroids, and the right one shows the sum of the Manhattan distance of all the centroids to the models computed in the previous epoch. This result is from a PAYL run on the live traffic going to the www1 webserver of our CS department, using 2000 packets as an epoch length and single centroid modeling. The length of an epoch needs to be chosen according to the environment. Usually the busier server needs a longer epoch, and some offline sampling might help find a good bound on each epoch. From the plot, we can clearly see how the numbers converge

96 79 and the models become stable after a series of training epochs. We stop training when we see no more new centroid and each of the Manhattan distance calculations is smaller than In practical use, there is no need to compute the Manhattan distance before the number of centroids becomes stable. Figure 6.2: Evolving of stability metrics of epoch-based training for PAYL. The left plot gives the number of centroids over epochs, and the right one gives the sum of Manhattan distance between corresponding centroids after each new epoch. For Anagram, estimating the stability of models is simpler. Since there are a vast number of distinct n-grams, as n increases, many longer grams may take days or weeks to observe in any content flow. However, the rate of seeing new grams is a key estimate we can use in estimating the stability of an Anagram content model. In this case we devised a simple strategy. We check the likelihood of seeing a new n-gram with additional epoch training. Figure 6.3 shows the percentage of the new distinct n-grams out of every 10,000 content packets when we train for up to 500 hours of traffic data. The key observation is that during the initial training period, one should expect to see many distinct n-grams. Over time, however, fewer distinct never before seen n-grams should be observed. Hence, for a given value of n, a particular site should exhibit some rate of observing new distinct n-grams within its normal content flow. By estimating this rate, we can estimate how well the Anagram model has been trained. When the likelihood of seeing new n-grams in normal traffic is low and stable, the model is deemed stable; we

97 80 can then use the rate estimate to help detect the attacks, as they should contain a higher percentage of unseen n-grams. Figure 6.4 plots the false positive rates of different models, varying in n-gram size and length of training time, when tested on the 72 hours of traffic immediately following the initial training epoch. Figure 6.3: The likelihood of seeing new n-grams as training time increases From the following plots, we can see that as the training time increases, the false positive rate generally goes down as the model is more complete. After some point, e.g. 4 days in figure 6.4, there is no further significant gain, and the FP rate is sufficiently low. Higher order n-grams need a longer training time to build a good model, so 7-grams display a worse accuracy result than 5-grams given the same amount of training. While the 3-gram model is likely more complete with the same amount of training, it scores significantly worse: 3-gram false positives do not stem from inadequate training, but rather because 3- grams are not long enough to distinguish malicious byte sequences from normal ones with great accuracy. Under the principle of mutual information longer gram sizes better serve as a means of distinguishing between the two classes of payload. In theory, Anagram should always improve with further training if we can guarantee a clean training dataset, which is crucial for the binary-based approach. However, obtain-

98 81 Figure 6.4: False positive rate (with 100% detection rate) as training time increases ing clean training data is not an easy task in practice. During our experiments, increased training eventually crosses a threshold where the false positive rate starts increasing, even if the training traffic has been filtered of all known attacks. The binary-based approach has significant advantages in speed and memory, but it s not very tolerant to noise in the training data, and manual training data cleanup is infeasible for large amounts of training data. We therefore introduce semi-supervised training in the next section to help Anagram be more robust against noisy data. 6.2 Semi-supervised Learning Another big problem facing unsupervised online training systems such as those explored in this thesis is the level of noise in the training data. For sure we can run the traffic through Snort to filter out old known attacks, but we still run the risk of including new exploits as part of the normal traffic model. Since PAYL models the average byte frequency of normal content, and includes their standard deviation, malicious traffic seen during training won t skew the model too much

99 82 as long as it s a small minority percentage of the traffic. But for Anagram, this is a really serious problem. The binary-based approach is simple and memory efficient, but too sensitive to noisy training data. If there are any attacks hidden in the training data, the n-grams of the attack will be recorded in the model and the detector will easily miss the similar attacks during testing. But, it is hard to generate pure clean dataset, as running through any existing IDS can only filter out old known attacks. The most reliable way is to use the feedback from the service to decide whether this is a valid request, which is pretty expensive operation and will be discussed later. Here we first adapt a simple form of supervised training. We utilize the signature content of Snort rules obtained from [21] and a collection of about 500 virus samples to precompute a known bad content model. We build a bad content Bloom filter containing the n-grams that appear in the two collections, using all possible n that we may eventually train on (e.g., n=2...9 in our experiments). This model can be incrementally updated when new signatures have been released. The bad n-gram Bloom filter model may be viewed as a means of generalizing the set of string based signatures commonly used in virus scanners and IDS systems. It s important to note, however, that signatures and viruses often contain some normal n-grams, e.g., the GET keyword for HTTP exploits. To remove these normal n-grams, we can maintain a small, known-clean dataset used to exclude normal traffic when generating the bad content BF. This helps to exclude, as a minimum, the most common normal n-grams from the bad content model. This approach is an approximation of mutual information metrics in information retrieval and other supervised training strategies. In one experiment, we used 24 hours of clean traffic to filter out normal n-grams from the bad content BF. Figure 6.5 shows the distribution of normal packets and attack packets that match against the bad content model. The X axis represents the matching score, the percentage of the n-grams of a packet that match the bad content model, while the Y axis shows the percentage of packets whose match score falls within that score range. The difference between normal and attack packets is obvious; where the normal traffic barely

100 83 matches any of the bad content model, attack packets have a much higher percentage, so that we can reliably apply the model for accurate detection. The resulting bad content BF contains approximately 46,000 Snort n-grams and 30 million virus n-grams (for n=2...9). Figure 6.5: Distribution of bad content scores for normal packets (left) and attack packets (right). The bad content model is used during both training and detection. During training, the incoming data stream is first filtered through Snort to ensure it is free of known, old attacks. Packets are then compared against the bad content model; any n-gram that matches the bad content model is dropped. The whole packet is also dropped if it matches too many n-grams from the bad content model as new attacks often reuse old exploit code to avoid modeling new malicious n-grams. In our experiment, we established a 5% bad n-gram threshold before ignoring a training packet. While this is rather strict, ignoring a packet during training time is harmless as long as relatively few packets are dropped, as figure 6.5 shows. During detection, if a never-before-seen n-gram also appears in the bad content model, its detection score is further weighted by a factor t over other malicious n-grams; in our experiment, we set t to 5. This enables us to further separate malicious packets from normal ones in order to achieve higher detection accuracy. To show the improvement we gain by using the bad content model, figure 6.6 compares the false positive rate before and after using it for different n-gram sizes on two datasets. The false positive rates are significantly

101 84 reduced with the help of this bad content model. Figure 6.6: The false positive rate (with 100% detection rate) for different n-grams, under both normal and semi-supervised training 6.3 Adaptive Learning Training Attacks versus Mimicry Attacks We distinguish between training attacks and mimicry attacks: A mimicry attack is the willful attempt to craft and shape an attack vector to look normal with respect to a model computed by an anomaly detector. The attacker would need to know the modeling algorithm used by the sensor and the normal data it was trained on. The polymorphic blended attack engine discussed in Section 5.1 assumes to know both by sniffing an environment s normal data (an open question is whether the normal data of one site produces sufficiently similar models to other sites that are targeted by a mimicry attack). Alternatively, a training attack is one whereby the attacker sends a stream of data incrementally or continuously distant from the normal data at a target site in order to influence the anomaly detector to

102 85 model data consistent with the attack vector. Attack packets would appear normal since they were modeled. This type of attack would succeed if the attacker were lucky enough to send the stream of data while the anomaly detector was in training mode. Furthermore, the attacker would presume that the stream of malicious training data would go unnoticed by the sensor while it was training. We presented the concept of randomization in order to thwart mimicry attack. Even if the attacker knew the normal data distribution, the attacker would not know the actual model used by the sensor. However, we have not addressed training attacks. [3] explores this problem and suggests several theoretical defenses. For example, Anagram s semisupervised learning mechanism can help protect the model if learning attacks recycle old exploit code. However, if the learning attack does not contain any known bad n-grams, Anagram cannot detect it by itself. We conjecture that the coupling of the training sensor with an oracle that informs the sensor of whether or not the data it is modeling is truly normal can thwart such training attacks. For example, if the attacker sends packets that do not exploit a server vulnerability, but produces an error response, this should be noticed by the sensor in training mode; we discuss this further in the next section. Such feedbackbased learning does not address all cases, e.g., a learning attack embedded into a HTTP POST payload, which would generate a legitimate server response. Randomization may also be valuable for learning attacks; we leave the exploration of such defense mechanisms for future research Feedback-based learning and filtering using instrumented shadow servers Host-based fault detection and patch generation techniques (such as Stack-Guard/MemGuard [9], STEM [64], DYBOC [63], and many others) hold significant promise in improving worm and attack detection, but at the cost of significant computational overhead on the host. The performance hit on a server could render such technologies of limited operational value. For instance, STEM [64] imposes a 200% overhead for an entirely-instrumented

103 86 application. DYBOC [63] is more proactive and designed to be deployed on production servers to provide faster response, but still imposes at least a 20% overhead on practical web servers. If one can find a means of reducing the cost of this overhead, the technology will have a far greater value to a larger market. We envision an architecture consisting of both production servers and an instrumented shadow server, with the latter executing both valid and malicious requests securely but with significant overhead. A network anomaly flow classifier is placed in front of these pools and shunts traffic based on the anomaly content in incoming requests, as shown in figure 6.7. Figure 6.7: Shadow server architecture In order for the flow classifier to be appropriate for this architecture, we need to ensure that no malicious requests are sent to the production pool, as those machines may be potentially vulnerable to zero-day attacks. It is acceptable if a small fraction of the traffic deemed as false positives are shunted to the shadow server, because this server will indeed serve the requests, but with a greater latency. Nothing has been lost, but only some amount of response time for a minority portion of the traffic flow. In other words, an anomaly detector that wants to act as this classifier should have a 100% true positive detection rate and a reasonably low false positive rate. We can characterize the latency in such an architecture