1 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER Statistical Analysis of Network Traffic for Adaptive Faults Detection Hassan Hajji, Member, IEEE Abstract This paper addresses the problem of normal operation baselining for automatic detection of network anomalies. A model of network traffic is presented in which studied variables are viewed as sampled from a finite mixture model. Based on the stochastic approximation of the maximum likelihood function, we propose baselining network normal operation, using the asymptotic distribution of the difference between successive estimates of model parameters. The baseline random variable is shown to be stationary, with mean zero under normal operation. Anomalous events are shown to induce an abrupt jump in the mean. Detection is formulated as an online change point problem, where the task is to process the baseline random variable realizations, sequentially, and raise alarms as soon as anomalies occur. An analytical expression of false alarm rate allows us to choose the design threshold, automatically. Extensive experimental results on a real network showed that our monitoring agent is able to detect unusual changes in the characteristics of network traffic, adapt to diurnal traffic patterns, while maintaining a low alarm rate. Despite large fluctuations in network traffic, this work proves that tailoring traffic modeling to specific goals can be efficiently achieved. Index Terms Normal operation baselining, online anomaly detection, online change detection, remote monitoring (RMON) management information base (MIB), traffic model. I. INTRODUCTION NETWORKS and distributed processing systems have become an important substrate of modern information technology. The rapid growth of these systems throughout the workplace has given rise to a discontinuity in expertise of human operators to manage them. There is a need for automating the management functions to reduce network operations and management cost. Detection of network problems is a crucial step in automating network management. It has a direct impact on the accuracy of fault, performance and security management functions. Well designed anomaly detection algorithms enhance the network control capability, by providing timely indication of network incipient problems. Early warnings from monitoring agents can trigger preventive actions, and serious and expensive outages can be avoided. A large amount of work has gone into developing mechanisms and protocols for collecting traffic statistics. Indeed, currently most of the work in the simple network management architecture focuses on defining detailed system and network traffic objects. Comparatively, little work is done to support analysis of collected statistics. Most of the interpretation is Manuscript received October 31, 2003; revised April 29, The author is with the IBM Business Consulting, Marunochi Chiyoda-ku, Tokyo, Japan ( Digital Object Identifier /TNN left to the common-sense of network operators. Unless control mechanisms are driven by objective measures using well-tested network traffic models, the benefits and the results of network traffic analysis will remain biased by the common sense of human operators. On the other hand, existing work on fault and performance management assumed that the alarm generating mechanism is accurate, and network problems are given a priori , , . Current practice in network management rely on user-defined thresholds for detection. Alarms are generated when some variable of interest crosses a predefined threshold. Generally, the value of the threshold is no more than an estimation of the normal range within which the measured feature is believed to operate. Not only there is little objective insights on how to choose these thresholds, but also there is a risk of missing subtle changes in the network state . In addition, the complexity and size of current network systems makes them vulnerable to novel, previously unseen problems. A natural approach to anomaly detection in network systems can proceed by generating residuals, obtained from the discrepancy between the observed and the predicted behavior. Change in the properties of theses signals is a symptom of an abnormal behavior of the system. The main difficulty in network anomaly detection is the lack of a generally accepted definition of what constitutes normal behavior . The dynamics of normal operations need to be identified from routine operation data. Earlier work reported in  characterizes the normal behavior by different templates, obtained by taking the standard deviations of observations (typically Ethernet load and packets count), at different operating times. An observation is declared abnormal if it exceeds the upper bound of the envelope. Given the nonstationary nature of network traffic, the standard deviation estimates are likely to be distorted, making subtle changes in the network state go undetected. To mitigate the effect of the nonstationary nature of network traffic,  considered the model formed by segmenting time series obtained form management information base (MIB) objects. Observations are declared abnormal if they do not fit an autoregressive model of the traffic inside segments. In , observations are declared abnormal after a statistical test with the mean of 24-h period sample. In these approaches, alarms are raised too frequently, and ad hoc correlations schemes are introduced to reduce the number of generated alarms. In this paper, we address the problem of change in characteristics of network traffic, and its relation with anomalies in local area networks. Here anomaly is used to mean changes from a normal baseline. It should be understood as encompassing faults, performance, and security problems, as far as they induce changes in the characteristic of network traffic. The emphasis /$ IEEE
2 1054 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005 is on fast detection, an important requirement for reducing potential impact of problems on network services users. Knowledge of anomalies to be detected is not required. We parameterize network traffic variables using finite Gaussian mixtures. Based on this parametric model, we propose a baseline of network normal operation as the asymptotic distribution of the difference between successive estimates of model parameters. This difference is shown to be approximately multivariate Normal, with mean zero under normal operations, and sudden jumps in this mean are characteristics of abnormal conditions. The detection problem is formulated as a change point problem. A real-time online change detection algorithm is designed to processes, sequentially, the baseline random variable realizations, and raise an alarm as soon as the anomaly occurs. We motivate this formulation through a real scenario. The proposed approach requires neither the set of anomalies to be detected, nor the thresholds to be supplied by the user. Experimental results on a real network showed the effectiveness of the method. A low alarm rate and a high detection accuracy has been demonstrated. Our earlier work ,  reported some of the experimental results obtained. We provided some motivation of the model used but limited explanation of the experimental results in relation with objectives of fast detection. In this paper, we build on our earlier work, and motivate our approach of using a mixture model for network traffic modeling through a real-world example of broadcast packets. We explain in more details the experimental results obtained in relation with our objectives. We rigourously establish the formulas for residual generation, and a mathematical simplification to make calculation of the threshold for statistic detection easier is introduced. In addition, we report on some new results related to the behavior of alarms, and the CPU overhead consumed by the monitoring. This paper is arranged as follows. Section II introduces our proposed parametric model of the network traffic, and shows how network normal operation baseline is built. Section III shows the characteristics of the baseline model, under abnormal condition, and introduces the formulation of the network anomaly detection. In Section IV, we present results of our experiments in a real network. We conclude in Section V. II. NORMAL OPERATIONS BASELINING The goal of this section is to characterize network normal behavior. We first present a parametric model of traffic variables, and then show how this model can be used to build a baseline of normal operations. The paper assumes that we have collected a sample large enough to account for all possible normal behaviors of the network. There is no training set due to anomalies, and it is assumed that the sample used for training contains only normal operation data. A. Parametric Model of Traffic Variables Our approach to network model parameterization is to view each variable as switching between different regimes, where each regime is a Gaussian distribution. This is a form of what Fig. 1. Traffic fluctuations of broadcast packets. Time is relative to the start of data collection and not time of the day. referred to in the literature as finite mixture model . The observations are generated by one of the Gaussian distribution, as shown in the following equations: The errors are assumed to be Gaussian, with mean 0 and variance. The constant is the mean of the distribution. The integer denotes the number of regimes. Each regime has a mixing probability, denoted by. The parametric finite mixture model has an attractive interpretation. It is useful in the analysis of data that are believed to come from a finite number of distinct subpopulations, indicated by a latent variable. Given the known fact that network traffic changes as a function of the time of the day, the latent variable in this case has the natural interpretation of time of the day. For instance, Fig. 1 shows how the number of broadcast packets, that arrived within a 10-s sampling interval, changes as a function of time. It can be seen clearly that broadcast packet decrease at night hours, to pick up again in working days. It is important to note, however, that time and traffic dependency does not hold always. It is possible that traffic decreases to a level similar to late night hours, even during the busiest day of the week. If the time of the day and traffic dependency is explicitly modeled, alarms are raised mainly because of structural breaks in this dependency model. With finite mixture model, however, each observation is assigned to the closest regime, and not necessarily to the regime a given time of the day may imply. Given the obvious observation that if the random variable is lognormal, then follows a Normal distribution, the random variable whose realization are the logarithm of the original data can be modeled by the finite mixture model of (2). This model provides, then, a good parametric characterization for many important traffic variables, such call holding time in telephony , , Telnet originator responder bytes, data transmitted in a given FTP connection, FTP session bytes , and the distribution of message length in public access mobile radio (PAMR) , since these variables are shown to follow a lognormal, or a mixture of lognormal distributions. Note that in general, finite Gaussian mixture models are general enough to approximate any continuous function with a finite number of (1) (2)
3 HAJJI: STATISTICAL ANALYSIS OF NETWORK TRAFFIC FOR ADAPTIVE FAULTS DETECTION 1055 discontinuities , providing a general first approximation to other network traffic variables. More importantly, we note that the ultimate goal of modeling in this work is anomaly detection. This has a subtle but important difference with current work on traffic analysis. Most of the work reported recently on self-similar traffic , , and its impact on performance analysis is done with the goal of using estimated parameters of hypothesized models in traffic engineering applications. For example, buffer sizing applications are interested in estimating the probability of the queue size exceeding a certain value. This quantity does depend on the true parameters of the used model (Hurst parameter in the case of the formula derived in , for example). Under these conditions, an appropriate traffic model, and an accurate estimation procedure are needed (a requirement hardly met in current models of self-similar traffic models, however). In our case, the only requirement on the model is the capability to discriminate between the normal/faulty states. The key idea we stress throughout this work is to rely minimally on the model assumption, and tackle the anomaly detection directly. As we shall see in the remaining part of this paper, it is effectively possible to detect network problems, without even using estimated parameters of the model. It should be noted, however, that it may well happen that this model is not useful for other traffic engineering application. B. Normal Operation Baselining Our approach to operation baselining starts by recognizing that any parametric model of network traffic is, at best, an approximation to the reality. Approximation errors and model misspecification become very pronounced in any inference that use the estimated parameters, as if they are the true ones. Given that our ultimate goal is anomaly detection, formulated as a change detection in baseline model parameters, we claim that this task can be achieved without using these true parameters. The goal is to avoid solving the more general problem of accurate parameter estimation, as an intermediate step to change detection. Our approach to realize this idea is pictorially shown in Fig. 2. Observation are passed through the learning algorithm to produce the parameter estimate. As a new data point is sequentially added, the learning algorithm outputs a refined new estimate. The idea is to characterize network normal operations using the difference. There are two major advantages of the baseline built this way. First, the difference does not depend on the true value of the parameter. This is very important since, in practice, we do not know this true value, and the only available information is the value, estimated from the data. Approximating the true parameter with, and studying the difference is possible, but our experiments showed that this approach is inefficient, as shown later. Second, the learning algorithm can be designed to adaptively track local changes in model parameters. It is unrealistic to assume that the model parameters will remain exactly the same over all the operating times of the network. Intuitively, under normal conditions, the difference between the successive values and is expected to fluctuate Fig. 2. Normal operation baselining based on repeated estimation of model parameters. around zero. This difference should not drift constantly in a fixed direction. On the other hand, if this difference drifts systematically over long duration, then the new observations are generated by a different model, induced by a pattern not present in the training data. The learning algorithm will alter the parameter to its new value. The idea, then, is to characterize normal behavior based on the random variable. The mean value of this difference is a good indicator of the health of the network. To realize this principle, two issues need to be addressed. First, the issue of how to design an adaptive learning algorithm. Since the detection is required to be online, the learning algorithm of Fig. 2 should be as fast as possible. Second, we have to work out the distribution of the difference, under normal conditions. The remainder of this section addresses these two issues. C. Learning Algorithm and Baselining Normal Behavior A well-known algorithm for parameter identification in finite mixture models is the expectation maximization (EM) algorithm , . Given a sample of size, and an initial parameter, the EM algorithm estimate the unknown parameter by iteratively maximizing the expected log likelihood, given by The EM is a batch-oriented algorithm. It requires the whole data to be available in memory, before a new refined estimates is produced. If we have to run this algorithm online (i.e., each time a new observation is made available to inspection), too much time and memory will be consumed. Worse yet, time and memory consumed keeps growing as monitoring goes on. We do not follow this approach. Instead, a stochastic approximation of the problem of maximizing the likelihood function, is used to turn the EM to an online algorithm. Let the vectors be a sequence of observations, whose joint probability distributions depends on the unknown parameter. The goal is to derive an online algorithm for (3) (4) (5)
4 1056 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005 estimating the parameter.wedefine the recursive likelihood function as follows: (6) Where denotes the expectation with respect to the true parameter, y is the latent, unobservable variable. This is basically the same recursive likelihood as in , except that the expectation is taken, conditionally on the whole set of observations. It can be shown, given appropriate regularity conditions, that (7) (8) Where are defined as given in the differential in equation (8). denotes Fisher information matrix for the complete data, where the separation variable is given. Similar results are obtained in  by minimizing the Kullback Liebler divergence, instead of maximizing the recursive likelihood. The proof of the previous result is omitted, because even taking the expectation conditionally on the whole set of observation, the proof given in  still holds. Working out the scores, and the matrix, leads to the following recursive formulas for updating the model parameters: Fig. 3. Comparison of the drifts (m 0 m ) and the (m 0 ^m ) for a duration of 5 h, under the same network conditions. (9) (10) Now let us verify our design goal. First, note that parameter updating is fast enough, to be implemented online. Sequentially acquired data points are merged with existing processed data, and do not require the recomputation of all collected data. Time and memory requirements are kept minimal. To allow the learning algorithm to track slight change in model parameters, we introduce an exponential forgetting factor 0, that reduces the effect of old observations, much in the same way as proposed in . Evaluating the sum is then replaced by. This way the learning algorithm adapts to small changes in model parameters. Also, note that our implementation assigns observation to the regime, with highest probability, and not to all regimes, but the weight of the regime with highest probability is kept in the computation. As we are not interested in inferring the latent variable, the weights are not updated. In the sequel, we shall be interested only in changes in the -dimensional mean. As stated earlier, approximating with, and studying the difference is possible, but our experiments showed that this approach is inefficient. Fig. 3 compares both differences for a duration of 5 h, under the same network conditions. Here we show results takes from the number of outbound broadcast packets from a network file server, because this variable in our network is very smooth, and likely to give good results for. As it can be concluded from this figure, however, is not symmetric around zero, while the difference is both symmetric and its mean is very close to zero under normal conditions. Fig. 4. Behavior the random variable e under the same conditions as Fig. 3. For the distribution of the baseline random variable, we showed empirically  that the -variate random variable given by (11) (12) is approximately Normal, with mean zero under network normal conditions. Note that given in (11) is simply the difference, scaled such that its variance-covariance matrix becomes Identity. Fig. 4 shows the behavior of under the same network conditions as Fig. 3. It can be seen that the baseline random variable is stable, and its mean is very close to 0. It should be said that in this experiment, the very smooth nature of is also a result of the data used in this experiment. In general, we have always been able to obtain a mean very close to zero, but a less smoother behavior of the baseline variable. To summarize results of this section, we showed how the learning algorithm transforms the raw data, to stationary multivariate variable. The baseline random variable have the desirable property of being Normal with mean zero and variance Identity matrix, under normal network operations. The mean of serves as the baseline of normal operation. The next section
5 HAJJI: STATISTICAL ANALYSIS OF NETWORK TRAFFIC FOR ADAPTIVE FAULTS DETECTION 1057 Fig. 5. Behavior of the baseline variable under abnormal network conditions. (a) Abnormal conditions induce a sudden change in the mean. (b) Slight change in the mean preceded the sudden jump. shows the behavior of these baseline random variable under abnormal conditions, and how we formulate and solve the detection problem. III. ONLINE ANOMALY DETECTION Given that we know the behavior of the baseline under normal conditions, an important point to investigate is how the variable behaves under abnormal conditions. Fig. 5 shows the behavior of the baseline random variable under a real abnormal condition. The problem is caused by a streaming network interface card sending excessively badly formatted packets. As shown in Fig. 5(a), this abnormal condition causes a sudden jump in the mean of the baseline variable. Fig. 5(b) shows the behavior of the baseline random variable just before the sudden jump in the mean. Interestingly, we notice that the sudden jump is preceded by a slight change in the mean of baseline. If the detection approach is designed to be sensitive to slight changes in the operating characteristics of the network, we could have predicted the problem of Fig. 5 before it became serious. The problem could have been avoided, or at least addressed immediately after its occurrence. Generally, one may not assume that all problems presents signs to allow their prediction. In this case, we require our detection method to raise alarm as soon as change in the mean occurs. A. Online Anomaly Detection Consider the realizations obtained by observing sequentially the baseline random variable from time point to. Under normal operation of the network, the sample of follows a -variate Normal distribution with mean 0 and Identity covariance matrix (Section II). At some unknown time point, a change happens in the model, and the new generated realizations shift to a new distribution. The goal is to find a decision function and a stopping rule that detect this change and raise an alarm as soon as possible, under a controlled false alarm rate. This formulation is known in sequential analysis literature as the disruption problem. The main difference with classical hypothesis testing is that the sample size is a function of the observations made so far (i.e., not fixed a priori), and the distribution of the baseline random variable is known, under normal operation. The goal is to achieve fast change detection, by using the minimum sample size possible to decide whether an alarm is to be raised or not. It is well-known that for known probability distribution after change, Page Lorden cumulative sum (CUSUM) , ,  test is optimal, in the sense that it minimizes the delay to detection, among all tests with a given false alarm rate. The idea is to inspect sequentially, for each of the possible change points, the logarithm of the likelihood ratio, and to stop the first time this ratio is bigger than a predefined threshold. The resulting decision function, and the stopping rule are (13) (14) In the present case of network anomaly detection, we do not have a priori knowledge about the distribution probability after change, and the change point. The common extension of Page Lorden CUSUM test consists of estimating the postchange distribution mean, and the change point from the data. This approach is known as the generalized likelihood ratio (GLR) test . That is, for the unknown parameter of the distribution of after change, and the change point are estimated from data, using the maximum likelihood estimator. The resulting decision function is given by (15) (16) In our case, where prechange and postchange distributions are Normal, the maximization problem of (15) can be worked out explicitly. It has a simple form, given by (17) (18)
6 1058 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005 The equation assumes that after change, the distribution of the baseline realizations is still Normal, but with a different mean. For the abnormal case, it is not possible to obtain an unbiased fit of the postchange distribution. Fortunately, such accurate estimation is not crucial. What is needed is that, when the anomaly occurs, the closest Normal distribution, obtained by maximum likelihood estimation, has a mean significantly different from zero. B. Tuning the Threshold So far we have introduced the decision function and the stopping rule used for online detection of network anomalous behavior. The remainder of our problem set-up concerns the choice of the design threshold. It can be shown that the expectation of the stopping rule under no change, denoted by, is given in  as (19) (20) Where denotes the Normal distribution function. For computational purposes, there is some theoretical justification to use the following small approximation : (21) (22) Not surprisingly, (19) turns out to be the mean time between false alarms. It follows that, given a desired false alarm rate, we can recover the design threshold, by solving (19). Unfortunately, (18) cannot be written recursively. Consequently, the number of realizations to be inspected can grow large. To circumvent this difficulty, we use a moving horizon of fixed length, where the starting point of the horizon moves one step forward as new observation are made available to inspection. IV. EVALUATION AND RESULTS The network monitoring algorithms described earlier has been implemented in a real network. This section discusses how the data is collected, and the results that validate the agent capabilities. A. Data Collection The implementation of the monitoring software consists of two modules: statistics collection module, and the monitoring module introduced in this paper. Statistic collection module interfaces the network for protocols operation statistics. The monitoring module monitors management objects, for online anomaly detection. Statistics collection is implemented as a remote monitoring (RMON) agent, running as user-level process. This solution is particularly appealing in the sense that dependency on the operating system kernel is minimally reduced to the interface to access the data link layer. This way, we have full control of all aspects of network statistics, as opposed to depending on whether TABLE I DEFINITION OF VARIABLES USED IN EXPERIMENTAL RESULTS operating system kernel keeps track of traffic statistics. Our earlier implementation experience of an SNMP agent , revealed that some kernels do not have entries for all management objects, as defined in MIB-II. The network monitoring module operates on top of RMON. It accesses raw measurements through RMON MIB. All monitoring computation is done locally. In contrast to the network management station (NMS) based polling of network statistics, the bandwidth and computation time consumed during transfer and processing of raw measurement is kept minimal. Distributed management organizational models , , addressed these issues, but failed to address the critical issue of how to use collected statistics. In this sense, our approach complements distributed management by providing the details of monitoring tasks to be carried out by distributed agents. B. Experimental Setup To illustrate each of the capabilities of our proposed monitoring approach, Table I lists the variables studied. Some of these variables are not part of the standard RMON MIB definitions, but can be easily obtained using appropriate filters. All these variables are modeled using a finite Gaussian mixture model. Slightly better results are obtained by modeling the TCP connection requests as a mixture of lognormal distributions, but all of the experiments reported here use Gaussian mixtures. For each of these variables, Table III shows its experimental configuration. Normal network model is obtained by training the model using a sample of normal traffic, and a predefined number of components as shown in Table III. The number of components is determined in an ad hoc manner, based on observed fluctuation of traffic in one week training data. We are now studying how to choose the number of components automatically from the data. As of the current implementation, the forgetting factor was chosen after testing a number of choices, and the value of 0.8 provided satisfactory results in our case. Arguably, this is a subjective measure, and a more robust method is needed. A potential extension of our work might be to adopt methods developed in  to our model to set the forgetting factor automatically. The variables used in this study can detect a number of anomalies as listed in Table II. This is in no way an exhaustive list of all possible anomalies, but it should be enough to prove the concepts introduced in this paper. There are a large number of anomalies like ICMP echo requests, fragmentation anomalies that affect other variables that can be easily obtained from the large set of MIB variable RMON agent collects, and our method should be readily extensible to bear on these problems.
7 HAJJI: STATISTICAL ANALYSIS OF NETWORK TRAFFIC FOR ADAPTIVE FAULTS DETECTION 1059 TABLE II EXAMPLES OF ANOMALIES THAT CAN BE DETECTED BY STUDIED VARIABLES TABLE III EXPERIMENTAL CONFIGURATION OF EACH OF THE VARIABLES STUDIED The rest of this section evaluates the following capabilities of our approach. 1) Detection Accuracy: the detection accuracy is shown through four anomalous events, induced by injecting appropriately assembled packets: total number of broadcast packet on the segment, total number of TCP connection requests on a web server, ARP packet operations, and the total number of unknown protocols. Packet assembly is implemented by assembling, and injecting packets in the network. The packet injection program is written using the data link provider interface (DLPI), on Solaris operating system. 2) Adaptability to Normal Traffic Fluctuations: the goal here is to show that traffic fluctuations that are part of the normal operation of the network, are not interpreted as anomalies. That is, the monitoring agent adapts to diurnal changes in network traffic, and no alarm is raised. 3) Subtle Changes Detection: as stated in the introduction, thresholds applied directly to traffic variables may miss subtle changes in network traffic. In this experiment, the monitoring agent capability to detect subtle changes in the network state is investigated, through two real cases. 4) Alarm Rate: gives the average alarm rate of monitored variables, evaluated for an overall duration of one week, then one month. This allows us to both assess the validity of the monitoring agent over an extended amount of time, and give a general sense of how frequent alarms are generated. 5) Monitoring Overhead: evaluates the overhead associated with continuously monitoring variables. The goal here is to provide an idea about the scalability of our detection algorithm. In the sequel, details of these experiments are presented. The term test statistic will be used to denote the statistic in (18) used for detection. C. Detection Accuracy Fig. 6(a) shows how the test statistic reacts to excessive broadcasts, created by injecting additional two broadcast packets, every second. As shown in the figure, the test statistic shows a sharp increase, crossing the threshold after a delay of 16 min, approximately. Fig. 6(b) shows how the test statistic reacts to a sustained rate of TCP connection requests, created by injecting 10 additional packets every second. It can be shown that the problem is detected, with a delay of 17 min, approximately. Our RMON implementation allows us to study per-protocol statistics. Here we focus on address resolution protocol (ARP), as an illustrative example, given both the lack of counters for ARP packet operations, and the range of problems that manifest themselves as changes in the statistic characteristics of this protocol traffic. Fig. 6(c) shows results of monitoring ARP operation for 24 h, and then perturbing network operations by injecting an additional two ARP request packets per second. It can be clearly seen that the anomaly is detected. The delay to detection is 16 min, approximately. Fig. 6(d) shows how the agent reacts to excessive unknown protocols. Most of this traffic is caused by packets with protocol type less than 1500, which is typical to other Ethernet formats, other than Ethernet-II (that is IEEE 802.3/802.2, and Novell Raw 802.3). In this case, it took 50 s for this anomaly to be detected. As shown in this section, the detection speed can oscillate between very short delay to detection (50 s), to relatively long delay to detection (17 min). It should be stressed that the theoretical formulation of our algorithm attempts to minimize the delay among all tests with a given false alarm rate. However, the observed delay to detection is a function of the degree of abnormality. If the abnormal behavior induces a sample of observations with a mean very close to an existing regime, the test statistic requires a sample large enough to cross the predefined threshold, resulting in a relatively long delay to detection. While we have yet to derive the mathematical proof that our test is the fastest, it is reasonable to expect that, on average, another test statistic, with no optimality considerations and with the same false alarm rate examining the same sample, would take even longer. On the other hand, if the anomaly induces a mean significantly apart from existing regimes, detection takes a short time. Again here, on average another nonoptimal test statistic examining the same sample would take even longer. The point here, is that given a false alarm rate, our mathematical formulation does try to minimize the delay to detection among all test with a given false alarm rate, and the observed delays (17 min, or 50 s) are to be interpreted probabilistically as realizations of the delay to detection random variable. In summary, we note that in all cases tested the detection is accurate. Recall that we are not modeling particular anomalies. The agent contrasts the baseline normal behavior with observed traffic, making it possible to detect novel network problems. It follows that, in principle, our approach can be easily deployed across different networks. It is also important to notice that in all of the previous cases, analyzing the captured packets, after an alarm is raised, immediately reveals the problem. This is to be contrasted with methods as in , that take alarms as given, and yet have to match faults signatures with prestored patterns. This approach is both knowledge-extensive, and do not have the capability to learn new, unseen faults.
8 1060 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005 Fig. 6. Behavior of the test statistic corresponding to (a) excessive broadcasts, (b) sustained rate of TCP connection requests, (c) excessive ARP packets, and (d) excessive unknown packets. Fig. 7. Adaptability to diurnal traffic patterns. (a) Normal increase in ARP traffic volume. (b) Behavior of the corresponding test statistic. (c) Normal decrease in ARP traffic volume. (d) Behavior of the corresponding test statistic. D. Adaptability to Normal Traffic Fluctuations Network traffic exhibits clear diurnal patterns. Night hours and less busy days of the week show a decrease in network traffic. Traffic picks up again during working days. The purpose of this section is to show that the monitoring agent learns these patterns, and does not take them for anomalies. Fig. 7(a) shows an increase in ARP volume, that is part of normal operation of the network. ARP packets count increases, as transition is made from night hours to day hours. Fig. 7(b) shows that the test statistic remains within the normal range, for this pattern. Similarly, Fig. 7(c) shows a decrease in ARP traffic volume, as network becomes less busy, in late evening hours. Fig. 7(d) plots the reaction of the test statistic, around the same time this transition took place. Here also, the test statistic remains within its normal range. What actually happens, is that ARP traffic model is a mixture, with three complements each modeling a given level of ARP traffic. Depending on the time of the day (latent variable), observations are assigned to the corresponding component, making it possible to adapt to these traffic fluctuation. One can conclude, then, that given a training data large enough to contain all possible regimes of operation, the monitoring can adapt to these patterns, and will not be taken for anomalies. E. Subtle Changes Detection As stated in the introduction, thresholds applied directly to traffic variables risk missing subtle changes. Here, we test the capability of our approach to detect subtle changes in network traffic. Detection is shown through two real scenarios, obtained by studying the number of connection requests on a web server. Fig. 8(a) shows the row statistics of tcpinsyn. As pointed by the arrow in the figure, the inbound TCP connection requests is sustained to almost a fixed level, for a duration of 6 h, approximately. Closer inspection of the captured packets revealed that this pattern was created by two Web robots, downloading the
9 HAJJI: STATISTICAL ANALYSIS OF NETWORK TRAFFIC FOR ADAPTIVE FAULTS DETECTION 1061 Fig. 8. Reaction of the detection algorithm to subtle changes in network traffic. (a) Subtle change in incoming TCP connection requests. (b) Reaction of the test statistic. (c) More clear change in incoming TCP connections requests. (d) Reaction of the test statistic. whole documents hosted by our web server. Fig. 8(b) shows that the test statistic reacts with a sharp increase, eventually crossing the threshold. Another result is shown in Fig. 8(c), in which the change in TCP behavior is obvious. Fig. 8(c) shows the reaction of the test statistic. The experiment was running unattended, but this behavior is likely to be explained by web robots, as the traffic pattern shown here is similar to the previous problem. Given that the training data did not contain such pattern, and apparently the web robot did not respect robots.txt file instructions supposed to exclude all robots, the anomaly was detected. It is important to note that these subtle changes would have been missed with static thresholds, applied to raw traffic. F. Alarm Rate Ideally, we would like to estimate the false alarm rate, given that we know for sure that the network is operating normally. Unfortunately, it is difficult to gain perfect knowledge about all the subtle changes in the network behavior. Instead, Tables IV, and V show the average alarm rate per hour, evaluated, after the monitoring agent is set to run for one week, then one month, respectively. The goal is to give a sense of how frequent the alarm are generated. It should be noted that most of the alarms generated by the variables tcpinsyn in Table V are caused by one particular anomalous activity. The problem shown earlier in Fig. 8(c) generated 29 alarms out of 38 total alarms raised, during one TABLE IV AVERAGEALARM RATE PER HOUR FOR A DURATION OF ONE WEEK (168 H) TABLE V AVERAGE ALARM RATE PER HOUR FOR A DURATION OF ONE MONTH (720 H) month duration of the experiment. In the remaining 29 days, only 9 alarms are generated. The same also applies to broadcast packets. This variable test statistic generated 12 alarms, 11 of them are shown in the Fig. 9, and caused by a failure of the file server in the neighboring subnet. Only one alarm is generated for the remaining 29 days. The result reported here proves that, first, our method can run without alarms for a very long time, and once anomalies are injected, the test statistic reacts with crossing the threshold. Since
10 1062 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005 Fig. 9. Distribution of alarms generated by the variable counting broadcast packets. Fig. 10. Percentage of CPU consumed by monitoring 10, 20, and 30 variables. we are not modeling the anomalies themselves, it is safe to affirm that our method has a good detection capability across all possible anomalies that induce change in statistical characteristics of monitored variables. Second, our algorithm running for an extended amount of time without raising alarms proves that our algorithm can adapt to normal changes in traffic. These results are to be contrasted with research reported in ,  where their algorithm produced a large number of alarms, and had to introduce ad hoc combination techniques to limit the number of alarms. For example in , variables produced as much as 2.23 alarms per hour. In our approach, for an extended amount of testing (720 h), the highest rate was as small as 0.12 alarms per hour. A comparison with methods that use data to train a model for a set of anomalies is probably in order. Methods that start with a model of known anomalies face the problems of how to characterize monitored variables given the anomaly. Apart from the fact that these methods cannot detect new, previously unseen problems, anomalies induce disparate behavior in the monitored variables. These approaches have to make a choice: take the characteristics induced by the anomaly literally, probably maximizing the fit with a predefined model, but at the risk of higher false alarm rate. Or build a model that only loosely captures the statistical essence of the anomaly and, hence, affect negatively the detection capability of the algorithm. There is an inherent tradeoff between detection capability and false alarm rate. Most of the current work does not spell out this tradeoff clearly in the formulation of the problem. The key question is to achieve good detection capability, within a reasonable false alarm rate. Compared to these methods, our approach mathematic formulation captures clearly the inherent tradeoff between detection capability and false alarm rate. As shown in (20), by fixing a false alarm rate, we derive the threshold to be associated with the test static. Compared to methods that may have modeled anomalies by overfitting observed data, our method has the advantage of minimizing the impact of model assumptions. As shown in Fig. 3, if estimated parameters have been used as true parameters of the model, approximation and model mis-specification errors will build up to cause a high false alarm rate. Instead, our test statistic avoids using the true parameters, and is based on the difference between successive estimates of the model parameters. The result is a very stable test statistic under normal conditions where drifts due to estimation inaccuracies are minimized. rate. G. Monitoring Overhead As Section IV-A explained, statistics collection is implemented as an RMON agent, running as user-level process. Packets are intercepted at the kernel level in raw format and sent directly to our RMON agent for statistics logging. Network traffic is sampled every 10 s. Fig. 10 shows the amount of CPU time it took our online algorithm to calculate the test statistic for a number of variables: 10, then 20, then 30 variables. In this experiment, CPU time usage is sampled every 10 s. The agent runs on SunOS Release 5.8, with SPARC CPU (333 MHz). Our method uses (10) to calculate residuals. The computation involves only a linear combination of the new observation with a function of the old value of the parameter. This generates very minimal CPU load. Most of the CPU time is consumed by the detection statistic (18), as involves a computation over the entire moving window. Overall, our method is scalable consuming a little over 2% of CPU to monitor 30 variables. Note that, the CPU results reported in this section has been based on artificially created variables to simplify the implementation. In addition, we report only on CPU time used for detection. We do not include the overhead for capturing and updating different RMON variables statistics. This instrumentation overhead is occurred by any other method interested in studying network traffic, and we are not claiming that our implementation is optimal. V. CONCLUSION In this paper, we developed an online technique for real-time detection of network anomalies. We introduced a parametric model of network traffic, in which studied variables are viewed as sampled from a finite mixture model. A new method for baselining network operation, based on successive parameter identification, is introduced. The baseline random variable is shown to be approximately Normal, with mean zero under
11 HAJJI: STATISTICAL ANALYSIS OF NETWORK TRAFFIC FOR ADAPTIVE FAULTS DETECTION 1063 normal operations, and sudden jumps in this mean are characteristics of abnormal conditions. A real-time online change detection algorithm is designed to processes, sequentially, the baseline random variable realizations and raise an alarm as soon as the anomaly occurs. The proposed approach requires neither the set of anomalies to be detected, nor the thresholds to be supplied by the user. Experimental results showed the effectiveness of the method on real data. Low alarm rate and high detection accuracy has been demonstrated. The key point that allowed efficient detection of network problems was to avoid solving the more general problem of accurate parameter estimation, as in intermediate between traffic modeling and change detection. REFERENCES  F. Barceló and J. Jordán, Channel holding time distribution in public telephony system (PAMR and PCS), IEEE Trans. Veh. Technol., vol. 49, no. 5, pp , Sep  M. Basseville and I. V. Nikiforov, Detection of Abrupt Changes: Theory and Application. Englewood Cliffs, NJ: Prentice-Hall,  V. A. Bolotin, Telephone circuit holding time distributions, in Proc. 14th Int. Conf. Fundamanetal Role of Teletraffic in the Evolution of Telecommunications Networks, vol. 1a, Amsterdam, The Netherlands, 1994, pp  V. A. Bolotin, Modeling call holding time distributions for CCS network design and performance analysis, IEEE J. Sel. Areas Commun., vol. 12, pp , Apr  V. A. Bolotin, Y. Levi, and D. Liu, Characterizing data connection and messages by mixtures of distributions on logarithmic scale, in Proc. 16th Int. Conf. Teletraffic Engineering in a Competitive World, Amsterdam, The Netherlands,  T. J. Brailsford, J. H. W. Penm, and R. D. Terrell, Selecting the forgetting factor in subset autoregressive modeling, J. Time-Series Anal., vol. 23, no. 6, pp. 1 21,  A. Dempster, N. Laird, and D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. B, vol. 39, pp. 1 38,  A. Erramilli, O. Narayan, and W. Willinger, Experimental queueing analysis with long-range dependent packet traffic, IEEE/ACM Trans. Netw., vol. 4, no. 2, pp ,  G. Goldzsmidt and Y. Yemini, Distributed management by delegation, in Proc. 15th Int. Conf. Distributed Computing Systems,  H. Hajji and B. H. Far, Continuous network monitoring for fast detection of performance problems, in Proc Int. Symp. Performance Evaluation of Computer and Telecommunication Systems,  H. Hajji, B. H. Far, and J. Cheng, Detection of network faults and performance problems, in Proc. Internet Conf., Japan, pp  H. Hajji, Baselining network traffic and online faults detection, in Proc. IEEE Int. Conf. Communications, vol. 26, May 2003, pp  S. C. Hood and C. Ji, Proactive network fault detection, in Proc. IN- FOCOM 97, 1997, pp  G. Jakobson and M. D. Weissman, Alarm correlation, IEEE Network, pp ,  I. Katzela and M. Schwarz, Schemes for fault identification is communication networks, IEEE/ACM Trans. Netw., vol. 3, pp ,  W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, On the self-similar nature of ethernet traffic, IEEE/ACM Trans. Netw., vol. 2, no. 1, pp. 1 15, Feb  V. Paxson and S. Floyd, Wide-area traffic: The failure of poisson modeling, IEEE/ACM Trans. Netw., vol. 3, no. 3, pp , Jun  V. Paxson, Empirically-derived analytic models of wide-area TCP connections, IEEE/ACM Trans. Netw., vol. 2, no. 4, Aug  L. LaBarre, Management by exception: OSI event generation, reporting, and logging, in Proc. 2nd Int. Symp. Integrated Network Management,  G. Lorden, Procedures for reacting to a change in distribution, Ann. Math. Statist., vol. 42, pp ,  R. A. Maxion and F. E. Feather, A case study of ethernet anomalies in a distributed computing environments, IEEE Trans. Reliab., vol. 39, no. 4, pp ,  J. P. Martin-Flatin, S. Znaty, and J. P. Hubaux, A survey of distributed enterprise network and systems management paradigms, J. Netw. Syst. Manage., vol. 7, no. 1, pp. 9 26,  E. S. Page, Continuous inspection schemes, Biometrika, vol. 41, pp ,  R. A. Redner and H. F. Walker, Mixture densities, maximum likelihood and the EM algorithm, SIAM Rev., vol. 26, no. 2, pp ,  D. Seigmund and E. S. Venkatraman, Using the generalized likelihood ratio statistic for sequential detection of a change points, Ann. Statist., vol. 23, no. 1, pp ,  D. Siegmund, Sequential Analysis: Tests and Confidence Intervals. New York: Springer-Verlag,  M. Thottan and C. Ji, Statistical detection of enterprise network problems, J. Netw. Syst. Manage., vol. 7, no. 1,  D. M. Titterington, Recursive parameter estimation using incomplete data, J. Roy. Statist. Soc., ser. B, vol. 46, no. 2, pp ,  N. A. Vlassis, A kurtosis-based dynamic approach to Gaussian mixture modeling, IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 29, no. 4, pp ,  E. Weinstein, M. Feder, and A. V. Oppenheim, Sequential algorithms for parameter estimation based on Kullback Leibler information measure, IEEE Trans. Acoust., Speech, Signal Process., vol. 38, no. 9, pp , Sep  S. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie, High speed and robust event correlation, IEEE Commun. Mag., pp , Hassan Hajji (M 05) received the B.Sc. degree from Mohamed Premier University, Morocco, in 1995 and the M.Sc. and Ph.D. degrees from Saitama University, Japan, in 1999 and 2002, respectively. He is currently with IBM Business Consulting, Tokyo, Japan, where he works on gon Demad Innovation Services service line. Prior to that, he was with IBM Research, Tokyo Research Laboratory. His research interests include enterprise risk management, security and privacy, and network and system management.