A New Statistical Approach to Network Anomaly Detection

A New Statistica Approach to Network Anomay Detection Christian Caegari, Sandrine Vaton 2, and Michee Pagano Dept of Information Engineering, University of Pisa, ITALY E-mai: {christiancaegari,mpagano}@ietunipiit 2 Dept of Computer Science, ENST Bretagne, FRANCE E-mai: sandrinevaton@enst-bretagnefr Abstract In the ast few years, the number and impact of security attacks over the Internet have been continuousy increasing To face this issue, the use of Intrusion Detection Systems (IDSs) has emerged as a key eement in network security In this paper we address the probem considering a nove statistica technique for detecting network anomaies Our approach is based on the use of different famiies of Markovian modes (namey high order and non homogeneous Markov chains) for modeing network traffic running over TCP The performance resuts shown in the paper, justify the proposed method and highight the improvements over commony used statistica techniques Index Terms Intrusion Detection System, High Order Markov Chain, Mixture Transition Mode, Non-Homogeneous Markov Chain I INTRODUCTION In the ast few years Internet has experienced an exposive growth Aong with the wide proiferation of new services, the quantity and impact of attacks have been continuousy increasing The number of computer systems and their vunerabiities have been rising, whie the eve of sophistication and knowedge required to carry out an attack have been decreasing, as much technica attack know-how is readiy avaiabe on Web sites a over the word Recent advances in encryption, pubic key exchange, digita signature, and the deveopment of reated standards have set a foundation for network security However, security on a network goes beyond these issues Indeed it must incude security of computer systems and networks, at a eves, top to bottom Since it seems impossibe to guarantee compete protection to a system by means of prevention mechanisms (eg authentication techniques), the use of an Intrusion Detection System (IDS) is of primary importance to revea intrusions in a network or in a system IDSs are usuay cassified on the basis of severa criteria [] State of the art in the fied of intrusion detection is mosty represented by misuse based IDSs Considering that most attacks are reaized with known toos, avaiabe on the Internet, a signature based IDS coud seem a good soution Nevertheess hackers continuousy come up with new ideas for the attacks, that a misuse based IDS is not abe to bock This is the main reason why our work has focused on the deveopment of an anomay based IDS In particuar our goa is to revea intrusions carried out expoiting TCP bugs, by using Markovian modes (high order and non homogeneous Markov chains) to describe the behavior of network traffic The use of first order homogeneous Markov chain is a weknown approach to detect two distinct kinds of anomaies : masqueraders (anayzing the command stream of a host) and intruders (anayzing the evoution of TCP fows in the network traffic) [2] Vardi and Ju in [3] describe the use of high order Markov chains to detect masqueraders at the host eve, and in [4][5] the authors compare performance of first order modes and generic high order modes After an extensive survey, to the best of our knowedge, there is no work directy reated neither to the use of high order Markov chains to detect anomaies in the TCP traffic nor to the appication of non-homogeneous Markov chains to anomay detection in genera Moreover no study at a compares the performance achievabe with Markov chains of different orders and with a simpe independent mode The paper is structured as foows: next section provides a detaied description of the impemented system, whie the subsequent section presents the experimenta resuts Finay section 4 concudes the paper with some fina remarks II SYSTEM DESIGN In this section we provide a detaied description of the proposed anomay based NIDS The aim of our work is to perform a comparison between severa statistica modes, which can be used to describe the behavior of TCP connections More in detai we take into account the use of: first order homogeneous Markov chains first order non-homogeneous Markov chains high order homogeneous Markov chains stationary ECDF (Empirica Cumuative Distribution Function) non-stationary ECDF Next subsections describe the training phase and the detection phase of our IDS A Training Phase To buid the mode which represents the norma behavior of the network, the system needs a training phase during which

2 it anayzes some network traffic, supposed to be attack free The system anayzes raw traffic traces in ibpcap format, the standard used by pubicy avaiabe packet sniffer software, as Tcpdump or Etherea First of a the IDS performs a fitering phase so that ony TCP packets are passed as input to the detection bocks The IDS ony considers some fieds of the packet headers, more precisey the IP source address, the IP destination address, the source port number, the destination port number, and the TCP fags The IP addresses and the port numbers are used to identify a connection, whie the vaue of the fags is used to buid the profie Experimenta resuts have shown that the stochastic modes associated to the different appications strongy differ one from the other Thus, before constructing the mode, the system isoates the different services, on the basis of the server port number, and the foowing procedure is reaized once per each appication After that the IDS reconstructs the singe connections on the basis of the 5-tupe (source and destination addresses, source and destination ports, and protoco) A vaue s i is associated to each packet, according to the configuration of the TCP fags: which grows exponentiay with the order, according to the rue K (K ) This entais the need of a parsimonious representation of the transition probabiities The approach used in this paper is the Mixture Transition Distribution (MTD) mode, first proposed in [6] Under the MTD mode, the transition probabiities of an th order Markov chain can be expressed as foows: P(C t = s i C t = s i,c t 2 = s i2,,c t = s i ) = j= λ jr(s i s i j ) where C t represents the state of the chain at step t and the quantities R = {r(s i s j ); i, j =,2,,K} Λ = {λ j ; j =,2,,} satisfy to the foowing constraints: r(s i s j ) ; i, j =,2,,K K s i = r(s i s j ) = j =,2,,K (2) (3) (4) s i = syn+2 ack+4 psh+8 rst+6 urg+32 fin () Thus each mono-directiona connection is represented by a sequence of symbos s i, which are integers in {,,,63} The training phase, as we as the detection phase, varies according to the stochastic mode we are taking into account ) ECDF: In the case of the stationary ECDF the training phase simpy consists of evauating the probabiities P(s i ) that the TCP fags assume the vaue s i, independenty of the position of the packet in the TCP connection For the non-stationary ECDF the system has to compute the probabiities P j (s i ) that the TCP fags of the j th packet of the connection assume the vaue s i Taking into account the nature of the security attacks, for reducing the compexity of the system, we have decided to evauate such probabiities ony for the first packets of a connection, ie j =,2,, 2) Markov Chains: In the case of Markovian modes the symbos s i are considered as the states of a hidden discrete time finite state Markov chain Since not a the TCP fags configurations are observabe in rea traffic, the system ony considers the states observed in the training phase Moreover, to take into account the possibiity that some new fags configurations coud be observed during the detection phase, a rare state is added This procedure aows us to reduce the cardinaity of the state space from 64 (a the possibe configurations of the six TCP fags bits) to a number K, usuay smaer than ten Then the system estimates the transition probabiities of the Markov chain Since the computation of such probabiities is quite straightforward in the case of first order Markov chains (homogeneous and non-homogeneous), in the foowing we consider a Markov chain of order The main probem reated to this kind of modes is the exposion of the number of parameters, λ j ; j =,2,, and j= λ j = (5) A consequence of the use of the MTD mode is the reduction of the number of parameters from K (K ) to K(K )+ To take into account the presence of the rare state (abeed K), we have to fix the foowing quantities: r(rare s i ) = ε, i =,2,,K and ε sma (ε = 6 ) r(s i rare) = ( ε)/(k ), i =,2,,K According to the MTD mode the og-ikeihood of a sequence (c,c 2,,c T ) of ength T is LL(c,c 2,,c T )( = ) K i = K i = N(s i,s i,,s i )og (7) j= λ jr(s i s i j ) where N(s i,s i,,s i ) represents the number of times the transition s i s i s i is observed Maximum ikeihood estimation (MLE) of the chain parameters requires to maximize the right hand side of eq (7), with respect to R and Λ, taking into account the constraints (4) and (5) Since the origina soution [7] seems to be too much computationay demanding, we have appied the procedure proposed in [3], which consists in an aternate maximization with respect to R and to Λ This process eads to a goba maximum, since LL is concave in R and Λ For the part when R is fixed, we maximize LL with respect to Λ, and vice-versa In the first step (estimation of Λ) we have used the sequentia quadratic programming, whie the second maximization step (estimation of R) is a inear inverse probem with positivity constraints (LININPOS) that we have soved appying the expectation maximization (EM) agorithm [8] Since the first maximization step is quite trivia, in what foows we discuss (6)

3 the second step, ie the estimation of the matrix R, with the vector Λ fixed First of a we have re-indexed the ogikeihood in the foowing way: φ(s i,s i,,s i ) = + which takes to j= N(s i,s i,,s i ) a k and (s i j )K j k (8) j= λ j r(s i s i j ) b k (9) Thus, at first we estimate the quantities b k (MLE) and then we sove the inear system are the unknowns, r(s i s j ) = r i+k( j ), and N = N(s,s,,s ) N(s,s,,s 2 ) N(s,s,,s K ) N(s K,s K,,s ) N(s K,s K,,s K ) = N N 2 N K N K+K + N K + where N(s,s,,s ) = N i, i = φ(s i,s i2,,s i ) (7) b k = j= λ j r(s i s i j ) () which is a LININPOS probem At this point, the og-ikeihood can be expressed as: A = {a i j } K + K 2 where a i j = k= λ ki[ j = i + K(i k )], (i,,i ) = φ (i) The matrix B ooks ike: (8) where [3] k + a k = T and k= k + a k ogb k () k= k + b k = K (2) k= Thus a simpe Lagrange method argument shows that the ogikeihood is maximized when ˆb k = a k or, equivaenty, when j= k a k k b k = a k T K, k (3) λ j r(s i s i j ) = K T N(s i,s i,,s i ), (i,,i ) (4) Thus, if we consider these equations as a inear system subject to the constraints (4), we obtain a LININPOS probem, which can be soved, in the sense of the minimum Kuback- Leiber distance, using the EM agorithm More in detai we have [9] where ( A B ) ( R = K T N R T = (r(s s ),r(s 2 s ),,r(s K s ),,r(s K s K )) = (r,r 2,,r K 2) ) (5) (6),,,,,,,,,,,, B =,,,,,,,,, K K 2 At this point the EM iteration step is the foowing: K (9) r j a j ˆr j (A, a j + b j T N,R)+ b j ˆr j (B,,R) (2) a j + b j where ˆr j (W,u,v) v j w j i w i j u i, j =,2,,K 2 (2) k w ik v k for matrix W = {w i j } and vectors u = {u i },v = {v i }, and a j = i a i j = = = (i,,i )=φ (i) k= k= k= λ k I[ j = i + K(i k )] λ k i i I[ j = i + K(i k )] λ k K = K (22) and b j = i b i j = The choice of the initia vaues for R and Λ is a key point Experimenta tests have shown that good resuts are obtained choosing λ i = /, i =,2,, and setting R to the first order transition probabiities

4 x 7 Log ikeihood Function 5 5 2 25 3 35 5 5 2 25 Number of Packets x 4 Fig Log-ikeihood function of a norma connection ( ) Maxv u L(c,c 2,,c T Λ v,r v ) T X = (25) L(c,c 2,,c T Λ u,r u ) where the vector (Λ u,r u ) represents the parameters corresponding to the mode computed during the training phase (hypothesis H ) and the component T is introduced to take into account that each observed sequence is characterized by a different ength T It is worth noticing that this test is equivaent to decide on the basis of the Kuback-Leiber divergence between the mode associated to H and the one computed for the observed sequence Log ikeihood Function Fig 2 2 4 6 8 2 4 6 Anomay 8 2 3 4 5 6 7 8 Number of Packets Log-ikeihood function of an anomaous connection B Detection Phase Once the training phase has been performed, the IDS has a mode of the norma behavior of the network, represented by the computed profie As for the training phase, the input is given by raw traffic traces in ibpcap format, which are processed so as to extract sequences of TCP fags configurations Thus, given an observed sequence (c,c 2,,c T ), the system has to decide between the two hypotheses: H : {(c,c 2,,c T ) computed mode} H : {anomay} (23) The probem is to choose between a singe hypothesis H, which is associated to the estimated stochastic mode, and the composite hypothesis H, which represents a the other possibiities No optima resut is presented in the iterature about this decision theory probem, thus the best soution is represented by the use of the Generaized Likeihood Ratio (GLR) test [] Since the probem is quite straightforward for ECDF, in the foowing we ony consider the case of Markovian modes, for which the GLR test is defined as foows: { H i f X < ξ H(X) = H i f X > ξ (24) where the threshod ξ is set by means of MonteCaro simuations and the quantity X is given by: III EXPERIMENTAL RESULTS In this section we compare the performance of the different statistica modes over the 999 DARPA evauation project [] data set For sake of brevity, in the foowing we ony present the resuts reated to the Tenet traffic, since they appear to be representative of the overa performance To test the correctness of the computed modes we have cacuated the og-ikeihood function of some sequences Figure corresponds to a norma connection As expected from the theory, the function decreases amost ineary with the number of packets; its sope is equa to the entropy of the mode, which, for first order Markov chain, is defined as: H (MC) = i π(i)p(s j s i ) ogp(s j s i ) j where π(i) is the stationary distribution of the Markov chain The given definition can be easiy extended for higher order Markov chains On the other hand the effect of an anomay is an abrupt jump in the og-ikeihood function, as highighted by figure 2 Both these figures refer to a first order mode, but the behavior of the og-ikeihood function does not significanty vary with the order of the Markov chain To evauate the performance we have used a Receiver Operating Characteristic (ROC) curve, which pots detection rate vs fase positive rate, obtained varying the vaue of the threshod ξ Figure 3 shows the ROC curves for Markov chains of different orders We have considered Markov chains of order up to 4, since higher orders impy a heavy processing time, not suitabe for on-ine detection Since the resuts obtained using a mode based on a Markov chain of order are aready very good for these traffic traces, it is not easy to reaize that we achieve some improvements with high order modes To be noted that the ROC curves are amost idea, since we have a detection rate cose to % with a negigeabe fase aarm rate Nevertheess the zoomed area inside the figure shows that with the mode of order 4 we are abe to achieve the best resuts, obtaining a detection rate of 53% with a fase aarm rate which is about one haf of that reated to the Markov chain of order The foowing figure shows the performance of the ECDF, whie figure 5 presents a comparison between the first order

5 Markov chains and the time dependent modes described in the paper Since a detection performed anayzing ony the first packets of each connection is obviousy worse than the one based on the entire connections, aso the time independent mode has been computed ony considering the first ten packets of each connection It is easy to concude that the homogeneous Markov chain achieves a detection rate amost % bigger than the other two modes This apparent paradox can be justified by the fact that the non-homogeneous modes have been computed with a reativey short, and so incompete, training phase Indeed, on one side the whoe training data set has been used to compute ony one homogeneous mode, whie on the other side, the same quantity of data is partitioned into ten subsets corresponding to the first ten steps in the time evoution of each connection In particuar this can ead to amost deterministic probabiities for the first steps of the non homogeneous modes, thus a singe fag configuration at step i, present in the training data set ony at steps j i (and hence captured by the time independent mode), may generate a fase aarm Finay, we have taken into account that an intrusion shoud be detected as soon as the anomay appears Thus, in figure 6, we show the performance of the homogeneous Markov chain mode as a function of the number of anayzed packets for each connection (both for buiding the mode and for the detection phase) The resuts highight that good performance are achieved with a sma number of packets, demonstrating that such statistica modes are suitabe for on ine anomay detection REFERENCES [] Kemmerer, RA, Vigna, G, Intrusion Detection: a Brief History and Overview, IEEE Security and Privacy (suppement to Computer, vo 35, no 4) pp 27-3, Apri 22 [2] Ye, N, Yebin Zhang, Y, and Borror, CM, Robustness of the Markov- Chain for Cyber-Attack Detection, IEEE Transactions on Reiabiity, Vo 53, no, pp 6-23, March 24 [3] Ju, W-H and Vardi, Y, A Hybrid High-order Markov Chain Mode for Computer Intrusion Detection, NISS, Technica Report Number 92, February 999 [4] Schonau, M, et a, Computer Intrusion: Detecting Masquerades, NISS, Technica Report Number 95, March 999 [5] Ye, N, Ehiabor, T, and Zhanget, Y, First-order Versus High-order Stochastic Modes for Computer Intrusion Detection, Quaity and Reiabiity Engineering Internationa, 8:243-25, 22 [6] Raftery, AE, A mode for high-order Markov chains, Journa of the Roya Statistica Society, series B, 47, 528-539, 985 [7] Raftery, AE and Tavare, S Estimation and modeing repeated patterns in high-order Markov chains with the mixture transition distribution (MTD) mode, Journa of the Roya Statistica Society, series C - Appied Statistics, 43, 79-2, 994 [8] Vardi, Y and Lee, D, From Image deburring to Optima investments: Maximum Likeihood Soutions for Positive Linear Inverse Probem, Journa of the Roya Statistica Society, series B, 55, 569-62, 993 [9] Iusem, AN and Svaiter, BF, A New Smoothing-Reguarization Approach for a Maximum-Likeihood Estimation Probem, Appied Mathematics and Optimization, 29:225-24, 994 [] Mood, AM, Graybi, FA, and D C Boes, DC, Introduction to the Theory of Statistics 3rd ed Tokyo, Japan: McGraw-Hi, 974 [] Lippmann, R, et a, The 999 DARPA Off-Line Intrusion Detection Evauation, Computer Networks Voume 34, Issue 4, October 2, Pages 579-595 IV CONCLUSIONS In this paper we have presented an anomay based network intrusion detection system, which detects anomaies using statistica characterizations of the TCP traffic We have compared severa stochastic modes, such as first order homogeneous and non-homogeneous Markov chains, high order homogeneous Markov chains, and stationary and non-stationary ECDF We have detaied the estimation of the parameters of the modes and we have shown the resuts obtained with the DARPA 999 data set The performance anaysis has highighted that the best resuts are obtained with the use of homogeneous Markov chains and that some improvements can be achieved using high order Markovian modes: for instance, 4th order Markov chains ead to the same detection rate of first order modes, with amost one haf of fase aarms Moreover, we have shown that, since ony a sma quantity of packets is sufficient to revea intrusions in the TCP traffic, this kind of approach is suitabe for on ine detection V ACKNOWLEDGMENTS This work was partiay supported by the Euro-NGI Network of Exceence funded by the European Commission and party by the RECIPE project funded by MIUR

6 Fig 3 Performance of Markovian modes of different orders 9 8 7 Detection Rate 6 5 4 3 2 2 3 4 5 6 7 8 9 Fase Aarm Rate Fig 4 Performance of ECDF mode

7 9 8 7 Detection Rate 6 5 4 3 2 Non Stationary ECDF Non Homogeneous MC Homogeneous MC 2 3 4 5 6 7 8 9 Fase Aarm Rate Fig 5 Performance comparison of the anayzed time dependent modes ( packets ony) 9 8 7 Detection Rate 6 5 4 3 2 5 packets packets 5 packets 2 packets 2 3 4 5 6 7 8 9 Fase Aarm Rate Fig 6 Performance of the homogeneous Markov chain mode, as a function of the number of processed packets