Hierarchical Clustering and Sampling Techniques for Network Monitoring

S. Sindhuja Hierarhial Clustering and Sampling Tehniques for etwork Monitoring S. Sindhuja ME ABSTRACT: etwork monitoring appliations are used to monitor network traffi flows. Clustering tehniques are used to extrat network traffi patterns. Anomaly detetion shemes are used to detet network attaks. Hierarhial and partitional lustering shemes are used to analyze network traffi data values. The hierarhial data analysis uses the struture and data values for the lustering proess. The patterns are minimized using the summarization tehniques. Keywords: Categorial, luster feature, itemset, sampling. I. ITRODUCTIO The hierarhial lustering based network monitoring system uses a distane measure to alulate distane between the hierarhial strutures. umerial, ategorial and hierarhial data values are ompared using the behavior in order to plan network apaity. As network apaities inrease, traffi analysis tools fae the problem of salability due to high paket arrival rates and limited memory. In this paper, we present a hierarhial lustering tehnique for identifying signifiant traffi flow patterns. In partiular, we present a novel way of exploiting the hierarhial struture of traffi attributes, suh as IP addresses, in ombination with ategorial and numerial attributes. This algorithm addresses the above problems in previous approahes [3] of network traffi analysis as it is a one-pass, fixed memory algorithm. We demonstrate the advantages of our lustering in terms of improved auray and signifiantly redued omputation time in omparison to an earlier approah, on a standard benhmark dataset. distane measure. The network pattern analysis system uses the data values from the database. All the network traffi information are olleted and updated into the database. The traffi analysis is arried out on the stored data values. The proposed system is designed to perform instant network traffi pattern analysis proess. All the network traffi information are aptured and transferred to the data analysis tool. The data values are transferred through data streams. TCP data streams are used to transfer network traffi information. The pattern extration tool analyzes the data values and assigns the relevant data into the hierarhy. Traffi flow, aess pattern and servie information are extrated from the system. Data sampling tehniques are used in high-speed data ommuniations and list of S.Sindhuja is working as Asst Professor in SS College of Engineering, Tamilnadu. Emails: sindhugubi@gmail.om There is a growing need for effiient algorithms to detet important trends and anomalies in network traffi data. For example, network managers need to understand user. A key hallenge in lustering multi-dimensional network traffi data is the need to deal with various types of attributes: numerial attributes with real values, ategorial attributes with unranked nominal values and attributes with hierarhial struture [6]. For example, byte ounts are numerial, protools are ategorial and IP addresses have hierarhial struture. A key issue for these shemes is how to represent a distane funtion inorporating hierarhial attributes to help find meaningful lusters. We have proposed a hierarhial approah to lustering network traffi data that exploits the hierarhial struture present in real life data. In network traffi a hierarhial relation between two IP addresses an reflet traffi flow to or from a ommon sub-network. The hierarhial representation of suh attributes thus gives more meaning to a general luster, whih an reflet a trend in traffi flows. We propose a ommon framework to inorporate suh hierarhial attributes in the distane funtion of our lustering algorithm II RELATED WORK The problem of identifying network trends has been studied by [3]. One of the earlier works is whih studied several sampling Analysis. In [9], a probabilisti model of network traffi is estimated using the Expetation Maximization (EM) tehnique. There are several existing tools for network traffi analysis []. Some graphially [] depit the traffi like flow-san [3], while others provide top K reports like flowd [3] and flow-tools [4]. A typial graphial report might look at the soure IP addresses that are sending the most traffi, or the ports that reeive the most onnetion attempts. A problem with this approah to reporting is that it tells us nothing about soures that only send a small volume of traffi. If these small flows are ombined, then they may form a large portion of the overall traffi. Consequently, these trends maybe overlooked, unless we an identify patterns among traffi flows. The graphial tools annot in general ope well with high dimensions and fail to generalize patterns. In [6], the authors address the problem of finding patterns in network traffi by proposing a frequent itemset mining algorithm. Their tool, alled AutoFous [5], desribes the traffi mix on a network link by using textual reports as well as time series plots. It also produes onise reports that an show general trends in the data. In Setion. we desribe frequent itemset mining of network data in more detail. In [], the authors build on the work of [6] and suggest a tehnique for traffi anomaly detetion based on analyzing orrelations of destination IP addresses in outgoing traffi. This address orrelation data is modeled International Journal of Sientifi Researh in Computer Siene (IJSRCS) Vol., Issue. 4, ov. 03

S. Sindhuja using disrete wavelet transformation for detetion of anomalies. In [], the authors use the ombination of a rule based flow header detetion and a traffi aggregation based pattern detetion algorithm. Although our work shares a ommon philosophy of traffi aggregation using traffi headers with the tehniques mentioned above, our approah differs in the effiieny and expressiveness of our lustering tehnique. A. Frequent Itemset Clustering using AutoFous AutoFous identifies signifiant patterns in traffi flows by using frequent itemset mining. It first reates a report based on unidimensional lusters of network flows and then ombines these unidimensional lusters in a lattie to reate a traffi report based on multidimensional lusters. Unidimensional lustering: For eah attribute, AutoFous builds a one-dimension tree by ounting frequent itemsets in the network traffi data [6]. This is straightforward for Protools and Ports. It requires only those values with more than a ertain frequeny to be tallied. For IP addresses, it builds a tree of ounters to reflet the struture of the IP address spae. Counters at the leaves of the tree orrespond to the original IP addresses that appeared in the traffi. Higher level nodes in the tree orrespond to lusters of addresses that have the same ommon prefix, i.e., addresses with the first l bits in ommon, where l is the level of the node in the tree. In order to prune the tree, only those nodes having traffi volumes above a threshold are retained. Multidimensional lustering: For multidimensional lustering, AutoFous uses unidimensional luster trees to reate an m-dimensional lattie struture. Building the omplete lattie would be expensive sine it involves all possible ombinations among the values of different attributes. Instead, AutoFous uses ertain properties of the lattie struture to avoid brute fore enumeration. evertheless, AutoFous still requires multiple passes through the network traffi dataset in order to generate signifiant multidimensional lusters. An open issue for researh is how to find multidimensional lusters in network traffi in a omputationally effiient manner. To address this problem, we onsider the use of a hierarhial lustering algorithm. B. Hierarhial Clustering Our approah to finding multidimensional lusters of network data builds on the BIRCH framework [7], whih is a lustering algorithm that uses a Cluster Feature (CF) to represent a luster of reords. Instead of individual reords, a CF only keeps their suffiient statistis as a vetor <n, LS, SS> where n is the number of reords in the luster, LS is the linear sum and SS is the square sum of the attributes of the reords. Clusters are built using a hierarhial tree alled a Cluster Feature Tree (CFTree) to summarize t he input reords. The tree is built in an agglomerative hierarhial manner. Eah leaf node onsists of l lusters, where eah luster is represented by its CF reord. These CF reords an themselves be lustered at the non-leaf nodes in a reursive manner, due to the additive property of the statistis in the reord. Consequently, the lusters in the root of the tree represent the most abstrat summary of the dataset. As the CF-tree grows, more nodes are alloated to the tree. An important advantage of the CF-tree struture is that the luster hierarhy an be maintained in a fixed memory size M by re-lustering the leaf-level CF reords if neessary. If P denotes the size of a node in the tree, then it takes only O (B*(+log B M/P)) omparisons to find the losest leaf node in the tree for a given reord [7]. An open issue for using the BIRCH approah to luster network traffi reords is how to ope with numerial, ategorial and hierarhial attributes whih are used to desribe the network traffi. For example, in traffi reords, the number of bytes in a flow is represented as an integer, the type of network protool is represented using a ategorial attribute and IP addresses are attributes with a hierarhial struture. This requires a method for representing the summary statistis and alulating distanes for lusters based on these types of attributes. We also require a method for extrating signifiant lusters from the CF-tree in order to generate a onise and informative report on the given network traffi data. In the next setion we propose several modifiations to the BIRCH lustering approah to address these problems. III CLUSTERIG ETWORK TRAFFIC USIG ECHIDA The key problem in lustering network traffi data is the representation of various distanes for different data types in order to aurately desribe the relationships among different reords and lusters. We propose a framework that exploits the natural hierarhy of an attribute and ombines different distane funtions for our lustering algorithm. The input data is extrated from network traffi as 6-tuple reords <SrIP, DstIP, Protool, SrPort, DstPort, bytes>, where SrIP, DstIP are hierarhial attributes, bytes is numerial and the rest are ategorial attributes. Our algorithm takes eah reord and iteratively builds a hierarhial tree of lusters alled a CF-Tree. As the tree is built, eah reord is inserted into the losest luster using a ombined distane funtion for all attributes (Setion 3.). The radius of a luster determines if a reord should be absorbed into the luster or if the luster should be split (Setion 3.). In Setion 3.3, the luster formation proess is explained in detail. One the luster tree is reated, signifiant nodes are identified in the summarization proess (Setion 3.4). ext, the luster tree is further ompressed (Setion 3.5) to reate a onise and meaningful report. All of these require one additional pass over the luster tree. A. Distane Funtions When lustering network traffi reords we need to onsider three kinds of attributes: numerial, ategorial and hierarhial. For eah type of attribute, we define the representation that we have used for reords, lusters and distane funtions in eah ase. a) umerial Attributes: A numerial attribute is represented by a salar x[i] R The entroid [i] of a numerial attribute i in luster C having points is given by C[ i] X j [ i] - - () j International Journal of Sientifi Researh in Computer Siene (IJSRCS) Vol., Issue. 4, ov. 03

S. Sindhuja 3 We alulate the distane d n between the entroids of two lusters C and C by using the Eulidean distane metri. d n ( ( C, C) ) () b) Categorial Attributes: In the ase of a ategorial attribute x[i] is a vetor Z, where is the number of possible values that the ategorial attribute i an take. For a d-dimensional ategorial attribute vetor X, let the i th attribute x[i] be represented as x[i] ={a, a,,a }, where a k {0,}a. Any ategorial attribute of a reord an only take one value at a time. If a k =, then a j = 0 for j I so that, the sum of ak in a single reord is always, i.e., a k k, where a k x[i] The entroid [i]of a ategorial attribute i in luster C having points is represented by a histogram of the frequenies of the attribute values: i] j A A A X j[ i],,..., [ (3) where, k A k A k j k ak, j x j[ i] { a, j, a, j,..., a, j}, and d ( C, C ) [ ( A A ) ] (4) k, k, For example, onsider an attribute x[i] that represents Color, whih has values {Red, Green, Blue}. A value X[i]={,0,0}represents the olor Red. In addition, onsider a luster C that ontains 5 reords desribed by the Color attribute. If there are 3 instanes of Red, instanes of Green and 0 instanes of Blue, then [ i] {3,,0}. ) Hierarhial Attributes: We define a hierarhial attribute H as a generalization hierarhy in the form of an L- level tree applied to a domain of values HL whih form the leaves of the tree at level L.A non-leaf node h H l, i l, in the hierarhy represents a generalization of the leaf nodes in the subtree rooted at h,. l i Similarly, IP addresses an be represented as a hierarhial attribute using a 3-level generalization hierarhy (i.e. L=3) on the domain of integer IP address values in the range [0,3 -]. Before desribing the IP address hierarhy in detail we need to define the following: B. Radius Calulation In order to ontrol the variane of data reords within a luster, we need some measure of the radius of a luster. This gives us a sense of how ompat the luster is. The radius for numerial and ategorial attributes an be represented in a straightforward manner as the standard deviation of the attribute values of the reords in the luster. In the ase of hierarhial attributes, we need to define a new measure of radius. The final radius value of the luster is simply a linear ombination of the individual radius values of different attributes types. C. Cluster Formation We follow the general approah of BIRCH [5] for luster formation, whih results in a hierarhial set of lusters in the form of a CF-tree. The leaves of the CF-tree represent the most detailed set of lusters, and the root of the CF-tree represents the most general set of lusters desribing the input reords. Cluster formation proeeds by adding eah input data reord X to a luster C l in a leaf node of the CF-tree, suh that C l is losest to X using the distane measure D(X,C l ), given in Equation 7. If the addition of X to C l would ause the radius R l of the luster to grow above a threshold T, then a new luster is reated. We now summarize the main steps in this proess. Eah data reord X, orresponding to a 6-tuple traffi flow reord is ompared to the losest luster starting from the root along a path P to a leaf node, where the losest C arg min{ X, Cl} At the leaf node, the data reord X is inserted into the losest C l and the radius R l of the updated luster is alulated. If R l > T, where T is a threshold value in the range [0,], and if the number of CF entries in the node is less than a minimum m, then X is inserted into the node as a new luster D. Summarization Beause of the hierarhial nature of the CF-Tree, we an think of the lusters at the index nodes as holding summaries of the leaf nodes. Thus the levels of the CF-Tree form a hierarhial struture with lusters at leaf level, and super-lusters at the index levels. ote that eah node orresponds to a luster C, and the CF-entries in the node orrespond to the sub-lusters C,,C l of C. This hierarhial struture provides a natural framework to reate a summary report of the hierarhial lusters. The lusters at eah level represent a generalized set of traffi flows, whih an be used to desribe the traffi flows in the network. Sine there is redundant information between different levels, the summary report should ontain only those nodes of any level having signifiant additional information ompared to their desendant levels. In order to keep the report onise but meaningful, the summary ontains only signifiant nodes. IV OLIE ETWORK MOITORIG WITH SAMPLIG TECHIQUE The proposed system is designed to monitor stream based network traffi analysis. All network traffi information are olleted and transferred to the monitoring appliation. The lustering tehnique is used to extrat network traffi patterns. The anomalies are identified with respet to the luster information. The system uses a new hierarhial distane measure to ompare network transations. The distane measure ompares numerial, ategorial and hierarhial data values. The lustering results are hierarhially assigned in a tree view. All the traffi flow information are ompared with the distane measure. Traffi information s are assigned in the luster International Journal of Sientifi Researh in Computer Siene (IJSRCS) Vol., Issue. 4, ov. 03

S. Sindhuja 4 based on the distane value. Patterns are extrated with referene to the luster intervals. etwork monitoring system is divided into four modules. Transation observer module is designed to ollet network traffi information. Clustering module is used to group up the network traffi information. Data sampling module is designed to perform the pattern extration using the sample data value. Summary analysis module is used to extrat summary information about the network. V EVALUATIO Our aim was to test the auray and salability of our hierarhial traffi summarization algorithm. As a basis for omparison, we have ompared the performane of our algorithm to AutoFous [6]. In order to provide a quantitative omparison of the algorithms, and to test their ability to generalize patterns of usage, we require network data ontaining known traffi patterns. In the absene of suh publily available data with known patterns, we deided to use a widely aepted syntheti dataset for intrusion detetion from the MIT Linoln Lab. The 998 DARPA dataset [6] provides traffi files that ontain raw traffi flow reords with an assoiated label to indiate whether the reord was part of a normal or attak flow. Testing methodology We have onduted tests to ompare our algorithm with AutoFous: luster auray, detetion auray and run-time performane. Cluster Auray: The aim of this test is to hek if the lusters ontain reords that reflet known patterns present in the traffi files. The CF-Tree represents a hierarhial model of the traffi data built without supervision. The objetive of this test is to determine how aurately the lustering algorithm an group traffi reords from two different populations, ) DR = TP/(TP+F). ) FPR = FP/(TP+FP). that hooses lasses at random. It an be seen that 3 out of 4 traffi files ahieve a luster auray that is above the diagonal. In 6 out of 4 ases, the DR exeeds 50%, while the FPR is less than 50% in 4 out 6 ases. Given that we are lustering on a very limited 6-tuple of traffi features, this result supports the reliability of our algorithm in terms of luster auray. VI PERFORMACE AALYSIS Our aim was to test the auray and salability of our hierarhial traffi summarization algorithm. As a basis for omparison, we have ompared the performane of our algorithm to Ehidna. In order to provide a quantitative omparison of the algorithms, and to test their ability to generalize patterns of usage, we require network data ontaining known traffi patterns. Testing methodology We have onduted two types of tests to ompare our algorithm with Ehidna: Detetion auray and runtime performane. Detetion Auray: In partiular, our aim is to generate a summary traffi report that identifies important flows in network traffi. In this ase, we use the DARPA traes to test whether the reports generated by Ehidna or Hierarhial stream lustering () identify speifi attaks that appear in the traes. We assess the omparative utility of eah approah in finding signifiant traffi patterns by omparing the number and type of attaks reported by eah algorithm. umber of Attaks 6 5 4 3 0 Detetion Ratio of Attaks Ipsweep eptune map Pod Satan Syslog Smurf Ehidna Attak Type Figure:. Detetion auray of Ehidna and by attak types Table o:. Detetion auray of Ehidna and by attak types Figure:. Cluster auray of Ehidna in terms of Detetion Rate and False Positive Rate on 4 files from weeks 3 through 7 of the 998 DARPA traes. Figure shows the luster auray for eah network traffi file in terms of the Detetion Rate and False Positive Rate. The diagonal line orresponds to a lassifier umber of Deteted Attaks Attak Total Ehidna Ipsweep 5 3 4 eptune 6 4 5 map 3 Pod 7 4 Satan 4 3 Syslog 3 Smurf 7 3 5 International Journal of Sientifi Researh in Computer Siene (IJSRCS) Vol., Issue. 4, ov. 03

S. Sindhuja 5 In order to evaluate detetion auray, we applied Ehidna and Hierarhial stream lustering () to 5 daily network traffi files, whih orrespond to weeks 3-5 of the DARPA traes. For eah file, we identified the number and type of attaks, reported as lusters in the summary reports from Ehidna and Hierarhial stream lustering (). We then identified the total number of ourrenes of these attak types in the traes. An attak ourrene was deteted by Ehidna or Hierarhial stream lustering () if the desription of a luster generated by the system mathed the desription of the attak in the DARPA attak signature database. Figure and Table show the number of times eah type of attak was deteted by Ehidna and Hierarhial stream lustering (). ote that was able to produe aurate results in detetion proess than Ehidna. Run-time performane: A key motivation for using a hierarhial lustering sheme suh as Ehidna is to redue the amount of time required to find the losest luster for a traffi reord. Hierarhial stream lustering () redues mush more amount of time to detet traffi flows. This is partiularly important when the number of traffi flows is large. Table o: : Comparison of Run-time Performane on DARPA traffi Ehidna 40 35 90 80 45 5 0 85 40 340 VII COCLUSIO In this paper, we have presented a lustering sheme alled Ehidna for generating summary reports of signifiant traffi flows in network traes. The key ontributions of our sheme are the introdution of a new distane measure for hierarhially-strutured attributes, suh as IP addresses, and a set of heuristis to summarize and ompress reports of signifiant traffi lusters from a hierarhial lustering algorithm. Based on an evaluation using standard benhmark traffi traes, we have demonstrated that our lustering sheme ahieves greater luster auray and omputational effiieny in omparison to previous work. REFERECES [] http://www.sla.stanford.edu/xorg/nmtf/nmtf tools.html []http://www.aida.org/projets/internetatlas/viz/viztools.ht ml [3] http://www.aida.org/tools/measurement/flow [4] http://www.splintered.net/sw/flow-tools/ [5] http://www.aida.org/tools/measurement/autofous [6] C. Estan, S. Savage and G. Varghese. Automatially Inferring Patterns of Resoure Consumption in etwork Traffi problem. In Proeedings of SIGCOMM 003 [7] Abdun Mahmood, Christopher Lekie, Parampalli Udaya Ehidna: Effiient Clustering of Hierarhial Data for etwork Traffi Analysis 008. Figure shows the run-time performane of Ehidna and with traffi samples varying from approximately 50x0 3 to 750x0 3 traffi reords. The input file is ombined from all days of week 6 of the 998 DARPA traffi traes. For eah sample size, the tests were repeated 3 times to give the average exeution time and error bars. The omputational omplexity of is linear with respet to the number of input traffi flow reords. S.Sindhuja is working as a Assistant Professor in Dept of CSE, SSCE. She published international journal and international onferenes. Run-time Performane on DARPA traffi Time (s) 450 400 350 300 50 00 50 00 50 0 49 48 348 497 746 umber of Traffi Reords (x000) Ehidna Figure 3 Comparison of Run-time Performane on DARPA traffi International Journal of Sientifi Researh in Computer Siene (IJSRCS) Vol., Issue. 4, ov. 03