1 Behavior Analysis-Based Learning Framework for Host Level Intrusion Detetion Haiyan Qiao, Jianfeng Peng, Chuan Feng, Jerzy W. Rozenblit Eletrial and Computer Engineering Department University of Arizona Tuson, Arizona, USA {haiyanq, jpeng, feng, Abstrat Mahine learning has great utility within the ontext of network intrusion detetion systems. In this paper, a behavior analysis-based learning framework for host level network intrusion detetion is proposed, onsisting of two parts, anomaly detetion and alert verifiation. The anomaly detetion module proesses unlabeled data using a lustering algorithm to detet abnormal behaviors. The alert verifiation module adopts a novel rule learning based mehanism whih analyzes the hange of system behavior aused by an intrusion to determine whether an attak sueeded and therefore lower the number of false alarms. In this framework, the host behavior is not represented by a single user or program ativity; instead, it is represented by a set of fators, alled behavior set, so that the host behavior an be desribed more aurately and ompletely. 1. Introdution With the growing number of network attaks, intrusion detetion systems (IDSs) are beoming an integral part of any omplete seurity pakage of a modern network system. The IDSs perform surveillane and seurity monitoring of the network infrastruture. A number of different intrusion detetion systems have been developed for partiular domains (e.g., hosts or networks), in speifi environments (e.g., Windows NT or Solaris), and at different levels of abstrations (e.g., kernel-level tools or appliation level tools). However, omputer systems and networks still suffer from an inreased threat of intrusions. The existing IDSs are far from perfet and may generate false positive and nonrelevant positive alerts [1]. The most widely deployed and ommerially available methods for intrusion detetion employ signature-based tehnique, where the signatures or patterns of well-known attaks are provided by human experts. The system or network traffi is sanned for attaks using well-known vulnerabilities and any instanes that math the signatures are deteted as intrusions. The advantage of this method is a low rate of false positives. The disadvantage is that the signature database has to be revised manually and the system is vulnerable to new types of attak until the revision is done. This limitation leads to ative researh on intrusion detetion tehniques based on data mining. Data mining based network intrusion detetion tehniques are generally lassified into two ategories: misuse detetion and anomaly detetion [2]. In misuse detetion, eah instane in the training data set is labeled as either normal or intrusion. A mahine learning algorithm is trained over the labeled data and then the trained model is applied to lassify new data. Misuse intrusion detetion is fast, requires little state information, and has a low false-positive rate. With different input data inluding new types of attaks, the intrusion detetion modules are retrained automatially without the manual intervention. However, misuse detetion annot detet novel, previously unseen attaks. Anomaly detetion, on the other hand, builds models of normal data to measure a "baseline" of suh stats as CPU utilization, disk ativity, user logins, file ativity, and so on. When there is a deviation from this baseline, an alert is triggered. Currently, almost all ommerial intrusion detetion systems use misuse detetion tehniques. Yet, anomaly detetion is getting more attention beause of its apability to detet novel or unforeseen attaks. Essentially, anomaly detetion is the mahine learning problem of modeling a normal network or system behavior. Although anomaly detetion is beoming an ative researh topi, widespread adoption of this method faes numerous obstales, inluding omplexity and high false positive rate. In order to redue false and irrelevant alerts, alert verifiation has to be a part of IDS. Alert verifiation is

2 a proess to determine whether an attak has been suessful or not. This information is passed to IDS to help differentiate the type of alerts [16]: 1) The sensor has orretly identified a suessful attak; 2) The sensor has orretly identified an attak but the attak failed to meet its objetives; and 3) The sensor inorretly identified an event as an attak. Alert verifiation effetively lowers the number of false alarms that an administrator or the deision support system has to deal with. In this paper, a behavior analysis-based learning framework for host based intrusion detetion is proposed, whih inludes anomaly detetion and alert verifiation. The framework has two harateristis. First, it is learning-based. In the anomaly detetion module, a luster-based outlier detetion algorithm detets anomalous data. In the alert verifiation module, a rule-learning algorithm is applied to learn the behavior hanges of the targeted mahine. The rules developed serve as an index of alert verifiation. Seond, the framework is behavior analysis oriented. Instead of using a single ativity as the indiator of the host behavior, a set of indiators, also alled behavior set, is defined and applied to desribe the host behavior. The intrusions are deteted and verified based on the analysis of the behavior set. Compared to other researh work in host-based anomaly detetion, this method does not need high-dimension data sine the host baseline is represented by a refined behavior set and eah element in the behavior set is of low dimensionality. Thus, stateof-the-art data normalization is not required when outlier detetion is applied to detet anomalous data. In setion 2, host based anomaly detetion is reviewed. In setion 3, the behavior analysis-based learning framework is presented and disussed. In setion 4, the experimental results are given. Finally, onlusions and future work are summarized in Setion Related Work In host based anomaly detetion researh, various features are used to model the system behavior baseline, e.g., keystroke harateristis, user ommand data, system all sequenes, file ativities, et. Generally these features fall into three ategories: user profile, program profile, and system resoure aess. Denning [3] first attempted to build anomaly detetion by omparing previous user profiles to urrent user ativity. Sequeira and Zaki [4] designed and implemented a user-profile dependent and temporal sequene lustering-based intrusion detetion system by olleting and proessing UNIX shell ommand data. Although analysis of user ativity is a natural approah to detet intrusions, experiene shows that it is far from aurate. This is beause user behavior typially laks strit patterns. User dynamis allowed more refletion on the features that define host behavior. All ations arried out by users involve using programs. Programs obtain the required servies by exeuting the speifi system all that provides the needed funtion. Sine the ode of a given appliation should not hange, the sequene of system alls exeuted by a program should be regular and preditive. Most existing researh on anomaly detetion uses system alls to model system behavior. A number of approahes based on system alls are proposed. Forrest et al. [5] established an analogy between the human immune system and intrusion detetion. Lee et al [6] applied a rule learning program to study a sample of system all data. Wagner and Dean [7] proposed to statially generate a non-deterministi finite automaton (NDFA) from the global ontrol-flow graph of the program and simulated NDFA on observed system all trae. Ghosh et al. [10] utilized return address information extrated from the all stak to generate the exeution path of a program for anomaly detetion. Liao and Vemuri [9] used the k-nearest Neighbor lassifier to lassify program behavior represented by frequenies of system alls instead of system all sequene. Eskin et al. [11] applied outlier detetion algorithms to anomaly detetion using system all data, where the system all data has to be mapped into feature spae and the hoie of feature spae is appliation speifi. Beause the system-all level data is fine grained, it inreases overhead and dereases system performane. Thus, some researhers study anomaly detetion by modeling files ativities. Stolfo et al. [13] studied anomaly detetion by learning file system aess patterns. Apa et al. [12] notied that in Windows OS almost all system ativities interat with the registry, so they analyzed anomaly detetion through modeling normal registry aess. In reality, an attak is usually unpreditable, and it is diffiult to know whih aspet of the system behavior is assoiated with the attak. For this reason, modeling system behavior based on only a single ategory is unreliable, no matter whether the ategory is user profile, program profile, or system resoure aess. In what follows, we try to model behavior from all three ategories to inrease the reliability of system behavior modeling, and thus to inrease the auray of anomaly detetion. 3. Behavior Analysist-Basd Learning Framework A system onsists of both the user and the host mahine. It is appropriate to desribe the system behaviors using a set of fators of user profile, program

3 profile, and system resoure usage. We all the set of fators the behavior set. In this setion, a learning framework of intrusion detetion is proposed based on analysis of the behavior set, as shown in Fig. 1. This framework inludes three modules: anomaly detetion, alert fusion, and alert verifiation. The input of the framework is an event, modeled as a triple { subjet verb objet} for user/program behavior, and the output is alerts with features of alert ID, alert type, timestamp, priority level, onfidene fator, verifiation status. In the anomaly detetion module, the normal behavior baseline is modeled adaptively and stored in the database. When a new event omes, it is lassified as normal or intrusion by the module. If an alert is triggered, the alert is fused with other existing alerts to derease the number of alerts with the same ause. Then the fused alerts are sent to the alert verifiation module to exlude false or unrelated alerts. In this paper, we disuss only the learning related modules: anomaly detetion and alert verifiation. To model normal system behavior, both supervised and unsupervised learning algorithms an be applied. We hoose unsupervised learning over supervised learning. The main reason is that unsupervised learning does not need labeled (normal/abnormal) data, whih is not readily available in reality. Labeled data is generally obtained by simulation or experiments. If labeled data is obtained by simulated intrusions, we are limited to the set of known attaks that we simulated. New types of attaks are not refleted in the training data set. If labeled data is obtained by experiments, then we must fae the diffiulties in manually lassifying large volume of audit data. In addition, if the experimental data labeled normal have buried intrusions, then future instanes of those intrusions will not be deteted beause they are assumed normal in the training set. Unsupervised learning algorithms take as input a set of unlabeled data and attempt to find noise and intrusions buried within the data. If anomalies are rare, unsupervised learning an be treated as a variant of the outlier detetion problem. Outlier based anomaly detetions luster the data based on ertain metris and the data loated in sparse regions are laimed as intrusions. Not all intrusions an be deteted using outlier based anomaly detetion. For example, syn-flood DOS annot be deteted using outlier detetion sine they are not rare in data distribution. Only when system behavior deviates signifiantly from average behavior are outlier detetions appliable Outlier detetion algorithm The outlier detetion algorithm we propose is given in Table I. The algorithm is an improvement of fixedwidth luster estimation [11]. For eah point, the algorithm approximates the density of points near the given point. The algorithm makes this approximation by ounting the number of points that are within a sphere of radius w around the point. Points that are in a dense region of the feature spae and ontain many points within the irle or ball are onsidered normal. Points that are in a sparse region of the feature spae and ontain few points within the irle or ball are onsidered anomalies. In the original fixed-width luster estimation, data that does not belong to any existing lusters is assigned to a new luster and works as the entral point. So the entral point of a luster is sensitive to the sequene of data to be lustered and new data might not be assigned to the losest luster. In the improved fixed width lustering algorithm, the entral point of a luster is adjusted as new data is added so data are always assigned to the losest luster. The idea of updating the entral point of a luster is lose to K-means lustering, one of the most popular statistial lustering algorithms. Unlike K-means lustering, the improved fixed-width luster estimation speifies luster width instead of fixing the number of lusters a priori. Thus the results are sensitive to the value of luster width. However, this problem an be easily solved by interation with the lustering results through GUI. Sine the lustering is performed offline, it is easy to adjust the width adaptively based on the visual lustering results when the lustering data is of low dimension. Using outlier detetion, it is preferable that the training data be of low dimensionality for the following reasons. First, when distanes are measured in all dimensions, it is more diffiult to detet outliers effetively beause of the average behavior of the noisy and irrelevant dimensions. Seond, it is not easy to intuitively explain and understand the lustering results with high dimensions. Third, in high dimensional spae, the data is sparse and the notion of proximity fails to retain its meaningfulness. Finally, with inreasing dimensionality, it beomes inreasingly diffiult and inaurate to estimate the multidimensional distribution of the data points. Atually, in the learning framework we propose, we do not need to use high dimensionality to represent the feature spae of the system behavior. Beause the system behavior is desribed by a set of fators instead of a single omplex indiator, all the elements in the behavior set are ensured to be represented by low dimensional data, e.g., the login time, aess frequenies of the files, et.

4 Figure 1. Behavior analysis-based learning framework for host-based intrusion detetion 3.2. Behavior analysis-based intrusion detetion Anomaly detetion an only detet intrusions that make the host behave differently from normal. Thus, before we attempt to model normal behavior, the key question is how to define host behavior. As analyzed at the beginning of the setion, we use behavior set to desribe system behavior in order to aurately and ompletely reflet the system harateristis. The behavior set is speified as a set {user login time, frequeny of appliations launhed, I/O ativities, frequeny of files aess, network ativities stats, CPU usage pattern over time, memory usage pattern, data bandwidth, network onnetion speed}. In the set, user login time helps to loate normal login time intervals; frequeny of appliation launhed loates the most and the least used appliations; frequeny of files an loate the most and the least aessed files; I/O ativities indiates the average bytes written to the files; network ativities stats loate normal sequene, frequeny, time interval, and other indexes of various network ativities, e.g., web browsing, ftp et.; CPU and memory usages et. are related to system performane. To desribe the host behavior, we need metadata. A relational database is onstruted to store system profiling metadata. The database onstruted with WAMP is shown in Fig. 2. In the database, the table features are well designed to reveal the information defined in the behavior set. For example, the table USER has features of userid, user name, login time, remote/loal login, logoff time. The table PROCESS has features of proess Id, proess name, owner of the proess, memory usage of the proess, CPU usage of the proess, time. To maintain data in real time, the data is updated periodially. The time interval to ollet data an be modified by the user via a sliding bar in the GUI as shown in Fig. 2. In addition, the user an easily selet a table and look at the data through the database interfae. When the reord size overflows, the new reord overwrites the oldest reord. With the profiling database, the outlier detetion algorithm is applied to eah single fator in the behavior set if appliable. For instane, the user login time is trained to get the normal login time intervals, e.g., [8:00 9:00] and [13:00 14:00]. The CPU usage pattern is learned to model the normal distribution pattern, e.g. peak time at [9:00 11:00] and [14:00 16:00]. Typially, only a few indiators of the behavior set are deteted as abnormal. So the signifiane of the host s deviation from normal behavior has to be measured in some way. In this framework, we adopt the idea of the weighted sum. First, the deviation of eah behavior indiator is alulated using the outlier detetion algorithm. Then we ompute the overall deviation from the host baseline model: ( m m 1 2 m w d + w d + + w d ) /( w + w + + w ), Where m is size of behavior set, wi is predefined weight of indiator i in the behavior set, and dm is binary value 0 or 1, i.e., normal or abnormal. When the total deviation is beyond a threshold, an alert is generated. The values of w i ( i = 1,2, m) an be ustomized based on the host funtion and the user type. For example, a software developer and an administrator will have very different file ativities. If we know the user of the host is a software

5 developer and the mahine is mainly used for software development, we an put more weight on deviations of frequeny of appliations launhed, I/O ativities, and frequeny of file aess. Inputs: TABLE I. Cluster width w ANOMALY DETECITON ALGORITHM The unlabeled data set D with n dimension Initialization: Set of lusters S is empty Loop until D is empty For eah data d in data set D: If S is empty reate a luster C with one element d, add C the luster to SC S S, and set d as entroid x, ) of C. ( 0 y0 Otherwise find the losest luster C in S suh that for any luster C in S, dist(c, d) dist(c, d), where dist(c, d) is Eulidean distane from d to the entroid of luster C. If dist(c, d) w insert d into luster C. adjust the old entroid x, ) to ( ', ' x ), where ( y y0 ' ' x0 = ( m x0 + m) /( m + 1), y 0 = ( m y0 + m) /( m + m = C -1. Otherwise a new luster C d with d as entroid is reated, S { Cd } S. D { d} D Find outliers: Sort the lusters based on sizes in asending sequene, Ci C 2 Cl, where l is the number of lusters. Let C + C + + C. i 2 l = p For i = 1to l If C i < ( p / l) 20%, set the luster Ci as abnormal, i++ Otherwise, set any luster C j 1),, j i as normal, break When a new servie is installed or the user s behavior hanges suddenly, false alerts are triggered. In this ase, the fuse module will fuse alerts generated for the same reason. When new data instanes are treated as intrusion by the anomaly detetion module at the beginning of a pattern hange, the new data are not disarded. As the data forming new pattern inreases, the host profiling database is updated, and the lustering algorithm is retrained over the new set of data for modeling the emerging new pattern Figure 2. System performane database When the behavior set is modified, the learning algorithm remains the same, but the database has to be modified or reonstruted Behavior analysis-based alert verifiation Conventional alert verifiation approahes inlude ative and passive mehanisms [1]. Passive alert verifiation mehanisms ompare the onfiguration of the vitim mahine to the requirements of a suessful attak, e.g., a vulnerable version of the Mirosoft IIS server. It gathers onfiguration data before an attak ours and determines whether the target mahine is vulnerable to the attak. Ative mehanisms model the expeted outome of attaks, heking the visible and hekable traes that a ertain attak leaves at a host, e.g., a temporary file or an outgoing network onnetion. They gather onfiguration data or forensi traes after an alert ours. Neither approah an detet unknown attaks. They assume that either the system vulnerability required by an attak or the outome of an attak is known in advane. If this is true, the attak an be prevented by installing a new servie pak. Then there is no need to detet the attak. In the proposed learning framework, we fous on the verifiation of unknown and new types of attaks based on host behavior analysis. The after-attak performane measured in quantity might be different in different systems. However, the variane between prior-attak behavior and post-attak behavior is onsistent on various mahines. For example, an attak auses the network onnetion to be 50% slower no matter what type of CPU and bandwidth the mahine has. Therefore, the basi idea of the alert

6 verifiation module is to learn the behavior features assoiated with an attak and hek the hange of the value of those features. It is appliable to the attaks that ause system behavior hange. In the training phase, one an alert is verified manually by system administration, the system behavior both before and after the attak ours are sampled from the system performane database and stored in a separate system metadata database. Sine eah alert has a timestamp t, we just need to probe the system performane database and get the reords at time t and the last sample time t-1. For eah system performane-related feature in the reords, we alulate the hange of feature values: feature = ( feature ( t) feature ( t 1) / feature ( t 1) i i i i. of length 4 will give a dimension of the feature spae 26 4, lose to 500,000. To implement the learning framework proposed, we started with system metadata olletion. Based on the behavior set we defined, as desribed in setion 3, we onstruted the WAMP database server and wrote C++ ode to ollet the system performane data from a host and store the data into the database. The data will be used as a training set to model the normal behavior of the system. The data is olleted in real time and the database is updated dynamially. The data sampling rate is predefined and an be modified through a sliding bar on the GUI as shown in Fig. 2. Fig. 3 shows one of the tables in the system profiling database -- CPU usage data olleted from the host. The system performane features inlude CPU usage, memory usage, I/O usage, network stats et. A reord in the form of { l alertid, alerttype, feature1, feature2,, feature } is added to the onfiguration hange baseline database, where l is the number of total features, featurei is the relative value hange of feature i. When the onfiguration hange baseline database is onstruted, the rule learning algorithm RIPPER [14] will be applied to the database. When applying RIPPER, we get rules for different types of alerts separately. To extrat rules for alert type i using RIPPER, all the reords with that alert type are treated as positive data, all the other reords are treated as negative data. RIPPER takes the positive and negative data sets as input, and outputs a rule set represented as a onjuntion of onditions in the form of A n = v, A θ, or A θ, where An is a nominal attribute and v is a legal value for A n, and A is a ontinuous variable and θ is some value for A. Figure 3. CPU usage data olleted from host 4. Experimental Results There are some benhmark data available for network intrusion detetion researh. For host-based intrusion detetion, one data set is from the 1999 DARPA intrusion detetion evaluation data whih onsists of BSM (Basi Seurity Module) data of all proesses run on Solaris mahines. Another set of data, obtained from Stephanie Forrest s group at the University of New Mexio [15], ontains normal traes for ertain programs as well as intrusion traes of system alls for several proesses. In our study, these benhmark data are not appliable. First, beause we use a novel onept, behavior set, to desribe system behavior instead of using system all traes, and seond, beause all behavior elements are of low dimension data for the auray of lustering. System all data annot be used for lustering diretly and the preproessing of data is expensive. Eskin et al. [11] adopted spetrum kernels to map the system all data into feature spae. However, the feature spae orresponding to system alls is in large dimension, e.g., 26 possible system alls and sub-sequenes Figure 4. Experimental result of lustering The unsupervised anomaly detetion algorithm is implemented in Java. Given the training data set and luster

7 width, the luster-based outlier detetion algorithm is applied and the simulation result is shown in Fig. 4. Using this algorithm, data instanes in sparse regions are treated as noise and intrusions buried in the training data, and data instanes in dense regions are treated as normal data. When a new data instane omes, whether it is labeled as normal or intrusion depends on whih region it is loated in. The algorithm needs to speify luster width instead of the number of lusters. Beause the lustering is performed offline, the luster width an be adjusted easily through the GUI for aurate results. The onnetion with the database server is implemented with Java JDBC and the anomaly detetion algorithm is also applied to real data from the database. The initial experimental results on a few fators in the behavior set, e.g., the user login time and CPU usage distribution over time, are satisfatory. Further experiments need to be arried out. To adapt to system behavior pattern hanges, the database is updated periodially and the algorithm is applied over the new input data to update the lusters that represent normal behavior. 5. Conlusion In this paper, behavior analysis-based intrusion detetion at the host level has been disussed and a learning frame has been proposed. Two parts in the frame, anomaly detetion and alert verifiation, have been designed using mahine learning tehniques. The behavior analysis-based framework has the following advantages: 1) the host behavior is desribed more aurately and omprehensively with a set of indiators, i.e., user profile, program profile, and system resoure aess; 2) the anomaly detetion module does not need labeled data, whih are diffiult to obtain; 3) eah indiator in the behavior set is represented by the data in low dimensionality sine the host behavior is refined into a set of indiators; 4) with these low dimension data, the lustering algorithm an be applied over the data without normalization, whih is data-dependent and appliation speifi; 5) the lustering algorithm does not need to predefine the number of lusters, and the width of lusters an be adjusted visually offline; and 6) a novel alert verifiation approah an hek the hanges in the host behavior aused by an attak and learn rules assoiated with the attak. Currently, the anomaly detetion module has been simulated and the host profiling database has been onstruted. In the future, the anomaly detetion module will be tested on data from a real-world database and the testing results will be arefully examined. In addition, we will implement the alert verifiation mehanism. Referenes [2] S. Axelsson. Intrusion detetion systems: A survey and taxonomy. Tehnial Report 99-15, Department of Computer Engineering, Chalmers University, Marh [3] D. E. Denning, "An intrusion-detetion model." IEEE Transations on Software Engineering, Vol. SE-13(No. 2): , Feb [4] K. Sequeira, M. Zaki, ADMIT: Anomaly-based Data Mining for Intrusions, Proeedings of the 8th ACM SIGKDD International Conferene on Knowledge Disovery and Data Mining, Edmonton, July [5] S. Forrest, S.A. Hofmeyr, and A. Somayaji. Computer immunology. Communiations of the ACM, 40(10):88-96, Otober [6] W. Lee, S. Stolfo, and P.K. Chan. Learning patterns from unix proess exeution traes for intrusion detetion. In Proeedings of AAAI97 Workshop on AI Methods in Fraud and Risk Management, [7] D. Wagner and D. Dean, Intrusion Detetion via Stati Analysis, IEEE Symposium on Seurity and Privay, Oakland, CA, 2001 [8] Henry Hanping Feng, Oleg M. Kolesnikov, Prahlad Fogla, Wenke Lee, and Weibo Gong. Anomaly Detetion Using Call Stak Information [9] Yihua Liao, V. Rao Vemuri. Use of K-Nearest Neighbor lassifier for intrusion detetion. Computer and Seurity. Volume 21, Issue 5, 1 Otober 2002, pages [10] A. K. Ghosh, A. Shwartzbard, and M. Shatz. Learning program behavior profiles for intrusion detetion. In Proeedings of the 1st USENIX Workshop on Intrusion Detetion and Network Monitoring. USENIX Assoiation, April [11] Eskin, E, A Arnold, M Prerau, L Portnoy, SJ Stolfo. A geometri framework for unsupervised anomaly detetion: deteting intrusions in unlabeled data. In Data Mining for Seurity Appliations, [12] Apap, F., A. Honig, S. Hershkop, E. Eskin and S. Stolfo. Deteting Maliious Software by Monitoring Anomalous Windows Registry Aesses. Fifth International Symposium on Reent Advanes in Intrusion Detetion, RAID Zurih, Switzerland, [13] Stolfo, S. J., L. Bui, Shlomo. Hershkop. Unsupervised Anomaly Detetion in Computer Seurity and an Appliation to File System Aess. Pro. ISMIS, pp , [14] Cohen, W. W. Fast effetive rule indution. In Proeedings of the 12th International Conferene on Mahine Learning. Lake Tahoe, CA, [15] Warrender, C., S. Forrest, and B. Pearlmutter. Deteting intrusions using system alls: alternative data models. In 1999 IEEE Symposium on Seurity and Privay, pages IEEE Computer Soiety, 1999 [16] C. Kruegel and W. Robertson, Alert Verifiation: Determining the Suess of Intrusion Attempts, Pro. First Workshop the Detetion of Intrusions and Malware and Vulnerability Assessment (DIMVA 2004), July [1] F. Valeur, G. Vigna, C. Kruegel, R. A. Kemmerer. A omprehensive approah to intrusion detetion alert orrelation. 2004

More information