Anomaly Traffic Analysis and The Experiment Statistic Model Based on

Anomaly Traffic Analysis and The Experiment Statistic Model Based on Honeypot 1 Wang Xin-Liang, 2 Lu Nan, 3 Li Hui, 4 Gao Qing-Hua *1, First Author School of Electrical Engineering and Automation, Henan Polytechnic University, wangxl@hpu.edu.cn 2 The Research Institution of China Mobile, lunan@chinamobile.com 3 School of Electrical Engineering and Automation, Henan Polytechnic University, li20042007@163.com 4 School of Electrical Engineering and Automation, Henan Polytechnic University, gaoqh@hpu.edu.cn Abstract In order to better assess the scale of suspicious hosts in the botnet, this paper performed an uninterrupted monitoring on the honeypot deployed in the network during the 470 days, and provided the relative experimental statistical model. The model points out that the attacker number of every day can be described by the normal distribution in a relatively short time, and in a relatively long time the attacker can not follow the normal distribution. The distribution statistic model of attacker number can be used to evaluate the security status of network, and can make the limited network resources utilized for the suspicious targets in order to improve the security performance of network better. 1. Introduction Keywords: Honeypot, Scan IP Distribution, Attack Frequency, Botnet Detection Botnets [1-4] refers to a group of compromised computers controlled by the server, causing a huge threat to network security, the national information security. According to the used protocol, the botnet can be divided into three types: the IRC-based botnet, the http-based botnet and the P2P-based botnet. The IRC-based botnet mainly contains SDBo, Agobot, GT-Bot and Rbot; The http-based botnet mainly contains Bobax, Rustock and Clickbot; The P2P-based botnet mainly contains Phatbot, Nugache and Storm [5,6]. If the above botnets are directly detected in the environment of actual high-speed network, it will cause system resources overburdened. The sampling techniques can effectively reduce network traffic that anomaly detection algorithms need to deal with, but it will have a greater impact on the performance of anomaly detection algorithm. The literature [7] proposed an effective packet sampling algorithm that will effectively preserve the fingerprint characteristics of worms and botnets to provide a basis for further deep packets detection. The literature [8] compared the results of scan detection for network traffic before and after sampling, and the used three scan detection algorithm were individually TRWSYN [9], TAPS [10] and entropybased detection algorithm [11, 12]. The experimental results showed that the sampling techniques severely reduced the detection accuracy rate because it caused the loss of the data traffic. The literature [13] mainly focused how packet sampling techniques influenced detectection indicators, and the experimental results showed that the entropy-based worm detection algorithm for sampled network traffic was still able to get a better detection results. The literature [14] proposed a new sampling technology based on flow, which could retained the anomaly data flows as much as possible so that it could ensure the effectiveness of the detection algorithm to a certain extent. For the problems that the sampling techniques cause, if we can obtain suspicious IP collection of botnet from the high-speed network and use it to filter network traffic, the subsequent botnet detection algorithm can effectively reduce the need to handle the network traffic, and can provide a basis for the further botnet detection. The paper obtains the suspicious IP collection of botnet by the honeypot technology, and makes an in-depth analysis on the traffic statistics features of suspicious IP set. By analyzing the anomaly traffic of honeypot, we found that scanning attack was a major type of attack. Journal of Convergence Information Technology(JCIT) Volume8, Number4,Feb 2013 doi:10.4156/jcit.vol8.issue4.17

For network scanning is a prelude of botnet propagation, the collection of abnormal IP obtained by the honeypot will be as the suspicious IP set of botnet to some extent so that it could provide a basis for further botnet anomaly detection. Honeypot used in this paper is constructed by the real host, operating system and the application. Through uninterrupted data monitoring of more than a year, we made a deep analysis on flow characteristics of suspicious IP and statistical regularities of daily suspicious IP number, and provided some statistical models. The models can assess the scale of daily suspicious IP to provide the experimental basis for the detection, control and classification of botnet. 2. Anomaly data acquisition and preprocessing based honeypot 2.1. Data acquisition The anomaly data is obtained by the honeypot deployed in internet data centre. For the honeypot is placed in the data center that the position is more sensitive, it will suffer a variety of attacks. When honeypot is attacked, the other servers in the same network are also very likely to suffer the same attack. So the attack data collected by the honeypot can reflect the circumstances of network attacks to a certain extent. The anomaly data can not be directly collected on the honeypot. Once the honeypot is compromised by the attacker, the attacker will likely turn off the program of data acquisition on the honeypot, and destroy the collected data, so that it would result in a failure of data acquisition. To avoid this situation, this paper will use a special equipment of data acquisition that achieves data acquisition, and this device will be installed on the network exports of internet data centre. So the data packets that external network attacks the honeypot can be collected in real time, even if the honeypot is destroyed by the attacker, it will not affect the anomaly data collection. 2.2. Honeypot configuration The operating system installed on the honeypot is Windows XP SP2, and we installed four kinds of application software that they are individually the Server U6.4, Tomca.0, Remote AnyWhere and VNC. Meanwhile, the system can also open the 3389 port for remote login service, and the access password of the software is a combination of letters, numbers and special characters. Tomca.0 only runs the pages the server owns, and doesn t run external web applications on the web server. During the monitoring, the software normally runs, and no attackers achieve the successful invasion. 2.3. Preprocessing Between March 11, 2009 and June 24, 2010, the equipment of data acquisition performed the continuous monitoring and collected all raw packets of anomaly data. The network traffic analysis system achieved the statistics of flow features on collected packets in accordance with the chronological order, and the results of flow statistics were deposited in the postgre database according to the stream format. The analysis showed that the total collected flows were 566631, TCP data flows were 548238, and abnormal TCP data flows were 510500. The normal data flows were produced by the update of the operating system, open services and related software applications. There was not abnormal in the collected UDP packets, and therefore we didn t consider the UDP packets in assessing the network security. The address of honeypot was not open to the outside, and therefore the data flows that all external hosts initiated to access the honeypot were abnormal data, and the abnormal data flows involved in this paper refer to TCP data flows. Definition 1: attacker number. The number of attacker in this paper refers to all exceptions IP number of accessing the honeypot.

3. Anomaly data acquisition and preprocessing based honeypot 3.1. Anomaly traffic analysis 3.1.1. Attack type The abnormal data captured by honeypot can be divided into scanning attacks and non-scan attack, and the specific contents are as follows: 1) Scan attack SCAN1. The attacker sends a SYN packet to request a connection, and the server returns <RST,ACK> packet to reject the connection. 2) Scan attack SCAN2. The attacker sends SYN packets to request a connection, the server returns <SYN,ACK> packets, and the attacker sends RST packet to close the connection. 3) Scan attack SCAN3. The attacker sends message <SYN,ACK>, and the server returns a RST packet to reject the request. 4) Scan attack SCAN4. The attacker sends the SYN packets to request a connection, and the server returns <SYN,ACK> message. However, the attacker no longer sends a confirmation message, which will lead to establish a semi-connection. 5) Remote anywhere attack. The attacker launches a variety of attacks in 2000 and 22 port that remote anywhere uses. 6) Tomcat attack. The attacker launches a variety of attacks in 8080 port that tomcat uses. 7) Remote login attack. The attacker launches a variety of attacks in 3389 port that remote desktop uses. 8) VNC attack. The attacker launches a variety of attacks in 5900 port that vnc uses. 9) FTP attack. The attacker launches a variety of attacks in 21 port that ftp uses. In this paper, a variety of scanning attack is described by scanning attack. 3.1.2. Analysis of attacker num Table 1. Time distribution table of attacker number Type of attack Days Mean Standard deviation Minimum Maximum All type of attack 470 133 76 48 902 Scanning attack 470 115.6 28.3 39 184 Scanning proportion 470 0.9286 0.1418 0.0876 1 Attack number of remote login attack 470 0.34 0.61 0 4 Attack number of remote login attack of FTP attack 470 0.5 0.78 0 5 Attack number of remote anywhere attack 470 19 76.28 0 830 Attack number of tomcat attack 470 6.45 3.5 0 18 Attack number of VNC attack 470 0.49 1.1 0 11 Figure 1 and figure 3 are individually the probability distribution plot of the daily attackers on all types and scanning type. From figure 1, the number of daily attacker is distributed in the less than 200 people, and the distribution is more concentrated. There are a larger number of attackers only in certain days, even up to about 900 people. By the analysis on the raw data, this number of mutations was produced for the server suffered the intensive attack that a number of attackers initiated in some period, for example: the attack of password attempt, and it may be initiated by many zombie hosts in the same botnet. From figure 2, there is the obvious deviation between the normal P-P plot curves and diagonal, that is to say, there exists a significant difference between the cumulative probability of observing and expecting. There exists a obvious difference between the actual curve of distribution and the fitted curve of normal distribution, so the attacker number of all types do not follows a normal distribution.

Figure 1. Probability distribution plot of the daily attackers (all types) Figure 2. Normal P-P plot of the daily attackers (all types) From figure 3 and figure 4, the number of scanning attack that the server daily suffered was relatively stable. For the scan attack may mainly be used for collecting vulnerability information, the botnet can not organize the zombie hosts to launch large-scale scanning attacks against the same server after it obtains the vulnerability information. From figure 3, the normal distribution curve can better fit the actual distribution, and from figure 4 there only exists a smaller difference between the cumulative probability of observing and expecting. So the actual distribution on the number of scanning attack is similar with the fitted curve of normal distribution, but it will be deeply discussed below whether the actual distribution can be described by the normal distribution. Figure 3. Probability distribution plot of the daily attackers (scanning attack)

Figure 4. Normal P-P plot of the daily attackers (scanning attack) 3.2. Experimental statistical model of attacker num By the distribution statistical analysis on anomaly data of 470 days, the corresponding experimental statistical model is established. The statistical methods use the single-sample K-S test provided by SPSS software, this method is able to use the sample data to infer whether the overall distribution is subject to a theoretical distribution, and it is a test method for goodness of fit. It is described by N that the num of attacker appears every day, the significance level is set to 0.05, and we assume that there isn t significant difference between the actual distribution and normal distribution. In this paper, 60 days are as the basic unit of time that is respectively represented by, t2,... t n, and n 470 / 60 8. Through the statistical analysis on the daily number of attacker, we found that the number of scanning attack and all attack were not respectively subject to the common distributions such as the normal distribution, Poisson distribution, and so on in the period of 470 days, and there existed a larger fluctuation in the attack number. This paper assumes that the daily number of all attack is represented by x, the daily number of scanning attack is represented by y, the daily proportion of scanning attacker is represented by z, and z x/ y. x and z respectively perform the statistical tests of distribution in accordance with the assumption that they follow the normal distribution, and the test results are shown in table 2, table 3 and table 4. From table 2, the probability P value of x is 0, so x doesn t follow the normal distribution. The probability P values of x, x, x and x are larger than 0.05, so there was no significant difference between the actual distribution and normal distribution in these four time periods. The probability P values of x, x, x t 7 and x are smaller than 0.05, so the hypothesis should be rejected, that is to say, they don t follow the normal distributions in these four time periods. In the first four time period, the number of all attackers is relatively stable, so the number of daily attackers on the honeypot basically maintains around the mean. This shows that during this time, the network security environment is relatively stable, and there is no significant change in the frequency that the honeypot suffers the attacks. In the next four time period, there are more substantial fluctuations on the standard deviation and relative mean of the attacker number, and it does not follow the normal distribution.

Table 2. The distribution of attacker number (all attack types) Type of attack Days Mean Standard deviation Probability P value x 470 132.6191 76.18168 0 x 60 124.95 11.5148 0.24 x 60 140.5833 11.43633 0.79 x 60 143.5 14.54217 0.275 x 60 115.35 29.87932 0.056 x 60 100.25 31.08306 0.01 x 60 131.1833 160.2181 0 x 60 123.6 43.0661 0 t 7 x 50 191.32 118.475 0 Table 3. The distribution of attacker number (Scanning attack) Type of attack Days Mean Standard deviation Probability P value y 470 115.6085 28.2524 0.007 y 60 120.9167 11.41407 0.603 y 60 136.0833 11.47036 0.781 y 60 140.3833 14.71 0.169 y 60 111.8167 29.171 0.075 y 60 81.1 16.733 0.84 y 60 83.6167 13.82 0.228 t6 y 60 113.93 15.52 0.687 t7 y 50 141.3 21.836 0.233 Table 4. The distribution of scanning attacker proportion Type of attack Days Mean Standard deviation Probability P value z 470 0.9286 0.142 0 z 60 0.9677 0.01705 0.879 z 60 0.9679 0.01572 0.743 z 60 0.9781 0.01512 0.297 z 60 0.9689 0.01929 0.526 z 60 0.8402 0.15490 0.005 z 60 0.8663 0.2221 0 z 60 0.9528 0.10388 0 t 7 z 50 0.8789 0.25184 0 From table 3, in the period of 470 days the probability P value of y is 0.007 and smaller than 0.05, so y doesn t follow the normal distribution. However, the probability P values of y, y t 7 and y are larger than 0.05, so in the eight time periods, the number of scanning attackers follows a normal distribution. That is to say, the number of scanning attacker follows a normal distribution in a shorter period of time, and it is likely to fluctuate around the mean. However, in the eight time periods, the mean and variance of the number of scanning attacker are also changing, so it follows a non-stationary distribution, and its probability distribution is as follows:

2 1 ( yt ( ) mt ) fy ( yt, ) exp( ), t t, t, t,..., t 2 t 2 t 1 2 3 8 The mean m t i and standard deviation ti of the number of scanning attacker in the different period t are showed in figure 5. The X axis represents the time period t 1,2,3,...,8, and its unit is 60 days; The Y axis represents the mean and standard deviation of the number of scanning attacker. From table 4, the probability P values of z, z, z and z are larger than 0.05, so the scanning attacker proportion follows the normal distribution. However, the probability P values of z, z, z t7 and z are smaller than 0.05, so the hypothesis should be rejected, that is to say, they don t follow the normal distributions in these four time periods. In the first four time periods, the number of other types of attacker is relatively stable, the scanning attacker proportion fluctuates around the mean, and it is less that the scanning attacker proportion is too large or too small. In the next four time periods, there are larger fluctuations on the number of other types of attacker and scanning proportion, so it doesn t follow the normal distribution. In a shorter time period (2 months), the number that the server suffered scanning attack is basically stable around the mean, and the serious deviation from the mean is less. Reflected from the other side, the network environment and network security measures are relatively stable within a short period of time, so the number the server suffered the attack is relatively stable. With the change of time, the network environment and network security measures are also changed, for example: network security measures during the Olympics, will inevitably lead to an overall reduction of network attacks so that the distribution of the attacker number of the server would be also changed. So, it is our next target how the attacker number is influenced by the network security measures. 4. Conclusions Figure 5. The mean and standard deviation of scan attacker num in different period In this paper, we make a deep analysis on the statistical regularities of the abnormal flows of suspicious IP and obtain the experimental statistical models that can provide the experimental statistical basis for assessing the network security situation and suspicious IP scale of botnet. However, when the suspicious IP queue is used to filter high-speed network traffic, if suspicious IP obtained by honeypot is constantly added to suspicious IP queue, it will eventually cause that the fewer network traffic can be filtered for the suspicious IP queue is too large. Based on the statistical regularities of daily suspicious IP and the handling capacity of network traffic in subsequent detection module of botnet, it will be needed to make the in-depth study how to construct the appropriate control strategies for the scale of suspicious IP.

5. Acknowledgements This paper is supported by the National Natural Science Foundation of China (Grant Nos. 41074090), Doctor Fund of Henan Polytechnic University (Grant No B2012-073) and Henan Province Open Laboratory for Control Engineering Key Disciplines (Grant No KG2009-20). 6. References [1] Du Yue-Jin, Cui Xiang, Malicious botnet and its illumination on computer security, China Data Communication, vol. 7, no. 5, pp.9-12, 2005. [2] Genevieve Bartlett, John Heidemann, Christos Papadopoulos, Low-rate, flow-level periodicity detection, 2011 IEEE Conference on Computer Communications Workshops, pp.804-809, 2011. [3] Tung-Ming Koo, Hung-Chang Chang, Wen-Chi Liao, Estimating the Size of P2P Botnets, International Journal of Advancements in Computing Technology, vol. 4, no. 12, pp.386-395, 2012. [4] Cheng Binlin, Fu Jianming, Yin Zhiyi, Heap Spraying Attack Detection Based on Sled Distance, International Journal of Digital Content Technology and its Applications, vol. 6, no. 14, pp.379-386, 2012. [5] Julian B Grizzard, Vikram Sharma, Chris Nunnery, Brent ByungHoon Kang, and David Dagon, Peer-to-peer botnets: Overview and case study, In Proceedings of USENIX HotBots 07, pp.1-8, 2007. [6] Thorsten Holz, Moritz Steiner, Frederic Dahl, Ernst Biersack, and Felix Freiling, Measurements and mitigation of peer-to-peer-based botnets: A case study on storm worm, In Proceedings of the First USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET 08), pp.1-9, 2008. [7] Lothar Braun, Gerhard Münz, Georg Carle, Packet Sampling for Worm and Botnet Detection in TCP Connections, Network Operations and Management Symposium (NOMS), pp.264-271, 2010. [8] Jianning Mai, Ashwin Sridharan, Chen-nee Chuah, et al., Impact of packet sampling on portscan detection, IEEE Journal on Selected Areas in Communication, vol. 24, no. 12, pp.2285-2298, 2006. [9] Jaeyeon Jung, Vern Paxson, Arthur W Berger, and Hari Balakrishnan, Fast Portscan Detection Using Sequential Hypothesis Testing, In: Proceedings of the 2004 IEEE Symposium on Security and Privacy, pp.211-225, 2004. [10] Avinash Sridharan, Tao Ye, Supratik Bhattacharyya, Connectionless port scan detection on the backbone, Performance, Computing, and Communications Conference, pp.567-576, 2006. [11] Anukool Lakhina, Mark Crovella, Christophe Diot, Mining anomalies using traffic feature distributions, Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, pp.217-228, 2005. [12] Kuai Xu, Zhi-Li Zhang, Supratik Bhattacharyya, Profiling internet backbone traffic: Behavior models and applications, In ACM Sigcomm, vol. 35, no. 4, pp.169-180, 2005. [13] Daniela Brauckhoff, Bernhard Tellenbach, Arno Wagner, et al., Impact of packet sampling on anomaly detection metrics, Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pp.159-164, 2006. [14] George Androulidakis, Symeon Papavassiliou, Improving network anomaly detection via selective flow-based sampling Communications, IET, vol. 2, no. 3, pp.399-409, 2008.