Real-Time Analysis of Non-stationary and Complex Network Related Data for Injection Attempts Detection

Real-Time Analysis of Non-stationary and Complex Network Related Data for Injection Attempts Detection Micha l Choraś 12, Rafa l Kozik 2 1 ITTI Ltd., Poznań, Poland mchoras@itti.com.pl 2 Institute of Telecommunications, UT&LS Bydgoszcz, Poland Summary. The growing use of cloud services, increased number of users, novel mobile operating systems and changes in network infrastructures that connect devices create novel challenges for cyber security. In order to counter arising threats, network security mechanisms and protection schemes also evolve and use sophisticated sensors and methods. The drawback is that the more sensors (probes) are applied and the more information they acquire, the volume of data to process grows significantly. In this paper, we present real-time network data analysis mechanism. We also show the results for SQL Injection Attacks detection. 1 Rationale Recently there is an increasing number of security incidents reported all over the world. The national CERTs (e.g. CERT Poland [1]) report that number of attacks in 2011 has increased significantly when compared to 2010. In annual reports they explain that most of network events submitted by automated feeds concern bot nets, spam, malicious URLs and Brute Force attacks. The increased number of incidents is strongly related to the fact that recently there is also an increasing number of mobile devices users that form the population of connect-from-anywhere terminals that regularly test the traditional boundaries of network security. Also the so called BYOD (bring your own device [2][3]) movement exposes the traditional security of many enterprises to novel and emerging threats. Many of nowadays malwares like ZITMO (Zeus In The Mobile) do not aim at mobile device itself anymore but on gathering the information about the users and gaining the access to remote services like bank web services. This significantly expands cyber space network security perimeter. There is also a significant number of reported incidents that are connected with huge widespread adoption of social media. Today, users are provide the content driving the growth at the same. This trend has a significant impact on accelerated spread of different kinds of malwares and viruses. As reported

2 Micha l Choraś, Rafa l Kozik by SophosLabs [2] the number of malware pieces they have analyzed has been doubled since 2010. Also as more and more cloud services and SaaS have been adapted by small and medium enterprises a big challenge for network security arises, since crucial for companies data started to be stored, maintained and transported by third party infrastructure where traditional points of inspection cannot be deployed. According to CISCO 2011 report [3] this trend is connected with the criminals that see the potential to get more return on their investment with cloud attacks, since they only need to hack one to hack them all. Other well known problems like attacks on the web applications to extract data or to distribute malicious code still remain unsolved. Cybercriminals continuously steal data and distribute their malicious code via legitimate web servers they have compromised. Also the emerging technologies such as HTML5 bring new cyber threats that web services providers have to deal with. In order to counter all the mentioned arising threats, network security mechanisms and protection schemes also evolve and use sophisticated sensors and methods. The drawback is that the more sensors (probes) are applied and the more information they acquire, the volume of data to process grows significantly. Therefore, in this paper, we present real-time network data analysis mechanism and we prove its effectiveness for SQL Injection Attacks detection. The paper is structured as follows: SQLIA attacks and detection tools are shortly presented in Section 2. Our own solution for SQL Injection attempts detection based on the evolutionary algorithm is presented in Section 3. The experimental setup and results are provided in Sections 4 and 5. Conclusions are given thereafter. 2 Current SQLIA Attack detection methods and tools. One of the most important network threat is SQLIA (SQL Injection Attack) which ranks as top threat in the OWASP list [4]. SQL injection and other similar exploits are the results of interfacing a scripting language by directly passing information through another language and are ultimately caused by insufficient input validation. SQL Injection Attacks (SQLIA) refer to a codeinjection attacks category in which part of the users input is treated as SQL code. Such code, if executed on the database, may change, erase, or expose sensitive data stored in the database. One of the most significant examples of SQL Injection Attacks include: hacking the Royal Navys website and recovering user names and passwords of the sites administrators (November 2010) [5]; stealing information related to almost 100000 accounts of subscribers registered on ISP news and review site DSLReports.com (April 2011) [6];

Title Suppressed Due to Excessive Length 3 exploiting SQL injection vulnerabilities of approximately 500000 web pages (April-August 2008) [7]. Several publications provide surveys, as well as analysis evaluating and comparing injection detection and prevention techniques. For example, more than twenty detective and preventive techniques are examined in [10]. In the publication, authors identified various types of SQLIAs and investigated ability to stop SQL injection provided by the most commonly used, current techniques. Similar approaches are presented in [11] and [12], where prevention techniques and security tools for the detection of SQL injection attacks were investigated. The set of tools used in this paper for detecting the SQL Injection attacks consists of both an algorithms proposed by authors and known (state of the art) solutions and tools. The tools evaluated in our tests are: 1. Apache Scalp. It is an analyzer of Apache server access log file. It is able to detect several types of attacks targeted on web application. The detection is a signature-based one. The signatures have form of regular expressions that are borrowed from PHP-IDS project. 2. Snort. It the most widely deployed IDS system that uses set of rules that are used for detecting web application attacks. However most of the available rules are intended to detect very specific type of attacks that usually exploit very specific web-based application vulnerabilities. 3. ICD (Idealized Character Distribution [15]). The method is similar to the one proposed by C.Kruegel in [15]. The proposed character distribution model for describing the genuine traffic generated to web application. The Idealized Character Distribution (ICD) is obtained during the training phase from perfectly normal requests send to web application. The IDC is calculated as mean value of all character distributions. During the detection phase the probability that the character distribution of a query is an actual sample drawn from its ICD is evaluated. For that purpose Chi-Square metric is used. 4. SQL ADS based on the Genetic Algorithm (proposed by authors [16] and described in section 3. 3 Genetic Algorithm Description In order to detect the anomalies in SQL queries a novel method is proposed. It exploits genetic algorithm, where the individuals in the population explore the log file that is generated by the SQL database. Each individual aims at delivering an generic rule (which is a regular expression) that will describe

4 Micha l Choraś, Rafa l Kozik visited log line. It is important for the algorithm to have an set of genuine SQL queries during the learning phase. The algorithm is divided into the following steps: Initialization. Each individual and line from log file is assigned. Each newly selected individual is compared to the previously selected in order to avoid duplicates. Adaptation phase. Each individual explores the fixed number of lines in the log file (the number is predefined and adjusted to obtain reasonable processing time of this phase). Fitness evaluation. Each individual fitness is evaluated. The global population fitness as well as rule level of specificity are taken into consideration, because we want to obtain set of rules that describe the lines in the log file. Cross over. Randomly selected two individuals are crossed over using algorithm for string alignment. If the newly created rule is too specific or too general it is dropped in order to keep low false positives and false negatives. In order to obtain the regular expression from two strings a modified version of the Neddleman and Wunch algorithm is proposed ([13]). The authors used this algorithm to find the best match between two DNA sequences which can diverge over time (e.g. by insertion or deletion) for different organisms. In order to find correspondence between those two sequences, it is allowed to modify the sequences by inserting the gaps. However, for each gap (and for mismatch) there is an penalty and award for genuine matches. For Needleman and Wunsch algorithm the most important is to find the best alignment between two sequences (the one with highest award). From anomaly detection point of view the parts where gaps are inserted are also important, because they are the points of injections. These parts are described with regular expressions using guidelines proposed in [14]. Therefore, the obtained result can be represented with the following regular expression: SE- LECT [a-z,]+ FROM patient WHERE name like [a-za-z]+. Needleman and Wunsch first suggested that in order to find the match with highest award a dynamic programming (DP) approach can be adapted. More details explaing how this is implemented can be found in [13]. The fitness function, that is used to evaluate each individual, takes into account the particular regular expression effectiveness (number of times it fires), the level of specificity of such rule and the overall effectiveness of the whole population. The fitness function is described by equation 1, where I indicates the particular individual regular expression, E population indicates the fitness of the whole population, E f effectiveness of regular expression (number of times the rule fires), and E s indicates the level of specificity. The α, β, and γ are constants that normalize the overall score and balance the each coefficient importance.

Title Suppressed Due to Excessive Length 5 E(I) = α E population + β E f (I) + γ E s (I) (1) E population = I P opulation E f (I) (2) The level of specificity indicates balance between number of matches and number of gaps. 4 Experiments In this section our evaluation methodology is described. The SQL Injection Attacks are conducted on php-based web service with state of the art tools for services penetration and SQL injection. The traffic generated by attacking tools are combined together with normal traffic (genuine queries) in order to estimate the effectiveness of the proposed methods. The genuine queries are both man-made and generated by web crawlers as well. The web service used for penetration test is so called LAMP (Apache + MySQL + PHP) server with MySQL back-end. It is one of the most common worldwide used servers and therefore it was used for validation purposes. The server was deployed on Linux Ubuntu operation system. For penetration tests examples services developed in PHP scripts and shipped by default with the server are validated. Attack injection methodology is based on the known SQL injection methods, namely: boolean-based blind, time-based blind, error-based, UNION query and stacked queries. For that purpose sqlmap tool is used. It is an open source penetration and testing tool that allows the user to automate the process of validating the tested services against the SQL injection flaws. In order to avoid double-counting the same attack patterns during the evaluation process, we decided to gather first the malicious SQL queries generated by sqlmap (several hundreds of different injection trials). After that genuine traffic (generated by crawlers and during the normal web service usage) is gathered. Such prepared data is used during the evaluation test that results are presented in section 5. 5 Results The conducted experiments were aimed at estimating the effectiveness of different tools commonly used for injection attack detection. Namely these are: Apache-Scalp (HTTP access log), Snort (HTTP packet content), ICD (HTTP access log), proposed SQL ADS (SQL DB log).

6 Micha l Choraś, Rafa l Kozik It must be noticed that both Apache-Scalp and Snort tools do not require any learning phase, since the signatures of anomaly (having symptoms of SQL Injection) SQL queries and malicious HTTP request are provided together with theses tools. The signatures are developed by security experts in form of regular expressions. Table 1. Effectiveness of injection attack detection (shown separately for genuine and malicious requests). SQL ADS SNORT ICD SCALP Attack 87,8% 66,3% 97,9% 50,9% Genuine 97,7% 80,5% 94,5% 96,1% Weighted Avg. 96,2% 78,3% 95,0% 89,0% The ICD and SQL ADS require dedicated learning phase and focus only on genuine HTTP and SQL queries. Method used for evaluation engages classic 10-fold algorithm. As it is shown in Table 1, the proposed SQL ADS algorithm slightly outperforms other state-of-the-art approaches when it comes to modelling the genuine queries. For queries having the symptoms of attack, the SQL ADS is about 10% worse when compared with ICD. Another experiment aimed at investigating whenever combining above methods together can additionally improve overall effectiveness of injection attack detection. For that purpose 10-fold approach is used. The informations obtained from SQL ADS, ICD and SNORT is used to build classifier for attack detection. Following classifiers were considered during this experiment: PART, NB (Naive Bayes), REPTree, J48, RIDOR. The effectiveness of above classifiers is shown in Table 1. It can be noticed that overall weighted average effectiveness has increased (from 96% to 99%) when we combine the proposed methods for injection attack detection. Table 2. Effectiveness of different classifiers (with SQL ADS). Attack Genuine Weig. Avg. PART 0.967 0.995 0.991 NB 0.982 0.967 0.97 REPTree 0.955 0.997 0.99 J48 0.967 0.995 0.991 RIDOR 0.961 0.996 0.991

Title Suppressed Due to Excessive Length 7 The Table 3 shows that without SQL ADS the effectiveness of attack detection is worse, but it is still about 2% better than the strongest classifier alone (in this case ICD). Table 3. Effectiveness of different classifiers (without SQL ADS). Attack Genuine Weig. Avg. PART 0.902 0.981 0.969 NB 0.881 0.983 0.967 REPTree 0.887 0.984 0.969 J48 0.902 0.981 0.969 RIDOR 0.887 0.978 0.963 6 Conclusions In this paper an innovative correlation-base approach for injection attack detection was proposed. The described algorithm aims at efficient processing of large volume data that is generated by web applications. The advantage of the proposed solutions is that it allows for reusing existing efficient detectors for injection attack detection (e.g. SNORT, SCALP, character distribution approaches, etc.). Our experiments show that combining several weak injection attack detectors and engaging the machine learning techniques can lead to overall effectiveness improvement. In this paper we also proposed an novel evolutionary algorithm for modelling the genuine traffic with regular expressions. Presented results show that proposed algorithm, when combined with other approaches, can increase effectiveness of injection attack detection. The experiments show that proposed approach can achieve high effectiveness and can outperform other state of the art approaches like SNORT and SCALP. References 1. CERT Polska Annual Report 2011. http://www.cert.pl/pdf/report CP 2011.pdf 2. SOPHOS homepage http://www.sophos.com 3. Cisco Annual Report 2011. 4. OWASP Top 10 2010, The Ten Most Critical Web Application Security Risks. (2010) 5. Royal Navy Website Attacked by Romanian Hacker. http : //www.bbc.co.uk/news/technology 11711478 (2008) 6. Mills, E., DSL Reports Says Member Information Stolen. (2011) 7. Keizer, G., Huge Web Hack Attack Infects 500,000 pages, (2008)

8 Micha l Choraś, Rafa l Kozik 8. Rao, T.K., Kum, G.Y., Reddy, E.K., Sharma, M., Major Issues of Web Applications: A Case Study of SQL Injection. Journal of Current Computer Science and Technology, Vol. 2, Issue 1, 16-20 (2012) 9. Halfond, W., Orso A., AMNESIA : Analysis and Monitoring for Neutralizing SQL-Injection Attacks. Proceedings of the 20th IEEEACM International Conference on Automated Software Engineering (2005) 10. Tajpour, A., JorJor Zade Shooshtari M., Evaluation of SQL Injection Detection and Prevention Techniques. CICSyN 2010 Second International Conference on Computational Intelligence, Communication Systems and Networks (2010) 11. Amirtahmasebi, K., Jalalinia, S.R., Khadem, S., A Survey of SQL Injection Defense Mechanisms. ICITST International Conference for Internet Technology and Secured Transactions (2009) 12. Elia, I.A., Fonseca, J., Vieira, M., Comparing SQL Injection Detection Tools Using Attack Injection: An Experimental Study. 2010 IEEE 21st International Symposium on Software Reliability Engineering (2010) 13. Needleman, S.B., Wunsch, C.D., A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology (1970) 14. Conrad, E., Detecting Spam with Genetic Regular Expressions. SANS Institute InfoSec Reading Room (2007) 15. Kruegel, C., Toth, T., Kirda, E.: Service specific anomaly detection for network intrusion detection. In: Proc. of ACM Symposium on Applied Computing, pp. 201-208, 2002. 16. Choraś M., Kozik R., Puchalski D., Holubowicz W., Correlation Approach for SQL Injection Attacks Detection, In: Herrero A. et al. (Eds.), Advances in Intelligent Systems and Computing, 189, 177-186, Springer, 2012.