Int. J. Web and Grid Services, Vol. X, No. Y, xxxx 1

Similar documents

On Using SVM and Kolmogorov Complexity for Spam Filtering

Spam detection with data mining method:

A Content based Spam Filtering Using Optical Back Propagation Technique

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Entropy and Mutual Information

About this documentation

Image Compression through DCT and Huffman Coding Technique

Support Vector Machines with Clustering for Training with Very Large Datasets

Towards better accuracy for Spam predictions

Storage Optimization in Cloud Environment using Compression Algorithm

PROTECTING YOUR MAILBOXES. Features SECURITY OF INFORMATION TECHNOLOGIES

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Spam DNA Filtering System

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Machine Learning in Spam Filtering

Learning is a very general term denoting the way in which agents:

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining: Algorithms and Applications Matrix Math Review

Savita Teli 1, Santoshkumar Biradar 2

K-Means Clustering Tutorial

Purchase College Barracuda Anti-Spam Firewall User s Guide

Detecting Spam Using Spam Word Associations

How to Send Video Images Through Internet

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

Azure Machine Learning, SQL Data Mining and R

FRACTAL RECOGNITION AND PATTERN CLASSIFIER BASED SPAM FILTERING IN SERVICE

Spam Filtering using Naïve Bayesian Classification

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

Policy Patrol 7 Upgrade Guide

Spam Detection Using Customized SimHash Function

1. Classification problems

The Bayesian Spam Filter with NCD. The Bayesian Spam Filter with NCD

Experiments in Web Page Classification for Semantic Web

Solutions IT Ltd Virus and Antispam filtering solutions

Data Mining - Evaluation of Classifiers

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Tufts Technology Services (TTS) Proofpoint Frequently Asked Questions (FAQ)

On the Use of Compression Algorithms for Network Traffic Classification

International Journal of Research in Advent Technology Available Online at:

Bayesian Spam Filtering

6.2.8 Neural networks for data mining

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

Less naive Bayes spam detection

Using Barracuda Spam Firewall

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Feature Subset Selection in Spam Detection

Spam Testing Methodology Opus One, Inc. March, 2007

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

PANDA CLOUD PROTECTION User Manual 1

How to Manage Spam and Junk

Adaptive Filtering of SPAM

Barracuda Spam Firewall

Information, Entropy, and Coding

USER S MANUAL Cloud Firewall Cloud & Web Security

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

KEITH LEHNERT AND ERIC FRIEDRICH

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur

Introduction. How does filtering work? What is the Quarantine? What is an End User Digest?

What makes Panda Cloud Protection different? Is it secure? How messages are classified... 5

A semi-supervised Spam mail detector

Reputation Network Analysis for Filtering

Combining Global and Personal Anti-Spam Filtering

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

MACHINE LEARNING & INTRUSION DETECTION: HYPE OR REALITY?

A Catalogue of the Steiner Triple Systems of Order 19

FCE: A Fast Content Expression for Server-based Computing

Classification algorithm in Data mining: An Overview

SPAM FILTER Service Data Sheet

Ipswitch IMail Server with Integrated Technology

BARRACUDA. N e t w o r k s SPAM FIREWALL 600

Statistical Modeling of Huffman Tables Coding

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Antispam Evaluation Guide. White Paper

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Barracuda Spam Firewall Users Guide. How to Download, Review and Manage Spam

the barricademx end user interface documentation for barricademx users

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Knowledge Discovery from patents using KMX Text Analytics

INSIDE. Neural Network-based Antispam Heuristics. Symantec Enterprise Security. by Chris Miller. Group Product Manager Enterprise Security

Spam Detection A Machine Learning Approach

Private Record Linkage with Bloom Filters

Prediction of DDoS Attack Scheme

Social Media Mining. Data Mining Essentials

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

Transcription:

Int. J. Web and Grid Services, Vol. X, No. Y, xxxx 1 Spam filtering using the Kolgomorov complexity analysis G. Richard IRIT University of Toulouse, France E-mail: richard@irit.fr Kolgomorov or Kolmogorov? A. Doncescu* LAAS CNRS University of Toulouse, France E-mail: andrei.doncescu@laas.fr *Corresponding author Abstract: One of the most irrelevant side effects of e-commerce technology is the development of spamming as an e-marketing technique. Spam e-mails (or unsolicited commercial e-mails) induce a burden for everybody having an electronic mailbox: detecting and filtering spam is then a challenging task and a lot of approaches have been developed to identify spam before it is posted in the end user s mailbox. In this paper, we focus on a relatively new approach whose foundations rely on the works of A. Kolmogorov. The main idea is to give a formal meaning to the notion of information content and to provide a measure of this content. Using such a quantitative approach, it becomes possible to define a distance, which is a major tool for classification purposes. To validate our approach, we proceed in two steps: first, we use the classical compression distance over a mix of spam and legitimate e-mails to check out if they can be properly clustered without any supervision. It has been the case to highlight a kind of underlying structure for spam e-mails. In the second step, we have implemented the k-nearest neighbours algorithm providing 85. Keywords: spam; Kolmogorov complexity; compression; clustering; k-nearest neighbours. Please provide the rest of the abstract. Please provide biographical notes (maximum of 100 words for each author). Reference to this paper should be made as follows: Richard, G. and Doncescu, A. (xxxx) Spam filtering using the Kolgomorov complexity analysis, Int. J. Web and Grid Services, Vol. X, No. Y, pp.000 000. Biographical notes: G. Richard A. Doncescu Copyright 200x Inderscience Enterprises Ltd.

2 G. Richard and A. Doncescu 1 Introduction Information security is one of the main conditions to insure a safe electronic marketplace. This is achieved through the use of multiple techniques to prevent unauthorised use. Encryption, authentication/password protection and login policies all provide some level of security. It is well known that, when accessing certain websites, we become obvious targets for spammers. Spam is just a (commercial) unsolicited e-mail and can be considered dangerous as soon as we reply to it. If we do not reply, it is only a kind of rubbish mail polluting our mailbox and we need to spend some time to select between relevant e-mails and spam. Simple calculations show that with the increasing number of spam received every day, the cost for business in terms of time and money is not negligible, simply because when we are filtering our mailboxes, we do not perform the task we are supposed to do. That is why there is a real need for automatic filters. But detecting and filtering spam is a quite difficult challenge due to the dynamic nature of junk e-mail. A truly effective spam filter must block the maximum unwanted e-mail with the minimum number of false positives (messages wrongly identified as spam). Of course, this is further complicated by the fact that individual users have different views on what they consider to be spam; spam is definitely a personal subject. The most basic methods are quite similar (see Graham, 2002 for a survey) in terms of philosophy: checking the content for textual analysis, blacklists of hosts and domains, content analysis, etc. For instance, one compares the sender domain with a blacklist of domain names known to issue spam e-mails. The blacklist has to be regularly updated. This simple version can be improved by adding an analysis of the body of the e-mail and looking for specific keywords based on a dictionary approach. The user can define his personal dictionary and can tune the system for his personal use. From a network infrastructure viewpoint, the system classifying the incoming e-mails can be: on the server side (Spam Assassin, for instance, is a well-known software, generally provided with an all-standard Linux distribution). E-mails are then tagged when they arrive on the server without taking into account the individual user profile on the client side, and in that case, the filtering system is generally included as a plug-in within the mail client (Outlook Express, Thunderbird, Eudora, etc.). It is good practice to set up anti-spam on the main e-mail server and to use an e-mail client that incorporates a filtering system. Whatever the chosen architecture, the problem of filtering the spam e-mails still remains a classification problem and the methods issued from the machine learning field are quite effective. From this viewpoint, the main philosophy is to train a learner on a sample set of e-mails (the witness set) clearly identified as junk or not junk and then to use the output of the system as a classifier for the next incoming e-mails. One of the most sophisticated approaches to deal with spam e-mail is the so-called Bayesian technique. This approach is based on the probabilistic notion of the Bayesian network (see Pearl and Russell, 2003 for an in-depth definition) and is considered as one of the most efficient techniques, providing few false positives. It seems that the future of anti-spamming systems relies on a mix between traditional practices (blacklists, dictionaries, etc.) and techniques issued from the field of machine learning. When it comes to classification, the mathematical concept of distance is always

Spam filtering using the Kolgomorov complexity analysis 3 very helpful. From a heuristic viewpoint, if two objects are close to each other with regard to a suitable distance, they should be similar in some sense. Roughly speaking, if an e-mail m is close to a junk e-mail, then m is likely a junk e-mail as well. That is why there are a lot of attempts to define a distance suitable to a particular situation. In our case, these distances generally take into account parameters such as the domain of the sender, the occurrence of specific words, etc. An e-mail is then represented as a point in a vector space and using the proper distance and the previous heuristic, we can identify spam e-mails. Despite the fact that the origin of an e-mail can obviously give some hints about its status, we basically think that spam e-mail is defined by its content. Then, in our approach, we decide to focus on the body of an e-mail. An e-mail is classified as spam because the informative content of its body is the informative content (or close to) of spam e-mail. So we are interested in the distance between two e-mails insuring that, if they are close, we know that they have a similar informative content. It remains to give a formal meaning to the concept of informative content. This is not a new problem and the Kolmogorov (or Kolmogorov-Chaitin) (see Kolmogorov, 1965) complexity is a strong candidate for this purpose. The main idea behind the Kolmogorov complexity is to consider the informative content of a string s as the measure of the size of the ultimate compression of s, K(s). Surprisingly, complexity plays a critical role in security and can be broadly applied as the basis for security analysis and design. For instance, in the field of network security (Kulkarni and Bush 2001; Bush and Stephen, 2002), the advantage of using complexity as a fundamental parameter to monitor system health and achieve information safety lies in its objectivity. Any given string (or file) has a Kolmogorov complexity without regard to the details of the system on which it is running. The operating system, the protocol being used and the meaning or type of data represented by a particular string, while related to string complexity, need not be known in order to measure string complexity. Tools based upon this paradigm do not require detailed a priori information about known attacks, but rather compute vulnerability based upon an inherent, underlying property of information itself, namely, its Kolmogorov complexity. In this paper, our aim is to investigate the use of the Kolmogorov complexity and its associated distance as a tool to classify junk e-mails: without any analysis of the content without any analysis of the header. The remaining sections of this document is structured as follows: in Section 2, we give a brief flavour of the Kolmogorov theory and its very meaning. In Section 3, we describe the main idea behind the practical applications of the Kolmogorov theory, which is the definition of a suitable information distance. Since K(s) is an ideal number, Section 4 explains how to estimate K and the relationship with compression techniques. In Section 5, we experiment with the compression distance to cluster a set of e-mails and a mixture of spam and legitimate e-mails: this shows that there is a common underlying structure in spam e-mails. We then progress to Sections 6 and 7, where we describe the implementation, our experiments and results. We discuss our approach as well. Finally, we conclude in Section 8.

4 G. Richard and A. Doncescu 2 The Kolmogorov complexity Starting from the fact that all the information we deal with in a computer is digitalised, it is relevant to consider every object as a finite string of 0 and 1. The Kolmogorov complexity K(x) is a measure of the descriptive complexity contained in an object or string x. It refers to the minimum length of a program such that a universal computer can generate a specific string x. A good introduction to the Kolmogorov Complexity can be found in Bennett et al. (1998) with a solid treatment in Ming (1997). The Kolmogorov complexity is related to the Shannon entropy, in that the expected value of K(x) for a random sequence is approximately the entropy of the source distribution for the process generating the sequence. However, the Kolmogorov complexity differs from entropy, in that it relates to the specific string being considered rather than the source distribution. The Kolmogorov complexity can be roughly described as follows, where T represents a universal computer (the Turing machine), p represents a program and x represents a string: K(x) is the size p of the smallest program p, such that T(p) = x. We can thus consider p as the essence of x: there is no shorter program than p to output x and we can view p as the most compressed form of x. The size of p, K(x), can thus be considered as the ultimate amount of information contained in x. Having said that, K(x) becomes the lower bound limit of all the possible compressions of x. The Kolmogorov complexity of a string x should be considered as the ultimate compression size of the string x. Since every data can be coded as a binary string, we can say that every data has a Kolmogorov complexity which is the complexity of its digital representation. Random strings have a rather high Kolmogorov complexity on the order of their length, as patterns cannot be discerned to reduce the size of a program generating such a string. On the other hand, strings with a large amount of structure have a fairly low complexity. Universal computers can be equated through programs of constant length, thus, a mapping can be made between the universal computers of different types: the Kolmogorov complexity of a given string on two computers differs by known or determinable constants. In some sense, this complexity is a kind of universal characteristic attached to a given data. For a given data x, K(x) is a strange number which cannot be computed just because it is an ideal lower limit related to an ideal machine (the universal Turing machine or calculator). This is a major difficulty, which can be overcome by using the fact that the length of any program producing x is an upper bound of K(x). We will develop this fact in Section 4. But now, we have to understand how we can build a suitable distance starting from K; this is the aim of the next section. Please provide reference. 3 Information distance As previously explained, from a data mining or knowledge discovery viewpoint, mathematical distances are powerful tools to classify new data. In our case, given an incoming e-mail s, we want to compute the distance between s and another e-mail s, previously classified as junk or not junk; if this distance is sufficiently small, we could decide that s belongs to the same class and transmit this information to the end user. Starting from the Kolmogorov complexity, it is relatively easy to define a distance called Information Distance or Bennett's distance (see Bennett, 1998 for a complete study). The informal idea is as follows: given a and b, two strings or files, we can put:

Spam filtering using the Kolgomorov complexity analysis 5 K(a) = K(a b) + K(a b)k(b) = K(b a) + K(a b). (1) The first equation simply says that the complexity of a (the information content) is the sum of the proper information content of a, denoted as a b and the common content with b denoted as a b. Now, we can concatenate a with b and we get a new file denoted as a.b. If we compute K(a.b), we get: K(a b) + K(b a) + K(a b) (2) since there is no redundancy with the Kolmogorov compression. So, the number: m(a, b) = K(a) + K(b) K(a.b) (3) is just a measure of the common information content to a and b. This is a very approximate reasoning leading to two very formal notions: d(a, b) = m(a, b) min(k(a), K(b)) (4) and d N (a, b) = (m(a, b) min(k(a), K(b)))/max(K(a), K(b)) = (5) = l m(a, b)/max(k(a), K(b)). (6) Both of them are distances, but d N is normalised and takes into account the relative size of a and b. Let us understand the very meaning of dn. If a = b, then K(a) = K(b) and m(a, b) = K(a), thus d N (a, b) = 0. On the opposite side, if a and b do not have any common information, a and b are empty and K(a.b) = K(a) + K(b), thus m(a, b) = 0 and then dn(a, b) = 1. Formally speaking, dn is a (semi) distance over the set of finite strings: d is called the normalised information distance or the Bennett distance. If d N (a, b) is very small, it means a and b are very similar. If d N (a, b) is close to 1, it means a and b have very few information in common. This is exactly what we need to start with in classification. Let us examine in the next section how we can use K in real life. 4 Complexity estimation Now that we have a formal tool, it remains to see how it can be practically used for real business purposes. As discussed above, the exact measurement of the Kolmogorov complexity is not achievable, but if we remember that K(x) is the lower limit of all the compressions of x, we understand that every compression C(x) of x gives an estimation of K(x). Back to the theoretical framework, a decompression algorithm is considered as our universal Turing machine T, such that: T(C(x)) = x. (7) It is now clear that we are looking for lossless compression algorithms and fortunately, we have a variety of such algorithms coming from the compression community. It is out of the scope of this paper to investigate that field and we will just give a quick overview of some classical compression tools available on the market. On one side, we have the formal algorithms: Lempel-Ziv-Welch (LZW) (Welch, 1984), Huffman (Huffman, 1952) and Burrows-Wheeler (Burrows and Wheeler, 1994). All of them are lossless compression techniques, which is exactly what we need. On the other side, we have the

6 G. Richard and A. Doncescu real implementations adding some clever manipulations to the previous algorithms: Unix compress utility based on LZ zip and gzip, which is a combination of LZ and Huffman encoding (for instance, Adobe Reader contains an implementation of LZW) and bzip2, which first uses the Burrows-Wheeler algorithm, then the Huffman coding. In term of the compression ratio, generally, we have the following inequality (where LZMA(x) is a variant of LZ): 0 < K(x) < LZMA(x) < bzip2(x) < gzip(x). (8) Instead of dealing with the ideal number K(x), we will deal with the size of the compressed file. The better the algorithm compresses the data, the better the estimation of K(x). When replacing K(x) by C(x), where C is one of the previous compressors, it becomes quite clear that the previous information distance is not a true mathematical distance anymore, but it remains sufficient for our purpose. Today, LZ-based algorithms are considered among the most efficient ones for text compression and as a consequence, they are the suitable tools for our implementation. 5 Compression for clustering Please verify highlighted text. Before going on with our idea, we need to have at least a flavour of the power of compression distance to discriminate between legitimate and spam e-mails without any supervision. To do that, we use the technique of Cilibrasi and Vitanyi (2005) that we briefly recall here in the context of spam filtering: we start from a mix of 50 spam and legitimate e-mails for each couple of e-mails (m, m'), we compute the compression distance d(m, m') and we get a symmetric matrix we provide a two-dimensional graphic representation of the 50 e-mails as a binary tree respecting the relative distance between the e-mails. Since this is not possible in a two-dimensional view, the generated tree is associated with a confidence level (numbered between 0 and 1) describing how well the tree represents the original data. There is no real difficulty here, since we use the works and tools of Cilibrasi and Vitanyi (2005) to handle our data. It is well known that clustering using a quartet tree heuristic is not tractable and it took around 20 hours (on a standard laptop) to get the tree below, where our test set contains 50 e-mails. For Point 1, we have an e-mail folder containing 50 e-mails, 25 spam and 25 legitimate e-mails. For Point 2, we use the ncd tools: ncd -d mail50 mail50 dist50.txt. The text file, dist50.txt, contains the distance matrix. This process is quite immediate. For Point 3, we use maketree: : maketree dist50.txt. Then we get a file, treefile.dot, which is a symbolic textual representation of the quartet tree T plus its confidence level S(T). We stop after 20 hours until we get S(T) = 0.917, which is quite acceptable. To provide a two-dimensional view, we used Graphviz, 1 (mainly developed within the ATT lab). The tree we got is below, where we have darkened the spam e-mails. We can clearly distinguish four homogeneous areas (delimited with rectangles), although the last one (without delimitation) is quite mixed. Of course, it would be interesting to do the same experience with a bigger test set, but the execution time would

Spam filtering using the Kolgomorov complexity analysis 7 be prohibitive. Anyway, we consider this picture, highlighting a basic unsupervised clustering of our test set, as sufficiently indicative to implement a classification process using the compression distance technique. Figure 1 Please provide figure caption for Figure 1. And please provide image with readable label. 6 Results In order to validate our approach, we start from a classical k-nearest neighbours (k-nn) algorithm, where all neighbours equally contribute to the determination of the value of the class variable. Using the D Amato et al. (2004) extension, we weigh the contribution of each of the k-nn according to their distance from the new case, giving greater weight to the closer neighbours. The witness set S is constituted with both spam and normal e- mails. Let us describe our definition a little bit more: given an e-mail m i in the witness set, we compute d(m, m i ) and if d(m, m i ) 0, we define w(m i ) = 1/d(m, m i ) (this is the weight of the mail m i ). Now, we have to distinguish between two cases: 1 there are some neighbours m i such that d(m, m i ) = 0 and then P(m is spam) = (1/k) * (m i d(m, m i ) = 0 and m i is spam. Of course, P(m is legitimate) = (1/k) * (#m i \d(m, m i ) = 0 and m i is legitimate = 1 P(m is spam) 2 the k-nn m i of m are such that d(m, m i ) 0 and then we compute the probability for m to be spam as: P(m is spam)= A/B where: and A = (1/k) * ( w(m i ) m i is spam) * #{m i m i is spam} B = A + A' where A' = (1/k) * ( w(m i ) m i is legitimate) * #{m i m i is legitimate}. It is clear that P(m is legitimate) = A'/B. Normalisation by the summation B of all weighted class probabilities is required to guarantee that the above measure P satisfies the properties of a probability measure. By weighing the distances, we insure that very distant examples will have little effect on the classification of the incoming e-mail m. When m has some null distance neighbours, then we consider only these neighbours because they are viewed as the best representatives of the class of m. As a complexity estimator, we use the LZW algorithm, whose C source code is freely available. 2 This technique performs without information loss and is quite good in terms of runtime Please verify where the close parenthesis be placed.

8 G. Richard and A. Doncescu Please clarify which table is being referred to. efficiency. Our implementation gives the choice between the weighted version of k nn and the classical one, despite the fact that the weighted one is generally more accurate. Basically, starting from an incoming e-mail, we compute its k nn belonging to a set of witness e-mails, previously classified as junk/not junk. Then, we choose the class (according to the previous weighted formula) as the class for the incoming e-mail. The distance we use is just the information distance estimated through LZW compression. Our algorithm is implemented in C and our programs (source and executable) are freely available on request. During the first experience with the non-normalised distance d using the previous software, our experimentation has been done with a set of 150 incoming e-mails to classify. We then performed four series of tests using four different witness sets: 25junk/ 25legitimate 50junk/50legitimate 75junk/75legitimate 100junk/100legitimate. The only parameter to be tuned is k and we choose k as 11, 21, 31, 41, 51, 61, 71, 81 and 91 when we have enough witness e-mails. That is why with only 50 witness e-mails, we go up to 49 only. We provide only the weighted version results, since the other version of k nn (without weight) gives lower accuracy. Below, we provide the four graphics which summarise our results: Table 1: 25junk/25legitimate witness e-mails Table 2: 50junk/50legitimate witness e-mails Table 3: 75junk/75legitimate witness e-mails Table 4: 100junk/100legitimate witness e-mails. Figure 2 Please provide figure caption for Figures 2 5. 100 80 60 40 20 0 50 witness mails - 150 checked mails 78.67 77.33 78.00 78.00 78.00 27 27 26 26 26 5 7 7 7 7 11 21 31 41 49 value of k false + false - accuracy

Spam filtering using the Kolgomorov complexity analysis 9 Figure 3 100 witness mails - 150 checked mails 100 80 60 40 20 0 70.67 74.67 77.33 79.33 78.67 78.67 78.67 78.67 26 25 25 26 26 26 26 26 18 13 9 5 6 6 6 6 11 21 31 41 51 61 71 81 91 value of k false + false - accuracy Figure 4 150 witness mails - 150 checked mails 100 80 60 40 20 0 75.33 75.33 78.00 79.33 79.33 79.33 79.33 78.67 78.67 26 26 24 25 25 26 26 26 26 11 11 9 6 6 5 5 6 6 11 21 31 41 51 61 71 81 91 value of k false + false - accuracy

10 G. Richard and A. Doncescu Figure 5 200 witness mails - 150 checked mails 100 80 75.33 77.33 77.33 90.67 78.00 79.33 79.33 79.33 78.67 60 40 20 0 26 25 25 24 25 26 26 26 11 9 9 10 4 9 6 5 5 6 11 21 31 41 51 61 71 81 91 value of k false + false - accuracy The visual inspection of our curves provides clear evidence that information distance really captures underlying semantics. Since in our samples, we have e-mails in diverse languages (French, English), we can think that there is a kind of common underlying information structure for spam. Except in one case (k = 41 and 200 witness e-mails), the accuracy rate remains in the range of 80. Let us note that the runtime is about 14mn to classify 150 e-mails (on a standard PC) when we have 200 witness e-mails. This runtime does not really rely on K, since we just have a table of distances to check: the main influence is coming from the size of the witness set (here, 200 e-mails). To compute the distance, we have to open and then close 200 files exactly 150 times and this is quite a time-consuming operation. When checking 150 e-mails with a witness set of 50 spam and 50 legitimate e-mails, the runtime becomes more acceptable with around 6 mn. Of course, this kind of algorithm, considered as plugged into an e-mail client (Thunderbird, Outlook, etc.) will have to classify on-the-fly incoming e-mail and in that case, this is quite a fast process. The second experience was with the normalised distance dn. Based on our previous batch, we decided to work on a fixed set of 100 witnesses and 150 incoming e-mails to filter. Our sets (both witness and test) are different from the previous experience. We have optimised our process in diverse ways: (1) from a theoretical viewpoint, k can vary between 1 and the cardinal of the witness set. An empirical limit is just the square root of the witness set, in our case, exactly 10. So we decide to experiment with k = 7, 9, 11, 13 and 15, which are numbers in the range of 10 and (2) we modified the compression distance to take into account the relative size of the e-mails: instead of using the compression distance d, we use the normalised compression distance d N. Below is a graphical view of our results (the weighted and nonweighted version provide the same results).

Spam filtering using the Kolgomorov complexity analysis 11 Figure 6 Please provide figure caption for Figure 6. 100 witness mails - 150 checked mails 100 80 86.67 87.33 85.33 82.00 80.67 60 40 20 0 12 8 7 12 16 20 22 6 7 7 7 9 11 13 15 value of k false + false - accuracy It is quite clear that the normalised distance, coupled with a clever choice of k, performs better than the standard one. With regard to the previous experiences, the most interesting features are: The accuracy rate is now around 86. The number of false positives is now below the number of false negatives and which is a sought-after feature for a spam filtering system the best accuracy rate is obtained with k=9 and decreases when k increases; and there is no need to look for more neighbours than the square root of the witness set. It is interesting to note that the weighted version of k-nn does not perform better than the nonweighted one. At the moment, it is not clear if this is due to our particular sample set or if this is a general property when using the normalised distance. We now have to investigate in order to find out a kind of optimal number of witness e-mails. There is probably no universal theoretical value for such a parameter, but with more experiments, we should be able to provide evidence for a realistic empirical value. 7 Conclusion Despite their relative novelty, the complexity-based approaches have proved quite successful in the numerous fields of IT: data mining, classification, clustering, fraud detection, etc. Within the field of IT security, we can refer to the works of Kulkarni and Bush (2001), Bush and Stephen (2002; 2003) and more recently, Wehner (2006): these works are mainly devoted to analysing data flow coming from network activities and detecting an intrusion or a virus. In this paper, we investigate the field of anti-spamming and we consider the Kolmorogov complexity K(m) of an e-mail m as a tool to classify it as junk or legitimate. Instead of dealing with a sophisticated clustering algorithm, we have implemented a weighted version of k-nn using the information distance deduced

12 G. Richard and A. Doncescu from K. Despite the fact that our tests have been limited, the first results we have obtained clearly show that the compression distance is meaningful in that situation. This does not completely come as a surprise: the seminal works of Bennett et al. (1998) and Cilibrasi and Vitanyi (2005) have showed the way and have highlighted the real power of compression methods. In our context, one of its main advantages is that there is no need to deeply analyse the diverse parts of the e-mail (header and body). We can, of course, refine our algorithm, but simply having a more powerful compression technique would probably bring more accurate results. Of course, multiple complexity estimators, i.e., compressors could be combined in order to improve discrimination accuracy. Another advantage of this kind of blind technique is that there is no need to update a dictionary or a blacklist of domain name. The system automatically updates when the witness base evolves over time (i.e., the database of junk/legitimate e-mails, which is basically unique to a given user): it can be trained on a per-user basis, like Bayesian spam filtering. Another similarity with Bayesian filtering is that complexity-based methods do not bring any explanations about their results. Fortunately, this is generally not a requirement for an e-mail filtering system. In that particular case, it cannot be considered as a drawback. This is definitely a point that should be investigated more deeply in the future. Finally, in order to quickly validate the complexity approach for spam filtering, we have chosen to use a k-nn algorithm as a classifier. Nevertheless, it is well known that more sophisticated approaches could be used, for instance, the support vector machine technique (see Burges, 1998), which is known to be very successful for classification purpose. It is quite clear that these kinds of approaches are not the ultimate Graal, but we strongly believed that, when mixed with other classical strategies, they will definitely bring us a step further in the field of spam filtering. For us, it remains to investigate how to combine complexity with the other information (header, domain, etc.) to provide a more accurate system and to ultimately tend to a spam-free internet. Please verify. References Bennett, C., et al. (1998) Information distance, IEEE Transaction on Information Theory, Vol. 44, No. 4, pp.1407 1423. Burges, C.J.C. (1998) A tutorial on support vector machines for pattern recognition, Data mining and Knowledge Discovery, Vol. 2, No. 2, pp.121 167. Burrows, M. and Wheeler, D. (1994) A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation. Bush, S. and Stephen, F. (2002) Active virtual network management prediction: complexity as a framework for prediction, optimization, and assurance, Proceedings of the 2002 DARPA Active Networks Conference and Exposition, San Francisco, CA: IEEE Computer Society Press, pp.534 553. Bush, S. and Stephen, F. (2003) Extended abstract: complexity and vulnerability analysis, complexity and inference, 2 5, DIMACS Center, Rutgers University. Cilibrasi, R. and Vitanyi, P. (2005) Clustering by compression, IEEE Trans. Information Theory, Vol. 51, No. 4. D Amato, C., et al. (2004) Extending the knn classification algorithm to symbolic objects, Proceeding of ECML/PKDD, Italy, http://citeseer.ist.psu.edu. Graham, P. (2002) A plan for spam, http://www.paulgraham.com/spam.html. Huffman, D. (1952) A method for the construction of minimum reduncancy codes, Proceeding of the IRE.

Spam filtering using the Kolgomorov complexity analysis 13 Please cite in text or delete from reference list. Kolmogorov, A.N. (1965) Three approaches to the quantitative definition of information, Problems Inform. Transmission, Vol. 1, No. 1, pp.1 7. Kulkarni, P. and Bush, S. (2001) Active network management and kolmogorov complexity, OpenArch, Anchorage, Alaska. Li, M. and Vitnyi, P. (1997) Introduction to Kolmogorov Complexity and Its Applications, Springer-Verlag, ISBN: 0-387-94053-7. Pearl, J. and Russell, S. (2003) Bayesian networks, in M.A. Arbib (Ed.) Handbook of Brain Theory and Neural Networks, Cambridge, MA: MIT Press, pp.157 160. Wehner, S. (2006) Analyzing worms and network traffic using compression, http://arxiv.org/abs/cs.cv/0504045. Welch, T. (1984) A technique for high performance data compression, IEEE Computer, (v17.6). Notes 1 http : //www.graphviz.org/ 2 http : //www.complearn.org Please verify if this is correctly stated.