On Using SVM and Kolmogorov Complexity for Spam Filtering
|
|
|
- Tabitha Kennedy
- 10 years ago
- Views:
Transcription
1 Proceedings of the Twenty-First International FLAIRS Conference (2008) On Using SVM and Kolmogorov Complexity for Spam Filtering Sihem Belabbes 2 and Gilles Richard 1,2 1 British Institute of Technology and E-commerce, Avecina House, Romford Road London E7 9HZ, UK 2 Institut de Recherche en Informatique de Toulouse, 118 Rte de Narbonne Toulouse, France [email protected] {belabbes,richard}@irit.fr Abstract As a side effect of e-marketing strategy the number of spam s is rocketing, the time and cost needed to deal with spam as well. Spam filtering is one of the most difficult tasks among diverse kinds of text categorization, sad consequence of spammers dynamic efforts to escape filtering. In this paper, we investigate the use of Kolmogorov complexity theory as a backbone for spam filtering, avoiding the burden of text analysis, keywords and blacklists update. Exploiting the fact that we can estimate a message information content through compression techniques, we represent an as a multidimensional real vector and then we implement a support vector machine classifier to classify new incoming s. The first results we get exhibit interesting accuracy rates and emphasize the relevance of our idea. Introduction Detecting and filtering spam s face a number of complex challenges due to the dynamic and malicious nature of spam. A truly effective spam filter must block the maximum unwanted s while minimizing the number of legitimate messages wrongly identified as spam (namely false positives). However individual users may not share the same view on what really a spam is. A great deal of existing methods, generally rule-based, proceed by checking content of incoming looking for specific keywords (dictionary approach), and/or comparing with blacklists of hosts and domains known as issuing spam (see (Graham 2002) for a survey). In one case, the user can define his own dictionary thus adapting the filter to his own use. In the other, the blacklist needs to be regularly updated. Anyway, getting rid of spam remains a classification problem and it is quite natural to apply machine learning methods. The main idea on which they rely is to train a learner on a sample s set (known as the witness or training set) clearly identified as spam or ham (legitimate ), and then to use the system output as a classifier for next incoming s. Amongst the most successful approaches to deal with spam is the so-called Bayesian technique which is based on the probabilistic notion of Bayesian networks( (Pearl & Russell Copyright c 2008, Association for the Advancement of Artificial Intelligence ( All rights reserved. 2003)), and has so far proved to be a powerful filtering tool. First described by P. Graham (Graham 2002), the Bayesian filters block over 90% of spam. Due to their statistical nature they are adaptive but can suffer statistical attacks simply by adding a huge number of relevant ham words in a spam message (see (Lowd & Meek 2005)). Next-generation spam filtering systems may thus depend on merging rule-based practices (like blacklists and dictionaries) and machine learning techniques. A useful underlying classification notion is the mathematical concept of distance. Roughly speaking, this relies on the fact that when 2 objects are close one to another, it means that they share some similarity. In spam filtering settings, if an m is close to a junk , then m is likely a junk as well. Such a distance takes parameters like sender domain or specific words occurrences. An is then represented as a point in a vector space and spam are identified by simply implementing a distance-based classifier. At first glance an can be identified as spam by looking at its origin. We definitely think that this can be refined by considering the whole content, including header and body. An is classified as spam when its informative content is similar or close to that of another spam. So we are interested in a distance between 2 s ensuring that when they are close, we can conclude that they have similar informative content without distinction between header and body. One can wonder about the formal meaning of that informative content concept. Kolmogorov (also known as Kolmogorov-Chaitin) complexity is a strong tool for this purpose. In its essence Kolmogorov s work considers the informative content of a string s as the measure of the size of the ultimate compression of s, noted K(s). In this paper we aim at showing that Kolmogorov complexity and its associated distance can be pretty reliable in classifying spam. Most important is that this can be achieved: without any body analysis without any header analysis The originality of our approach is the use of support vector machines for classification such that we represent every in the training set and in the testing set as a multidimensional real vector. This can be done by considering a set of typical s previously identified as spam or ham. 130
2 The rest of this paper is organized as follows: we briefly review the Kolmogorov theory and its very meaning for a given string s. Then we describe the main idea behind practical applications of Kolmogorov theory which is defining a suitable information distance. Since K(s) is an ideal number, we explain how to estimate K and the relationship with commonly used compression techniques. We then exhibit our first experiment with a simple k-nearest neighbors algorithm and show why we move to more sophisticated tools. After introducing support vector machine classifiers, we describe our multi-dimensional real vector representation of an . We exhibit and comment our new results. We discuss related approaches before concluding and outlining future perspectives. What is Kolmogorov complexity? The data a computer deals with are digitalized, namely described by finite binary strings. In our context we focus on pieces of text. Kolmogorov complexity K(s) is a measure of the descriptive complexity contained in an object or string s. A good introduction to Kolmogorov s complexity is contained in (Kolmogorov 1965) with a solid treatment in (Li & Vitányi 1997; Kirchherr, Li, & Vitányi 1997). Kolmogorov s complexity is related to Shannon entropy (Shannon 1948) in that the expected value of K(s) for a random sequence is approximately the entropy of the source distribution for the process generating this sequence. However, Kolmogorov s complexity differs from entropy in that it is related to the specific string being considered rather than to the source distribution. Kolmogorov complexity can be roughly described as follows, where T represents a universal computer (Turing machine), p represents a program, and s represents a string: K(s) is the size p of the smallest program p s.t. T (p) = s Thus p can be considered as the essence of s since there is no way to get s with a shorter program than p. It is logical to consider p as the most compressed form of s. The size of p, K(s), is a measure of the amount of information contained in s. Consequently K(s) is the lower bound limit of all possible compressions of s: it is the ultimate compression size of the string. Random strings have rather high Kolmogorov complexity on the order of their length, as patterns cannot be discerned to reduce the size of a program generating such a string. On the other hand, strings with a large amount of structure have fairly low complexity and can be massively compressed. Universal computers can be equated through programs of constant length, thus a mapping can be made between universal computers of different types, and the Kolmogorov complexity of a given string on two computers differs by known or determinable constants. In some sense, this complexity is a kind of universal characteristic attached to a given data. For a given string s, K(s) is a strange number which cannot be computed just because it is an ideal lower limit related to an ideal machine (Universal Turing machine or calculator). This is a major difficulty which can be overcome by using the fact that the length of any program producing s is an upper bound of K(s). We develop this fact in a further section. But now we need to understand how to build a suitable distance starting from K. This is our aim in the next section. Information distance When considering data mining and knowledge discovery, mathematical distances are powerful tools to classify new data. The theory around Kolmogorov complexity helps to define a distance called Information Distance or Bennett s distance (see (Bennett et al. 1998) for a complete study). The informal idea is that given two strings or files, a and b, we can say: K(a) = K(a \ b) + K(a b) K(b) = K(b \ a) + K(a b) It is important to point out that those 2 equations do not belong to the theoretical framework but are approximate formula allowing to understand the final computation. The first equation says that a s complexity (its information content) is the sum of the proper a s information content denoted a \ b, and the common content with b denoted a b. Concatenating a with b yields a new file denoted a.b whose complexity is K(a.b): K(a.b) = K(a \ b) + K(b \ a) + K(a b), since there is no redundancy with Kolmogorov compression. So the following number: m(a, b) = K(a) + K(b) K(a.b) = K(a b) is a relevant measure of the common information content to a and b. Normalizing this number in order to avoid side effects due to strings size helps in defining the information distance: d(a, b) = 1 m(a, b)/max(k(a), K(b)) Let us understand the very meaning of d. If a = b, then K(a) = K(b) and m(a, b) = K(a) thus d(a, b) = 0. On the opposite side, if a and b do not have any common information, m(a, b) = 0 and then d(a, b) = 1. Formally, d is a metric satisfying d(a, a) = 0, d(a, b) = d(b, a) and d(a, b) d(a, c) + d(c, b) over the set of finite strings. d is called the information distance or the Bennett distance. If d(a, b) is very small, it means a and b are very similar. d(a, b) close to 1 means a and b have very few information in common. This is exactly what we need to start with classification. As we understand now, the basic quantity we need to compute or estimate for a given string s is K(s). This is addressed in the next section. Kolmogorov complexity estimation In this section we are interested in adapting the formal tool introduced so far to real-world applications. We have mentioned above that the Kolmogorov complexity exact value can never be computed. Though K(s) being the lower limit of all possible compressions of s means that every compression C(s) of s approximates K(s). A decompression algorithm is considered as our universal Turing machine T such that: T (C(s)) = s. Hence it is obvious that we focus our attention on lossless 131
3 compression algorithms to preserve the validity of the previous equality. Fortunately there exists a variety of such algorithms coming from the compression community. Discussing those tools falls out of the scope of this paper, and we briefly overview some classical compression tools available on the market. On one side there are formal lossless compression algorithms: LZW standing for Lempel-Ziv- Welch (Welch 1984), Huffman (Huffman 1952), Burrows- Wheeler (Burrows & Wheeler 1994). On the other side there are real implementations adding some clever manipulations to the previous algorithms: Unix compress utility based on LZ which is a less elaborated version of LZW zip and gzip which are a combination of LZ and Huffman encoding (for instance, Adobe Reader contains an implementation of LZW) bzip2 which first uses Burrows-Wheeler transform then a Huffman coding. The bzip2 tool is known to achieve interesting compression ratios and we choose it to approximate Kolmogorov complexity. In the following, instead of dealing with the ideal number K(s) where s is a file, we deal with the size of the corresponding compressed file. The better the algorithm compresses the data s, the better the estimation of K(s). When replacing K(s) by C(s), it becomes clear that the previous information distance is not anymore a true mathematical distance but remains sufficient for our purpose. Our first tests with k-nn In order to quickly validate our approach, we start from a classical k-nearest-neighbors (k-nn) algorithm where all neighbors of a given incoming equally contribute to determining the actual class to which it belongs (either spam or ham). The training set S is constituted of both spam and ham s. Complexity estimation is obtained by bzip2 compression tool whose C source code is freely available ( and easy to integrate. This technique performs without information loss and is quite good in term of runtime efficiency, which is exactly what we need. Basically, starting from an incoming , we compute its k-nearest neighbors belonging to a set of witness s previously classified as spam/ham. Then we choose the most frequent class among those k neighbors as the class for the incoming . The distance we use is just the information distance estimated through bzip2 compression. Our algorithm is implemented in C and our programs (source and executable) are freely available on request. Using the previous software, our experimentation has been run over a set of 200 new s to classify. We then performed 4 tests series using 4 different training sets with cardinalities as below: 25 spam/25 ham, 50 spam/50 ham, 75 spam/75 ham, 100 spam/100 ham. The only parameter we tune is k and we choose k as 11, 21, 31, 41, 51, 61, 71. As expected, when increasing k to a certain threshold, the results quality decreased. That is why finally we choose the value of k as the square root of the training set cardinality. This is just an empirical value which works well. Below we provide only one graphic (with 200 checked s) which gives an intuition of our results: Figure 1: 50 spam/50 ham witness s. Visual inspection of our curve provides clear evidence that information distance really captures an underlying semantics. Since in our samples there are s in diverse languages (French, English), we can think that there is a kind of common underlying information structure for spam. The accuracy rate remains in the range of 80%-87%, whatever the number of witness s, and the false negative level remains low. Up to now there is no existing filtering method ensuring a perfect 100% accuracy, thus false positive level implies that the individual user still needs to check his junk folder to avoid losing important s. Since our first results are quite encouraging, we decide to move to a completely different and more sophisticated technique. Instead of investigating the k-nn algorithm tuning, we work with a powerful training component, namely Support Vector Machines (SVM), and we adapt our implementation. We briefly introduce SVM in the next section. SVM for spam classification Support vector machines are powerful classifiers. It is out of the scope of this paper to review SVM (see (Burges 1998)) and we only give a flavor. Given a set of k-dimensional vectors x i, each one labeled 1 or -1 depending on their classification. Given a problem, a discriminating hyperplane w is one that creates a decision function satisfying the following constraint for all i: y i (x i.w + b) 1 0. The learning problem is called linearly inseparable in case of no existing such hyperplane. This drawback can be dealt with following some strategies. Indeed Support Vector Machines can use a nonlinear kernel function defining a new inner-product on the input space. This inner product may be used to calculate many higher-power terms of samples combinations in the input space. This yields a higherdimensional space such that when it is large enough, there will be a separating hyperplane. An SVM tool is controlled 132
4 by 2 parameters. The main one relates to the kernel function. Radial Basis Function (RBF) kernel is widely used since it somewhat captures other kernel functions. It takes the value 1 in case of two equal input vectors. Contrarily, its value slowly decreases towards 0 in a radially symmetric fashion: K(x i, x j ) = e xi xj 2 /2α 2. Here, α is a parameter controlling the decreasing rate and is set priorly to the SVM learning procedure. Other parameters may be considered as well but it is not our aim to tune such parameters in the present paper. That is why we leave this task to the SVM package we use since there is a way to automatically provide suitable values for these parameters. As we understand, running an SVM algorithm depends on a vector representation of s. We obtain this by exploiting a quite simple idea. We choose a set of typical s: T yp = {t i i [1, n]} mixing 50% spam and 50% ham. For every m not belonging to T yp we compute the information distance: m i = d(m, t i ). We get #T yp 1 coordinates, building a #T yp-dimensional vector representing m. The choice of Typ sets up the dimension of the vector space we deal with. Later on we will see that T yp s quality is crucial for the filter performance. In terms of implementation, we use the LIBSVM software (Chih-Chung Chang & Chih-Jen Lin 2001). When it comes to evaluate the relevance of a spam filtering system, the standard accuracy rate (which is a usual measure for classification purpose) is not really sufficient since all errors are treated on equal footing. If we have 90% of accuracy rate, it is possible to have 10% of errors only by misclassifying the ham. It means that your inbox is clean but you have to check your junk box to get the missing ham. So it is important for an end-user to know the rate of ham (or legitimate s) lost during the process. If this rate (False Positive Rate) is very low (for instance around 1%), then there is no need to deeply examine the junk box. Vice-versa if the number of not identified spam is important, then a lot of spam remain in the inbox: so we want a high rate of identified spam (True Positive Rate). Another interesting value is the number of spam properly identified among the total number of s considered as spam (true spam and false positives). This value is the Positive Predictive Value (or precision). Let us give formal definitions: fn = number of spam identified as ham fp number of ham identified as spam s number of spam in the test set h number of ham in the test set (s + h is the test set cardinality) accuracy = 1 (fn + fp)/(s + h) FPR = fp/h (False Positive Rate or fall out) TPR = (s fn)/s (True Positive Rate or recall) PPV = (s fn)/(s fn + fp) (Positive Predictive Value also known as precision). 1 Here, #T yp denotes the cardinality of the set T yp. It is quite clear that those numbers are not independent and sometimes, it can be interesting to represent TPR as a function of FPR (roc analysis). Due to space limitations, we do not present graphical representation of our results. Our results with SVM We have run a set of experiments on the publicly available Spamassassin corpus. This set is constituted of s of which 2397 are spam. We performed no pre-processing on those s that we consider as raw input data for our system. Below is our protocol: First of all, we have defined 4 training sets of 50, 100, 150 and 200 s. We have then defined 5 sets of typical s with cardinality 8, 10, 12, 16 and 20, each one on a 50/50% spam/ham basis. Those 5 typical sets gave rise to 5 experiment types with the same training and testing sets: each type is associated to an 8, 10, 12, 16 and 20 dimensional vector space. Then instead of working on the full dataset, we have chosen 3 test sets: 2 with 500 and one with s of both types. We give here the results for the final 1000 set. Our experiments series are aimed to: allow to choose the most effective vector space dimension allow to choose the most effective training set cardinality. validate the combination of Kolmogorov complexity and SVM as a really effective framework for spam filtering. Table 1 presents our results (all numbers are percentages, the testing set contains 1000 new s to classify). train Accuracy FPR TPR train Accuracy FPR TPR train Accuracy FPR TPR train Accuracy FPR TPR Table 1: Classification results with bzip2 In our experiment, and except with a training set of 150 s, it is clear that the best dimensional representation 133
5 should be in the range of We think there is a relationship between the size of the vector representation and the training set size: It is well known that over-fitting can occur with a huge training set, thus reducing the prediction power of the classifier. In fact when we increase the vector dimension (e.g. 16 or 20 in our experiment), we increase the quantity of information carried by an . It is realistic to consider that relatively small training sets perform better than bigger ones since they are less sensitive to over-fitting. This relationship needs to be more deeply investigated. Regarding the training set size, 100 seems to be quite effective. It can be seen from our results that when dealing with a bigger training set (150 or 200), some kind of over-fitting effect harms the classifier accuracy and in particular the harms the false positive and true positive rates. Obviously a smaller training set (50 s) is not sufficient to output a precise classifier via the SVM tool. Moreover our experiment shows that, in that case, the false positive rate reaches the worst values as the vector space dimension increases (12, 16 and 20). It is worth noting that, except for the smaller training set, our accuracy rates range from 93% to 97% with FPR between 1% and 3%, which is usually quite acceptable. Despite the fact that more investigation has to be performed, it is obvious that the pair Kolmogorov complexity-svm provides a clean theoretical framework to accurately deal with practical applications like spam filtering. In that case, our simple implementation is equivalent in terms of performance to more mature tools (such as SpamBayes or Spamassassin). It becomes definitely clear that a deep analysis of the incoming itself or a clever pre-processing are not a prerequisite to get accurate results. Better performances could be expected (e.g. by combining diverse approaches), however we are not sure this is feasible since Kolmogorov complexity theory is a strong substratum. Of course there is room for improving our model due to the number of parameters controlling it! Related works Despite their relative novelty, complexity-based approaches have been proved quite successful in numerous fields of IT: data mining, classification, clustering, fraud detection, and so on. Within the IT security field, we can refer to works of (Kulkarni & Bush 2001; Bush 2002; 2003), and more recently (Wehner 2006): they are mainly devoted to analyzing data flow coming from network activities and to detecting intrusion or virus. From a spam filtering perspective, using compression paradigm is not a completely new idea, despite the fact that our way to mix Kolmogorov complexity with an SVM training component is quite new (as far as we know and except other applications such as protein sequence classification (Kocsor et al. 2006)). Probably the most closely related work is that of Spracklin and Saxton (Spracklin & Saxton 2007): they use Kolmogorov complexity without referring to the information distance. In each , they just consider the body which is first pre-processed and cleaned up (only lower case letters, removal of common words and HTML tags, etc.). Then each word is converted into 0 if it appears frequently in spam or 1 if it appears more frequently in ham. The final result is a string of 0 and 1 which is then compressed. If the complexity is below a certain threshold the is classified as ham, otherwise as spam. Their accuracy rates vary in the range of 80%-96% depending of their test sets. This is a good result similar to ours, with the difference that we do not pre-process our s. In term of complexity, our process is much simpler since we only compute distances. Work of Bratko et al. (Bratko et al. 2006) also has to be cited. It is not exactly based on Kolmogorov complexity but is compression-based. They use an adaptive statistical data compression process. The main idea is to consider the messages as issued by an unknown information source and to estimate the probability of a symbol x to be produced. Such a probability then exhibits a way to encode the given symbol and to compress. So in order to filter spam, it is sufficient to compress the training ham set into a file Ham, and to do the same for the training spam set getting a file Spam. When a new m arrives, it is added to the ham folder and the resulting folder is compressed into a file Ham + m. The same is done for the spam folder getting Spam + m. If the difference between Spam + m size and Spam size is small, it means m does not bring new information to the training spam and is likely to be a spam. Otherwise m is considered as a ham. This method, which does not use the information distance, provides good results on the standard databases whatever the compression model. Our rates are currently similar but we have not yet investigated the whole collection of available databases. Conclusion and Future works In this paper we have investigated the anti-spamming field and have considered the Kolmorogov complexity of an e- mail as a tool for classifying it as spam or ham. In order to quickly validate our idea, we first implemented a basic version of k-nearest neighbors algorithm, using the information distance deduced from K as a closeness measure. Our tests have been processed on relatively small data sets. Still, the first obtained results clearly show that the compression distance is meaningful in that situation. This is not completely coming as a surprise: the seminal works of (Bennett et al. 1998) then (Cilibrasi & Vitányi 2005) already paved the way and highlighted the real practical power of compression methods. One major advantage in using compression for e- mail classification is that there is no need to deeply analyze the diverse parts of an (header and body). Our algorithm can of course be refined, and simply having a more powerful compression technique would probably bring more accurate results. Multiple complexity estimators could also be combined in order to improve discrimination accuracy. A more sophisticated machine learning component based on SVM could then be implemented to further investigate our compression-based classification method. In the present work using SVM required a real vector representation of s and it appears that 10 is a suitable dimension for the vector space model. Then we trained our SVM on diverse sets: by examining our experimental results it appears that a suitable size for the training set is in the range of 100. On our test sets (coming from Spa- 134
6 massassin repository), we got accuracy rates in the range of 93%-97%, with false positive rates between 1% and 2%. It is quite clear that the compression method coupled with the SVM training component is successful, and our results emphasize the idea that there is no need to separate the header from the body and to analyze them separately. A great advantage of this kind of blind techniques is that there is: no need to update a dictionary or a blacklist of domain names. The system automatically updates when the training base evolves over time (i.e. the database of junk/legitimate s which is basically unique to a given user): it can be trained on a per-user basis, like Bayesian spam filtering. no need to pre-process the s (like tokenization, HTML tag and capital letters removal). A similarity with Bayesian filtering is that complexity based methods do not bring any explanations about their results. Fortunately this is generally not a requirement for an e- mail filtering system. In that particular case it cannot be considered as a drawback. Even if some commercial softwares claim 99% of success in spam elimination (e.g. see it is well known that the end users are not entirely satisfied by the current performances of anti-spam tools. For instance in (SPAMfighter 2007) people estimate around 15% the amount of daily spam they receive after the spam filters have done their job. It means 85% of accuracy which is clearly outperformed in our experiments. It is quite clear that compression based approaches are not the ultimate Graal. Though we strongly believe that when mixed with other classical strategies they will definitely bring a step further in the spam filtering field. For us it remains to investigate how to overcome the fact that we only estimate the Kolmogorov complexity. Using compression makes information distance calculation approximative since d(m, m) is definitely not 0 but a small value between 0.1 and 0.3. This error has to be properly managed to provide accurate vectors for the training components. This is our next step to conceive a more accurate system and to ultimately tend towards a spam-free internet! Acknowledgment The authors are grateful to Ivan José Varzinczak for his help in integrating the bzip2 code to the used programs. References Bennett, C.; Gacs, P.; Li, M.; Vitányi, P.; and Zurek, W Information distance. IEEE Transaction on Information Theory 44(4): Bratko, A.; Cormack, G. V.; Filipic, B.; Lynam, T. R.; and Zupan, B Spam filtering using statistical data compression models. Journal of Machine Learning Research 7: Burges, C A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2): Burrows, M., and Wheeler, D A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation. Bush, S. F Active virtual network management prediction: Complexity as a framework for prediction, optimization, and assurance. In Proceedings of the 2002 DARPA Active Networks Conference and Exposition (DANCE), Bush, S. F Extended abstract: Complexity and vulnerability analysis. In Complexity and Inference. Chih-Chung Chang, and Chih-Jen Lin Libsvm : a library for support vector machines. Software available at cjlin/ libsvm. Cilibrasi, R., and Vitányi, P Clustering by compression. IEEE Transaction on Information Theory 51(4). Graham, P A plan for spam. Available online at Huffman, D A method for the construction of minimum reduncancy codes. In Proceedings of the IRE. Kirchherr, W.; Li, M.; and Vitányi, P The miraculous universal distribution. MATHINT: The Mathematical Intelligencer 19(4). Kocsor, A.; Kertész-Farkas, A.; Kaján, L.; and Pongor, S Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics 22(4): Kolmogorov, A. N Three approaches to the quantitative definition of information. Problems in Information Transmission 1(1):1 7. Kulkarni, P., and Bush, S. F Active network management and kolmogorov complexity. In OpenArch Li, M., and Vitányi, P Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag. Lowd, D., and Meek, C Anti-spam products give unsatisfactory performance. In Proceedings of the Second Conference on and Anti-spam (CEAS), Pearl, J., and Russell, S Bayesian networks. In A., A. M., ed., Handbook of Brain Theory and Neural Networks. Cambridge, MA: MIT Press Shannon, C A mathematical theory of communication. Bell System Technical Journal 27: , SPAMfighter Anti-spam products give unsatisfactory performance. Available online at Other.asp?D= Spracklin, L., and Saxton, L Filtering spam using kolmogorov complexity estimates. In 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW), Wehner, S Analyzing worms and network traffic using compression. Available online at org/abs/cs.cv/ Welch, T A technique for high performance data compression. IEEE Computer 17(6). 135
Int. J. Web and Grid Services, Vol. X, No. Y, xxxx 1
Int. J. Web and Grid Services, Vol. X, No. Y, xxxx 1 Spam filtering using the Kolgomorov complexity analysis G. Richard IRIT University of Toulouse, France E-mail: [email protected] Kolgomorov or Kolmogorov?
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering
Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images
Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
A fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
Image Compression through DCT and Huffman Coding Technique
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Anti Spamming Techniques
Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Less naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:[email protected] also CoSiNe Connectivity Systems
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
Prediction of DDoS Attack Scheme
Chapter 5 Prediction of DDoS Attack Scheme Distributed denial of service attack can be launched by malicious nodes participating in the attack, exploit the lack of entry point in a wireless network, and
Reputation Network Analysis for Email Filtering
Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler University of Maryland, College Park MINDSWAP 8400 Baltimore Avenue College Park, MD 20742 {golbeck, hendler}@cs.umd.edu
Data Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
Investigation of Support Vector Machines for Email Classification
Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software
Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management
Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management Paper Jean-Louis Amat Abstract One of the main issues of operators
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam
Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques
INTRUSION DETECTION USING THE SUPPORT VECTOR MACHINE ENHANCED WITH A FEATURE-WEIGHT KERNEL
INTRUSION DETECTION USING THE SUPPORT VECTOR MACHINE ENHANCED WITH A FEATURE-WEIGHT KERNEL A Thesis Submitted to the Faculty of Graduate Studies and Research In Partial Fulfillment of the Requirements
IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Bayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme
IJCSET October 2012 Vol 2, Issue 10, 1447-1451 www.ijcset.net ISSN:2231-0711 Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme I.Kalpana, B.Venkateswarlu Avanthi Institute
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and
Efficient Security Alert Management System
Efficient Security Alert Management System Minoo Deljavan Anvary IT Department School of e-learning Shiraz University Shiraz, Fars, Iran Majid Ghonji Feshki Department of Computer Science Qzvin Branch,
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Private Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
[email protected]. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam
An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State [email protected] Abdulhamid Shafi i
Spam Filtering with Naive Bayesian Classification
Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011
Enhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development
Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development Author André Tschentscher Address Fachhochschule Erfurt - University of Applied Sciences Applied Computer Science
Multiple Kernel Learning on the Limit Order Book
JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
