Spam Sender Detection with Classification Modeling on Highly Imbalanced Mail Server Behavior Data
|
|
|
- Blanche Blankenship
- 10 years ago
- Views:
Transcription
1 Spam Sender Detection with Classification Modeling on Highly Imbalanced Mail Server Behavior Data Yuchun Tang, Sven Krasser, Dmitri Alperovitch, Paul Judge Secure Computing Corporation 4800 North Point Parkway, Suite 300 Alpharetta, GA 30022, USA {ytang, skrasser, dalperovitch, Abstract Unsolicited commercial or bulk s or s containing viruses pose a great threat to the utility of communications. A recent solution for filtering is reputation systems that can assign a value of trust to each IP address sending messages. By analyzing the query patterns of each node utilizing reputation information, reputation systems can calculate a reputation score for each queried IP address. In this research, we explore a behavioral classification approach based on features extracted from such global messaging patterns. Due to the large amount of bad senders, this classification task has to cope with highly imbalanced data. Firstly, for each observed sender, we calculate periodicity properties using a discrete Fourier transform and global breadth information reflecting message volume and recipient distribution. After that, a Granular Support Vector Machine - Boundary Alignment algorithm (GSVM-BA) is implemented to solve the class imbalance problem and compared to cost sensitive learning. Lastly, we determine the performance of support vector machine, C4.5 decision trees, naïve Bayesian decision trees, and multinomial logistic regression classifiers on the resulting data set. The best performance is observed by using GSVM-BA for rebalance and then using SVM for classification. I. INTRODUCTION Traditional content filtering anti-spam systems can provide highly accurate detection rates but are usually prohibitively slow and poorly scalable to deploy in high-throughput enterprise and ISP environments. An early approach for efficient filtering of unwanted spam traffic has been the use of real-time blacklists (RBLs). An RBL is a service containing a list of IPs from which servers should not accept messages. Typically, RBLs can be queried over the DNS protocol. Whenever a host on the Internet tries to send a message to an server, the server can query an RBL to check whether the IP of the sending host is listed. If the IP is listed then the sending host is a known source of malicious traffic and the receiving server can reject the message. Alternatively, the information that the sender is listed can be combined with a local analysis result to make a final decision. RBLs generally receive information about malicious senders from spam messages delivered to spamtrap addresses, manual listings, or user feedback. There are several drawbacks to this approach. To be automatically listed, a particular spam run needs to target a spamtrap address. All messages that have been delivered to other addresses beforehand cannot be stopped. Manual listings and user feedback require a human in the loop and are inherently non-automatic. This results in a slow reaction time, which is especially problematic since most machines sending spam are zombie machines with only a few hours of sending activity. One approach to counter these shortcomings is to take more sources of feedback into account. Especially the real-time queries sent over DNS can yield additional insight. Fig. 1 shows the data that can be gathered from such a query. A sending host with IP address Q tries to send a message to a receiving server. The server queries an RBL for the IP address of the sending host (Q), which we therefore call the queried IP. The query is relayed over DNS to an RBL server, which will see the query packet coming from the IP address S of the DNS server, which we call the source IP. In addition, the time T the query has been received is stored. This results in a tuple < Q, S, T > generated by every query. Ramachandran et al. analyze the source IPs querying an RBL [1]. They investigate a dataset of RBL queries to spot exploited machines (zombies) in botnets that are sending queries to test whether members of the same botnet have been blacklisted. Therefore, additional information can be obtained with regard to the maliciousness of the source IP. Ramachandran et al. propose another approach that is able to detect malicious IPs if information on the destination domain is available by looking at the distribution of domains to which a sending IP sends [2]. However, such classifiers cannot always make a definite decision. Introducing a continuous reputation value in contrast to a discrete yes/no decision allows the user of such a system to define his or her own thresholds. More importantly, this continuous value can be used as a feature in a local classification engine. One such reputation system is Secure Computing s TrustedSource [3]. TrustedSource operates on a wide range of data sources including proprietary and public data. The lat- 174
2 Fig. 1. Data gathered from query. ter includes public information obtained from DNS records, WHOIS data, or real-time blacklists (RBLs). The proprietary data is gathered by over 5000 IronMail appliances that give a unique view into global enterprise patterns. Based on the needs of the IronMail administrator, the appliance can share different levels of data and can query for reputation values for different aspects of an message. Currently, TrustedSource can calculate a reputation value for the IP address a message originated from, for URLs in the message, or for message fingerprints. This research uses queries for IPs as input to detect spam sender IPs among them. In previous work, we outlined the detection of spam senders based on query patterns [4]. This approach allows gaining additional information on the queried (sending) IP by aggregating query patterns into spectral and breadth features with a focus on spectral features. In this paper, we focus on the extraction of breadth features. In addition, we compare the performance of GSVM-BA we proposed previously with four standard algorithms. The rest of the paper is organized as follows. Section II describes the data we use for classification modeling. Section III gives a brief review of research on imbalanced classification. In Section IV, GSVM-BA is presented in detail as a solution to the class imbalance challenge. Section V compares performance between GSVM-BA and cost-sensitive learning with four classification algorithms on IP classification. Finally, Section VI concludes the paper. II. QUERY DATA ANALYSIS TrustedSource reputation system provides an RBL-style interface to retrieve reputation values for IP addresses over DNS. While TrustedSource allows other means to be queried that can include additional information such as message fingerprints or header metadata, we limit this classification problem to < Q, S, T > tuples (or QST data in short) since this type of data is the most generic one and applies also to standard RBLs widely used on the Internet today. A. Sources of Noise Generally, an IP that is queried has sent a message shortly before the query has been logged (at time T ). Note that this assumption does not always hold. IPs can also be queried by hand by humans trying to find information regarding a listing. Also, as noted previously, some zombie machines try to automatically send queries to evade IP-based blocking mechanisms [1]. Zombies could also attempt to actively poison the data feed by submitting bogus queries. Lastly, not all messages result in queries since DNS results can be cached. This can occur even when an RBL server specifically returns low time-to-live (TTL) values with its responses due to DNS servers relaying these responses not honoring the TTL values. Normally, a well designed classifier will be able to still yield useful results in the presence of such noise, but there are a number of techniques to avoid these problems. These include moving away from DNS (not desirable for RBLs since DNS queries are an accepted standard and widely supported), adding a unique string to the query (to prevent caching, needs to be supported by both server client and RBL server), and adding authentication information (to prevent data poisoning). B. Feature Extraction For this research, we are interested in classifying senders based on their sending behavior. For each sender observed in the data set (i.e. each queried IP), we calculate a number of features based on the distribution of source IPs (S) and timestamps (T ). These fall into two general categories: spectral features and breadth features. 1) Spectral Features: Spectral features are based on the distribution of the set of timestamps observed for each queried IP. Currently, we do not take source IP information into account for this set of features. Under the assumptions outlined previously, each timestamp seen from a particular IP Q i corresponds to a message sent from Q i. We consider a time interval T that we split up in N equally sized intervals t n where n = 0...(N 1). The number of timestamps falling into interval t n is denoted c n, which corresponds to the total number of messages observed from Q i in that interval. This sequence is then transformed into the frequency domain using a discrete Fourier transform (DFT). Since we do not consider time zones or time shifts, we are only interested in the magnitude of the complex coefficients of the transformed sequence and throw away the phase information. This results in the sequence C k. C 0 is the constant component that corresponds to the total message count. The message count will be considered as part of the breadth features, and we will use it here only to normalize coefficients yielding a new sequence C k = C k/c 0. Furthermore, since the input sequence c n is real, all output coefficients C k with k > N/2 are redundant due to the symmetry properties of the DFT and can be ignored. We have chosen T = 24 h and N = 48 resulting in 30 minute time slots t i. This results in 24 usable raw spectral features, C 1 to C 24. An evaluation of these raw features indicated that for ham senders there are three distinct groups. C 1, C 2 to C 4, and C 5 to C 24 all lie within specific ranges. We use C 1, the mean and the standard deviation of C 2 to C 4, and the mean the standard deviation of C 5 to C 24 as the first 175
3 five spectral features. In addition, we add log-scaled versions of C 1 to C 24 to the spectral feature set. The selection of these features is mostly based on heuristics at this point, and we expect to be able to achieve a vastly improved performance after a careful analysis. Spectral features have been previously covered in our preliminary results presented in [4]. 2) Breadth Features: Breadth features are also calculated based on data gathered in a 24 hour time window. In that window, the sending behavior of each queried IP Q is analyzed. First, the numbers of source IPs S querying Q is calculated, which we denote Bsrc. Under the assumptions made this count reflects the number of recipients that Q attempted to send to. Since S is the IP of the DNS server relaying the query and multiple recipients may either use the same DNS server or one recipient may use multiple DNS servers, this count does not perfectly reflect the amount of recipients. It does however give a good idea about the breadth of recipients across the Internet. For the sake of brevity we refer to the source IP of a query (i.e. the IP address of the DNS server relaying that query) as the virtual recipient keeping in mind that the actual recipient is not exposed in the QST data. Second, the number of queries for each IP Q (and therefore the number of messages sent by that IP) denoted Bmsgs is calculated. As previously explained, a major source of noise in this feature is due to caching. Third, we introduce the notion of a session. A session is a 30 minute period in which an IP Q i sends to a particular virtual recipient S j. When S j queries Q i (i.e. when Q i sends a message to the virtual recipient S j ) for the first time, a new session is counted. This session is valid for 30 minutes, and all additional queries involving < Q i, S j > do not result in new sessions (while queries involving a different virtual recipient in the same time frame will). After 30 minutes the session is closed, and a query will result in a new session. All sessions for Q i across all virtual recipients are summed up yielding the session count Bssn. Note that this feature mimics the query pattern that would be observed if the relaying DNS server would cache all queries with a 30 minute TTL. Fourth, we calculate the average number of messages per session per particular source IP, sum all averaged values up, and divide by the total number of sessions. We denote this feature as Bmpss. Fifth, we calculate the number of global sessions. A global session is akin to a regular session but does not distinguish between different virtual recipients and therefore reflects the global time of activity for each queried IP. We denote the global session count as Bgssn. Lastly, we heuristically incorporate two derived features, B s/m = B src Bmpss and B s/s = B ssn Bsrc. The Signal-to-Noise ratio (S2N), defined as the distance of the arithmetic means of the spam and non-spam (ham) classes divided by the sum of the corresponding standard deviations [5], for each of these five features is presented in Table I. The S2N values show that these breadth features are informative and can be utilized for classification modeling. The Bgssn Density TABLE I BREADTH FEATURE CHARACTERISTICS S2N µspam σspam µ ham σ ham Bsrc Bssn Bgssn Bmsgs Bmpss B s/m B s/s Fig B gssn spam ham Probability density for Bgssn for spam and ham senders. feature achieves the best ratio. The probability densities for the spam and ham classes of this particular feature are shown in Figure 2. III. IMBALANCED CLASSIFICATION How to build an effective and efficient model on a huge and complex dataset is a major concern of the science of knowledge discovery and data mining. With emergence of new data mining application domains such as messaging security, e-business, and biomedical informatics, more challenges are arising. Among them, highly skewed data distribution has been attracting noticeably increasing interest from the data mining community due to its ubiquitousness and importance [6], [7]. A. Class Imbalance Class imbalance happens when the distribution on the available dataset is highly skewed. This means that there are significantly more samples from one class than samples from another class for a binary classification problem. Class imbalance is ubiquitous in data mining tasks, such as diagnosing rare medical diseases, credit card fraud detection, intrusion detection for national security, etc. For spam filtering, each IP is classified as spam or non-spam based on its sending behavioral patterns. This classification is highly imbalanced, as shown in our experiments. As our current target in this research is to detect spam IPs, we define spam IPs as positive and non-spam IPs as negative. 176
4 B. Methods for Imbalanced Classification Many methods have been proposed to handle imbalanced classification, and some good results have been reported [7]. These methods can be categorized into three different kinds: cost-sensitive learning, oversampling the minority class, or undersampling the majority class. Interested readers may refer to [8] for a good survey. For a real world classification task like spam IP detection, there are usually a large amount of IP samples. These samples need to be classified quickly so that spam messages from those IPs can be blocked in time. However, cost sensitive learning or oversampling usually increases decision complexity and hence slows down classification. On the other hand, undersampling is a promising method to improve classification efficiency. Unfortunately, random undersampling may not generate accurate classifiers because informative majority samples may be removed. In this paper, granular computing and SVM are utilized for undersampling by keeping informative samples while eliminating irrelevant, redundant, or even noisy samples. After undersampling, data is cleaned and hence a good classifier can be modeled for IP classification both in terms of effectiveness and efficiency. C. SVM for Imbalanced Classification SVM approximates the structural risk minimization principle that minimizes an upper bound on the expected risk [9], [10]. Because structural risk is a reasonable trade-off between the training error and the modeling complication, SVM has a great generalization capability. Geometrically, the SVM modeling algorithm works by constructing a separating hyperplane with the maximal margin. Compared with other standard classifiers, SVM performs better on moderately imbalanced data. The reason is that only Support Vectors (SVs) are used for classification and many majority samples far from the decision boundary can be removed without affecting classification [11]. However, performance of SVM is significantly deteriorated on highly imbalanced data [11], [12]. For this kind of data, it is prone to find the simplest model that best fits the training dataset. Unfortunately, the simplest model is exactly the naïve classifier that identifies all samples as part of the majority class. SVM is usually much slower than other standard classifiers [13], [14], [15]. The speed of SVM classification depends on the number of SVs. For a new sample X, K(X, SV ) is calculated for each SV. Then it is classified by aggregating these kernel values with a bias. To speed up SVM classification, one potential method is to decrease the number of SVs. IV. THE GSVM-BA ALGORITHM In this work, the Granular Support Vector Machines- Boundary Alignment algorithm (GSVM-BA) is designed and targeted at improving both effectiveness and efficiency for IP classification based on the principles of granular computing. Validation data Fig. 4. Training data SVM learning SVM prediction Performance improved? No Output the second last SVM Remove positive support vectors Flow chart of the GSVM-BA algorithm. A. Granular Computing and GSVM Granular computing represents information in the form of some aggregates (called information granules) such as subsets, subspaces, classes, or clusters of a universe. It then solves the targeted problem in each information granule [16]. There are two principles in granular computing. The first principle is divide-and-conquer to split a huge problem into a sequence of granules (granule split); The second principle is data cleaning to define the suitable size for one granule to comprehend the problem at hand without getting buried in unnecessary details (granule shrink). As opposed to traditional data-oriented numeric computing, granular computing is knowledge-oriented [17]. By embedding prior knowledge or prior assumptions into the granulation process for data modeling, better classification can be achieved. A granular computing-based learning framework called Granular Support Vector Machines (GSVM) was proposed in [18]. GSVM combines the principles from statistical learning theory and granular computing theory in a systematic and formal way. GSVM works by extracting a sequence of information granules with granule split and/or granule shrink, and then building an SVM on some of these granules when necessary. The main potential advantages of GSVM are: 1) GSVM is more sensitive to the inherent data distribution by trading off between local significance of a subset of data and global correlation among different subsets of data, or trading off between information loss and data cleaning. Hence, GSVM may improve the classification performance. 2) GSVM may speed up the modeling process and the classification process by eliminating redundant data locally. As a result, it is more efficient and scalable on huge datasets. Yes 177
5 Fig. 3. GSVM-BA can push the boundary back close to the ideal position with fewer SVs. B. GSVM-BA Armed with the data cleaning principle of granular computing, GSVM-BA is ideal for spam IP detection on highly imbalanced data derived from QST data. SVM assumes that only SVs are informative to classification and other samples can be safely removed. However, for highly imbalanced classification, the majority class pushes the ideal decision boundary toward the minority class [11], [12]. As demonstrated in Fig. 3(a), positive SVs that are close to the learned boundary may be noisy. Some really informative samples may hide behind them. To find these informative samples, we can conduct costsensitive learning to assign higher penalty values to false positives (FP) than false negatives (FN). This method is named SVM+CS. However, to counter the increased weights on the minority side, SVM+CS increases the number of majority SVs (Fig. 3(b)), and hence slows down the classification process. In contrast to this, GSVM-BA looks for these informative samples by repetitively removing positive support vectors from the training dataset and rebuilding another SVM. GSVM-BA is knowledge-oriented in that it embeds the boundary push assumption ( prior knowledge )into the modeling process. After an SVM is modeled, a granule shrink operation is executed to remove corresponding positive SVs to generate a smaller training dataset on which a new SVM is modeled. This process is repeated to gradually push the boundary back to its ideal location where the optimal classification performance is achieved. Empirical studies in the next section show that GSVM-BA can compute a better decision boundary with much fewer samples involved in the classification process (Fig. 3(c)). Consequently, classification performance can be improved in terms of both effectiveness and efficiency. Fig. 4 sketches the GSVM-BA algorithm. The first SVM is always the naïve one by default. V. EXPERIMENTS Classification modeling is carried out on a workstation with a Pentium M R CPU at 1.73 GHz and 1 GB of memory. The experiments are targeted at comparing the performance between GSVM-BA undersampling and cost-sensitive learning with different classification algorithms on highly imbalanced IP classification. A. Data Preparation We build classifiers on the daily based server behavioral data, which are retrieved from over 7000 sensors located in 51 countries. We see several millions IPs every day. About 32% IPs can be labeled as spam or ham based on information from the real production system. By running the TrustedSource system over 5 years already, the labeling information is very reliable. Although our inherent task is to build classifiers on these 32% known IPs to catch spam senders in the remaining 68% unknown IPs, we use QST data from labeled IPs for performance comparison. The QST data gathered on 08/14/2006 is used for training. The QST data gathered between 08/07/2006 and 08/13/2006 is used for validation. Data used for testing has been collected from 08/15/2006 to 09/11/2006. The validation dataset and the testing dataset pass through a stratified random undersampling process so that each of them has similar size to one day s data. It is convenient for us to estimate classification efficiency in the real production system, in which QST features are retrieved from past 24 hours data. As shown in Table II, the datasets are highly imbalanced with over 93% of the IPs being spam IPs. TABLE II QST DATA DISTRIBUTION spam non-spam training data 93.96% 6.04% validation data 93.09% 6.91% testing data 94.99% 5.01% The training dataset is normalized so that the value of each input feature falls into the interval of [-1,1]. The validation dataset and the testing dataset are normalized correspondingly. In the modeling phase, multiple algorithms are applied on the training dataset to build classifiers that are tuned to get the best performance on the validation dataset. The performance of these classifiers on the testing dataset is then reported. 178
6 True Positive Rate ROC analysis on the validation dataset GSVM SVM+CS C4.5+GSVM C4.5+CS NBTree+GSVM NBTree+CS Logistic+GSVM Logistic+CS True Positive Rate GSVM SVM+CS C4.5+GSVM C4.5+CS NBTree+GSVM NBTree+CS Logistic+GSVM Logistic+CS ROC analysis on the testing dataset False Positive Rate (a) validation performance Fig. 5. ROC analysis False Positive Rate (b) testing performance B. Evaluation Metrics The performance evaluation is directly related to our spam detection application. To make the classification result reliable, it is required that fpr, as defined in (1), should be less than 1%. This threshold is decided based on the feedback from the technical support team. As they noticed, with fpr under 1%, the FP reports from our customers are controlled at a business satisfied level. In our modeling process, we decide to control the fpr at 0.8% on the validation dataset to be slightly more conservative. With this prerequisite in mind, we are also trying to increase the tpr, which is defined in (2). C. Classification Algorithms fpr = FP/(FP + TN) (1) tpr = TP/(TP + FN) (2) Four classification algorithms are used in this study. 1) The C4.5 algorithm for building a decision tree [19]. 2) The NBTree algorithm for building a decision tree with naïve Bayes classifiers at the leaves [20]. 3) The Logistic algorithm for building a multinomial logistic regression model with a ridge estimator [21]. 4) The SVM algorithm for building a support vector machine [9]. These algorithms represent the state of the art in machine learning as well as the most popular and widely deployed classification solutions. For the first three algorithms, we choose the Weka software package (available at nz/ml/weka/). We also choose LIBSVM (available at cjlin/libsvm) for SVM modeling with the RBF kernel. A cost-sensitive meta-learning strategy is utilized to handle class imbalance between spam IPs and non-spam ones. The misclassification cost for a FN is always 1 while the misclassification cost for a FP is tuned such that the fpr is close to 0.8% on the validation dataset. As classification with SVM+CS is extremely slow, we designed GSVM-BA. A sequence of granular shrink operations is executed to recursively remove positive support vectors and thus to align the classification boundary until the fpr is close to 0.8% on the validation dataset. For a thorough comparison between GSVM-BA and costsensitive learning, we also run C4.5, NBTree, and Logistic algorithms on the training dataset after GSVM-BA undersampling besides running them on the original training dataset. D. Result Analysis TABLE III EFFECTIVENESS COMPARISON validation testing tpr% fpr% tpr% fpr% SVM+CS GSVM-BA C4.5+CS C4.5+GSVM NBTree+CS NBTree+GSVM Logistic+CS Logistic+GSVM TABLE IV EFFICIENCY COMPARISON #SVs validation testing (seconds) (seconds) SVM+CS GSVM-BA C4.5+CS N/A C4.5+GSVM N/A NBTree+CS N/A NBTree+GSVM N/A Logistic+CS N/A Logistic+GSVM N/A
7 Table III compares the effectiveness of the four classification algorithms, combined with the two rebalance techniques. For each combination, the tpr and the fpr are reported both on the validation dataset and on the testing dataset. Modeled on simple QST data, the classifiers demonstrate a significantly high tpr and are therefore effective to catch a large number of spam senders. With the fpr threshold in mind, GSVM-BA is the most effective algorithm with SVM+CS in second place. Fig. 5 reports ROC analysis [22] results. Only curves with fpr 1% are plotted so that the comparison is meaningful for the real system. GSVM-BA has the largest area under ROC curve. Moreover, the figures show that GSVM-BA has always the largest tpr for any fpr under 1%. The comparison between the validation performance and the testing performance shows that no overfitting happens. Although being slower than C4.5, NBTree, or Logistic, GSVM-BA is much faster than SVM+CS for classification as demonstrated in Table IV. The reason is that GSVM-BA extracts much less (only 2%) SVs than SVM+CS. It takes about 4 minutes for classification. Considering the fact that it takes about 30 minutes to retrieve QST data in the real production system, the efficiency improvement is proved to be critical to decrease the response time to detect spam senders. In our experiments, we observed similar modeling time for GSVM-BA and SVM+CS. For GSVM-BA, most of modeling time is consumed at the first few rounds of granule shrink operations. After that, it becomes much faster for SVM modeling to converge because the decision boundary becomes much clearer as the positive SVs are removed. We do GSVM-BA modeling once every month (offline learning). And then we use the model for classification in the next month (online classification). So classification efficiency is critical while modeling efficiency is not in the real production system. It has been running for one year and has demonstrated stable performance. For effectiveness, it can catch about 28% spam senders with less than 1% fpr. For efficiency, it only takes about 4 minutes to classify the latest samples. VI. CONCLUSION Four state-of-the-art classification algorithms are utilized to detect malicious servers on low-informative and highly imbalanced QST data based on global messaging patterns. Based on simple QST records that do not appear to be informative at a first glance, the breadth and spectral features are extracted. Our study demonstrates that these features combined with advanced classification techniques can be reliable for spam detection if combined with methods to deal with class imbalance. SVM with cost-sensitive learning suffers from a long run time due to the large amount of support vectors, which is problematic for detecting malicious senders typically sending only for a few hours. To improve efficiency, GSVM- BA is proposed to handle class imbalance with undersampling by recursively eliminating positive support vectors and rebuilding another SVM. Our study demonstrates that GSVM- BA greatly speeds up the classification process by extracting much less support vectors. GSVM-BA is even more effective than SVM with cost-sensitive learning. At the same FP rate level acceptable for real application, GSVM-BA catches more spam servers. The resulting classifier contributes to TrustedSource as one source of input and prevents malicious or unwanted messages being delivered to users. REFERENCES [1] A. Ramachandran, N. Feamster, and D. Dagon, Revealing botnet membership using dnsbl counter-intelligence, in Proc. of 2nd USENIX Steps to Reducing Unwanted Traffic on the Internet, pp , [2] A. Ramachandran, N. Feamster, and S. Vempala, Filtering spam with behavioral blacklisting, in Proc. of the 14th ACM Conference on Computer and Communications Security (CCS), [3] Secure Computing Corporation, TrustedSource website, [4] Y. C. Tang, S. Krasser, P. Judge, and Y.-Q. Zhang, Fast and effective spam IP detection with granular SVM for spam filtering on highly imbalanced spectral mail server behavior data, in Proc. of The 2nd International Conference on Collaborative Computing, [5] T. S. Furey, N. Christianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Hauessler, Support vector machine classification and validation of cancer tissue samples using microarray expression data., Bioinformatics, vol. 16, no. 10, pp , [6] N. Japkowicz and S. Stephen, The class imbalance problem: A systematic study, Intelligent Data Analysis, vol. 6, no. 5, pp , [7] N. V. Chawla, N. Japkowicz, and A. Kotcz, Editorial: special issue on learning from imbalanced data sets, SIGKDD Explorations, vol. 6, no. 1, pp. 1 6, [8] G. M. Weiss, Mining with rarity: a unifying framework, SIGKDD Explorations, vol. 6, no. 1, pp. 7 19, [9] V. N. Vapnik, Statistical Learning Theory. New York: John Wiley and Sons, [10] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, vol. 2, no. 2, pp , [11] R. Akbani, S. Kwek, and N. Japkowicz, Applying support vector machines to imbalanced datasets, in Proc. of the 15th European Conference on Machine Learning (ECML 2004), pp , [12] G. Wu and E. Y. Chang, KBA: Kernel boundary alignment considering imbalanced data distribution, IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp , [13] J. Dong, C. Y. Suen, and A. Krzyzak, Algorithms of fast SVM evaluation based on subspace projection, in Proc. of 2005 IEEE International Joint Conference on Neural Networks, Vol. 2, pp , [14] H. Isozaki and H. Kazawa, Efficient Support Vector Classifiers for Named Entity Recognition, in Proc. of the 19th International Conference on Computational Linguistics (COLING 02), pp , [15] B. L. Milenova, J. S. Yarmus, and M. M. Campos, SVM in oracle database 10g: removing the barriers to widespread adoption of support vector machines, in Proc. of the 31st international conference on Very large data bases, pp , [16] T. Y. Lin, Data mining and machine oriented modeling: A granular computing approach, Applied Intelligence, vol. 13, no. 2, pp , [17] A. Bargiela and W. Pedrycz, Granular Computing: An Introduction. Kluwer: Kluwer Academic Pub, [18] Y. C. Tang, B. Jin, and Y.-Q. Zhang, Granular support vector machines with association rules mining for protein homology prediction, Artificial Intelligence in Medicine, vol. 35, no. 1-2, pp , [19] R. J. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, [20] R. Kohavi, Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid, in Proc. of the Second International Conference on Knowledge Discovery and Data Mining, pp , [21] S. le Cessie and J. C. van Houwelingen, Ridge estimators in logistic regression, Applied Statistics, vol. 41, no. 1, pp , [22] A. P. Bradley, The use of the area under the roc curve in the evaluation of machine learning algorithms., Pattern Recognition, vol. 30, no. 7, pp ,
Support Vector Machines and Random Forests Modeling for Spam Senders Behavior Analysis
Support Vector Machines and Random Forests Modeling for Spam Senders Behavior Analysis Yuchun Tang, Sven Krasser, Yuanchen He, Weilai Yang, Dmitri Alperovitch Applied Research, Secure Computing Corporation
Identifying Image Spam based on Header and File Properties using C4.5 Decision Trees and Support Vector Machine Learning
Identifying Image Spam based on Header and File Properties using C4.5 Decision Trees and Support Vector Machine Learning Sven Krasser, Yuchun Tang, Jeremy Gould, Dmitri Alperovitch, Paul Judge Abstract
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide
eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide This guide is designed to help the administrator configure the eprism Intercept Anti-Spam engine to provide a strong spam protection
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Intercept Anti-Spam Quick Start Guide
Intercept Anti-Spam Quick Start Guide Software Version: 6.5.2 Date: 5/24/07 PREFACE...3 PRODUCT DOCUMENTATION...3 CONVENTIONS...3 CONTACTING TECHNICAL SUPPORT...4 COPYRIGHT INFORMATION...4 OVERVIEW...5
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Mining DNS for Malicious Domain Registrations
Mining DNS for Malicious Domain Registrations Yuanchen He, Zhenyu Zhong, Sven Krasser, Yuchun Tang McAfee, Inc. 451 North Point Parkway, Suite 3 Alpharetta, GA 322, USA {yhe, ezhong, skrasser, ytang}@mcafee.com
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Extending Black Domain Name List by Using Co-occurrence Relation between DNS queries
Extending Black Domain Name List by Using Co-occurrence Relation between DNS queries Kazumichi Sato 1 keisuke Ishibashi 1 Tsuyoshi Toyono 2 Nobuhisa Miyake 1 1 NTT Information Sharing Platform Laboratories,
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
INSIDE. Neural Network-based Antispam Heuristics. Symantec Enterprise Security. by Chris Miller. Group Product Manager Enterprise Email Security
Symantec Enterprise Security WHITE PAPER Neural Network-based Antispam Heuristics by Chris Miller Group Product Manager Enterprise Email Security INSIDE What are neural networks? Why neural networks for
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, [email protected] Department of Statistics,UC Berkeley Andy Liaw, andy [email protected] Biometrics Research,Merck Research Labs Leo Breiman,
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm
Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification
How To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
ClusterOSS: a new undersampling method for imbalanced learning
1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately
An Overview of Spam Blocking Techniques
An Overview of Spam Blocking Techniques Recent analyst estimates indicate that over 60 percent of the world s email is unsolicited email, or spam. Spam is no longer just a simple annoyance. Spam has now
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
Global Reputation Monitoring The FortiGuard Security Intelligence Database WHITE PAPER
Global Reputation Monitoring The FortiGuard Security Intelligence Database WHITE PAPER FORTINET Global Reputation Monitoring PAGE 2 Overview Fortinet s FortiGuard Security Services delivers two essential
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Eiteasy s Enterprise Email Filter
Eiteasy s Enterprise Email Filter Eiteasy s Enterprise Email Filter acts as a shield for companies, small and large, who are being inundated with Spam, viruses and other malevolent outside threats. Spammer
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
Dual Mechanism to Detect DDOS Attack Priyanka Dembla, Chander Diwaker 2 1 Research Scholar, 2 Assistant Professor
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Engineering, Business and Enterprise
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data
Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto [email protected] @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto [email protected] @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Botnet Detection Based on Degree Distributions of Node Using Data Mining Scheme
Botnet Detection Based on Degree Distributions of Node Using Data Mining Scheme Chunyong Yin 1,2, Yang Lei 1, Jin Wang 1 1 School of Computer & Software, Nanjing University of Information Science &Technology,
Learning with Skewed Class Distributions
CADERNOS DE COMPUTAÇÃO XX (2003) Learning with Skewed Class Distributions Maria Carolina Monard and Gustavo E.A.P.A. Batista Laboratory of Computational Intelligence LABIC Department of Computer Science
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms
Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms Johan Perols Assistant Professor University of San Diego, San Diego, CA 92110 [email protected] April
An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks
2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks Reyhaneh
An apparatus for P2P classification in Netflow traces
An apparatus for P2P classification in Netflow traces Andrew M Gossett, Ioannis Papapanagiotou and Michael Devetsikiotis Electrical and Computer Engineering, North Carolina State University, Raleigh, USA
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
KEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
Addressing the Class Imbalance Problem in Medical Datasets
Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Implementation of Botcatch for Identifying Bot Infected Hosts
Implementation of Botcatch for Identifying Bot Infected Hosts GRADUATE PROJECT REPORT Submitted to the Faculty of The School of Engineering & Computing Sciences Texas A&M University-Corpus Christi Corpus
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
How To Filter Email From A Spam Filter
Spam Filtering A WORD TO THE WISE WHITE PAPER BY LAURA ATKINS, CO- FOUNDER 2 Introduction Spam filtering is a catch- all term that describes the steps that happen to an email between a sender and a receiver
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Detecting Spam at the Network Level
Detecting Spam at the Network Level Anna Sperotto, Gert Vliek, Ramin Sadre, and Aiko Pras University of Twente Centre for Telematics and Information Technology Faculty of Electrical Engineering, Mathematics
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Spam Testing Methodology Opus One, Inc. March, 2007
Spam Testing Methodology Opus One, Inc. March, 2007 This document describes Opus One s testing methodology for anti-spam products. This methodology has been used, largely unchanged, for four tests published
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
Support Vector Machines for Dynamic Biometric Handwriting Classification
Support Vector Machines for Dynamic Biometric Handwriting Classification Tobias Scheidat, Marcus Leich, Mark Alexander, and Claus Vielhauer Abstract Biometric user authentication is a recent topic in the
Anti Spam Best Practices
39 Anti Spam Best Practices Anti Spam Engine: Time-Tested Scanning An IceWarp White Paper October 2008 www.icewarp.com 40 Background The proliferation of spam will increase. That is a fact. Secure Computing
The Network Box Anti-Spam Solution
NETWORK BOX TECHNICAL WHITE PAPER The Network Box Anti-Spam Solution Background More than 2,000 years ago, Sun Tzu wrote if you know yourself but not the enemy, for every victory gained you will also suffer
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Investigation of Support Vector Machines for Email Classification
Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software
Direct Marketing When There Are Voluntary Buyers
Direct Marketing When There Are Voluntary Buyers Yi-Ting Lai and Ke Wang Simon Fraser University {llai2, wangk}@cs.sfu.ca Daymond Ling, Hua Shi, and Jason Zhang Canadian Imperial Bank of Commerce {Daymond.Ling,
