Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering



Similar documents
An IG-RS-SVM classifier for analyzing reviews of E-commerce product

Green Master based on MapReduce Cluster

APPENDIX III THE ENVELOPE PROPERTY

Average Price Ratios

6.7 Network analysis Introduction. References - Network analysis. Topological analysis

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

A Parallel Transmission Remote Backup System

The Digital Signature Scheme MQQ-SIG

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

Evaluating the Network and Information System Security Based on SVM Model

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

Learning to Filter Spam A Comparison of a Naive Bayesian and a Memory-Based Approach 1

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

CHAPTER 2. Time Value of Money 6-1

Approximation Algorithms for Scheduling with Rejection on Two Unrelated Parallel Machines

AN ALGORITHM ABOUT PARTNER SELECTION PROBLEM ON CLOUD SERVICE PROVIDER BASED ON GENETIC

Optimal Packetization Interval for VoIP Applications Over IEEE Networks

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

Web Service Composition Optimization Based on Improved Artificial Bee Colony Algorithm

The impact of service-oriented architecture on the scheduling algorithm in cloud computing

Study on prediction of network security situation based on fuzzy neutral network

Time Series Forecasting by Using Hybrid. Models for Monthly Streamflow Data

Application of Grey Relational Analysis in Computer Communication

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

Numerical Methods with MS Excel

Using Data Mining Techniques to Predict Product Quality from Physicochemical Data

Measuring the Quality of Credit Scoring Models

Projection model for Computer Network Security Evaluation with interval-valued intuitionistic fuzzy information. Qingxiang Li

Three Dimensional Interpolation of Video Signals

Chapter Eight. f : R R

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

A particle Swarm Optimization-based Framework for Agile Software Effort Estimation

1. The Time Value of Money

CIS603 - Artificial Intelligence. Logistic regression. (some material adopted from notes by M. Hauskrecht) CIS603 - AI. Supervised learning

An Evaluation of Naive Bayesian Anti-Spam Filtering

GRADUATION PROJECT REPORT

Software Aging Prediction based on Extreme Learning Machine

Robust Realtime Face Recognition And Tracking System

Proactive Detection of DDoS Attacks Utilizing k-nn Classifier in an Anti-DDos Framework

10.5 Future Value and Present Value of a General Annuity Due

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011

Proceedings of the 2010 Winter Simulation Conference B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yücesan, eds.

A particle swarm optimization to vehicle routing problem with fuzzy demands

Fault Tree Analysis of Software Reliability Allocation

An Operating Precision Analysis Method Considering Multiple Error Sources of Serial Robots

Curve Fitting and Solution of Equation

Software Reliability Index Reasonable Allocation Based on UML

Suspicious Transaction Detection for Anti-Money Laundering

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

IP Network Topology Link Prediction Based on Improved Local Information Similarity Algorithm

Performance Attribution. Methodology Overview

Fast, Secure Encryption for Indexing in a Column-Oriented DBMS

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

ON SLANT HELICES AND GENERAL HELICES IN EUCLIDEAN n -SPACE. Yusuf YAYLI 1, Evren ZIPLAR 2. yayli@science.ankara.edu.tr. evrenziplar@yahoo.

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT

Developing tourism demand forecasting models using machine learning techniques with trend, seasonal, and cyclic components

VIDEO REPLICA PLACEMENT STRATEGY FOR STORAGE CLOUD-BASED CDN

Using Phase Swapping to Solve Load Phase Balancing by ADSCHNN in LV Distribution Network

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

Compressive Sensing over Strongly Connected Digraph and Its Application in Traffic Monitoring

Efficient Traceback of DoS Attacks using Small Worlds in MANET

A Comparative Study of Medical Data Classification Methods Based on Decision Tree and System Reconstruction Analysis

On Error Detection with Block Codes

How To Balance Load On A Weght-Based Metadata Server Cluster

FINANCIAL MATHEMATICS 12 MARCH 2014

The Application of Intuitionistic Fuzzy Set TOPSIS Method in Employee Performance Appraisal

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom.

AnySee: Peer-to-Peer Live Streaming

A DISTRIBUTED REPUTATION BROKER FRAMEWORK FOR WEB SERVICE APPLICATIONS

Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

Forecasting Trend and Stock Price with Adaptive Extended Kalman Filter Data Fusion

ROULETTE-TOURNAMENT SELECTION FOR SHRIMP DIET FORMULATION PROBLEM

Classic Problems at a Glance using the TVM Solver

Optimizing Software Effort Estimation Models Using Firefly Algorithm

Transcription:

Moder Appled Scece October, 2009 Applcatos of Support Vector Mache Based o Boolea Kerel to Spam Flterg Shugag Lu & Keb Cu School of Computer scece ad techology, North Cha Electrc Power Uversty Hebe 071003, Cha E-mal: lsg69@sohu.com, cepuckb@163.com Abstract Spam s so wdely speared that has a bad effect o daly use of E-mal. Nowadays, amog the prmary techologes of spam flterg, support vector mache (SVM) s appled wdely, because t s effcet ad has hgh separatg accuracy. The ma problem of support vector mache arthmetc s how to choose the kerel fucto. To solve ths problem people propose spam flterg arthmetc of support vector mache based o Boolea kerel. The arthmetc uses flterg methods based o attrbutes, such as IP address, subject words, keywords cotet, eclosure formato, etc. These attrbutes compose the feature vectors, ad the vectors are classfed by SVM-MDNF based o Boolea kerel. The expermet results show that ths arthmetc has hgh separatg accuracy, hgh recall rato ad precso rato. The arthmetc has ts value theory ad applcato. Keywords: Spam, Support Vector Mache, Boolea Kerel 1. Itroducto E-mal s oe of the ma meas for people to commucate formato o Iteret. As the Iteret s so wdely used, sedg ad recevg E-mal has almost become a part of cosderable amout of people s daly lfe. However, wth the coveece the Iteret brgs, t also brgs the exstece ad wde spread of spams, whch cause a lot of troubles to people. It s evdet that people s work effcecy ad ther emoto wll be flueced, f they have to sped tme ad efforts o detfcato E-mal every day. So to auto-dstgush spam has mportat meag ad applyg value(shawe-taylor J, Crsta N. KereI. 2005). Spam meas that publczg E-mals, cotag all kds of publctes, such as ads, electroc publcatos, are ot requested or accepted by recevers advace. To classfy the techologes of spam flterg, they ca be classfed to two kds: server spam flterg ad clet spam flterg, accordg to dfferet places the flter s executed. But f we classfy the techologes based o dfferet flterg methods, there are three ways: spam flterg based o blacklst/ whtelst, spam flterg based o prcples ad spam flterg based cotet. 1) Spam Flterg Based O Blacklst/Whtelst Ay E-mals, set by seders the whtelst, are cosdered legal E-mals, whle ay E-mals set by the seders the blacklst are treated as spams. The followg method s wdely used spam flterg recetly. Usually t collects a blacklst ad a whtelst. I these lsts, the cotet ca be E-mal addresses, the DNS of E-mal servers or IP addresses. They help recevers to check seders real tme. 2) Spam Flterg Based O Prcples Ths method eeds people to set some prcples. Ad the spam s the E-mal that meets oe of several prcples. These prcples always clude aalyss o header, flterg o multple sed, accurate matchg o keywords ad other features of the E-mal. 3) Spam Flterg Based O Cotet Actually, the producers who sed spam vary cotuously. So the blacklst/whtelst has great lmtatos. Ad spam flterg based o prcples also has some dsadvatages: prcples are made by people, ad those users who are lack of experece wll affect the valdty ad accuracy of prcples. Therefore, may experts come up wth a dea that aalyze the cotet of E-mal frst, ad the dstgush whether t s a spam. Ths method combes spam flterg wth other techologes, such as text classfcato ad formato flterg. It requres the arthmetc of text classfcato 27

Vol. 3, No. 10 Moder Appled Scece ad formato flterg to be troduced to the spam flterg. To solve ths problem, a great amout of measures have bee adopted, such as exteso of E-mal protocols, certfcato of E-mal server, spam flterg ad legslato. Amog these measures, the spam flterg s more realstc. Nowadays, may arthmetc of text classfcato have bee troduced to applcatos of spam flterg based o cotet, lke Bayes, Decso Tree, K-Most Neghborg Arthmetc, Support Vector Maches, etc(wag b, Pa wefeg. 2005). Ad applcatos of SVM are more successful spam flterg. 2. Evaluate Stadard of Spam Flterg System The performace evaluato o spam flterg ofte makes use of some related dexes text classfcato. The stadard, whch ca decde whether text classfcato s mature or ot, s the mappg accuracy ad mappg speed. Ad the mappg speed s decded by the complexty of mappg arthmetc; the mappg accuracy s evaluated by formato retreval evaluato. The followgs are the deftos about two commo dexes: Recall Rato ad Precso Rato of formato retreval spam flterg feld(c.j. va Rjsberge. 1979). Def 1: Recall Rato s the rato of the amout of spam that has bee fltered to the amout of E-mals that should be fltered. The computg formula of Recall Rato s: amout of fltered spam Re call (1) amout of E mals that should be fltered Def 2: Precso Rato s the rato of the amout of spam that has bee fltered to the amout of E-mals that have bee fltered. The computg formula of Precso Rato s: amout of fltered spam Pr ecso (2) amout of E mals that have bee fltered Both of Recall Rato ad Precso Rato reflect the qualty of E-mal classfcato. They should be cosdered together rather tha oly oe of them s pad atteto. So F1 Test Value s ofte used to pla the classfcato result of evaluato E-mals as a whole. The computg formula of F1 Test Value s: Pr ecso Rato Re call Rato 2 F1 (3) Pr ecso Rato + Re call Rato Ad there are Mcro Average ad Macro Average to calculate Recall Rato, Precso Rato ad F1 Test Value. Mcro Average couts respectvely every kd s recall rato, precso rato ad test value; ad Macro Average ufedly calculates all kds recall rato, precso rato ad test value. It s evdet that all E-mal flterg arthmetc s amed at reachg the performace requsto of recall rato ad precso rato E-mal classfcato the ed. 3. Support Vector Mache Based o Boolea Kerel Fucto Support vector mache (SVM)(Zhag Yag, L Zhahua, Tag Ya, Cu Keb, DRC-BK. 2004) s a learg method proposed by Vapk ad the research group, whch s led by hm Bell Laboratory. Ad ths method s based o statstcs. SVM s developed from Optmal Separatg Plae o lear classfyg. The basc dea of t s maxmum-separato (marg). The so called optmal meas that separatg plae s requred ot oly to separate two kds of text correctly, but also to fd a max marg. Actually, the maxmum-marg s the cotrol of promoto ablty. Lear support vector mache separates the yes ad o examples, through costructg optmal hyperplae W, X + b 0 put space. Here the <,> represets the er product; W R, b R, to make that: y W, X + b 1 0 1,2,..., d (4) It ca be proved that the optmal separatg plae s what leads to mmum 1 2 2 W put space. To solve ths problem we eed to trasform t to dual form wth Lagrage Optmzato. The dual form ca also be called costrats: The solvg s as follows: d y 0 1, 2,..., d 1 α (5) 28

Moder Appled Scece October, 2009 d 1 d d r r Q( α) arg max α y y X, X 1 α 2 1 j 1 α j j j α (6) α s the correspodg Lagrage multpler of costrat (5) prmary problem. Ths s a problem of seekg optmzato for quadratc fucto o the costrat of equalty ad t has uque aswer. It ca be proved easly that oly part (ofte a lttle part) of α aswers are ot equal to zero, ad the correspodg examples are the support vector. Through workg out the above-metoed problem, we get the optmal separatg fucto. That s: d r f ( X) sg( α y X, X b) 1 + (7) I the fucto: fact, the summato oly works support vector. The b s separatg threshold. It ca be worked out wth ay support vector (satsfyg formula 5th) or through the meda of ay par of support vectors two classes. d r r b ys α y X, Xs, s 0 1 α (8) Here, the sg() s a symbol fucto. Wth No-lear-Mappgφ, vectors of put space ca be trasformed to vectors of hgher-dmeso space, whch s amed as feature space. The feature space has a hgher dmeso tha the put space. No-lear SVM makes use of No-lear-Mappg φ to trasform vectors of put space to vectors of hgh-dmeso space. Therefore, X r, X the above equato are respectvely replaced by φ( X r ), φ ( X ). So we ca get that: I the fucto: d r f ( X) sg( α y ( X ), ( X) b) 1 φ φ + d r r b ys α y ( X ), ( Xs), s 0 1 φ φ α We ame the fucto lke K( xy, ) ( φ( x), φ( y)) 2 x x 1) Gaussa Radal bass fuctos: K( x, x ) exp( ) 2 2σ 2) Polyomal: K( x, x ) ((, ) 1) d x x +, for d1,..,n 3) Hyperbolc taget: K( x, x ) tamj( β x + b) as kerel fucto. Some commo kerel fuctos clude: 4) Sple kerel fuctos: K( x, x) B2+ 1( x x) Choosg dfferet kerel fuctos, you ca get dfferet No-lear support vector mache. If x ad y the kerel fuctos above are Boolea, the we ca suppose that U {0,1}, V {0,1}, σ > 0, p N, for I represets ut vector. So: K MDNF ( U, V) 1 + ( σu V 1) 1 + (9) (10) (11) We call K MDNF as Mootoe Dsjuctve Normal Form (MDNF) kerel fucto. MDNF kerel fucto s the kerel fucto we use ths paper as SVM arthmetc. 4. SVM Spam Flterg Based o Boolea Kerel Fucto ad the Expermet Results 4.1 The Strategy of SVM Spam Flterg Based O Boolea Kerel Ths expermet adopts Ero-spam E-mal dataset. Ad the dataset cludes two parts: pre-processed s the set of E-mals that have bee pretreated, ad the part raw are pretreated based o eeds to get preprocessed. Our expermet cramps out some preprocessed as trag set, ad some as testg set. We select 2000 E-mals. Amog these E-mals, 1100 are spam ad 900 are ormal E-mals. The specfc procedures of the strategy of SVM spam flterg based o Boolea kerel are as followgs: 1) Frstly, we process the dataset wth stadard. Wpe off the ose words (such as spellg mstakes, etc), ad 29

Vol. 3, No. 10 Moder Appled Scece flter words whose text frequecy are betwee 2 ad 8000; set dfferet weghg to the subject ad text cotet of every E-mal, ad the subject s set hgher weghg to cocer the words appearg the E-mal subject. Takg subject, text cotet ad may other features of the E-mal to cosderato, we wll get the feature vector of every E-mal. 2) Make baryzato towards the features the feature vector. That s to gve every feature the value 0 or 1. Sce we use Boolea kerel MDNF here, there s a eed to trasform the feature vector to Boolea feature vector. 3) Flter spam wth SVM based o MDNF Boolea kerel. I order to verfy whether the arthmetc s vald or ot, we use k cross for our expermet. K cross s to separate E-mals to k parts. We make use of the k-1 parts for trag, ad the remag for testg. The procedure loops k tmes, so every part has bee tested. Fally, the average of tests values s used as the result of test for evaluato. Here we make k equal 10. 4.2 Expermet Result ad Aalyss I ths expermet, we compare the separatg accuracy of the spam flterg arthmetc based o Boolea kerel SVM wth that of some arthmetc-naïve Bayes, lear SVM ad No-lear SVM based o radal bass fuctos. The result s show s the table 1: From the comparso result of separatg accuracy, t s evdet that the hghest s SVM based o MDNF Boolea kerel. Secod top s the No-lear SVM based o radal bass fuctos. The lowest s Naïve Bayes. Durg the evaluato of the effcecy of E-mal separatg arthmetc, t caot evaluate the arthmetc completely oly to compare the separatg accuracy. So we evaluate the arthmetc further usg precso rato, recall rato ad F 1 gve the Secto 2. I table 2, t compares the recall rato, precso rato ad F. Ad from these targets, we ca evaluate the valdty of the arthmetc a more comprehesve way. From the expermet result, we ca fd that SVM based o MDNF Boolea kerel has the best spam flterg effect, comparg wth the other three. 5. Cocluso After the aalyss of all the characterstcs of spam, we propose the SVM based o MDNF Boolea kerel spam flterg arthmetc whe we make the feature vector usg E-mal subject, text cotet, etc. The expermet shows that ths arthmetc has hgher separatg accuracy, ad has better spam flterg effect recall rato ad precso rato, comparg wth Naïve Bayes, Lear SVM ad SVM based o radal bass fuctos. Ad the expermets thereafter, we wll apply SVM wth more Boolea kerels to spam flterg, ad look forward a better effect. Refereces C.J. va Rjsberge. (1979). Iformato Retreval (2d edto), Butterworths, Lodo, 1979. http://www.cs.cmu.edu/~ero/ Shawe-Taylor J, Crsta N. KereI. (2005). Methods for Patter Aa1yss. Be jg: Cha Mache Press, 2005:60-74. Wag, b, Pa, wefeg. (2005). Cotet-based spam flterg techology. Joural of Chese Iformato Processg. Be jg: 2005, 19(5):1-10. Zhag Yag, L Zhahua, Tag Ya, Cu Keb, DRC-BK. (2004). Mg Classfcato Rules wth Help of SVM. I the Proceedgs of the 8th Pacfc-Asa Coferece o Kowledge Dscovery ad Data Mg(PAKDD'04), Lecture Notes Artfcal Itellgece, Volume 3056, Sprger-Verlag Press, 2004. Table 1. Comparso of separatg accuracy Classfy algorthm Classfy accuracy NB 92.5% Ler SVM 93.8% RBF kerel SVM 94.7% MDNF-SVM 97.8% 30

Moder Appled Scece October, 2009 Table 2. Comparso of recall, precso ad F Classfy algorthm recall precso F 1 NB 90.4% 88.7% 89.5% Ler SVM 91.2% 90.5% 90.8% RBF kerel SVM 92.2% 92.5% 92.3% MDNF-SVM 94.2% 95.5% 94.8% 31