Fuzzy Logic for E-Mail Spam Deduction



Similar documents
Spam Filtering using Signed and Trust Reputation Management

Weighted Graph Approach for Trust Reputation Management

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Bayesian Spam Filtering

How To Filter Spam Image From A Picture By Color Or Color

Spam Filtering Methods for Filtering

Savita Teli 1, Santoshkumar Biradar 2

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

escan Anti-Spam White Paper

IMPROVING SPAM FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

Spam Detection Using Customized SimHash Function

Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang

Representation of Electronic Mail Filtering Profiles: A User Study

Immunity from spam: an analysis of an artificial immune system for junk detection

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

A Survey on Spam Filtering for Online Social Networks

Spam detection with data mining method:

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

An Efficient Two-phase Spam Filtering Method Based on s Categorization

SPAM-What To Do SUMMERSET COMPUTER CLUB

Classification Using Data Reduction Method

On Attacking Statistical Spam Filters

Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

International Journal of Research in Advent Technology Available Online at:

DESIGN OF AN ONLINE EXPERT SYSTEM FOR CAREER GUIDANCE

Filtering Junk Mail with A Maximum Entropy Model

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development

A Monitor Tool for Anti-spam Mechanisms and Spammers Behavior

6367(Print), ISSN (Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

Detecting Spam Using Spam Word Associations

Who will win the battle - Spammers or Service Providers?

Facilitating Business Process Discovery using Analysis

A LVQ-based neural network anti-spam approach

Image Spam Filtering Using Visual Information

Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10

AN ENHANCED APPROACH FOR CONTENT FILTERING IN SPAM DETECTION

A Case-Based Approach to Spam Filtering that Can Track Concept Drift

An Anti-Spam Engine using Fuzzy Logic with Enhanced Performance Tuning

Simple Language Models for Spam Detection

Filters that use Spammy Words Only

DON T BE FOOLED BY SPAM FREE GUIDE. Provided by: Don t Be Fooled by Spam FREE GUIDE. December 2014 Oliver James Enterprise

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

Image Spam Filtering by Content Obscuring Detection

WE DEFINE spam as an message that is unwanted basically

Spam or Not Spam That is the question

Marketing Glossary of Terms

Spam Filter: VSM based Intelligent Fuzzy Decision Maker

FRACTAL RECOGNITION AND PATTERN CLASSIFIER BASED SPAM FILTERING IN SERVICE

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

An Approach to Detect Spam s by Using Majority Voting

An Overview of Spam Blocking Techniques

Track-able Bulk Management System

Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages

Single-Class Learning for Spam Filtering: An Ensemble Approach

Three-Way Decisions Solution to Filter Spam An Empirical Study

An Efficient Spam Filtering Techniques for Account

A Content based Spam Filtering Using Optical Back Propagation Technique

PARTIAL IMAGE SPAM DETECTION USING OCR

An Imbalanced Spam Mail Filtering Method

ACCURATE KEYWORD BASED SPAMS FILTERING IN SOCIAL NETWORKS.

Is Spam Bad For Your Mailbox?

Spam Mail Detection through Data Mining A Comparative Performance Analysis

Controlling Spam s at the Routers

Spam Filters: Bayes vs. Chi-squared; Letters vs. Words

Spam Filtering with Naive Bayesian Classification

Hoodwinking Spam Filters

Antispam Security Best Practices

Prevention of Spam over IP Telephony (SPIT)

Markovian Process and Novel Secure Algorithm for Big Data in Two-Hop Wireless Networks

RULE-BASE DATA MINING SYSTEMS FOR

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Real Time Traffic Monitoring With Bayesian Belief Networks

Controlling Spam at the Routers

Intercept Anti-Spam Quick Start Guide

SVM-Based Spam Filter with Active and Online Learning

Journal of Information Technology Impact

Anti Spamming Techniques

An Efficient Three-phase Spam Filtering Technique

About this documentation

GFI Product Comparison. GFI MailEssentials vs Barracuda Spam Firewall

Spam Filters: Bayes vs. Chi-squared; Letters vs. Words

Anti-Spam Methodologies: A Comparative Study

How To Use Neural Networks In Data Mining

Spam Filtering using Spam Mail Communities

Impact of Feature Selection Technique on Classification

MXSweep Hosted Protection

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Enrique Puertas Sánz Francisco Carrero García Universidad Europea de Madrid Villaviciosa de Odón Madrid, SPAIN

Accelerating Techniques for Rapid Mitigation of Phishing and Spam s

Combining Global and Personal Anti-Spam Filtering

Transcription:

Fuzzy Logic for E-Mail Spam Deduction P.SUDHAKAR 1, G.POONKUZHALI 2, K.THIAGARAJAN 3,R.KRIPA KESHAV 4, K.SARUKESI 5 1 Vernalis systems Pvt Ltd, Chennai- 600116 2,4 Department of Computer Science and Engineering, Rajalakshmi Engineering College, Affiliated to Anna University- Chennai, Tamil Nadu 3 Department of Science and Humanities, KCG College of Technology Affiliated to Anna University-Chennai, Tamil Nadu 5 Hindustan Institute of Technology and Science-Chennai,Tamil Nadu INDIA 1 sudhakar.asp@gmail.com, 2 poonkuzhali.s@rajalakshmi.edu.in, 3 vidhyamannan@yahoo.com, 4 kripa_keshav@yahoo.co.in, 5 profsaru@gmail.com Abstract: - In this Information era, most of the communication is happening through e-mails. Many e-mails contain web spam, as the transaction through this internet is affected by Passive attacks and Active attacks. Recently, several algorithms are developed partially for detecting spam emails. Therefore there is a need for improving the performance of the spam-detecting algorithm. In this proposed work fuzzy rules are defined and applied to all incoming emails. Based on the result of the various rules against user attitude, input email is classified as spam or ham. This method outperforms the existing spam-detecting algorithms in terms of accuracy and user friendly. Key-Words: - Attitude, Email, Email spam, Fuzzy, Fuzzy logic, Spam, Spam deduction 1 Introduction E-mail spam, also known as junk e-mail or unsolicited bulk e-mail (UBE), is a subset of spam that involves nearly identical messages sent to numerous recipients by e-mail. Definitions of spam usually include the aspects that e-mail is unsolicited and sent in bulk. One subset of UBE is UCE (unsolicited commercial e-mail).e-mail spam has steadily grown since the early 1990s. Botnets, networks of virus-infected computers, are used to send about 80% of spam. Since the cost of the spam is borne mostly by the recipient, it is effectively postage due advertising. The legal status of spam varies from one jurisdiction to another. Spammers collect e-mail addresses from chatrooms, websites, customer lists, newsgroups, and viruses which harvest users' address books, and are sold to other spammers. They also use a practice known as "e-mail appending" or "epending" in which they use known information about their target (such as a postal address) to search for the target's e- mail address. Much of spam is sent to invalid e-mail addresses. Spam averages 78% of all e-mail sent. According to the Message Anti-Abuse Working Group, the amount of spam email was between 88-92% of email messages sent in the first half of 2010. Most of the inbox is flooded with these Spams which occupies lot of memory space. There are several algorithms available for detecting and filtering spam emails. Among the existing algorithms, Bayesian filtering produces best result, still it does not detect all the spam emails. Most of the existing algorithms considers content alone for filtering the spam emails. To detect all the spam emails, existing spam filtering methods has to be enhanced. In this proposed work, a new algorithm is devised with various fuzzy rules and fuzzy variables. Each fuzzy rule will produce Attack Factor value which are consider for arriving result. Each rule Attack Factor value was arrived by comparing input parameter against Black list and White List. Black list contains predetermined spam content. White list contains acceptable contents. This final result from above calculated Attack Factor will decide the input email content to be spam or ham or to be sent hold state. The final result of the algorithm was obtained by summing up each rule result value and decision was taken based on the result of the individual rules. ISBN: 978-960-474-281-3 83

1.1 Related works Xavier Carreras et al.[1] proposed a Boosting algorithm for Anti Spam filtering. Even though Boosting algorithm delivers good result, possibility of misclassification costs persist inside the AdaBoost learning algorithm. William W. Cohen et al.[2] suggested Speech act theory for email filtering. The outcome of Speech act theory highly depend on the learning and this approach shows new projection for classifying email spam content. Harris Drucker et al.[6] developed support vector Machines for Spam Categorization. Even though support vector approach outperforms well, switching from training model need user intervention. Addition to that, reply emails are considered as no spam. Joes M.Gomez Hidalgo et al.[8] presents a new dimension for spam email classification. 1.2 Outline of the document Section 2 contains Generation of Fuzzy rules. Section 3 projects Fuzzy rule implementation. Section 4 delivers results and discussion on results. Section 5 propose future work and concludes. 2. Generation of Fuzzy s Input variable : {Sender saddress, SenderIP, SubjectWords, ContentWords, Attachment} Fuzzy set : {positive, Zero, Negative} Linguistic set : (highpositive, lowpostive,positive, highnegative,lownegative,negative, Zero} 1: a: IF SenderAddress spammer list AttackFactor=-0.25; b: IF SenderAddress to Ham list AttackFactor=0.25; c : IF Sender Address Spammerlist & Sender address Ham addresslist AttackFactor=0; 1.b : If there exist a sender address belongs to Ham list then, Attack Factor of this rule should be set to 0.25; 1.c : If there exist a sender address that doesn t belongs to spammer list and Ham list then, Attack Factor of this rule should be set to 0; 2 : a: IF SenderIP SpammerIPlist AttackFactor= -0.25; b: IF SenderIP HamIPlist AttackFactor=0.25; c: IF SenderIP SpammerIPlist & HamIPlist AttackFactot=0; 2.a : If there exists a sender IP address belongs to Spammer list, then Attack Factor of this rule should be set to -0.25; 2.b : if there exists a sender IP address belongs to Ham list, then Attack Factor of this rule was set to 0.25; 2.c : If there exists a sender IP address doesn t belongs to Spammer list and Ham List then Attack Factor of this rule was set to 0; 3: a: IF Subject words Spam words AttackFactor= -0.50; b: IF Subjectword Spamwords - 0.50<AttackFactor< 0.50 3.a: If all Subject words belongs to Spam words then, Attack Factor of this rule should be set to -0.50; 3.b : If there exists a subject word that belongs to spam word then Attack Factor of this rule is varies from -0.50 to +0.50; 1.a : If there exist a sender address belongs to spammer list, then Attack Factor of this rule should be set to -0.25; 4: a: IF Content words Spamwordlist AttackFactor= -0.50; ISBN: 978-960-474-281-3 84

b: IF Content words Spamwordslist - 0.50<AttackFactor< 0.50; 4.a : If all email content words belongs to Spam words then, Attack Factor of this rule should be set to -0.50; 4.b : If there exists an email content word that belongs to spam word then Attack Factor of this rule is varies from -0.50 to +0.50; 5 : a: IF Attachment VirusList AttackFactor=1.0; b: IF Attachment Visuslist AttackFactor= -1.0; 5.a : If all attachment doesn t belong to virus list then, Attack Factor of this rule is set to 1.0; 5.b : If there exist an attachment belongs to virus list, then Attack Factor of this rule is set to -1.0; 3. Fuzzy implementation Figure 1. Architecture of proposed system parameter- Sender address. Based on 1, Sender address was extracted from email and compared against the Black list which has spammer email address list. If any match was found then, Attack Factor for this rule was set to -0.25. If sender address was not found in the black list, then it was compared against the White list which contains all good and acceptable email addresses. If match was found, then attack factor for this rule was set to 0.25. If sender address was not found in both Black and White list, then attack factor for this rule was set to 0. Set this rule result in R1. 2 was applied on Fuzzy Input parameter Sender IP. IP Address of the sender was compared against the IP Address Black List. If match was found, then 2 Attack Factor was set to -0.25. If not found, then Sender IP Address was compared against White List IP Address. If match found then attack factor of 2 was set to 0.25. If not found then Attack Factor of the 2 was set to 0. Assign resultant value in R2. 3 was applied on Fuzzy input parameter Subject words. An Email may contain one or more words in subject line. All words are taken and compared against the Black list words. Every words impact (Impact Factor) on this subject line was calculated. Following are Algorithm to compute Attack Factor of 3 Algorithm for Email Subject Attack Factor Calculation: Step 1 :Split the Subject content into words say W i where n i 1 Step 2 : assign to T w = n Step 3 :Calculate word Impact Factor W f where W f = 0.5 /T w Step 4 :Perform comparison for each word W i in Black list Step 5 :If match found then update the update W fi = - W f else W fi = W f ; where i <= T w ; Step 6 : Calculate Attach Factor = Step 7 : Calculate R3 = ; 4 was applied on Fuzzy Input variable- ContentWords. Every email body may contain one or more words. Every word are taken and compared against the Block list words. Following are the Algorithm to compute Attack Factor of 4. When an email is arrived, fuzzy input parameters are extracted. 1 was applied on Fuzzy input ISBN: 978-960-474-281-3 85

Algorithm for Email Content Attack Factor Calculation: Step 1 :Split the email bodycontent to words say W i where i 1 Step 2: Count the total number of words in email Bodyand assign to Tw Step 3 : If T w > 0 then continue Step 4. Step 4 :Calculate word impact factor W f where W f = 0.5 /T w Step 5 :Perform comparison for each word Wi in Black list Step 6 :If match found then update the update W fi = - W f else W fi = W f ; where i Tw; Step 7 :Calculate Attach Factor = Step 8 : Calculate R4 = ; 5 was applied to calculate Attack Factor for email containing attachment. If email does not contain Attachment, then Attack Factor was set to zero. If any one of the attachment content was identified in virus list then Attack Factor was set to - 1. If none of the content was identified in virus list, then Attack Factor was set to 1. 5 result was assigned to R5. Result value of each was arrived by sum up previous rule results. R1 = R1; R2 = R1 + R2; R3 = R3 + R2 + R1; R4 = R4 + R3 + R2 + R1; R5 = R5 + R4 + R3 + R2 + R1; 4. Results and Discussion All fuzzy rules are applied over fuzzy input variables: Sender saddress, SenderIP, SubjectWords, ContentWords, Attachment for sample emails namely A,B,C,D and E. Results of the each fuzzy rules are distributed in the table. The results of the email can be obtained by plotting each rule result. Table 1: Results based on Fuzzy s Email The following graph was plotted for tabulated result where the X axis represent the s and Y axis predicts the Results. When line move towards the Positive value without any axis interception, then the identified email is Ham. When line starts with negative and lies in the negative region without axis interception, then the email is definitely spam. Whenever the result enters in the negative region, then the email is set to spam. If it lies in the axis, then the email is set to hold state. Result value 3 2 1 0-1 -2-3 1 2 3 Spam Deduction A-Ham C-Ham D-Hold 0 2 4 6 4 A 0.25 0.5 1 1.5 2.5 B 0.25 0-0.5 0-1 C 0.25 0 0.5 1 2 D 0 0 0 0 0 E -0.25-0.5-1 -1.5-2.5 5 B-Spam E-Spam Figure 2. Graphical Representation ISBN: 978-960-474-281-3 86

5. Conclusion and Future Work In this proposed work, Fuzzy rules are constructed for 5 input parameters namely Sender s Address, SenderIP, SubjectWords, ContentWords and Attachment for common user to deduct the spam emails. The proposed simplistic approach out performs in terms of accuracy and faster in deduction spam emails than the existing approaches. In future, this approach can be extended based on user attitude. User can tighten or relax their spam levels by moving x axis towards positive / negative region. Acknowledgment The authors would like to thank Dr. Ponnammal Natarajan worked as Former Director Research, Anna University- Chennai,India and currently an Advisor, (Research and Development), Rajalakshmi Engineering College and Dr. K..Ravi, Associate Professor, Department of Mathematics, Sacred Heart College-Tirupattur, India for their intuitive ideas and fruitful discussions with respect to the paper s contribution. References: [1] Carreras, X. and Mdrquez, L., Boosting trees for anti-spare email filtering, In Proc. of RANLP, 2001. [2] Cohen, W.W., Learning s that Classify E-Mail., Proceedings. of the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, California,1996. [3] Cournane, A. and Hunt, R., An Analysis of the Tools Used For the Generation and Prevention of Spam, Computer and Security, Vol. 23, pp 154-166, 2004. [4] Cox, E., The Fuzzy System Handbook, Academic Press, Second Edition, 1999. [5] Daelemans, W., Z. Jakub, K. van der Sloot and A. van den Bosch, TiMBL: Tilburg Memory Based Learner, version 2.0, Reference Guide. ILK,Computational Linguistics, Tilburg University. http:/ilk.kub.nl/~ilk/papers/ilk9901.ps.gz, 1999. [6] Drucker, H., Wu, D., & Vapnik, V., Support vector machines for Spam categorization. IEEE-NN, Vol. 10, No.5, pp. 1048 1054,1999. [7] Graham, P., Better Baysian Filtering. In Proceedings of Spam Conference http://spamconference.org/proceedings2003.ht ml, 2003. [8] Hidalgo, J. G., Spez, M, and Sanz, E, Combining text and heuristicz for costsensitive spam filtering. In Proc. of CONL, 2000. [9] Lee, J., Spam: An escalating attack of the clones, The New York Times, 2002. [10] Mayer, C., and Eunjung-Cha, A., Making spam go splat: Sick of unsolicited e-mail, [11] businesses are fighting back, The Washington Post, 2002. [12] Norvig P. and Russell S., Artificial Intelligence A Modern Approach, Prentice Hall, New Jersey, 2003. [13] Nozaki, K., Ishibushi, H. and Tanaka, H., Trainable Fuzzy classification systems based on Fuzzy If-Then-s, Proc. IEEE, vol. 1, pp. 498-502, 1994. [14] RFC 822: Standard for the Format of Arpa Internet Text Messages, www.w3.org/protocols/rfc822/, 1994. [15] Sahami, M., Dumais, S., Heckerman, D. and Horvitz, E., A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization, AAA1 Workshop, pp. 55-62, Madison Wisconsin, 1998. [16] SpamAssassin, www.spamassassin.org, 2004. P.Sudhakar received Bachelor of Engineering degree in Computer science from Anna University Chennai-India in 2006 and Master of Engineering degree in Computer Science from Anna University Chennai- India in 2008. He started his carrier as a Junior software programmer in Vernalis systems Pvt Ltd, Chennai India at 2008 and elevated to Associate software. He also presented various papers in National level conferences and published his research work in International Journals. G.Poonkuzhali received B.E degree in Computer Science and Engineering from University of Madras, Chennai, India, in 1998, and the M.E degree in Computer Science and Engineering from Sathyabama University, Chennai, India, in 2005. Currently she is pursuing Ph.D programme in the Department of Information and Communication ISBN: 978-960-474-281-3 87

Engineering at Anna University Chennai, India. She has presented and published 10 research papers in international conferences & journals and authored 5 books. She is a life member of ISTE (Indian Society for Technical Education),IAENG (International Association of Engineers), and CSI (Computer Society of India). K.Thiagarajan working as Senior Lecturer in the Department of Mathematics in KCG College of Technology - Chennai-India. He has totally 14 years of experience in teaching. He has attended and presented research articles in 33 National and International Conferences and published one national journal and 26 international journals. Currently he is working on web mining through automata and set theory. His area of specialization is coloring of graphs and DNA Computing. Dr. K. Sarukesi has a very distinguished career spanning of nearly 40 years. He has a vast teaching experience in various universities in India and abroad. He was awarded a commonwealth scholarship by the association of common wealth universities, London for doing Ph.D in UK. He completed his Ph.D from the University of Warwick U.K in the year 1982. His area of specializations is Technological Information System. He worked as expert in various foreign universities. He has executed number of consultancy projects. he has been honored and awarded commendations for his work in the field of information technology by the government of TamilNadu. He has published over 40 research papers in international conferences/journals and 40 National Conferences/journals. R.Kripa Keshav currently undergraduate student of Rajalakshmi Engineering College. He is the Member of computer society of india. He has presented one paper in national conference and won the best paper award and one paper published in international journal. ISBN: 978-960-474-281-3 88