A Score Point based Email Spam Filtering Genetic Algorithm

Similar documents
#1 $$$ 100% free 100% Satisfied 4U 50% off Accept credit cards Acceptance Access Accordingly Act Now! Act now! Don t hesitate! Ad Additional income

Spam Filtering Using Genetic Algorithm: A Deeper Analysis

Purchase College Barracuda Anti-Spam Firewall User s Guide

Spam Filtering Using Adaptive Genetic Algorithm

Spam detection with data mining method:

Introduction: What is Spam?... 3 How to Bypass Spam Filters Common Mistakes... 7

October Is National Cyber Security Awareness Month!

OCT Training & Technology Solutions Training@qc.cuny.edu (718)

Savita Teli 1, Santoshkumar Biradar 2

How To Filter Spam Image From A Picture By Color Or Color

Learn to protect yourself from Identity Theft. First National Bank can help.

Spam DNA Filtering System

Anti Spamming Techniques

Fighting spam in Australia. A consumer guide

Malware & Botnets. Botnets

Guidelines. The following guidelines are for companies who develop HTML design/creative and copy.

The State of Spam A Monthly Report August Generated by Symantec Messaging and Web Security

Professional Ethics for Computer Science

SPAM FILTER Service Data Sheet

Pay-Per-Click Suggested Words

Avoiding Malware in Your Dental Practice. 10 Best Practices to Defend Your Data

Attachment spam the latest trend

BE SAFE ONLINE: Lesson Plan

Mortgage Secrets. What the banks don t want you to know.

Junk Filtering System. User Manual. Copyright Corvigo, Inc All Rights Reserved Rev. C

How to stay safe online

An Overview of Spam Blocking Techniques

USER S MANUAL Cloud Firewall Cloud & Web Security

Data Pre-Processing in Spam Detection

About this documentation

Cloud Services. Anti-Spam. Admin Guide

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

OUTLOOK SPAM TUTORIAL

The Guide to: Marketing Analytics"

Emerging Trends in Fighting Spam

When you listen to the news, you hear about many different forms of computer infection(s). The most common are:

PANDA CLOUD PROTECTION User Manual 1

Who will win the battle - Spammers or Service Providers?

System Compatibility. Enhancements. Operating Systems. Hardware Requirements. Security

Avoiding Malware in Your Dental Practice. 10 Best Practices to Defend Your Data

How To Clean Your Credit In 60 Days... Or Less!

FILTERING FAQ

Being labeled as a spammer will drive your customers way, ruin your business, and can even get you a big fine or a jail sentence!

Microsoft Outlook 2010 contains a Junk Filter designed to reduce unwanted messages in your

Spam Detection Using Customized SimHash Function

When registering on a jobsite, first ensure that the site is reputable and has a physical address and landline phone number.

A Content based Spam Filtering Using Optical Back Propagation Technique

Infocomm Sec rity is incomplete without U Be aware,

Online Security Awareness - UAE Exchange - Foreign Exchange Send Money UAE Exchange

Guidelines for Account Management and Effective Usage

Marketing Glossary of Terms

DON T BE FOOLED BY SPAM FREE GUIDE. Provided by: Don t Be Fooled by Spam FREE GUIDE. December 2014 Oliver James Enterprise

This is a licensed product of Ken Research and should not be copied

What makes Panda Cloud Protection different? Is it secure? How messages are classified... 5

STOP THINK CLICK Seven Practices for Safer Computing

Spam: What Consumers Really Think

Quarantined Messages 5 What are quarantined messages? 5 What username and password do I use to access my quarantined messages? 5

Ipswitch IMail Server with Integrated Technology

ONLINE BANKING SECURITY TIPS FOR OUR BUSINESS CLIENTS

Security Breaches. There are unscrupulous individuals, like identity thieves, who want your information to commit fraud.

Computer Security Self-Test: Questions & Scenarios

Achieve more with less

How To Filter From A Spam Filter

It is a program or piece of code that is loaded onto your computer without your knowledge and runs against your wishes.

2009 Antispyware Coalition Public Workshop

Why Bayesian filtering is the most effective anti-spam technology

ITSC Training Courses Student IT Competence Programme SIIS1 Information Security

Commtouch RPD Technology. Network Based Protection Against -Borne Threats

Transcription:

A Score Point based Email Spam Filtering Genetic Algorithm Preeti Trivedi Kanpur Institute of Technology, Kanpur, India Email: preetitrivedi705@gmail.com Mr. Sudhir Singh Kanpur Institute of Technology, Kanpur, India Email: ss@kit.ac.in Abstract Abstract E-mail is one of the most essential parts of communications over internet today. However, each day we spent several minutes in deleting spam related to advertisement of products, offering loans at low interest rates, drugs etc. Though spam filters are capable to identify spam mails but spammers are constantly evolving newer methods to send spam messages to more and more people. With the advent of technology mobile devices and other portable electronic devices are now Wi-Fi enabled and internet telephony VoIP (voice over internet protocol) has made communicating across the world easier and inexpensive. Social networks like Twitter, Facebook, MySpace, orkut are very general means of connecting with friends across universally. However this has opened a newer audience for spammers to exploit. Spam is not just limited to e-mail anymore, it is on VoIP in the form of unsolicited marketing or advertising phone calls, or marketing, advertising and pornography links on social network. Spam is everywhere. This paper presents a genetic algorithms based spam filtering technique whose fitness function is based on the score point. We have shown that the considered algorithm provide a good recognition rate of 84% at FPR of 0.001. 1. Introduction The world of the Internet today is so fast, cheap and efficient. These important trio features made it a reliable communication for personal, business and academic purposes. Therefore, e-mail is a dynamic tool of business and personal means of sending personal correspondence [1]. However, the start of e-mails has definitely reduced greeting cards, letters because of the ease and speed of e-mail delivery. Now days we are not only receiving solicited mails called HAM mails but also receive unwanted mails called SPAM mails. Generally, spam is unsolicited commercial e-mail, it contains mails that different e- mail users do not want, and only directed to the e-mail owner or end users. Thus, spam is seen as unwanted e- mails sent out in bulk and is mainly used by companies or unethical marketers in order to advertise their products. Therefore, spam messages or mails are used for marketing of goods or rendering of useless services. Some spam e-mails contents appear in the form of jokes or scams sent out in bulk. Today, most spam mails are about advertising of cheap software, jewellery, medical supplies, money making schemes and pornography to mention but a few. Most of these spam mails may have a single attractive image which come with the message and may tempt the user to click on it. Some of the message may appear in form of links that may have been directly linked up to a server where the innocent users get infected with malwares [2]. Despite of beneficial features of the Internet, it is also gradually becoming in support of malicious activities. Over the years, worms, viruses and security issues have been seen as the major problems that information technology is facing. However, spam is slowly taking a different dimension on a daily basis and, therefore, becoming as problematic as worms and viruses. The distribution of bulk messages by unknown senders cost little or no cost [4-6]. Hence, e-mail users still spend lots of time and effort to recognize legitimate mails and deleting it from their mailbox. However, it is irritating when users check their mail box and find over thousands of mails from unknown sender with a reasonably good heading but irrelevant content that does not concern the users. Hence, this is just for spammers to promote their unethical markets and may be called spam. Now the spam has appeared as one of the largest irritating problems of the internet. Every email users receives numerous spam mails daily and still there is no proper solution to protect an active email from becoming a spam target. In spite of all defenses there is the risk that the wrong people may obtain the address through a ink on a website, subscription to a newsletter, participation in a contest, a virus or worm on the system of a friend. There are many possible ways 955

where an address can be a bulk mailer, and once it happens there is no way to stop those advertisers from spreading a flood of great deals. It is also very annoying for home users to stop spam. Companies compromises in numerous ways from the spam. Employees waste their time to scan to extract the good mails. In this procedure there are the probabilities of missing an important mail, which increases with the number of spam mails present. Due to numbers of spam mails the mail servers workload increases. This can lead to system slowdowns and thus to a reduced effectiveness of the company s workflow. As many researchers all over the world are involved in extensive research to fight spam. Still a solution with very good efficiency is not available. As spam filtering is complex problem, it is not possible to spam emails with one single solution. As the structure of the spam emails is changing continuously, therefore we need a solution which can be adaptive in nature. In this paper, a genetic algorithm based spam filtering method is proposed, here, the fitness function is designed using the experimental results of the genetic algorithms. 2. Challenging in Spam e-mail filtering The growth of some social media website such as Facebook, MySpace has helped spammers in using trick to retrieve information such as e-mail and password of users [5]. According to the market survey about the social network users in 2008, 83% of the users have received at-least one unwanted messages (spam) or friend request [6-7]. The use of downloaded spyware on users computer without the knowledge of the owner is a method which spammers used in carrying out their operations. The spyware acts by rooting through the computer until it locates the users address book so as to use it in promoting their unethical market. A lot needs to be done about the increment of the computer security because the spammers make use of innocent users compute or impersonate legitimate business in order to send a lot of spam mails. Apart from that spammers also use the tool to steal bank details, social security number and some credit card information in order to promote their market and identity theft. Therefore, it makes it very difficult for ISPs to locate the source of the spam if spammers engage in using this machine for operation. Advisedly, most computer users should make sure they protect their machine with the use of firewall, anti-virus and anti-spyware scan has recommended by different spam fighting companies. Spam filtering method is not only used for technical purposes such as overspreading of network bandwidth and email storage, but also related to social issues such as child safety, and phishing email. Many of this methods has done lot of good job for ISPs by helping them to safe at least a ton of money and also filtering of spam mails into junk, allowing legitimate mails to be delivered successfully, but still they cannot be relied on because they are not so effective. Most spam filters are liable to false positive when legitimate mails are classified as spam and false negative when junk mails arrives in user inbox. Therefore the need to consider the occurrence of false positive and false negative is highly essential when it comes to filter evaluation. The two first items are the ones that we want to make happen. The two last items are the outcomes we do not want. To measure these we define false positive ratios (FPR) and false negative ratios (FNR) as follows: Emails wrongly identified as spam FPR = Total emails Emails wrongly identified as ham FNR =. Total emails 3. Genetic Algorithms The Genetic Algorithm technique has many advantages over traditional non-linear solution techniques. However, both of these techniques do not always achieve an optimal solution. However, GA provides near optimal solution easily in comparison to other methods. Genetic Algorithm Steps The details of how Genetic Algorithms work are explained below. The general layout of the genetic operations is shown below in Figure. The basic idea of the GA algorithm is to generate offspring which are better than the previous generation. As detailed in the above figure, we generate a solution by selecting parents, if solution is good enough then we stop, otherwise we select new parents to reproduce, then again crossover is performed to generate new offsprings. This process continues until offspring of desired results are not generated. A. Initialization In genetic algorithm based solution, in general initial population is generated randomly. However, in some applications where range of the solution is known, initial population generation is not random. With this approach, we are able to provide the GA a good start point and speed up the evolutionary process. B. Reproduction In reproduction, the complete population is replaced in each generation using survival of fittest phenomenon. In this approach, two mate of the old generation are coupled together to produce two offspring. For the initial population of N, this procedure is repeated N/2 times and thus producing N newly generated chromosomes. Genetic Algorithm Flow Chart 956

useful since crossover may not be able to produce new alleles if they do not appear in the initial generation and a new type of chromosomes can be generated with old and new character. 4. Genetic Algorithm in E-mail Filtering Process Genetic algorithm can be used as spam classifier. The collection of the e-mails is called corpus [2-4[. Spam mails for the corpus are encoded into a class of chromosomes and these chromosomes undergo with genetic operations, i.e., crossover, mutation and fitness function etc.. The rules set for spam mails are developed using the genetic algorithm. Rules for classifying the emails: Fig. 1 Genetic Operation and Evolution Process C. Parent Selection mechanism In general, the chance of each parent being selected is related to its fitness. In general probabilistic method is used for the parent selection. Fitness-based selection In this work, parent selection is done using Roulette Wheel selection or fitness-based selection. In this method of parent selection, survival of each chromosome is directly proportional to its fitness. The effect of this depends on the range of fitness values in the current population. D. Crossover Operator The crossover is the most important operation in GA as it produces offspring. Crossover as name suggests is a process of recombination of bit strings via an exchange of segments between pairs of chromosomes. In this paper, one point crossover is considered. One-point Crossover In one point cross-over, a bit position is randomly selected that need to change. Here, the bits before the number keep unchanged and swap the bits after the crossover position between the two parents. Fig. 2 Schematic of one point crossover (f) Mutation Mutation has the effect of ensuring that all possible chromosomes can maintain good gene in the newly generated chromosomes. The mutation operator can overcome this by simply randomly selecting any bit position in a string and flipping it if required. This is The weight of the words of gene in testing mail and the weight of words of gene in spam mail prototypes are compared and matched gene is find. If the matched gene is greater than some number let say x then mail is considered as spam. Fitness Function: 1 SPAM mail F = 0 Ham mail The basic idea is to find SPAM and HAM mails form the mails arriving in the mail box. As the fitness function is itself problem dependent and cannot be fixed initially in SPAM email filtering. For the evolution of the fitness function we carried out experiments and we found that the minimum score point for the available 1346 SPAM mails was 1 for SPAM mails. Hence, we defined our fitness function as 1 Score point 1 F = 0 Score point < 1 In the earlier work [5], minimum score pint 3 is considered for classification of SPAM and HAM mails. The GA based methods only look for the work available in data dictionary in spam classifications. In general look for a word in arriving emails and deciding on the basis of that word as SPAM and HAM will produce a lot of error. Hence, it is necessary to look for the words, their frequency and total number of words in an email for more accurate classification of 957

mails. In GA words in data dictionary can be added or deleted so it is adaptive in nature. In our experiment we considered database of 2448 emails, out of which 1346 are SPAM mails and rest 1102 mails are HAM mails [7]. In the data-dictionary 421 words are taken which is further sub-divided into seven categories C 1 -C 7. The data dictionary can be found in appendix-a. The procedure of calculating weights for a word of a particular group is detailed below: Lets for an example an email consists of four words namely hotel, luxury, tax, transaction. Out of these four words tax, transaction belongs to categories C 2 and hotel, luxury, belongs to categories C 5 (see Appendix - A). Let us consider an email with 567 words, out of which 237 words are hotel, luxury, tax, transaction with frequency 84, 23, 97 and 33 respectively. These words are taken so large in number to make sure that the considered mail is a spam mail as the spam database is very small as it contains only 421 words. The extracted words form the emails are first classify as whether they belongs to any spam database category. Once if words in email match word in spam data dictionary then the probability of getting a word from the spam database is using simple formula S W = WM, where T w S WM WM : Total spam word in e-mail T WM : Total word in e-mail The pw f or the word hotel is 84 W w = = 0.1482 567 Similarly the weight of word luxury is 0.04056 The weight of the category is calculated by taking the average of the category for example the weight of category C 1 is (0.1482+ 0.04056)/2=0.0944. Thus the obtained weight for each word is tabulated in Table 1. Table 1: Calculation of weights under average weightage method Group Word Frequency Weight of Weight of word group C 2 tax 97 0.1711 0.1147 C 2 transaction 33 0.0582 C 5 hotel 84 0.1482 0.0944 C luxury 23 0.04056 5 Then after normalization the weights are converted in the range of 0.000 to 1.000. And using the hex representation we have The weight of the gene can be encoded as Binary 0000000000 represents weight 0.000 Binary 0000000001 represents weight 0.001 Binary 0000000010 represents weight 0.002. Binary 1111100111 represents weight 0.999 Binary 1111111000 represents weight 1.000 Figure 2 SPAM chromosomes prototype As shown above in figure 2, each mail is encoded into chromosomes consists of 70 bits, 10 bits for each categories. The weight value is represented in hex number format. Once, chromosomes are constructed for all the mails. The process of genetic algorithm starts and crossover takes place. In each generation of chromosomes only 12% are crossed. The mutation rate is only 3 %. The weight of the words of gene in testing mail and the weight of words of gene in spam mail prototype are compared to find the matched gene. If number of matched gene, is greater than or equal to one, than spam mail prototype will receive one score point. If the score point are greater than some threshold score points than the mail is considered as spam mail. However, the threshold point can be manually adjusted to get the appropriate results. It must be remembered that we have used the fitness function on the basis of our experimental results. 958

5. Results Figure 3 Recognition Rate Vs. FPR. In figure 3, recognition rate Vs. False Acceptance Rate (FPR) is plotted. Here at the FPR level of 0.001 the recognition rate is nearly 84% which increases to 97.5% at the FPR level of 1. In our early results we found that, if number of words in the mail is larger, then more correct classification is possible. We have checked our algorithm on large corpus of 2248 mails out of which 1346 were SPAM mails and rest of them were HAM mails. The results on such a large email corpus are taken into account to see more accurate classifications of mail and effectiveness of GA algorithm. As in earlier experiments only 140 mails were considered and it was stated that the running time of GA is very large, therefore larger size of mail corpus will take much time. However, we did this experiment on the high end machine to get more clear and accurate picture of the GA. In our experiments we found that the nearly 84% mails are correctly classified by our method at the FPR level of 0.001, which increases to 97.5 at the FPR level of 1. 6. Conclusion This paper, discusses the Genetic algorithms based e- mail spam classification method. The major advantages of GA based approach is that, it is adaptive in nature. This technique can be applied on the user side; the recognition rate is 84% at the FPR of 0.001. As the recognition rate is not very high, this technique can be used along with some other technique like neural network, Bayesian classifier. The results of two techniques can be combined together using fuzzy logic to get higher recognition rate. 6. References [1] Enrico Blanzieri and Anton Bryl, A Survey of Learning Based Techniques of Email Spam Filtering, Conference on Email and Anti-Spam, 2008. [2] K.S. Tang et.al., Genetic Algorithm and Their Applications IEEE Signal Processing magazine, pp.22-37, Nov. 1996. [3] Usarat Sanpakdee,et.al., Adaptive Spam Mail Filtering Using Genetic Algorithm ICACT 2006. [4] Yang J, Honavar V., Feature subset selection using a genetic algorithm,. Intelligent Systems and their Applications, IEEE, Vol.13, No.2,:pp.44-49,1998. [5] Shrivastava, J. N., & Bindu, M. H., E-mail Spam Filtering Using Adaptive Genetic Algorithm, International Journal of Intelligent Systems & Applications, Vol. 6, No.2, pp.54-60,2014. [6] Karimpour, J., A.A. Noroozi, and A. Abadi., The Impact of Feature Selection on Web Spam detection, International Journal of Intelligent Systems and Applications (IJISA), Vol. 4, No. 9, p. 61, 2012. [7] Spam Assassin, http://spamassassin.org. Appendix A Group Content Example of keywords in each group C1 Adult adult, aphrodisiac, big, cam, climax, company, cum, desire, erotic, fantasy, fuck, gay, girl, great, guy, hard, hardcore, heaven, hot, huge, long, man, max, max length, nude, orgasm, penis, performance, pheromone, pill, porn, powerful, pussy, satisfy, sex, stamina, sweet, teen, Viagra, webcam, x, xxx, xxx-porn, young, love, teen, anus C2 Financial Account, accountant, alert, analyst, attorney, bank, bankruptcy, benefit, bill, billing, broker, budget, building, cash, cheque, commission, consolidate, court, credit, creditor, currency, customer, debt, deposit, discover, economy, entrepreneur, estate, exchange, fee, finance, freedom, fund, help, highrisk, insurance, invest, investor, judgment, legal, legitimate, lender, loan, 959

mastercard, mortgage, obligate, pay, payable, payable, paycheck, promote, purchase, rate, refinance, refund, rent, revenue, risk, service, statement, stock, support, tax, transaction, vat, visa, wealth, worth, service C3 Commercial college, commerce, computer, cost, deliver, discount, especial, expensive, express, fantastic, free, furnishing, furniture, game, get, gif, gift, great, guarantee, inexpensive, invite, item, just, keyboard, license, lifetime, magazine, maintenance, mall, market, material, materials, mobile, motherboard, mouse, offer, online, only, order, palm, pamphlet, percent, premium, price, produce, product, program, recommend, refill, release, resell, reseller, retail, sale, save, save, sell, ship, shipping, shop, shopping, special, subscribe, supply, surprise, trade, trademark, upgrade, voucher, whole, wholesale, within C4 Beauty and diet after, age, amaze, anti-aging, appetite, beauty, become, before, believe, blood, body, botanic, breast, build, burn, Diet calorie, capsule, card, cell, change, chemical, cholesterol, confirm, course, diet, difference, dose, drug, effect, effective, eliminate, energy, enhance, exercise, eye, face, fast, fat, firm, fit, fitness, flexible, gary, grow, grown, growth, hair, health, healthcare, heart, height, herb, herbal, hormone, improve, inche, incredible, kidney, large, laser, life-changing, light, lose, loss, low, magic, medicine, metabolism, micro-cap, miracle, modem, move, muscle, nature, nutrient, old, over, overweight, permanent, plain, potential, pound, power, protect, reduce, remanufacture, repair, restore, retain, reverse, safe, satisfaction, secret, size, step, strength, strong, tablet, therapy, thin, toxin, treatment, under, virginia, vitamin, weight, woman, wonderful, wrinkle C5 Traveling book, deluxe, excite, guide, holiday, honest, hotel, luxury, meal, package, plan, problem, relax, relief, reserve, resort, summer, temple, ticket, tour, train, travel, traveler, trip, vacation, C 6 Home- Based address, astonishment, base, broadcast, bulk, business, comfort, connect, demo, domain, downline, download, Business earn, email, emailing, ethernet, facemail, fresh, home, homebased, homeworker, host, income, interest, international, internet, investigate, job, list, lucrative, mail, mailbox, mailer, mailing, make, marketing, message, million, money-making, opportunity, part-time, people, private, profit, reach, receive, recipient, require, re-register, return, server, software, subscriber, success, teach, unsubscribe, user, visit, website, work, work-athome, worker, working C7 Gambling action, award, bet, bonus, casino, challenge, extra, gambling, gold, hunt, lass, lucky, millionaire, player, poker, prize, reward, rich, vegas, win, lottery 960