GRADUATION PROJECT REPORT



Similar documents
Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Swarm Based Truck-Shovel Dispatching System in Open Pit Mine Operations

A Parallel Transmission Remote Backup System

Conversion of Non-Linear Strength Envelopes into Generalized Hoek-Brown Envelopes

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

Average Price Ratios

10.5 Future Value and Present Value of a General Annuity Due

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

3.6. Metal-Semiconductor Field Effect Transistor (MESFETs)

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

Green Master based on MapReduce Cluster

On formula to compute primes and the n th prime

Credit Risk Evaluation of Online Supply Chain Finance Based on Third-party B2B E-commerce Platform: an Exploratory Research Based on China s Practice

A Spam Message Filtering Method: focus on run time

Confidence Intervals for Linear Regression Slope

Data Analysis Toolkit #10: Simple linear regression Page 1

Study on prediction of network security situation based on fuzzy neutral network

A DISTRIBUTED REPUTATION BROKER FRAMEWORK FOR WEB SERVICE APPLICATIONS

The Time Value of Money

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Banking (Early Repayment of Housing Loans) Order,

Basic statistics formulas

MDM 4U PRACTICE EXAMINATION

Chapter = 3000 ( ( 1 ) Present Value of an Annuity. Section 4 Present Value of an Annuity; Amortization

IP Network Topology Link Prediction Based on Improved Local Information Similarity Algorithm

Simple Linear Regression

Analysis of Two-Echelon Perishable Inventory System with Direct and Retrial demands

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

How To Value An Annuity

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

Quantitative Computer Architecture

A general sectional volume equation for classical geometries of tree stem

The impact of service-oriented architecture on the scheduling algorithm in cloud computing

Supply Chain Management Chapter 5: Application of ILP. Unified optimization methodology. Beun de Haas

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

Report 52 Fixed Maturity EUR Industrial Bond Funds

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

Chapter Eight. f : R R

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

of the relationship between time and the value of money.

On Error Detection with Block Codes

FINANCIAL MATHEMATICS 12 MARCH 2014

Credibility Premium Calculation in Motor Third-Party Liability Insurance

Numerical Methods with MS Excel

CHAPTER 2. Time Value of Money 6-1

Static revisited. Odds and ends. Static methods. Static methods 5/2/16. Some features of Java we haven t discussed

Impact of Interference on the GPRS Multislot Link Level Performance

The Digital Signature Scheme MQQ-SIG

6.7 Network analysis Introduction. References - Network analysis. Topological analysis

Classic Problems at a Glance using the TVM Solver

Performance of Multiple TFRC in Heterogeneous Wireless Networks

Overview of some probability distributions.

Suspicious Transaction Detection for Anti-Money Laundering

Measuring the Quality of Credit Scoring Models

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

PERFORMANCE ANALYSIS OF PARALLEL ALGORITHMS

(VCP-310)

APPENDIX III THE ENVELOPE PROPERTY

A technical guide to 2014 key stage 2 to key stage 4 value added measures

CCH Accountants Starter Pack

TI-89, TI-92 Plus or Voyage 200 for Non-Business Statistics

Topic 5: Confidence Intervals (Chapter 9)

Cluster-Aware Cache for Network Attached Storage *

Forecasting Trend and Stock Price with Adaptive Extended Kalman Filter Data Fusion

A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

How To Balance Load On A Weght-Based Metadata Server Cluster

Report 05 Global Fixed Income

FM4 CREDIT AND BORROWING

A particle swarm optimization to vehicle routing problem with fuzzy demands

ODBC. Getting Started With Sage Timberline Office ODBC

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

RUSSIAN ROULETTE AND PARTICLE SPLITTING

How To Make A Supply Chain System Work

Speeding up k-means Clustering by Bootstrap Averaging

CHAPTER 13. Simple Linear Regression LEARNING OBJECTIVES. USING Sunflowers Apparel

IT Support n n support@premierchoiceinternet.com. 30 Day FREE Trial. IT Support from 8p/user

Transcription:

SPAM Flter School of Publc Admtrato Computer Stude Program GRADUATION PROJECT REPORT 2007-I-A02 SPAM Flter Project group leader: Project group member: Supervor: Aeor: Academc year (emeter): MCCS390 Graduato Project I Marco, Lou Cha Wa (P-05-0463-7) Macro, Leog Weg Hog (P0406485) Phlp Le Adrew Su 2007/2008 (2 d emeter) 頁

SPAM Flter Cotet table: Abtract... 3 Objectve... 4 Feature... 5 Sytem Archtecture... 6 Learg Project... 7 Clafcato Proce... 7 Nave Bayea... 8 Problem of Nave Bayea ad mprovemet... 9 Our varato of Nave Bayea algorthm... 2 Expermet... 3 Prototype... 5 Future work... 7 Develop Evromet:... 8 Sytem Requremet... 8 Cocluo... 9 Job dtrbuto... 20 Referece... 2 MCCS390 Graduato Project I 頁 2

SPAM Flter Abtract: Spam ha bee a problem for may year but oly utl recetly ha people tarted to become dguted wth t. Performace could alo be a cocer for may people. That problem ot oly appear E-Mal, the SMS o moble phoe alo commo. Curretly, there are may famou pam flter ervce, uch a Yahoo, Google Why we develop th project? Becaue the defto of pam or o-pam very ubjectve, our method are accordg to uer habt, doe t have ay tadard to defe pam or o-pam, a th reao, everyoe ca defe h ow Spam ad o pam trag et. Study Bayea techology ad propoe ome mprovemet of Bayea formula are the maly challege th project. I that part, we make ome chage of Naïve Bayea ad we perform ome expermet to valdate our flter. We bae o JAVA platform to program. Therefore, we ca be applyg eay varou vero, uch a erver vero, clet vero, moble vero. We wll exted our ytem to hadle SMS pam the ext emeter. MCCS390 Graduato Project I 3

SPAM Flter Itroducto: Several commo At-Spam E-mal techologe. E-mal header aaly: The mal header are caed for ome mall cotece that ca gve away forgere: A mal date the pat or the future, forged meage ID, ad the lke. 2. Keyword checkg ad text aaly: The mal body caed for typcal pam mal cotet, uch a pam keyword, captalzed letter, or vtato to buy or clck omethg. 3. Whte ad blacklt checkg: Whte ad Black lt ca be ued to cofgure whch emal addree are permtted or deed. 4. Bayea Spam Flterg: Bayea pam flterg tattcally calculate the probablty that a meage pam. Spam mal ca be traed, o that mlar mal are more lkely to be detfed a a pam the future. Objectve & Goal The goal of our project the vetgato of SMS pam flterg ug Bayea. I th emeter, our objectve to vetgate pam flterg for Eglh emal. We tudy Bayea pam flter ad propoe ome mprovemet, ad the we perform ome expermet to valdate our pam flter. I the fal, we develop a prototype to how our reearch reult ca be appled ealy. MCCS390 Graduato Project I 4

SPAM Flter Feature: There are ome feature our pam flter: Cutomzable: Everyoe ca defe h ow Spam ad o pam trag et. Adaptve: The clafcato proce ca flter the ew pam whe the learg proce updated the trag et. Accurate: The rate of correct clafcato hgh (Over 95%). Platform depedet: Bae o Java platform to program. Our pam flter ca be ued varou vero (e.g.) erver vero, clet vero, moble vero. MCCS390 Graduato Project I 5

SPAM Flter Sytem Archtecture MCCS390 Graduato Project I 6

SPAM Flter Tak A: Learg Proce: I the Learg Proce, there wll to cout the word frequecy trag et ad the output three map. May be you wll ak, what the cotag of map th project. The map ha three type: (Spam, No-Spam, All) Spam Map: word frequecy pam meage. No pam Map: word frequecy o pam meage. All Map: word frequecy um up the pam ad o pam meage. (e.g.) All Map(Vagra,40)=Spam(Vagra,37)+o pam(vagra,3) All Map(the,40)=Spam(the,20)+o pam(the,20) Whe to ru Learg Proce: Uer may perform learg proce aga whe ewly collected pam ad o pam are avalable. Uer ca defe how ofte wll ru the learg proce. Lke daly, weekly, mothly. Learg proce mut ru perodcally to cofrm the data et up-to-date. Tak B: Clafcato Proce: What about Clafcato Proce: Baed o the three map from the learg proce, apply the Bayea formula to etmate the probablty that a ew meage SPAM. If the etmato greater tha 50%, the clafcato reult SPAM, otherwe that o SPAM. Our clafcato algorthm a varato of ave Bayea flterg. MCCS390 Graduato Project I 頁 7

SPAM Flter Nave Bayea: SPAM D) = SPAM ) + SPAM ) ( SPAM )) D={S,S2 S}: Meage D a et of word. P (S SPAM):The probablty that S appear pam. Cout S / um of pam meage trag et SPAM ) = SPAM )* 2 SPAM )*...* SPAM ) P (SPAM D):The probablty of the meage SPAM. MCCS390 Graduato Project I 頁 8

SPAM Flter Problem of Nave Bayea ad mprovemet: SPAM D) Problem: If a meage cota a word S wth P (S SPAM) =0, the P (SPAM D) = 0. It mea f S dd t appear pam map, ad the the meage clafed a o pam. = SPAM ) + Improvemet: Ue S o SPAM) tat of ( - P (S SPAM)). Expermet Reult: After the below mprovemet, f S dd t appear pam map, t wll fd o-pam map, the formula wll become the follow. SPAM ) ( SPAM )) MCCS390 Graduato Project I 9

SPAM Flter SPAM D) Problem:. If S dd t appear the pam or o pam map, the deomator become 0. 2. Our expermet howed that whe umber of pam much larger tha umber of o pam, the accuracy of clafcato drop. Improvemet:. Igore word that do t appear trag et. 2. Itroduce the percetage of pam ad o-pam meage the formula. Expermet Reult: After the below mprovemet, the clafcato wll ot oly retur 0 ad, t wll retur actually a percetage of that a pam, lke 89.63. The formula wll become the follow. = SPAM ) + SPAM ) o SPAM ) MCCS390 Graduato Project I 0

SPAM Flter SPAM D) = SPAM )* o SPAM ) + SPAM )* o SPAM ) o SPAM )* SPAM ) P (SPAM): Percetage of pam meage trag et. P (o_spam):percetage of o pam meage trag et. Problem: Polarzato: If ome P (S pam ) very mall (eg. 0.00), the the product SPAM ) quckly approache 0. The reult P (pam D)=0 regardle of other word the meage. Example: The follow a SPAM mal, ad ue the below formula to clafy. The follow a SPAM mal ame a the left te ad add up a o-pam meage the ue the below formula to clafy. Hey have you heard? Fally, the 2008 Collecto are, ejoy 70% OFF Brad Name Shoe & Boot for Me & Wome from TOP Faho Deger. Chooe from a varety of the eao' hottet model from Gucc, Prada, Chael, Dor, Ugg Boot, Burberry, D&G, Hey have you heard? Fally, the 2008 Collecto are, ejoy 70% OFF Brad Name Shoe & Boot for Me & Wome from TOP Faho Deger. Chooe from a varety of the eao' hottet model from Gucc, Prada, Chael, Dor, Ugg Boot, Burberry, D&G, Dquared & Dquared & So that the pam mal eder ca ue that bug to chage ome word that alway appear pam, lke VIAGAR V_I_A_G_A_R. The the meage wll clafy a a o-pam. So we have the follow mprovemet. Improvemet: MCCS390 Graduato Project I S SPAM ) > S o _ SPAM ) > S SPAM ) S o _ SPAM ) 頁

SPAM Flter Our varato of Nave Bayea algorthm SPAM D) = SPAM )* o SPAM ) + SPAM )* o SPAM ) SPAM ) = SPAM ) + 2 SPAM ) +... + o SPAM )* SPAM ) Although the mprovemet doe t ha rgd mathematcal foudato. The followg expermet how th algorthm ha atfactory accuracy. SPAM ) MCCS390 Graduato Project I 2

SPAM Flter Expermet: Dataet: Trag dataet:aroud 80 pam meage ad 80 o-pam meage. Tetg dataet:20 pam meage ad 20 o meage. Balacg: We attempt to ue ame ze of pam ad o pam data et. The followg catter dagram how the clafcato of the tetg dataet (20 pam & 20 o-pam) a update trag et. The accuracy wll become hgh, 00%. MCCS390 Graduato Project I 3

SPAM Flter Up-to-date trag et mportat: Trag et ad tetg et mut be collected at the ame tme, or mut up-to-date. If we ue the old trag et to clafy ew meage, the the accuracy wll become lower. The follow catter dagram how the clafcato of the tetg dataet (20 pam & 20 o-pam) a 2 moth ago trag et. The accuracy wll become lower. A you ca ee ome pam meage clafy a o-pam ad alo ome o-pam meage clafy a pam. MCCS390 Graduato Project I 4

SPAM Flter Prototype: To how our reearch reult ca be appled ealy, we buld up a mall E-mal reader ad plug our pam-mal flter algorthm to how how our flter ca clafe the mal. MCCS390 Graduato Project I 5

SPAM Flter That E-mal reader ca log to our IPM E-mal accout, the how the emal ad ue the clafer proce to meaure the pam probablty. The Learg proce ue for update three Map. MCCS390 Graduato Project I 6

SPAM Flter Future work: Chee Segmetato: The dffculty of Chee Segmetato how to determe the bet plttg way Chee, becaue Chee artcle doe t have pace to eparate. (e.g.) 青 川 縣 災 區 再 發 生 強 烈 地 震 eparate to 青 川 縣 災 區 再 發 生 強 烈 地 震 Implemet to SMS: The dffculty of Implemet to SMS format the horter text legth (60 word), becaue the clafer techology of Bayea Theorem eed a great quatty of ample data. MCCS390 Graduato Project I 7

SPAM Flter Developmet Evromet: Hardware: Petum 4( 3GHz ), GB RAM, 80GB Hard Dk Operatg Sytem: Mcrooft Wdow XP Profeoal wth Servce Pack 2 Software: J2SE.6 Developmet Kt (JDK) J2EE JavaMal Reource TextPad 5.0.3 32 bt Edto JFromDeger JSmooth 0.9.9.7 Edto Programmg Laguage: J2SE 6.0 (JAVA 2 Stadard Edto) Sytem Requremet: Italled JavaTM 2 SDK, Stadard Edto Vero Ca acce to MPI mal erver va teret. MCCS390 Graduato Project I 8

SPAM Flter Cocluo: The maly challege of th project the algorthm aaly ad mprovemet of Nave Bayea ad expermetato. There eed may mathematcal kowledge ad a well aaly capablte. After th project, our aaly capablte mprove very much. Aother challege of th project the programmg; we ue JAVA to program, the developmet tme, we lear more kowledge ad kll JAVA. Th a good experece our career lfe of IT. Th project ot oly mprove our aaly capablte ad kll programmg, the project maagemet ad tme maagemet are epecally mportat of the project. Becaue th project a group team-work, the team-prt ad Team-cooperato are alo mportat. I fact, each dcuo tme, we alo got a great beeft. Fally, we mut to thak very much our Supervor, Phlp Le; Aeor, Adrew Su ad Advor, Jacky Tag.Although our project doe t preet very perfect, we wll more effort. MCCS390 Graduato Project I 9

SPAM Flter Job dtrbuto: Marco, Lou Cha Wa Learg Proce Prototype Macro, Leog Weg Hog Clafcato Proce Expermet Work Together Algorthm aaly Improvemet of Nave Bayea Preetato Report MCCS390 Graduato Project I 頁 20

SPAM Flter Referece: Nave Bayea Text Clafcato http://www.ddj.com/developmet-tool/84406064 反 垃 圾 郵 件 的 幾 種 技 術 - 網 管 專 欄, 郵 件 服 務 http://ew.ofthoue.com.c/ew/how/908.html 反 垃 圾 郵 件 防 火 牆 的 核 心 技 術 分 析 及 應 用 http://et.zdet.com.c/ecurty_zoe/2008/0424/8395.html 反 垃 圾 郵 件 防 火 牆 的 核 心 技 術 分 析 () http://forum.ct.org.tw/phpbb2/vewtopc.php?t=6099& Dcovery Challege http://www.ecmlpkdd2006.org/challege.html MCCS390 Graduato Project I 2

SPAM Flter The Ed. Thak for your atteto! MCCS390 Graduato Project I 頁 22