Web Spam Detection Using Machine Learning in Specific Domain Features

Transcription

1 Journal of Informaton Assurance and Securty 3 (2008) Web Spam Detecton Usng Machne Learnng n Specfc Doman Features Hassan Najadat 1, Ismal Hmed 2 Department of Computer Informaton Systems Faculty of Computer and Informaton Technology Jordan Unversty of Scence and Technology Irbd 22110, Jordan najadat@just.edu.jo 1 hmed@just.edu.jo 2 Abstract: In the last few years, as Internet usage becomes the man artery of the lfe's daly actvtes, the problem of spam becomes very serous for nternet communty. Spam pages form a real threat for all types of users. Ths threat proved to evolve contnuously wthout any clue to abate. Dfferent forms of spam wtnessed a dramatc ncrease n both sze and negatve mpact. A large amount of E-mals and web pages are consdered spam ether n Smple Mal Transfer Protocol (SMTP) or search engnes. Many techncal methods were proposed to approach the problem of spam. In E-mals spam detecton, Bayesan Flters are wdely and successfully appled for the sake of detectng and elmnatng spam. The assumpton that each term n the document contrbutes to the flterng task equally to other terms and the avodance of user's feed back are major shortcomngs that we attempt to overcome n ths work. We propose an mproved Naïve Bayes Classfer that gves weght to the nformaton fed by users and takes nto consderaton the exstence of some doman specfc features. Our results show that the mproved Naïve Bayes classfer outperforms the tradtonal one n terms of reducng the false postves and the false negatves and ncreasng the overall accuracy. Keywords: Web Spam, Naïve Bayes, Term Frequency Matrx (TFM), Confuson Matrx (CM). 1. Introducton Wth the ncreased advancements n nternet applcatons and the prolferaton of nformaton avalable for the publc, the need for effcent search engnes that are able to retreve the most relevant documents that satsfy users' needs becomes evdent. From Informaton Retreval (IR) perspectve, search engnes are responsble for retrevng a set of documents that are ranked n descendng order accordng to ther relevancy [2]. A common problem encountered n ths context s that there are some documents marked wth a hgh rank and retreved as the frst (or one of the top) documents by the search engnes where they are truly not [5]. Several reasons exst to justfy ths problem; one reason s related to the extent to whch a user knows exactly what he or she s searchng for, and consequently, hs or her knowledge s reflected on the retreved results. Another mportant reason s the exstence of the so called: Spam Web pages; these are pages that from the search Receved August 20, engnes' pont of vew seem to be relevant, but n realty they contan no useful nformaton for users [5]. In ther dscusson about web spam, Castllo et. al. [4] defned web spam as any attempt to deceve a search engne s relevancy algorthm, or an acton performed wth the purpose of nfluencng the rankng of the page. Detectng Web Spam s consdered as one of the most challengng ssues facng search engnes and web users [11]. Snce the search engnes are the gates to the World Wde Webs, t s mportant to provde the possble best results answerng the user's queres. There are some people well known as spammers try to mslead the search engnes by boostng ther web pages rank, as a result capture user attenton to ther pages. These pages contan a few or not any useful nformaton that the user expects to fnd. The search engnes need to detect or flter spam pages to provde hgh qualty results to users (.e. truly relevant pages). For a search engne to be evaluated as an effcent one, t should not only return as much documents as possble, but also should return those relevant documents that are spam-free. Currently, many technques are appled by search engnes to fght spam, such as detectng spam web pages through content analyss [11]. Ths technque s the most popular technque for spam detecton currently used by search engnes such as Google; nevertheless, t s stll lack to fnd all spam web pages. A separate secton s devoted to detal ths technque further. Spam can be very annoyng n the context of search engne for several reasons. Frst, n the case there are fnancal advantages from search engne, the exstence of spam pages may lower the chance for legtmate (legal) web pages to get the revenue that they mght earn n the absence of spam. Second the search engne my return rrelevant results that users do not expect, and therefore, a non-trval porton of tme mght spent on-lne wadng through such unwanted pages. Fnally the search engne my waste mportant resources on spam pages, ths nclude wastng network bandwdth (Crawlng), wastng CPU cycles (Processng), and wastng storage space (Indexng) [11]. Mcrosoft Researchers [11] show that some partcular toplevel domans are more lkely to contan spam than others do, for example,.bz (Busness) has greatest percentage of spam wth 70% of all pages beng spam,.us doman comes $03.50 Dynamc Publshers, Inc

2 Web Spam Detecton Usng Machne Learnng n Specfc Doman Features 221 n second place wth 35% spam pages. Moreover, pages wrtten n some partcular languages are more lkely to be spam than those wrtten n other languages, for nstance pages wrtten n French are the most lkely to be spam, wth the percentage of 25% of beng spam. Spammers proved ther excellence to adapt to the dfferent formats avalable for Web pages, several spammng technques used by spammers to nfluence the rankng page algorthms of search engnes. All these technques are consdered challengng for web page spam detecton algorthm, especally for Contents-Based approach. The two man categores of spammer technques are: Term Spammng, and Lnk Spammng [11, 3].In term spammng, many technques that modfy the content of the page are appled. The content ncludes: the document body, the ttle, Meta tags n HTML header, anchor texts assocated wth URLs and page URLs. The spammers can attach ther unsolcted content (.e. spam) to one or more of these contents resultng n a new page that can pass the spam flter wthout any doubt of beng legal. Among all term spammng technques, the most popular one s body spammng [11], n whch terms are ncluded n the document body, an example s to nclude specfc terms as "Free grant money", "free nstallaton", "Promse you...!", "free prevew", etc. Another way of groupng term spammng technques s based on the type of terms that are added to the text felds, ether by repeatng one or a few specfc terms, ncludng a large number of unrelated terms, or sttchng phrase wheren, sentences or phrases, possbly from dfferent sources are glued together [11]. In lnk spammng technque, spammers tend to nsert lnks between pages that are present for reasons other than mert [13]. Lnk spam takes advantage of lnk-based rankng algorthms, such as Google s Page Rank algorthm, whch gves a hgher rankng to a webste that s cted by other hgh ranked webstes. In correspondence to the aforementoned spammng technques, many content-based spam flterng technques were proposed. The mportance of analyzng the content of a partcular web page s that spammers tend to boost ther web pages rank by applyng spammng technques on these contents [11]. Durng content analyss, the number of words n the page s body and ttle, the average length of words, the amount of text anchor and keywords n metatags are analyzed to detect the abnormaltes n these contents that are nterpreted as spammng attempts. The rest of ths paper s organzed as follows. Secton 2 gves an overvew of the works related to spam detecton. The Naïve Bayes classfer s llustrated n Secton 3. Secton 4 proposed our approach, and our expermental results are shown n Secton 5. Secton 6 concludes the paper and provdes future drectons. 2. Related Work Due to the mportant role that web pages occupy as means for supportng electronc commerce (E-Commerce), web pages become entcng target for all dfferent knds of traders and marketers to advertse ther products for sale, get-rch-on-the-fly schemes [13, 17], and to get nformaton about pornographc web stes. Moreover, they become an entcng target for spammers to embed ther spam content. SpamCon Inc. [1] estmated the cost nduced by resources loss and spam flterng assocated wth only one unsolcted message s 1$ up to 2$ multpled by the number of spam sent and receved every day, the one dollar becomes mllon. Because of the serous problems assocated wth the unsolcted spam contents of ether a sngle E-mal or a large web page, a number of automated flterng approaches were proposed n the lterature to overcome such problems [16, 10]. These flters are used manly for E-mal spam and then transformed to be used n the context of Web Spam. Early proposed approaches for spam flterng reled mostly on manually constructed pattern-matchng rules that need to be tuned to each user s message [9]. That s, they allow users to hand-buld a rule set that conssts of a set of logcal rules to detect spam emals and Web pages. However, these approaches are seemed to be tedous and problematc, snce users need to pay a full attenton just to buld the desred set of rules, whch by the way not all users can buld such a set. In addton, t s a tme consumng process, snce the generated set of rules should be changed or refned perodcally as the nature of spam changes too. Because of the problems assocated wth the manual constructon of rules, another approach was proposed n [7] to automatcally adapt to the changng nature of spam over tme and to provde a system that can learn drectly from data already stored n the web server databases. These approaches proved as successful when appled for general classfcaton tasks, that s, the classfcaton of E-mal to ether spam or non-spam based on ther text, wth no regards to the exstence of some doman specfc features. Several machne learnng algorthms have been proposed for text categorzaton (classfcaton) [14, 15]. These approaches were nvestgated to be used for spam flterng snce t s vewed as a text categorzaton problem. In [13], they appled a machne learnng algorthm for the purpose of spam flterng. In ths algorthm, the flter learns to classfy documents nto fxed classes (.e. spam and nonspam), based on ther content, after beng traned on manually classfed documents. As a varaton of the rule-based approaches dscussed above, a great deal of work was wtnessed n the lterature to automatcally perform content-based classfcaton. Naïve Bayes classfers [16], was proposed as a good example of those approaches that showed satsfactory results n the context of E-mal spam Flterng. [13] traned a Naïve Bayes classfer on manually classfed spam and nonspam messages reportng surprsngly good results n terms of precson and recall. Our work utlzes Naïve Bayes classfers based on the context of web pages to detect the spam pages automatcally. 3. The Naïve Bayes Classfer A Naïve Bayesan classfer s a smple probablstc 221

3 222 Najadat and Hmed classfer based on applyng Bayes' theorem wth a strong (nave) ndependence assumpton that all varables A 1, A 2,,A n n a gven category C are condtonally ndependent wth each others gven C [16]. Dependng on the precse nature of the probablty model, Nave Bayes classfers can be traned very effcently n a supervsed learnng settng [6,12]. Besde Naïve Bayes classfers, a varety of supervsed machne learnng algorthms such as Support Vector Machne (SVM) and memory-based learnng [4, 8] have been successfully appled and showed satsfactory results n the context of spam flterng. Although these technques dscussed above proved to perform well n some cases, they stll have problems n other cases: for nstance, all types of Content-based spam flters have false postves; generally, t s more sever to msclassfy a legtmate message as spam than to let a spam message pass the flter [4]. In addton what s classfed as spam by these flters may not truly be so because spam s a relatve concept, that s, what mght be consdered as a spam for one person may not be so for another one. These lmtatons are drvng factors for us to develop a novel technque for spam flterng by usng a user-orented feedback mechansm that works n combnaton wth Naïve Bayes classfer to reduce the false postves and false negatve encountered by tradtonal classfers, n addton, a specal concern s gven to some specfc doman features (terms and patterns) that maybe consdered as spam dscrmnators. The classfers can be appled on spam flterng beng vewed as text classfcaton problem as follows: Gven a set of tranng documents, D= {t 1, t 2 t n } of tuples and a set of classes C= {C 1, C 2 C n }. The classfcaton problem s to defne a mappng f: D C where each t s assgned to one class wth ther assocated class labels, each document, t, s represented by a vector of words {w 1, w 2, w n }. The ndependent probablty of w of a gven document assocated wth class C can be wrtten as n [6] p( W C). Snce each document conssts of a large number of words, the Naïve Bayes classfer makes the smplfyng assumpton that w 1, w 2, w are condtonally ndependent gven the category C, P ( D C ) = p ( W C ) (1) The probablty of a gven document D belongs to a gven class C s represented as P(C D), whch can be computed p ( C ) P ( C D ) = p ( D C ) (2) P ( D ) To estmate the probablty of a partcular document s spam, gven that t contans certan words, Bayes' theorem states that the probablty of fndng those certan words n spam documents, tmes the probablty that any document s spam, dvded by the probablty of fndng those words n any document [13]: p( words spam) P ( spam words) = p( spam) (3) P( words) 4. Web Spam Detecton Classfer Fndng a spam web page s vewed as supervsed text classfcaton problem. In the supervsed classfcaton applcaton, the web spam classfer needs to be traned wth a set of web pages that are prevously classfed nto two categores, spam and non-spam. Snce spam s a relatve concept, that s, what s consdered spam for one user may not be the same for other users. Moreover, what mght be spam for a specfc user at a partcular tme mght not be so for the same user at dfferent tme, then, dependng only on the capabltes of the traned classfer as the case of many tradtonal Naïve Bayes Classfers seems to be of lmted benefts. In tranng phase, a user-orented preparaton s performed, wheren, the web pages resde n the web server are classfed nto spam or nonspam based on user's feedback and the automated classfcaton by the flter. The attenton s gven for the general spam that the majorty of users agree upon, then all the terms contaned n the pages classfed as general spam are extracted to form the General Spam Dctonary, ths phase s llustrated n Fgure 1. General spam, specfc spam, and nonspam pages Flter Feedback Fgure 1. User Orented Tranng Therefore, a novel user-orented tranng mechansm s needed to support the classfer wth perodc users' feedback to determne whether the page s consdered as spam or nonspam. In ths tranng scenaro, there are two outcomes: those web pages that are judged to be spam by the majorty of users, we call such pages: the General Spam, and those

4 Web Spam Detecton Usng Machne Learnng n Specfc Doman Features 223 web pages that vewed to be spam by one or few users, we call them Specfc Spam. Our attenton wll be focused on the general spam, because they can contrbute effcently n the process of classfcaton. Ths novel tranng mechansm s proposed manly to functon n the server-sde rather than n the clent-sde. It s an effectve and promsng mechansm to overcome the prevously mentoned problems assocated wth web server beng affected wth spam, especally the problem of wastng resources on spam pages (the resources nclude server's bandwdth, CPU cycles and storage space). The spam detecton system conssts of three phases whch nclude tranng phase, preprocessng phase, and classfcaton phase. As shown n Fgure 2, the preprocessng phase s the cleanng process, whch s appled to each web document to extract ts body. In stemmng and stop words removal, the frequent words that do not contrbute effcently n classfcaton process are removed from the body of the page. Stemmng operaton reduces dstnct words to ther common stem, whch s acheved by removng prefxes and suffxes from words. We choose the affx stemmers algorthm n the stemmng work. Ths asssts n reducng the tme requred for classfcaton snce the words length s lessened, whch yelds to reduce the accuracy of classfcaton. The vtal step n ths phase s the generatng dctonares, wheren two dfferent lsts of words are generated, such lsts are called dctonares, and they nclude: the frequent terms dctonary, extracted from those web pages that are classfed as non-spam, and the specal features dctonary. Each entry n the frequent terms dctonary conssts of <term, probablty of beng spam>. In classfcaton phase, the preprocessed documents are represented by Term Frequency Matrx (TFM) structure [5] to perform the statstcal analyss (.e. Bayesan rule). TFM smplfes the calculaton of the probablty of word belongs to class j, P(class j word ), and also mproves the effcency of Naïve Bayes classfer whch requres only one scan through the entre tranng dataset. As provded n Fgure 3, each ntersects of word and class row n Term Frequency Matrx represents the number of tmes (or frequency) of the word appear n class j. Assumng that each feature contrbutes n the process of spam flterng n an equal manner to each other feature may lead to nadequate results. In other words, there s no partcular feature n the text of the web page that provdes evdence as to whether the page s spam or nonspam. However, ths assumpton does not hold n many real stuatons. For example, t s proved by experence (and from users' feedback) that there are many specfc features whose exstence provdes a strong ndcaton on the suspcous message (spam), such as "free money", "congratulaton you are wnner number ", "$$$$ you wn xxx$$$$", and the over used punctuatons "??????". In addton to these dscrmnatng textual features (patterns), web pages contan many non-textual features that ndcate whether t s spam or not such as the doman type.edu,.org,.com,.bz, etc [4]. It s shown by Mcrosoft researchers [11] that 70% of those pages wth the doman.bz are spam, and that.edu pages are rarely (or never) contan spam. To ths end, we consder the employment of such (textual and non-textual) features as good dscrmnators of spam that nsst n a correct classfcaton. To acheve ths, we mantan a table (or named as dctonary) called specfc feature dctonary consstng of all these specfc features, each entry n ths table corresponds to: <feature, probablty>. Then, t becomes straghtforward to ncorporate such addtonal features to our Naïve Bayes Model. As a new web page needs to be classfed, a lst of all ts words s generated and checked aganst the pre-establshed features dctonary to make sure whether t contans one or more of ts words that are determned to be spam dscrmnators. In addton to the comparson wth the specfc features dctonary, the new webpage s compared aganst other dctonares (.e. the general spam and the frequent terms dctonares). After each comparson, the probablty of spam s computed, as a result, we come up wth a probablty value that takes nto account the doman specfc features exstence, the words that are judged (by users) to be spam, and the ordnary terms that mght be spam n some cases (accordng to ther probablty). Fgure 4 depcts the Naïve Bayes classfer wth user feedback and doman specfc features. 223

5 224 Najadat and Hmed Fgure 2. Web Spam detecton system structure wth Naïve Bayes Classfer

6 Web Spam Detecton Usng Machne Learnng n Specfc Doman Features 225 Fgure 3. Term Frequency Matrx assocated wth ndvual fle terms 5. Expermental Results Our experments were all performed on the webspam- UK2006 data set [4]. The tranng dataset conssts of 8,1415 web pages. A detaled descrpton of ther data set and the crtera n assgnng a web to be spam or nonspam can be found n [4]. In our work, a sample of these web pages s taken to evaluate the classfcaton accuracy of our spam detector. To estmate the accuracy of our proposed algorthm, we use a popular accuracy measure n the context of Informaton Retreval, namely: the Confuson Matrx (CM). CM contans nformaton about actual and predcted classfcatons done by a classfcaton system [6]. Fgure 5 shows confuson matrx for the two classes spam and non-spam. As n [6], TP represents actually postve and classfy as postve, FN represents actually postve and classfy as negatve, FP represents actually negatve and classfy as postve, TN represents actually negatve and classfy as negatve. 225

7 226 Najadat and Hmed 1) Buld Vocabulary Table for the entre documents. 2) Calculate P (Spam) and P (Non-spam). 3) For each word n test document j do If (word exsts n Feature dctonary) then calculate probablty of document D beng spam P 1 ( D spam) = P( w spam) If (word exsts n Spam dctonary) then calculate probablty of document D beng spam: P 2 ( D spam ) = P( w spam ) Fgure 4. Spam Detecton Procedure If (word exsts n Term dctonary) then calculate probablty of document D beng spam: P 3 ( D spam ) = P ( w spam ) 4) End for loop 5) Ptotal ( D spam) = P1 + P2 + P3 6) Calculate posteror probablty of document D : P ( spam D) = Ptotal ( D spam). P( spam), P( non spam D) = P( w non-spam). P( non spam) 7) If P ( spam D) > P( non spam D) then Classfy document D as spam else Classfy document D as not spam. Senstvty = truepostve truepostve + falsenegatve Specfcty = truenegatve truenegatve + falsepostve We use the above measurements to compute the accuracy whch s defned as follows: Accuracy = TP + TN (6) TP + FN + FP + TN To evaluate the accuracy, holdout technque s utlzed to produce a true estmate of the classfer. The data are parttoned nto two separated dataset, tranng set and testng set. The tranng set used to learn the classfer algorthm, and the testng set n used to evaluate the accuracy. We run our classfer on dfferent fve samples and then calculate the classfcaton accuracy for each run. For nstance, we calculate the accuracy of a sample consstng of 238 testng documents (wth 69 of documents belong to Nonspam class, and 169 of documents belong to Spam class). For ths test, the Senstvty and Specfcty are 97% and 66% respectvely. And the total accuracy s 88%, where total accuracy equal ((TP + TN) / (TP+FP+TN+FN)). Other experments are made on dfferent samples consstng of 324, 400, 519 and 618 documents. The accuracy results are shown n fgures 6 and 7, these fgures show also the effect of stemmng the document before classfcaton on the accuracy results. Fgure 7 ndcates that usng stemmng documents gans 80% n average, whle Fgure 6 shows the accuracy s 78%. (4) (5) Spam Nonspa m Actual Spam TP FN Class Nonspam FP TN Fgure 5. Confuson Matrx We use the confuson matrx to calculate senstvty and Specfcty measures. Senstvty refers to true postve rato, that s, the proporton of postve documents that are correctly dentfed. Specfcty s the true negatve rato, that s, the proporton of negatve documents that are correctly dentfed. Where truepostve s the number of true postve documents are correctly classfy. TrueNegatve s the number of true negatve documents that are correctly classfed. FalsePostve s the number of false postve documents that are ncorrectly classfed [6]. Accuracy (%) Data Sze Fgure 6. Accuracy versus Dataset sze (Stemmng text)

8 Web Spam Detecton Usng Machne Learnng n Specfc Doman Features 227 Fgure 7. Accuracy versus Dataset sze (Unstemmng text) We also evaluate the classfer performance by concentratng on the number of nonspam pages that are wrongly classfed as spam, and the number of spam pages that are wrongly classfed as nonspam. The frst parameter s called the Nonspam Msclassfcaton Rate (HMR) whle the second parameter s called Spam Msclassfcaton Rate (SMR). In the context of spam flterng, t s proved that the effect f msclassfyng a legtmate message as spam s more sever than msclassfyng a spam message as legtmate. The accuracy s compared for both Naïve Bayes Classfer wth Doman Specfc Features (NBCDSF) and the Naïve Bayes Classfer wthout User Feedback (NBCUF), the results show that the mproved Naïve Bayes classfer wth user feedback outperforms the NBCUF n terms of ncreased accuracy and decreased SMR and HMR rates. The results are shown n Fgure Concluson and Future work Web spam pages are an annoyng problem that were prevented by many technques, among them, s the Naïve Bayes classfers that proved to be effcent mechansms for spam flterng. SMR Accuracy (%) Data Sze Data Sze Fgure 8. Spam Msclassfcaton Rate results NBCUF NBCDSF Detectng spam web pages s one of the major challenges that face search engnes n ther queres results. Search engnes should return hgh qualty results n response to the user's queres. Many search engnes necesstate an ntegraton of a healthy detecton spam to elmnate all web pages that effect n page rankng algorthm. Several contentbased and machne learnng technques were proposed to detect spam pages. Ths paper proposed a Naïve Bayes approach that gves weght for user's feedback to mprove the tranng process of the classfer and that consder the exstence of some doman specfc features that contrbute strongly to the spam dscrmnaton, that s, there exstence provdes evdence that the webpage s spam. Ths approach proposed manly to functon n the server-sde, to reduce the overhead assocated wth spam pages n the web server. Our work nvolved Naïve Bayes classfer n dscoverng non requred pages. The expermental results showed that Naïve Bayes classfer provdes on average accuracy equal to 80.2%. For future works we wll develop an applcaton as plug-n n one of the open source browser to work as detector n onlne webste pages to notfy the users for spam pages currently workng on. Our future work wll be to optmze the performance of our Naïve Bayes Classfer by takng nto consderaton the word-poston of the doman specfc features, whch wll contrbute to a better accuracy. In addton, the mprovement wll nclude the user-orented feedback model to satsfy the needs of much more users. References [1] S.Atskns, "Sze and Cost of the Problem", In the Proceedngs of the Ffty-sxth nternet Engneerng Task Force(IETF) Meetng, San Francsco, CA, USA, [2] R.Baeza-Yates and B.Rbero-Neto, Modern Informaton Retreval, Ednburgh Gate, England, [3] L.Becchett, C.Castllo, D.Donato, R.Baeza-Yates, S.Leonard, "Lnk Analyss for Web Spam Detecton", ACM Trans. Web, 2 (1), pp. 1-42, [4] C.Castllo, D.Donato, L.Becchett1, P.Bold, S.Leonard, M.Santn and S.Vgna, "A reference Collecton for Web Spam", SIGIR Forum, (40), pp , [5] Z.Gyöngy and H.Garca-Molna, "Web Spam Taxonomy", In the Proceedngs of the Frst Internatonal Workshop on Adversaral Informaton Retreval on the Web, Stanford Unversty, May [6] J.Han, and M.Kamber, Data Mnng: Concept and Technques, Morgan-Kaufman, New York, [7] I.Koprnska, J.Poon, J.Clark, and J.Chan, "Learnng to Classfy E-mal", Informaton Scence, 10(177), pp , [8] C.La, "An Emprcal Study of Three Machne Learnng Methods for Spam Flterng", Knowledge- Based Systems, 3 (20), pp , [9] C.Lee, Y.Km, and P.Rhee "Web Personalzaton Expert wth Combnng Collaboratve Flterng and 227

9 228 Najadat and Hmed Assocaton Rule Mnng Technque", Expert Systems wth Applcatons, 3(21) pp , [10] J.María, G.Cajgas and E.Puertas, "Content Based SMS Spam Flterng", In the Proceedngs of the 2006 ACM symposum on Document engneerng, ACM, [11] A.Ntoulas, M.Najork, M.Manasse, and D.Fetterly, "Detectng Spam Web Pages Through Content Analyss", In the Proceedngs of the 15th nternatonal conference on World Wde Web, ACM, [12] G.Paul, "Better Bayesan Flterng". In the Proceedngs of the 2003 spam conference, Jan [13] M.Saham, S.Dunmas, D.Heckerman, and E.Horvtz, "A Bayesan Approach to Flterng Junk E-mal", AAAI Workshop on Learnng for Text Categorzaton, July 1998, Madson, Wsconsn, AAAI Techncal Report WS [14] J.Su and H.Zhang, "Full Baysan Network Classfers", In the Proceedngs of 23 rd Internatonal Conference on Machne Learnng, Pttsburgh, PA, [15] B.Yu and Z.Xu, "A Comparatve Study for Content- Based Dynamc Spam Classfcaton Usng Four Machne Learnng Algorthms", Knowledge-Based Systems, 4(21), pp , [16] H.Zhang and D.L, "Naïve Bayes Text Classfer", In the Proceedngs of the IEEE Internatonal Conference on Granular Computng, [17] L.Zhang, J.Zhu, and T.Yao, "An Evaluaton of Statstcal Spam Flterng Technques ACM Transactons on Asan Language Informaton Processng, 4(l.3), pp , Authors Bographes Hassan M. Najadat s an Assstant Professor n the Department of Computer Informaton Systems n Jordan Unversty of Scence and Technology n Irbd, Jordan. Hs research nterests are centered on data mnng, machne learnng, database systems and he has authored over nne refereed publcatons n the areas of clusterng, classfcaton, assocaton rules, and text mnng. He receved hs PhD n Computer Scence from North Dakota State Unversty n Fargo, USA, MS n Computer Scence from Unversty of Jordan n Amman, Jordan, and BS n Computer Scence from Mut ah Unversty n Alkarak, Jordan. Ismal Hmed s an Assstant Professor n the Department of Computer Informaton Systems n Jordan Unversty of Scence and Technology n Irbd, Jordan. Hs research nterests are centered on nformaton retreval, natural language processng, e-learnng, and database systems. He has authored over 13 refereed publcatons n the areas of query expanson, automatc ndexng and text categorzaton. He receved hs PhD n Computer Scence from Illnos Insttute of Technology, USA, MS and BS n Computer Scence from Eastern Mchgan Unversty, U.S.A.