An Anti-spam Filter Combination Framework for Text-and-Image Emails through Incremental Learning



Similar documents
How To Calculate Backup From A Backup From An Oal To A Daa

A Background Layer Model for Object Tracking through Occlusion

Kalman filtering as a performance monitoring technique for a propensity scorecard

12/7/2011. Procedures to be Covered. Time Series Analysis Using Statgraphics Centurion. Time Series Analysis. Example #1 U.S.

Analyzing Energy Use with Decomposition Methods

Capacity Planning. Operations Planning

An Ensemble Data Mining and FLANN Combining Short-term Load Forecasting System for Abnormal Days

Spline. Computer Graphics. B-splines. B-Splines (for basis splines) Generating a curve. Basis Functions. Lecture 14 Curves and Surfaces II

An Architecture to Support Distributed Data Mining Services in E-Commerce Environments

Methodology of the CBOE S&P 500 PutWrite Index (PUT SM ) (with supplemental information regarding the CBOE S&P 500 PutWrite T-W Index (PWT SM ))

Linear Extension Cube Attack on Stream Ciphers Abstract: Keywords: 1. Introduction

Genetic Algorithm with Range Selection Mechanism for Dynamic Multiservice Load Balancing in Cloud-Based Multimedia System

MODEL-BASED APPROACH TO CHARACTERIZATION OF DIFFUSION PROCESSES VIA DISTRIBUTED CONTROL OF ACTUATED SENSOR NETWORKS

A Hybrid Method for Forecasting Stock Market Trend Using Soft-Thresholding De-noise Model and SVM

Anomaly Detection in Network Traffic Using Selected Methods of Time Series Analysis

INTERNATIONAL JOURNAL OF STRATEGIC MANAGEMENT

A Common Neural Network Model for Unsupervised Exploratory Data Analysis and Independent Component Analysis

Insurance. By Mark Dorfman, Alexander Kling, and Jochen Russ. Abstract

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting*

MORE ON TVM, "SIX FUNCTIONS OF A DOLLAR", FINANCIAL MECHANICS. Copyright 2004, S. Malpezzi

PARTICLE FILTER BASED VEHICLE TRACKING APPROACH WITH IMPROVED RESAMPLING STAGE

Currency Exchange Rate Forecasting from News Headlines

The Rules of the Settlement Guarantee Fund. 1. These Rules, hereinafter referred to as "the Rules", define the procedures for the formation

Linear methods for regression and classification with functional data

The Virtual Machine Resource Allocation based on Service Features in Cloud Computing Environment

THE IMPACT OF UNSECURED DEBT ON FINANCIAL DISTRESS AMONG BRITISH HOUSEHOLDS. Ana del Río and Garry Young. Documentos de Trabajo N.

Cost- and Energy-Aware Load Distribution Across Data Centers

Estimating intrinsic currency values

Ground rules. Guide to the calculation methods of the FTSE Actuaries UK Gilts Index Series v1.9

A Hybrid AANN-KPCA Approach to Sensor Data Validation

THE USE IN BANKS OF VALUE AT RISK METHOD IN MARKET RISK MANAGEMENT. Ioan TRENCA *

PerfCenter: A Methodology and Tool for Performance Analysis of Application Hosting Centers

Lecture 40 Induction. Review Inductors Self-induction RL circuits Energy stored in a Magnetic Field

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS. Exponential Smoothing for Inventory Control: Means and Variances of Lead-Time Demand

GUIDANCE STATEMENT ON CALCULATION METHODOLOGY

MULTI-WORKDAY ERGONOMIC WORKFORCE SCHEDULING WITH DAYS OFF

Time Series. A thesis. Submitted to the. Edith Cowan University. Perth, Western Australia. David Sheung Chi Fung. In Fulfillment of the Requirements

Levy-Grant-Schemes in Vocational Education

Social security, education, retirement and growth*

SPC-based Inventory Control Policy to Improve Supply Chain Dynamics

Pedro M. Castro Iiro Harjunkoski Ignacio E. Grossmann. Lisbon, Portugal Ladenburg, Germany Pittsburgh, USA

HEURISTIC ALGORITHM FOR SINGLE RESOURCE CONSTRAINED PROJECT SCHEDULING PROBLEM BASED ON THE DYNAMIC PROGRAMMING

A GENERALIZED FRAMEWORK FOR CREDIT RISK PORTFOLIO MODELS

Network Effects on Standard Software Markets: A Simulation Model to examine Pricing Strategies

Attribution Strategies and Return on Keyword Investment in Paid Search Advertising

Sensor Nework proposeations

RESOLUTION OF THE LINEAR FRACTIONAL GOAL PROGRAMMING PROBLEM

Testing techniques and forecasting ability of FX Options Implied Risk Neutral Densities. Oren Tapiero

CLoud computing has recently emerged as a new

Proceedings of the 2008 Winter Simulation Conference S. J. Mason, R. R. Hill, L. Mönch, O. Rose, T. Jefferson, J. W. Fowler eds.

TECNICHE DI DIAGNOSI AUTOMATICA DEI GUASTI. Silvio Simani References

A Heuristic Solution Method to a Stochastic Vehicle Routing Problem

Auxiliary Module for Unbalanced Three Phase Loads with a Neutral Connection

APPLICATION OF CHAOS THEORY TO ANALYSIS OF COMPUTER NETWORK TRAFFIC Liudvikas Kaklauskas, Leonidas Sakalauskas

Y2K* Stephanie Schmitt-Grohé. Rutgers Uni ersity, 75 Hamilton Street, New Brunswick, New Jersey

A Hybrid Wind-Solar Energy System: A New Rectifier Stage Topology

Index Mathematics Methodology

Guidelines and Specification for the Construction and Maintenance of the. NASDAQ OMX Credit SEK Indexes

Optimization of Nurse Scheduling Problem with a Two-Stage Mathematical Programming Model

The Prediction Algorithm Based on Fuzzy Logic Using Time Series Data Mining Method

The Feedback from Stock Prices to Credit Spreads

Applying the Theta Model to Short-Term Forecasts in Monthly Time Series

Information-based trading, price impact of trades, and trade autocorrelation

Efficiency of General Insurance in Malaysia Using Stochastic Frontier Analysis (SFA)

Nonlinearity or Structural Break? - Data Mining in Evolving Financial Data Sets from a Bayesian Model Combination Perspective

Return Persistence, Risk Dynamics and Momentum Exposures of Equity and Bond Mutual Funds

HEAT CONDUCTION PROBLEM IN A TWO-LAYERED HOLLOW CYLINDER BY USING THE GREEN S FUNCTION METHOD

Market-Clearing Electricity Prices and Energy Uplift

Inventory Management MILP Modeling for Tank Farm Systems

HAND: Highly Available Dynamic Deployment Infrastructure for Globus Toolkit 4

Fixed Income Attribution. Remco van Eeuwijk, Managing Director Wilshire Associates Incorporated 15 February 2006

Spotting Fake Reviewer Groups in Consumer Reviews

The Joint Cross Section of Stocks and Options *

Selected Financial Formulae. Basic Time Value Formulae PV A FV A. FV Ad

Modeling state-related fmri activity using change-point theory

The US Dollar Index Futures Contract

What Explains Superior Retail Performance?

ACKNOWLEDGEMENT RATNADIP ADHIKARI - 3 -

This research paper analyzes the impact of information technology (IT) in a healthcare

(Im)possibility of Safe Exchange Mechanism Design

Evaluation of the Stochastic Modelling on Options

The Sarbanes-Oxley Act and Small Public Companies

A Modification of the HP Filter. Aiming at Reducing the End-Point Bias

Fundamental Analysis of Receivables and Bad Debt Reserves

Transcription:

An An-spam Fler Combnaon Framework for Tex-and-Image Emals hrough Incremenal Learnng 1 Byungk Byun, 1 Chn-Hu Lee, 2 Seve Webb, 2 Danesh Iran, and 2 Calon Pu 1 School of Elecrcal & Compuer Engr. Georga Insue of Technology Alana, GA 3332-25 {yorke3, chl}@ece.gaech.edu 2 College of Compung Georga Insue of Technology Alana, GA 3332-28 {webb, danesh, calon}@cc.gaech.edu Absrac We presen an an-spam flerng framework ha combnes ex-based and mage-based an-spam flers. Frs, an ncremenal learnng approach o reducng msmaches beween ranng and es daases s proposed o resolve he problem of a lack of ranng daa for legmae emals ha conan boh ex and mages. Then, he oupus of ex-based and mage-based flers are combned wh he weghs deermned by a Bayesan framework. Our expermenal resuls on he TREC 25 and 27 spam corpora usng wo sae-of-hear ex-based flers show ha he combned sysem sgnfcanly reduces he false posve errors for he msclassfed emals conanng mages. 1 Inroducon In he pas few years, ex-based an-spam flers have been exremely effecve a deecng emal spam [4, 6, 8, 16]. To comba hese flers, spammers have recenly adoped a number of counermeasures, whch are amed a confusng hese flers and degradng her performance. One of he mos popular counermeasures s he use of mages o ransm spam messages whle camouflagng such messages wh legmae-lookng ex. In response o hs new rend, called mage spam, n spammng behavor, many researchers have proposed spam flerng echnques ha denfy hese messages usng dsncve properes of spam mages [1, 3, 6, 16], and some progress n denfyng spam mages has been observed. Despe recen advances n spam mage flerng research, he sole use of such echnques for mage spam s no approprae because he performances of curren echnques are sll below he desred level o use n realsc suaons. Mos spam mage flerng echnques have produced que a b of msclassfcaon errors for legmae emals [1, 3, 7, 18]. So, s necessary o leverage ex-based spam flers, whch have been performng exremely well o denfy legmae emals, as well as mage-based spam flers. However, he mehod o explo hese wo echnques ogeher sysemacally s relavely unexplored because here are only a small se of publcly avalable mage-and-ex legmae emals. One vable approach for explorng he negraed naure of emals conanng boh ex and mage s o combne he oupus of ndvdual ex-based and mage-based classfers. Ths opc, combnng classfers, has been explored acvely n many classfcaon scenaros [2, 13, 17]. One example would be negrang speaker verfcaon resuls wh fngerprn recognon resuls for a secury sysem [13]. Ofen, combnng classfers provdes a sysemacal framework for negrang mulple heerogeneous sources of nformaon as well as enhances classfcaon resuls [2, 13, 17]. Applyng a smlar approach o an-spam flerng asks makes possble o classfy emals conanng mages n a unfed fashon. I s, however, hard o mplemen such an negraed sysem ha akes advanage of dsncve feaures n ndvdual ex and mage pars and her nerdependences because of a lack of publcly avalable legmae emals ha conan mages, whch has been prevenng researchers from benchmarkng and mprovng her algorhms as well. Prevously, here was one aemp [18] o generae a synhec se of legmae emals by combnng legmae emals and random mages, whch were seleced from a pool of legmae mages. However, hs approach may no be conssen wh realsc suaons snce [18] reled on a small se of prvae daa, such as mages obaned from personal nboxes or separae mage ses (e.g. Corel) from oher non-mal sources. In hs paper, we presen an an-spam fler combnaon framework ha combnes oupus of exbased and mage-based flers o make unfed decsons. To deal wh he aforemenoned ssue, we

adop an ncremenal learnng algorhm where a classfer adaps s parameers o new daa whle s beng used o mnmze msmaches beween ranng and es mages over me. Tme varably,.e. he characerscs of spam messages n he pas are que dfferen from he characerscs of spam messages we are recevng now, also makes an ncremenal learnng algorhm more aracve. For hs reason, mos saeof-he-ar ex-based spam flers mplemened usng an ncremenal learnng algorhm [6]. Here, we refer o he mehod ha requres processng he avalable daa as a whole as bach learnng and o he mehod n whch adapaon s performed onlne based on a small se of avalable daa as ncremenal learnng [12]. The ncremenal learnng algorhm s mplemened by modfyng he dscrmnave spam mage deecon echnque [3] proposed n our prevous work. I s hen combned wh wo sae-of-he-ar ncremenal learnng ex-based an-spam flers such as Bogofler [1] and OSBF [11]. To fully explo he aforemenoned advanages of ncremenal learnng, combnaon of ex and mage flers s performed whn a Bayesan framework n whch parameers are updaed ncremenally as well dependng upon ndvdual spam fler s performances. Based on our expermenal evaluaon, we found ha our proposed sysem clearly demonsraed mprovemens n erms of spam classfcaon errors over ex-based an-spam flers for emal messages conanng mages, whch wll be referred o as ex-and-mage emals hereafer. Specfcally, he proposed framework wh Bogofler and OSBF as ex-based flers reduces he number of msclassfed ex-and-mage emals sgnfcanly over he case when only Bogofler or OSBF was used. The remander of he paper s organzed as follows. In Secon 2, an overvew of he negraed sysem s descrbed. In Secon 3, we propose a dscrmnave ncremenal learnng approach o spam mage flerng, whch s followed by a dscusson of ssues relaed o negrang mage and ex classfers n Secon 4. Expermenal resuls usng he TREC 25 and 27 spam mage daa are gven n Secon 5. Fnally, we conclude our fndngs n Secon 6. 2 Sysem Overvew A radonal an-spam flerng sysem can be dvded no hree componens. The frs s a fron-end module n whch emal messages are parsed and okenzed. The second s a classfer module n whch a probably of beng spam or legmae s compued usng he okenzed messages. Gven hs probably, he las componen fnally makes a decson based on he cos ha one should pay from msclassfcaon errors. In some cases, hs las componen nvolves geng users feedback regardng he decson. Therefore, here mgh be some feedback pahs back o he classfer module. Our proposed framework can also be caegorzed no smlar componens. The man dfferences are hreefold. Frs, n addon o an emal message parser and a okenzer a he fron-end, we have an mage analyss module n whch dsncve feaures of spam mages are exraced f he emal conans mages. Second, an addonal classfer ha compues he probably for each aached mage o be spam or legmae s raned gven such feaures. Lasly, he back-end module fuses he probables compued so ha he fnal decson can be made. In Fgure 1, we llusrae he block dagram of our sysem. Noe ha he feedback pahs are gven o he ex-based and mage-based flers from he back-end module. Through hese pahs, we ensure ha he an-spam flers n he mddle wll be raned ncremenally as hey see new samples. Here, we can also add more anspam flers ha explo oher feaures, such as srucural nformaon of emals, and so on. Fgure 1: Block dagram of our negraed sysem. 3 Incremenal Learnng In hs secon, we descrbe our proposed ncremenal learnng approach for mage-based spam flers. Typcally, ncremenal learnng s mplemened whn a paramerc modelng framework wh maxmum a poseror probably (MAP) parameer esmaon [14], whch s suppored by s heorecal soundness. The basc procedure sars wh assumng ha nformaon conaned n all samples colleced so far s embedded n a pror densy. When new samples become avalable, classfer parameers are esmaed by a MAP creron. Fnally, pror densy s parameers, known as hyperparameers, are updaed o creae an updaed pror densy n a process known as pror evoluon [12]. In addon o s heorecal soundness, hs approach has anoher benef n erms of sorage and compuaonal effcency n general because suffcen sascs of he poseror dsrbuons are preserved hrough he usage of conjugae prors [14]. One poenal drawback of hs approach s, however, a need o know he rue form of he dsrbuons. If he assumed probablsc dsrbuon s far from he rue one, he sysem performance wll suffer. Therefore, n our proposed sysem, we adop a dscrmnave learnng approach n whch an explc assumpon on he model dsrbuons s no necessary. In parcular, a

J ( D, ) : objecvefuncon D : oal ranngsamplesa me : parameers o be esmaeda me I : nformave ranngsamplesa me U : unnformave ranngsamplesa me T : oalnumber of samplesavalable defne : InalzeΛ, I, U for 1: T compue d(x, Λ 1 ) f d(x, Λ 1 ) [nervalfor nformave samples] I I-1 { x} else U U -1 {x} Λ arg mnj ( I, ) end d(x, ) -f ΛΘ (X, ) f (X, Fgure 2. A procedure for ncremenal learnng dscrmnave spam mage denfcaon echnque runnng n bach-mode [3] s adoped and modfed o be operaed n an adapve mode. So, he parameers of dscrmnan funcons are racked and updaed ncremenally as we encouner new samples. To mprove ncremenal adapaon effecveness, we only consder an nformave subse of he ranng samples. By nformave we mean he samples ha have more o do wh deermnng decson boundares. No surprsngly for dscrmnave ranng, such samples are locaed around decson boundares n general. Based on hs undersandng, we pck samples ha fall no a regon such ha s dsance from decson boundares s less han a ceran hreshold. More formally, le x be a new sample n class and f (x) be a value of he dscrmnan funcon for class of such a sample. Also, le I be a se of nformave samples and U be a se of unnformave samples for class, respecvely. Suppose he hreshold τ s wce he sze of a margn, where he margn s defned as he dsance beween he decson boundary for class and he closes posve sample for class ha s correcly classfed. Then, f f(x), we augmen I wh x. Oherwse, U s augmened wh x. Laer, when he parameers of he dscrmnan funcon are reesmaed, only he samples n I are used. We should also consder he case n whch he sze of I becomes oo large so ha he cos of re-esmaon exceeds wha we desre. In hs case, we can drop some samples n a mely manner from I based on an assumpon ha he characerscs of spam and legmae mages are changng n me. We summarze he proposed ncremenal learnng algorhm n Fgure 2. An MFoM-based classfer ) learnng approach [9], whch ams a opmzng he parameers n erms of a fgure-of-mer, or FoM such as recall, precson, or average deecon error rae, s used. Average deecon error rae, whch s defned as a mean of false posve error rae and false negave error rae, s seleced o defne a form of he objecve funcon o be opmzed. In selecng wheher a new sample x s nformave or no, we frs consder d(x) as a generalzed lkelhood rao funcon. We hen use a sgmod fng o conver he range of d(x) no he [, 1] nerval as hs value can be consdered as a smulaed probably of how lkely a gven mage wll belong o he spam class. Gven he probably, we selec samples ha have he value less han, say.8, for spam mages. Lkewse, we selec samples ha have he value greaer han, for example.2, for legmae mages. Our expermenal resuls show ha hs scheme reduces he requred number of ranng samples sgnfcanly whle he performance of he sysem remans abou he same. 4 Inegrang Tex and Image Classfers In hs secon, we descrbe our proposed mehod of negrang mage and ex-based an-spam flers. There are hree ssues o be addressed o negrae hese wo flers: dealng wh mulple mages, combnng classfers, and deermnng operang pons. We dscuss such ssues n he followng subsecons n deal. 4.1 Issues wh Mulple Images In mos cases a sngle emal message conans mulple mages. Typcally, hese mages are smlar n erms of her properes, hus s possble o process hose mages as a whole o generae a sngle decson. However, spammers mgh even camouflage spam mages wh legmae mages, so s more approprae o compue a probably of each mage beng spam ndvdually and o merge he probables aferwards usng some echnques. In he proposed framework, we unfy classfcaon resuls from mulple mages by usng a generalzed power mean over ndvdual classfcaon resuls, defned as: 1 / P (. (1) Here, denoes a non-normalzed verson he overall probably of an emal x eher beng spam or legmae consderng all mages n x whereas s an ndvdual probably for he -h mage. η s a posve consan ha conrols effecs of dfferen mages. One can easly see ha for a large η, s approxmaed wh max. Ths mples ha for a carefully chosen η, Eq. (1) can handle he case ha x only conans one mage spam and he res of he

mages are legmae compared o oher approaches. I easy o see ha a smple arhmec average of mgh no be able o handle such cases. 4.2 Classfer Combnaon In hs secon, we developed an an-spam fler combnaon echnque whn a Bayesan framework. Here, a subscrp s used o dsngush dfferen spam flers and s used o denoe ncremenal learnng cycles a me. Suppose x s a receved emal and ω s he class assocaed wh x, eher beng spam or legmae. Then, assumng a hdden varable Z for an even o selec a ex or mage spam fler, a probably for a class ω gven x,, can be expressed as a margnal probably of a jon probably of Z and ω., Z Z Z, Here, Z s an exernal knowledge o express each classfer s confdence gven x. For example, n he case where a ceran feaure becomes unavalable, he correspondng Z wll be se o be zero. Also, f mage feaure domnaes over ex feaure, one could assgn a large probably for he correspondng Z. The frs erm n he summaon, Z, s a poseror probably of ω gven a sngle classfer and x. Noe ha snce only a par of x s avalable o each classfer, a poseror probably obaned from an ndvdual classfer (.e. an mage spam fler or a exbased spam fler) mgh dffer from Z. To ake no accoun hs fac, consder as anoher random varable, where s an oupu of a sngle classfer. Then, Z can be compued as a margnal dsrbuon of, Z and can be expanded as: where C s he oal se of classes and Z s he -h classfer s oupu. Snce s hard o compue, Z, x), we approxmae wh, Z ) by assumng ω and x are condonally ndependen gven and Z. Subsung Eq. (3) no Eq. (2), he overall equaon for classfer combnaon s as follows: To compue, Z ), consder a bnary random varable Y for he -h classfer where Y =1 f and Y =-1 f. Then, we can assume ha Y follows a Bernoull dsrbuon wh a probably of a success p a me, where we encode a success wh an even Y =1 and a falure wh and an even Y =-1. As (2) Z, Z Z, (3) C, Z ) Z Z (4) C akes wo values; spam or legmae, we wll have wo Bernoull dsrbuons. The dependency of he parameer p on me s due o he naure of ncremenal learnng. We esmae p wh an MAP creron and updae p over me whn a Bayesan framework. We perform hs by updang a pror dsrbuon for p over me. I s well-known [15] ha a bea dsrbuon wh hyper-parameers, α and β, s a conjugae pror of a Bernoull dsrbuon. Wh hs we can compue he a poseror dsrbuon usng Bayesan formula and esmae p ha maxmzes he a poseror dsrbuon. The resulng formula for p s: I ( success ) 1 p, 1 where I() s an ndcaor funcon. In fac, we expec p o become smaller when he correspondng classfer makes an error whle he oher classfers make correc decsons. Ths propery can be acheved by updang he hyper-parameers α and β n he bea dsrbuon as new daa become avalable. Specfcally, α and β are funcons of summaon of he number of success and falure unl me -1 whch can be compued as follows: where and are he nal values of he hyperparameers. I s clear, from he above equaons, ha as he number of falure s close o zero, he parameer p would reach uny. Rewrng Eq. (3) usng he quany p as n Eq. (5), we can compue he probably ha a new emal x s a spam message, adapvely, as follows: 1 1 1 1 The nal values of he hyper-parameers are crcal o he performance of he proposed framework. A good sarng pon s eher assumng boh classfers o be perfec a me zero or usng he values obaned from some valdaon ses. Our prelmnary expermens show ha boh mehods work reasonably well, so for smplcy, we adop he frs nalzaon scheme. In fac, our proposed mehod s a major exenson from [13], where maxmum lkelhood (ML) parameer esmaon s used. Insead, we use MAP esmaon wh updang a pror dsrbuon so ha, Z ) s adaped over me dependng on each classfer s performance, whch n urn ensures he proposed framework o work n a fully ncremenal manner. 4.3 Combnaon of he operang pons The las ssue s o se an operang pon. The proposed framework s argeng mage-and-ex emals, ex-only I ( success ) I ( falure), (5) P ( spam [ p P ( spam Z, x) (7) (1 p ) legmae Z ] Z. (6)

emals, and mage-only emals. However, an magebased fler and a ex-based fler may have dfferen operang pons. For hs reason, applyng a sngle operang pon o he ndvdual componen s napproprae. Insead, we se mulple operang pons accordng o he conen of he emals. In parcular, for ex-only and mage-only emals, operang pons of he mage-based and ex-based flers are preserved. As for he ex-and-mage emals, he operang pon can be obaned by usng a cross-valdaon se or from some exernal knowledge. In he propose echnque, we deermne he value by cross-valdaon on a subse of he TREC5 spam corpus. The operang value deermned here s used hroughou all expermens o be dscussed nex. 5 Expermenal Resuls We prepared for wo publc daases, TREC 25 and 27 spam corpora. We delberaely lef ou he TREC 26 spam corpus from our ranng and esng daa ses because here were only 3 mages n ha corpus. In he TREC 25 spam corpus, we exraced 153 mages, ncludng 1256 spam mages and 274 legmae mages. These mages were conaned n 132 emals ou of 92189 (5279 for spam and 39399 for legmae) n whch 1197 were spam and he res of hem were legmae. Smlarly, we exraced 9676 mages from he TREC 27 spam corpus conssng of 8414 spam mages and 1262 legmae mages. The number of emals conanng mages n he TREC 27 spam corpus was 8223 for spam and 326 for legmae, respecvely. The oal number of emals n he TREC 27 spam corpus was 5199 for spam and 2522 for legmae, respecvely. To evaluae he proposed framework, we adoped he evaluaon procedures from he offcal TREC Spam Track gudelne [5] wh hree modes of operaons: mmedae feedback (correc classfcaon of each message s gven o he fler mmedaely afer makes s predcon), delayed feedback (same as mmedae feedback excep for he fac ha here s a delay o provde he correc classfcaon. In he wors case, he correc classfcaon mgh no be provded), and onlne acve learnng (he fler s gven a feedback quoa. The fler asks he correc classfcaon for some messages, and as long as he quoa s no exceeded, he correc classfcaon s provded). For smplcy, we seleced he mmedae feedback scheme. As for ex-based spam fler, Bogofler and OSBF were used wh her defaul sengs. Bogofler s a mal fler ha akes advanage of a Bayesan-lke sascal approach o updang he ndcang power of spam of he words n he corpora. I also makes use of he nverse of a ch-square dsrbuon o compue dscrmnaon beween he spam and legmae classes. The defaul seng s a wo-sae classfcaon wh a hreshold value a.99. On he oher hand, OSBF s an mplemenaon of OSBF(Orhogonal Sparse Bgram wh confden Facor) algorhm ha enhances a Bayesan classfer. Snce OSBF has oupu range from (-, + ), sgmod fng was performed o conver s oupu no range of (, 1). As for he mage classfer, we used he same feaures as n [3], ncludng color momens, color heerogeney, color conspcuousness, and self-smlary. In addon o hese feaures, we exraced some meadaa from mages such as dmenson, fle ype, ec. because has been seen ha such meadaa were useful n some cases. As a resul, we obaned 74-dmenson feaure vecors. The same class-dscrmnan funcons as n [3] (.e. lnear dscrmnan funcons) were used. We dd no use mul-class characerzaon n he curren framework snce was almos unrealsc for he users o know whch sub-class an mage belonged o n conras o gvng a feedback only on wheher he mage was spam or legmae. Parameer opmzaon was solved wh a generalzed probablsc decen (GPD) algorhm [3]. To deal wh mulple mages, we used he smulaed probably descrbed n Secon 3 o compue he overall probably. In he followng, he effecveness of ncremenal learnng agans bach learnng s gven frs. Then, comparsons of ncremenal learnng among dfferen nformave ses are presened o decde an opmal creron for consrucng an nformave se. Fnally, he performances of he proposed framework are compared wh hose of ex-based spam flers. 5.1 Effecveness of ncremenal learnng To see he effecveness of ncremenal learnng n mage-based spam flerng, we frs mplemened nal dscrmnave mage spam fler raned wh he mxed daa se used n [3] and [7]. The spam mage class n hs daa se conssed of spam mages from he SpamArchve corpus and spam mages obaned from a personal nbox. As for he legmae mage class, conssed of mages from a personal nbox and some regular mages from he Corel daase and Google Image Search. Whle collecng hs daa se, duplcaed or nearly-duplcaed mages were elmnaed, whch creaed he fnal daa se wh 1814 spam mages and 1939 legmae mages. Here s expeced o see larger msmaches beween he ranng se used n a bach-mode wh he TREC 27 corpus han he TREC 25 corpus because for he spam class, mos of he ranng samples were colleced que a long me ago and for he legmae class, many mages were no acually exraced from real emals. Ths endency s refleced n he resuls shown n Table 1. In Table 1 we compare he performance of spam mage classfcaon of he classfers raned wh hs daa se wh classfers

[S L] [FULL] [.95.5] [.9.1] [.85.15] [.8.2] [.77.23] [.75.25] [.7.33] [.64.36] [.6.4] [.58.42] [.55.45] raned ncremenally. In he A1 and A2 expermens, he TREC 25 spam corpus was used whle n he B1 and B2, we used he TREC 27 daase. Moreover, A1 and B1 are he resuls obaned wh classfers raned wh bach learnng, whle A2 and B2 are resuls obaned from classfers raned wh ncremenal learnng. DER.6.5.4.3.2.1 TREC7 TREC5 Table 1. Comparson of ncremenal and bach learnng False Negave (%) False Posve (%) A1 (bach/25) 5.18 25.55 A2 (ncrem/25) 2.23 29.56 B1 (bach/27) 2.94 6.7 B2 (ncrem/27) 1.73 19.89 For he TREC 27 corpus, boh false posve errors (msclassfyng legmae mages as spam mages) and false negave errors (msclassfyng spam mages as legmae mages) were reduced sgnfcanly as compared o bach learnng. Alhough false posve errors were slghly ncreased n he TREC 25 spam corpus, false negave errors were experencng more dramac mprovemen n erms of he relave error reducon rae. Ths resul shows ha ncremenal learnng for mage spam flerng effecvely ncreases he spam fler s performances by reducng msmaches beween he ranng and es condon. 5.2 Comparson among dfferen nformave ses To desgn an opmal creron for consrucng an nformave adapaon se I s an mporan ssue for our proposed ncremenal learnng framework, snce s desred o have good performance whle mananng speed on re-ranng. Followng he algorhm llusraed n Fgure 2, he hresholds for he nformave se I are changed o deermne he number of ranng samples needed. The compuaonal complexy of he proposed ncremenal learnng algorhm s heavly dependen upon he number of eraons, he number of ranng samples used and he number of classes. The number of eraons and he number of classes are fxed here, bu he number of ranng samples s conrollable hrough dfferen nformave ses. The queson s when he mos effcen nformave ses are acheved. To address hs ssue, n Fgure 3, we plo he average deecon rae agans he hresholds used for I. From Fgure 3, can be seen ha he TREC 25 spam corpus s less sensve o he varaon of he hreshold values han he TREC 27 spam corpus. For he frs few pars of hreshold values, he behavors are somewha errac, parcularly for he TREC 27 spam corpus. However, boh TREC corpora showed a endency ha he average deecon error raes remaned comparave o he case where we use he full se of Fgure 3. Average deecon error rae versus hreshold used. Y-axs represens he average deecon error raes and X-axs represens he hreshold values. The frs number s he hreshold value for spam (S) and he second one s he hreshold value for legmae (L). ranng samples once he hreshold values passed some values. Besdes, f we consder he number of ranng samples needed o acheve good performance, was shown o reduce dramacally. For example, for he TREC 25 spam corpus, applyng a par of (.75,.25) hreshold values ncreased he average deecon error rae only from 15.9% o 18.6%, whle reducng he number of ranng samples ncluded n he se I by one-enh. Smlarly, he same hreshold par provded a large reducon of ranng samples (by one-enh of he oal ranng samples) for he TREC 27 spam corpus bu wh an ncrease of deecon error rae (from 1.8% o 15.8%). To furher oban he resuls for he res of he expermens, herefore, we appled hs se of hresholds. 5.3 Comparson wh he ex-only resuls Fnally, we compare he proposed framework wh he sysem ha uses only ex-based an-spam flers. The wo ex-based an-spam flers, Bogofler and OSBF, were used o creae ex-only resuls as well as negraed resuls resulng hree dfferen sysems (Sysem I, II, III), whch were usng Bogofler, OSBF wh a decson hreshold.5, and OSBF wh a decson hreshold.99, respecvely. For negraed sysems, We se he parameers as follows: he parameer η ha conrols he weghng facor of mulple mages was se o 4, and he nal hyperparameers of bea dsrbuon, and, were se o 1 and 2. The operang pons we se o use were lsed n Table 2. In Table 3, he compared resuls are gven for ex spam flers and he proposed framework. Here, false negave (msclassfcaon of spam emals) and false posve (msclassfcaon of legmae emals) errors are compued over all emals n he corpora and lsed for dfferen confguraons. To emphasze he

effecveness of he proposed framework, he numbers of ex-and-mage emals ha were msclassfed are also presened n parenheses. In our corpora, here were no mage-only emals. Table 2. The operang pons for hree dfferen seups Tex Image Inegraed Bogofler (I).8.5.75 OSBF (II).5.5.52 OSBF (III).99.5.52 As seen n Table 3, Bogofler alone produced 6.463% and 8.212% of false negave errors on he TREC 25 and TREC 27 spam corpora, respecvely. Ou of 1197 ex-and-mage spam emals n he TREC 25 corpus, msclassfed 67 ex-and-mage spam emals. On he oher hand, here was no msclassfcaon made for legmae ex-and-mage emals (.e. false posve errors) n boh TREC 25 and 27 spam corpora usng a Bogofler only. The performance mprovemen for he case of Bogofler s raher promsng. Combnng Bogofler and he mage spam fler (Sysem I), he proposed framework was able o reduce false negave errors from 6.463% o 6.424% whle confnng false posve errors a he same level. In fac, all ex-and-mage emals n he TREC 25 spam corpus were now correcly classfed. Smlarly, decreased false negave errors from 8.212% o 7.152% for he TREC 27 spam corpus. Now he number of msclassfed ex-and-mage emals was cu n half. Wh a decson hreshold a.5 (Sysem II), OSBF performed exremely well for TREC 25 and TREC 27 corpus n ha OSBF dd no msclassfy any exand-mage spam emals n boh corpora. However, even n hs case, he proposed sysem wh OSBF demonsraes comparable performances, as seen n Table 3 because he proposed sysem makes a use of Bayesan framework, whch consders performances of each of he flers. To furher see he effecveness of he proposed sysem, we consdered a case where he decson hreshold was now se o.99 for OSBF (Sysem III). In hs case, OSBF msclassfed 7 exand-mage emals for he TREC 25 corpus and 17 such emals for he TREC 27 corpus, respecvely. As seen n Table 3, he proposed sysem effecvely enhanced he performances of OSBF as now only one msclassfcaon error and 13 msclassfcaon errors are observed for he TREC 25 and TREC 27 corpus, respecvely. 6 Concluson and Fuure Work In hs paper, we presen an an-spam fler combnaon framework for ex-and-mage emals ha combnes a ex-based an-spam fler wh an magebased fler. Based on a prevously proposed dscrmnave learnng echnque for mage spam, we developed an mage-based an-spam fler, whch s parameers were learned ncremenally, so ha he ssue wh lackng proper ranng daa for legmae mages can be addressed. The expermenal resuls on he TREC 25 and 27 spam corpora show ha our proposed ncremenal learnng approach performed well and effecvely reduced msmaches beween ranng and es daa. More mporanly, our proposed framework s proven o mprove an-spam flerng performance especally for ex-and-mage emals. The number of ex-and-mage spam emals ha were msclassfed n he TREC 25 and 27 spam corpora were reduced sgnfcanly for boh. For boh Bogofler and OSBF, no ex-and-mage spam emals were msclassfed from he TREC 25 spam corpus usng he proposed framework. For he TREC 27 spam corpus, he proposed framework also showed performance mprovemens. Ths s very encouragng n ha we show he negraon of magebased an-spam flers wh ex-based flers enhances spam flerng n realsc suaons. Our proposed framework can also be used n varous felds such as web-spam or blog-spam flerng, where mages and exs are mxed ogeher. The fac ha we adop an ncremenal learnng approach makes our echnque more aracve snce here s no need o collec a large amoun of ranng samples. We nend o apply our echnque o hose relavely unexplored areas Table 3. Comparsons of he proposed framework wh ex-only spam flers. False negave(fn) and false posve (FP) were compued over all emals. The numbers of msclassfed ex-and-mage emals are shown n parenheses. Bogofler (I) OSBF (II) OSBF (III) TREC25 TREC27 Tex only Tex + Image Tex only Tex + Image Tex only Tex + Image FN(%) 6.463(67) 6.424().52().56(2) 4.644(7) 4.635() F%).28().28().424(2).424(2).51().53(1) FN(%) 8.212(1128) 7.152(594).98().98().721(13).699(2) F%).24().24().42(7).436(11).222(4).237(11)

n he fuure. We wll also refne echnques used n each componen o acheve a beer performance. Acknowledgemens Ths work has been suppored by grans from Ar Force Offce of Scenfc Research. [17] Schapre, R. E., and Snger, Y., Improved boosng algorhms usng confdence-raed predcors, Machne Learnng, no. 37, vol. 3, 1999. [18] Wu, C. -T., Cheng, K.-T., Zhu, Q., and Wu, Y.-L., Usng vsual feaures for an-spam flerng, n Proc. of ICIP, 25. References [1] Aradhey, H. B., Gregory, K. M., and James, A. H., Image analyss for effcen caegorzaon of magebased spam e-mal, n Proc. of ICDAR, 25. [2] Blmes, J. and Krchhoff, K., Dreced graphcal models of classfer combnaon: applcaon o phone recognon, n Proc of ICSLP, 2. [3] Byun, B., Lee, C. -H., Webb, S., and Pu, C. A dscrmnave learnng approach o mage modelng and spam mage denfcaon, n Proc. of CEAS, 27. [4] Carreras, X. and Mrquez, L., Boosng rees for An-spam emal flerng, n Proc. of RANLP-1, 21. [5] Cormack, G. and Lynam, T., Spam Track Gude lne 25-27, hp://plg.uwaerloo.ca/gvcormac/spam/. [6] Cormack, G. and Brako, A., Bach and Onlne Spam Fler Comparson, n Proc. of CEAS, 26. [7] Dredze, M., Gevaryahu, R., and Elas-Bachrach, A., Learnng fas classfer for mage spam, n Proc. of CEAS, 27. [8] Drucker, H., Wu, D., and Vapnk, V. N. Suppor vecor machnes for spam caegorzaon, IEEE Trans. on Neural Neworks, vol. 1, no. 5, 1999. [9] Gao, S., Wu, W., Lee, C. -H., and Chua T. -S., An MFoM learnng approach o robus mulclass mullabel ex caegorzaon, n Proc. of ICML, 24. [1] hp://bogofler.sourceforge.ne/. [11] hp://osbf-lua.luaforge.ne/. [12] Huo, Q. and Lee, C. -H., On-lne adapve learnng of he counnuous densy hdden markov model based on approxmae recursve bayes esmae, IEEE Trans. on Speech and Audo Processng, vol. 5, no. 2, 1997. [13] Ivanov, Y., Serre, T., and Bouvre, J., Error weghed classfer combnaon for mul-modal human denfcaon, CBCL paper#258/ai memo #25-35, MIT, 25. [14] Lee, C. -H., and Huo, Q., On adapve decson rules and decson parameers adapaon for auomac speech recognon, Proc. of IEEE, vol. 88, no. 8, 2. [15] Lehmann, E. L., and Casella, G. Theory of pon esmaon, Sprnger 2 nd ed., 1998. [16] Saham, M., Horvz, E., Saham, M., and Dumas. S., A Bayesan approach o flerng junk emal, AAAI Workshop on Learnng for Tex Caegorzaon, 1998. (a) TREC 25-47/97 (b) TREC 27-nmal.36222 Fgure 4. Examples of ex-and-mage spam emals exraced from he TREC 25 and TREC 27 corpus. Noe ha boh emals conan legmae ex messages wh spam mages. Bogofler (I) and OSBF (III) msclassfed boh messages, whereas he proposed framework were correcly classfed hem as spam