Spam Detection in Voice-over-IP Calls through Semi-Supervised Clustering

Spam Detecton n Voce-over-IP Calls troug Sem-Supervsed Clusterng Yu-Sung Wu, Saurab Bagc Navjot Sng Ratsameetp Wta 3 Scool of Electrcal & Computer Eng., Purdue Unversty, West Lafayette, IN 4797 vaya Labs, 33 Mt. ry Rd., Baskng Rdge, NJ 79 3 Culalongkorn Unversty, Taland {yswu,sbagc@purdue.edu, sng@avaya.com, Ratsameetp.W@Student.cula.ac.t bstract In ts paper, we present an approac for detecton of spam calls over IP telepony called SPIT n VoIP systems. SPIT detecton s dfferent from spam detecton n emal n tat te process as to be soft real-tme, fewer features are avalable for eamnaton due to te dffculty of mnng voce traffc at runtme, and smlarty n sgnalng traffc between legtmate and malcous callers. Our approac dffers from estng work n ts adaptablty to new envronments wtout te need for laborous and errorprone manual parameter confguraton. We use clusterng based on te call parameters, usng optonal user feedback for some calls, wc tey mark as SPIT or non-spit. We mprove on a popular algortm for sem-supervsed learnng, called MPC-Means, to make t scalable to a large number of calls and operate at runtme. Our evaluaton on captured call traces sows a ffteen fold reducton n computaton tme, wt mprovement n detecton accuracy. eywords Voce-over-IP systems, spam detecton, spt detecton, semsupervsed learnng, clusterng. Introducton s te popularty of VoIP systems ncreases, tey are beng subjected to dfferent knds of securty treats []. large class of te treats suc as call reroutng, toll fraud, and conversaton jackng ncur devatons n te protocol state macnes and can be detected troug montorng te protocol state transtons [],[3]. ddtonally, cryptograpcally secure versons of te common VoIP protocols, suc as Secure SIP and Secure RTP, address many of te attacks presented n te lterature. However, spam calls n VoIP [4], commonly called SPIT, are becomng an ncreasng nusance. Te ease wt wc automated SPIT calls can be launced can deral te adopton of VoIP as a crtcal nfrastructure element. Estng montorng and cryptograpc solutons are not mmedately applcable to SPIT detecton. In ts paper, we address te problem of detecton of SPIT calls. Detecton of spam emals s a mature feld and tere are some smlartes to our problem. In bot domans, users can provde feedback about ndvdual emal or call, for te latter, troug a bult-n button n some commercally avalable VoIP pones. However, tere est sgnfcant dfferences VoIP traffc s real-tme and te detecton sould deally be real-tme as well; some features are epensve to etract n real-tme, especally tose n voce traffc; te sgnalng patterns are lkely smlar n legtmate and malcous calls renderng content-based flterng on sgnalng traffc neffectve; and features from multple protocols used n VoIP may be relevant. In ts paper, we present te desgn of a system tat uses sem-supervsed macne learnng for detecton of SPIT calls. It bulds on te noton of clusterng wereby calls wt smlar features are placed n a cluster for SPIT or legtmate calls. Call features nclude tose etracted drectly from sgnalng traffc, tose etracted from meda traffc, suc as proporton of slence n te call, and tose derved from calls. However, prevous approaces tat use tresolds [5] on te call features are dffcult to use n practce snce te nature of SPIT calls vares wdely. Terefore, we learn te features to use and ter relatve mportance n clusterng troug runtme observatons, wc nclude user feedback. Te popular sem-supervsed clusterng algortm called MPC-Means [6] scales as O(N 3 ) were N s te number of calls. Ts would generally be too epensve for real-tme operaton. We modfy ts to create our algortm called empc-means, usng VoIP specfc features to reduce t to O(N). Suc specalzaton ncludes te early use of user feedback and pror knowledge of te number of clusters. ddtonally, we create an ncremental protocol called pmpc-means, tat can perform te detecton as soon as te call s establsed. We evaluate te protocols usng four call traces wt dfferent caracterstcs of SPIT and non-spit calls, over dfferent proportons of user feedback and accuracy of te user feedback. Wt a batc of 4 calls, empc-means s 5 tmes faster tan MPC-Means, wle acevng better detecton coverage n terms of true and false postves. Snce pmpc-means can eamne a lmted set of call

features, t works well only wt a large fracton of calls wt accurate user feedback.. Related work Rosenberg [4] detals te problem of VoIP SPIT and gves varous g-level conceptual solutons. Te solutons can be placed n tree categores [7]: () Nonntrusve metods based on te ecange and analyss of sgnalng messages; () Interacton metods tat create nconvenences for te caller by requestng tem to pass a ceckng procedure before te call s establsed; (3) Callee nteracton metods tat ecange nformaton wt te callee on eac call. n eample work n category s [8] were te autors look at te SIP sgnalng traffc pattern to detect SPIT. However, tey do not provde quanttatve data on te detecton accuracy. Our epermental results ndcate solely relyng on SIP message patterns wll gve low detecton coverage. Te work by Quttek [7] generates a greetng sound or faked rng tone to te caller rgt after te call s establsed and montors te response voce patterns from te caller to dfferentate between uman caller and a SPIT generator. Ts falls n category. In comparson, our work encompasses categores and 3. olan [9] presents an approac wc mantans te trust nformaton for eac caller. Te nformaton can be automatcally bult up troug user feedback, or troug a propagaton of reputaton va socal networks. Te approac can be used n our system were we can embed te caller s trust as one of te call features. However, te reputaton database may grow large and a reputaton system can be gamed by false prase or false blame. Clusterng s a way to learn a classfcaton from te data [], especally wt unlabeled data. Clusterng tecnques ave been used for detectng e-mal spam n [],[]. On te oter and, classfcaton tecnques suc as SVM [3] are popular for data classfcaton. However, tey typcally requre labeled data and do not take unlabeled data nto consderaton. Recent developments n sem-supervsed classfcaton tecnques [4], suc as sem-supervsed SVM [5], ncorporate bot labeled and unlabeled data. 3. Desgn 3. Structure of VoIP calls Tere are typcally tree pases nvolved n a VoIP pone call [6]. Te frst pase s call establsment troug a tree-way andsake, wc nvolves () te caller sendng a SIP INVITE message to te proy server and te server forwardng te INVITE message to te callee, () te callee replyng wt a SIP O message, and () te caller sendng SIP C message to complete te call establsment pase. Te second pase s te conversaton, wc contans te meda stream (voce) transmtted between te caller and te callee typcally usng RTP/RTCP [7]. Te last pase s te call tear down pase, wc can be ntated by eter te caller or te callee sendng a SIP BYE message followed by SIP O and SIP C messages. 3. Caracterstcs of VoIP SPIT calls blacklst-based approac can be used at te call establsment pase based on source IP or From URI to drop calls from known SPIT sources. In te meda stream pase, a typcal pattern one can magne for SPIT calls s tat te caller speaks more tan te callee. noter pattern s tat te lengt of te meda stream pase,.e., te call duraton, s sorter n te case of calls answered by a lve person snce SPIT calls are generally undesrable. lso, one can assume tat t s more lkely tat for a SPIT call, a call termnaton wll be ntated by te callee,.e., te callee sends te SIP BYE message. Snce SPIT calls are usually large volume calls made by some sptter wtn a perod of tme, we found tat t s also useful to look for patterns n a batc of calls. Certan features are avalable wen lookng at te collectve set of calls, suc as te nter-arrval tme between calls. lso statstcal learnng can only occur wt a batc of calls. 3.3 Detecton sceme VoIP envronment typcally conssts of multple domans wt eac doman composed of a few proy servers and pones belongng to end users. Fgure sows an eample VoIP envronment consstng of two domans. In a VoIP envronment, a proy server s man functon s to route te sgnalng messages. For te specfc eample we sow, ere Proy # s used to route te sgnalng Legend S SIP based VoIP Proy Server # Clent-sde Detector : normal user : sptter Server-sde Detector Clent-sde Detector S B C Spt Detector Clent-sde Detector SIP based VoIP Proy Server # Fgure. Detectng Spt Calls n a VoIP Envronment S E F

messages among pones {,B,C. nd smlarly, Proy # s used to route te sgnalng messages among pones {E,F. Cross doman pone calls {,B,C {E,F are collaboratvely andled by Proy # and Proy #. Once a pone call s establsed, subsequent messages (sgnalng and voce) can travel drectly between pones wtout nvolvng te proes. However, an ISP can mandate all traffc pass troug te proes, wc s often te case for bllng and securty purposes. Our approac n detectng SPIT calls nvolves placng local detectors at te SIP proes and te pones n te managed doman. Te domans tat ave our detecton mecansm are called managed domans and oters are called unmanaged domans. Essentally, te detectors requre observablty of te sgnalng and te meda streams wtn te managed doman. sptter can est as any pone n a VoIP envronment, weter wtn a managed (pone B) or an unmanaged doman (pone E). Te embedded detectors collect te nformaton of te pone calls and send tem to te SPITDetector, were te logc for dfferentatng SPIT calls from non-spit calls eecutes. Te decodng of te traffc and calculaton of te call features are andled by te respectve serversde/clent-sde detectors and only a dgest of te necessary nformaton s forwarded up to te detector, tus mnmzng network traffc. SPITDetector supports two modes of detecton: Mode : Look at eac pone call wt early detecton: In ts mode, te SPITDetector as to determne weter a call s a SPIT or not before te meda stream of te call s establsed. Ts means tat te detecton as to be completed before te callee pcks up te pone. Ts mode s useful from an end-user s pont of vew snce SPIT calls can be potentally blocked wtout furter annoyance. Mode B: Look at te wole batc of pone calls: Wt Mode B, we assume receved calls are kept n a collecton wc are ten presented n a batc to our sem-supervsed clusterng algortm. Ts mode provdes ger detecton accuracy tan Mode due to te avalablty of complete call feature nformaton. Mode B s attractve to a servce provder, rater tan to an end user. 4. SPIT Detecton usng Sem-Supervsed Clusterng 4. Background In our problem contet, eac VoIP call s regarded as one data pont. We are nterested n clusterng call data ponts nto two clusters, one contanng te SPIT calls, and te oter contanng te non-spit calls. In general, tere may be multple sub-clusters wtn eac cluster correspondng to radcally dfferent knds of SPIT or non- SPIT calls. We eplore ts approac of multple subclusters furter n Sec. 4.7. Sem-supervsed clusterng [8], [9], [6] s a recent development n te data clusterng researc communty tat ams to address te ssue of selectng te proper crtera for clusterng. Sem-supervsed clusterng allows te use of optonal labeled data for a subset of te runtme observatons to progressvely modfy te clusterng crtera. Ts means tat one does not need to determne a pror wc features of te data ponts sould be used for clusterng. Te clusterng crtera wll be traned nto generatng clusters tat obey te user-labeled data as fatfully as possble [6]. Te mplct assumpton s tat user feedback s perfectly accurate. In our work ere, we evaluate te mpact of nose n te user feedback. 4. VoIP call features for clusterng We construct a data pont from eac VoIP call based on 7 features: -. From/To URI, 3. Start tme, 4.Duraton, 5. # of SIP INVITE messages, 6. # of C messages, 7-8. # of BYE messages from caller/callee, 9. Tme snce te last call from te orgnator of te current call, -5. # of,, 3, 4, 5, and 6 SIP Response messages, 6. Call frequency of te orgnator of te current call, 7. Rato of non-slence duraton of te callee to te caller meda streams. For Mode early detecton, only features,, 3, and 9 are avalable. Feature 7 s derved from te RTP meda stream by clent-sde detectors f te meda streams are confgured to flow drectly between clents [] or t can be provded by te server-sde detector f te meda streams are confgured to flow troug te SIP Proy. We select te unverse of features usng our doman knowledge, to cover dfferent facets of a VoIP call and to lmt te number of features so tat onlne clusterng s feasble. 4.3 Labeled data va user feedback Pone calls receved n te managed doman can ave optonal user feedback nformaton ndcatng weter a call s a SPIT call or a non-spit call. Te correspondng data pont wll be labeled wt a SPIT or a non-spit tag and fed nto te sem-supervsed clusterng process. Suc a data pont wll be used for adjustng te clusterng crtera. 4.4 Etended -Means for sem-supervsed clusterng: MPC-Means For ts work, we select te sem-supervsed clusterng algortm called MPC-Means [6]. τ mpckm (, j) (, j) ( ) ( ( )) = μl log det l l χ + wj fm (, j ) l lj M + wj fc, j l = lj C () 3

T ( ) ( ) μ = μ l l μ () f M (, j) = j + j l lj (3) f C ( j) = l l j l l j (4) T = X ( μ)( μ) X T + wj ( ) ( j )( j ) l lj, j M T + wj ( ) ( )( ), j C (5) T ( j)( j) l = l j Eq. () s te objectve functon tat MPC-Means mnmzes. l s te cluster tat pont s assocated wt. Te man dea s te same as -Means were ntra-cluster dstance s beng mnmzed. However te Eucldean dstance metrc n MPC-Means s wegted by a clusterspecfc matr l (one can also use te same matr across all clusters)[6]. l s modfed based on user feedback and ponts n cluster l followng Eq.(5). Te user labeled data n MPC-Means s suppled n te form of clusterng constrants M (must lnk sets) and C (cannot lnk set). Here te M set specfes pars of data ponts tat sould be put n te same cluster wle te C set specfes tose pars of data ponts tat sould not be put n te same cluster. In Eq. (), te last two terms are used to add penalty to te objectve functon from te volaton of tese constrants. Te functon f M returns a value proportonal to te dstance between te two ponts tat are n dfferent clusters. Te functon f C returns a value tat s nversely proportonal to te dstance between two ponts tat are n te same cluster. Te ponts l and l represent te two fartest data ponts n X l wt respect to ter dstance computed usng l. Te pseudo code for MPC-Means s lsted as lgortm below. Input: Set of data ponts { M = (, j), Set of cannot-lnk constrants C (, j) N X =, Set of must-lnk constrants = { = {, # of clusters, Sets of constrants costs W and W, t. Output: Dsjont -parttonng { X of X suc tat objectve = functon τ mpckm s locally mnmzed. Metod:. Intalze clusters:.. Create te λ negboroods { N P λ P= λ f () Intalze { μ = from M and C. usng wegtest fartest-frst traversal startng from te largest N P. Else () Intalze { μ λ = wt centrods of { N P λ P= Intalze remanng clusters at random. Repeat untl convergence.. For eac data pont X argmn μ log det ( = ( ( ) ) (, ) [ ] j (, ) [ ]) + (, j) M w j M j j (, j) C = l C j j t ssgn to X + ( t ) t.. For eac cluster X, { μ + + ( t+ X ) X.3. Update_metrcs for all clusters { X = (Eq. (5)).4. t t + lgortm. MPC-Means (dapted from [6]) 4.4. Mappng user feedback to par-wse constrants n MPC-Means Te system keeps two sets: F S (data ponts of SPIT calls from feedback) and F N (data ponts of non-spit calls from feedback). For a data pont, wc as user feedback, te user ndcates F S or F N. Wt respect to te MPC-Means algortm, must-lnk constrants M are derved onlne from pars of ponts (, j ) F S or (, j ) F N. Smlarly, cannot-lnk constrants C are created onlne from (, j ), were F S and j F N. For ease of eposton, we ntally dscuss te case wt clusters one eac for SPIT and non-spit calls. We dscuss te etenson to multple clusters n Sec. 4.7. 4.4. Buldng detecton predcate Gven a cluster X from te clusterng algortm, we use te number of data ponts wt dfferent user feedback n te cluster to determne te assocaton of te cluster. If X FS > X FN, te calls n X wll be consdered SPIT calls; else, tey wll be consdered non-spit calls. 4.5 Effcent MPC-Means In te cluster assgnment step of MPC-Means (Step.) te tme complety on teratng troug te mustlnk/cannot-lnk peers of pont s a O(N) operaton. X s te wole set of data ponts suppled to te clusterng algortm. N= X s te number of data ponts. Te determnaton of te mamally separated ponts ' and '' used n f c (.) (Step. of lgortm ) and update_metrcs (Step.3) as tme complety O(N ). Ts mples MPC-Means s O(N 3 ) snce te operaton as to be done for eac data pont (actually O(cN 3 ) were c s a small fed number of teratons tll convergence). Tus, MPC-Means does not scale well wt large data sets. For our applcaton, were N can be undreds for a small-szed doman or tousands for a md-szed doman, t turns out to be probtve tme-wse to apply te orgnal MPC- Means drectly. Terefore, we adapt MPC-Means nto te empc- Means (effcent MPC-Means) algortm (lgortm ). In t, te mamally separated ponts are estmated troug an O() appromaton algortm. We use an O(N) 4

mplementaton for te negborood creaton process n te cluster ntalzaton step of MPC-Means. ddtonally, te general practcal eperence wt a - Means based algortm s tat t converges wtn a small number of teratons for te man loop (Step n MPC- Means). Combned tese make empc-means O(N) and te constant s small for a range of VoIP call traces. 4.5. empc-means : Intalze clusters Te empc-means algortm creates te ntal negboroods drectly from te user feedback F S and F N sets. Specfcally, t creates w negboroods {F S, F N, n3, n4,, nw, were { n3, n4,, nw = X-F S -F N s te set of data ponts not covered by te user feedback. Te complety of ts step s O(N). We use te same wegted-fartest-frst traversal as n MPC-Means, wc s O(N) wen te number of clusters s a constant. Overall, te ntalze clusters n empc-means as O(N) complety. 4.5. empc-means : effcent estmaton of mamally separated ponts (, ) In MPC-Means, to fnd te eact mamally separated ponts (, ) used n Eq. (4) and matr updatng[6], t requres evaluatng te dstance j for every par of ponts (, j ) X, wc s an O(N ) operaton. Snce te matr s updated n eac teraton of te loop of step n lgortm, ts evaluaton as to be repeated as well. In empc-means, we estmate te mamally separated ponts by frst puttng data ponts from X nto an array R[..N] n a random orderng. We ten terate troug consecutve elements R[] and R[+] n te array., to (R[ ], R[ +]) tat gves te mamal We set ( ) value of R[ '] R[ ' + ]. Ts operaton (Step n lgortm ) s performed once rgt after te cluster ntalzaton step and s done tmes, once for eac cluster. Te tme complety of ts step s O(N). However, snce te matr s updated n eac teraton of MPC-Means (Step.3, lgortm ), te estmate (, ) as to be updated accordngly as well. We embed te updatng process nto te calculaton of te parameterzed Eucldean dstance j (Eq. ()). Te parameterzed Eucldean dstance s calculated n Eq. (3) and Eq. (4) as well. Te dea ere s tat wen a par of ponts (, j ) s found to ave a greater dstance tan te current estmate (, ) at te tme of evaluatng te parameterzed Eucldean dstance, we wll set te mamally separated ponts estmate to (, j ). Te advantage of ts approac s tat t s an O() operaton and does not ncrease te order of complety of empc- Means. However, ts s an appromaton because suppose, n te loop to terate troug all te ponts, we are at pont and are calculatng - B. Te pont C s to be consdered n a later teraton and (, C ) appens to be te fartest par of ponts. Ten, te computaton for pont wll not ave te accurate dstance for te fartest par of ponts. Hereafter, wen we refer to Eucldean dstance computaton, we mean tat t as mamally separated pont estmaton embedded wtn t. To nsure tat f C (.) functon (Eq. (4)) does not evaluate to negatve values wt our appromated estmaton of (, ), we enforce tat te second term s always evaluated before te frst term so tat tere s an opportunty to update (, ). 4.5.3 Use only a fed number of constrants n cluster assgnment step In te cluster assgnment step of MPC-Means (Step., lgortm ), rater tan teratng troug te complete must-lnk/cannot-lnk peers of, wc makes Step. O(N ), we coose a fed-szed subset of tem. Ts corresponds to Step 3. n empc-means. Ts optmzaton s nted at by te fact tat te mustlnk/cannot-lnk nformaton n our doman as sgnfcant redundancy. set of k and k calls placed, troug user feedback, n te SPIT and non-spit categores generates k +k must-lnk and k k cannot-lnk constrants. On te oter and, we see from epermental results n [6] tat MPC-Means can work reasonably well even wt a lmted numbers of constrants. Te cluster assgnment step tus becomes O(N). In general, ts can negatvely affect te clusterng qualty. However, we beleve t s a trade-off tat s necessary n an effort to make te detecton sceme scalable. 4.5.4 Pre metrcs update on te startng cluster(s) In MPC-Means, te frst update metrcs step (Step.3) occurs only after te frst teraton of te cluster assgnment step (Step.). In te frst teraton of te cluster assgnment, a default dentty matr s assgned to, wc drectly affects te qualty of te generated clusters from te frst teraton and as a long-term effect on te qualty of te eventual clusters as we see emprcally. Terefore, n empc-means we conduct a metrcs update (Step., empc-means, lgortm ) early on, rgt after te ntal clusters are generated from te cluster ntalzaton step. Intutvely, te user feedback s avalable at te outset and ts optmzaton allows te matr to mmedately adapt to te user feedback, wc results n more accurate clusterng. ddtonally, t mproves te convergence speed as we see later (Table ). Input: Set of data ponts { M N X =, Set of must-lnk constrants = = {(, j), Set of cannot-lnk constrants C {(, j) =, Number of clusters, Sets of constrants costs W and W, 5

, t = Output: Dsjont -parttonng { X = of X suc tat objectve functon τ mpckm s locally mnmzed. Metod: (). If ntal cluster centrods { μ s not gven n te nput =.. Create te λ negboroods { N P λ P= wt steps from Sec. 4.5.. f λ Use wegtest fartest-frst traversal to select () Optonal ntal cluster centrods { μ Else negboroods { N ( ). P = () ssgn te data ponts { X N P( ) () Intalze { () { μ = X N λ = Intalze remanng clusters at random Intalze { μ () =.. Update metrcs for all clusters { X ([6]). =. Intalzaton of mamally separated ponts (, ) wt respect to eac. 3. Repeat untl convergence 3.. For eac X M { (, j ) M, M = ctssze Randomly select. C (, j) C, C = ctssze ( { ( ) = argmn μ log det ( ) (, ) (, ) ) + w f l + w f = l (, j) M j M j j (, j) C j C j j t ssgn to X + ( t+ ) t+ μ t+ X X 3.. For eac cluster X, { ( ) 3.3. Update_metrcs for all clusters { 3.4. t t+ X = ([6]) lgortm. empc-means lgortm sows te proposed empc-means wt te above modfcatons to MPC-Means. Step decdes te startng centrods (means) for te clusters troug te use of ntal user feedback. For te specfc case of te user flaggng calls as SPIT or non-spit, =. Step ntalzes te mamally separated ponts estmaton. Step 3. performs te cluster assgnment. Step 3. updates te mean. Note tat te mean can be updated n constant tme by keepng te sum of te data ponts and performng an addton/subtracton wen a data pont s assocated wt/unassocated from a cluster. Step 3.3 updates te matr for eac cluster. Te goal of ts process s to pck s suc tat te objectve functon (Eq. ()) s mnmzed for te cluster assgnment done n te current teraton of Step 3. Conceptually, ts process wll result n s tat puts ger wegts on tose features wc are consstent among data ponts n te same cluster and lower wegts on tose tat are less consstent. 4.6 Progressve MPC-Means = Te empc-means algortm assumes tat te data ponts are avalable n a batc, and s tus suted for Mode B (batc mode) detecton (Sec. 3.3). To support Mode per-call early detecton, we create a varant called progressve MPC-Means (pmpc-means). Te pseudo code s gven as lgortm 3. Te dea ere s tat wen a new call comes n, pmpc-means performs only te cluster assgnment step and only for te new data pont. Te features From URI, To URI, Start tme, and Tme from te last call by te same caller are avalable at te begnnng of te pone call and are used n pmpc- Means. For te features tat are not avalable, pmpc- Means flls te data pont wt te mean values from te cluster to wc ts pont s dstance s beng computed. Ts s mplctly carred out n Step 4 of lgortm 3. In pmpc-means, te update metrcs operaton only occurs occasonally wen te cluster means ave canged sgnfcantly (eceedng a gven tresold d tresold ). Estmatng te mean s an O() operaton for eac new data pont. Ts amortzes over many calls te cost of computaton and te cost of re-clusterng all estng data ponts. However, a cost as to be pad n advance, wc s tat we requre reasonably szed cluster(s) to be grown on te ntal data ponts ( X > t tresold ) troug empc- Means. Te reason s tat we want te ntal matr to be as accurate as possble. lgortm: pmpc-means X = ( t ) Input: new data pont t., Dsjont -parttonng { {,,.., ( t ) X t =. Output: Te cluster assocaton l t for te pont t. Dsjont -parttonng { X of X = {,,..,, = t t. Internal Varables: Metod:. If t < t tresold. If { X { μ = t { ; { X ( ) X X t = ( t ) = = (all clusters are empty) X X { t. Call empc-means to generate { X { μ μ = ; Return { { ; Return = from M (, j) M, M = ctssze 3. Randomly select. C (, j ) C, C = ctssze 4. ( = argmn μ log det ( ) X. ( ) (, ) (, ) ) + w f l + w f = l ( t ) 5. { (, j) M j M j j (, j) C j C j j ; X () { X X X = t 6. If μ μ / > dtresold of 6

/, are te mamally separated ponts wrt / Call empc-means wt ntal centrods { μ { X = on ( t ) X { ; μ μ. = lgortm 3. pmpc-means 4.7 Mult-Class empc clusterng = to generate We create a varant of empc n wc te ntal clusters are splt nto sub-clusters based on te call types calls gong to voce mal, calls termnated mmedately after te call s establsed, and te remanng calls. Tese tree types ebt dfferent patterns n te nonslence call duraton rato (feature 7, Sec. 4.). Te subclusters are formed for bot SPIT and non-spit calls. Ts s an attempt to gude te clusterng process troug epert knowledge. Te user feedback owever s only able to dfferentate between SPIT and non-spit calls, and not place a call nto a sub-cluster. 5. Eperments and Results 5. Testbed We set up a two-doman testbed wt a topology smlar to Fgure, one of te domans beng protected by our detecton tecnque. We use stersk as te VoIP proy servers and MjSp for te pone clents. Eac doman as 9 pones actng as non-sptters and 6 pones actng as sptters. We use te Posson dstrbuton to model call arrval tmes and te Eponental dstrbuton to model call duratons. Te generaton of call traces was done by only one of te co-autors wtout provdng any nformaton about te nature of non-spit and SPIT calls to te rest of te team. Ts was done by desgn so tat te team workng on te detecton system does not ave any pror knowledge of te call m. Ideally we would ave lked to perform te evaluaton on trd-party call traces. However, at te tme of wrtng, no suc call trace s publcly avalable. 5. Summary of call trace dataset We collected four call traces from our testbed wt varyng call caracterstcs as follows (call trace name, Non-SPIT Call lengt average, Non-SPIT Call nterarrval tme average, SPIT Call lengt average, SPIT call nter-arrval tme average, Number of SPIT calls n trace, Number of non-spit calls n trace): (v4, 5, 3,,,, 7), (v5, 5,,,, 45, 338), (v6, 5, 3,,, 94, 89), (v7, 5, 3, 5,, 8, 3). Te tme unt s mnute. In terms of smlarty between SPIT and non-spit calls, n decreasng order, te call traces are v5, v7, v6, and v4. Tere are oter caracterstcs wc are sared by te four call traces. Eamples nclude a 6% cance of a call beng ung up by te caller for a non-spit call and a % cance of beng ung up by te caller (sptter) for a SPIT call. Te meda streams for a SPIT call are domnated by te sptter wle for a non-spit call, te non-slence duraton on te caller and te callee meda streams are about te same on average. Oter epermental parameter settngs are: at most 5 must-lnk and 5 cannot-lnk constrants are used. Te pmpc-means algortm uses data ponts ntally wt empc-means before commencng ncremental operaton. Eac data pont n te eperment s based on te average from 5 runs wt te same parameter settngs. 5.3 Effect of proporton of user feedback We evaluate te effect of te proporton of calls tat come wt user feedback. We assume te same rato for bot SPIT and non-spit calls. We assume te feedback s perfectly accurate. Fgure sows te clusterng qualty wt respect to four dfferent algortms proposed on call trace 4 n terms of te F-Measure [6]. larger F-Measure value means better qualty clusterng. From Sec. 5., we know tat call trace 4 ebts a very clear dstncton between SPIT and non-spit calls n terms of call duraton and call nterarrval tme. Ts makes empc perform well wt user feedback rato as low as.. Te orgnal MPC-Means aceves te same level but wt a ger user feedback rato of.. Te mproved result of empc s due to te pre-metrcs update (Sec. 4.5.4), wc creates a more accurate wegt matr based on user feedback, pror to teratng over te data ponts. Te F-Measure from empc Mult Class drops wt ncreasng user feedback rato because we break te cluster nto sub-clusters based on te call types. s a result, empc Mult Class wll put dfferent types of SPIT and non-spit calls nto dfferent sub-clusters. Bot wll urt te F-Measure snce by defnton of F-Measure, tese calls sould be clustered nto te same cluster. Ts negatve effect grows stronger as te user feedback rato ncreases. Fgure 3 and Fgure 4 sow te true postve (TP) and false postve (FP) rates of SPIT detecton on call trace v4. Wat we can see ere s tat empc Mult Class actually performs well despte te poor F-Measure. empc Mult Class performs worse tan empc at low user feedback rato because breakng te ntal cluster nto sub-clusters reduces te number of call data ponts wt feedback n eac sub-cluster. Ts results n poor clusterng and ence low detecton accuracy. Compared to empc, MPC s detecton accuracy lags bend due to te lack of premetrcs updatng. pmpc performs rater poorly even wt call trace v4. However, t s stll n te usable range (e.g..63 True Postve wt a user feedback rato of.). pmpc s poor performance s due to te lmted features avalable before te meda stream s establsed. 7

Due to space constrants, we sow only te True Postve curves for call traces v5, v6, and v7 n Fgure 5, Fgure 6, and Fgure 7 respectvely. ll te algortms perform worse wt call trace v5 due to same nter-arrval tme of SPIT and non-spit calls. Ts makes te tme snce last call from te same caller and call frequency (features 9 and 6 n Sec. 4.) muc less useful. noter factor s te number of SPIT calls n te call trace s decreased to 45 (compared to n v4) wc furter lowers te clusterng qualty and detecton accuracy. Fgure 8 summarzes te True Postve rates from empc across te four call traces. Ts bascally corresponds to ow salent te dfferences between SPIT calls and non- SPIT calls n te call traces are. In order, te easest one s v4, followed closely by v6, and ten v7. Te ardest s v5. In v5, SPIT calls are almost ndstngusable from sortduraton non-spit calls. We sow error-bar ( ± s.t.d.) for empc n Fgure. Tey are omtted n te rest of te fgures for presentaton clarty. Te general trend s tat te errors dmns wt ncreasng rato of user feedback. We observe less tan ± 5% error across te eperments on call traces 4, 6, and 7 wen user rato s set beyond.. For call trace 5, te error s ger (up to ± 5% at. rato). F-Measure..9.8.7.6.5..4.6.8 rato of calls wt feedback True Postve Rate..9.8.7.6.5.4.3..4.6.8 rato of calls wt feedback False Postve Rate.8.6.4...4.6.8 rato of calls wt feedback Fgure. Call trace v4 / F-Measure Fgure 3. Call trace v4 / TP Fgure 4. Call trace 4 / FP True Postve Rate.8.6.4...4.6.8 rato of calls wt feedback True Postve Rate..8.6.4...4.6.8 rato of calls wt feedback True Postve Rate.8.6.4...4.6.8 rato of calls wt feedback Fgure 5. Call trace 5 / TP Fgure 6. Call trace 6 / TP Fgure 7. Call trace 7 / TP True Postve.9.8.7.6.5.4.3.. v4 v5 v6 v7..4.6.8 rato of calls wt feedback True Postve Rate.8.6.4...4.6.8 Nose level False Postve Rate.8.6.4...4.6.8 Nose level Fgure 8. Compare empc True Postve Rate across call traces Fgure 9. TP vs. Nose n User Feedback Fgure. FP vs. Nose n User Feedback 5.4 Scalablty of eecuton tme In ts eperment we compare te runnng tmes of MPC and empc by varyng te number of call data ponts. Call trace v7 s used for ts eperment. For MPC, we apply eact optmzatons wc do not cause loss of accuracy. For eample, te mamally separated ponts evaluaton s re-eecuted only wen te matr gets canged. Te results are based on code compled wt 8

MPC Tme (ms) 6 8 4 MPC empc 3 4 Fgure. Runnng Tme MS VC++ 8. wt default optmzaton level runnng on Wndows XP, Intel E64.3 GHz CPU. s Fgure sows, MPC ebts non-lnear growt n te runnng tme as te number of call data ponts ncreases (error bars are ± std.). empc, on te oter and, ebts a lnear growt n te runnng tme. lso, MPC takes sgnfcantly longer to run compared wt empc 5 tmes longer for a batc of 4 calls. Lookng at te number of teratons tat eac algortm takes to converge (Table ), empc fares better. Te runnng tme advantage of empc comes from te lower number of teratons as well as te lower runnng tme of eac teraton. Te lower number of teratons s eplaned by empc s update of s on te ntalzed clusters. For call trace v5, te smlarty n SPIT and non-spit calls renders te ntalzaton neffectve and te number of teratons s rougly equal for bot algortms. volume = -.3984 Num of data ponts 96 7 48 4 empc Tme (ms) 5.5 Effect of nose n user feedback MPC empc v4 6.94 3.98 v5 7.8 7.83 v6 7.8 5.38 v7 6.94 4.7 verage 7.37 5.47 Table. Number of teratons to convergence We evaluated dfferent algortms wt varous nose levels n te user feedback. Wen we say te nose level s c, t means tat a fracton c of te user feedback s false,.e., a SPIT call s reported as non-spit and vce-versa. We sow te result wt call trace 6 for ts eperment. Te user feedback rato s fed at.3. Fgure 9 sows te true postve rate decreases as te nose level ncreases. Observng te false postve rates n Fgure, we conclude tat pmpc s completely unusable troug te wole nose level range wle te oter algortms are usable at low nose levels. We conclude tat pmpc s usable only for a g proporton of accurate user feedback. Beyond nose level.5 empc performance drops below tat of MPC due to our desgn of te detecton predcate (Sec. 4.4.), namely, consderng te cluster tat contans more calls marked by te user as SPIT tan non-spit, to be te SPIT cluster. Wt nose level above.5, te user feedback s wrong more often tan rgt and te negatve effect s more pronounced n empc tan MPC, snce t dd a better job of clusterng on te user feedback tan MPC. s an eample of a usable operatng pont, consder tat at nose levels. or below, empc as bot true postve and true negatve above.8. volume = -.788 volume = -.3734 TP - FP.5 -.5 TP - FP.5 -.5 TP - FP.5 -.5 -.5 nose level.5 rato of feedback Fgure. MPC (TP FP) for call trace v6 -.5 nose level 5.6 Evaluaton wt nose and feedback rato Here we perform an evaluaton of all four proposed algortms wt respect to te four call traces. Our evaluaton metodology consders te combned effect of proporton of user feedback and te nose level and te results are sown n Fgure, Fgure 3, and Fgure 4. In te 3D plot, te Z-as corresponds to TP-FP, te dfference between True Postve rate and False Postve rate, wt respect to eac par of feedback rato and nose level. Intutvely, f TP-FP s greater tan zero, t means te detecton gves more correct results tan ncorrect.5 rato of feedback Fgure 3. empc (TP FP) for call trace v6 -.5 nose level.5 rato of feedback Fgure 4. pmpc (TP FP) for call trace v6 results and can be regarded as a vald operatng pont were te detecton s useful. Due to page lengt lmtaton, we sow te 3D plots only for call trace 6. general trend we can see n te 3D plots s tat wen fng te nose level, te TP-FP value clmbs to a peak and ten goes down wen varyng te feedback rato from to. Tere s no sarp breakdown of performance for any of te algortms. If te user feedback s accurate, ten even wt low rato of user feedback, te performance s good for MPC and empc. Te performance of pmpc on te oter and s acceptable only close to te etreme regon of almost perfect user feedback for almost all calls. To 9

gve an overall quantfcaton of te detecton qualty, we defne te volume metrc based on te ntegral (Eq. (6)). In te deal case were TP-FP s mantaned at troug te entre range of nose levels and feedback rato values, te volume wll be.9. Table sows te volume for eac combnaton of algortm and call trace. Call trace v5 gves te lowest volume correspondng to te worst performance for all algortms. veraged over te entre range, we see tat empc performs best followed by empc (Mult Class), MPC, and pmpc. Volume = ( TP FP) df dn TP-FP Volume (6) n= f =. n: nose level, f:feedback rato v4 v5 v6 v7 avg. MPC.48 -.595 -.39 -.388 -.34 empc (Mult Class).68 -.59 -.33 -.4 -.34 empc.4 -.577 -.7 -.34 -.87 pmpc.5 -.596 -.37 -.4 -.34 Table. Summary of TP-FP volume comparsons 6. CONCLUSION In ts paper, we proposed a new approac to detect SPIT calls n a VoIP envronment. We map eac pone call nto a data pont based on an etendable set of call features, derved from te sgnalng as well as te meda protocols. Ts converts te problem of SPIT detecton nto a data classfcaton problem, were a classc soluton s te use of clusterng. We apply sem-supervsed clusterng, wc allows for te optonal use of user feedback for more accurate classfcaton. Ts corresponds to users flaggng some calls as SPIT and oters as legtmate. We create a new algortm called empc-means, based on a prevous algortm called MPC-Means, wc provdes lnear tme performance wt te number of calls. empc-means ncludes a premetrcs-update step, wc contrbutes to g (> 9%) detecton true postve rates wt less tan % user feedback data ponts for tree of te four call traces used ere. We found tat t s dffcult to attan g detecton accuracy based only on features avalable n te call establsment pase, wc would enable a SPIT call to be dropped wtout te user needng to answer te call. Ts algortm pmpc performs well only wt accurate user feedback for a majorty of calls. 7. REFERENCES [] VOIPS, "VoIP Treat Taonomy," 8. [] Y. S. Wu, S. Bagc, S. Garg, and N. Sng, "SCIDIVE: a stateful and cross protocol ntruson detecton arctecture for voce-over-ip envronments," n DSN, 4, pp. 433-44. [3] H. Sengar, D. Wjesekera, H. Wang, and S. Jajoda, "VoIP Intruson Detecton Troug Interactng Protocol State Macnes," n DSN, 6, pp. 393-4. [4] C. J. J. Rosenberg, "RFC 539 : Te Sesson Intaton Protocol (SIP) and Spam," 8. [5] D. Sn, J. n, and C. Sm, "Progressve Mult Gray- Levelng: Voce Spam Protecton lgortm," IEEE Network, vol., pp. 8-4, 6. [6] M. Blenko, S. Basu, and R. J. Mooney, "Integratng constrants and metrc learnng n sem-supervsed clusterng," n ICML, 4, pp. 8-88. [7] J. Quttek, S. Nccoln, S. Tartarell, M. Stemerlng, M. Brunner, and T. Ewald, "Detectng SPIT Calls by Ceckng Human Communcaton Patterns," n ICC, 7, pp. 979-984. [8] R. MacIntos and D. Vnokurov, "Detecton and mtgaton of spam n IP telepony networks usng sgnalng protocol analyss," n IEEE/Sarnoff Symposum on dvances n Wred and Wreless Communcaton, 5, pp. 49-5. [9] P. olan and R. Dantu, "Soco-tecncal defense aganst voce spammng," CM Transactons on utonomous and daptve Systems (TS), vol., 7. [] J. MacQueen, "Some metods for classfcaton and analyss of multvarate observatons," n te Fft Berkeley Symposum on Matematcal Statstcs and Probablty, 967, p. 4. [] P. Hader, U. Brefeld, and T. Sceffer, "Supervsed clusterng of streamng data for emal batc detecton," n ICML, 7, pp. 345-35. [] M. Sasak and H. Snnou, "Spam Detecton Usng Tet Clusterng," n Internatonal Conference on Cyberworlds, 5. [3] C. J. C. Burges, " tutoral on support vector macnes for pattern recognton," Data Mnng and nowledge Dscovery, vol., pp. -67, 998. [4] G. Druck, C. Pal,. McCallum, and X. Zu, "Semsupervsed classfcaton wt ybrd generatve/dscrmnatve metods," n DD, 7, pp. 8-89. [5]. Bennett and. Demrz, "Sem-supervsed support vector macnes," dvances n Neural Informaton processng systems, pp. 368-374, 999. [6] J. Rosenberg, "RFC 36 - SIP: Sesson Intaton Protocol,". [7] H. Sculzrnne, "RFC 889 - RTP: Transport Protocol for Real-Tme pplcatons," 996. [8] N. Grra, M. Crucanu, and N. Boujemaa, "Unsupervsed and Sem-supervsed Clusterng: a Bref Survey," Revew of Macne Learnng Tecnques for Processng Multmeda Content, Report of te MUSCLE European Network of Ecellence (FP6), 4. [9] T. Fnley and T. Joacms, "Supervsed clusterng wt support vector macnes," n ICML, 5, pp. 7-4. [] vop-nfo.org, "stersk SIP Meda Pat."