Detectng Algorthmcally Generated Malcous Doman Names Sandeep Yadav, Ashwath K.K. Reddy, and A.L. Narasmha Reddy Department of Electrcal and Computer Engneerng Texas A&M Unversty College Staton, TX 77843, USA {sandeepy@, kashwathkumar@, reddy@ece.}tamu.edu ABSTRACT Recent Botnets such as Confcker, Kraken and Torpg have used DNS based doman fluxng for command-and-control, where each Bot queres for exstence of a seres of doman names and the owner has to regster only one such doman name. In ths paper, we develop a methodology to detect such doman fluxes n DNS traffc by lookng for patterns nherent to doman names that are generated algorthmcally, n contrast to those generated by humans. In partcular, we look at dstrbuton of alphanumerc characters as well as bgrams n all domans that are mapped to the same set of IP-addresses. We present and compare the performance of several dstance metrcs, ncludng KL-dstance, Edt dstance and Jaccard measure. We tran by usng a good data set of domans obtaned va a crawl of domans mapped to all IPv4 address space and modelng bad data sets based on behavors seen so far and expected. We also apply our methodology to packet traces collected at a Ter- ISP and show we can automatcally detect doman fluxng as used by Confcker botnet wth mnmal false postves. Categores and Subject Descrptors C.2.0 [Computer-Communcaton Networks]: Securty and protecton; K.6.5 [Management of Computng and Informaton Systems]: Securty and Protecton General Terms Measurement, Securty, Verfcaton Keywords Components, Doman flux, Doman names, Edt dstance, Entropy, IP Fast Flux, Jaccard Index, Malcous Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. IMC 0, November 3, 200, Melbourne, Australa. Copyrght 200 ACM 978--4503-0057-5/0/...$0.00. Supranamaya Ranjan Narus Inc. Sunnyvale, CA 94085, USA soups@narus.com. INTRODUCTION Recent botnets such as Confcker, Kraken and Torpg have brought n vogue a new method for botnet operators to control ther bots: DNS doman fluxng. In ths method, each bot algorthmcally generates a large set of doman names and queres each of them untl one of them s resolved and then the bot contacts the correspondng IP-address obtaned that s typcally used to host the command-and-control (C&C) server. Besdes for command-and-control, spammers also routnely generate random doman names n order to avod detecton. For nstance, spammers advertse randomly generated doman names n ther spam emals to avod detecton by regular expresson based doman blacklsts that mantan sgnatures for recently spamvertsed doman names. The botnets that have used random doman name generaton vary wdely n the random word generaton algorthm as well as the way t s seeded. For nstance, Confcker-A [27] bots generate 250 domans every three hours whle usng the current date and tme at UTC (n seconds) as the seed, whch n turn s obtaned by sendng empty HTTP GET queres to a few legtmate stes such as google.com, badu.com, answers.com etc. Ths way, all bots would generate the same doman names every day. In order to make t harder for a securty vendor to pre-regster the doman names, the next verson, Confcker-C [28] ncreased the number of randomly generated doman names per bot to 50K. Torpg [30, 6] bots employ an nterestng trck where the seed for the random strng generator s based on one of the most popular trendng topcs n Twtter. Kraken employs a much more sophstcated random word generator and constructs Englsh-language alke words wth properly matched vowels and consonants. Moreover, the randomly generated word s combned wth a suffx chosen randomly from a pool of common Englsh nouns, verbs, adjectve and adverb suffxes,suchas-able,-dom,-hood,-ment,-shp,or-ly. From the pont of vew of botnet owner, the economcs work out qute well. They only have to regster one or a few domans out of the several domans that each bot would query every day. Whereas, securty vendors would have to pre-regster all the domans that a bot queres every day, even before the botnet owner regsters them. In all the cases above, the securty vendors had to reverse engneer the bot executable to derve the exact algorthm beng used for generatng doman names. In some cases, ther algorthm would predct domans successfully untl the botnet owner would 48
patch all hs bots wth a repurposed executable wth a dfferent doman generaton algorthm [30]. We argue that reverse engneerng of botnet executables s resource- and tme-ntensve and precous tme may be lost before the doman generaton algorthm s cracked and consequently before such doman name queres generated by bots are detected. In ths regards, we rase the followng queston: can we detect algorthmcally generated doman names whle montorng DNS traffc even when a reverse engneered doman generaton algorthm may not be avalable? Hence, we propose a methodology that analyzes DNS traffc to detect f and when doman names are beng generated algorthmcally as a lne of frst defense. In ths regards, our proposed methodology can pont to the presence of bots wthn a network and the network admnstrator can dsconnect bots from ther C&C server by flterng out DNS queres to such algorthmcally generated doman names. Our proposed methodology s based on the followng observaton: current botnets do not use well formed and pronounceable language words snce the lkelhood that such a word s already regstered at a doman regstrar s very hgh; whch could be self-defeatng as the botnet owner would then not be able to control hs bots. In turn ths means that such algorthmcally generated doman names can be expected to exhbt characterstcs vastly dfferent from legtmate doman names. Hence, we develop metrcs usng technques from sgnal detecton theory and statstcal learnng whch can detect algorthmcally generated doman names that may be generated va a myrad of technques: () those generated va pseudo-random strng generaton algorthms as well as () dctonary-based generators, for nstance the one used by Kraken[5, 3, 4] as well as a publcly avalable tool, Kwyjbo [2] whch can generate words that are pronounceable yet not n the englsh dctonary. Our method of detecton comprses of two parts. Frst, we propose several ways to group together DNS queres: () ether by the Top Level Doman (TLD) they all correspond to or; () the IP-address that they are mapped to or; () the connected component that they belong to, as determned va connected component analyss of the IP-doman bpartte graph. Second, for each such group, we compute metrcs that characterze the dstrbuton of the alphanumerc characters or bgrams (two consecutve alphanumerc characters) wthn the set of doman names. Specfcally, we propose the followng metrcs to quckly dfferentate a set of legtmate doman names from malcous ones: () Informaton entropy of the dstrbuton of alphanumercs (ungrams and bgrams) wthn a group of domans; () Jaccard ndex to compare the set of bgrams between a malcous doman name wth good domans and; () Edt-dstance whch measures the number of character changes needed to convert one doman name to another. We apply our methodology to a varety of data sets. Frst, we obtan a set of legtmate doman names va reverse DNS crawl of the entre IPv4 address space. Next, we obtan a set of malcous doman names as generated by Confcker, Kraken and Torpg as well as model a much more sophstcated doman name generaton algorthm: Kwyjbo [2]. Fnally, we apply our methodology to one day of network traffc from one of the largest Ter- ISPs n Asa and South Amerca and show how we can detect Confcker as well as a botnet htherto unknown, whch we call Mjuyh (detals n Secton 5). Our extensve experments allow us to characterze the effectveness of each metrc n detectng algorthmcally generated doman names n dfferent attack scenaros. We model dfferent attack ntenstes as number of doman names that an algorthm generates. For nstance, n the extreme scenaro that a botnet generates 50 domans mapped to the same TLD, we show that KL-dvergence over ungrams acheves 00% detecton accuracy albet at 5% false postve rate (legtmate doman groups classfed as algorthmc). We show how our detecton mproves sgnfcantly wth much lower false postves as the number of words generated per TLD ncreases, e.g., when 200 domans are generated per TLD, then Edt dstance acheves 00% detecton accuracy wth 8% false postves and when 500 domans are generated per TLD, Jaccard Index acheves 00% detecton wth 0% false postves. Fnally, our methodology of groupng together domans va connected components allows us to detect not only doman fluxng but also f t was used n combnaton wth IP fluxng. Moreover, computng the metrcs over components yelds better and faster detecton than other groupng methods. Intutvely, even f botnets were to generate random words and combne them wth multple TLDs n order to spread the doman names thus generated (potentally to evade detecton), as long as they map these domans such that at least one IP-address s shared n common, then they reveal a group structure that can be exploted by our methodology for quck detecton. We show that percomponent analyss detects 26.32% more IP addresses than usng per-ip analyss and 6.3% more hostnames than usng per-doman analyss when we appled our methodology to detect Confcker n a Ter- ISP trace. The rest of ths paper s organzed as follows. In Secton 2, we compare our work aganst related lterature. In Secton 3, we present our detecton methodology and ntroduce the metrcs we have developed. In Secton 4, we present the varous ways by whch domans can be grouped n order to compute the dfferent metrcs over them. Next, n Secton 5, we present results to compare each metrc as appled to dfferent data sets and trace data. Further, n Secton 6, we present the detecton of malcous domans n a supervsed learnng framework, n partcular, L-regularzed lnear regresson. We present a dscusson over the relatve computatonal complexty of each metrc and the usefulness of component analyss n Secton 7. Fnally, n Secton 8 we conclude. 2. RELATED WORK Characterstcs, such as IP addresses, whos records and lexcal features of phshng and non-phshng URLs have been analyzed by McGrath and Gupta [22]. They observed that the dfferent URLs exhbted dfferent alphabet dstrbutons. Our work bulds on ths earler work and develops technques for dentfyng domans employng algorthmcally generated names, potentally for doman fluxng. Ma, et al [7], employ statstcal learnng technques based on lexcal features (length of doman names, host names, number of dots n the URL etc.) and other features of URLs to automatcally determne f a URL s malcous,.e., used for phshng or advertsng spam. Whle they classfy each URL ndependently, our work s focused on classfyng a group of URLs as algorthmcally generated or not, solely by makng use of the set of alphanumerc characters used. 49
In addton, we expermentally compare aganst ther lexcal features n Secton 5 and show that our alphanumerc dstrbuton based features can detect algorthmcally generated doman names wth lower false postves than lexcal features. Overall, we consder our work as complmentary and synergstc to the approach n [7]. Wth reference to the practce of IP fast fluxng, e.g., where the botnet owner constantly keeps changng the IPaddresses mapped to a C&C server, [24] mplements a detecton mechansm based on passve DNS traffc analyss. In our work, we present a methodology to detect cases where botnet owners may use a combnaton of both doman fluxng wth IP fluxng, by havng bots query a seres of doman names and at the same tme map a few of those doman names to an evolvng set of IP-addresses. Also earler papers [23, 20] have analyzed the nner-workng of IP fast flux networks for hdng spam and scam nfrastructure. Wth regards to botnet detecton, [4, 5] perform correlaton of network actvty n tme and space at campus network edges, and Xe et al n [33] focus on detectng spammng botnets by developng regular expresson based sgnatures from a dataset of spam URLs. We fnd that graph analyss of IP addresses and doman names embedded n DNS queres and reples reveal nterestng macro relatonshps between dfferent enttes and enable dentfcaton of bot networks (Confcker) that seemed to span many domans and TLDs. Wth reference to graph based analyss, [34] utlzes rapd changes n user-bot graphs structure to detect botnet accounts. Statstcal and learnng technques have been employed by varous studes for predcton [0, 25, 3]. We employed results from detecton theory n desgnng our strateges for classfcaton [3, ]. Several studes have looked at understandng and reverseengneerng the nner workngs of botnets [5, 3, 4, 6, 30, 26, 29]. Botlab has carred out an extensve analyss of several bot networks through actve partcpaton [9] and provded us wth many example datasets for malcous domans. 3. DETECTION METRICS In ths secton, we present our detecton methodology that s based on computng the dstrbuton of alphanumerc characters for groups of domans. Frst, we motvate our metrcs by showng how algorthmcally generated doman names dffer from legtmate ones n terms of dstrbuton of alphanumerc characters. Next, we present our three metrcs, namely Kullback-Lebler (KL) dstance, Jaccard Index (JI) measure and Edt dstance. Fnally, n Secton 4 we present the methodology to group doman names. 3. Data Sets We frst descrbe the data sets and how we obtaned them: () Non-malcous ISP Dataset: We use network traffc trace collected from across 00+ router lnks at a Ter- ISP n Asa. The trace s one day long and provdes detals of DNS requests and correspondng reples. There are about 270,000 DNS name server reples. () Nonmalcous DNS Dataset: We performed a reverse DNS crawl of the entre IPv4 address space to obtan a lst of doman names and ther correspondng IP-addresses. We further dvded ths data set n to several parts, each comprsng of domans whch had 500, 200, 00 and 50 doman labels. The DNS Dataset s consdered as non-malcous for the followng reasons. Botnets may own only a lmted number of IP addresses. Based on our study, we fnd that a DNS PTR request maps an IP address to only one doman name. The dataset thus obtaned wll contan very few malcous doman names per analyzed group. In the event that the bots exhbt IP fluxng, t s noteworthy that the botnet owners cannot change the PTR DNS mappng for IP addresses not owned. Although, the malcous name servers may pont to any IP address. () Malcous datasets: We obtaned the lst of doman names that were known to have been generated by recent Botnets: Confcker [27, 28], Torpg [30] and Kraken [5, 3]. As descrbed earler n the Introducton, Kraken exhbts the most sophstcated doman generator by carefully matchng the frequency of occurrence of vowels and consonants as well as concatenatng the resultngwordwthcommonsuffxesntheendsuchas-able, -dom, etc. (v) Kwyjbo: We model a much more sophstcated algorthmc doman name generaton algorthm by usng a publcly avalable tool, Kwyjbo [2] whch generates doman names that are pronounceable yet not n the Englsh language dctonary and hence much more lkely to be avalable for regstraton at a doman regstrar. The algorthm uses a syllable generator, where they frst learn the frequency of one syllable followng another n words n Englsh dctonary and then automatcally generate pronounceable words by modelng t as a Markov process. 3.2 Motvaton Our detecton methodology s based on the observaton that algorthmcally generated domans dffer sgnfcantly from legtmate (human) generated ones n terms of the dstrbuton of alphanumerc characters. Fgure (a) shows the dstrbuton of alphanumerc characters, defned as the set of Englsh alphabets (a-z) and dgts (0-9) for both legtmate as well as malcous domans. We derve the followng ponts: () Frst, note that both the non-malcous data sets exhbt a non-unform frequency dstrbuton, e.g., letters m and o appear most frequently n the non-malcous ISP data set whereas the letter s appears most frequently n the non-malcous DNS data set. () Even the most sophstcated algorthmc doman generator seen n the wld for Kraken botnet has a farly unform dstrbuton, albet wth hgher frequences at the vowels: a, e and. () If botnets of future were to evolve and construct words that are pronounceable yet not n the dctonary, then they would not exhbt a unform dstrbuton as expected. For nstance, Kwyjbo exhbts hgher frequences at alphabets, e, g,, l, n, etc. In ths regards, technques that are based on only the dstrbuton of ungrams (sngle alphanumerc characters) may not be suffcent, as we wll show through the rest of ths secton. The termnology used n ths and the followng sectons s as follows. For a hostname such as physcs.unversty.edu, we refer to unversty as the second-level doman label, edu as the frst-level doman, and unversty.edu as the second-level doman. Smlarly, physcs.unversty.edu s referred to as the thrd-level doman and physcs s the thrd-level doman label. The cctlds such as co.uk are effectvely consdered as frst-level domans. Even though doman names may contan characters such as -, we currently lmt our study to alphanumerc characters only. 50
0.4 0.2 Non-malcous(ISP dataset) Non-malcous (DNS dataset) Malcous (randomly generated) 0.2 0. Malcous (Kraken) Malcous (Kwyjbo) Malcous (randomly generated) Probablty of occurrence 0. 0.08 0.06 0.04 Probablty of occurrence 0.08 0.06 0.04 0.02 0.02 0 0 2 3 4 5 6 7 8 9 a b c d e f g h j k l m n o p q r s t u v w x y z Alphanumerc characters (a) Non-malcous and malcous domans. 0 0 2 3 4 5 6 7 8 9 a b c d e f g h j k l m n o p q r s t u v w x y z Alphanumerc characters (b) Only malcous enttes Fgure : Probablty dstrbutons of malcous and non-malcous domans 3.3 Metrcs for anomaly detecton The K-L(Kullback-Lebler) dvergence metrc s a nonsymmetrc measure of dstance between two probablty dstrbutons. The dvergence (or dstance) between two dscretzed dstrbutons P and Q s gven by: D KL(P Q) = n P () = P ()log. Q() where n s the number of possble values for a dscrete random varable. The probablty dstrbuton P represents the test dstrbuton and the dstrbuton Q represents the base dstrbuton from whch the metrc s computed. Snce the K-L measure s asymmetrc, we use a symmetrc form of the metrc, whch helps us deal wth the possblty of sngular probabltes n ether dstrbuton. The modfed K-L metrc s computed usng the formula: D sym(pq)= 2 (DKL(P Q)+DKL(Q P )). Gven a test dstrbuton q computed for the doman to be tested, and non-malcous and malcous probablty dstrbuton over the alphanumercs as g and b respectvely, we characterze the dstrbuton as malcous or not va the followng optmal classfer (for proof see appendx): D sym(qb) D g sym(qg) 0 () b For the test dstrbuton q to be classfed as non-malcous, we expect D sym(qg) tobelessthand sym(qb). However, f D sym(qg) s greater than D sym(qb), the dstrbuton s classfed as malcous. 3.3. Measurng K-L dvergence wth ungrams The frst metrc we desgn measures the KL-dvergence of ungrams by consderng all doman names that belong to the same group, e.g. all domans that map to the same IPaddress or those that belong to the same top-level doman. We postpone dscusson of groups to Secton 4. Gven a group of domans for whch we want to establsh whether they were generated algorthmcally or not, we frst compute the dstrbuton of alphanumerc characters to obtan the test dstrbuton. Next, we compute the KL-dvergence wth a good dstrbuton obtaned from the non-malcous data sets (ISP or DNS crawl) and a malcous dstrbuton obtaned by modelng a botnet that uses generates alphanumercs unformly. As expected, a smple ungram based technque may not suffce, especally to detect Kraken or Kwyjbo generated domans. Hence, we consder bgrams n our next metrc. 3.3.2 Measurng K-L dvergence wth bgrams A smple obfuscaton technque that can be employed by algorthmcally generated malcous doman names could be to generate doman names by usng the same dstrbuton of alphanumercs as commonly seen for legtmate domans. Hence, n our next metrc, we consder dstrbuton of bgrams,.e., two consecutve characters. We argue that t would be harder for an algorthm to generate doman names that exactly preserve a bgram dstrbuton smlar to legtmate domans snce the algorthm would need to consder the prevous character already generated whle generatng the current character. The choces for the current character wll be hence more restrctve than when choosng characters based on ungram dstrbutons. Thus, the probablty of test bgrams matchng a non-malcous bgram dstrbuton, becomes smaller. Analogous to the case above, gven a group of domans, we extract the set of bgrams present n t to form a bgram dstrbuton. Note that for the set of alphanumerc characters that we consder [a-z, 0-9], the total number of bgrams possble are 36x36,.e.,,296. Our mproved hypothess now nvolves valdatng a gven test bgram dstrbuton aganst the bgram dstrbuton of non-malcous and malcous doman labels. We use the database of non-malcous words to determne a non-malcous probablty dstrbuton. For a sample malcous dstrbuton, we generate bgrams randomly. Here as well, we use KL-dvergence over the bgram dstrbuton to determne f a test dstrbuton s malcous or legtmate. 3.3.3 Usng Jaccard Index between bgrams We present the second metrc to measure the smlarty 5
between a known set of components and a test dstrbuton, namely the Jaccard ndex measure. The metrc s defned as JI = A B A B where, A and B each represent the set of random varables. For our partcular case, the set comprses of bgrams that compose a doman label or a hostname. Note that Jaccard ndex (JI) measure based on bgrams s a commonly used technque for web search engne spell-checkng [2]. The core motvaton behnd usng the JI measure s same as that for KL-dvergence. We expect that bgrams occurrng n randomzed (or malcous) hostnames to be mostly dfferent when compared wth the set of non-malcous bgrams. To elaborate, we construct a database of bgrams whch pont to lsts of non-malcous words, doman labels or hostnames, as the case may be. Now for each sub-doman present n a test set, we determne all non-malcous words that contan at least 75% of the bgrams present n the test word. Such a threshold helps us dscard words wth less smlarty. However, longer test words may mplctly satsfy ths crtera and may yeld ambguous JI value. As observed n secton 5, the word szes for 95% of non-malcous words do not exceed 24 characters, and hence we dvde all test words nto unts of 24 character strngs. Fgure 2 presents the CDF of doman label szes as observed n our DNS PTR dataset (descrbed n secton 5). Calculatng the JI measure s best explaned wth an example. Consderng a randomzed hostname such as ckoxjsov.botnet.com, we determne the JI value of the doman label ckoxjsov by frst computng all bgrams (eght, n ths case). Next, we examne each bgram s queue of non-malcous doman labels, and short lst words wth at least 75% of bgrams,.e., sx of the eght bgrams. Words satsfyng ths crtera may nclude thequckbrownfoxj umpsoverthelazydog (35 bgrams). However, such a word stll has a low JI value owng to the large number of bgrams n t. Therefore, the JI value s thus computed as 6/(8 + 35-6) = 0.6. The low value ndcates that the randomzed test word does not match too well wth the word from the non-malcous bgram database. and therefore do not always ensure a hgh JI value. We compute the JI measure usng the equaton descrbed above and average t for all test words belongng to a partcular group beng analyzed. The averaged JI value for a nonmalcous doman s expected to be hgher than those for malcous groups. As observed va our experments n Secton 5, the JI measure s better at determnng doman based anomales. However, t s also computatonally expensve as the database of non-malcous bgrams needs to be mantaned n the memory. Also, classfyng a non-malcous hosts wll take more CPU cycles as we would obtan and compare a large set of words consstng of test word s bgrams. Secton 7 examnes the computatonal complexty of varous metrcs that we use. 3.3.4 Edt dstance Note that the two metrcs descrbed earler, rely on defnton of a good dstrbuton (KL-dvergence) or database (JI measure). Hence, we defne a thrd metrc, Edt dstance, whch classfes a group of domans as malcous or legtmate by only lookng at the domans wthn the group, and s hence not relant on defnton of a good database or dstrbuton. The Edt dstance between two strngs represents an ntegral value dentfyng the number of transformatons requred to transform one strng to another. It s a symmetrc measure and provdes a measure of ntra-doman entropy. The type of elgble transformatons are addton, deleton, and modfcaton. For nstance, to convert the word cat to dog, the edt dstance s three as t requres all three characters to be replaced. Wth reference to determnng anomalous domans, we expect that all doman labels (or hostnames) whch are randomzed, wll, on an average, have hgher edt dstance value. We use the Levenshten edt dstance dynamc algorthm for determnng anomales. The algorthm for computng the Levenshten edt dstance has been shown n Algorthm [2]. Algorthm Dynamc programmng algorthm for fndng the edt dstance EdtDst(s,s 2). nt m[,j ]=0 Cumulatve fracton of total sub-domans 0.8 0.6 0.4 0.2 0 0 5 0 5 20 25 30 35 40 45 Number of characters n second-level sub-doman 2. for to s 3. do m[,0] = 4. for j to s 2 5. do m[0,j ]=j 6. for to s 7. do for j to s 2 8. do m[,j ]=mn{m[[-,j -] + f(s [] =s 2[j ]) then 0 else f, 9. m[-,j ]+ 0. m[,j -] + }. return m[ s, s 2 ] Fgure 2: CDF of doman label szes for DNS PTR dataset. The JI measure s thus computed for the remanng words. The test words mght comprse of a large number of bgrams 4. GROUPING DOMAIN NAMES In ths secton, we present ways by whch we group to- 52
gether doman names n order to compute metrcs that were defned n Secton 3 earler. 4. Per-doman analyss Note that several botnets use several second-level doman names to generate algorthmc sub-domans. Hence, one way by whch we group together doman names s va the secondlevel doman name. The ntenton s that f we begn seeng several algorthmcally generated doman names beng quered such that all of them correspond to the same secondlevel doman, then ths may be reflectve of a few favorte domans beng exploted. Hence for all sub-domans, e.g., abc.exampleste.org, def.exampleste.org, etc., that have the same second-level doman name exampleste.org, we compute all the metrcs over the alphanumerc characters and bgrams of the correspondng doman labels. Snce doman fluxng nvolves a botnet generatng a large number of doman names, we consder only domans whch contan a suffcent number of thrd-level doman labels, e.g., 50, 00, 200 and 500 sub-domans. 4.2 Per-IP analyss As a second method of groupng, we consder all domans that are mapped to the same IP-address. Ths would be reflectve of a scenaro where a botnet has regstered several of the algorthmc doman names to the same IP-address of a command-and-control server. Determnng f an IP address s mapped to several such malcous domans s useful as such an IP-address or ts correspondng prefx can be quckly blacklsted n order to sever the traffc between a commandand-control server and ts bots. We use the dataset from a Ter- ISP to determne all IP-addresses whch have multple hostnames mapped to t. For a large number of hostnames representng one IP address, we explore the above descrbed metrcs, and thus dentfy whether the IP address s malcous or not. 4.3 Component analyss A few botnets have taken the dea of doman fluxng further and generate names that span multple TLDs, e.g., Confcker-C generates doman names n 0 TLDs. At the same tme doman fluxng can be combned wth another technque, namely IP fluxng [24] where each doman name s mapped to an ever changng set of IP-addresses n an attempt to evade IP blacklsts. Indeed, a combnaton of the two s even harder to detect. Hence, we propose the thrd method for groupng doman names nto connected components. We frst construct a bpartte graph G wth IP-addresses on one sde and doman names on the other. An edge s constructed between a doman name and an IP-address f that IP-address was ever returned as one of the responses n a DNS query. When multple IP addresses are returned, we draw edges between all the returned IP addresses and the quered host name. Frst, we determne the connected components of the bpartte graph G, where a connected component s defned as one whch does not have any edges wth any other components. Next, we compute the varous metrcs (KL-dvergence for ungrams and bgrams, JI measure for bgrams, Edt dstance) for each component by consderng all the doman names wthn a component. Component extracton separates the IP-doman graph nto components whch can be classfed n to the followng classes: () IP fan: these have one IP-address whch s mapped to several doman names. Besdes the case where one IPaddress s mapped to several algorthmc domans, there are several legtmate scenaros possble. Frst, ths class could nclude doman hostng servces where one IP-address s used to provde hostng to several domans, e.g. Google Stes, etc. Other examples could be mal relay servce where one mal server s used to provde mal relay for several MX domans. Another example could be when doman regstrars provde doman parkng servces,.e., someone can purchase a doman name whle askng the regstrar to host t temporarly. () Doman fan: these consst of one doman name connected to multple IPs. Ths class wll contan components belongng to the legtmate content provders such as Google, Yahoo!, etc. ()Many-to-many component: these are components that have multple IP addresses and multple doman names, e.g., Content Dstrbuton Networks (CDNs) such as Akama. In secton 6, we brefly explan the classfcaton algorthm that we use to classfy test components as malcous or not. 5. RESULTS In ths secton, we present results of employng varous metrcs across dfferent groups, as descrbed n secton 3 and 4. We brefly descrbe the data set used for each experment. Wth all our experments, we present the results based on the consderaton of ncreasng number of doman labels. In general, we observe that usng a larger test data set yelds better results. 5. Per-doman analyss 5.. Data set The analyss n ths sub-secton s based only on the doman labels belongng to a doman. The non-malcous dstrbuton g may be obtaned from varous sources. For our analyss, we use a database of DNS PTR records correspondng to all IPv4 addresses. The database contans 659 secondlevel domans wth at least 50 thrd-level sub-domans, whle there are 03 second-level domans wth at least 500 thrdlevel sub-domans. From the database, we extract all secondlevel domans whch have at least 50 thrd-level sub-domans. All thrd-level doman labels correspondng to such domans are used to generate the dstrbuton g. For nstance, a second-level doman such as unversty.edu may have many thrd-level doman labels such as physcs, cse, humantes etc. We use all such labels that belong to trusted domans, for determnng g. To generate a malcous base dstrbuton b, we randomly generate as many characters as present n the non-malcous dstrbuton. We use doman labels belongng to well-known malware based domans dentfed by Botlab, and also a publcly avalable webspam database, as malcous domans [, 9] for verfcaton usng our metrcs. Botlab provdes us wth varous domans used by Kraken, Pushdo, Storm, MegaD, and Srzb []. For per-doman analyss, the test words used are the thrd-level doman labels. Fgure (a) shows how malcous/non-malcous dstrbutons appear for the DNS PTR dataset as well as the ISP dataset descrbed n the followng sectons. We wll present the results for all the four measures de- 53
True postve rate (TPR) 0. True postve rate (TPR) 0. Usng 50 test words Usng 00 test words Usng 200 test words Usng 500 test words 0.0 0.00 0.0 0. False postve rate (FPR) (a) K-L metrc wth ungram dstrbuton (Per-doman). Usng 50 test words Usng 00 test words Usng 200 test words Usng 500 test words 0.0 0.00 0.0 0. False postve rate (FPR) (b) K-L metrc wth bgram dstrbuton (Per-doman). Fgure 3: ROC curves for Per-doman analyss True postve rate (TPR) 0. True postve rate (TPR) Usng 50 test words Usng 00 test words Usng 200 test words Usng 500 test words 0.0 0.0 0. False postve rate (FPR) (a) Jaccard measure for bgrams (Per-doman). Usng 50 test words Usng 00 test words Usng 200 test words Usng 500 test words 0. 0.0 0. False postve rate (FPR) (b) Edt dstance (Per-doman). Fgure 4: ROC curves for Per-doman analyss scrbed earler, for doman-based analyss. In later sectons, we wll only present data from one of the measures for brevty. 5..2 K-L dvergence wth ungram dstrbuton We measure the symmetrzed K-L dstance metrc from the test doman to the malcous/non-malcous alphabet dstrbutons. We classfy the test doman as malcous or non-malcous based on equaton (0) n Appendx A. Fgure 3(a) shows the results from our experment presented as an ROC curve. The fgure shows that the dfferent szes of test data sets produce relatvely dfferent results. The area under the ROC s a measure of the goodness of the metrc. We observe that wth 200 or 500 doman labels, we cover a relatvely greater area, mplyng that usng many doman labels helps obtan accurate results. For example, usng 500 labels, we obtan 00% detecton rate wth only 2.5% false postve rate. Note that wth a larger data set, we ndeed expect hgher true postve rates for small false postve rates, as larger samples wll stablze the evaluated metrcs. The number of doman labels requred for accurate detecton corresponds to the latency of accurately classfyng a prevously unseen doman. The results suggest that a doman-fluxng doman can be accurately characterzed by the tme t generates around 500 names. 5..3 K-L dvergence wth bgram dstrbuton Fgure 3(b) presents the results of employng K-L dstance metrc over bgram dstrbutons. We observe agan that usng 200 or 500 doman labels does better than usng smaller number of labels, wth 500 labels dong the best. Experments wth 50/00 doman labels yeld smlar results. We note that the performance wth ungram dstrbutons s slghtly better than usng bgram dstrbutons. However, when botnets employ counter measures to our technques, the bgram dstrbutons may provde better defense compared to ungram dstrbutons as they requre more effort to match the good dstrbuton (g). 5..4 Jaccard measure of bgrams The Jaccard Index measure does sgnfcantly better n 54
comparson to the prevous metrcs. From fgure 4(a), t s evdent that usng 500 doman labels gves us a clear separaton for classfcaton of test domans (and hence an area of ). Usng 50 or 00 labels s farly equvalent wth 200 labels dong comparatvely better. The JI measure produces hgher false postves for smaller number of domans (50/00/200) than K-L dstance measures. 5..5 Edt dstance of doman labels Fgure 4(b) shows the performance usng edt dstance as the evaluaton metrc. The detecton rate for 50/00 test words reaches only for hgh false postve rates, ndcatng that a larger test word set should be used. For 200/500 doman labels, 00% detecton rate s acheved at false postve rates of 5-7%. 5..6 Kwyjbo doman label analyss Kwyjbo s a tool to generate random words whch can be used as doman labels [2]. The generated words are seemngly closer to pronounceable words of the englsh language, n addton to beng random. Thus many such words can be created n a short tme. We antcpate that such a tool can be used by attackers to generate doman labels or doman names quckly wth the am of defeatng our scheme. Therefore, we analyze Kwyjbo based words, consderng them as doman labels belongng to a partcular doman. The names generated by Kwyjbo tool could be accurately characterzed by our measures gven suffcent names. Example results are presented n Fg. 5 wth K-L dstances over ungram dstrbutons. From fgure 5, we observe that verfcaton wth ungram frequency can lead to a hgh detecton rate wth very low false postve rate. Agan, the performance usng 500 labels s the best. We also observe a very steep rse n detecton rates for all the cases. The Kwyjbo domans could be accurately characterzed wth false postve rates of 6% or less. True postve rate (TPR) 0. Usng 50 test words Usng 00 test words Usng 200 test words Usng 500 test words 0.0 0.00 0.0 0. False postve rate (FPR) Fgure 5: ROC curve : K-L metrc wth ungram dstrbuton (Kwyjbo). The ntal detecton rate for Kwyjbo s low as compared to the per-doman analyss. Ths s because the presence of hghly probable non-malcous ungrams n Kwyjbo based domans makes detecton dffcult at lower false postve rates. The results wth other measures (K-L dstance over bgram dstrbutons, JI and edt dstances) were smlar: kwyjbo domans could be accurately characterzed at false postve rates n the range of 0-2%, but detecton rates were nearly zero at false postve rates of 0% or less. The scatter plot presented n Fg. 6 ndcates the clear separaton obtaned between non-malcous and malcous domans. The plot represents the Jaccard measure usng 500 test words. We hghlght the detecton of botnet based malcous domans such as Kraken, MegaD, Pushdo, Srzb, and Storm. A few well-known non-malcous domans such as apple.com, csco.com, stanford.edu, mt.edu, andyahoo.com have also been ndcated for comparson purposes. Jaccard measure 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. apple.com (0.587) csco.com (0.788) mt.edu (0.78) stanford.edu (0.864) Srzb (0.02) yahoo.com (0.665) Pushdo (0.05) Kraken (0.08) MegaD (0.006) Storm (0.005) 0 0 0 20 30 40 50 60 70 Doman ndex Fgure 6: Scatter plot wth Jaccard Index for bgrams (500 test words). 5..7 Progressve demarcaton The earler results have showed that very hgh good detecton rates can be obtaned at low false postve rates once we have 500 or more hostnames of a test doman. As dscussed earler, the number of hostnames requred for our analyss corresponds to latency of accurately characterzng a prevously unseen doman. Durng our experments, not all the test domans requred 500 hostnames for accurate characterzaton snce the dstrbutons were ether very close to the good dstrbuton g or bad dstrbuton b. These test domans could be characterzed wth a smaller latency (or smaller number of hostnames). In order to reduce the latency for such domans, we tred an experment at progressve demarcaton or characterzaton of the test domans. Intutvely, the dea s to draw two thresholds above one there are clearly good domans, below the second threshold there are clearly bad domans and the domans between the two thresholds requre more data (or hostnames) for accurate characterzaton. These thresholds are progressvely brought closer (or made tghter) as more hostnames become avalable, allowng more domans to be accurately characterzed untl we get 500 or more hostnames for each doman. The results of such an experment usng the JI measure are shown n Fg. 7. We establsh the lower bound usng the formula μ b + σ b where μ b s the mean of JI values observed for bad or malcous domans and σ b s the standard devaton. Smlarly, the upper bound s obtaned usng the expresson μ g σ g where the subscrpt g mples good domans. Fgure 7 shows the detecton rate for the consdered domans. 55
We see a monotoncally ncreasng detecton rate for both good and bad domans. It s observed that 85% of bad domans could be so characterzed accurately wth only 00 hostnames whle only about 23% of good domans can be so characterzed wth 00 hostnames. In addton, our experments ndcate that only a small percentage of domans requre 200 or more hostnames for ther characterzaton. Percentage of domans correctly dentfed 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Malcous domans detecton rate Non-malcous domans detecton rate 0.2 0 00 200 300 400 500 Number of test words used Fgure 7: Illustratng benefts of progressve demarcaton wth JI measure. 5.2 Per-IP analyss 5.2. Data set Here, we present the evaluaton of doman names that map to an IP address. For analyzng the per-ip group, for all hostnames mappng to an IP-address, we use the doman labels except the top-level doman TLD as the test word. For nstance, for hostnames physcs.unversty.edu and cse.unversty.edu mappngtoan IPaddress, say 6.6.6.6, we use physcsunversty and cseunversty as test words. However, we only consder IP addresses wth at least 50 hostnames mappng to t. We found 34 such IP addresses, of whch 53 were found to be malcous, and 288 were consdered non-malcous. The data s obtaned from DNS traces of a Ter- ISP n Asa. Many hostnames may map to the same IP address. Such a mappng holds for botnets or other malcous enttes utlzng a large set of hostnames mappng to fewer C&C(Command and Control) servers. It may also be vald for legtmate nternet servce such as for Content Delvery Networks (CDNs). We frst classfy the IPs obtaned nto two classes of malcous and non-malcous IPs. The classfcaton s done based on manual checkng, usng blacklsts, or publcly avalable Web of Trust nformaton [7]. We manually confrm the presence of Confcker based IPaddresses anddomannames [28]. The ground truth thus obtaned may be used to verfy the accuracy of classfcaton. Fgure shows the dstrbuton of non-malcous test words and the randomzed dstrbuton s generated as descrbed prevously. We dscuss the results of per-ip analyss below. For the sake of brevty, we present results based on K-L dstances of bgram dstrbutons only. Summary of results from other metrcs s also provded. The ROC curve for K-L metrc shows that bgram dstrbuton can be effectve n accurately characterzng the doman names belongng to dfferent IP addresses. We observe a very clear separaton between malcous and non-malcous IPs wth 500, and even wth 200 test words. Wth a low false postve rate of %, hgh detecton rates of 90% or more are obtaned wth 00 or greater number of test words. The bgram analyss was found to perform better than ungram dstrbutons. The per-ip bgram analyss performed better than per-doman bgram analyss. We beleve that the bgrams obtaned from the ISP dataset provde a comprehensve non-malcous dstrbuton. The second-level doman labels also assst n dscardng false anomales, and therefore provde better accuracy. The JI measure performed very well, even for small set of test words. The area covered under the ROC curve was for 200/500 test words. For the experment wth 00 test words, we acheved the detecton rates of 00% wth false postve rate of only 2%. Edt dstance wth domans mappng to an IP, results n a good performance n general. The experments wth 00 test words results n a low false postve rate of about 0% for a 00% detecton rate. However for usng only 50 test words, the detecton rate reaches about 80% for a hgh false postve rate of 20%. Thus, we conclude that for per-ip based analyss, the JI measure performs relatvely better than prevous measures appled to ths group. However, as hghlghted n secton 7, the tme complexty for computng jaccard ndex s hgher. Table 2: Summary of nterestng networks dscovered through component analyss Comp. #Comps. #domans #IPs type Confcker.9K 9 botnet Helldark 28 5 botnet Mjuyh 2.2K botnet Msspelt 5 25 7 Domans Doman Parkng 5 630 5 Adult content 4 349 3 Table 3: Doman names used by bots Type of group Doman names vddxnvzqjks.ws Confcker botnet gcvwknnxz.bz joftvvtvmx.org 935c4fe[0-9a-z]+.6.mjuyh.com Mjuyh bot c2d026e[0-9a-z]+.6.mjuyh.com may.helldark.bz Helldark Trojan X0R.rcdevls.net www.baldmanpower.org 5.3 Summary For a larger set of test words, the relatve order of effcacy of dfferent measures decreases from JI, to edt dstance to 56
Table : Dfferent types of classes Type of class # of componentdresses # of IP ad- # of doman Types of components found names Many-to-many 440 K 35K Legtmate servces (Google, Yahoo), CDNs, Cooke trackng, Mal servce, Confcker botnet IP fans.6k.6k 44K Doman Parkng, Adult content, Blogs, small webstes Doman fans 930 8.9K 9.3K CDNs (Akama), Ebay, Yahoo, Mjuyh botnet K-L dstances over bgrams and ungrams. However, nterestngly, we observe the exact opposte order when usng a small set of test words. For nstance, wth 50 test words used for the per-doman analyss, the false postve rates at whch we obtan 00% detecton rates, are approxmately 50% (JI), 20% (ED), 25% (K-L wth bgram dstrbuton), and 5% (K-L wth ungram dstrbuton). Even though the proof n the Appendx ndcates that K-L dvergence s an optmal metrc for classfcaton, n practce, t does not hold as the proof s based on the assumpton that t s equally lkely to draw a test dstrbuton from a good or a bad dstrbuton. 6. DETECTION VIA SUPERVISED LEARN- ING As dscussed n Secton 5.3 mmedately above, the relatve merts of each measure vary dependng, for nstance, on the number of subdomans present n a doman beng tested. In ths secton, we formulate detecton of malcous domans (algorthmcally generated) as a supervsed learnng problem such that we can combne the benefts afforded by each measure whle learnng the relatve weghts of each measure durng a tranng phase. We dvde the one-day long trace from the South Asan Ter- ISP n to two halves such that the frst one of 0 hours duraton s used for tranng. We test the learnt model on the remander of the trace from South Asan ISP as well as over a dfferent trace from a Ter- ISP n South Amerca. In ths secton, we use the groupng methodology of connected components, where all doman name, response IP-address pars present durng a tme wndow (ether durng tranng or test phases) are grouped n to connected components. 6. L-regularzed Lnear Regresson We formulate the problem of classfyng a component as malcous (algorthmcally generated) or legtmate n a supervsed learnng settng as a lnear regresson or classfcaton problem. We frst label all domans wthn the components found n the tranng data set by queryng aganst doman reputaton stes such as McAfee Ste Advsor [2] and Web of Trust [7] as well as by searchng for the URLs on search-engnes [32]. Next, we label a component as good or bad dependng on a smple majorty count,.e., fmorethan 50% of domans n a component are classfed as malcous (adware, malware, spyware, etc.) by any of the reputaton engnes, then we label that component as malcous. Defne the set of features as F whch ncludes the followng metrcs computed for each component: KL-dstance on ungrams, JI measure on bgrams and Edt dstance. Also defne the set of Tranng examples as T and ts sze n terms of number of components as T. Further, defne the output value for each component y = f t was labeled malcous or = 0 f legtmate. We model the output value y for any component T as a lnear weghted sum of the values attaned by each feature where the weghts are gven by β j for each feature j F : y = j F βjxj + β0 In partcular, we use the LASSO, also known as L-regularzed Lnear Regresson [8], where an addtonal constrant on each feature allows us to obtan a model wth lower test predcton errors than the non-regularzed lnear regresson snce some varables can be adaptvely shrunk towards lower values. We use 0-fold cross valdaton to choose the value of the regularzaton parameter λ [0-] that provdes the mnmum tranng error (equaton below) and then use that λ value n our tests: arg mn β T (y β 0 β jx j) 2 + λ β j. (2) = j F j F 6.2 Results Frst, note the varous connected components present n the South Asan trace as classfed n to three classes: IP fans, Doman fans and Many-to-many components n Table. Durng the tranng phase, whle learnng the LASSO model, we mark 28 components as good (these consst of CDNs, mal servce provders, large networks such as Google) and one component belongng to the Confcker botnet as malcous. For each component, we compute the features of KL-dvergence, Jaccard Index measure and Edt dstance. We tran the regresson model usng glmnet tool [8] n statstcal package R, and obtan the value for the regularzaton parameter λ as e 4, that mnmzes tranng error durng the tranng phase. We then test the model on the remanng porton of the one day long trace. In ths regard, our goal s to check f our regresson model can not only detect Confcker botnet but whether t can also detect other malcous doman groups durng the testng phase over the trace. Durng the testng stage, f a partcular component s flagged as suspcous then we check aganst Web of Trust [7], McAfee Ste Advsor [2] as well as va Whos queres, search engnes, to ascertan the exact behavor of the component. Next, we explan the results of each of the classes ndvdually. On applyng our model to the rest of the trace, 29 components (out of a total of 3K components) are classfed as malcous, and we fnd 27 of them to be malcous after cross checkng wth external sources (Web of Trust, McAfee, etc.) whle two components (99 domans) are false postves and comprse of Google and domans belongng to news blogs. 57
Note that here we use a broad defnton of malcous domans as those that could be used for any nefarous purposes on the web,.e., we do not necessarly restrct the defnton to only nclude botnet doman generaton algorthm. Out of the 27 components that were classfed as malcous, one of them corresponds to the Confcker botnet, whch s as expected snce our tranng ncorporated features learnt from Confcker. We next provde detals on the remanng 26 components that were determned as malcous (see Table 2). Mjuyh Botnet: The most nterestng dscovery from our component analyss s that of another Botnet, whch we call Mjuyh, snce they use the doman name mjuyh.com (see Table 3). The fourth-level doman label s generated randomly and s 57 characters long. Each of the 2 doman names belongng to ths bot network return 0 dfferent IP addresses on a DNS query for a total of.2k IP-addresses. Also, n some reples, there are nvald IP addresses lke 0.6.57.48. All the 0 IP addresses returned for a gven doman name, belong to dfferent network prefxes. Furthermore, there s no ntersecton n the network prefxes between the dfferent doman names of the mjuyh bot. We strongly suspect that ths s a case of doman fluxng along wth IP fast fluxng, where each bot generated a dfferent randomzed query whch was resolved to a dfferent set of IP-addresses. Helldark Trojan: We dscovered a component contanng fve dfferent thrd-level domans (a few sample doman names are as shown n Table 3) The component comprses of 28 dfferent doman names whch were all found to be spreadng multple Trojans. One such Trojan spread by these domans s Wn32/Hamweq.CW that spreads va removable drves, such as USB memory stcks. They also have an IRCbased backdoor, whch may be used by a remote attacker drectng the affected machne to partcpate n Dstrbuted Denal of Servce attacks, or to download and execute arbtrary fles [8]. Ms-spelt component: There are about fve components (comprsng 220 doman names) whch used trcked (ms-spelt or slghtly dfferent spellng) names of reputed doman names. For example, these components use doman names such as uahoo.co.uk to trck users tryng to vst yahoo.co.uk (snce the alphabet u s next to the alphabet y, they expect users to enter ths doman name by mstake). Dzneyland.com s used to msdrect users tryng to vst Dsneyland.com (whch replaces the alphabet s wth alphabet z ). We stll consder these components as malcous snce they comprse of domans that exhbt unusual alphanumerc features. Doman Parkng: We found 5 components (630 doman names) that were beng used for doman parkng,.e., a practce where users regster for a doman name wthout actually usng t, n whch case the regstrar s IP-address s returned as the DNS response. In these 5 components, one belongs to GoDaddy (66 doman names), 3 of them belong to Sedo doman parkng (50 doman names) and one component belongs to OpenDNS (57 doman names). Clearly these components represent somethng abnormal as there are many domans wth wdely dsparate algorthmc features clustered together on account of the same IP-address they are mapped to. Adult Content: We fnd four components that comprse of 349 domans prmarly used for hostng adult content stes. Clearly ths matches the well known fact, that n the world of adult ste hostng, the same set of IP-addresses are used to host a vast number of domans, each of whch n turn may use very dfferent words n an attempt to drve traffc. In addton, for comparson purposes, we used the lexcal features of the doman names such as the length of the doman names, number of dots and the length of the secondlevel doman name (for example, xyz.com) for tranng on the same ISP trace, nstead of usng the KL-dvergence, JI measure and Edt dstance measures used n our study. These lexcal features were found to be useful n an earler study n dentfyng malcous URLs [7]. The model traned on these lexcal features correctly labeled four components as malcous (Confcker bot network, three adult content components and one component contanng ms-spelt doman names) durng the testng phase, but t also resulted n 30 components whch were legtmate as beng labeled ncorrectly; compare ths aganst 27 components that were correctly classfed as malcous and two that were false postves on usng our alphanumerc features. We also test our model on a trace obtaned from a South Amerca based Ter- ISP. Ths trace s about 20 hours long and s collected on a smaller scale as compared to the ISP trace from Asa. The tme lag between the capture of S. Amercan Ter- ISP trace and the prevously used ISP trace from Asa, s about 5 days. We use the same tranng set for the predcton model as we use for the ISP trace from Asa. In the predcton stage, we successfully detect the Confcker component wth no false postves. The Confcker component has 85 doman names and 0 IP addresses. Of the 0 IP addresses determned for the Confcker component of the South Amercan trace, nne are common wth the Asa ISP trace s Confcker component. We conclude that Confcker based C&C servers have relatvely large TTLs. However, out of the 85 doman names only fve domans are common from ths component and the component from the ISP trace from Asa. Clearly, the Confcker botnet exhbts rapd doman fluxng. Overall, ths experment shows that a tranng model learnt n one network can be appled to a completely dfferent network and stll successfully detect malcous doman groups. 7. DISCUSSION 7. Usefulness of component analyss Confcker botnet, present n our ISP trace, employs doman fluxng across TLDs, that became drectly vsble after IP-doman components were extracted and analyzed from the trace. The component analyss allowed the applcaton of our detecton methods across several dfferent domans, whch otherwse would have been separated from each other. In addton, component analyss allowed us to detect Confcker domans that would not have been detectable wth our approach when appled to doman names alone snce some of these domans contaned fewer than 50 names needed for accurate analyss. Smlarly, some of the IP addresses n the component hosted fewer than 50 names and would not have been detected wth the IP address based analyss ether. However, these domans wll be ncluded n the component analyss as long as the component has altogether more than 50 names. Let D c be the number of hostnames and I c be the number of IP addresses n the component. If D d, I d are the number of hostnames and correspondng IP addresses detected 58
through doman level analyss, we defne doman level completeness ratos as D d /D c and I d /I c. Smlarly, we can defne the completeness ratos for IP-based analyss as D /D c and I /I c,whered and I correspond to the total number of hostnames and IP addresses of the Confcker botnet detected by the IP-based analyss. For the Confcker botnet, these completeness ratos for IPbased analyss were 73.68% for IP addresses and 98.56% for hostnames. Ths mples that we are able to detect an addtonal 26.32% of IP addresses and a relatvely small fracton of.44% of hostnames for those IP addresses. The completeness ratos for doman based analyss were found to be 00% for IP addresses and 83.87% for the hostnames. Therefore, we do 6.3% better n terms of determnng the hostnames usng the per-doman analyss. Ths shows that the component level analyss provdes addtonal value n analyzng the trace for malcous domans. 7.2 Complexty of varous measures Table 4 dentfes the computatonal complexty for every metrc, and for all groups that we use. We observe that K-L metrcs analyzng ungram and bgram dstrbutons can be computed farly effcently. However, for the JI measure, the sze of the non-malcous database largely nfluences the tme taken to compute the measure. A good database sze results n a hgher accuracy, at the cost of ncreased tme taken for analyss. Smlarly, edt dstance takes longer for large word lengths, and the number of test words. However, t s ndependent of any database, hence the space requrements are smaller. Notaton A Alphabet sze W Maxmum word sze K Number of test words K Number of test words n a component S g Number of words n non-malcous database Table 4: Computatonal complexty Grp. K-L K-L ungram bgram JI ED dom. O(KW O(KW+ O(KW 2 S g) O(K 2 W 2 ) +A) A 2 ) IP O(KW O(KW+ O(KW 2 S g) O(K 2 W 2 ) +A) A 2 ) Com. O(K W O(K W O(K W 2 S g) O(K 2 W 2 ) +A) +A 2 ) We brefly descrbe how we determne the bounds as expressed n table 4 for the per-doman group. For the K-L ungram analyss snce we examne every character of every test word, the complexty s bounded by KW. Wethencompute, for every character n the alphabet A, the dvergence values. Therefore, we obtan the complexty as O(KW + A). Bgram dstrbuton based K-L dvergence s calculated smlarly except that the new alphabet sze s A 2. Whle calculatng the Jaccard ndex, note that the number of bgrams obtaned s O(W ). For each bgram, we examne the queues pontng to words from the non-malcous database. Thus, for each bgram, we examne O(WS g) bgrams. Snce we do t for K test words, we obtan O(KW 2 S g). For every test word used whle obtanng the edt dstance, we examne t aganst the K-test words. Therefore, the total complexty s smply O(K 2 W 2 ). The expressons for per-ip and per-component groups are obtaned analogously. It s nterestng to note that A s of the sze 36 (0-9, a-z characters). K used n our analyss vares as 50/00/200/500. However, the average value for K s hgher n comparson. The DNS PTR dataset consdered for perdoman analyss has approxmately 469,000 words used for tranng purposes. Ths helps us estmate S g. For the ISP dataset, S g s of the order of 522 words. An estmate of W for the DNS PTR dataset s obtaned from fgure 2. 8. CONCLUSIONS In ths paper, we propose a methodology for detectng algorthmcally generated doman names as used for doman fluxng by several recent Botnets. We propose statstcal measures such as Kullback-Lebler dvergence, Jaccard ndex, and Levenshten edt dstance for classfyng a group of domans as malcous (algorthmcally generated) or not. We perform a comprehensve analyss on several data sets ncludng a set of legtmate doman names obtaned va a crawl of IPv4 address space as well as DNS traffc from a Ter- ISP n Asa. One of our key contrbutons s the relatve performance characterzaton of each metrc n dfferent scenaros. In general, the Jaccard measure performs the best, followed by the Edt dstance measure, and fnally the KL dvergence. Furthermore, we show how our methodology when appled to the Ter- ISP s trace was able to detect Confcker as well as a botnet yet unknown and unclassfed, whch we call as Mjuyh. In ths regards, our methodology can be used as a frst alarm to ndcate the presence of doman fluxng n a network, and thereafter a network securty analyst can perform addtonal forenscs to nfer the exact algorthm beng used to generate the doman names. As future work, we plan to generalze our metrcs to work on n-grams for values of n>2. 9. ACKNOWLEDGMENTS Ths work s supported n part by a Qatar Natonal Research Foundaton grant, Qatar Telecom, and NSF grants 070202 and 06240. We thank Prof. K.R.Narayanan, Department of Electrcal and Computer Engneerng, Texas A&M Unversty, for helpng us wth the proof as provded n the appendx. Part of ths work was done whle Narasmha Reddy was on a sabbatcal at the Unversty of Carlos III and Imdea Networks n Madrd, Span. 0. REFERENCES [] Botlab. http://botlab.cs.washngton.edu/. [2] Mcafee ste advsor. http://www.steadvsor.com. [3] On kraken and bobax botnets. http://www.damballa. com/downloads/r_pubs/kraken_response.pdf. [4] On the kraken and bobax botnets. http://www.damballa.com/downloads/r_pubs/ Kraken_Response.pdf. [5] Pc tools experts crack new kraken. http://www.pctools.com/news/vew/d/202/. [6] Twtter ap stll attracts hackers. http://blog.unmaskparastes.com/2009/2/09/ twtter-ap-stll-attracts-hackers/. 59
[7] Web of trust. http://mywot.com. [8] Wn32/hamewq. http: //www.mcrosoft.com/securty/portal/threat/ Encyclopeda/Entry.aspx?Name=Wn32/Hamweq. [9] Yahoo webspam database. http://barcelona. research.yahoo.net/webspam/datasets/uk2007/. [0] A. Bratko, G. V. Cormack, B. Flpc, T. R. Lynam, and B. Zupan. Spam flterng usng statstcal data compresson models. Journal of Machne Learnng Research 7, 2006. [] T. Cover and J. Thomas. Elements of nformaton theory. Wley, 2006. [2] H. Crawford and J. Aycock. Kwyjbo: Automatc Doman Name Generaton. In Software Practce and Experence, John Wley & Sons, Ltd., 2008. [3] S. Ganveccho, M. Xe, Z. Wu, and H. Wang. Measurement and Classfcaton of Humans and Bots n Internet Chat. In Proceedngs of the 7th USENIX Securty Symposum (Securty 08), 2008. [4] G. Gu, R. Perdsc, J. Zhang, and W. Lee. BotMner: Clusterng Analyss of Network Traffc for Protocoland Structure-ndependent Botnet Detecton. Proceedngs of the 7th USENIX Securty Symposum (Securty 08), 2008. [5] G. Gu, J. Zhang, and W. Lee. BotSnffer: Detectng Botnet Command and Control Channels n Network Traffc. Proc. of the 5th Annual Network and Dstrbuted System Securty Symposum (NDSS 08), Feb. 2008. [6] T.Holz,M.Stener,F.Dahl,E.W.Bersack,and F. Frelng. Measurements and Mtgaton of Peer-to-peer-based Botnets: A Case Study on Storm Worm. In Frst Usenx Workshop on Large-scale Explots and Emergent Threats (LEET), Aprl 2008. [7] S. S. J. Ma, L.K. Saul and G. Voelker. Beyond Blacklsts: Learnng to Detect Malcous Web Stes from Suspcous URLs. Proc. of ACM KDD, July 2009. [8] R. T. Jerome Fredman, Trevor Haste. glmnet: Lasso and Elastc-net Regularzed Generalzed Lnear Models. Techncal report. [9] J. P. John, A. MoshChuck, S. D. Grbble, and A. Krshnamurthy. Studyng Spammng Botnets Usng Botlab. Proc. of NSDI, 2009. [20] M. Konte, N. Feamster, and J. Jung. Dynamcs of Onlne Scam Hostng Infrastructure. Passve and Actve Measurement Conference, 2009. [2] C. D. Mannng, P. Raghavan, and H. Schutze. An Informaton to Informaton Retreval. Cambrdge Unversty Press, 2009. [22] D. K. McGrath and M. Gupta. Behnd Phshng: An Examnaton of Phsher Mod Operand. Proc. of USENIX workshop on Large-scale Explots and Emergent Threats (LEET), Apr. 2008. [23] E. Passern, R. Palear, L. Martgnon, and D. Brusch. Fluxor : Detectng and Montorng Fast-flux Servce Networks. Detecton of Intrusons and Malware, and Vulnerablty Assessment, 2008. [24] R. Perdsc, I. Corona, D. Dagon, and W. Lee. Detectng Malcous Flux Servce Networks Through Passve Analyss of Recursve DNS Traces. In Annual Computer Socety Securty Applcatons Conference (ACSAC), dec 2009. [25] R.Perdsc,G.Gu,andW.Lee.UsnganEnsembleof One-class SVM Classfers to Harden Payload-based Anomaly Detecton Systems. In Proceedngs of the IEEE Internatonal Conference on Data Mnng (ICDM 06), 2006. [26] P. Porras, H.Sad, and V. Yegneswaran. Conflcker C P2P Protocol and Implementaton. SRI Internatonal Tech. Report, Sep. 2009. [27] P. Porras, H. Sad, and V. Yegneswaran. An Analyss of Confcker s Logc and Rendezvous Ponts. Techncal report, mar 2009. [28] P. Porras, H. Sad, and V. Yegneswaran. Confcker C Analyss. Techncal report, apr 2009. [29] J. Stewart. Insde the Storm: Protocols and Encrypton of the Storm Botnet. Black Hat Techncal Securty Conference, USA, 2008. [30] B. Stone-Gross, M. Cova, L. Cavallaro, B. Glbert, M. Szydlowsk, R. Kemmerer, C. Kruegel, and G. Vgna. Your Botnet s my Botnet: Analyss of a Botnet Takeover. In ACM Conference on Computer and Communcatons Securty (CCS), nov 2009. [3] H. L. V. Trees. Detecton, Estmaton and Modulaton Theory. Wley, 200. [32] I. Trestan, S. Ranjan, A. Kuzmanovc, and A. Nucc. Unconstraned Endpont Proflng: Googlng the Internet. In ACM SIGCOMM, aug 2008. [33] Y. Xe, F. Yu, K. Achan, R. Pangrahy, G. Hulten, and I. Ospkov. Spammng Botnets: Sgnatures and Characterstcs. ACM SIGCOMM Computer Communcaton Revew, 2008. [34] Y. Zhao, Y. Xe, F. Yu, Q. Ke, Y. Yu, Y. Chen, and E. Gllum. Botgraph: Large Scale Spammng Botnet Detecton. USENIX Symposum on Networked Systems and Desgn Implementaton (NSDI 09), 2009. APPENDIX Let A = {a,a 2,...,a M} denote M the letters of the alphabet from whch the doman names are chosen (n our case, ths s Englsh alphabet wth spaces and specal characters). Let g =[g,g 2,...,g M]andb =[b,b 2,...,b M]be the dstrbuton of the letters n the good and bad domans, respectvely. Let x be the actual doman name of length N, that has to be classfed as beng good or bad. Let the letter a appear n tmes n x such that n = N. Let q =[q,q 2,...,q M] be the dstrbuton of the dfferent letters n x,.e., q = n /N. Under the assumpton that apror, x can belong to a good or bad doman wth equal probablty, the classfer that mnmzes the probablty of error (wrong classfcaton) s gven by the maxmum-lkelhood classfer whch classfes x accordng to P (x g) g P (x b) (3) b Intutvely, x s classfed as good, f t s more lkely to have resulted from the good dstrbuton than from the bad dstrbuton. The above classfer can be specfed n terms of 60
the lkelhood rato gven by λ(x) = P (x g) P (x b) g (4) b As we wll see later, t s easer to work wth an equvalent quantty log λ(x). The classfer s then gven accordng N to N log λ(x) = P (x g) g log 0 (5) N P (x b) b Under the assumpton that the letters n x have been generated ndependently from the same dstrbuton, P (x g) s gven by N M M P (x g) = P (x k g) = P (a g) n = g n = k= = = M = g q N. (6) The second equalty follows by groupng all the occurrences of the letters a together and recall that there are n such occurrences. Smlarly, N M M M P (x b) = k= P (x k b) = = P (a b) n = = b n = = b q N. (7) Usng (6) and (7) n (5), the log-lkelhood rato can be seen to be N log λ(x) = P (x g) M log N P (x b) =log = gq M (8) = bq Dvdng the numerator and the denomnator by qq, we get M = log λ(x) = log N M = ( ) q g q ( b q ) q (9) = q log g q log b q q (0) = D(q b) D(q g) () where D(q b) s the Kullback-Lebler (KL) dstance between the two dstrbutons. Thus, the optmal classfer gven n (5) s equvalent to D(q b) D(q g) g 0 (2) b Ths result s ntutvely pleasng snce the classfer essentally computes the KL dstance between q and the two dstrbutons and chooses the one that s closer. 6