Finding Domain-Generation Algorithms by Looking at Length Distributions



Similar documents
LASTLINE WHITEPAPER. Using Passive DNS Analysis to Automatically Detect Malicious Domains

WE KNOW IT BEFORE YOU DO: PREDICTING MALICIOUS DOMAINS Wei Xu, Kyle Sanders & Yanxin Zhang Palo Alto Networks, Inc., USA

We Know It Before You Do: Predicting Malicious Domains

VIRUS TRACKER CHALLENGES OF RUNNING A LARGE SCALE SINKHOLE OPERATION

Internet Monitoring via DNS Traffic Analysis. Wenke Lee Georgia Institute of Technology

Tracking and Characterizing Botnets Using Automatically Generated Domains

Botnet Detection with DNS Monitoring

Detecting Algorithmically Generated Malicious Domain Names

TECHNICAL REPORT. An Analysis of Domain Silver, Inc..pl Domains

RECENT botnets such as Conficker, Kraken and Torpig have

Winning with DNS Failures: Strategies for Faster Botnet Detection

Automatic Extraction of Domain Name Generation Algorithms from Current Malware

Tracking and Characterizing Botnets Using Automatically Generated Domains

EVILSEED: A Guided Approach to Finding Malicious Web Pages

Korea s experience of massive DDoS attacks from Botnet

A General-purpose Laboratory for Large-scale Botnet Experiments

Preetham Mohan Pawar ( )

DNS Firewall Overview Speaker Name. Date

Adventures in Cybercrime. Piotr Kijewski CERT Polska/NASK

Detecting Algorithimically Generated Malicious Domain Names

New Trends in FastFlux Networks

Analyzing HTTP/HTTPS Traffic Logs

Section 1 Overview Section 2 Home... 5

Website usage trends among top-level domains

Malicious Automatically Generated Domain Name Detection Using Stateful-SBB

ThreatSTOP Technology Overview

When Reputation is Not Enough: Barracuda Spam & Virus Firewall Predictive Sender Profiling

WHEN THE HUNTER BECOMES THE HUNTED HUNTING DOWN BOTNETS USING NETWORK TRAFFIC ANALYSIS

ENEE 757 CMSC 818V. Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park

LASTLINE WHITEPAPER. Large-Scale Detection of Malicious Web Pages

Indicator Expansion Techniques Tracking Cyber Threats via DNS and Netflow Analysis

Security Business Review

Botnet Analysis Leveraging Domain Ratio Analysis Uncovering malicious activity through statistical analysis of web log traffic

When Reputation is Not Enough: Barracuda Spam Firewall Predictive Sender Profiling. White Paper

A COMPREHENSIVE STUDY ON BOTNET AND ITS DETECTION TECHNIQUES

Next Generation IPS and Reputation Services

V-ISA Reputation Mechanism, Enabling Precise Defense against New DDoS Attacks

QUARTERLY REPORT 2015 INFOBLOX DNS THREAT INDEX POWERED BY

Implementation of Botcatch for Identifying Bot Infected Hosts

LASTLINE WHITEPAPER. The Holy Grail: Automatically Identifying Command and Control Connections from Bot Traffic

WHITE PAPER Big Data Analytics. How Big Data Fights Back Against APTs and Malware

Kaspersky Lab. Contents

Operation Liberpy : Keyloggers and information theft in Latin America

NSEC3 Hash Breaking. GPU-based. Matthäus Wander, Lorenz Schwittmann, Christopher Boelmann, Torben Weis IEEE NCA

Phone Fax

Extending Black Domain Name List by Using Co-occurrence Relation between DNS queries

Protecting DNS Query Communication against DDoS Attacks

Botnets: The Advanced Malware Threat in Kenya's Cyberspace

Security A to Z the most important terms

Botnet Detection Based on Degree Distributions of Node Using Data Mining Scheme

Index Terms Denial-of-Service Attack, Intrusion Prevention System, Internet Service Provider. Fig.1.Single IPS System

Kindred Domains: Detecting and Clustering Botnet Domains Using DNS Traffic "" " Matt Thomas" Data Architect, Verisign Labs"

Automated Protection Against Advanced Attacks

Next-Generation DNS Monitoring Tools

EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis

Botnet Detection by Abnormal IRC Traffic Analysis

A Novel Distributed Denial of Service (DDoS) Attacks Discriminating Detection in Flash Crowds

Agenda. Taxonomy of Botnet Threats. Background. Summary. Background. Taxonomy. Trend Micro Inc. Presented by Tushar Ranka

EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis

Search Query and Matching Approach of Information Retrieval in Cloud Computing

Looking Behind the Attacks - Top 3 Attack Vectors to Understand in 2015

Malware Detection in Android by Network Traffic Analysis

.eu Insights. Website usage trends among top-level domains 2014

HIDING THE NETWORK BEHIND THE NETWORK. BOTNET PROXY BUSINESS MODEL Alexandru Maximciuc, Cristina Vatamanu & Razvan Benchea Bitdefender, Romania

DNS Traffic Monitoring. Dave Piscitello VP Security and ICT Coordina;on, ICANN

Domain Hygiene as a Predictor of Badness

When Reputation is Not Enough. Barracuda Security Gateway s Predictive Sender Profiling. White Paper

CSE 3482 Introduction to Computer Security. Denial of Service (DoS) Attacks

JUNIPER NETWORKS SPOTLIGHT SECURE THREAT INTELLIGENCE PLATFORM

BioCatch Fraud Detection CHECKLIST. 6 Use Cases Solved with Behavioral Biometrics Technology

First Line of Defense

The Impact of DNSSEC. Matthäus Wander. on the Internet Landscape. Duisburg, June 19, 2015

WHITE PAPER Cloud-Based, Automated Breach Detection. The Seculert Platform

Detection of NS Resource Record DNS Resolution Traffic, Host Search, and SSH Dictionary Attack Activities

Prevention, Detection and Mitigation of DDoS Attacks. Randall Lewis MS Cybersecurity

Domain Name Abuse Detection. Liming Wang

How To Protect A Dns Authority Server From A Flood Attack

The Environment Surrounding DNS. 3.1 The Latest DNS Trends. 3. Technology Trends

Network attack and defense

SPAM FILTER Service Data Sheet

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

How To Detect An Advanced Persistent Threat Through Big Data And Network Analysis

Threat Intelligence for Dummies. Karen Scarfone Scarfone Cybersecurity

Basheer Al-Duwairi Jordan University of Science & Technology

Securing Your Business with DNS Servers That Protect Themselves

A Study on the Live Forensic Techniques for Anomaly Detection in User Terminals

An Efficient Methodology for Detecting Spam Using Spot System

INTERNET DOMAIN NAME SYSTEM

The Impact of Cybercrime on Business

Introduction Header Body content Todo List. Analysis. Joe Huang Anti-Spam Team Cellopoint. February 21, Joe Huang Analysis 1 / 62

Fuzzy Network Profiling for Intrusion Detection

Flexible Deterministic Packet Marking: An IP Traceback Scheme Against DDOS Attacks

Conficker by the numbers

Report. Takeover of Virut domains

Guidance for Preparing Domain Name Orders, Seizures & Takedowns

Efficient Prevention of Credit Card Leakage from Enterprise Networks

Intelligent Worms: Searching for Preys

SPEAR PHISHING AN ENTRY POINT FOR APTS


OS KERNEL MALWARE DETECTION USING KERNEL CRIME DATA MINING

Transcription:

Finding Domain-Generation Algorithms by Looking at Length Distributions Miranda Mowbray Security and Cloud Lab, HP Labs HP Bristol, UK miranda.mowbray @ hp.com Josiah Hagen Tipping Point, HP Software HP Austin, TX, USA josiah.hagen @ hp.com Abstract In order to detect malware that uses domain fluxing to circumvent blacklisting, it is useful to be able to discover new domain-generation algorithms (DGAs) that are being used to generate algorithmically-generated domains (AGDs). This paper presents a procedure for discovering DGAs from Domain Name Service (DNS) query data. It works by identifying client IP addresses with an unusual distribution of second-level string lengths in the domain names that they query. Running this fairly simple procedure on 5 days data from a large enterprise network uncovered 19 different DGAs, nine of which have not been identified as previously-known. Samples and statistical information about the DGA domains are given. Keywords-domain generation algorithms; DGA; AGD; botnet; Domain Name Service; Big Data I. INTRODUCTION It is common for botnets to use domain fluxing in order to circumvent domain blacklisting. In domain fluxing, an infected machine periodically uses a seed and a domaingeneration algorithm (DGA) to automatically generate a large batch of domain names, and does a Domain Name Service (DNS) query to discover the domains IP addresses [1]. The seed may be time-independent, or may depend on the current time or other time-varying public information such as top Twitter trends [2,3]. The command-and-control server contacts an infected machine by registering and using one of the generated domain names. If this domain is blacklisted, it can move to another of the generated domains. If a machine can be detected querying malware DGA domains, it may be possible to disinfect or quarantine it before the infection produces any symptoms. The usual approach to detecting clients querying DGA domains is to use machine learning to build a classifier based on features of DGA domain names or of queries for them [4,5]. This approach only detects DGAs for which samples are available. There is therefore a need to identify previously-unknown DGAs, and obtain samples for them. This is the problem we address in this paper. In Section II we describe a relatively simple mostlyautomated procedure that can be used on DNS query data to detect sets of domains produced by previously-unknown DGAs. The procedure identifies only a small number of clients querying a DGA s domains often only one but once samples from a new DGA have been found, other clients querying domains from the DGA can be identified. The main contributions of the paper are the descriptions of the unidentified DGAs, and the description of the procedure used to find them. The identified DGAs that were found might also be of interest. To explain the main idea behind the procedure, we need to introduce some terminology. The term second-level domain refers to the substring of a domain name consisting of all the characters after the second-last dot before the public suffix. The DGAs used for second-level domain fluxing generate domains with many different second-level domains. The substring between the last and second-last dots before the public suffix will be called the 2 nd -level string, and its length the 2-length. For instance, both s1.s2.example.com and example.co.uk have 2 nd -level string example and 2- length 7. The top-level domain is the substring of a domain name after the last dot, and the suffix is just the public suffix. The main idea is to look for clients querying domains with an unusual distribution of 2-lengths. Thus, the procedure we describe can only detect DGAs that are used for second-level domain fluxing, and which generate batches of domains for which the 2-length distribution is unusual. However, when the procedure was run on 5 days DNS query data from a large enterprise network, it uncovered 21 different DGAs, nine of which have not been identified and so may be previously-unknown. We describe these in Section III. Some of these may be produced by legitimate applications rather than malware, and some or even all of them may be known, but we have not found previous reports of them. Once domains from a new malware DGA have been detected, they can be used to train a classifier so that this DGA can be detected in the future. We are currently using a classifier built using multi-feature logistic regression with L2 regularization [6], using syntactic features of second-level domain names. The features we use include counts of individual characters and character n-grams, dots, hyphens, digits, uppercase, and lowercase, and the length. The classifier performs well, with F1 scores > 95% generally, over several classes: we currently have 12 DGA classes, but aim to add new classes for the DGAs found. A. Other Approaches There are other approaches to detecting new DGAs. In some cases DGAs can be determined from captured

malware code, or by capturing the DNS traffic of a machine infected with malware that uses a new DGA. However this requires the prior identification of the infected machine. One approach [7] looks for domain names that are not human-readable or human-understandable. Similarly, another more recent approach uses linguistic features from the domain names [8]. However, some DGAs are designed to produce human-readable domain names. Most bigrams in Pushdo and Expiro domains consist of a vowel and a consonant, and Dorkbot domains are derived from English words. Moreover, some benign domain names are not obviously human-understandable. A different idea is to look for clusters of similar domains being queried by multiple clients [9,10,11]. This only catches infections in their later stages, when they have spread to several clients. The approach described in this paper potentially allows the detection of a DGA at the end of the first time slot during it is used by the first infected client in the network. However it cannot detect all DGAs. It makes sense to combine these two approaches, so that if a DGA is not detected early it may be detected later on. B. Privacy Considerations The reason for identifying clients based on very limited information about the domain names is that this allows for greater privacy protection. For each time slot, the automated part of the procedure identifies a few client IP addresses of interest, and the un-automated part examines the unusual domains queried by these IP addresses during the time slot. The procedure is designed so that most of it can be run on pseudonymized data, where both the IP addresses and the domains are pseudonymized. The pseudonymization of the domains preserves the public suffix and the 2-length, but may not preserve any other features. Non-pseudonymized domains need only be examined for the candidates that have been identified as having unusual 2-length distributions, and the non-pseudonymized version of a candidate IP address need not be revealed until it has been confirmed that it is infected and action is needed to remedy this. C. The Data Set In this paper we report results for five days DNS queries collected from a large enterprise network on April 9, May 12, June 11, July 18, and August 4 2014. The dates were selected at random from the days for which data was available, with the constraint that they are all from different months. The data set indicates, for each 3-hour time slot, active client IP address, and second-level domain, how many DNS queries were made by the client IP address of a domain with that second-level domain. There were 224,818 active client IP addresses, querying domains with over a million different second-level domains a day. II. THE PROCEDURE This section describes the procedure used to find DGAs in the data set, and the reasons for the choices made in the design of this procedure. The first step was to convert upper-case letters in the second-level domain names to lower case. Doing this conversion is standard practice when searching for malware domains in DNS data, as the DNS service is case-invariant, and also it ensured that client IP addresses making many queries of the same domain with different capitalization as a defence against DNS spoofing [12] would not appear to query many different domains with the same 2-length. The data entries for second-level domains with invalid public suffixes were deleted. Malware DGAs need to generate domains with valid public suffixes, so the secondlevel domains considered in the search for DGAs can be restricted to these. The data entries for domains with second-level domains that were queried more than 50 times in total in the appropriate time slot were also deleted. These are unlikely to be generated by a DGA used for second-level domain fluxing, unless many machines in the network have been compromised by the same malware and are querying the same domain. 98% of all queries of valid domains in the data set are for domains with a second-level domain that was queried at least 50 times in the time slot containing the query, even though they account for less than 10% of the different second-level domains queried. This parameter could in fact have been set to a much lower value than 50, say 10, as the DGA domains found by the procedure were typically queried by only a few end-user machines (often, just one) and a few servers or aggregators. Next, the candidates for the time slot were identified. These were the client IP addresses that during the time slot queried 50 or more of the remaining 2 nd -level strings. An experiment on one day s data reducing this parameter from 50 to 25 did not result in the detection of additional DGAs. In both cases, 7 DGA families were found. We found it was useful to treat candidates that queried Chinese-language web sites separately, because the 2-lengths for Chinese-language web sites tend to be shorter than average, as a result of practices for representing Chineselanguage domain names in Latin alphabet letters and numbers. The candidates were classified as Chinese-speakers or not using a day s data, by determining whether what fraction of the domains with a two-letter top-level domain that a candidate queried had the top-level domain cn, tw, mo, sg or hk. If the fraction was over 50%, the candidate was classified as a Chinese-speaker. The average 2-length distributions of (non-deleted) domain names queried by candidates in the two classes were calculated, with each candidate in the class being weighted equally in the average. For each candidate, the distance was calculated between the 2-length distribution during the time slot for domains queried by that candidate and the average distribution for either non-chinese-speaking or Chinese-speaking candidates, candidates, depending on how the candidate was

classified. The distance function used was the standard statistical distance function over a finite alphabet (1) D(X,Y) = ½ Σ X(i)-Y(i) (1) where the sum is over all 2-length values i, X(i) is the probability of i in distribution X and Y(i) the probability of i in distribution Y. The next step was to find candidates for which the distance from the average was anomalously large. Fig. 1 shows the distribution of the values for the distance from the average, measured to the nearest 0.1, for each day. For all five days the relatively flat part of the graph begins at a distance value of around 0.47, so values greater than 0.465 were treated as anomalous. (The blip at a distance value close to 1 turned out to be caused by candidates that queried Zbot domains.) Next, for each of the candidates with anomalous 2-length distributions, DNS queries were made ten of the second-level domains that they queried in the time slot. If 6 or more of these queries resulted in the answer that the domain did not exist, the candidate was counted as an identified candidate. The reason for this check was that about 28% of the candidates with anomalous 2-length distributions did not appear upon manual inspection to be querying AGDs in the time slot identified. For these candidates, most of the domains they queried resolved to an IP address, whereas by design, most malware DGA domains have no DNS registration. This check will be discussed further in Section IV, which gives reasons why some candidates had anomalous length distributions without querying DGA domains. For the reported candidates, further investigations were carried out to determine the properties of the underlying DGA, and to try to identify it. This was made easier by the fact that usually, the domains queried by the candidate during the time slot (and queried less than 50 times in total during that time slot) were almost all from the DGA. In some cases they were all from the DGA. Figure 1. Distribution of values of the distance from the aveerage for candidates on each day. I. RESULTS The procedure detected 19 different DGAs in the data set, with between 5 and 10 different DGAs detected in each day s data. The numbers of candidates reported on the five days were 28, 34, 13, 10 and 6 respectively. (These numbers count a candidate only once if it was identified in multiple time slots). All of them were querying AGDs. A. Identified malware DGAs The procedure detected the DGAs used by Conficker.D, Dorkbot, Expiro, Pushdo (the variant using suffix kz [13]), Ramdo, and Zbot. For more information on these, see their entries in [14]. Note that the names refer to the DGA, and not necessarily the underlying malware. For example the Zbot domains might have been queried by GameoverZeuS or by Cryptolocker, as these use the same DGA. Dorkbot domains are hard-coded rather than being generated by infected machines, but they look as though they were originally machine-generated. As these DGAs are already known, we do not describe the domains that they generate. B. Fixed-string DGAs.For two of the sets of AGDs, the domains are just a fixed string apart from a varying index number. These have domains of the formats firehuntern.com and testsupporturln.com respectively, where for the firehunter domains the index N is a string of 12 decimal digits, and for the testsupporturl domains the index N is of the form hhhhhhhh-hhhh-hhhh-hhhhhhhhhhhh where each h a hex digit,. The firehunter domains appear not to be associated with malware, but rather with an Internet Service measurement tool by Agilent Technologies [14], queried by two clients. The testsupporturl domains appear to be produced by a test carried out by a single client. C. Properties of numbered DGAs Table I gives ten sample second-level domains from each of the eleven remaining DGAs. Larger samples are available, for research purposes, on request from the authors. Table II indicates the suffixes used by the DGAs, the minimum and maximum 2-lengths, and the set of characters used in the 2 nd - level strings in the examples found in the data. There were brief reports in 2013 of a DGA using 26 different suffixes and characters a-y in the 2 nd -level string, without more details being given. DGA1 is probably that DGA, so it is not included in the total of nine unidentified DGAs, however a search for a reference was unsuccessful. DGA6 is also not counted in the nine, because it may be the DGA reported in [9] as New-DGA-v1; however it is hard to be sure, as [9] gives no general information about New- DGA-v1, just ten sample domains. Batches of DGA6 domains often include the same 2 nd -level string three times, once with each of the three suffixes. The client that queried DGA8 domains also queried, in the same timeslot, three longer domains used in the past as command-and-control servers for the Nitol botnet [17].

DGA TABLE I. SAMPLES OF DGA DOMAINS Sample Domains bvoornoepdyolmray.so, eiqnrwvaqex.ms, vbrmbmxxhplg.in, dtaoaoclxcuuluroglkvhj.eu, DGA1 ltyhilfadckiah.cx, fxxalaepncyycb.eu, quucxbsbgqjj.so, ivuuvimrwdw.tj, chgaeucrwujhsp.sc, lvcdwxsadrw.bz pp7zfsdm126bu4y8bs.cn, ux5ngqjlkmwvrmnb51.ru, kythnpfcmpajgttqn2.biz, llwqh5b46zyywzmww7.net, DGA2 4y7wboxtvv7v5tqyik.ru, wqdkcxrevxcd71kzs8.cn, 5vdaka3xdn3wbtzftg.biz, xli1uzkgz5psknild8.net, 16qyln3wufdtp5xeru.ru, e2o7a7on7zgun7l3bf.com tefdxtv.com, xffxgus.com, jvfjvkhb.com, hseggstb.com, DGA3 iewhggkj.com, gtjjwkfg.com, jjhedxtk.com, sxktjxss.com, hbckdhbe.com, xthvdsvj.com zvamlajjsuqu.com, xmfrtusizoq.com, olwlnxlnwiyohue.com, eagxozfzozbrmi.com,rhjwxkahcrij.com, lvplnjnutiuuckp.com, DGA4 civzelsvkcfspc.com, sfhdcrbhlp.com, qtcjoizgqctdb.com, gyowliphjh.com YurAIvr.com, MYiEcWp.com, CqQCYIr.com, DGA5 ORxaqPQ.com, xznxwkr.com, PQJAiZU.com, lzvdncg.com, XJlJzUg.com, edjjtbq.com, HvDrXtY.com 67add90c.net, 3142481d.com, 3142481d.net, DGA6 3142481d.info, 8293654d.net, 3293654d.com, 3293645d.com, ef955eb6.com, ef955eb6.com, bb25331f.info jgelvgde.net, zllzdpb.se, dtgjgg.com, kwsxbkzb.eu, DGA7 azjuoy.in, etpsoprc.ru, zderywwwd.in, lywaglvez.net, rgkzlt.ru, ogvpqdkbo.ru uzetfk.com, dynwjr.com, adgpaf.com, vtphfh.com, DGA8 opljqb.com, peycwg.com, kbgzag.com, uotkso.com, azfkfv.com, rhkihl.com ozylyjctih.com, fyxonepqpc.org, fmrqjqbovm.org, bovixcbgzk.tv, djylalqhc.com, ncjcpolkfe.org, DGA9 toxshsfypg.com, pcbaledgxc.tv, jujurunsxy.net, gtuzyjqday.com kdrlowylf.com, dclhmfkcb.ru, iclcakajd.ru, oysaqcxbi.ru, DGA10 fxazudqiv.ru, kntrejzkq.ru, heiylmruc.ru, uczcgpuxv.com, nxoyntdzt.com, srxkwklks.ru maxeplqzvcnxpfkovloz.net,sioyqajagaqfvdsudrvf.net, odensidfvfytihusgthm.net,uaovdstulcuzsxyrgsfh.net, DGA11 gtulxzaszyfmnxxzffkf.net,tktutqntzqmwvumvuozv.net, hpjzambgpgbtpubklgvp.net,nakumrwuyycpphowdqfx.net, ehfrwbderocmdbenvczm.net,vhamxcuypagehpdqgevx.net D. Length and Suffix Distributions In the set of DGA7 domains found, the 2-lengths 6, 7, 8, 9, 10 and 11 occur with frequencies approximately 0.15, 0.22, 0.3, 0.22, 0.09 and 0.02 respectively. For the other numbered DGAs with more than one possible 2-length, all the 2-lengths between the minimum and maximum values occur with approximately equal frequency. Some machine-learning techniques use the frequencies of different suffixes as one of their indicators to identify domains from known DGAs [9]. These frequencies can be different in different batches, including batches queried by different clients on the same day. This is illustrated in Fig. 2, which shows the distribution of suffixes in seven DGA1 samples, each of which contains at least 590 domains. However, there is much less variation between batches of the distribution of suffixes sorted by frequency. That is, the probability of the suffix that was the i th most frequently-used suffix in the batch is approximately the same for the different batches, but the identity of the i th most frequently-used suffix might be different for different batches. Fig. 3 shows that the suffix distributions for the seven DGA1 samples used for Fig. 2 are similar if the probability for the i th most frequentlyused suffix for the batch is plotted rather than the probability of a specific suffix. TABLE II. DGA DGA1 SUFFIXES, LENGTHS AND CHARACTERS FOR DGA DOMAINS Suffixes ac, bz, cc, cm, co, cx, de, eu, im, in, jp, ki, la, me, mn, ms, mu, nf, nu, ru, sc, sh, so, tj, tv, tw 2- lengths Characters in 2 nd -level string 10-25 a-y DGA2 biz, cn, com, net, ru 18 1-8, a-z DGA3 com 8 b-k, s-z DGA4 com 10-15 a-z DGA5 com 7 A-Z, a-z DGA6 com, info, net 8 0-9, a-f DGA7 biz, com, eu, in, info, name, net, org, ru, se 6-11 a-z DGA8 com 6 a-z DGA9 com, net,org, ru, tv 10 a-z DGA10 com, ru 9 a-z DGA11 net 20 a-z This is partly because the ordering by frequency reduces the statistical variation between different samples from the same underlying distributions. This may be the complete explanation in the case of DGA1. But in general, it is also because of the design of malware DGAs: if the underlying suffix distribution differs between batches, it seems that the DGA effectively uses the seed to determine an ordering of the suffixes, and chooses the suffix for a generated domain with a probability determined by this ordering. Hence, it may be useful to report the suffix distributions for the domains found from the numbered DGAs, even though different batches may have different distributions. The distribution for DGA1 is shown in Figs 2 and 3. The suffix frequencies for the domains found are approximately 0.3 cn, 0.2 ru, 0.2 biz, 0.15 com and 0.15 net for DGA2, 0.5 info, 0.3 net, 0.2 com for DGA6, and 0.6 ru, 0.4 com for DGA10. For DGA7 and DGA9, the suffixes appear with approximately equal frequency. Figure 2. Suffix distributions of DGA1 domains queried by seven different client IP addresses.

Fig. 6 shows the bigram distributions, sorted in reverse order of frequency for the relevant DGA, for the 2 nd -level strings in DGA3, DGA6, DGA9 and DGA10 batches. DGA9 uses only half the of the 676 possible bigrams of characters a-z in its 2 nd -level strings, because two adjacent letters in a 2nd-level string for a DGA9 domain are always an odd distance away from each other in the English alphabet. In 2nd-level strings of DGA1, DGA2, DGA4 and DGA11 domains found, all bigrams from the character set have approximately equal frequencies. As yet we have found insufficiently many sample domains from the three other numbered DGAs to report the bigram distributions of their 2 nd -level strings with confidence. Figure 3. The same data as Fig. 2, but where points with the same x- coordinate represent frequencies of the xth most frequent suffix in the distribution for the particular client E. Character and Bigram Distributions Two other features used to detect DGA domains using machine learning are character and bigram frequencies. Just as the suffix distributions may vary between different batches of the same DGA, so may the character and bigram distributions. For example, all the domains in the same Expiro batch begin with the same letter, so this letter has high frequency, but the letter can be different in different batches. However, as for the suffix distributions, the character and bigram distributions appear to be stable between batches when sorted by frequency. The batches of DGAs found were not large enough to perform n-gram analysis with confidence when n > 2. Fig. 4 shows the character distributions for the 2 nd -level strings in the DGA8 domains found. The six vowels (including y) are more frequent than the consonants, and z is the least frequent character. Figure 5. Character distributions for various numbered DGAs. Points with the same x-coordinate represent frequencies of the xth most frequent character in the 2 nd -level strings of the particular DGA. Figure 6. Bigram distributions for DGA3, DGA6, DGA9 and DGA10. Points with the same x-coordinate represent frequencies of the xth most frequent bigram in the 2 nd -level strings of the particular DGA. Figure 4. Charater distributions for 2 nd -level strings of DGA8 domains found The characters in 2 nd -level strings for DGA1, DGA2, DGA9 and DGA11 have approximately equal frequencies. Fig. 5 shows the character distribution for the remaining DGAs, sorted by reverse order of frequency for the particular DGA. The most frequent character in DGA7 2 nd -level strings varies between batches. II. LIMITATIONS AND FUTURE WORK The procedure cannot detect DGAs whose distribution of 2 nd -level string lengths is not sufficiently unusual. However, such DGAs may be detectable using other methods once the infection has spread to more machines. The procedure is only designed to detect second-level domain fluxing, but it might be possible to extend it to

detect fluxing at a higher level by considering looking for unusual distributions of the lengths of higher-level strings. The candidates in the data set that had unusual 2-length distributions but were not querying DGA domains tended to have distances from the average 2-length distribution close to the cutoff point of 0.465, and typically had unusual length distributions for one of two reasons. Either they queried short Chinese-language domains, but were classified as not Chinese-speaking; or, they queried set of related long but not AGDs. For example, one candidate queried over 60 web sites belonging to a large online shop called CoolBlue [17] in the same time slot. Ten of the CoolBlue domains are below. schuurmachineshop.nl, snijmachineshop.nl, broodbackmachinestore.nl, audiorecordershop.nl, naaimachinestore.nl, inktcartidcestore.nl. gourmetstelstore.nl, All skibrilcenter.nl, manuscripts videocamerashop.nl, must be in English. mengpanellstore.nl These guidelines The check that 6 or more of the 10 sample domains were reported as non-existent prevented all these candidates from being reported. In addition, the identification of Chinesespeaking candidates could be made more accurate by considering only country-code top-level domains rather than all 2-letter top-level domains - some misclassified candidates queried many domains with top-level domain cc or vn. The check may eliminate some candidates that are in fact querying DGA domains. In particular, it will do so if most of these domains have been sinkholed, that is, registered by others to prevent their use by the malware owner. This was the case for a candidate with an anomalous 2-length distribution in the May 12 data that was querying SillyFDC domains (see the entry in [13] for information on SillyFDC), and for several candidates querying Zbot domains after a large number of these were sinkholed by international law enforcement [18]. It is sensible for the check to count domains that resolve to a known sinkhole IP address as though they had no DNS registration. Our team is investigating how best to identify sinkhole IP addresses from DNS data. Our current method (details of which are out of scope of this paper) identified the sinkhole IP addresses involved in both of these cases. A non-automated alternative to the check is to examine the 10 sample domains by eye, and decide whether these appear to be AGDs. This typically takes only a few seconds per candidate with anomalous 2- length distribution. It would be useful to automatically compare the features of domains queried by a reported candidate to known DGAs, to see whether the DGA is already known. Finally, this paper addresses only a small part of a large problem. In order to be used in practice, the procedure needs to be integrated with network security tools and systems. Once DGAs have been identified, there also need to be processes to develop recognizers for domains produced by the DGAs, to identify infected clients in the network, and to remediate the associated problems. As a first step toward this integration, we aim to use domains found from the newly-discovered malware DGAs to retrain our classifier, so that domains from these DGAs can be recognized quickly. ACKNOWLEDGMENT Thanks to all the Big Data for Security team at HP Labs. REFERENCES [1] B. Stone-Gross et al., Your botnet is my botnet: analysis of a botnet takeover. Proc. Computer and Commuication Security (CCS 09), ACM, pp.635-647, New York, USA, 2009. [2] T. Barabosch, A. Wichmann, F. Leder and E. Gerhards-Padilla, Automatic Extraction of Domain Name Generation Algorithms from Current Malware, Proc. NATO Symposium IST-111 on Information Assurance and Cyber Defense, Koblenz, Germany, 2012 [3] SophosLabs, Surge in Sinowal Distribution, nakedsecurity blog post, 12 July 2009, http://nakedsecurity.sophos.com/2 009/07/12/surge-sinowal-distribution/ [4] S. Yadav, A. K. K. Reddy, A. N. Reddy, and S. Ranjan. Detecting algorithmically generated malicious domain names, Proc. Internet Measurement (IMC 10), ACM, pp.48-61, New York, USA, 2010. [5] L. Blige, E. Kirda, C. Kruegl, and M. Balduzzi, EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis, Proc. 18 th Network & Distributed System Security Symposium (NDSS 11), The Internet Society, pp.18-35, 2011 [6] C. M. Bishop, Pattern Recognition and Machine Learning, Information Science and Statistics series, publ. Springer, 2007, p.209 [7] R. Mouchouz, DNS Resolution Traffic Analysis Applied to Bot Detection, Botconf 2013, Nancy, France, 5-6 Dec 2013 [8] S. Schiavoni et al., Phoenix: DGA-based Botnet Tracking and Intelligence, DIMVA 2014 [9] M. Antonakaikis et al., From Throw-Away Traffic to Bots: Detecting the rise of DGA-Based Malware, Proc. USENIX Security 2012 [10] M. Thomas and A. Mohaisen, Kindred Domains: Detecting and Clustering Botnet Domains Using DNS Traffic, 23 rd World Wide Web (WWW 14) companion publication, IWWWCSC, Geneva, Switzerland, 2014, pp. 707-712. DOI: 10.1134/2567948.2579359 [11] H Guerid, K. Mittig and A. Serhruouchni, Privacy-preserving domain-flux bot detection in a large scale network, Proc. 5 th Coomunication Systems and Netorks conference (COMSNETS 13), Bangalore, India, 7-10 2013, IEEE. DOI: 10.1109/COMSNETS.2013.6465572, pp.1-9 [12] D.Dagon, M.Antonakais, P. Vixie, T.Jinme and W. Lee, Increased DNS Forgery Resistance Through 0x20-Bit Encoding, Proc. CCS 08, ACM, New York, USA, 2008 pp.211-222. DOI: 10.1145/1455770.1455798 [13] A.Raff, PushDo Malware Domain Genration Algorithm, Seculert blog post, 16 May 2013, http://www.seculert.com/blog/2013/05/pushdo-malware-domaingeneration-adaptation.html [14] Microsoft, Malware Protection Center, online security portal, http://www.microsoft.com/security/portal/mmpc/default.aspx [15] Agilent Technologies, Firehunter white paper, 1998, online at http://literature.agilent.com/litweb/pdf/5988-6139en.pdf [16] I.Liba, Digging into the Nitol DDOS Botnet, McAffee Labs blog ost, 19 April 2012, http://blogs.mcafee.com/mcafee-labs/digging-intothe-nitol-ddos-botnet [17] CoolBlue B.V., Careers at CoolBlue, online document, 2014 http://www.careersatcoolblue.com/#about [18] Federal Bureau of Investigation, GameOver Zeus Botnet Disrupted, online announcemtn, 6 June 2014, http://www.fbi.gov/news/stories/2014/june/gameover-zeus-botnetdisrupted