Detecting Algorithmically Generated Malicious Domain Names

Detecting Algorithmically Generated Malicious Domain Names Sandeep Yadav, Texas A&M University IT Security for the Next Generation American Cup, New York 9-11 November, 2011

Introduction Botnets, a network of bots (compromised hosts) under a Command & Control server, responsible for: Spamming Phishing DDoS Recent (in-)famous botnets: PAGE 2

Modus Operandi bota C&C Location #1 botb C&C Location #2 botc C&C Location #3 Location #1 Location #2 Location #3 Domain fakjdfhak.botnet.com xxcvsdf.botnet.com lkdsjlllh.botnet.com IP A1.B1.C1.D1 A2.B2.C2.D2 A3.B3.C3.D3 PAGE 3

Objective fakjdfhak.botnet.com lkdsjlllh.botnet.com xxcvsdf.botnet.com Goal Detect domain-fluxing botnets by exploiting the randomness in domain names. PAGE 4

Groups for Analysis Per-domain fakjdfhak.botnet.com lkdsjlllh.botnet.com xxcvsdf.botnet.com Botnet.com Per-IP kdjfhk.org weiyrilskjd.com hlkhdfds.info IP: 1.1.1.1 1.1.1.1 Per-component Domain1 Domain2 IP1 IP2 The whole component PAGE 5

Metrics for analysis Kullback- Leibler divergence Jaccard Index Edit Distance PAGE 6

Metrics for analysis (Kullback-Leibler Divergence) A measure of similarity between two probability distributions. PAGE 7

Metrics for analysis (Jaccard Index) Evaluates similarity with a benign database of bigrams E.g. xjisov.botnet.com Break into bigrams xj ji is so ov Compute the fraction of bigrams present in the benign database Benign groups will have a higher value of this metric. PAGE 8

Metrics for analysis (Edit Distance) The number of character modifications (addition, deletion, or substitution) required to convert one string to other. E.g. When converting dog to cat, the edit distance is three. Intuitively, the edit distance for two randomized (possibly malicious) domains is higher. For instance: ns1.google.com -> ns2.google.com (Benign) sljslasdkja.com -> rjhbgjhr.org (Malicious) [low ED] [high ED] No reference database or distribution required. PAGE 9

Results Evaluation data set Tier-1 ISP dataset containing host of malicious domains (including Conficker). Benign data set DNS-PTR dataset containing PTR records for addresses in the IPv4 space. Malicious data set Domains for Storm, Kraken, Pushdo, etc., obtained from Botlab. PAGE 10

Results Per-domain analysis K-L divergence Jaccard Index Similarly, edit distance also performs reasonably well: With 500 test words, TPR: 100% for 8% FPR. PAGE 11

Results Per-component analysis Applied a supervised learning approach. Used L1-Regularization algorithm for classification. Used three features: K-L divergence for unigrams, Jaccard index, Edit distance Training based on one malicious (Conficker) component and remaining benign components. PAGE 12

Results Per-component analysis Outcome: Detected Conficker. Discovered the presence of Helldark botnet. Discovered a new botnet, we call, Mjuyh. 57-character long fourth level domain composed of random characters. One domain maps to 10 IP addresses. PAGE 13

Take Away Domain-fluxing botnets utilize high-entropy domain names. We deploy metrics such as Kullback- Leibler divergence, Jaccard index, and Edit distance for detecting such botnets. In addition to detecting known botnets, we discover new botnets. PAGE 14

Thank You Sandeep Yadav, Texas A&M University IT Security for the Next Generation American Cup, New York 9-11 November, 2011