Domain Hygiene as a Predictor of Badness Tim Helming Director, Product Management DomainTools Your Presenter Director of Product Management (aka the roadmap guy ) Over 13 years in cybersecurity Passionate about fighting the good fight! 2
Contents Meet your Presenter The Hypothesis Examples of good and bad hygiene Algorithms for predictive reputation scoring Test methodology and findings Future directions Q&A 3 Your Presenter s Employer Started by domainers as a(nother) Whois lookup site Added a few unique twists Maintained historical DB of information Went beyond canonical Whois data Built research tools atop all of this Today: it s not just for domainers any more. Lots of cybersleuths use it. 4
The Hypothesis Malicious domains are not set up the same way as legitimate ones much like physical dens of crime are not (usually) very similar to legit businesses Many of the characteristics that distinguish the legit from the illegit are visible in the public record These characteristics can be used to characterize domains and registrants This can help predict badness of domains before they strike or even before they re registered 5 How Might We Use Hygiene Information? Add a dimension to existing reputation scoring Assign a higher risk profile to unhygienic domains Help prioritize attribution/investigation targets when potential targets abound 6
Bolstering Reputation Feeds Traditional Reputation Feeds Tend to be malware focused (not always) Often require the domain to show badness, i.e. hurt victims, before listing them Ignore registrant as a vector of risk, and registration details as a marker of risk 7 Bolstering Reputation Feeds A Hygiene-Aware Reputation Feed Could raise risk profile of domains that have not otherwise implicated themselves Could predictively raise risk profile of domains as they come online (conceptually before they come online) via registrant reputation 8
Does Your Domain Have Good Hygiene? What does the public record (OSINT) tell us? Markers of Legitimate Commercial Domains Linguistic Coherence MX Record (most businesses use email ) Valid Physical Address Valid Phone Number Age (yes, we re practicing ageism but unlike the other markers, this one is self-healing!) 9 Does That Domain Have Bad Hygiene? What does the public record (OSINT) tell us? Markers of Illegitimate Domains Linguistic incoherence of domain name, registrant name/email Lack of mail servers Lack of web presence Irrational/invalid Physical Address Irrational/invalid Phone Numbers 10
And now To the Science! 11 Methodology 1.0 We started with the easy stuff Built three corpuses of domains: Legitimate domains (a random selection of businesses and nonprofits, and Alexa top-ranked sites) Unknown (a truly random set with no qualifications) Bad (a random selection from reputable malware/spam/phishing classification providers) And we compared their hygiene characteristics 12
Scoring 1.0 Criterion (range) Low Risk Medium Risk High Risk Linguistic Coherence (0-2) 0=domain name makes sense linguistically (including acronyms/abbreviations) 1=Questionable. High entropy but can pass the "squint test." 2=Incoherent Age (0-2) 0=>45 days 1=7-45 days 2=<7 days MX Record (0-1) 0=has MX record 1=no MX record Web Server (0-1) 0=has Web server 1=no server Physical Address Coherence (0-2) Phone # Coherence (0-2) 0=valid address 1=indeterminate or partial 2=invalid address 0=valid phone number 1=indeterminate or partial 2=not a valid phone number Scores were composites of these 13 Results: Test Corpuses 1.0 Alexa Top Domains Unknown Malware Spam Phishing 0.047 0.1585 0.3615 0.573 0.6905 Why is malware in the middle? Likely reflects the combination of pwned and evil sites (pwned ones would in many cases have good hygiene, evil ones, not so much) 14
Methodology 2.0 Okay, that was the easy stuff Now for more rigor. 15 Methodology 2.0 Using AI to assess entropy in Whois record fields All random sampling rather than hand-picked corpuses Spot checks comparing high score (high badness) vs low score domains Do new bad domains show up on malware lists? 16
Methodology 2.0 Using AI to classify threats based on Whois Step 1: Measurements on each record Linguistic analysis of domain names, and other text based Whois data fields Frequency of linguistically rational bigrams Letter, number, symbol, vowel ratios 17 Methodology 2.0 More AI to classify threats based on Whois Step 1 (cont d): Measurements on each record Analysis of contacts information Impossible names (entropy) Improbable names ( Donald Duck ) Impossible phone numbers Improbable region words (e.g. Alabama in France) Bad postal codes Throwaway email domains in contact emails Generic email domains in contact emails 18
Methodology 2.0 More AI to classify threats based on Whois Step 1 (cont d): Measurements on each record Domain names matching hash patterns (md5/ sha) Privacy protection (and similar) Registration duration and age Indicators of data completeness 19 Methodology 2.0 Various domain / Whois hygiene indicators into AI Unsupervised Classification (e.g. k-means) Find ~15 classes of domains with similar scores in the hygiene space Look at counts of threats (spam, botnet, etc) in each class Supervised Classification (e.g. random forests) Fit a model to a blacklist and predict blacklists of future Calibration Step: Exploratory Stats Summarize domain hygiene data to better design classifiers 20
Results 2.0: Ranking of Type by Score Some Results from AI Scoring Ranking by Type, based on length and entropy 1. Botnet (longest, highest entropy) 2. Malware 3. Phishing 4. Spam 5. Not a Known Threat (shortest, lowest entropy) Each parameter (length, entropy) gave the same ranking 21 Close-Up: Relative Entropy by Type How do different types of domains compare by domain name entropy? 22
Close-Up: Relative Length by Type How do different types of domains compare by domain name length? 23 Zooming Back Out Some Practical Considerations 24
A word about false positives Reputation scoring is not blacklisting More like actuarial risk scoring Hygiene is correlated with, but not deterministically attached to, legitimacy One (or two) higher-risk attributes do not sink a domain However. 25 A word about false positives Hygiene-based false positives behave differently They are low-risk FPs Some such domains will be assigned high risk scores when they aren t hosting malware, but Does anyone really need to visit 829fh92-s8s.com IOW, they are very unlikely to be legit Ergo, FPs based on hygiene scoring aren t so likely to be an IT headache 26
Using Hygiene Analysis Today You can use this approach in investigations today (modulo scale) As a filtering connection method: signal: given given a pool a pool of of suspicious domains which domains, may or may examine not be Whois linked, records: do they have do they common pass the bogon sniff test? registrant (Parsed info? records Or, does make a given this easier) bogon registrant Focus on hold the bogon more than ones the first one domain you started looking at? 27 Using Hygiene Analysis Today and, you can use it as a defense mechanism, albeit manually Identify registrants of evil domains (it doesn t matter if the info is bogus, as long as you ve got the same registrant) Do reverse lookups of these registrants to see if they own other domains you ve not seen Add the other domains to blacklists, sinkholes, or other configuration inputs 28
Future Directions Iterate, refine, improve Watch for registration trends that could affect the AI algs or suggest new ones Integrate with our other branch of reputation tech ( proximity to badness ) Keep exploring 29 Your Turn Q & A 30
DomainTools says Thank You! 31