1 + EVILSEED: A Guided Approach to Finding Malicious Web Pages Presented by: Alaa Hassan Supervised by: Dr. Tom Chothia
2 + Outline Introduction Introducing EVILSEED. EVILSEED Architecture. Effectiveness of EVILSEED. Discussion and Limitations. Conclusion.
3 + Searching the Web How would you identify a page to be malicious? Are the current techniques for identifying malicious pages effective in your opinion?
4 + Identifying Malicious Web Pages is A Challenging Task The web is a very large place. Everyday new pages whether legitimate and malicious are added to the web in a daunting pace. Attackers regularly perform scans for vulnerable hosts in which they can exploit to store malicious pages. Infected hosts are organized in complex malicious meshes to increase the chances of users landing on them.
5 + Searching the Web A Three Step Process Using crawlers, URLs are collected in mass amounts. Fast prefiltering to quickly discard pages that are very likely to be legitimate. Oracles: Slowly and carefully analyze the remaining pages and detect malicious content using special tools, such as Honeyclients. Effective approach but not efficient: Resource consuming. Time consuming. Costly.
6 + A Much More Efficient Approach EVILSEED is a guided approach to finding malicious web pages, in a much more efficient way: Improves the efficiency of web crawling phase. Starts from a set of known malicious pages. o Legit compromised web pages. o Pages set up by cybercriminals. Generate search engine queries to find pages that share certain similarities with the known malicious pages, Guided Search rather than random search. Allows gathering URLs with high toxicity.
7 + Advantages of EVILSEED URLs found are much more likely to be malicious than a web page found by randomly crawling. Fixed amount of resources. Much faster. Could be beneficial to search engines.
8 + Why EVILSEED Works? Malicious pages usually share similarities o Attackers usually search the web for patterns associated with vulnerable web applications that can be exploited by injecting malicious code into their pages. o Attackers use exploit toolkits to create their attack pages. o Many compromised pages are often linked to the same malicious page. Made use of available up to date tools and datasets in the guided search process o Passive DNS feeds. o Google & Bing crawler infrastructure. (indexed a large portion of the web, always up to date).
9 + EVILSEED Components Seed: The (evil) seed is a set of pages that have been previously found to be malicious. Gadgets: The core of EVILSEED, they o extract info from the seed pages, o build the search engine queries based on that info, Expansion. o Gather back the URLs caught in the guided search process and pass them to the oracle for further analysis. Oracle: Further analysis is done. o Google s safe Browsing Blacklist. o Wepawet: service for detecting and analyzing web-based threats. o Custom built tool to detect fake AV sites.
10 + EVILSEED Architecture
11 + Gadgets EVILSEED implements five gadgets: Links Gadget: uses the web topology (web graph) to find pages that link to malicious resources. Content Dorks Gadget: identifying vulnerable and exploited web applications. Search Engine Optimization (SEO) Gadget: analyzes seed pages that belong to blackhat Search Engine Optimization campaigns. Domain Registrations Gadget: identifies suspicious sequences of domain registrations. DNS Queries Gadget: analyzes traces of DNS requests to locate pages that lead to a malicious domain.
12 + Link Gadget Locates Malware Hubs (pages that contain links to several malicious URLs.). Seed: All URLs known to be malicious. Expansion: o Searches for malware hubs that link to the seed pages. o Forms search queries that are sent to Google, Bing and Yacy to distribute the load. o Retrieves the URLs and extracts all outgoing links from each URL.
13 + Content Dorks Gadget Automates the generation of relevant Google Dorks Can automatically identify suitable dorks. Google dorks are the center of the Google Hacking database. Many hackers use google to find vulnerable webpages and later use these vulnerabilities for hacking.
14 + Content Dorks Gadget Seed: Legitimate webpages that are compromised by attackers. (landing pages) o Contain indexable content o Remain online longer o Such sites share characteristics that can be identified. Expansion: queries are based on n-grams of words extracted from indexable content. n grams :type of probabilistic language model for predicting the next item in a sequence in an order (n-1). o Term extraction (extracts terms that best summarize the content of the page). o n-gram selection (extracts all sequences (of length n) of words from a landing page, ranks them according to their likelihood of occurring in a malicious page vs. benign page.
15 + Search Engine Optimization Gadget cybercriminals use a variety of techniques to drive traffic to the malicious pages under their control. blackhat Search Engine Optimization (SEO) techniques o Attackers host many different web pages, optimized for different search terms, on each web site in a campaign. o Attackers host pages optimized for the same search terms on different web sites in a campaign. o Pages in a campaign often link to each other. SEO kits use semantic cloaking o Exploited web sites respond with completely different content depending on the source of a request.
16 + Search Engine Optimization Gadget Seed: at least one malicious URL that is part of a live SEO campaign. Redirection based cloaking which is mostly used in blackhat SEO campaigns. o Visit the URL three times, with different value. If two or more different landing pages appear, cloaking is detected. Expansion: One cloaked URL will lead to other malicious page from the same campaign.
17 + Domain Registrations Gadget Blacklists are one the most well known techniques to protect against web malware. Domain based blacklists contain domains that are discovered to host malicious content. Seed: all the domains that are known to host malicious pages, and domain registration records which are freely available online. Expansion: extracting and flagging domains of malicious URLs, then creating URLs by taking the closest malicious registered URL and replacing its domain with the one flagged. This gadget does not use the search engines but uses the guided search process when creating the URLs.
18 + DNS Queries Gadget Analyzes recursive DNS traces to identify the domain names of compromised landing pages that are likely to lead to malicious pages. Seed: all domains known to host malicious pages. Expansion: large number of infected pages contain links to a single, malicious page, and that DNS traces (partially) expose these connections.
19 + Effectiveness of EVILSEED There are two key components that measure effectiveness of EVILSEED: Toxicity: fraction of the URLs submitted to the oracles that are malicious. Higher values of toxicity imply that the resources needed to analyze a page are used more efficiently. Expansion: average number of new malicious URLs that EVILSEED finds for each seed. A higher seed expansion indicates that for each malicious seed URL a larger number of malicious URLs are found. There is a trade-off between toxicity and seed expansion.
20 + A Test Run.. EVILSEED ran in parallel with a traditional crawler for 25 days. Malicious URLs found by the crawler, were added to EVILSEED seeds. Oracle used: Wepawet, Google Safe Browsing, Custom fake AV detector. All gadgets were used, except DNS queries gadget ( no access to DNS trace datasets) and domain registrations gadget (not fully developed)
21 + A Test Run.. Assessed against two approaches of finding malicious webpages: o Random Search (Sending queries to search engines). o Traditional crawler with fast prefilter. To generate web queries: o Random alphabetic phrases, composed of 1 to 5 words, of length from 3 to 10 characters (e.g., asdf qwerou ); o Random phrases with words taken from the English dictionary, from 1 to 5 words (e.g., happy cat ); o Trending topics taken from Twitter and Google Hot Trends (e.g., black friday 2011 ); o Manually-generated Google dorks, taken from an online repository (e.g., allinurl:forcedownload.php?file=, which locates vulnerable WordPress sites)
22 + Results EVILSEED: o submitted 226,140 URLs to the oracles,. o 3,036 URLs were found malicious. o toxicity of 1.34%. The Crawler & prefilter: o submitted 437,251 URLs to the oracles,. o 604 URLs were found malicious (these are the URLs we use as seeds for EVILSEED). o toxicity of 0,14%, which is an order of magnitude less than EVILSEED. The web search: o submitted 63,936 URLs to the oracles,. o 219 URLs were found malicious. o toxicity of 0.34%
23 + Results EVILSEED clearly outperforms in toxicity both crawling (1.34% vs. 0.14%) and web searching (1.34% vs. 0.34%). Adding even relatively few new pages to the set of evil seeds enabled EVIL SEED to locate significant numbers of additional malicious pages.
24 + Does EVILSEED find malicious URLs on different domains? EVILSEED: 6.14 malicious pages per domain. Crawler & fast prefilter: 6.16 malicious pages per domain. results show that EVILSEED maintains the same domain coverage as the crawler.
25 + Links Gadget evaluation Three main categories have been used by the link gadgets to locate malicious content: o Unmaintained websites: The gadget found malicious content of such website. o Domains that publish blacklists of malicious domains: the gadget was able to automatically discover and parse these sources. o Domains that list additional info about a domain: for a given domain, it locates: All domain on the same IP. Domain hosted in the same subnet. Domains with similar spelling.
26 + Content Dorks Gadget evaluation The most important factor in the success of this gadget was found to be n, the length of n-grams. Smaller n-grams are usually found in more pages. Toxicity for the results of queries ranged from 1,21% for 2- grams to 5,83% for 5-grams. Shorter n-grams means that more pages will compete for the top spots in the search engine rankings. The first ten most-successful dorks in term of toxicity were five 2-grams and five 3-grams.
27 + SEO Gadget evaluation During the Test run, this gadget performed poorly as its seed at the time it was found, did not belong to a live SEO campaign. Fetched hourly the top trends for Twitter and Google Hot Trends, searched for them on Google, and analyzed the results with our cloaking detection heuristic. Then fed the URLs as seeds to SEO Gadget. The ratio of the malicious pages found over the visited pages is 0,93%, which is two orders of magnitude higher than the crawler (0,019% ).
28 + Domain Registrations Gadget evaluation Domain registrations for the top-level domains.com.,.net.,.org.,.info. and.us were collected over a year s time. Gadget identified malicious URLs on 10, 435 domains using 1, 002 domains as seeds. Hypothesis: Malicious domains are registered close in time to each other. o o Given 1 malicious domain, at least one of the registrations that come immediately before or after it is also malicious. Data collected over the year, showed that these two events are correlated. Which concludes: domains that have been registered immediately before and after a known malicious domain are much more than 35 times likely to also serve malicious content.
29 + DNS Queries Gadget evaluation Testing: Internet Service Provider (ISP) provided access to a DNS trace collected from its network during 43 days in February and March ,472,280 queries sent by 30,000 clients. Trace was made available towards the end of the collection period, which caused a delay between the collection of data and the time when the gadget was run. Seed: 115 known malicious domains from the trace. Expansion: gadget generated 4,820 URLs on 2,473 domains. Result: o o o o o 171 URLs on 62 domains were identified malicious. Only 25 out of the 115 led to finding malicious URLs. The most effective domain guided the gadget to locate 46 malicious URLs on 16 different servers. 21 domains led to multiple malicious URLs The delay explains why no malicious URLs were found for the remaining 90 URLs.
30 + Discussion and Limitations Security analysis: EVILSEED works by searching and finding malicious URLs. o Attacker with full control of an exploited website can hide the pages in which they won t be indexed by search engines. o Attackers could also try to perform evasion attacks against the detection techniques employed by our oracle (Wepawet, our custom fake AV page detector, and the Safe Browsing system). Would attackers go for hiding their pages from search engines? What if we connect EVILSEED to another oracle?
31 + Discussion and Limitations Seed quality: The effectiveness of our gadgets is dependent on the quality and diversity of the malicious seed that they use as input. Results over time: For EVILSEED to be useful, it need a constant stream of high quality URLs rather than exhausting its effect after one or few runs.
32 + Discussion and Limitations Performance and Scalability: The bottleneck of EVILSEED is the cost of performing in depth analysis with an oracle. EVILSEED runs on two servers: o Crawler: gathers millions of URLs. o Gadget: 100k URLs per search engine. Deployment: Search Engines could deploy EVILSEED. This might diminish its effectiveness but it also means that the vectors EVILSEED targets were mitigated.
33 + Conclusion An important component of defense is the ability to identify as many malicious web pages on the Internet as possible in an efficient manner. The goal of EVILSEED was to improve the effectiveness of the search process for malicious web pages by leveraging a seed of known, malicious web pages and extracting characterizing similarities that these pages share.
34 + Thank you..