Application of Machine Learning and Crowdsourcing. to Detection of Cybersecurity Threats

Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats February 2011 Eugene Fink, Mehrbod Sharifi, and Jaime G. Carbonell eugenefink@cmu.edu, mehrbod@cs.cmu.edu, jgc@cs.cmu.edu Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 phone: (412) 268-6593 Research sponsor: Department of Homeland Security 1

Abstract We are applying machine learning and crowdsourcing to cybersecurity, with the purpose to develop a toolkit for detection of complex cyber threats, which are often undetectable by traditional tools. It will serve as an extra layer of armor that supplements the standard defenses. The initial results include (1) an architecture for sharing security warnings among users and (2) machine learning techniques for identifying malicious websites. The public release of the developed system is available at http://cyberpsa.com. This project is part of the work on advanced data analysis at the CCICADA Center of Excellence. Keywords: Cybersecurity, web scam, machine learning, crowdsourcing. 2

Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats Introduction We can broadly divide cybersecurity threats in two categories. The first is the vulnerabilities caused by factors outside the end user s control, such as security flaws in applications and protocols. The traditional remedies include using firewalls and antivirus software, distributing patches that fix newly discovered problems, and amending protocols. While the defense against such threats is still an ongoing battle, software engineers have been effective in countering most threats and reducing the risk to an acceptable level in most cases. The second category, which has historically received less attention, includes the problems caused by careless user actions. For example, an attacker may convince inexperienced users to install a fake antivirus, which in reality corrupts their computers. As another example, an attacker may use deceptive email and web advertisements, as well as phishing [Kumaraguru at al., 2009], to trick users into falling victims of scams that go beyond the traditional software attacks, such as disclosing sensitive information or paying for fake product offers. The number of such threats has grown in recent years, as more and more people conduct their daily activities through the Internet, thus providing the attackers with opportunities to exploit the user naïveté. While web browsers and operating systems now include some defenses against such threats, they are often insufficient. The attackers have been effective in finding ways to trick the users into bypassing the security barriers. The detection of such threats is difficult for both humans and automated systems because malicious websites tend to look legitimate and use effective deception techniques. 3

To improve defenses against these threats, we have taken a crowdsourcing approach, combined with machine learning and natural language processing. We are working on a distributed system that enables users to report threats spotted on the web, and applies machine learning to integrate their reports. This idea is analogous to user-review mechanisms, where people share their experiences with specific products. The novel characteristics of the developed system are as follows. Integration with crowdsourced question answering, similar to Yahoo Answers, which helps to encourage user participation. Application of machine learning and language processing to analyze user feedback. Synergy of user feedback with automated threat detection. From the user s point of view, the developed system acts as a personal security assistant. It gathers relevant information, learns from the user s feedback, and helps the user to identify websites that may pose a threat. The initial work has lead to the development of a crowdsourcing architecture, as well as machine learning algorithms for detection of two specific security threats: scam websites and cross-site request forgery. 4

Figure 1. The main screen of the SmartNotes architecture. Crowdsourcing architecture We have developed an architecture, called SmartNotes, that helps users to share their experience related to web threats, and integrates the wisdom gathered from all its users. It enables users to rate websites, post comment, and ask and answer related questions. Furthermore, it combines human opinions with automated threat detection. User interface: The system s main screen (Figure 1) allows making comments and asking questions about a specific website. The user can select a rating (positive, neutral, or negative), add comments, and post questions to be answered by other users. By default, the comments are for the currently open web page, but the user can also post comments for the entire web domain. For instance, when she is looking at a specific product on Amazon, she may enter notes about that product page or about the entire amazon.com service. The user can specify whether her notes are private, visible to her friends, or public. When the user visits a webpage, she can read notes by others about it. She can also search the entire database of notes about all 5

webpages. In addition, the user can invoke automated scam detection, which calculates the chances that a given webpage poses a threat. ratings and comments Web Browser SmartNotes Browser Extension MULTIPLE USERS scam warnings SmartNotes Web Service Host Analyzer Web Service DATA SOURCES Figure 2. The distributed crowdsourcing architecture. The SmartNotes service collects comments of multiple users. The Host Analyzer service gathers data about websites from trusted online sources and uses them to calculate the chances that a given website poses a threat. Main components: The distributed system consists of three components (solid boxes in Figure 2), which communicate through HTTP requests (dashed lines in Figure 2). SmartNotes browser extension provides a graphical user interface, which is written in JavaScript and uses the Chrome extension API to interact with the browser. SmartNotes web service is written in C#.NET and includes a SQL Server database. It exposes methods for reading and writing notes, and supports other actions available to the users, such as login and account administration. Host Analyzer web service is also written in C#.NET. It includes all data-analysis algorithms, such as scam detection, parsing of user comments, and integration of user opinions with the automated threat detection. 6

Detection of scam websites Web scam is fraudulent or intentionally misleading information posted on the web, such as false promises to help find work at home and cure various diseases, usually with the purpose to trick people into sending money or disclosing sensitive information. The challenge of detecting such scams is largely unaddressed. For legal reasons, search engines are reluctant to block scammers unless they have specific strong proof of fraudulent activity, such as confirmed instances of malware distribution. The initial research on scam detection includes the work of Anderson et al. [2007], who analyzed spam email to extract addresses of scam websites; and that of Cormack et al. [2010], who addressed the problem of preventing scammers from tricking search engines into giving them undeservedly high rankings. Currently, the most common approach to fighting web scam is blacklisting. Several online services maintain lists of suspicious websites, usually compiled through user reports. For example, Web of Trust (mywot.com) allows users to rate webpages on vendor reliability, trustworthiness, privacy, and child safety, and displays the average ratings. As another example, hosts-file.net and smapcop.net provide databases of malicious sites. The blacklisting however has several limitations. In particular, a list may not include recently created scam websites, as well as old sites moved to new domain names. Also, it may mistakenly include legitimate sites because of inaccurate or intentionally biased reports. We are developing a system that reduces the omissions and biases in blacklists by integrating information from various heterogeneous sources, particularly focusing on quantitative measurements that are hard to manipulate. We have created a web service, called Host Analyzer (Figure 2), for gathering information about websites from various trusted online sources that provide such data. It currently collects forty-three features describing websites from eleven 7

sources. Examples of these features include ratings and traffic ranks for a given website; geographic location of the website server; and the number of positive and negative comments provided through Web of Trust and other similar services. We have applied logistics regression with L1 regularization [Schmidt et al., 2007] to evaluate the chances that a specific website poses a security threat. The learning module constructs a classifier based on a database of known ligitimate and malicious websites, and the system then uses it to estimate the probability that previously unseen websites are malicious. We have tested it using ten-fold cross-validation on a database of 837 manually labeled websites. The precision of this technique is 98.0%; the recall is 98.1%; and the AUC measure, defined as the area under the ROC curve, is 98.6%. Intuitively, these results mean that the system correctly determines whether a website is malicious in 49 out of 50 cases. Detection of cross-site request forgery A cross-site request forgery (CSRF) is an attack through a web browser, in which a malicious website uses a trusted browser session to send unauthorized requests to a target site [Barth et al., 2008]. For example, Zeller and Felten [2008] described CSRF attacks that stole the user s email address and performed unauthorized money transfers. When a user visits a website, the browser creates a session cookie that accompanies all subsequent requests from all browser windows while the session is active, thus enabling web applications to maintain the state of their interaction with the user. The browser provides the session information even if the request is generated by a different website. If the user has an active session with site1.com, all requests sent to site1.com include that information. If the user opens a (possibly malicious) site2.com, which generates a (possibly unauthorized) request to site1.com, it will also include the site1.com session 8

information. This functionality is essential because some sites, such as advertising and paymentprocessing servers, maintain the transaction state of requests from multiple domains; however, it creates the vulnerability exploited by CSRF. A web application cannot determine whether a request comes from the user or from a malicious site, since it contains the same session information in both cases. The existing defenses require the developers of web applications to adopt certain protocols. While these defenses are effective, developers occasionally fail to implement them properly. Email News Ads Malicious Bank Figure 3. Example graph of cross-site requests, where the nodes are domains and the edges are requests. The solid nodes are the domains visited by the user, whereas the unfilled nodes are accessed indirectly through cross-site requests. The dashed lines are CSRF attacks. We are working on a machine learning technique for enhancing standard defenses, which prevents attacks against unprotected sites by spotting malicious HTTP requests. It learns patterns of legitimate requests, detects deviations from these patterns, and warns the user about potentially malicious sites and requests. We represent patterns of requests by a directed graph, where the nodes are web domains and the edges are HTTP requests. We show an example in Figure 3, where the solid nodes are domains visited by the user, and the unfilled nodes are domains accessed indirectly, through requests from the visited domains. In the example of Figure 3, all sites except Bank show advertising materials from the Ads server. Furthermore, both Email and Bank show a news bar, 9

which requires cross-site requests to News. A CSRF attack occurs when the Malicious site sends forged requests, shown by dashed lines, to Email and Bank. If there are no active browser sessions when the system starts building the graph, a CSRF attack cannot occur on the first visit to a website. Therefore, when the system adds a new node, its first incoming edge is a legitimate request. In the naïve version, we allow no incoming requests for the directly accessed (solid) nodes and only one incoming edge for every indirectly accessed (unfilled) node. If the system detects requests that do not match this pattern, it considers them suspicious. In the example of Figure 3, the system would only allow requests from the solid nodes to their nearby unfilled nodes within the same corner of the graph. It would give warnings for requests between different corners, such as a request from Bank to News. The justification for this approach comes from the observation that most legitimate requests are due to the web application design in which the contents are distributed across servers. While the naïve approach is effective for spotting attacks, it produces numerous false positives, that is, warnings for legitimate requests. In the example of Figure 3, it would produce warnings when multiple sites generate requests to Ads and News. To prevent such false positives, we use the observation that, when a site receives legitimate requests from multiple domains, it usually receives requests from a large number of domains. Thus, the most suspicious case is when a domain receives requests from two or three sites, whereas the situation when it receives requests from tens of sites is usually normal. The system thus identifies domains with a large number of incoming edges and does not give warnings for HTTP requests sent to them. We also use two heuristics to improve identification of legitimate requests. Trusted websites: The system automatically estimates domain trustworthiness, as described in the previous section, and does not warn about any requests from trustworthy domains. 10

Sensitive data: The system identifies sessions that are likely to involve sensitive data, and uses stricter thresholds for spotting potentially malicious requests that affect these sessions. It views a session as sensitive if either (1) the user has entered a password when starting this session or (2) the related website uses the HTTPS protocol rather than HTTP. System release We have implemented the initial crowdsourcing system as a Chrome browser extension, available at http://cyberpsa.com. This public release includes mechanisms for the manual rating of websites and sharing free-text comments about potential threats, as well as the initial automated mechanism for evaluating the chances that a website poses a threat. Future work We will continue the work on application of machine learning and crowdsourcing to automated and semi-automated detection of various threats. The specific goals are as follows. Detection of newly evolving threats, which are not yet addressed by the standard defenses. Detection of cyber attacks by their observed symptoms in addition to using the traditional approach of directly analyzing the attacking code, which will help to identify new reimplementations of known malware. Detection of nontraditional threats that go beyond malware attacks, such as posting misleading claims with the purpose to defraud users rather than corrupting their computers. 11

References [Anderson et al., 2007] David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M. Voelker. Spamscatter: Characterizing Internet scam hosting infrastructure. In Proceedings of the Sixteenth USENIX Security Symposium, 2007. [Cormack et al., 2010] Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Department of Computer Science, University of Waterloo, 2010. Unpublished manuscript. [Barth et al., 2008] Adam Barth, Collin Jackson, and John C. Mitchell. Robust defenses for cross-site request forgery. In Proceedings of the Fifteenth ACM Conference on Computer and Communications Security, pages 75 88, 2008. [Kumaraguru et al., 2009] Ponnurangam Kumaraguru, Justin Cranshaw, Alessandro Acquisti, Lorrie Cranor, Jason Hong, Mary Ann Blair, and Theodore Pham. School of phish: A real-world evaluation of anti-phishing training. In Proceedings of the Fifth Symposium on Usable Privacy and Security, pages 1 12, 2009. [Schmidt et al., 2007] Mark Schmidt, Glenn Fung, and Rómer Rosales. Fast optimization methods for L1 regularization: A comparative study and two new approaches. In Proceedings of the European Conference on Machine Learning, pages 286 297, 2007. [Sharifi et al., 2010] Mehrbod Sharifi, Eugene Fink, and Jaime G. Carbonell. Learning of personalized security settings. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, 3428 3432, 2010. [Zeller and Felten, 2008] William Zeller and Edward W. Felten. Cross-site request forgeries: Exploitation and prevention. Computer Science Department, Princeton University, 2008. Unpublished manuscript. 12