Application of Machine Learning and Crowdsourcing. to Detection of Cybersecurity Threats



Similar documents
Detection of Internet Scam Using Logistic Regression

WEB SECURITY CONCERNS THAT WEB VULNERABILITY SCANNING CAN IDENTIFY

Where every interaction matters.

A Server and Browser-Transparent CSRF Defense for Web 2.0 Applications. Slides by Connor Schnaith

National Endowment for the Arts Evaluation Report. Table of Contents. Results of Evaluation Areas for Improvement Exit Conference...

A Novel Frame Work to Detect Malicious Attacks in Web Applications


Index Terms Domain name, Firewall, Packet, Phishing, URL.

Web Application Attacks and Countermeasures: Case Studies from Financial Systems

The Devil is Phishing: Rethinking Web Single Sign On Systems Security. Chuan Yue USENIX Workshop on Large Scale Exploits

WHITE PAPER. FortiWeb and the OWASP Top 10 Mitigating the most dangerous application security threats

Rational AppScan & Ounce Products

Phoenix Information Technology Services. Julio Cardenas

A new fake Citibank phishing scam using advanced techniques to manipulate users into surrendering online banking access has emerged.

10 Things Every Web Application Firewall Should Provide Share this ebook

Computer Security Literacy

PCI-DSS and Application Security Achieving PCI DSS Compliance with Seeker

Don t Fall Victim to Cybercrime:

Spammer and Hacker, Two Old Friends

LASTLINE WHITEPAPER. Large-Scale Detection of Malicious Web Pages

OWASP Top Ten Tools and Tactics

ArcGIS Server Security Threats & Best Practices David Cordes Michael Young

A Hybrid Approach to Detect Zero Day Phishing Websites

DON T BE FOOLED BY SPAM FREE GUIDE. Provided by: Don t Be Fooled by Spam FREE GUIDE. December 2014 Oliver James Enterprise

Bad Ads Trend Alert: Shining a Light on Tech Support Advertising Scams. May TrustInAds.org. Keeping people safe from bad online ads

Streamlining Web and Security

Information Technology Policy

WEB 2.0 AND SECURITY

Prevent Cross-site Request Forgery: PCRF

EXECUTIVE BRIEF. IT and Business Professionals Say Website Attacks are Persistent and Varied. In this Paper

SANS Top 20 Critical Controls for Effective Cyber Defense

Recommended Practice Case Study: Cross-Site Scripting. February 2007

Fighting Advanced Threats

RTC-Web Security Considerations

Web applications. Web security: web basics. HTTP requests. URLs. GET request. Myrto Arapinis School of Informatics University of Edinburgh

SPEAR PHISHING AN ENTRY POINT FOR APTS

The data which you put into our systems is yours, and we believe it should stay that way. We think that means three key things.

Appalachian Regional Commission Evaluation Report. Table of Contents. Results of Evaluation Areas for Improvement... 2

Protecting Your Organisation from Targeted Cyber Intrusion

WEB ATTACKS AND COUNTERMEASURES

Introduction: 1. Daily 360 Website Scanning for Malware

Microsoft Phishing Filter: A New Approach to Building Trust in E-Commerce Content

Online Cash Manager Security Guide

Web Application Security Assessment and Vulnerability Mitigation Tests

WEB PROTECTION. Features SECURITY OF INFORMATION TECHNOLOGIES

Guidelines for Web applications protection with dedicated Web Application Firewall

The Benefits of SSL Content Inspection ABSTRACT

SECURITY ADVISORY. December 2008 Barracuda Load Balancer admin login Cross-site Scripting

Integrated Network Vulnerability Scanning & Penetration Testing SAINTcorporation.com

OIG Fraud Alert Phishing

Recurrent Patterns Detection Technology. White Paper

[state of the internet] / SEO Attacks. Threat Advisory: Continuous Uptick in SEO Attacks

Your Guide to Security

Trust the Innovator to Simplify Cloud Security

OWASP AND APPLICATION SECURITY

WE KNOW IT BEFORE YOU DO: PREDICTING MALICIOUS DOMAINS Wei Xu, Kyle Sanders & Yanxin Zhang Palo Alto Networks, Inc., USA

PROTECT YOUR COMPUTER AND YOUR PRIVACY!

A Secure Login Process Using USB for Various Phishing Prevention System

Kaspersky Fraud Prevention: a Comprehensive Protection Solution for Online and Mobile Banking

Introduction The Case Study Technical Background The Underground Economy The Economic Model Discussion

Anti-Phishing Best Practices for ISPs and Mailbox Providers

Phishing Past, Present and Future

Web Vulnerability Scanner by Using HTTP Method

By John Pirc. THREAT DETECTION HAS moved beyond signature-based firewalls EDITOR S DESK SECURITY 7 AWARD WINNERS ENHANCED THREAT DETECTION

Phishing and the threat to corporate networks

User Documentation Web Traffic Security. University of Stavanger

IBM Protocol Analysis Module

Tracking Anti-Malware Protection 2015

Magento Security and Vulnerabilities. Roman Stepanov

KASPERSKY SECURITY INTELLIGENCE SERVICES. EXPERT SERVICES.

WHITE PAPER FORTIWEB WEB APPLICATION FIREWALL. Ensuring Compliance for PCI DSS 6.5 and 6.6

A B S T R A C T. Index Terms : Framework, threats, skill, social engineering, risks, insider. I. INTRODUCTION

SERENA SOFTWARE Serena Service Manager Security

TOP 10 TIPS FOR EDUCATING EMPLOYEES ABOUT CYBERSECURITY. Mark

Protect Yourself. Who is asking? What information are they asking for? Why do they need it?

EVALUATING COMMERCIAL WEB APPLICATION SECURITY. By Aaron Parke

Six Essential Elements of Web Application Security. Cost Effective Strategies for Defending Your Business

1. Introduction. 2. Web Application. 3. Components. 4. Common Vulnerabilities. 5. Improving security in Web applications

The Top Web Application Attacks: Are you vulnerable?

Members of the UK cyber security forum. Soteria Health Check. A Cyber Security Health Check for SAP systems

EVILSEED: A Guided Approach to Finding Malicious Web Pages

Sitefinity Security and Best Practices

Top five strategies for combating modern threats Is anti-virus dead?

Client Side Filter Enhancement using Web Proxy

SPEAR PHISHING UNDERSTANDING THE THREAT

Protect Your IT Infrastructure from Zero-Day Attacks and New Vulnerabilities

Dealing with spam mail

Detection and mitigation of Web Services Attacks using Markov Model

ALDR: A New Metric for Measuring Effective Layering of Defenses

Enterprise-Grade Security from the Cloud

Application security testing: Protecting your application and data

Is Drupal secure? A high-level perspective on web vulnerabilities, Drupal s solutions, and how to maintain site security

Columbia University Web Security Standards and Practices. Objective and Scope

Top tips for improved network security

ITSC Training Courses Student IT Competence Programme SIIS1 Information Security

October Is National Cyber Security Awareness Month!

Understanding and Responding to the Five Phases of Web Application Abuse

Phishing. Foiled. Over just a few weeks, I received . Can

Advice about online security

Web Application Penetration Testing

Transcription:

Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats February 2011 Eugene Fink, Mehrbod Sharifi, and Jaime G. Carbonell eugenefink@cmu.edu, mehrbod@cs.cmu.edu, jgc@cs.cmu.edu Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 phone: (412) 268-6593 Research sponsor: Department of Homeland Security 1

Abstract We are applying machine learning and crowdsourcing to cybersecurity, with the purpose to develop a toolkit for detection of complex cyber threats, which are often undetectable by traditional tools. It will serve as an extra layer of armor that supplements the standard defenses. The initial results include (1) an architecture for sharing security warnings among users and (2) machine learning techniques for identifying malicious websites. The public release of the developed system is available at http://cyberpsa.com. This project is part of the work on advanced data analysis at the CCICADA Center of Excellence. Keywords: Cybersecurity, web scam, machine learning, crowdsourcing. 2

Application of Machine Learning and Crowdsourcing to Detection of Cybersecurity Threats Introduction We can broadly divide cybersecurity threats in two categories. The first is the vulnerabilities caused by factors outside the end user s control, such as security flaws in applications and protocols. The traditional remedies include using firewalls and antivirus software, distributing patches that fix newly discovered problems, and amending protocols. While the defense against such threats is still an ongoing battle, software engineers have been effective in countering most threats and reducing the risk to an acceptable level in most cases. The second category, which has historically received less attention, includes the problems caused by careless user actions. For example, an attacker may convince inexperienced users to install a fake antivirus, which in reality corrupts their computers. As another example, an attacker may use deceptive email and web advertisements, as well as phishing [Kumaraguru at al., 2009], to trick users into falling victims of scams that go beyond the traditional software attacks, such as disclosing sensitive information or paying for fake product offers. The number of such threats has grown in recent years, as more and more people conduct their daily activities through the Internet, thus providing the attackers with opportunities to exploit the user naïveté. While web browsers and operating systems now include some defenses against such threats, they are often insufficient. The attackers have been effective in finding ways to trick the users into bypassing the security barriers. The detection of such threats is difficult for both humans and automated systems because malicious websites tend to look legitimate and use effective deception techniques. 3

To improve defenses against these threats, we have taken a crowdsourcing approach, combined with machine learning and natural language processing. We are working on a distributed system that enables users to report threats spotted on the web, and applies machine learning to integrate their reports. This idea is analogous to user-review mechanisms, where people share their experiences with specific products. The novel characteristics of the developed system are as follows. Integration with crowdsourced question answering, similar to Yahoo Answers, which helps to encourage user participation. Application of machine learning and language processing to analyze user feedback. Synergy of user feedback with automated threat detection. From the user s point of view, the developed system acts as a personal security assistant. It gathers relevant information, learns from the user s feedback, and helps the user to identify websites that may pose a threat. The initial work has lead to the development of a crowdsourcing architecture, as well as machine learning algorithms for detection of two specific security threats: scam websites and cross-site request forgery. 4

Figure 1. The main screen of the SmartNotes architecture. Crowdsourcing architecture We have developed an architecture, called SmartNotes, that helps users to share their experience related to web threats, and integrates the wisdom gathered from all its users. It enables users to rate websites, post comment, and ask and answer related questions. Furthermore, it combines human opinions with automated threat detection. User interface: The system s main screen (Figure 1) allows making comments and asking questions about a specific website. The user can select a rating (positive, neutral, or negative), add comments, and post questions to be answered by other users. By default, the comments are for the currently open web page, but the user can also post comments for the entire web domain. For instance, when she is looking at a specific product on Amazon, she may enter notes about that product page or about the entire amazon.com service. The user can specify whether her notes are private, visible to her friends, or public. When the user visits a webpage, she can read notes by others about it. She can also search the entire database of notes about all 5

webpages. In addition, the user can invoke automated scam detection, which calculates the chances that a given webpage poses a threat. ratings and comments Web Browser SmartNotes Browser Extension MULTIPLE USERS scam warnings SmartNotes Web Service Host Analyzer Web Service DATA SOURCES Figure 2. The distributed crowdsourcing architecture. The SmartNotes service collects comments of multiple users. The Host Analyzer service gathers data about websites from trusted online sources and uses them to calculate the chances that a given website poses a threat. Main components: The distributed system consists of three components (solid boxes in Figure 2), which communicate through HTTP requests (dashed lines in Figure 2). SmartNotes browser extension provides a graphical user interface, which is written in JavaScript and uses the Chrome extension API to interact with the browser. SmartNotes web service is written in C#.NET and includes a SQL Server database. It exposes methods for reading and writing notes, and supports other actions available to the users, such as login and account administration. Host Analyzer web service is also written in C#.NET. It includes all data-analysis algorithms, such as scam detection, parsing of user comments, and integration of user opinions with the automated threat detection. 6

Detection of scam websites Web scam is fraudulent or intentionally misleading information posted on the web, such as false promises to help find work at home and cure various diseases, usually with the purpose to trick people into sending money or disclosing sensitive information. The challenge of detecting such scams is largely unaddressed. For legal reasons, search engines are reluctant to block scammers unless they have specific strong proof of fraudulent activity, such as confirmed instances of malware distribution. The initial research on scam detection includes the work of Anderson et al. [2007], who analyzed spam email to extract addresses of scam websites; and that of Cormack et al. [2010], who addressed the problem of preventing scammers from tricking search engines into giving them undeservedly high rankings. Currently, the most common approach to fighting web scam is blacklisting. Several online services maintain lists of suspicious websites, usually compiled through user reports. For example, Web of Trust (mywot.com) allows users to rate webpages on vendor reliability, trustworthiness, privacy, and child safety, and displays the average ratings. As another example, hosts-file.net and smapcop.net provide databases of malicious sites. The blacklisting however has several limitations. In particular, a list may not include recently created scam websites, as well as old sites moved to new domain names. Also, it may mistakenly include legitimate sites because of inaccurate or intentionally biased reports. We are developing a system that reduces the omissions and biases in blacklists by integrating information from various heterogeneous sources, particularly focusing on quantitative measurements that are hard to manipulate. We have created a web service, called Host Analyzer (Figure 2), for gathering information about websites from various trusted online sources that provide such data. It currently collects forty-three features describing websites from eleven 7

sources. Examples of these features include ratings and traffic ranks for a given website; geographic location of the website server; and the number of positive and negative comments provided through Web of Trust and other similar services. We have applied logistics regression with L1 regularization [Schmidt et al., 2007] to evaluate the chances that a specific website poses a security threat. The learning module constructs a classifier based on a database of known ligitimate and malicious websites, and the system then uses it to estimate the probability that previously unseen websites are malicious. We have tested it using ten-fold cross-validation on a database of 837 manually labeled websites. The precision of this technique is 98.0%; the recall is 98.1%; and the AUC measure, defined as the area under the ROC curve, is 98.6%. Intuitively, these results mean that the system correctly determines whether a website is malicious in 49 out of 50 cases. Detection of cross-site request forgery A cross-site request forgery (CSRF) is an attack through a web browser, in which a malicious website uses a trusted browser session to send unauthorized requests to a target site [Barth et al., 2008]. For example, Zeller and Felten [2008] described CSRF attacks that stole the user s email address and performed unauthorized money transfers. When a user visits a website, the browser creates a session cookie that accompanies all subsequent requests from all browser windows while the session is active, thus enabling web applications to maintain the state of their interaction with the user. The browser provides the session information even if the request is generated by a different website. If the user has an active session with site1.com, all requests sent to site1.com include that information. If the user opens a (possibly malicious) site2.com, which generates a (possibly unauthorized) request to site1.com, it will also include the site1.com session 8

information. This functionality is essential because some sites, such as advertising and paymentprocessing servers, maintain the transaction state of requests from multiple domains; however, it creates the vulnerability exploited by CSRF. A web application cannot determine whether a request comes from the user or from a malicious site, since it contains the same session information in both cases. The existing defenses require the developers of web applications to adopt certain protocols. While these defenses are effective, developers occasionally fail to implement them properly. Email News Ads Malicious Bank Figure 3. Example graph of cross-site requests, where the nodes are domains and the edges are requests. The solid nodes are the domains visited by the user, whereas the unfilled nodes are accessed indirectly through cross-site requests. The dashed lines are CSRF attacks. We are working on a machine learning technique for enhancing standard defenses, which prevents attacks against unprotected sites by spotting malicious HTTP requests. It learns patterns of legitimate requests, detects deviations from these patterns, and warns the user about potentially malicious sites and requests. We represent patterns of requests by a directed graph, where the nodes are web domains and the edges are HTTP requests. We show an example in Figure 3, where the solid nodes are domains visited by the user, and the unfilled nodes are domains accessed indirectly, through requests from the visited domains. In the example of Figure 3, all sites except Bank show advertising materials from the Ads server. Furthermore, both Email and Bank show a news bar, 9

which requires cross-site requests to News. A CSRF attack occurs when the Malicious site sends forged requests, shown by dashed lines, to Email and Bank. If there are no active browser sessions when the system starts building the graph, a CSRF attack cannot occur on the first visit to a website. Therefore, when the system adds a new node, its first incoming edge is a legitimate request. In the naïve version, we allow no incoming requests for the directly accessed (solid) nodes and only one incoming edge for every indirectly accessed (unfilled) node. If the system detects requests that do not match this pattern, it considers them suspicious. In the example of Figure 3, the system would only allow requests from the solid nodes to their nearby unfilled nodes within the same corner of the graph. It would give warnings for requests between different corners, such as a request from Bank to News. The justification for this approach comes from the observation that most legitimate requests are due to the web application design in which the contents are distributed across servers. While the naïve approach is effective for spotting attacks, it produces numerous false positives, that is, warnings for legitimate requests. In the example of Figure 3, it would produce warnings when multiple sites generate requests to Ads and News. To prevent such false positives, we use the observation that, when a site receives legitimate requests from multiple domains, it usually receives requests from a large number of domains. Thus, the most suspicious case is when a domain receives requests from two or three sites, whereas the situation when it receives requests from tens of sites is usually normal. The system thus identifies domains with a large number of incoming edges and does not give warnings for HTTP requests sent to them. We also use two heuristics to improve identification of legitimate requests. Trusted websites: The system automatically estimates domain trustworthiness, as described in the previous section, and does not warn about any requests from trustworthy domains. 10

Sensitive data: The system identifies sessions that are likely to involve sensitive data, and uses stricter thresholds for spotting potentially malicious requests that affect these sessions. It views a session as sensitive if either (1) the user has entered a password when starting this session or (2) the related website uses the HTTPS protocol rather than HTTP. System release We have implemented the initial crowdsourcing system as a Chrome browser extension, available at http://cyberpsa.com. This public release includes mechanisms for the manual rating of websites and sharing free-text comments about potential threats, as well as the initial automated mechanism for evaluating the chances that a website poses a threat. Future work We will continue the work on application of machine learning and crowdsourcing to automated and semi-automated detection of various threats. The specific goals are as follows. Detection of newly evolving threats, which are not yet addressed by the standard defenses. Detection of cyber attacks by their observed symptoms in addition to using the traditional approach of directly analyzing the attacking code, which will help to identify new reimplementations of known malware. Detection of nontraditional threats that go beyond malware attacks, such as posting misleading claims with the purpose to defraud users rather than corrupting their computers. 11

References [Anderson et al., 2007] David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M. Voelker. Spamscatter: Characterizing Internet scam hosting infrastructure. In Proceedings of the Sixteenth USENIX Security Symposium, 2007. [Cormack et al., 2010] Gordon V. Cormack, Mark D. Smucker, and Charles L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Department of Computer Science, University of Waterloo, 2010. Unpublished manuscript. [Barth et al., 2008] Adam Barth, Collin Jackson, and John C. Mitchell. Robust defenses for cross-site request forgery. In Proceedings of the Fifteenth ACM Conference on Computer and Communications Security, pages 75 88, 2008. [Kumaraguru et al., 2009] Ponnurangam Kumaraguru, Justin Cranshaw, Alessandro Acquisti, Lorrie Cranor, Jason Hong, Mary Ann Blair, and Theodore Pham. School of phish: A real-world evaluation of anti-phishing training. In Proceedings of the Fifth Symposium on Usable Privacy and Security, pages 1 12, 2009. [Schmidt et al., 2007] Mark Schmidt, Glenn Fung, and Rómer Rosales. Fast optimization methods for L1 regularization: A comparative study and two new approaches. In Proceedings of the European Conference on Machine Learning, pages 286 297, 2007. [Sharifi et al., 2010] Mehrbod Sharifi, Eugene Fink, and Jaime G. Carbonell. Learning of personalized security settings. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, 3428 3432, 2010. [Zeller and Felten, 2008] William Zeller and Edward W. Felten. Cross-site request forgeries: Exploitation and prevention. Computer Science Department, Princeton University, 2008. Unpublished manuscript. 12