White Paper: Applying machine learning techniques to achieve resilient, accurate, high-speed malware detection Prepared by: Northrop Grumman Corporation Information Systems Sector Cyber Solutions Division 05 February 2014 Contact us: 1-855-672-4258 bluvector@ngc.com
WHAT YOUR EXISTING NETWORK SECURITY SYSTEMS ARE NOT DETECTING In 2012, a sample of 19 worldwide organizations reported 620 confirmed data breach events occurred among more than 47,000 reported network security incidents 1. These data breaches represent successful cyber attacks that circumvented, penetrated, or otherwise went unnoticed by organizations with significant existing investments in network security hardware and software. In 2013, the Ponemon Institute estimates that 60 large US corporations spent on average $11.6 million each to mitigate successful cyber attacks 2. These corporations on average experienced 2 unmitigated security breaches per week. It is impossible to know how many more security breaches went undiscovered or unreported. Verizon s analysis of network security breaches found most are conducted by external parties to the victim organization using malware or hacking techniques, take months to discover, and are most often reported to the victim by a third party 1. Undetected network breaches represent a real, present, and growing danger to any organization that values their intellectual property, reputation, and customer s proprietary data. The companies in Verizon s survey were not sitting ducks or soft targets. They fully employed common network security devices and best governance practices. Their existing security posture included a combination of advanced perimeter controls and firewalls, encryption technologies, data loss prevention tools, access governance tools, automated policy management tools, employment of certified/expert security personnel, substantial training and awareness activities, and formation of senior-level security leaders and councils. Figure 1. 82% of network data breaches are discovered outside the victim organization. 1 1 Verizon Inc., 2013 Data Breach Investigation Report, http://www.verizonenterprise.com/dbir/2013/, April 2013. Unrelated parties include all external parties with whom the victim has with no business relationship specific to detection services (such as ISPs, intelligence agencies, etc) and are not part of any other category. 2 Ponemon Institute, 2013 Cost of Cyber Crime Study: United States, http://media.scmagazine.com/documents/54/2013_us_ccc_report_final_6-1_13455.pdf, October 2013. 1
The current suite of network security products such as firewalls, intrusion prevention systems (IPS), hostbase anti-virus software, policy enforcement tools, configuration management tools, and data loss prevention systems play a vital role in implementing a defense-in-depth approach to network security. They are capable of blocking much of the lower end malicious activity in the wild when properly used. They also play an important role in breach remediation by providing an existing framework to remove malware from affected systems once it has been discovered and fingerprinted. Yet, these tools are limited to detecting previously observed malware and threat actors and are thus ineffective against the more modern threats. Whether signature-based or based on dynamic behavior analysis, today s malware detection engines (anti-virus software, intrusion detection/prevention systems, etc.) are tuned for certainty. Signatures are written specifically to hit on the malicious software for which they are designed. Detection systems based on dynamic behavior analysis are driven by rules or heuristics learned through careful examination of known malware operations. While more robust to malware changes than signatures, new threat vectors can easily evade behavioral heuristics. Both approaches attempt to mitigate their fragile design through broad-based reporting and sharing of malicious signatures. These techniques leave defenders chasing the threat and offer no path to proactive and predictive defense. A paradigm shift is needed. Proactive search and discovery of threats using real time non-signature based techniques must replace antiquated signature and behavior based techniques. Automated identification of threats must reduce the discovery timeline from months to minutes. Defenders must head off an attack before it occurs rather than waiting for someone outside their organization to alert them of a breach. FINDING WHAT THE OTHERS LEAVE BEHIND As Figure 1 illustrates, existing solutions are ineffective against the modern threat. Furthermore, manual inspection of the hundreds of thousands of daily network transactions is infeasible. A new high speed automated platform for discovering malware and threats is needed. Signatures: Nothing New Here Pure signature-based detection mechanisms are fast and can keep up with high speed network traffic up to 10 Gbps. This is the most common detection technique in use today, and has been around the longest. The challenge with signature-based techniques, however, is well known. They are fragile with no ability to handle previously unseen malware. Even for recently discovered malware, signature-based systems may lag behind the discovery for days, weeks, even months while vendors verify malware; write, test, and distribute updates; and system administrators install the updates. Ultimately, signatures find no new malware, only new-to-you malware. Sandboxes: Too Slow to Keep Up To address the shortcoming of signature-based systems, defenders are now using specialized, carefully controlled virtual environments, called sandboxes, to execute and examine the behavior of suspect files. While more resilient to changes in malware relative to signatures, sandboxes are simply too slow to keep up with tremendous volume of enterprise network traffic. Since each file can take several minutes to 2
examine, these systems must use prefilters to decide which files to look at while allowing the rest to pass unexamined. It is typically too costly to build a sandbox large enough to examine all traffic. Prefilters are little more than signature-based detection mechanisms where the signatures are related to the file metadata or behavioral patterns in the network environment. For example, a prefilter might look for an emailed file from someone outside the enterprise with a URL linking to the file, or a file with multiple extensions. These behavioral patterns are just as fragile as signatures and require previous, repeated observation of the behavior. This means that only a small subset of potential malware is inspected. Many potential threats continue into the victim network untouched. Even if new malware were to be selected for analysis, most sandbox environments only look for previously observed malicious activity such as changes to the Windows registry. Unfortunately, authors of new malware understand sandbox detectors and have simple methods for evading detection including having the malware sleep for several minutes before executing, detecting network settings typical of sandboxes, mimicking benign program execution, and designing for highly specific computing configurations that sandboxes don t replicate. Machine Learning: Automated Search and Discovery Machine learning has been successfully applied in many fields such as facial recognition, voice recognition, and image processing. Machine learning is a process used to train computers to distinguish between objects of different types, or classes, through exposure to examples of objects of various types. A key step in the process is the selection and determination of which features of the objects will be used in learning the differences between the classes. The result of the learning process is called a classifier. Classifiers allow computers to predict the class of an object they have never seen before and can therefore be used to search for objects of similar type. In many fields, this automated method of search and discovery has replaced slower, manual methods. Consider the application of machine learning to finding the picture of a particular person among many images in an electronic album, a task called facial recognition. If the album is relatively small a human could look at every picture and determine if the person of interest was present. If, however, the album consisted of vast numbers of images, an automated search and discovery process is helpful in reducing the number of images that a human needs to examine. The images in the reduced set are those the search and discovery process determined contained the person of interest with a high degree of certainty. Not all images will actually contain the person of interest, but due to the high accuracy of the machine learning algorithms there is a tremendous reduction in the amount of time needed to investigate the entire album. This technique is successfully used today by law enforcement and public safety agencies for finding suspect terrorists, criminals, and missing persons. It has also found application in enhancing boarder security, transportation security, and social media. A machine learning based search and discovery approach is well suited to addressing threats in cyberspace. Machine learning can be used to build software classifiers to distinguish malware from benign software, and search massive volumes of traffic. Similar to other fields, the classifiers are based on a complex combination of features. A high speed appliance can inspect network traffic, and human analysts can then be tipped to inspect high probability events of interest. This approach has distinct advantages over signature and sandbox based approaches. First, it scales to very high volumes of traffic such that EVERYTHING can be inspected. No prefilter or other form of upstream thinning is required. 3
This increases our field of view and closes the open doors. Second, the classifiers are resilient to changing malware and tactics. Unlike signatures, or behaviors, the classifiers can discover threats even after they have changed. This increases both probability of detection and accuracy. By getting ahead of the threat, defenders can reduce the number of successful attacks and reduce the millions of dollars spent chasing and remediating a breach. Northrop Grumman, leveraging decades of experience with government intelligence missions, is applying machine learning to malware detection in its BluVector cyber intelligence platform. The machine learning based malware detection engine at the heart of BluVector allows it to operate at line rates up to 10 Gbps while catching 95% or more of malicious software whether the malware is well known or never seen before. GET AHEAD OF THE THREAT WITH THE BLUVECTOR CYBER INTELLIGENCE PLATFORM BluVector is a cyber intelligence platform designed to enable analysts to manage advanced threats. The 2U sensor appliance deploys to points of network aggregation and passively ingests tapped network traffic. It extends the existing enterprise cyber security infrastructure with an advanced, patented, machine learning-based malware early warning system which dramatically reduces the time it takes to discover previously unseen malicious threats. BluVector inspects all files, and then generates a score indicating the probability that the file is malicious. The high probability files are then promoted to an analyst for final verification. BluVector can search massive volumes of traffic and alert the defender within seconds. The maliciousness scores produced by the machine learning classifier help analysts prioritize their search for advanced malware entering their network. This rapid discovery prevents further progress of the breach. To further assist analysts, BluVector provides Yara rule engines, and traditional signature-based anti-virus detection engines and interfaces to threat intelligence streams. This flexible combination of analytic engines allows analysts to add analytics over time to generate additional indications and warnings of suspicious events. 4
Figure 2. BluVector is simple to install and easy to use BluVector can be installed at a network gateway or internal aggregation point within 30 minutes. It can then immediately begin alerting analysts to malicious events. While the machine learning sounds complicated, it s transparent to the defenders. A simple and intuitive user interface allows you to be up and running quickly. Events can also flow up to other event management tools so that BluVector can fit within the normal workflows of a typical security operation center. Northrop Grumman s BluVector cyber intelligence platform helps your organization get ahead of advanced cyber threats through the application of machine learning technology to achieve resilient, accurate, high-speed malware detection. BluVector addresses the challenges of signature and sandbox based detection mechanisms for detecting previously unseen malware. Integrating BluVector into existing enterprise security deployments provides unique detection and alerting capability and can reduce the risk and cost of potential data breaches. 2014 Northrop Grumman Systems Corporation. All rights reserved. Approved for public release: 14-0306 5