Harnessing Intelligence from Malware Repositories



Similar documents
Software Fingerprinting for Automated Malicious Code Analysis

Inside a killer IMBot. Wei Ming Khoo University of Cambridge 19 Nov 2010

esrever gnireenigne tfosorcim seiranib

Abysssec Research. 1) Advisory information. 2) Vulnerable version

Introduction to Reverse Engineering

Fighting malware on your own

Automating Linux Malware Analysis Using Limon Sandbox Monnappa K A monnappa22@gmail.com

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 20: Stack Frames 7 March 08

Computer Organization and Architecture

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

TitanMist: Your First Step to Reversing Nirvana TitanMist. mist.reversinglabs.com

Introduction. Figure 1 Schema of DarunGrim2

Introduction. Application Security. Reasons For Reverse Engineering. This lecture. Java Byte Code

From Georgia, with Love Win32/Georbot. Is someone trying to spy on Georgians?

Parasitics: The Next Generation. Vitaly Zaytsev Abhishek Karnik Joshua Phillips

64-Bit NASM Notes. Invoking 64-Bit NASM

Binary Code Extraction and Interface Identification for Security Applications

What Happens In Windows 7 Stays In Windows 7

Reverse Engineering and Computer Security

Detecting the One Percent: Advanced Targeted Malware Detection

Computer Virus Strategies and Detection Methods

Identification and Removal of

Stitching the Gadgets On the Ineffectiveness of Coarse-Grained Control-Flow Integrity Protection

Attacking Obfuscated Code with IDA Pro. Chris Eagle

Return-oriented programming without returns

Fine-grained covert debugging using hypervisors and analysis via visualization

Storm Worm & Botnet Analysis

The Beast is Resting in Your Memory On Return-Oriented Programming Attacks and Mitigation Techniques To appear at USENIX Security & BlackHat USA, 2014

1. General function and functionality of the malware

Using In-Memory Computing to Simplify Big Data Analytics

Fast Matching of Binary Features

For a 64-bit system. I - Presentation Of The Shellcode

Unpacked BCD Arithmetic. BCD (ASCII) Arithmetic. Where and Why is BCD used? From the SQL Server Manual. Packed BCD, ASCII, Unpacked BCD

Hacking Techniques & Intrusion Detection. Ali Al-Shemery arabnix [at] gmail

Mike Melanson

High-speed image processing algorithms using MMX hardware

Symantec Cyber Threat Analysis Program Program Overview. Symantec Cyber Threat Analysis Program Team

VIRUS TRACKER CHALLENGES OF RUNNING A LARGE SCALE SINKHOLE OPERATION

Dongwoo Kim : Hyeon-jeong Lee s Husband

Evading Android Emulator

Towards Revealing Attackers Intent by Automatically Decrypting Network Traffic

for Malware Analysis Daniel Quist Lorie Liebrock New Mexico Tech Los Alamos National Laboratory

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Obfuscation: know your enemy

Heap-based Buffer Overflow Vulnerability in Adobe Flash Player

Buffer Overflows. Security 2011

Fight fire with fire when protecting sensitive data

Adi Hayon Tomer Teller

Automata Designs for Data Encryption with AES using the Micron Automata Processor

Self Protection Techniques in Malware

MEng, BSc Applied Computer Science

Software Vulnerabilities

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

A Tiny Guide to Programming in 32-bit x86 Assembly Language

NSA/DHS CAE in IA/CD 2014 Mandatory Knowledge Unit Checklist 4 Year + Programs

Off-by-One exploitation tutorial

Networks and Security Lab. Network Forensics

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Hydra. Advanced x86 polymorphic engine. Incorporates existing techniques and introduces new ones in one package. All but one feature OS-independent

Inciting Cloud Virtual Machine Reallocation With Supervised Machine Learning and Time Series Forecasts. Eli M. Dow IBM Research, Yorktown NY

Compiling CAO: from Cryptographic Specifications to C Implementations

Efficient Similarity Search over Encrypted Data

CAS : A FRAMEWORK OF ONLINE DETECTING ADVANCE MALWARE FAMILIES FOR CLOUD-BASED SECURITY

MEng, BSc Computer Science with Artificial Intelligence

Week Overview. Installing Linux Linux on your Desktop Virtualization Basic Linux system administration

Polymorphic Worm Detection Using Structural Information of Executables

End to End Defense against Rootkits in Cloud Environment Sachin Shetty

CERIAS Tech Report Basic Dynamic Processes Analysis of Malware in Hypervisors Type I & II by Ibrahim Waziri Jr, Sam Liles Center for Education

Reverse Engineering by Crayon: Game Changing Hypervisor and Visualization Analysis

Full Potential of Dynamic Binary Translation for AV Emulation Engine

Phoenix Technologies Ltd.

Peeking into Pandora s Bochs. Instrumenting a Full System Emulator to Analyse Malicious Software

One Minute in Cyber Security

Under the Hood: How Actaeon Unveils Your Hypervisor

Theory and Practice of Cyber Threat-Intelligence Management Using STIX and CybOX Siemens AG All rights reserved

Indexing Full Packet Capture Data With Flow

The Big Data Paradigm Shift. Insight Through Automation

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Chapter 7D The Java Virtual Machine

Sandy. The Malicious Exploit Analysis. Static Analysis and Dynamic exploit analysis. Garage4Hackers

Reversing C++ Paul Vincent Sabanal. Mark Vincent Yason

Defending Behind The Device Mobile Application Risks

Spam Detection Using Customized SimHash Function

SeChat: An AES Encrypted Chat

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Pattern Insight Clone Detection

Using MMX Instructions to Convert RGB To YUV Color Conversion

Penetration Testing 2014

Cymon.io. Open Threat Intelligence. 29 October 2015 Copyright 2015 esentire, Inc. 1

Detecting Malware With Memory Forensics. Hal Pomeranz SANS Institute

CORPORATE AV / EPP COMPARATIVE ANALYSIS

02 B The Java Virtual Machine

TODAY, FEW PROGRAMMERS USE ASSEMBLY LANGUAGE. Higher-level languages such

Outline. Introduction. State-of-the-art Forensic Methods. Hardware-based Workload Forensics. Experimental Results. Summary. OS level Hypervisor level

Faculty of Engineering Student Number:

Computer Organization and Components

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

File Disinfection Framework (FDF) Striking back at polymorphic viruses

Transcription:

Harnessing Intelligence from Malware Repositories Arun Lakhotia and Vivek Notani Software Research Lab University of Louisiana at Lafayette arun@louisiana.edu, vxn4849@louisiana.edu 7/22/2015 (C) 2015 U. Louisiana at Lafayette 1

Self Introduction Software Research Lab 10 years research on Malware Graduate course on malware analysis Active interaction with industry Funded by AFOSR, ARO, DARPA, ONR, and State of Louisiana Research Focus How does malware evade detection? How to detect stealthy malware? Malware analysis in the large Results Papers: 50+ peer-reviewed Patents: one granted Degrees: 6 Ph.D., 8 M.Sc. Research Funding: $5MM+ 7/22/2015 (C) 2015 U. Louisiana at Lafayette 2

Targeted Attacks 7/22/2015 (C) 2015 U. Louisiana at Lafayette 3

Machine Generation of Malware 4

CyberSecurity Paradox Physical world: FAILED attempts are INVESTIGATED Cyber World: FAILED attempts are CELEBRATED

Extract Intelligence from Malware Connect actors Discover trends Malware evolution Connect families 7/22/2015 (C) 2015 U. Louisiana at Lafayette 6

Requirement: Google for Malware Malware Collection Nearest Match 0.90 Query Malware Google 0.82 0.76 0.30 8

Challenges ❶ ❷ Remove obfuscation Map to document ❶ ❷ ❸ ❹ ❸ ❹ Create index Search 9

Key Innovation: VM Introspection in the Cloud VM VM HYPERVISOR 7/22/2015 (C) 2015 U. Louisiana at Lafayette 10

Key Innovation: Semantic Fingerprints STATE OF PRACTICE VIRUSBATTLE Bit shreds (380091df) (0091df96) (91df96f6) (df96f633) Semantics eax = def(ebp) ebp = -4+def(esp) esp = -8+def(esp) memdw(-8+def(esp))= def(ebp) memdw(-4+def(esp))= def(ebp) memdw(4+def(esp)) = def(memdw(def(esp))) Fingerprint Fingerprint Susceptible to obfuscation 7/22/2015 (C) 2015 U. Louisiana at Lafayette 11 Obfuscation resistent

Semantics Enabled: Connecting Malware through Code Smokeloader Aldibot Semantically similar binaries between malware families Ponyloader Darkcomet 7/22/2015 (C) 2015 U. Louisiana at Lafayette 12

Unpacking Malware 7/22/2015 (C) 2015 U. Louisiana at Lafayette 13

Challenge 1: Packing 7/22/2015 (C) 2015 U. Louisiana at Lafayette 14

Classes of Packers (Protectors) Classification parameter Based on execution behavior When and how much of the original code is decrypted Traditional Packer Entire original code is decrypted at one time Entire original code is in clear text before it is executed Paged Packer Just-In-Time decryption of a page when it is executed Only a page of the original code is in clear text at any time Virtual Machine Protectors Decrypt a single instruction at a time None of the original code is ever in clear text 7/22/2015 (C) 2015 U. Louisiana at Lafayette 15

Unpacking: State-of-Practice Malware UNPACKER CODE Windows Guest OS Virtualization Stack 7/22/2015 (C) 2015 U. Louisiana at Lafayette 16

Innovation: Unpacking using VM Introspection Malware Windows Guest OS QEMU Hypervisor VB UNPACKER CODE Linux Guest OS Virtualization Stack Observe malware below ring 0 7/22/2015 (C) 2015 U. Louisiana at Lafayette 17

Unpacking Traditionally Packed Malware Execute Malware Detect When to Stop Extract Revealed Code 7/22/2015 (C) 2015 U. Louisiana at Lafayette 18

When to Stop: Hump and Dump Traditional Packer Decryption in a loop High instruction execution frequency Spike in frequency graph Hump & Dump Algorithm Detect spike hump Detect end flat PE Compact2.5 (calc.exe), Linear Scale Addresses Ordered by last execution time 7/22/2015 (C) 2015 U. Louisiana at Lafayette 19

When to Stop: TimeOut What if Hump is never detected? TimeOut Limits execution time 7/22/2015 (C) 2015 U. Louisiana at Lafayette 20

Constructing PE Modify OEP using last PC value Fix Section Headers Copy Memory Contents to new PE 7/22/2015 (C) 2015 U. Louisiana at Lafayette 21

Extracting Memory Contents: Challenge Extracting memory through hypervisor Memory contents may be paged out by GuestOS Solution: Determine memory is paged out Analyze execution profile Re-run unpacker with new parameters Catch before memory is paged out 7/22/2015 (C) 2015 U. Louisiana at Lafayette 22

Case Study Dataset Description: File Type: PE-32 Source: FBI Availability: Upon request Collection period: 1 year Bot Family # Executables Aldibot 19 Armageddon 1 Blackenergy 65 Darkcomet 339 Darkshell 379 Ddoser 5 Illusion 17 Nitol 11 Optima 160 Ponyloader 1,312 Smokeloader 31 Umbraloader 25 Yzf 4 Zeus 41 Total 2,409 7/22/2015 (C) 2015 U. Louisiana at Lafayette 23

Case Study: Results Input : 2,409 Unpacked : 2,354 Output : 2,185 Original Unique Unpacked Poor Unpacking Poor Unpacking HEURISTIC (%) Hump and Dump 1,671 1,523 205 12.27 TimeOut 515 500 46 8.93 Self-tuning 168 163 23 13.69 TOTAL 2,354 2,186 274 11.64 Unpacked Binary very similar to Original => Poor Unpacking 7/22/2015 (C) 2015 U. Louisiana at Lafayette 24

Unpacker s Impact: Analysis Cost Reduction CNC IP Unpacked Malware Multiple malware produce the same unpacked result Malware Binaries # % Original binaries 2,409 Unpacked successfully 2,354 97.7% Unique unpack result 2,185 92.8% Reduction in analyst work 169 7.2% 7/22/2015 (C) 2015 U. Louisiana at Lafayette 25

Matching Code 7/22/2015 (C) 2015 U. Louisiana at Lafayette 26

Challenge 2: Code Obfuscation push ecx mov ecx,ebp add ecx,33 mov [ecx-36],eax pop ecx push ecx mov ecx,ebp add ecx,33 push esi mov esi,ecx sub esi,34 mov [esi-2],eax pop esi pop ecx push ecx mov ecx, ebp push eax mov eax, 33 add ecx, eax pop eax push esi mov esi, ecx push edx mov edx, 34 sub esi, edx pop edx mov [esi - 2], eax pop esi pop ecx push ecx mov ecx, [ebp + 10] mov ecx, ebp push eax add eax, 2342 mov eax, 33 add ecx, eax pop eax mov eax, esi push eax mov esi, ecx push edx xor edx, 778f mov edx, 34 sub esi, edx pop edx mov [esi-2], eax pop esi pop ecx 7/22/2015 (C) 2015 U. Louisiana at Lafayette 27

Requirements Scale requirement Search in collection of thousands to millions of malware Performance requirement Provide results in seconds, or less Quality requirement Error rates should be comparable to pairwise matching 7/22/2015 (C) 2015 U. Louisiana at Lafayette 28

Representations for Matching Binaries Map binary to document Disassemble (380091df) (0091df96) (91df96f6) (df96f633) Word = N-Bytes Abstracted Bytecode (je push) (push mov) (mov pop) (pop xor) Word = N- mnemonic 7/22/2015 (C) 2015 U. Louisiana at Lafayette 29

VirusBattle Strategy Map binary to CFG to Document Binary Disassembly CFG Abstracted Bytecode Abstracted Disassembly Semantics Word = Block Juice 7/22/2015 (C) 2015 U. Louisiana at Lafayette 30

Code to Semantics Code Sequential Focus on operations Semantics Parallel Captures affect push ebp mov ebp,esp sub esp,4 mov eax, DWORD ebp+4 mov DWORD ebp+8,eax mov eax, DWORD ebp mov DWORD ebp-4,eax eax = def(ebp) ebp = -4+def(esp) esp = -8+def(esp) memdw(-8+def(esp))= def(ebp) memdw(-4+def(esp))= def(ebp) memdw(4+def(esp)) = def(memdw(def(esp))) Interpret: seq(instruction) -> State -> State State = LValue -> RValue LValue = Register + Mem Value in previous state Unsimplified RValue = Int + def(rvalue) + RValue op Rvalue + op RValue 7/22/2015 (C) 2015 U. Louisiana at Lafayette 31

Limitations of (Block) Semantics Does not capture: Register renaming Memory address reassignment Code motion between blocks Evolutionary changes Hashes good for strict equality Solution: Generalize semantics Juice Use n-block semantics Use fuzzy hashes 7/22/2015 (C) 2015 U. Louisiana at Lafayette 32

Semantics to words Challenge: How to map equal semantics to the same `word? Solution: Define canonical ordering RValue structures are ground Use ordering over symbols Account for commutativity Sum-of-product form Simplify Word = Hash (md5, SHA1) of linearized semantics RValue = Int + def(rvalue) + RValue op Rvalue + op RValue 7/22/2015 (C) 2015 U. Louisiana at Lafayette 33

Computing Juice Problem: Establish constraints between induced variables? Solution 1. Track simplification steps 2. Generalize simplification steps 7/22/2015 (C) 2015 U. Louisiana at Lafayette 34

Semantics and Juice Code push ebp mov ebp,esp sub esp,4 mov eax, DWORD ebp+4 mov DWORD ebp+8,eax mov eax, DWORD ebp mov DWORD ebp-4,eax RValue = Int + def(rvalue) + RValue op Rvalue + op RValue + Variable Semantics eax = def(ebp) ebp = -4+def(esp) esp = -8+def(esp) memdw(-8+def(esp))= def(ebp) memdw(-4+def(esp))= def(ebp) memdw(4+def(esp)) = def(memdw(def(esp))) Juice Inductive Generalization Replace registers and constants by variables A = def(b), B = N1+def(C), C = N2+def(C), memdw(n2+def(c)) = def(b) memdw(n1+def(c)) = def(b) memdw(n3+def(c)) = def(memdw(def(c))) where A, B, C are registers N1, N2, N3 are Int 7/22/2015 (C) 2015 U. Louisiana at Lafayette 35

Challenge 3: Scalable Search 7/22/2015 (C) 2015 U. Louisiana at Lafayette 36

Featurization Process Hash Procedure Hash MinHash Binary Unpack Disassembly Procedure Hash Procedure Compiler Attributes MinHash Binary Feature Vector 7/22/2015 (C) 2015 U. Louisiana at Lafayette 37

MinHash: A form of LSH A Feature B Light Brown Hair Color Dark Brown Long Hair Length Long Brown Eye Color Brown Feature = MinHash Function Set of Features = MinHash Signature MinHash Signature-1 MinHash Signature-2 Compose for Deterministic manipulations 7/22/2015 (C) 2015 U. Louisiana at Lafayette 38

Architecture 7/22/2015 (C) 2015 U. Louisiana at Lafayette 39

VirusBattle Webservice Architecture 7/22/2015 (C) 2015 U. Louisiana at Lafayette 40

Empirical Results 7/22/2015 (C) 2015 U. Louisiana at Lafayette 41

Dataset 3000 Bots harvested in 2013 2500 2000 1500 1000 unpacked original 500 0 7/22/2015 (C) 2015 U. Louisiana at Lafayette 42

Interesting Procedures 7/22/2015 (C) 2015 U. Louisiana at Lafayette 43

Libraries ID ed by IDA 7/22/2015 (C) 2015 U. Louisiana at Lafayette 44

Transitive Library via Semantics 7/22/2015 (C) 2015 U. Louisiana at Lafayette 45

RE Cost Reduction 1.7 Million+ procedures # of procedures in binaries 105K+ procedures # of IDA unique procedures # of semantically unique procedures 32K+ semantically unique procedures Procedures All IDA Unique Juice Unique Lib Procs 65,113 11,482 4,382 Non Lib Procs 1,644,355 96.2% 93,916 89.1% 7/22/2015 (C) 2015 U. Louisiana at Lafayette 46 27,859 86.4% Total 1,709,468 105,398 32,241

Intelligence: Connecting Families Smokeloader Aldibot Semantically similar binaries between malware families Ponyloader Darkcomet 7/22/2015 (C) 2015 U. Louisiana at Lafayette 47

Intelligence: Code Sharing Non-Lib Unpacked Procedure Optima DarkComet Shared in 13/159 samples Shared in 65/319 samples 7/22/2015 (C) 2015 U. Louisiana at Lafayette 48

# Procedures Shared Intelligence: Code Evolution ddoser: 10 samples 200 in 100% 100 in 10% Percent binaries 7/22/2015 (C) 2015 U. Louisiana at Lafayette 49

# Procedures Shared Intelligence: Needle in Haystack darkcomet: 649 samples 930 in 0-5% 20 in 60-65% Percent binaries 7/22/2015 (C) 2015 U. Louisiana at Lafayette 50

Performance Distribution of analysis time 95% percentile < 15s Semantic analysis: Unpacking: Malware Search: < 30s < 7s Procedure Search: < 100ms 7/22/2015 (C) 2015 U. Louisiana at Lafayette 51

VirusBattle: In a nutshell Key Innovations Automated unpacking using VM introspection Semantic fingerprints, as against bits-based fingerprints Innovative 2-tier search algorithm for fast searches Search at various granularity: Whole binary, procedures, blocks, strings Interfaces with Palantir s Forensic Investigation platform Component Time(*) Accuracy Unpacker 30 sec 97% Semantic Juice 15 sec Binary search 7 sec 95% Procedure search Performance 100ms * Based on analysis of 2,500 botnets binaries; ** Max time to process 95% of files Application Rapidly extract intelligence from malware Impact Order of magnitude improvement in malware analysts capability Unpacking time: Reduced from days/weeks to minutes Analysis work: Reduce efforts from weeks/months to minutes New capability: Build knowledge base of analysis indexed on similar code Share analysts experience across malware families 7/22/2015 (C) 2015 U. Louisiana at Lafayette 52

Blackhat Sound Bytes Malware repositories are great source of intelligence Semantic juice peers through code obfuscation Semantic hashing enables fast search over large repositories VM Introspection gives you X-Ray vision over malware VirusBattle.com: Malware Intelligence Mining in the Cloud 7/22/2015 (C) 2015 U. Louisiana at Lafayette 53

Contact Prof. Arun Lakhotia Vivek Notani University of Louisiana at Lafayette, Lafayette, LA, USA arun@louisiana.edu vxn4849@louisiana.edu 7/22/2015 (C) 2015 U. Louisiana at Lafayette 54

Extras 7/22/2015 (C) 2015 U. Louisiana at Lafayette 55

MinHash: A form of LSH Consider Set A and Set B Let h(x)->int be a function that takes a member of A or B and gives an integer Let h min (s) represent minimum member of set s w.r.t. h. Then, Pr(h min (A)=h min (B)) = Problem: High Variance! 7/22/2015 (C) 2015 U. Louisiana at Lafayette 56

MinHash Signatures: Compose d minhash functions: Signature Match then implies each of the d functions agree on match Pr (sig(a)=sig(b)) = J(A,B) d Problem: Too many False Negatives! Check r minhash signatures: A Match then implies atleast one of the r signatures agree on match Pr (match(a,b)) = 1 - (1 - J(A,B) d ) r 7/22/2015 (C) 2015 U. Louisiana at Lafayette 57