Software Fingerprinting for Automated Malicious Code Analysis



Similar documents
Analysis of Win32.Scream

esrever gnireenigne tfosorcim seiranib

Abysssec Research. 1) Advisory information. 2) Vulnerable version

TitanMist: Your First Step to Reversing Nirvana TitanMist. mist.reversinglabs.com

Computer Organization and Assembly Language

Fighting malware on your own

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 20: Stack Frames 7 March 08

Systems Design & Programming Data Movement Instructions. Intel Assembly

Return-oriented programming without returns

Inside a killer IMBot. Wei Ming Khoo University of Cambridge 19 Nov 2010

Introduction to Reverse Engineering

Harnessing Intelligence from Malware Repositories

Introduction. Figure 1 Schema of DarunGrim2

Introduction. Application Security. Reasons For Reverse Engineering. This lecture. Java Byte Code

A Museum of API Obfuscation on Win32

1. General function and functionality of the malware

A Tiny Guide to Programming in 32-bit x86 Assembly Language

Reversing C++ Paul Vincent Sabanal. Mark Vincent Yason

64-Bit NASM Notes. Invoking 64-Bit NASM

Hotpatching and the Rise of Third-Party Patches

Assembly Language: Function Calls" Jennifer Rexford!

The Beast is Resting in Your Memory On Return-Oriented Programming Attacks and Mitigation Techniques To appear at USENIX Security & BlackHat USA, 2014

Buffer Overflows. Security 2011

Off-by-One exploitation tutorial

CS61: Systems Programing and Machine Organization

Stitching the Gadgets On the Ineffectiveness of Coarse-Grained Control-Flow Integrity Protection

Heap-based Buffer Overflow Vulnerability in Adobe Flash Player

Hacking Techniques & Intrusion Detection. Ali Al-Shemery arabnix [at] gmail

Identification and Removal of

INTRODUCTION TO MALWARE & MALWARE ANALYSIS

Reverse Engineering and Computer Security

How To Hack The Steam Voip On Pc Orchesterian Moonstone 2.5 (Windows) On Pc/Robert Kruber (Windows 2) On Linux (Windows 3.5) On A Pc

Packers Models. simple. malware. advanced. allocation. decryption. decompression. engine loading. integrity check. DRM Management

Computer Organization and Architecture

Packers. (5th April 2010) Ange Albertini Creative Commons Attribution 3.0

Violating Database - Enforced Security Mechanisms

Application-Specific Attacks: Leveraging the ActionScript Virtual Machine

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

REpsych. : psycholigical warfare in reverse engineering. def con 2015 // domas

Removing Sentinel SuperPro dongle from Applications and details on dongle way of cracking Shub-Nigurrath of ARTeam Version 1.

Complete 8086 instruction set

Format string exploitation on windows Using Immunity Debugger / Python. By Abysssec Inc

CS 16: Assembly Language Programming for the IBM PC and Compatibles

What Happens In Windows 7 Stays In Windows 7

Overview of IA-32 assembly programming. Lars Ailo Bongo University of Tromsø

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Unpacked BCD Arithmetic. BCD (ASCII) Arithmetic. Where and Why is BCD used? From the SQL Server Manual. Packed BCD, ASCII, Unpacked BCD

Detecting the One Percent: Advanced Targeted Malware Detection

Abysssec Research. 1) Advisory information. 2) Vulnerable version

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

Windows XP SP3 Registry Handling Buffer Overflow

How Compilers Work. by Walter Bright. Digital Mars

White paper: August Marcin Icewall Noga

Automatic Network Protocol Analysis

Where we are CS 4120 Introduction to Compilers Abstract Assembly Instruction selection mov e1 , e2 jmp e cmp e1 , e2 [jne je jgt ] l push e1 call e

Crowd Security Intelligence. (download slides)

X86-64 Architecture Guide

Windows Assembly Programming Tutorial

Binary Code Extraction and Interface Identification for Security Applications

Using MMX Instructions to Convert RGB To YUV Color Conversion

From Georgia, with Love Win32/Georbot. Is someone trying to spy on Georgians?

Jakstab: A Static Analysis Platform for Binaries

TO APPEAR IN: IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 1. Proactive Detection of Computer Worms Using Model Checking

Compiler Construction

Diving into a Silverlight Exploit and Shellcode - Analysis and Techniques

Dongwoo Kim : Hyeon-jeong Lee s Husband

Self Protection Techniques in Malware

Bypassing Anti- Virus Scanners

x64 Cheat Sheet Fall 2015

Evaluating a ROP Defense Mechanism. Vasilis Pappas, Michalis Polychronakis, and Angelos D. Keromytis Columbia University

Software Vulnerabilities

Bypassing Windows Hardware-enforced Data Execution Prevention

PE Explorer. Heaventools. Malware Code Analysis Made Easy

Intel Assembler. Project administration. Non-standard project. Project administration: Repository

Analysis and Diversion of Duqu s Driver

Platform-independent static binary code analysis using a metaassembly

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

For a 64-bit system. I - Presentation Of The Shellcode

Stack Overflows. Mitchell Adair

Reverse Engineering Malware Part 1

Test Driven Development in Assembler a little story about growing software from nothing

Egil Aspevik Martinsen Polymorphic Viruses. Material from Master Thesis «Detection of Junk Instructions in Malicious Software»

Attacking x86 Windows Binaries by Jump Oriented Programming

Dynamic Behavior Analysis Using Binary Instrumentation

8. MACROS, Modules, and Mouse

Attacking Obfuscated Code with IDA Pro. Chris Eagle

Disassembly of False Positives for Microsoft Word under SCRAP

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Static detection of C++ vtable escape vulnerabilities in binary code

WLSI Windows Local Shellcode Injection. Cesar Cerrudo Argeniss (

Mission 1: The Bot Hunter

The 80x86 Instruction Set

Fine-grained covert debugging using hypervisors and analysis via visualization

The Plan Today... System Calls and API's Basics of OS design Virtual Machines

風 水. Heap Feng Shui in JavaScript. Alexander Sotirov.

The Value of Physical Memory for Incident Response

Transcription:

Software Fingerprinting for Automated Malicious Code Analysis Philippe Charland Mission Critical Cyber Security Section October 25, 2012 Terms of Release: This document is approved for release to Defence Departments, Defence Contractors, and Governments. Any further distribution requires the prior approval of the Defence R&D Canada Valcartier Document Review Panel

Outline Software Reverse Engineering Motivation Research Objective Prototypes Future Work 2

Why Software Reverse Engineering? To develop a solid understanding of a software for which there is no Documentation Source code Malicious software (malware) falls into this category 3

Malware Figures SophosLabs Analyzed 95,000 malware pieces every day in 2010 Panda Security 26 million new malware samples were identified in 2011 73,000 strains per day 4

5 Targeted Attacks

Software Reverse Engineering Process Increase in Complexity Deobfuscation / Software Dearmoring Disassembly / Code level Analysis Relevant and Interesting Feature Identification Unpacking Debugger IDA Pro Olly Dbg Experience based Newbies have trouble with this 6

Assembly Code Analysis Most nebulous portion of the process Largely depends on intuition and experience Looking at assembly is tedious Not seeing the forest for the trees Analyst fatigue High level of attention required 7

Assembly Code Analysis Question: lea eax, DWORD PTR [edx+edx] add eax, eax add eax, eax add eax, eax add eax, eax 8

Assembly Code Analysis Question: lea eax, DWORD PTR [edx+edx] add eax, eax add eax, eax add eax, eax add eax, eax Answer: y = x * 32 9

Assembly Code Analysis Doing everything manually is unsustainable... Throwing more reverse engineers is not possible... 10

Assembly Code Analysis Automate some of the assembly code analysis process! 11

Motivation Malware authors Develop huge numbers of variants to bypass antivirus Exchange source code among them Reuse open source code Reverse engineers Leverage the code reuse in malware Reduce redundant analysis efforts Accelerate the reverse engineering process 12

Research Objective Automatically identify code fragments that reuse 1. Open source code 2. Previously analyzed assembly code 13

Assembly and Source Code Matching RE Google regoogle.carnivore.it IDA Pro plug in Python script Enumerate all functions and extract Strings Constants Imported functions names Perform a Google Code Search Add top results as function comments 14

Results Constants 15

16

RE Google Relies on the Google Code Search API Shut down on January 15, 2012... Look for alternatives... 17

Google Code Search Alternatives As suggested on the Google Code Search Group: www.antepedia.com www.grepcode.com www.koders.com www.krugle.org... 18

Koders Merging with Ohloh (code.ohloh.net) Index and search 10+ BLOCs (3x the amount of Koders) Support 43 programming languages 19

20 Koders

21 Koders

22 Koders

23 Koders Search for SHA 512

24

RE Source IDA Pro Plug in Based on the original RE Google Python script Assembly File RE Source Extract Features Comment Functions Build Query Parse Results HTML Page RE Source 25

26 RE Source Precise Calculator Case study

Precise Calculator Open source programmable scientific calculator Has more than 150 mathematical and statistical functions Written in C++ and assembly 9.h files 7.cpp files 2 assembly files 27

RE Source Precise Calculator Case study Disassembled executable contains 533 functions Features extracted for 67 functions Identified 5 of the 7.cpp files with 100% accuracy 70% of the original source code Detected functions Mathematical, geometrical and statistical Parsing, editing, GUI 28

29 RE Source Precise Calculator Case study

30 RE Source Precise Calculator Case study

Clone Detection Technique to identify duplicate code fragments in a code base Most algorithms operate on source code Decrease code size by consolidating it Facilitate program comprehension and software maintenance Commercial off the shelf software Copyright infringements Plagiarism detection 31

Clone Detection vs. Clone Search Clone Detection Identify all the similar code fragments within a code base Compare every code fragment pair Clone Search Identify only the code fragments similar to a target fragment 32

Clone Types Syntactic Clones Textual similarity Type I, II, III clones Semantic Clones Functional similarity Type IV clones 33

Syntactic Clones Type I Identical code fragments except for variations in whitespace, layout and comments push eax ; Memory call ds:_aligned_free and dword ptr [esi], 0 pop ecx push eax call ds:_aligned_free and dword ptr [esi], 0 pop ecx 34

Syntactic Clones Type II Structurally/syntactically identical fragments except for variations in identifiers, literals, types, layout and comments push edi ; Size call _malloc mov edx, eax mov ecx, edi mov [esp+24h+var_c], edx mov edi, edx mov edx, ecx xor eax,eax shr ecx, 2 rep stosd mov ecx, edx add esp, 4 and ecx, 3 rep stosb mov eax, [esp+20h+var_c] test eax, eax jnz loc_10001a97 mov eax, [ebx] push eax push edi ; Size call _malloc mov edx, eax mov ecx, edi mov [esp+20h+inbuffer], edx mov edi, edx mov edx, ecx xor eax, eax shr ecx, 2 rep stosd mov ecx, edx add esp, 4 and ecx, 3 rep stosb mov eax, [esp+1ch+inbuffer] test eax, eax jnz loc_10001493 mov eax, [ebx] push eax 35

Syntactic Clones Type III Copied fragments with further modifications Statements can be changed, added or removed in addition to variations in identifiers, literals, types, layout and comments mov esi, [ebp+arg_0] mov edx, [esi+214h] mov edi, [esi+220h] mov [ebp+var_4], edx cmp [esi+21ch], edi jl short loc_76641044 lea ebx, [edx+edi*8] mov esi, [ebp+arg_0] mov edx, [esi+214h] mov [ebp+var_4], edx mov edi, [esi+220h] cmp [esi+21ch], edi jl short loc_76641044 lea ebx, [edx+edi*8] 36

Semantic Clones Type IV Two or more code fragments that perform the same computation implemented through different syntactic variants strlen1 proc near arg_0 = dword ptr 4 mov eax, [esp+arg_0] loc_401004: cmp byte ptr [eax], 0 jz short done inc eax jmp short loc_401004 done: sub eax, [esp+arg_0] retn strlen1 endp strlen3 proc near arg_0 = dword ptr 4 push edi mov edi, [esp+4+arg_0] xor ecx, ecx not ecx xor al, al cld repne scasb not ecx lea eax, [ecx-1] pop edi retn strlen3 endp 37

Clone Detector Overview Extended from A. Saebjornsen, et al. (2009), University of California, Davis Disassembler Assembly Files Regionizer Normalizer Binary Files Token Indexer Exact Clone Detector Inexact Clone Detector Visualizer XML File Maximal Clone Merger Duplicate Clone Merger 38

Clone Detector Regionizer 39 sub_402d5f proc near ; CODE XREF: sub_402fc1+12p mov edi, edi push esi push edi mov edi, ecx lea esi, [edi+0d0h] mov eax, [esi] test eax, eax jz short loc_402d7c push eax ; Memory call ds:_aligned_free and dword ptr [esi], 0 pop ecx loc_402d7c: ; CODE XREF: sub_402d5f+10j and dword ptr [edi+0d4h], 0 push 90h ; Size push 0 ; Val add edi, 40h push edi ; Dst call memset add esp, 0Ch pop edi pop esi retn sub_402d5f endp

Clone Detector Regionizer sub_402d5f proc near mov edi, edi push esi push edi mov edi, ecx lea esi, [edi+0d0h] mov eax, [esi] test eax, eax jz short loc_402d7c push eax call ds:_aligned_free and dword ptr [esi], 0 pop ecx and dword ptr [edi+0d4h], 0 push 90h push 0 add edi, 40h push edi call memset add esp, 0Ch pop edi pop esi retn sub_402d5f endp Window Size = 10 instructions Step Size = 4 instructions Region 0 Region 1 Region 2 Region 3 Region 4 Region 5 40

Clone Detector Normalization Registers, constants and memory addresses are normalized Constants VAL or VALx, where x is an index number Memory addresses MEM or MEMx, where x is an index number Registers Different normalization levels are available 41

Clone Detector Normalization REG EAX REG CS REG EDI REG REGSeg, REGGen, REGldxPtr EAX REGGen CS REGSeg EDI REGIdxPtr REGGen8, REGGen16, REGGen32 EAX REGGen32 AX REGGen16 AH REGGen8 REGx EAX REG#0 AX REG#1 AH REG#2 REG REGSeg REGGen REGIdxPtr REGGen8 REGGen16 REGGen32 REGx 42

Clone Detector Normalization Assembly code mov push push mov edi, edi ebp ebp, esp eax, dword ptr [epb+8] Normalized assembly code mov push push mov REG, REG REG REG, REG REG, MEM 43

Clone Detector Exact Clones Compare statements between regions Two regions are considered an exact clone if all their normalized statements are identical (i.e. same hash value) 44

Clone Detector Exact Clones sub_402d5f proc near mov REGGen32, REGGen32 push REGIdxPtr push REGIdxPtr mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL mov REGGen32, MEM test REGGen32, REGGen32 jz short VAL push REGGen16... retn sub_402d5f endp sub_579aeg proc near... mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL mov REGGen32, MEM test REGGen32, REGGen32 jz short VAL push REGGen16... retn sub_579aeg endp sub_579aeg proc near mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL mov REGGen32, MEM test REGGen32, REGGen32 push REGGen16 mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL... retn sub_579aeg endp Hash Key Clone Cluster 4464394 6486468 1561898 45

46 Clone Detector Exact Clones

47 Clone Detector Exact Clones

48 Clone Detector Exact Clones

Clone Detector Inexact Clones Compute a feature vector for each region Feature vectors are constructed based on Mnemonics of instructions Types of operands in instructions Combination of mnemonic and operands Two regions are considered an inexact clone if the similarity between their feature vectors is within a minimum similarity threshold 49

50 Clone Detector Inexact Clones

51 Clone Detector Inexact Clones

52 Clone Detector Inexact Clones

Clone Detector Duplicate Clone Merger Remove clones that are highly overlapping regions in the same function push call sub push lea push push call lea lea push push push push call mov mov edi ds:gettickcount eax, dword_1000d22c eax eax, [esp+0ch] offset a9lu eax _sprintf ecx, [esp+14h] edx, [esp+24h] ecx offset Dest offset a8ss edx _sprintf eax, dword_1000d218 ecx, dword_1000a044 53

Clone Detector Maximal Clone Merger Merge consecutive cloned regions push call sub push lea push push call lea lea push push push push call mov mov lea add lea edi ds:gettickcount eax, dword_1000d22c eax eax, [esp+0ch] offset a9lu eax _sprintf ecx, [esp+14h] edx, [esp+24h] ecx offset Dest offset a8ss edx _sprintf eax, dword_1000d218 ecx, dword_1000a044 edi, [esp+34h] esp, 1Ch edx, [ecx+eax+0ah] push call sub push lea push push call lea lea push push push push call mov mov lea add lea edi ds:gettickcount eax, dword_1000d22c eax eax, [esp+0ch] offset a9lu eax _sprintf ecx, [esp+14h] edx, [esp+24h] ecx offset Dest offset a8ss edx _sprintf eax, dword_1000d218 ecx, dword_1000a044 edi, [esp+34h] esp, 1Ch edx, [ecx+eax+0ah] 54

55 Clone Detector Clone Search

56 Clone Detector Clone Search

Case Study 18 open source code dynamic link libraries (DLLs) Recall and precision consistently above 80% Zeus and Blaster malware Precision over 96% Efficiency is not sensitive to the window size 57

Future Work Proof of concept prototypes RE Source Automatically identify a larger proportion of libraries Clone Detector Improve the precision and recall of inexact clone detection Support semantic clones Conduct additional case studies 58

Questions philippe.charland@drdc rddc.gc.ca 59