Software Fingerprinting for Automated Malicious Code Analysis Philippe Charland Mission Critical Cyber Security Section October 25, 2012 Terms of Release: This document is approved for release to Defence Departments, Defence Contractors, and Governments. Any further distribution requires the prior approval of the Defence R&D Canada Valcartier Document Review Panel
Outline Software Reverse Engineering Motivation Research Objective Prototypes Future Work 2
Why Software Reverse Engineering? To develop a solid understanding of a software for which there is no Documentation Source code Malicious software (malware) falls into this category 3
Malware Figures SophosLabs Analyzed 95,000 malware pieces every day in 2010 Panda Security 26 million new malware samples were identified in 2011 73,000 strains per day 4
5 Targeted Attacks
Software Reverse Engineering Process Increase in Complexity Deobfuscation / Software Dearmoring Disassembly / Code level Analysis Relevant and Interesting Feature Identification Unpacking Debugger IDA Pro Olly Dbg Experience based Newbies have trouble with this 6
Assembly Code Analysis Most nebulous portion of the process Largely depends on intuition and experience Looking at assembly is tedious Not seeing the forest for the trees Analyst fatigue High level of attention required 7
Assembly Code Analysis Question: lea eax, DWORD PTR [edx+edx] add eax, eax add eax, eax add eax, eax add eax, eax 8
Assembly Code Analysis Question: lea eax, DWORD PTR [edx+edx] add eax, eax add eax, eax add eax, eax add eax, eax Answer: y = x * 32 9
Assembly Code Analysis Doing everything manually is unsustainable... Throwing more reverse engineers is not possible... 10
Assembly Code Analysis Automate some of the assembly code analysis process! 11
Motivation Malware authors Develop huge numbers of variants to bypass antivirus Exchange source code among them Reuse open source code Reverse engineers Leverage the code reuse in malware Reduce redundant analysis efforts Accelerate the reverse engineering process 12
Research Objective Automatically identify code fragments that reuse 1. Open source code 2. Previously analyzed assembly code 13
Assembly and Source Code Matching RE Google regoogle.carnivore.it IDA Pro plug in Python script Enumerate all functions and extract Strings Constants Imported functions names Perform a Google Code Search Add top results as function comments 14
Results Constants 15
16
RE Google Relies on the Google Code Search API Shut down on January 15, 2012... Look for alternatives... 17
Google Code Search Alternatives As suggested on the Google Code Search Group: www.antepedia.com www.grepcode.com www.koders.com www.krugle.org... 18
Koders Merging with Ohloh (code.ohloh.net) Index and search 10+ BLOCs (3x the amount of Koders) Support 43 programming languages 19
20 Koders
21 Koders
22 Koders
23 Koders Search for SHA 512
24
RE Source IDA Pro Plug in Based on the original RE Google Python script Assembly File RE Source Extract Features Comment Functions Build Query Parse Results HTML Page RE Source 25
26 RE Source Precise Calculator Case study
Precise Calculator Open source programmable scientific calculator Has more than 150 mathematical and statistical functions Written in C++ and assembly 9.h files 7.cpp files 2 assembly files 27
RE Source Precise Calculator Case study Disassembled executable contains 533 functions Features extracted for 67 functions Identified 5 of the 7.cpp files with 100% accuracy 70% of the original source code Detected functions Mathematical, geometrical and statistical Parsing, editing, GUI 28
29 RE Source Precise Calculator Case study
30 RE Source Precise Calculator Case study
Clone Detection Technique to identify duplicate code fragments in a code base Most algorithms operate on source code Decrease code size by consolidating it Facilitate program comprehension and software maintenance Commercial off the shelf software Copyright infringements Plagiarism detection 31
Clone Detection vs. Clone Search Clone Detection Identify all the similar code fragments within a code base Compare every code fragment pair Clone Search Identify only the code fragments similar to a target fragment 32
Clone Types Syntactic Clones Textual similarity Type I, II, III clones Semantic Clones Functional similarity Type IV clones 33
Syntactic Clones Type I Identical code fragments except for variations in whitespace, layout and comments push eax ; Memory call ds:_aligned_free and dword ptr [esi], 0 pop ecx push eax call ds:_aligned_free and dword ptr [esi], 0 pop ecx 34
Syntactic Clones Type II Structurally/syntactically identical fragments except for variations in identifiers, literals, types, layout and comments push edi ; Size call _malloc mov edx, eax mov ecx, edi mov [esp+24h+var_c], edx mov edi, edx mov edx, ecx xor eax,eax shr ecx, 2 rep stosd mov ecx, edx add esp, 4 and ecx, 3 rep stosb mov eax, [esp+20h+var_c] test eax, eax jnz loc_10001a97 mov eax, [ebx] push eax push edi ; Size call _malloc mov edx, eax mov ecx, edi mov [esp+20h+inbuffer], edx mov edi, edx mov edx, ecx xor eax, eax shr ecx, 2 rep stosd mov ecx, edx add esp, 4 and ecx, 3 rep stosb mov eax, [esp+1ch+inbuffer] test eax, eax jnz loc_10001493 mov eax, [ebx] push eax 35
Syntactic Clones Type III Copied fragments with further modifications Statements can be changed, added or removed in addition to variations in identifiers, literals, types, layout and comments mov esi, [ebp+arg_0] mov edx, [esi+214h] mov edi, [esi+220h] mov [ebp+var_4], edx cmp [esi+21ch], edi jl short loc_76641044 lea ebx, [edx+edi*8] mov esi, [ebp+arg_0] mov edx, [esi+214h] mov [ebp+var_4], edx mov edi, [esi+220h] cmp [esi+21ch], edi jl short loc_76641044 lea ebx, [edx+edi*8] 36
Semantic Clones Type IV Two or more code fragments that perform the same computation implemented through different syntactic variants strlen1 proc near arg_0 = dword ptr 4 mov eax, [esp+arg_0] loc_401004: cmp byte ptr [eax], 0 jz short done inc eax jmp short loc_401004 done: sub eax, [esp+arg_0] retn strlen1 endp strlen3 proc near arg_0 = dword ptr 4 push edi mov edi, [esp+4+arg_0] xor ecx, ecx not ecx xor al, al cld repne scasb not ecx lea eax, [ecx-1] pop edi retn strlen3 endp 37
Clone Detector Overview Extended from A. Saebjornsen, et al. (2009), University of California, Davis Disassembler Assembly Files Regionizer Normalizer Binary Files Token Indexer Exact Clone Detector Inexact Clone Detector Visualizer XML File Maximal Clone Merger Duplicate Clone Merger 38
Clone Detector Regionizer 39 sub_402d5f proc near ; CODE XREF: sub_402fc1+12p mov edi, edi push esi push edi mov edi, ecx lea esi, [edi+0d0h] mov eax, [esi] test eax, eax jz short loc_402d7c push eax ; Memory call ds:_aligned_free and dword ptr [esi], 0 pop ecx loc_402d7c: ; CODE XREF: sub_402d5f+10j and dword ptr [edi+0d4h], 0 push 90h ; Size push 0 ; Val add edi, 40h push edi ; Dst call memset add esp, 0Ch pop edi pop esi retn sub_402d5f endp
Clone Detector Regionizer sub_402d5f proc near mov edi, edi push esi push edi mov edi, ecx lea esi, [edi+0d0h] mov eax, [esi] test eax, eax jz short loc_402d7c push eax call ds:_aligned_free and dword ptr [esi], 0 pop ecx and dword ptr [edi+0d4h], 0 push 90h push 0 add edi, 40h push edi call memset add esp, 0Ch pop edi pop esi retn sub_402d5f endp Window Size = 10 instructions Step Size = 4 instructions Region 0 Region 1 Region 2 Region 3 Region 4 Region 5 40
Clone Detector Normalization Registers, constants and memory addresses are normalized Constants VAL or VALx, where x is an index number Memory addresses MEM or MEMx, where x is an index number Registers Different normalization levels are available 41
Clone Detector Normalization REG EAX REG CS REG EDI REG REGSeg, REGGen, REGldxPtr EAX REGGen CS REGSeg EDI REGIdxPtr REGGen8, REGGen16, REGGen32 EAX REGGen32 AX REGGen16 AH REGGen8 REGx EAX REG#0 AX REG#1 AH REG#2 REG REGSeg REGGen REGIdxPtr REGGen8 REGGen16 REGGen32 REGx 42
Clone Detector Normalization Assembly code mov push push mov edi, edi ebp ebp, esp eax, dword ptr [epb+8] Normalized assembly code mov push push mov REG, REG REG REG, REG REG, MEM 43
Clone Detector Exact Clones Compare statements between regions Two regions are considered an exact clone if all their normalized statements are identical (i.e. same hash value) 44
Clone Detector Exact Clones sub_402d5f proc near mov REGGen32, REGGen32 push REGIdxPtr push REGIdxPtr mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL mov REGGen32, MEM test REGGen32, REGGen32 jz short VAL push REGGen16... retn sub_402d5f endp sub_579aeg proc near... mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL mov REGGen32, MEM test REGGen32, REGGen32 jz short VAL push REGGen16... retn sub_579aeg endp sub_579aeg proc near mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL mov REGGen32, MEM test REGGen32, REGGen32 push REGGen16 mov REGIdxPtr, REGGen32 lea REGIdxPtr, VAL... retn sub_579aeg endp Hash Key Clone Cluster 4464394 6486468 1561898 45
46 Clone Detector Exact Clones
47 Clone Detector Exact Clones
48 Clone Detector Exact Clones
Clone Detector Inexact Clones Compute a feature vector for each region Feature vectors are constructed based on Mnemonics of instructions Types of operands in instructions Combination of mnemonic and operands Two regions are considered an inexact clone if the similarity between their feature vectors is within a minimum similarity threshold 49
50 Clone Detector Inexact Clones
51 Clone Detector Inexact Clones
52 Clone Detector Inexact Clones
Clone Detector Duplicate Clone Merger Remove clones that are highly overlapping regions in the same function push call sub push lea push push call lea lea push push push push call mov mov edi ds:gettickcount eax, dword_1000d22c eax eax, [esp+0ch] offset a9lu eax _sprintf ecx, [esp+14h] edx, [esp+24h] ecx offset Dest offset a8ss edx _sprintf eax, dword_1000d218 ecx, dword_1000a044 53
Clone Detector Maximal Clone Merger Merge consecutive cloned regions push call sub push lea push push call lea lea push push push push call mov mov lea add lea edi ds:gettickcount eax, dword_1000d22c eax eax, [esp+0ch] offset a9lu eax _sprintf ecx, [esp+14h] edx, [esp+24h] ecx offset Dest offset a8ss edx _sprintf eax, dword_1000d218 ecx, dword_1000a044 edi, [esp+34h] esp, 1Ch edx, [ecx+eax+0ah] push call sub push lea push push call lea lea push push push push call mov mov lea add lea edi ds:gettickcount eax, dword_1000d22c eax eax, [esp+0ch] offset a9lu eax _sprintf ecx, [esp+14h] edx, [esp+24h] ecx offset Dest offset a8ss edx _sprintf eax, dword_1000d218 ecx, dword_1000a044 edi, [esp+34h] esp, 1Ch edx, [ecx+eax+0ah] 54
55 Clone Detector Clone Search
56 Clone Detector Clone Search
Case Study 18 open source code dynamic link libraries (DLLs) Recall and precision consistently above 80% Zeus and Blaster malware Precision over 96% Efficiency is not sensitive to the window size 57
Future Work Proof of concept prototypes RE Source Automatically identify a larger proportion of libraries Clone Detector Improve the precision and recall of inexact clone detection Support semantic clones Conduct additional case studies 58
Questions philippe.charland@drdc rddc.gc.ca 59