Process Virtualization and Symbiotic Optimization Kim Hazelwood ACACES Summer School July 2009 About Your Instructor Currently Assistant Professor at University of Virginia Faculty Consultant at Intel Previously PostDoc at Intel (2004-2005) PhD from Harvard (2004) Four summer internships (HP & IBM) Worked with Dynamo, Jikes RVM, Other Interests Marathons (Boston, NYC, Disney) Reality TV Shows Family (8 month old at home!) 1 1
About the Course Day 1 What is Process Virtualization? Day 2 Building Process Virtualization Systems Day 3 Using Process Virtualization Systems Day 4 Symbiotic Optimization We ll use Pin as a case study www.pintool.org You ll have homework! 2 What is Process Virtualization? System virtualization allows multiple OSes to share the same hardware Process virtualization runs as a normal application (on top of an OS) and supports a single process App 1 App 2 App 1 OS 1 OS 2 DBT VMM OS HW HW System Virtualization App 2 DBI Process Virtualization 3 2
Classifying Virtualization Dynamic binary optimization (x86 x86--) Complement the static compiler User inputs, phases, DLLs, hardware features Examples: DynamoRIO, Mojo, Strata Dynamic translation (x86 PPC) Convert applications to run on a new architecture Examples: Rosetta, Transmeta CMS, DAISY Dynamic instrumentation (x86 x86++) Inspect/add features to existing applications Examples: Pin, Valgrind 4 A Simple Example of Instrumentation Inserting extra code into a program to collect runtime information counter++; sub $0xff, %edx counter++; cmp %esi, %edx counter++; jle <L1> counter++; mov $0x1, %edi counter++; add $0x10, %eax 5 3
Instruction Count Output $ /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out $ pin -t inscount.so -- /bin/ls Makefile imageload.out itrace proccount imageload inscount atrace itrace.out Count 422838 6 A Simple Example of Optimization On Pentium 3, inc is faster than add On Pentium 4, add is faster than inc sub cmp jle mov inc $0xff, %edx %esi, %edx <L1> $0x1, %edi %eax sub cmp jle mov add $0xff, %edx %esi, %edx <L1> $0x1, %edi $0x1, %eax 7 4
Research Applications Computer Architecture Trace Generation Fault Tolerance Studies Emulating New Instructions Program Analysis Code coverage Call-graph generation Memory-leak detection Instruction profiling Multicore Thread analysis Thread profiling Race detection Cache simulations Compilers Compare programs from competing compilers Security Add security checks and features 8 Approaches Source modification: Modify source programs Binary modification: Modify executables directly Advantages for binary modification Language independent Machine-level view Modify legacy/proprietary software 9 5
Static vs Dynamic Approaches Dynamic approaches are more robust No need to recompile or relink Discover code at runtime Handle dynamically-generated code Attach to running processes The Code Discovery Problem on x86 Instr 1 Instr 2 Indirect jump to?? Instr 3 Jump Reg DATA Data interspersed Instr 5 Instr 6 with code Uncond Branch PADDING Instr 8 Pad for alignment 10 Dynamic Modification: Approaches JIT Mode Create a modified copy of the application on-the-fly Original code never executes More flexible, more common approach Probe Mode Modifies the original application instructions Inserts jumps to modified code (trampolines) Lower overhead (less flexible) approach 11 6
JIT-Mode Binary Modification Generate and cache modified copies of instructions EXE Transform Profile Code Cache Execute Modified (cached) instructions are executed in lieu of original instructions 12 JIT-Mode Instrumentation Original code 1 Code cache 1 Exits point back to VMM 2 3 7 5 6 4 2 7 Fetch trace starting block 1 and start instrumentation Pin 13 7
JIT-Mode Instrumentation Original code 1 Code cache 1 2 3 5 6 7 4 2 7 Transfer control into code cache (block 1) Pin 14 JIT-Mode Instrumentation Original code 1 Code cache trace linking 1 3 2 3 5 6 7 4 2 7 Fetch and instrument a new trace 5 6 Pin 15 8
Instrumentation Approaches JIT Mode Create a modified copy of the application on-the-fly Original code never executes More flexible, more common approach Probe Mode Modify the original application instructions Insert jumps to instrumentation code (trampolines) Lower overhead (less flexible) approach 16 A Sample Probe A probe is a jump instruction that overwrites original instruction(s) in the application Copy/translate original bytes so probed functions can be called Original function entry point: 0x400113d4: push %ebp 0x400113d5: mov %esp,%ebp 0x400113d7: push %edi 0x400113d8: push %esi 0x400113d9: push %ebx Entry point overwritten with probe: 0x400113d4: jmp 0x41481064 0x400113d9: push %ebx Copy of entry point w/ original bytes: 0x50000004: push %ebp 0x50000005: mov %esp,%ebp 0x50000007: push %edi 0x50000008: push %esi 0x50000009: jmp 0x400113d9 17 9
Probe Instrumentation Advantages: Low overhead few percent Less intrusive execute original code Disadvantages: More tool writer responsibility Restrictions on where to modify (routine-level) 18 Probe Tool Writer Responsibilities No control flow into the instruction space where probe is placed 6 bytes on IA32, 7 bytes on Intel64, bundle on IA64 Branch into replaced instructions will fail Probes at function entry point only Thread safety for insertion/deletion of probes During image load callback is safe Only loading thread has a handle to the image Replacement function has same behavior as original 19 10
Probe vs. JIT Summary Probes JIT Overhead Few percent 50% or higher Intrusive Low High Granularity Function boundary Instruction Safety & Isolation More responsibility for tool writer High 20 Process Virtualization Systems Readily Available DynamoRIO Valgrind Pin Available By Request Strata Adore Unavailable Transmeta CMS Dynamo 21 11
DynamoRIO 22 Valgrind 23 12
Pin 24 Intel Pin Dynamic Instrumentation: Do not need source code, recompilation, post-linking Programmable Instrumentation: Provides rich APIs to write in C/C++ your own instrumentation tools (called Pintools) Multiplatform: Supports x86, x86-64, Itanium, Xscale Supports Linux, Windows, MacOS Robust: Instruments real-life applications: Database, web browsers, Instruments multithreaded d applications Supports signals Efficient: Applies compiler optimizations on instrumentation code 25 13
Using Pin Launch and instrument an application $ pin t pintool.so - application Instrumentation engine (provided in the kit) Instrumentation tool (write your own, or use one provided in the kit) Attach to and instrument an application $ pin t pintool.so pid 1234 26 Pin Instrumentation APIs Basic APIs are architecture independent: Provide common functionalities like determining: Control-flow changes Memory accesses Architecture-specific APIs e.g., Info about opcodes and operands Call-based APIs: Instrumentation routines Analysis routines 27 14
Instrumentation vs. Analysis Concepts borrowed from the ATOM tool: Instrumentation routines define where instrumentation is inserted e.g., before instruction Occurs first time an instruction is executed Analysis routines define what to do when instrumentation is activated e.g., increment counter Occurs every time an instruction is executed 28 Pintool 1: Instruction Count counter++; sub $0xff, %edx counter++; cmp %esi, %edx counter++; jle <L1> counter++; mov $0x1, %edi counter++; add $0x10, %eax 29 15
Pintool 1: Instruction Count Output $ /bin/ls Makefile imageload.out itrace proccount imageload inscount0 atrace itrace.out $ pin -t inscount0.so -- /bin/ls Makefile imageload.out itrace proccount imageload inscount0 atrace itrace.out Count 422838 30 #include <iostream> #include "pin.h" ManualExamples/inscount0.cpp UINT64 icount = 0; void docount() { icount++; } analysis routine instrumentation routine void Instruction(INS ins, void *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_END); } void Fini(INT32 code, void *v) { std::cerr << "Count " << icount << endl; } int main(int argc, char * argv[]) { PIN_Init(argc, argv); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; } 31 16
Pintool 2: Instruction Trace Print(ip); sub $0xff, %edx Print(ip); cmp %esi, %edx Print(ip); jle <L1> Print(ip); mov $0x1, %edi Pi Print(ip); add $0x10, %eax Need to pass ip argument to the analysis routine (Printip()) 32 Pintool 2: Instruction Trace Output $ pin -t itrace.so -- /bin/ls Makefile imageload.out itrace proccount imageload inscount0 atrace itrace.out $ head -4 itrace.out 0x40001e90 0x40001e91 0x40001ee4 0x40001ee5 33 17
ManualExamples/itrace.cpp #include <stdio.h> #include "pin.h" argument to analysis routine FILE * trace; void printip(void *ip) { fprintf(trace, "%p\n", ip); } analysis routine ( ) instrumentation i routine void Instruction(INS ins, void *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)printip, IARG_INST_PTR, IARG_END); } void Fini(INT32 code, void *v) { fclose(trace); } int main(int argc, char * argv[]) { trace = fopen("itrace.out", "w"); PIN_Init(argc, argv); INS_AddInstrumentFunction(Instruction, 0); } PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; 34 Examples of Arguments to Analysis Routine IARG_INST_PTR Instruction pointer (program counter) value IARG_UINT32 <value> An integer value IARG_REG_VALUE <register name> Value of the register specified IARG_BRANCH_TARGET_ADDR Target address of the branch instrumented IARG_MEMORY_READ_EA Effective address of a memory read And many more (refer to the manual for details) 35 18
Instrumentation Points Instrument points relative to an instruction: Before: IPOINT_BEFORE After: Fall-through edge: IPOINT_AFTER Taken edge: IPOINT_TAKEN_BRANCH count() count() cmp jle mov %esi, %edx <L1> $0x1, %edi count() <L1>: mov $0x8,%edi 36 Instrumentation Granularity Instrumentation can be done at three different granularities: Instruction Basic block A sequence of instructions sub $0xff, %edx terminated at a control-flow cmp %esi, %edx changing instruction jle <L1> Single entry, single exit Trace mov $0x1, %edi A sequence of basic blocks terminated at an unconditional control-flow changing instruction Single entry, multiple exits add jmp $0x10, %eax <L2> 1 Trace, 2 BBs, 6 insts 37 19
Pintool 3: Faster Instruction Count counter += 3 sub $0xff, %edx cmp %esi, %edx jle <L1> counter += 2 mov $0x1, %edi basic blocks (bbl) add $0x10, %eax 38 ManualExamples/inscount1.cpp #include <stdio.h> #include "pin.h UINT64 icount = 0; void docount(int32 c) { icount += c; } analysis routine void Trace(TRACE trace, void *v) { instrumentation routine for (BBL bbl = TRACE_BblHead(trace); BBL_ Valid(bbl); bbl = BBL_ Next(bbl)) { BBL_InsertCall(bbl, IPOINT_BEFORE, (AFUNPTR)docount, IARG_UINT32, BBL_NumIns(bbl), IARG_END); } } void Fini(INT32 code, void *v) { fprintf(stderr, "Count %lld\n", icount); } int main(int argc, char * argv[]) { PIN_Init(argc, I argv); TRACE_AddInstrumentFunction(Trace, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; } 39 20
What Did We Learn Today? Overview of Process Virtualization Approaches Source vs. Binary Static vs. Dynamic JIT vs. Probes Three Available Systems Three Simple Examples 40 Want More Info? Read Jim Smith s book: Virtual Machines Download one (or more) of them! Pin DynamoRIO Valgrind www.pintool.org org code.google.com/p/dynamorio www.valgrind.org Day 1 What is Process Virtualization? Day 2 Building Process Virtualization ti Systems Day 3 Using Process Virtualization Systems Day 4 Symbiotic Optimization 41 21