Software Security Memory Virtualization Jan Nordholz Prof. Jean-Pierre Seifert Security in Telecommunications TU Berlin SoSe 2016 jan (sect) Software Security SoSe 2016 1 / 27
Virtualization (Recap) assume basic virtualization support (Intel VT-x, AMD SVM, ARM VE) all sensitive instructions either handled internally (i. e. modify virtual state instead of physical state) or cause a trap into HV But what about memory management? There s still only one MMU! Hint: print this document 2-up... jan (sect) Software Security SoSe 2016 2 / 27
MMU: native case recall: hardware register denoting the pagetable base address (x86: CR3) CR3 points to pagetable managed by OS pagetable contains virtual-to-physical mappings with ission bits, notation: V P OS maintains one pagetable for each address space scheduling to a different process means loading another value into CR3 jan (sect) Software Security SoSe 2016 3 / 27
jan (sect) Software Security SoSe 2016 4 / 27
MMU: native case machine with 256 MB RAM, physical address range [P base, P base + 256M] mapping by OS pagetable: V P P [P base, P base + 256M] (obviously) accessing V ends up at physical address P jan (sect) Software Security SoSe 2016 5 / 27
jan (sect) Software Security SoSe 2016 6 / 27
MMU: virtualized case machine with 1 GB RAM: physical address range [P base, P base + 1G] three virtual machines, each assigned 256 MB RAM remaining memory reserved for HV itself pseudo-physical ( guest-physical ) range for each guest: [GP base, GP base + 256M] (hypervisor mimicks smaller-size native machine) translation between addresses in GP and P: offset! (different for each guest!) jan (sect) Software Security SoSe 2016 7 / 27
jan (sect) Software Security SoSe 2016 8 / 27
MMU: virtualized case mapping by guest OS pagetable: V GP [GP base, GP base + 256M] GP accessing V ends up at pseudo-physical address GP That s not what we want! Recall Popek&Goldberg: HV must have full control! guest OS in control of paging guests can touch arbitrary locations guests might possibly overwrite HV violation guests would also stomp on each other s memory (each guest thinks it is running alone on native hardware) jan (sect) Software Security SoSe 2016 9 / 27
jan (sect) Software Security SoSe 2016 10 / 27
MMU: virtualized case HV must control paging to protect itself and guests from each other effective value in CR3 must point to HV shadow pagetable HV has to: look up intended V GP tuples in guest pagetable apply bounds check on GP, apply guest offset to get P create corresponding V P entries in shadow pagetable jan (sect) Software Security SoSe 2016 11 / 27
jan (sect) Software Security SoSe 2016 12 / 27
Lazy Mapping, native When are mappings created? Linux (and similar OS) are lazy: load file contents, reserve memory etc. only when actually necessary ex.: when loading /bin/bash, only the page containing the entry point is actually loaded when execution continues to a (yet unmapped) location V...... trap into OS: Page Fault OS loads desired additional page of executable into free physical page P, creates V RX P OS resumes process same for handling stack growth: if stack exceeds amount of mapped pages, a new physical page is allocated and mapped jan (sect) Software Security SoSe 2016 13 / 27
jan (sect) Software Security SoSe 2016 14 / 27
Lazy Mapping, virtualized process tries to execute (yet unmapped) location V Page Fault must be handled by HV! HV searches guest pagetable for V Case 1: no entry GP HV cannot know what to do with the fault forward ( inject ) fault into guest OS handles fault (just like the native scenario), but... OS resumes process: faults again! Why? After handling by OS, V RX GP in guest pagetable, but no V RX P in shadow pagetable yet! jan (sect) Software Security SoSe 2016 15 / 27
jan (sect) Software Security SoSe 2016 16 / 27
Lazy Mapping, virtualized process tries to execute (yet unmapped) location V Page Fault must be handled by HV! HV searches guest pagetable for V Case 2: now there is an entry! GP HV checks that GP [GP base, GP base + 256M] HV calculates P = GP + offset HV inserts V P into shadow pagetable HV resumes guest jan (sect) Software Security SoSe 2016 17 / 27
jan (sect) Software Security SoSe 2016 18 / 27
MMU: first conclusions HV has to copy and translate guest pagetable entries cost of lazily adding a mapping (native case): 1x Page Fault = 2x transitions between user and system mode cost of lazily adding a mapping (virtualized case): 2x Page Fault = 4x transitions between guest and hypervisor + 1x Injected Page Fault = 2x additional transitions between user and system mode Much more expensive with virtualization! Especially process creation is bad each new process = new pagetable = lots of new entries! jan (sect) Software Security SoSe 2016 19 / 27
MMU: unanswered details How does HV know which guest pagetable to consult? Easy: HV has to intercept guest access to CR3 anyway When guest wants to write value to CR3: HV checks for an existing shadow pagetable for this value if exists: use it if not: allocate new spt, tag it with guest CR3 program real CR3 with address of recovered/allocated spt jan (sect) Software Security SoSe 2016 20 / 27
MMU: unanswered details Ok, lazy addition of mappings is slow, but can be emulated faithfully i. e. guest does not notice it is running virtualized What does HV do when guest wants to change/delete a mapping? recall: processor contains Translation Lookaside Buffer (TLB) TLB caches V P tuples only positive caching lookup failures are not cacheable changing or deleting mapping requires flush of TLB for change to become effective, even in the native case HV can use that by intercepting TLB flushes and applying to spt as well! You can think of an spt simply as a (large) software TLB. jan (sect) Software Security SoSe 2016 21 / 27
Hardware updates to page tables Some architectures support a dirty bit in pagetable: extra bit in pagetable, set to 0 by OS when creating mapping set to 1 by CPU when write to page happens dirty bit checked by OS to determine pages with changed content if memory contents correspond to file loaded from disk, OS can optimize disk write-back by writing only changed ( dirty ) pages HV has to emulate dirty bit behaviour of CPU as well! jan (sect) Software Security SoSe 2016 22 / 27
Lazy mapping w/ dirty bit, virtualized changes to established process: when HV translates V guest pagetable GP to V P, HV has to set dirty bit in What if we have to handle a read fault, but the mapping we find is V RW GP? a) we translate to V RW P, but then we have to immediately set dirty bit b) we translate to V R P and don t set dirty in b), we will get a write fault later if the page is really written to in a), we won t, so we have to assume a write might happen jan (sect) Software Security SoSe 2016 23 / 27
Dirty bit vs. Flush to Disk if the OS flushes data to disk, it may clear the dirty bit again no TLB flush necessary dirty bit is not TLB-relevant but HV needs to know it has to emulate dirty bit behaviour! imagine spt contains V RW P. HV will never again get a fault for V, so no chance to set dirty bit for V again Clearing of dirty bit must be communicated to HV A) paravirtualize guest, i. e. make aware of HV and insert explicit hypercalls B) observe: guest clearing dirty bit is a memory write operation HV can intercept those! jan (sect) Software Security SoSe 2016 24 / 27
Guest Pagetable Tracking New idea: track all changes to guest pagetables When HV intercepts guest write to CR3, in addition to creating/reactivating shadow pagetable: walk guest pagetable at new CR3 value enumerate all pieces in memory that comprise the pagetable (pagetables are not necessarily a contiguous structure) revoke write ission to those memory pages in spt HV now notified of all modifications to pagetable! This changes sequence diagram of page faults! jan (sect) Software Security SoSe 2016 25 / 27
Lazy Mapping, virtualized: take II process tries to execute (yet unmapped) location V Page Fault must be handled by HV! HV searches guest pagetable for V Case 1: no entry GP HV cannot know what to do with the fault forward ( inject ) fault into guest OS handles fault, tries to create V GP (nested) Page Fault HV notices change to pagetable HV validates and translates to V P nested fault resolved, HV resumes OS OS resumes process Still 2x Page Fault, now nested, not sequential jan (sect) Software Security SoSe 2016 26 / 27
Hardware Solution new hardware extension: Nested Paging ARM VE: already part of the Virtualization Extensions Intel: EPT (Extended Page Tables), 2008 AMD: NPT (Nested Page Tables), 2008 key feature: two CR3 registers, one controlled by guest, one by HV CPU uses guest CR3 to translate V to GP...... and HV CR3 to translate GP to P faults can now be attributed properly: no entry in guest table? let guest OS handle it no entry in HV table? probably illegal result of guest table, terminate guest (but see next week) jan (sect) Software Security SoSe 2016 27 / 27