Applications of SMT to. Test Generation

Similar documents

From Blackbox Fuzzing to Whitebox Fuzzing towards Verification

Software Model Checking Improving Security of a Billion Computers

Whitebox Fuzzing for Security Testing

SAGE: Whitebox Fuzzing for Security Testing

InvGen: An Efficient Invariant Generator

Model Checking: An Introduction

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation

Model Checking of Software

Regression Verification: Status Report

Dynamic Test Generation for Large Binary Programs

Towards practical reactive security audit using extended static checkers 1

µz An Efficient Engine for Fixed points with Constraints

Introduction to Static Analysis for Assurance

Fuzzing in Microsoft and FuzzGuru framework

Software Testing & Analysis (F22ST3): Static Analysis Techniques 2. Andrew Ireland

AUTOMATED TEST GENERATION FOR SOFTWARE COMPONENTS

GameTime: A Toolkit for Timing Analysis of Software

Environment Modeling for Automated Testing of Cloud Applications

Software testing. Objectives

Business Application Services Testing

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Bug hunting. Vulnerability finding methods in Windows 32 environments compared. FX of Phenoelit

Static Analysis. Find the Bug! : Analysis of Software Artifacts. Jonathan Aldrich. disable interrupts. ERROR: returning with interrupts disabled

Static Program Transformations for Efficient Software Model Checking

Gold Standard Method for Benchmarking C Source Code Static Analysis Tools

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Optimizing the Performance of Your Longview Application

Binary search tree with SIMD bandwidth optimization using SSE

Experimental Comparison of Concolic and Random Testing for Java Card Applets

Windows Server Performance Monitoring

Secure Software Programming and Vulnerability Analysis

Managing Latency in IPS Networks

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Fast Arithmetic Coding (FastAC) Implementations

How To Test Automatically

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

Performance Workload Design

I Control Your Code Attack Vectors Through the Eyes of Software-based Fault Isolation. Mathias Payer, ETH Zurich

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Delivering Quality in Software Performance and Scalability Testing

Ivan Medvedev Principal Security Development Lead Microsoft Corporation

FROM SAFETY TO SECURITY SOFTWARE ASSESSMENTS AND GUARANTEES FLORENT KIRCHNER (LIST)

Automatic vs. Manual Code Analysis

SAN Conceptual and Design Basics

Verification and Validation of Software Components and Component Based Software Systems

Automated Theorem Proving - summary of lecture 1

Minimizing code defects to improve software quality and lower development costs.

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Detecting Critical Defects on the Developer s Desktop

1/20/2016 INTRODUCTION

The Model Checker SPIN

Assertion Synthesis Enabling Assertion-Based Verification For Simulation, Formal and Emulation Flows

Practical Performance Understanding the Performance of Your Application

Application Performance Testing Basics

Formal Verification by Model Checking

Guideline for stresstest Page 1 of 6. Stress test

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

CSC 2405: Computer Systems II

Levels of Software Testing. Functional Testing

Coverity White Paper. Effective Management of Static Analysis Vulnerabilities and Defects

Jonathan Worthington Scarborough Linux User Group

Software safety - DEF-STAN 00-55

Runtime Verification - Monitor-oriented Programming - Monitor-based Runtime Reflection

The Security Development Lifecycle. OWASP 24 June The OWASP Foundation

HP Client Automation Standard Fast Track guide

The Advantages of Block-Based Protocol Analysis for Security Testing

Kentico CMS 6.0 Performance Test Report. Kentico CMS 6.0. Performance Test Report February 2012 ANOTHER SUBTITLE

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

A Practical Method to Diagnose Memory Leaks in Java Application Alan Yu

The Road from Software Testing to Theorem Proving

Module 10. Coding and Testing. Version 2 CSE IIT, Kharagpur

Speeding Up Cloud/Server Applications Using Flash Memory

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

11.1 inspectit inspectit

Introduction to Functional Verification. Niels Burkhardt

The Security Development Lifecycle

D. Best Practices D.1. Assurance The 5 th A

The Methodology Behind the Dell SQL Server Advisor Tool

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Base One's Rich Client Architecture

Abstract Interpretation-based Static Analysis Tools:

Key Attributes for Analytics in an IBM i environment

Chair of Software Engineering. Software Verification. Assertion Inference. Carlo A. Furia

Lecture 12: Software protection techniques. Software piracy protection Protection against reverse engineering of software

Distribution One Server Requirements

The programming language C. sws1 1

CS514: Intermediate Course in Computer Systems

Transcription:

Applications of SMT to Test Generation Patrice Godefroid Microsoft Research Page 1 June 2012

Test Generation is Big Business #1 application for SMT solvers today (CPU usage) SAGE @ Microsoft: 1 st whitebox fuzzer for security testing 400+ machine years (since 2008) 3.4+ Billion constraints 100s of apps, 100s of security bugs Example: Win7 file fuzzing ~1/3 of all fuzzing bugs found by SAGE (missed by everything else ) Bug fixes shipped (quietly) to 1 Billion+ PCs Millions of dollars saved for Microsoft + time/energy for the world How fuzzing bugs were found (Win7, 2006-2009) : Blackbox Fuzzing All Others + Regression SAGE Page 2 June 2012

Agenda 1. Why Test Generation? 2. What kind of SMT constraints? 3. New: test generation from validity proofs Disclaimer: here, focus on test generation for software using SMT (not hardware using SAT) Page 3 June 2012

Part 1: Why Test Generation? Whitebox Fuzzing for Security Testing (The Killer App) Page 4 June 2012

Security is Critical (to Microsoft) Software security bugs can be very expensive: Cost of each Microsoft Security Bulletin: $Millions Cost due to worms (Slammer, CodeRed, Blaster, etc.): $Billions Many security exploits are initiated via files or packets Ex: MS Windows includes parsers for hundreds of file formats Security testing: hunting for million-dollar bugs Write A/V (always exploitable), Read A/V (sometimes exploitable), NULL-pointer dereference, division-by-zero (harder to exploit but still DOS attacks), etc. Page 5 June 2012

Hunting for Security Bugs I am from Belgium too! Main techniques used by black hats : Code inspection (of binaries) and Blackbox fuzz testing Blackbox fuzz testing: A form of blackbox random testing [Miller+90] Randomly fuzz (=modify) a well-formed input Grammar-based fuzzing: rules that encode well-formed ness + heuristics about how to fuzz (e.g., using probabilistic weights) Heavily used in security testing Simple yet effective: many bugs found this way At Microsoft, fuzzing is mandated by the SDL Page 6 June 2012

Introducing Whitebox Fuzzing Idea: mix fuzz testing with dynamic test generation Dynamic symbolic execution Collect constraints on inputs Negate those, solve with constraint solver, generate new inputs do systematic dynamic test generation (=DART) Whitebox Fuzzing = DART meets Fuzz Two Parts: 1. Foundation: DART (Directed Automated Random Testing) 2. Key extensions ( Whitebox Fuzzing ), implemented in SAGE Page 7 June 2012

Automatic Code-Driven Test Generation Problem: Given a sequential program with a set of input parameters, generate a set of inputs that maximizes code coverage = automate test generation using program analysis This is not model-based testing (= generate tests from an FSM spec) Page 8 June 2012

How? (1) Static Test Generation Static analysis to partition the program s input space [King76, ] Ineffective whenever symbolic reasoning is not possible which is frequent in practice (pointer manipulations, complex arithmetic, calls to complex OS or library functions, etc.) Example: int obscure(int x, int y) { } if (x==hash(y)) error(); return 0; Can t statically generate values for x and y that satisfy x==hash(y)! Page 9 June 2012

How? (2) Dynamic Test Generation Run the program (starting with some random inputs), gather constraints on inputs at conditional statements, use a constraint solver to generate new test inputs Repeat until a specific program statement is reached [Korel90, ] Or repeat to try to cover ALL feasible program paths: DART = Directed Automated Random Testing = systematic dynamic test generation [PLDI 05, ] detect crashes, assertion violations, use runtime checkers (Purify, ) Page 10 June 2012

DART = Directed Automated Random Testing Example: Run 1 : - start with (random) x=33, y=42 int obscure(int x, int y) { - execute concretely and symbolically: if (33!= 567) if (x!= hash(y)) if (x==hash(y)) error(); constraint too complex return 0; simplify it: x!= 567 } - solve: x==567 solution: x=567 - new test input: x=567, y=42 Observations: Run 2 : the other branch is executed All program paths are now covered! Dynamic test generation extends static test generation with additional runtime information: it is more powerful see [DART in PLDI 05], [PLDI 11] The number of program paths can be infinite: may not terminate! Still, DART works well for small programs (1,000s LOC) Significantly improves code coverage vs. random testing Page 11 June 2012

DART Implementations Defined by symbolic execution, constraint generation and solving Languages: C, Java, x86,.net, Theories: linear arith., bit-vectors, arrays, uninterpreted functions, Solvers: lp_solve, CVCLite, STP, Disolver, Z3, Examples of tools/systems implementing DART: EXE/EGT (Stanford): independent [ 05-06] closely related work CUTE = same as first DART implementation done at Bell Labs SAGE (CSE/MSR) for x86 binaries and merges it with fuzz testing for finding security bugs (more later) PEX (MSR) for.net binaries in conjunction with parameterized-unit tests for unit testing of.net programs YOGI (MSR) for checking the feasibility of program paths generated statically using a SLAM-like tool Vigilante (MSR) for generating worm filters BitScope (CMU/Berkeley) for malware analysis CatchConv (Berkeley) focus on integer overflows Splat (UCLA) focus on fast detection of buffer overflows Apollo (MIT/IBM) for testing web applications and more! Page 12 June 2012

Whitebox Fuzzing [NDSS 08] Whitebox Fuzzing = DART meets Fuzz Apply DART to large applications (not unit) Start with a well-formed input (not random) Combine with a generational search (not DFS) Negate 1-by-1 each constraint in a path constraint Generate many children for each parent run Challenge all the layers of the application sooner Leverage expensive symbolic execution Search spaces are huge, the search is partial yet effective at finding bugs! parent Gen 1 Page 13 June 2012

Example void top(char input[4]) input = good { int cnt = 0; Path constraint: if (input[0] == b ) cnt++; I 0!= b I 0 = b bood if (input[1] == a ) cnt++; I 1!= a I 1 = a gaod if (input[2] == d ) cnt++; I 2!= d I 2 = d godd } if (input[3] ==! ) cnt++; if (cnt >= 4) crash(); I 3!=! I 3 =! SMT solver Negate each constraint in path constraint Solve new constraint new input good SAT goo! Gen 1 Page 14 June 2012

The Search Space void top(char input[4]) { int cnt = 0; if (input[0] == b ) cnt++; if (input[1] == a ) cnt++; if (input[2] == d ) cnt++; if (input[3] ==! ) cnt++; if (cnt >= 4) crash(); } If symbolic execution is perfect and search space is small, this is verification! Page 15 June 2012

SAGE (Scalable Automated Guided Execution) Generational search introduced in SAGE Performs symbolic execution of x86 execution traces Builds on Nirvana, idna and TruScan for x86 analysis Don t care about language or build process Easy to test new applications, no interference possible Can analyse any file-reading Windows applications Several optimizations to handle huge execution traces Constraint caching and common subexpression elimination Unrelated constraint optimization Constraint subsumption for constraints from input-bound loops Flip-count limit (to prevent endless loop expansions) Page 16 June 2012

SAGE Architecture Input0 Coverage Data Constraints Check for Crashes (AppVerifier) Code Coverage (Nirvana) Generate Constraints (TruScan) Solve Constraints (Z3) Input1 Input2 SAGE was mostly developed by CSE (2006-2008) InputN MSR algorithms & code inside (2006-2012) Page 17 June 2012

Some Experiments Most much (100x) bigger than ever tried before! Seven applications 10 hours search each App Tested #Tests Mean Depth Mean #Instr. Mean Input Size ANI 11468 178 2,066,087 5,400 Media1 6890 73 3,409,376 65,536 Media2 1045 1100 271,432,489 27,335 Media3 2266 608 54,644,652 30,833 Media4 909 883 133,685,240 22,209 Compressed File Format 1527 65 480,435 634 OfficeApp 3008 6502 923,731,248 45,064 Page 18 June 2012

Generational Search Leverages Symbolic Execution Each symbolic execution is expensive 30 25 20 15 10 5 0 25m30s SymbolicExecutor TestTask X1,000 Yet, symbolic execution does not dominate search time SymbolicExecutor 10 hours Testing/Tracing/Coverage Page 19 June 2012

SAGE Results Since April 07 1 st release: many new security bugs found (missed by blackbox fuzzers, static analysis) Apps: image processors, media players, file decoders, Bugs: Write A/Vs, Read A/Vs, Crashes, Many triaged as security critical, severity 1, priority 1 (would trigger Microsoft security bulletin if known outside MS) Example: WEX Security team for Win7 Dedicated fuzzing lab with 100s machines 100s apps (deployed on 1billion+ computers) ~1/3 of all fuzzing bugs found by SAGE! SAGE = gold medal at Fuzzing Olympics organized by SWI at BlueHat 08 (Oct 08) Credit due to entire SAGE team + users! Page 20 June 2012

WEX Fuzzing Lab Bug Yield for Win7 How fuzzing bugs found (2006-2009) : 100s of apps, total number of fuzzing bugs is confidential Default Blackbox Fuzzer + Regression All Others SAGE SAGE is running 24/7 on 100s machines: the largest usage ever of any SMT solver N. Bjorner + L. de Moura (MSR, Z3 authors) But SAGE didn t exist in 2006 Since 2007 (SAGE 1 st release), ~1/3 bugs found by SAGE But SAGE was then deployed on only ~2/3 of those apps Normalizing the data by 2/3, SAGE found ~1/2 bugs SAGE was run last in the lab, so all SAGE bugs were missed by everything else! Page 21 June 2012

SAGE Summary SAGE is so effective at finding bugs that, for the first time, we face bug triage issues with dynamic test generation What makes it so effective? Works on large applications (not unit test, like DART, EXE, etc.) Can detect bugs due to problems across components Fully automated (focus on file fuzzing) Easy to deploy (x86 analysis any language or build process!) 1 st tool for whole-program dynamic symbolic execution at x86 level Now, used daily in various groups at Microsoft Page 22 June 2012

Part 2: What kind of SMT constraints? (Tests from Satisfiability Models) Page 23 June 2012

Types of Constraints Roughly speaking: Long conjunctions! Bit vector constraints Bounded arrays (for pointer reasoning) No unbounded quantifiers See (anonimysed) SAGE benchmarks part of SMT comp (for more, contact Nikolaj Bjorner or Leonardo de Moura) Sometimes: Axiomatization for type systems (e.g., PEX) String constraints, etc. (e.g., HAMPI) uninterpreted functions (for summaries) Page 24 June 2012

The Art of Constraint Generation More On the Research Behind SAGE : How to recover from imprecision in symbolic exec.? PLDI 05, PLDI 11 Must under-approximations How to scale symbolic exec. to billions of instructions? NDSS 08 Techniques to deal with large path constraints How to check efficiently many properties together? EMSOFT 08 Active property checking How to leverage grammars for complex input formats? PLDI 08 Lift input constraints to the level of symbolic terminals in an input grammar How to deal with path explosion? POPL 07, TACAS 08, POPL 10, SAS 11 Symbolic test summaries (more later) How to reason precisely about pointers? ISSTA 09 New memory models leveraging concrete memory addresses and regions How to deal with floating-point instructions? ISSTA 10 Prove non-interference with memory accesses How to deal with input-dependent loops? ISSTA 11 Automatic dynamic loop-invariant generation and summarization Page 25 June 2012

SAGAN: Fuzzing in the (Virtual) Cloud Since June 2010, new centralized server collecting stats from all SAGE runs! 200+ machine-years of SAGE data (since June 2010) Track results (bugs, concrete & symbolic test coverage), incompleteness (unhandled tainted x86 instructions, Z3 timeouts, divergences, etc.) Help troubleshooting (SAGE has 100+ options ) Tell us what works and what does not Page 26 June 2012

Picking and Choosing Typical SAGE run can fill up 300 GB in 1 week Problem: 100s of machines * 300+ GB = lots of data Solution: pick and choose what to ship up Configuration files Counters from each SAGE execution Run heartbeats at random intervals Crashing test information Key principles Enough information to repro SAGE run results Support key analyses for improving SAGE Page 27 June 2012

Are we hitting known problems in symbolic execution? Page 28 June 2012

Incompleteness 100s of instructions in x86 Some of them we handle perfectly Some we do not handle perfectly and we know it Which instructions should we prioritize? SAGAN tells us! Page 29 June 2012

How to Automate These Steps? Automated Synthesis of Symbolic Instruction Encodings from I/O Samples (to appear at [PLDI 2012]) : 6 abstract instruction templates building blocks are bit-vector constraints (SMT-lib format) for 534 x86 ALU instructions (8/16/32bits, outputs, EFLAGS) new smart sampling synthesis algorithm takes <2 hours with Z3 synthesis against specific x86 processor as I/O oracle : Page 30 June 2012

How long do we spend in constraint generation and constraint solving? Page 31 June 2012

Most solver queries are fast Page 32 June 2012

but solving time often dominates Page 33 June 2012

We can cut off outlying tasks! Solver time Page 34 June 2012

Most Constraints are UNSAT UNSAT / timeout SAT Page 35 June 2012

Why are these constraints (usually) fast to solve? 90.18% of Z3 queries solved in 0.1 seconds or less! Page 36 June 2012

Key Optimizations Sound Common subexpression elimination on every new constraint Crucial for memory usage Related Constraint Optimization Unsound Constraint subsumption Syntactic check for implication, take strongest constraint Drop constraints at same instruction pointer after threshold This is the art of constraint generation! So that the constraint solver is not the bottleneck Predictability is key to good business! Nobody wants an NP-hard problem in the way! Page 37 June 2012

What Next? Towards Verification When can we safely stop testing? When we know that there are no more bugs! = Verification Testing can only prove the existence of bugs, not their absence. Unless it is exhaustive! This is the model checking thesis Model Checking = exhaustive testing (state-space exploration) Two main approaches to software model checking: Modeling languages state-space exploration Model checking [Dijkstra] abstraction (SLAM, Bandera, FeaVer, BLAST, ) adaptation Programming languages state-space exploration Systematic testing Concurrency: VeriSoft, JPF, CMC, Bogor, CHESS, Data inputs: DART, EXE, SAGE, Page 38 June 2012

Exhaustive Testing? Model checking is always up to some bound Limited (often finite) input domain, for specific properties, under some environment assumptions Ex: exhaustive testing of Win7 JPEG parser up to 1,000 input bytes 8000 bits 2^8000 possibilities if 1 test per sec, 2^8000 secs FYI, 15 billion years = 473040000000000000 secs = 2^60 secs! MUST be symbolic! How far can we go? Practical goals: (easier?) Eradicate all remaining buffer overflows in all Windows parsers Reduce costs & risks for Microsoft: when to stop fuzzing? Increase costs & risks for Black Hats! Many have probably moved to greener pastures already (Ex: Adobe) Ex: <5 security bulletins in all the SAGE-cleaned Win7 parsers If noone can find bugs in P, P is observationally equivalent to verified! Page 39 June 2012

How to Get There? 1. Identify and patch holes in symbolic execution + constraint solving 2. Tackle path explosion with compositional testing and symbolic test summaries [POPL 07,TACAS 08,POPL 10] Page 40 June 2012

The Art of Constraint Generation Static analysis: abstract away irrelevant details Good for focused search, can be combined with DART (Ex: [POPL 10]) But for bit-precise analysis of low-level code (function pointers, in-lined assembly,...)? In a non-property-guided setting? Open problem Bit-precise VC-gen: statically generate 1 formula from a program Good to prove complex properties of small programs (units) Does not scale (huge formula encodings), asks too much of the user SAT/SMT-based Bounded Model Checking : stripped-down VC-gen Emphasis on automation Unrolling all loops is naïve, does not scale DART : the only option today for large programs (Ex: Excel) Path-by-path exploration is slow, but whitebox fuzzing can scale it to large executions + zero false alarms! But suffers from path explosion Page 41 June 2012

DART is Beautiful Generates formulas where the only free symbolic variables are whole-program inputs When generating tests, one can only control inputs! Strength: scalability to large programs Only tracks direct input dependencies (i.e., tests on inputs); the rest of the execution is handled with the best constantpropagation engine ever: running the code on the computer! (The size of) path constraints only depend on (the number of) program tests on inputs, not on the size of the program = the right metric: complexity only depends on nondeterminism! Price to pay: path explosion [POPL 07] Solution = symbolic test summaries Page 42 June 2012

Example void top(char input[4]) { int cnt = 0; input = good Path constraint: if (input[0] == b ) cnt++; if (input[1] == a ) cnt++; if (input[2] == d ) cnt++; if (input[3] ==! ) cnt++; I 0!= b I 1!= a I 2!= d I 3!=! if (cnt >= 3) crash(); } Page 43 June 2012

Compositionality = Key to Scalability Idea: compositional dynamic test generation [POPL 07] use summaries of individual functions (or program blocks, etc.) like in interprocedural static analysis but here must formulas generated dynamically If f calls g, test g, summarize the results, and use g s summary when testing f A summary φ(g) is a disjunction of path constraints expressed in terms of g s input preconditions and g s output postconditions: φ(g) = φ(w) with φ(w) = pre(w) post(w) g s outputs are treated as fresh symbolic inputs to f, all bound to prior inputs and can be eliminated (for test generation) Can provide same path coverage exponentially faster! See details and refinements in [POPL 07,TACAS 08,POPL 10] Page 44 June 2012

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 SAGAN tracks branch flipped 20000 18000 16000 14000 12000 10000 Summarize this! 8000 6000 4000 2000 0 Sampled runs on Windows, many different file-reading applications Max frequency 17761, min frequency 592 Total of 290430 branches flipped, 3360 distinct branches Page 45 June 2012

Summaries Cure Search Redundancy Across different program paths IF THEN ELSE Across different program versions Incremental Compositional Dynamic Test Generation [SAS 11] Across different applications Summaries avoid unnecessary work What if central server of summaries for all code?... Page 46 June 2012

The Engineering of Test Summaries Systematically summarizing everywhere is foolish Very expensive and not necessary (costs outweigh benefits) Not scalable without user help (see work on VC-gen and BMC) Summarization on-demand: (100% algorithmic) When? At search bottlenecks (with dynamic feedback loop) Where? At simple interfaces (with simple data types) How? With limited side-effects (to be manageable and sound ) Goal: use summaries intelligently THE KEY to scalable bit-precise whole-program analysis? Necessary, but sufficient? In what form(s)? Computed statically? [POPL 10, ISSTA 10] Stay tuned Page 47 June 2012

Part 3: Tests from Validity Proofs (Higher-Order Test Generation) [PLDI 2011] Page 48 June 2012

Why Dynamic Test Gen.? Most Precise! Example: Run 1 : - start with (random) x=33, y=42 int obscure(int x, int y) { - execute concretely and symbolically: if (33!= 567) if (x!= hash(y)) if (x==hash(y)) error(); constraint too complex return 0; simplify it: x!= 567 } - solve: x==567 solution: x=567 - new test input: x=567, y=42 Observations: Run 2 : the other branch is executed All program paths are now covered! Unknown/complex symbolic expressions can be simplified using concrete runtime values [DART, PLDI 05] Let s call this step concretization (ex: hash(y) 567) Dynamic test generation extends static test generation with additional runtime information: it is more powerful How often? When exactly? Why? this work! Page 49 June 2012

Unsound and Sound Concretization Concretization is not always sound int foo(int x, int y) { } if (x==hash(y)) { } if (y==10) error(); Run: x=567, y=42 pc: x==567 and y!=10 New pc: x==567 and y==10 New inputs: x=567, y=10 Divergence! pc and new pc are unsound! Definition: A path constraint pc for a path w is sound if every input satisfying pc defines an execution following w Sound concretization: add concretization constraints pc: y==42 and x==567 and y!=10 (sound) New pc: y==42 and x==567 and y==10 (sound) Theorem: path constraint is now always sound. Is this better? No Forces us to detect all sources of imprecision (expensive/impossible ) Can prevent test generation and good divergences Page 50 June 2012

Idea: Using Uninterpreted Functions Modeling imprecision with uninterpreted functions int obscure(int x, int y) { Run: x=33, y=42 if (x==hash(y)) error(); return 0; pc: x!= h(y) } New pc: x == h(y) How to generate tests? Is ( x,y,h:) x=h(y) SAT? Yes, but so what? (ex: x=y=0, h(0)=0) Need universal quantification! ( h:) x,y: x=h(y) is this first-order logic formula valid? Yes. Solution (strategy): fix y, set x to the value of h(y) Test generation from validity proofs! (not SAT models) Necessary but not sufficient: what value of h(y)? Page 51 June 2012

Need for Uninterpreted Function Samples Record I/O UF samples int obscure(int x, int y) { if (x==hash(y)) error(); return 0; } Run: x=33, y=42 Record: 567 == h(42) pc: x!=h(y) Use UF samples to interpret a validity proof/strategy fix y, set x to the value of h(y) set y=42, x=567 Or new pc: ( h:) x,y: (567=h(42)) => (x=h(y)) is valid? Higher-order test generation = models imprecision using Uninterpreted Functions records UF samples as concrete input/output value pairs generates tests from validity proofs of FOL formulas Key: a higher-order logic representation of path constraints Page 52 June 2012

Higher-Order Test Generation is Powerful Theorem: HOTG is as powerful as sound concretization Can simulate it (both UFs and UF samples are needed for this) Higher-Order Test Generation is more powerful Ex 1: ( h:) x,y: h(x)=h(y) Ex 2: ( h:) x,y: h(x)=h(y)+1 is valid (solution: set x=y) is invalid But ( h:) x,y: (h(0)=0 h(1)=1) => h(x)=h(y)+1 Ex 3: (solution: set x=1, y=0) int foo(int x, int y) { } if (x==hash(y)) { } if (y==10) error(); is valid Run: x=567, y=42 pc: x==h(y) and y!=10 New pc: ( h:) x,y: (h(42)=567) => x=h(y) y=10 is valid. Solution: set y=10, set x=h(10) 2-step test generation: - run1 with y=10, x=567 to learn h(10) =66 - run2 with y=10, x=66! Page 53 June 2012

Implementability Issues Tracking all sources of imprecision is problematic Excel on a 45K input bytes executes 1 billion x86 instructions Imprecision cannot always be represented by UFs Unknown input/output signatures, nondeterminism, Capturing all input/output pairs can be very expensive Limited support from current SMT solvers X:Ф(F,X) is valid iff X: Ф(F,X) is UNSAT little support for generating+parsing UNSAT saturation proofs In practice, HOTG can be used for targeted reasoning about specific user-defined complex/unknown functions Page 54 June 2012

Application: Lexers with Hash Functions Parsers with input lexers using hash functions for fast keyword recognition Initially, forall language keywords: addsym(keyword, hashtable) When parsing the input: X=findsym(inputChunk, hashtable); // is inputchunk in hashtable? if (x==52) // how to get here? With higher-order test generation: Represent hashfunct by one UF h Capture all pairs (hashvalue,h(keyword)) If h(inputchunk)==52 and (52,h( while )) -> inputchunk= while This effectively inverses hashfunct only for all keywords Sufficient to drive executions through the lexer! See [PLDI 2011] for details Page 55 June 2012

Other Related Work Modeling imprecision with UFs is well-known in program verification of universal properties may abstractions, universal quantifiers only, validity checks Here, novelty is for existential properties Test generation is only one way to verify existential properties of programs More generally, one can build must abstractions Alternation is also used then Test generation as a game is not new in model-based testing, testing for reactive systems, etc. but from validity proofs of FOL formulas with UFs is new Page 56 June 2012

Summary (see [PLDI 2011]) Higher-order test generation = UFs + UF samples + test generation from validity checks of FOL formulas A new powerful form of test generation Tracking all sources of incompleteness is unrealistic, targeted use of UFs is more practical (ex: lexer with h) A formal tool to define the limits of test generation Explains in what sense dynamic test generation is more powerful than static test generation only in its ability to record concrete values in path constraints concrete values are ultimately needed for test generation Page 57 June 2012

General Summary 1. Why Test Generation? 2. What kind of SMT constraints? 3. New: test generation from validity proofs Page 58 June 2012

Conclusion: Impact of SAGE (In Numbers) 400+ machine-years Runs in the largest dedicated fuzzing lab in the world 3.4 Billion+ constraints Largest computational usage ever for any SMT solver 100s of apps, 100s of bugs (missed by everything else) Bug fixes shipped quietly (no MSRCs) to 1 Billion+ PCs Millions of dollars saved for Microsoft + time/energy savings for the world DART, Whitebox fuzzing now adopted by (many) others (10s tools, 100s citations) Page 59 June 2012

Conclusion: Blackbox vs. Whitebox Fuzzing Different cost/precision tradeoffs Blackbox is lightweight, easy and fast, but poor coverage Whitebox is smarter, but complex and slower Note: other recent semi-whitebox approaches Less smart (no symbolic exec, constr. solving) but more lightweight: Bunny-the-fuzzer (taint-flow, source-based, fuzz heuristics from input usage), Flayer (fault injection, not necessarily feasible), etc. Which is more effective at finding bugs? It depends Many apps are so buggy, any form of fuzzing find bugs in those! Once low-hanging bugs are gone, fuzzing must become smarter: use whitebox and/or user-provided guidance (grammars, etc.) Bottom-line: in practice, use both! (We do at Microsoft) Page 60 June 2012

What Next? Towards Verification Tracking all(?) sources of incompleteness Possibly using UFs and Higher-Order Test Generation Summaries (on-demand ) against path explosion How far can we go? Reduce costs & risks for Microsoft: when to stop fuzzing? Increase costs & risks for Black Hats (goal already achieved?) For history books!? 2000 2005 2010 2015 Blackbox Fuzzing Whitebox Fuzzing Verification Page 61 June 2012

Acknowledgments SAGE is joint work with: MSR: Ella Bounimova, David Molnar, CSE: Michael Levin, Chris Marsh, Lei Fang, Stuart de Jong, Interns Dennis Jeffries (06), David Molnar (07), Adam Kiezun (07), Bassem Elkarablieh (08), Marius Nita (08), Cindy Rubio-Gonzalez (08,09), Johannes Kinder (09), Daniel Luchaup (10), Nathan Rittenhouse (10), Mehdi Bouaziz (11), Ankur Taly (11) Thanks to the entire SAGE team and users! Z3 (MSR): Nikolaj Bjorner, Leonardo de Moura, Windows: Nick Bartmon, Eric Douglas, Dustin Duran, Elmar Langholz, Isaac Sheldon, Dave Weston, TruScan support: Evan Tice, David Grant, Office: Tom Gallagher, Eric Jarvi, Octavian Timofte, SAGE users all across Microsoft! References: see http://research.microsoft.com/users/pg Page 62 June 2012