Program Analysis Week 3-23 rd to 27 th January

Similar documents
Lecture 2 Introduction to Data Flow Analysis

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Properties of Real Numbers

(Refer Slide Time: 2:03)

(Refer Slide Time: 00:01:16 min)

Logic in Computer Science: Logic Gates

1 if 1 x 0 1 if 0 x 1

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

The Advantages of Dan Grossman CSE303 Spring 2005, Lecture 25

Binary Adders: Half Adders and Full Adders

Not agree with bug 3, precision actually was. 8,5 not set in the code. Not agree with bug 3, precision actually was

CSE 504: Compiler Design. Data Flow Analysis

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Introduction to Fractions

LINEAR EQUATIONS IN TWO VARIABLES

7. Latches and Flip-Flops

Java Interview Questions and Answers

5.1 Radical Notation and Rational Exponents

2) Write in detail the issues in the design of code generator.

Chapter 5 Names, Bindings, Type Checking, and Scopes

Memory Allocation. Static Allocation. Dynamic Allocation. Memory Management. Dynamic Allocation. Dynamic Storage Allocation

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

CHAPTER 3 Boolean Algebra and Digital Logic

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

26 Integers: Multiplication, Division, and Order

A Little Set Theory (Never Hurt Anybody)

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Sources: On the Web: Slides will be available on:

Counters and Decoders

Instruction Set Architecture (ISA)

Static Analysis. Find the Bug! : Analysis of Software Artifacts. Jonathan Aldrich. disable interrupts. ERROR: returning with interrupts disabled

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Math 4310 Handout - Quotient Vector Spaces

Network (Tree) Topology Inference Based on Prüfer Sequence

Informatica e Sistemi in Tempo Reale

Chapter One Introduction to Programming

1 Definition of a Turing machine

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:

Design: a mod-8 Counter

Digital Systems Based on Principles and Applications of Electrical Engineering/Rizzoni (McGraw Hill

Computer Science 217

CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS

The C Programming Language course syllabus associate level

Theory of Logic Circuits. Laboratory manual. Exercise 3

Testing LTL Formula Translation into Büchi Automata

1 Abstract Data Types Information Hiding


Math Workshop October 2010 Fractions and Repeating Decimals

Section 1.1 Linear Equations: Slope and Equations of Lines

Approximation Algorithms

6.080/6.089 GITCS Feb 12, Lecture 3

QUADRATIC, EXPONENTIAL AND LOGARITHMIC FUNCTIONS

3 Some Integer Functions

8 Divisibility and prime numbers

TOPIC 4: DERIVATIVES

Comprehensive Static Analysis Using Polyspace Products. A Solution to Today s Embedded Software Verification Challenges WHITE PAPER

Object Oriented Software Design

Basics of Counting. The product rule. Product rule example. 22C:19, Chapter 6 Hantao Zhang. Sample question. Total is 18 * 325 = 5850

MATH10040 Chapter 2: Prime and relatively prime numbers

C Programming. for Embedded Microcontrollers. Warwick A. Smith. Postbus 11. Elektor International Media BV. 6114ZG Susteren The Netherlands

Flip-Flops, Registers, Counters, and a Simple Processor

Lecture 17 : Equivalence and Order Relations DRAFT

LINEAR INEQUALITIES. Mathematics is the art of saying many things in many different ways. MAXWELL

Java Basics: Data Types, Variables, and Loops

Mathematical Induction

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Answer Key for California State Standards: Algebra I

Object Oriented Software Design

CHAPTER 11: Flip Flops

Write Barrier Removal by Static Analysis

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

Web Caching With Dynamic Content Abstract When caching is a good idea

Today. Binary addition Representing negative numbers. Andrew H. Fagg: Embedded Real- Time Systems: Binary Arithmetic

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 3: Finding integer solutions to systems of linear equations

Using the ac Method to Factor

Glossary of Object Oriented Terms

10CS35: Data Structures Using C

Semantic Analysis: Types and Type Checking

1. Write the number of the left-hand item next to the item on the right that corresponds to it.

Using Casio Graphics Calculators

The Graphical Method: An Example

Asynchronous counters, except for the first block, work independently from a system clock.

3.Basic Gate Combinations

MICROPROCESSOR AND MICROCOMPUTER BASICS

Just the Factors, Ma am

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS

Elementary Number Theory and Methods of Proof. CSE 215, Foundations of Computer Science Stony Brook University

EQUATIONS and INEQUALITIES

3. Mathematical Induction

Systems I: Computer Organization and Architecture

n 2 + 4n + 3. The answer in decimal form (for the Blitz): 0, 75. Solution. (n + 1)(n + 3) = n lim m 2 1

Object-Oriented Design Lecture 4 CSU 370 Fall 2007 (Pucella) Tuesday, Sep 18, 2007

6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

Mechanics 1: Vectors

Transcription:

Program Analysis Week 3-23 rd to 27 th January Ganesh K. (CS16M006) Honey Goyal (CS16M018) February 5, 2017 1

Contents 1 Data Flow Analysis (Contd.) 3 1.1 Concrete Vs Abstract Interpretation:.......................... 3 1.2 Storing Abstractions:.................................... 5 1.3 Conservative Analysis:................................... 5 1.4 Safety vs Liveness:..................................... 6 1.5 Analysis - Soundness and Precision :........................... 7 1.6 Optimization Soundness :................................. 8 2 Pointer Analysis 9 2.1 Why Pointer Analysis?.................................. 9 2.2 Definitions :........................................ 10 2.3 Types of pointer manipulating statements:........................ 10 2.4 Algebraic Properties :................................... 13 2.5 Cyclic Dependence :.................................... 14 2.6 Modelling Pointer analysis as a Data Flow Analysis Problem :............ 15 2

1 Data Flow Analysis (Contd.) 1.1 Concrete Vs Abstract Interpretation: The Interpretation of a program is a construct by construct execution of the program, and it is performed during compile time. Concrete Interpretation : It involves the interpretation of the program with the actual values at runtime. It is highly precise and offers a better interpretation than what is obtained during static computation. Abstract Interpretation : This interpretation is obtained during compile time and serves as an approximation. Example of concrete interpretation (concrete execution trace ) : // x i s undefined x=0; // x i s zero ++x ; // x i s one ; x ; // x i s zero ; The above is a straightforward program with a branchless, control flow pattern from the start to end. We now proceed with approximating the number of values x takes in the program. It can be noticed that the value of x is undefined till the instruction x=0; is reached. After that program point, the value of x lies in the set {0,1}. If we choose not to use a full integer variable (of multiple bit size) to keep track of the value of x, and proceed with a single bit we limit ourselves to be able to track atmost 2 values of the variable. We choose to track, if the value of x is Zero or Not. Why such a tracking mechanism? : The choice of what value of a variable to track may depend on the requirement of a transformation, (in this case, to track x==0) to facilitate optimization at a later stage. 3

Example of Abstract interpretation : Figure 1: Abstract Implementation : execution trace The Idea here is to look through a peephole/window of one construct/instruction at a time. At any point, we have access to the current instruction, and the bit(s) that we use to track variable(s). We specify the bit, say B tracking the variable x to be True, only when the value of x is 0. Whenever the value of x is not 0, or there exists an uncertainty associated with its value, we assign bit B the value, False. Refer Figure 1: Initially the value of x is undefined, hence we assign B=False. Line 1: The variable x has the value 0, as a result of an absolute assignment. Hence, B = True. Line 2: x s value is incremented and it no longer holds the value 0, therefore B=FALSE. Lines 3:x s value is decremented, the Bit B was False earlier, since we cannot claim for certain that the value is 0, using a single bit value tracker, we conservatively assign bit B=False. Figure 2 shows scenarios where we use 2 bits to track x s value. Since there are a total of four possible states with 2 bits 00 - value 0, and denotes imprecision 01 - value 1, 10 - value 2, 11 - value 3. We choose the state 00 to additionally denote imprecision. Figure 2: Abstract Implementation : Using 2 bit tracking variable 4

Initially the bits are assigned the value 00 denoting imprecision (Refer Fig 2(a)). Line 1 assigns the value 0 to the variable x. This assignment also sets the bits to 00. However further modifications shown in lines 2 and 3 fail to update the bits, as the state 00 also indicates that x s value could be anything. Unless this loss in precision is regained via a direct assignment to the variable x in the program, it continues to propagate. Similarly, if 11 had been the state chosen to model the uncertainty in x s value, an assignment of the form x=3 would have caused imprecision. This is shown in Figure 2(c), the bits are set to be 11 everytime we are uncertain about x s value. It is to be noted that, 11 may also denote the state where x s value is 3. Line 1: assigns 1 to x therefore the bits are now assigned the values 01. Line 2: absolutely assigns x the value 3, therefore the bits are assigned the values 11. Line 3: decrements x, whose value cannot be determined for certain. Therefore the buts continue to hold the values 11. The figure 2(b) shows another form of abstraction. We maintain two bits here (B 0 B 1 ), one bit (B 0 ) to indicate if x==0, and another bit B 1 to indicate if x<2. (NOTE : 00 models the imprecise state). Additional information : x is a positive integer. Initially,the bits are assigned 00. Line 1, absolutely assigns the value 0 to x, hence bits are assigned 11. Line 2 increments x, since x s value is known to be 0 at this point, the bits are assigned 01. Line 3 decrements x, and yet again since we can determine x s value to be 0 after the operation, we assign the value 11 to the bits. NOTE: we could track x s values accurately post increment/decrement operations only because we knew x was a positive integer. In the absence of type information, we d have conservatively assigned 00 to bits. 1.2 Storing Abstractions: We typically use saturating counters to track values, where the first and last states are used to track imprecision and the range of states between them track precise values. The states at the ends do not wrap around and are used to represent more than one states. Assertions / Conditions may also be tracked. As observed from the example in Figure 2.b additional information about the type, sign if tracked, may improve precision. 1.3 Conservative Analysis: Being safe in Static analysis indicates moving towards the Bottom of the lattice, although it may be imprecise. We initialise the information with empty set (in case of reaching definition )and not Bottom (all definitions) because the analysis monotonically increases the set, and it leads us downwards in the lattice. This is shown in the following figure. 5

Figure 3: Reaching Definitions Lattice Choice of Initialisation depends on the confluence operator used in the analysis (Union, in case of reaching definition). 1.4 Safety vs Liveness: Transformations query information computed by an analysis (example : data flow facts at a said program point ). The resulting information is used by optimization passes. The queries can be classified across two properties : 1. Safety property 2. Liveness property Safety : A property / condition that holds across all execution paths at runtime. Liveness : Does there exist an execution path where the property holds. Example : A program can be claimed bug free, only if there exists no bugs(s) across all of its execution paths (Absence of a bug is a Safety property ) However, if there exists a path with a bug in the program, it conclusively indicates the presence of a bug in the program (Presence of a bug is a liveness property) Example : To prove presence of a data race condition in a program it is sufficient to find one pair across any execution path (Liveness property ). However, absence of a data race condition requires us to confirm its absence across all program paths and all pairs of data (Safety property) Question : Are Must/ May/ No alias for pointers Safety/Liveness properties?: No and Must aliases : are safety properties, since they require us to establish correctness across all possible execution paths in the program. However, May alias is a liveness property, as we require only one path where two pointers may point to the same location to prove correctness. 6

NOTE : A must alias analysis gives us a guarantee that two pointers point to the same memory location. However, A may alias analysis does not assure aliasing. The following scenario shows how information from may-analysis can be used to infer properties of other pointers. p, q and r are three pointers. Alias analysis provides us with information that only p and q may alias. Since there is a third pointer r that alias analysis does not speak about, it is guaranteed that r aliases neither with p nor with q. This information about r may be of use to some transformation. Figure 4: Relation between Must, May and Must-Not alias NOTE : The complement of May here is Must-not. By increasing the amount of may information, we d be reducing the amount of may-not information, thereby making the analysis more imprecise. 1.5 Analysis - Soundness and Precision : Analyses facilitate optimizations in certain scenarios. Scenarios, here are potential points of optimizations. An analysis is sound if the optimization it leads to maintains the functionality of the original code, and does not inadvertently trigger optimizations where there is no scope. The analysis is sound as long as the the optimization points are within the set of scenarios. If Optimization is carried out for all such scenarios, the analysis is termed complete. An analysis is precise as long as it does not disable optimizations for any possible scenario. 7

1.6 Optimization Soundness : It is possible that an analysis is useful for more than one optimization. i.e. the information provided by an analysis may be used by 2 or more optimizations in the similar manner. Example : Increasing the amount of information computed by the analysis may affect multiple the optimizations in the exact same manner. This lets us argue about soundness of the analysis independent of the optimizations. However, this is not always the case : Two opposite optimizations may see the information from the same analysis in opposing ways. i.e. Increasing the amount of information from the analysis may enable optimization O1 but disable optimization O2. In such cases, we cannot argue about soundness of the analysis independent of the optimizations. Example 1 :(Using pointer analysis ) Consider Optimization O 1 that changes *p to x if p points to only x (we can enable caching the value ). Consider Optimization O 2 that makes p volatile if p points to multiple variables at different program points ( we can disable caching the value ). Let the program exhibit, the points-to information as p {x, y} If A computes more information p {x, y, z}, O 1 is suppressed but O 2 is enabled. If A computes less information p {x}, O 1 is enabled and O 2 is suppressed. Example 2 (Uses numerical analysis) Consider Optimization O 1 that converts multiplication by 2 to a left bit-shift operation (x * 2 to x 1). Consider Optimization O 2 that uses a special circuit (fast operation) when there is a sum of reciprocals of powers of 2 (1 + 1 2 + 1 4 +... ) Let the program use 1.98 as the value in a computation. Analysis A is used to compute values of arithmetic expressions. Converting 1.98 to 2 enables O 1, disables O 2. Converting 1.98 to 1.96875 enables O 2, disables O 1. In both examples shown above, analysis feeds different optimization in different ways. What is Precise for one is imprecise for other, and what is sound for one is unsound for the other. 8

2 Pointer Analysis It is a static code analysis technique that establishes which pointers, can point to which variables, or storage locations. It is often used as a component or a prelude of more complex analyses. It is also used as the collective name for both points-to analysis, and alias analysis. 2.1 Why Pointer Analysis? 1. a = &x ; 2. b = a ; 3. i f ( b == p ) { /... IF BLOCK... / 4. } else { /... ELSE BLOCK... / 5. } For the program written above, say if a transformation queries if b and *p are aliases ( Line 3 ) If the analysis performed is a Must alias analysis. and the response to the query is Yes, then the transformation can safely exclude the ELSE block. However if the response is No, it is not of significant use to the transformation (since the complement of MUST is MAY NOT ). If the analysis performed is a May alias analysis, and the response is Yes it does not help the transformation to exclude either of the blocks. //However with a No response for the may alias query, the transformation can safely exclude the IF BLOCK (since the complement of MAY is MUST NOT ). Parallelization : Example : fun(p) fun(q) If p and q s execution do not interfere, they can be executed in parallel. Optimization : a= p+2; b= q+2; In this case say, if p are q are inferred to be must aliases, we can proceed to apply common sub-expression elimination among the statements. 9

Table 1: Placement of Points-to Analysis Goal Improved Runtime Secure code Better debugging Better Compile Time Application of points-to analysis Lock synchronizer Parallelizing compiler String vulnerability finder Memory leak detector Program slicer Type analyzer Data flow analyzer Affine expression analyzer 2.2 Definitions : Points-to-analysis: computes points-to information for each pointer. Alias analysis : computes aliasing information across pointers. Pointee : the element a pointer points to. Points-to edge : the directed edge from the pointer to pointee in a points-to graph. Aliasing-edge : the undirected edge between two pointers that may be aliases. Figure 5: points-to graph From the figure shown above : p s points to set is {m, x} q s points to set is { x, y }. since there exists a common pointee ( x, here). p and q are aliases. It is to be noted that, we require the points-to information to compute the aliasing information, and not the other way round. Frameworks choose to store points-to information instead of aliasing information, since storing the latter is expensive. 2.3 Types of pointer manipulating statements: A C program can have arbitrarily typed valid statements. We choose to model pointer manipulating statements using the following 4 types : p = &q (address-of) A statement of the kind p=&x assigns the address of x to the pointer p. x is added to the points-to set of p. p = q (copy) If q points to multiple pointees, p begins to point to each one of them. 10

q s points-to set is copied into p s set. p = *q (load) The points-to sets of the pointees of q are copied to p s points-to set, as shown in the following figure. Figure 6: load instruction *p = q (store) p s pointees start pointing to the elements in q s points-to set as shown in the figure. Figure 7: store instruction NOTE: p s points-to set is not modified here. 11

Question : What kind of pointer statements cannot be modelled using the aforementioned types: Pointer arithmetic t= (p+i); we choose to model (p+i) as another pointer, say q; Pointer casting : We ignore type information and model a statement like q=(int *)p; as q=p; It may be the case that p and q are of different types and refer to blocks of different sizes, but we are okay with the loss in precision/information. malloc instructions: an instruction of the format p=mallox(10); is modelled as p=&x where x is a mnemonic to represent the malloc callsite. free(p): A call to free, deallocates the memory that is associated with the pointer p. However, it does not remove the reference, i.e. free does not change the points-to set of the pointer p, giving rise to a dangling pointer. As far as pointer analysis is concerned, we ignore the free statement. Question: How is NULL pointer modelled? The NULL pointer can be modelled in one of two ways 1. It can be modelled as another pointer in the program. 2. It need not be modelled at all. The choice of how to model the NULL pointer depends on the goal of the analysis. (whether the goal of the analysis is optimization or Security). Question : In case of a load statement ( p = *q ), what happens to p s original points-to set. Is it overwritten or appended to? If analysis is Flow sensitive : The original points-to set is killed. If the analysis is Flow insensitive, the new pointees are added to the points-to set. Question : In case of a load statement (p=*q), can p and *q be termed aliases? Yes, Two expressions can be aliases as long as they refer to the same memory location. 12

2.4 Algebraic Properties : Points-to relation : Not reflexive. One may argue that the points-to relation may be reflexive in weakly typed languages. Reflexivity, for the points-to relation is a general property cannot be based upon a few instances of language, and therefore, it is not reflexive. Not symmetric. Not transitive. Directed in nature. Alias relation: Reflexive Symmetric It is not Transitive. The following figure depicts a case where transitivity does not hold a and b are aliases, b and c are aliases. But a and c are Not. Figure 8: transitivity - counter example Undirected in nature. 13

Points-to graph: The points-to graphs for programs written in strictly typed languages are essentially DAGs (Directed Acyclic Graphs ). Further, they fall under a subset of DAGs that are possible, wherein edges to pointees in farther levels are not permitted. However, points-to graphs for programs written in weakly typed languages may have such edges. The following figure shows the mentioned scenario, where (a) represents a points-to graph for a strictly typed program instance, and (b) indicates the graph for a weakly typed instance. Every group of s in the figure shown, represents a dereferencing level. Figure 9: Points-to graphs - DAGs 2.5 Cyclic Dependence : Optimization points-to information Output of optimization passes is fed to pointer analysis pass.there exists a possibility that, this might lead to more opportunities to perform optimizations. Therefore we need this cyclic dependence across optimization and analysis. Example : Let p, q, r and s be pointer in a program. p=q+5; r=s +5; If we had prior knowledge that q and r were aliases, we could have performed common sub-expression elimination. Performing pointer analysis will therefore let us enable this optimization. Performing further pointer analysis after this optimization will let us discover that p and q may alias. Call graph Function pointer: To perform inter-procedural or context-sensitive analysis we have to know which function is invoked. However a function pointer may point to different functions depending on its runtime value. We will have to perform pointer analysis to determine targets of the function pointers. In order to proceed with inter-procedural pointer analysis we will require the call graph which in turns expects us to know function call information. There exists a circular dependence in this case. 14

For Example: Let p=&x be a statement in the main method, and let a function f (to which, main passes p ) have the statement q=p. We need to know that f is being called. A function pointer may point to different pointers depending on its runtime value. To figure the function pointer s targets and therefore to construct the call graph, we need pointer analysis In this case to perform context sensitive pointer analysis, we need p s points to set ( which is modified in the caller, main function ). This requires the call graph. In such cases, the call graphs are constructed with some pointer information, pointer analysis is then invoked using which function targets are resolved. This information is then fed back to the call graph to make it more precise. 2.6 Modelling Pointer analysis as a Data Flow Analysis Problem : Performing Pointer analysis requires us to compute the points-to information across various basic blocks in the Control Flow Graph, we therefore model this as a data flow analysis Problem and define the GEN and KILL sets. GEN - set of points-to facts that are generated in the basic block. KILL - set of points-to facts that are killed in the basic block. Points-to generations for the four statements used to model pointer operations are as follows : a=&x {generate a x} p=q { generate p x x : x points-to(q) } m=*n { generate m x x points-to(y) y points-to(n) } *p =a generate { b x} if {p b a x }, kill {b x} if {p b b x } (In this case the points-to information of p is not modified. Pointees of a are assigned to pointees of p ) NOTE: The IN set of a basic block is defined in terms of the OUT set of it s predecessor. ( Analysis is in the forward direction ) The confluence operator used is Union. IN(B)= OUT(P) (B - Basic Block, P - Predecessors ) OUT(B) = GEN(B)-{IN(B)-KIll(B)} The points-to facts are defined across the entire program, and not local to the respective basic blocks. 15