Program Analysis Week 3-23 rd to 27 th January Ganesh K. (CS16M006) Honey Goyal (CS16M018) February 5, 2017 1
Contents 1 Data Flow Analysis (Contd.) 3 1.1 Concrete Vs Abstract Interpretation:.......................... 3 1.2 Storing Abstractions:.................................... 5 1.3 Conservative Analysis:................................... 5 1.4 Safety vs Liveness:..................................... 6 1.5 Analysis - Soundness and Precision :........................... 7 1.6 Optimization Soundness :................................. 8 2 Pointer Analysis 9 2.1 Why Pointer Analysis?.................................. 9 2.2 Definitions :........................................ 10 2.3 Types of pointer manipulating statements:........................ 10 2.4 Algebraic Properties :................................... 13 2.5 Cyclic Dependence :.................................... 14 2.6 Modelling Pointer analysis as a Data Flow Analysis Problem :............ 15 2
1 Data Flow Analysis (Contd.) 1.1 Concrete Vs Abstract Interpretation: The Interpretation of a program is a construct by construct execution of the program, and it is performed during compile time. Concrete Interpretation : It involves the interpretation of the program with the actual values at runtime. It is highly precise and offers a better interpretation than what is obtained during static computation. Abstract Interpretation : This interpretation is obtained during compile time and serves as an approximation. Example of concrete interpretation (concrete execution trace ) : // x i s undefined x=0; // x i s zero ++x ; // x i s one ; x ; // x i s zero ; The above is a straightforward program with a branchless, control flow pattern from the start to end. We now proceed with approximating the number of values x takes in the program. It can be noticed that the value of x is undefined till the instruction x=0; is reached. After that program point, the value of x lies in the set {0,1}. If we choose not to use a full integer variable (of multiple bit size) to keep track of the value of x, and proceed with a single bit we limit ourselves to be able to track atmost 2 values of the variable. We choose to track, if the value of x is Zero or Not. Why such a tracking mechanism? : The choice of what value of a variable to track may depend on the requirement of a transformation, (in this case, to track x==0) to facilitate optimization at a later stage. 3
Example of Abstract interpretation : Figure 1: Abstract Implementation : execution trace The Idea here is to look through a peephole/window of one construct/instruction at a time. At any point, we have access to the current instruction, and the bit(s) that we use to track variable(s). We specify the bit, say B tracking the variable x to be True, only when the value of x is 0. Whenever the value of x is not 0, or there exists an uncertainty associated with its value, we assign bit B the value, False. Refer Figure 1: Initially the value of x is undefined, hence we assign B=False. Line 1: The variable x has the value 0, as a result of an absolute assignment. Hence, B = True. Line 2: x s value is incremented and it no longer holds the value 0, therefore B=FALSE. Lines 3:x s value is decremented, the Bit B was False earlier, since we cannot claim for certain that the value is 0, using a single bit value tracker, we conservatively assign bit B=False. Figure 2 shows scenarios where we use 2 bits to track x s value. Since there are a total of four possible states with 2 bits 00 - value 0, and denotes imprecision 01 - value 1, 10 - value 2, 11 - value 3. We choose the state 00 to additionally denote imprecision. Figure 2: Abstract Implementation : Using 2 bit tracking variable 4
Initially the bits are assigned the value 00 denoting imprecision (Refer Fig 2(a)). Line 1 assigns the value 0 to the variable x. This assignment also sets the bits to 00. However further modifications shown in lines 2 and 3 fail to update the bits, as the state 00 also indicates that x s value could be anything. Unless this loss in precision is regained via a direct assignment to the variable x in the program, it continues to propagate. Similarly, if 11 had been the state chosen to model the uncertainty in x s value, an assignment of the form x=3 would have caused imprecision. This is shown in Figure 2(c), the bits are set to be 11 everytime we are uncertain about x s value. It is to be noted that, 11 may also denote the state where x s value is 3. Line 1: assigns 1 to x therefore the bits are now assigned the values 01. Line 2: absolutely assigns x the value 3, therefore the bits are assigned the values 11. Line 3: decrements x, whose value cannot be determined for certain. Therefore the buts continue to hold the values 11. The figure 2(b) shows another form of abstraction. We maintain two bits here (B 0 B 1 ), one bit (B 0 ) to indicate if x==0, and another bit B 1 to indicate if x<2. (NOTE : 00 models the imprecise state). Additional information : x is a positive integer. Initially,the bits are assigned 00. Line 1, absolutely assigns the value 0 to x, hence bits are assigned 11. Line 2 increments x, since x s value is known to be 0 at this point, the bits are assigned 01. Line 3 decrements x, and yet again since we can determine x s value to be 0 after the operation, we assign the value 11 to the bits. NOTE: we could track x s values accurately post increment/decrement operations only because we knew x was a positive integer. In the absence of type information, we d have conservatively assigned 00 to bits. 1.2 Storing Abstractions: We typically use saturating counters to track values, where the first and last states are used to track imprecision and the range of states between them track precise values. The states at the ends do not wrap around and are used to represent more than one states. Assertions / Conditions may also be tracked. As observed from the example in Figure 2.b additional information about the type, sign if tracked, may improve precision. 1.3 Conservative Analysis: Being safe in Static analysis indicates moving towards the Bottom of the lattice, although it may be imprecise. We initialise the information with empty set (in case of reaching definition )and not Bottom (all definitions) because the analysis monotonically increases the set, and it leads us downwards in the lattice. This is shown in the following figure. 5
Figure 3: Reaching Definitions Lattice Choice of Initialisation depends on the confluence operator used in the analysis (Union, in case of reaching definition). 1.4 Safety vs Liveness: Transformations query information computed by an analysis (example : data flow facts at a said program point ). The resulting information is used by optimization passes. The queries can be classified across two properties : 1. Safety property 2. Liveness property Safety : A property / condition that holds across all execution paths at runtime. Liveness : Does there exist an execution path where the property holds. Example : A program can be claimed bug free, only if there exists no bugs(s) across all of its execution paths (Absence of a bug is a Safety property ) However, if there exists a path with a bug in the program, it conclusively indicates the presence of a bug in the program (Presence of a bug is a liveness property) Example : To prove presence of a data race condition in a program it is sufficient to find one pair across any execution path (Liveness property ). However, absence of a data race condition requires us to confirm its absence across all program paths and all pairs of data (Safety property) Question : Are Must/ May/ No alias for pointers Safety/Liveness properties?: No and Must aliases : are safety properties, since they require us to establish correctness across all possible execution paths in the program. However, May alias is a liveness property, as we require only one path where two pointers may point to the same location to prove correctness. 6
NOTE : A must alias analysis gives us a guarantee that two pointers point to the same memory location. However, A may alias analysis does not assure aliasing. The following scenario shows how information from may-analysis can be used to infer properties of other pointers. p, q and r are three pointers. Alias analysis provides us with information that only p and q may alias. Since there is a third pointer r that alias analysis does not speak about, it is guaranteed that r aliases neither with p nor with q. This information about r may be of use to some transformation. Figure 4: Relation between Must, May and Must-Not alias NOTE : The complement of May here is Must-not. By increasing the amount of may information, we d be reducing the amount of may-not information, thereby making the analysis more imprecise. 1.5 Analysis - Soundness and Precision : Analyses facilitate optimizations in certain scenarios. Scenarios, here are potential points of optimizations. An analysis is sound if the optimization it leads to maintains the functionality of the original code, and does not inadvertently trigger optimizations where there is no scope. The analysis is sound as long as the the optimization points are within the set of scenarios. If Optimization is carried out for all such scenarios, the analysis is termed complete. An analysis is precise as long as it does not disable optimizations for any possible scenario. 7
1.6 Optimization Soundness : It is possible that an analysis is useful for more than one optimization. i.e. the information provided by an analysis may be used by 2 or more optimizations in the similar manner. Example : Increasing the amount of information computed by the analysis may affect multiple the optimizations in the exact same manner. This lets us argue about soundness of the analysis independent of the optimizations. However, this is not always the case : Two opposite optimizations may see the information from the same analysis in opposing ways. i.e. Increasing the amount of information from the analysis may enable optimization O1 but disable optimization O2. In such cases, we cannot argue about soundness of the analysis independent of the optimizations. Example 1 :(Using pointer analysis ) Consider Optimization O 1 that changes *p to x if p points to only x (we can enable caching the value ). Consider Optimization O 2 that makes p volatile if p points to multiple variables at different program points ( we can disable caching the value ). Let the program exhibit, the points-to information as p {x, y} If A computes more information p {x, y, z}, O 1 is suppressed but O 2 is enabled. If A computes less information p {x}, O 1 is enabled and O 2 is suppressed. Example 2 (Uses numerical analysis) Consider Optimization O 1 that converts multiplication by 2 to a left bit-shift operation (x * 2 to x 1). Consider Optimization O 2 that uses a special circuit (fast operation) when there is a sum of reciprocals of powers of 2 (1 + 1 2 + 1 4 +... ) Let the program use 1.98 as the value in a computation. Analysis A is used to compute values of arithmetic expressions. Converting 1.98 to 2 enables O 1, disables O 2. Converting 1.98 to 1.96875 enables O 2, disables O 1. In both examples shown above, analysis feeds different optimization in different ways. What is Precise for one is imprecise for other, and what is sound for one is unsound for the other. 8
2 Pointer Analysis It is a static code analysis technique that establishes which pointers, can point to which variables, or storage locations. It is often used as a component or a prelude of more complex analyses. It is also used as the collective name for both points-to analysis, and alias analysis. 2.1 Why Pointer Analysis? 1. a = &x ; 2. b = a ; 3. i f ( b == p ) { /... IF BLOCK... / 4. } else { /... ELSE BLOCK... / 5. } For the program written above, say if a transformation queries if b and *p are aliases ( Line 3 ) If the analysis performed is a Must alias analysis. and the response to the query is Yes, then the transformation can safely exclude the ELSE block. However if the response is No, it is not of significant use to the transformation (since the complement of MUST is MAY NOT ). If the analysis performed is a May alias analysis, and the response is Yes it does not help the transformation to exclude either of the blocks. //However with a No response for the may alias query, the transformation can safely exclude the IF BLOCK (since the complement of MAY is MUST NOT ). Parallelization : Example : fun(p) fun(q) If p and q s execution do not interfere, they can be executed in parallel. Optimization : a= p+2; b= q+2; In this case say, if p are q are inferred to be must aliases, we can proceed to apply common sub-expression elimination among the statements. 9
Table 1: Placement of Points-to Analysis Goal Improved Runtime Secure code Better debugging Better Compile Time Application of points-to analysis Lock synchronizer Parallelizing compiler String vulnerability finder Memory leak detector Program slicer Type analyzer Data flow analyzer Affine expression analyzer 2.2 Definitions : Points-to-analysis: computes points-to information for each pointer. Alias analysis : computes aliasing information across pointers. Pointee : the element a pointer points to. Points-to edge : the directed edge from the pointer to pointee in a points-to graph. Aliasing-edge : the undirected edge between two pointers that may be aliases. Figure 5: points-to graph From the figure shown above : p s points to set is {m, x} q s points to set is { x, y }. since there exists a common pointee ( x, here). p and q are aliases. It is to be noted that, we require the points-to information to compute the aliasing information, and not the other way round. Frameworks choose to store points-to information instead of aliasing information, since storing the latter is expensive. 2.3 Types of pointer manipulating statements: A C program can have arbitrarily typed valid statements. We choose to model pointer manipulating statements using the following 4 types : p = &q (address-of) A statement of the kind p=&x assigns the address of x to the pointer p. x is added to the points-to set of p. p = q (copy) If q points to multiple pointees, p begins to point to each one of them. 10
q s points-to set is copied into p s set. p = *q (load) The points-to sets of the pointees of q are copied to p s points-to set, as shown in the following figure. Figure 6: load instruction *p = q (store) p s pointees start pointing to the elements in q s points-to set as shown in the figure. Figure 7: store instruction NOTE: p s points-to set is not modified here. 11
Question : What kind of pointer statements cannot be modelled using the aforementioned types: Pointer arithmetic t= (p+i); we choose to model (p+i) as another pointer, say q; Pointer casting : We ignore type information and model a statement like q=(int *)p; as q=p; It may be the case that p and q are of different types and refer to blocks of different sizes, but we are okay with the loss in precision/information. malloc instructions: an instruction of the format p=mallox(10); is modelled as p=&x where x is a mnemonic to represent the malloc callsite. free(p): A call to free, deallocates the memory that is associated with the pointer p. However, it does not remove the reference, i.e. free does not change the points-to set of the pointer p, giving rise to a dangling pointer. As far as pointer analysis is concerned, we ignore the free statement. Question: How is NULL pointer modelled? The NULL pointer can be modelled in one of two ways 1. It can be modelled as another pointer in the program. 2. It need not be modelled at all. The choice of how to model the NULL pointer depends on the goal of the analysis. (whether the goal of the analysis is optimization or Security). Question : In case of a load statement ( p = *q ), what happens to p s original points-to set. Is it overwritten or appended to? If analysis is Flow sensitive : The original points-to set is killed. If the analysis is Flow insensitive, the new pointees are added to the points-to set. Question : In case of a load statement (p=*q), can p and *q be termed aliases? Yes, Two expressions can be aliases as long as they refer to the same memory location. 12
2.4 Algebraic Properties : Points-to relation : Not reflexive. One may argue that the points-to relation may be reflexive in weakly typed languages. Reflexivity, for the points-to relation is a general property cannot be based upon a few instances of language, and therefore, it is not reflexive. Not symmetric. Not transitive. Directed in nature. Alias relation: Reflexive Symmetric It is not Transitive. The following figure depicts a case where transitivity does not hold a and b are aliases, b and c are aliases. But a and c are Not. Figure 8: transitivity - counter example Undirected in nature. 13
Points-to graph: The points-to graphs for programs written in strictly typed languages are essentially DAGs (Directed Acyclic Graphs ). Further, they fall under a subset of DAGs that are possible, wherein edges to pointees in farther levels are not permitted. However, points-to graphs for programs written in weakly typed languages may have such edges. The following figure shows the mentioned scenario, where (a) represents a points-to graph for a strictly typed program instance, and (b) indicates the graph for a weakly typed instance. Every group of s in the figure shown, represents a dereferencing level. Figure 9: Points-to graphs - DAGs 2.5 Cyclic Dependence : Optimization points-to information Output of optimization passes is fed to pointer analysis pass.there exists a possibility that, this might lead to more opportunities to perform optimizations. Therefore we need this cyclic dependence across optimization and analysis. Example : Let p, q, r and s be pointer in a program. p=q+5; r=s +5; If we had prior knowledge that q and r were aliases, we could have performed common sub-expression elimination. Performing pointer analysis will therefore let us enable this optimization. Performing further pointer analysis after this optimization will let us discover that p and q may alias. Call graph Function pointer: To perform inter-procedural or context-sensitive analysis we have to know which function is invoked. However a function pointer may point to different functions depending on its runtime value. We will have to perform pointer analysis to determine targets of the function pointers. In order to proceed with inter-procedural pointer analysis we will require the call graph which in turns expects us to know function call information. There exists a circular dependence in this case. 14
For Example: Let p=&x be a statement in the main method, and let a function f (to which, main passes p ) have the statement q=p. We need to know that f is being called. A function pointer may point to different pointers depending on its runtime value. To figure the function pointer s targets and therefore to construct the call graph, we need pointer analysis In this case to perform context sensitive pointer analysis, we need p s points to set ( which is modified in the caller, main function ). This requires the call graph. In such cases, the call graphs are constructed with some pointer information, pointer analysis is then invoked using which function targets are resolved. This information is then fed back to the call graph to make it more precise. 2.6 Modelling Pointer analysis as a Data Flow Analysis Problem : Performing Pointer analysis requires us to compute the points-to information across various basic blocks in the Control Flow Graph, we therefore model this as a data flow analysis Problem and define the GEN and KILL sets. GEN - set of points-to facts that are generated in the basic block. KILL - set of points-to facts that are killed in the basic block. Points-to generations for the four statements used to model pointer operations are as follows : a=&x {generate a x} p=q { generate p x x : x points-to(q) } m=*n { generate m x x points-to(y) y points-to(n) } *p =a generate { b x} if {p b a x }, kill {b x} if {p b b x } (In this case the points-to information of p is not modified. Pointees of a are assigned to pointees of p ) NOTE: The IN set of a basic block is defined in terms of the OUT set of it s predecessor. ( Analysis is in the forward direction ) The confluence operator used is Union. IN(B)= OUT(P) (B - Basic Block, P - Predecessors ) OUT(B) = GEN(B)-{IN(B)-KIll(B)} The points-to facts are defined across the entire program, and not local to the respective basic blocks. 15