Non-Parametric Statistical Techniques for Computational Forensic Engineering

Transcription

1 Non-Parametric Statistical Techniques for Computational Forensic Engineering Jennifer Lee Wong Abstract Computational forensic engineering is the process of identification of the tool or algorithm that was used to produce a particular output or solution by examining the structural properties of the output. We introduce a new Relative Generic Forensic Engineering (RGFE)technique that has several advantages over the previously proposed approaches. From the quantitative point of view, the new RGFE technique performs not only more accurate identification of the tool used but also provides the identification with a level of confidence. A higher degree of classification is achieved by our technique with the ability to identify the output as produced by an unknown tool. We introduce a generic formulation which enables rapid application of the RGFE approach to a variety of problems that can be formulated as -1 integer linear programs. Additionally, we present forensic engineering scenarios which enable a natural classification of the forensic engineering task with respect to the types and amount of information available to conduct the classification. From the technical point of view, the key innovations of the RGFE technique include the development of a simulated annealing-based CART classification and clustering technique and a generic property formulation technique which provides a systematic way to develop properties for a given problem or facilitates their reuse. In addition to solution properties, we introduce instance properties which enable an enhanced classification of problem instances leading to a higher accuracy of algorithm identification. Finally, the single most important innovation, property calibration, interprets the value for a given algorithm for a given property relative to the values for other algorithms. We demonstrated the RGFE technique on two canonical optimization problems: boolean satisfiability (SAT)and graph coloring (GC) and used statistical techniques to establish the effectiveness of the approach. 1 Introduction 1.1 Motivation The emergence and rapid growth of the Internet, and in particular Peer-to-Peer networking, has enabled more convenient, economic and faster methods of intellectual property distribution, which therefore also indirectly enables copyright infringement. Software piracy resulted in a loss of over $59 billion globally between 1995 and 2, and continues to induce an average of $12 billion each year in the United States alone [49]. Intellectual Property Protection techniques (IPP), such as watermarking and fingerprinting, have been proposed to aid in preventing and protecting software. These techniques have shown significant potential. However, they introduce additional overhead on each application of the tool and IP and they cannot be applied to tools which already exist. Computational forensic techniques remove these limitations and enable identification of the tool or algorithm which was used to produce a particular output or solution. Therefore, forensic engineering techniques provide an inexpensive approach that can be applied to any existing or future design or software tool. Through the examination of structural properties of a particular output, or solutions, generated by a tool, the RGFE technique aims to identify the tool used to produce the output with a high degree of confidence. It is important to note that forensic engineering has a number of other applications that in many situations have even higher economical impact. For instance, computational forensic engineering can be used for optimization algorithm development, as a basis for developing or improving other IPP techniques, for the development of more powerful benchmarking tools, for enabling security, and facilitating runtime prediction. More specifically, computational forensic engineering can be used to perform optimization algorithm tuning, instance partitioning for optimization, algorithm development, and analysis 1

2 of algorithm scaling. For example, forensic engineering can analyze the performance of the algorithm on various test cases and pinpoint the types of instances on which the algorithm does not perform well. The technique can also be used to partition instances into components each of which can be processed using algorithms which will perform best with respect to the structure of each part. Furthermore, one can tune parameters of heuristics with respect to the properties of the targeted instance. Additionally, computational forensic engineering can assist in the development of IPP techniques such as watermarking, obfuscation, and reverse engineering. Watermarking techniques often add a signature in the form of additional constraints into the instance structure. Forensic techniques can be used to determine the proper type of additional constraints to embed in the instance in order to ensure the uniqueness of the watermark without reducing the quality of the obtained solution. Reverse engineering of algorithms can be facilitated by forensic analysis in several ways. For example, the forensic technique can be used to determine which specific instances of the problem to analyze in order to identify the key optimization mechanisms of the algorithm. Benchmark sets can be built to accurately identify the benefits and limitations of an algorithm with respect to particular type of instance. Forensic analysis can also be used to generate instances with specific structures on which an algorithm does not perform well, performs exceptionally well, or to build a compact, yet diverse benchmark set which fairly tests all competing algorithms/tools. Security applications include the generation of secure mobile codes and runtime checks of code. For example, one can check using computational forensic techniques whether or not a delivered piece of code was indeed generated using a particular compiler by looking at the register assignment (graph coloring) and scheduling. Lastly, forensic engineering can be used for the runtime prediction of a particular algorithm. Proper resource allocation, such as memory, can be identified using forensic techniques by examining the memory and runtime of an algorithm on instances with similar properties and of similar size. The main motivation for our work is provided by the analysis of the limitations of the initially proposed computational forensic engineering technique [47]. The initial technique performed well on a number of specific instances and on a specific small set of algorithms for two optimization problems - graph coloring and boolean satisfiability. This technique showed remarkable effectiveness, however, only under rather limiting conditions. The RGFE techniques eliminates these conditions and provides several not only quantitative, but more importantly, qualitative and conceptual advantages. RGFE can be applied to an arbitrary problem which can be formulated as a -1 linear programming problem. In this generic formulation, properties of the problem are extracted and used to analyze the structure of both instances of the problem and the output or solutions of a representative set of tools. Using the information gathered, the RGFE technique builds and verifies a Classification and Regression Tree (CART)model to represent the classification of the observed tools. Once built, the CART model can be used to identify the tool used to generate a particular instance output. This RGFE approach consists of three phases: Property Collection, Modeling, and Validation. The key enabling factors in the property collection phase are the ability to extract properties of a given problem systematically and to conduct calibration of these properties to reflect the differences between solutions generated by the tools. 1.2 Motivational Example In order to demonstrate the benefits of the RGFE techniques, we use an example using the boolean satisfiability problem. For the sake of brevity, we focus our attention only on one of the novelties of the RGFE approach - the use of properties of the solution to facilitate the classification process. Specifically, we establish the need for instance properties using two basic boolean satisfiability algorithms and two instances. The key observation is that one has to consider the properties of a specific instance of the problem in order to properly interpret and identify a particular solution. The boolean satisfiability problem (SAT)is a NP-complete combinatorial optimization problem. SAT problems consist of a set of variables and a set of clauses each containing a subset of complemented and uncomplemented forms of the variables. If each variable can be assigned a truth value in such a way that every clause is satisfied (contains at least one variable that evaluates to True), the instance is satisfiable. A solution to the problem is one satisfying the truth assignments for the variables. Examples of SAT instances are shown in Figure 2. A formal definition and additional information on the boolean satisfiability problem is presented in Chapter 3.3. We first introduce two simple algorithms for solving the SAT problem. Each of the algorithms uses a different optimization mechanism to solve the problem. The first algorithm focuses on the number of occurrences for a particular literal in the SAT instance. This Literal Approach counts the number of

3 Literal Approach { Index all Variables; Assign all Variables in Clauses of size 1; while(instance not Satisfied){ Select Literal(s)with highest appearance count; if(multiple literals selected) { Select literal with lowest complemented appearance; if(multiple literals still selected) Select literal with lowest index; } Assign Variable according to selected literal form; Simplify Instance; Assign all Variables in Clauses of size 1; }} Difficult Clauses { Index all Variables; Assign all Variables in Clauses of size 1; while(instance not satisfied){ Select all clauses of smallest size; Select literal which occurs most in selected clauses (in case of tie, select one with lower index); Assign Variable according to selected literal form; Simplify Instance; Assign all Variables in Clauses of size 1; CurrentSize++; }} Figure 1: Motivational example: Pseudo-code for boolean satisfiability algorithms (Literal Approach,Difficult Clauses). Instance 1: (x 1 + x 2)(x 1 + x 3)(x 1 + x 4)(x 1 + x 5)(x 1 + x 4 + x 5)(x 1 + x 3 + x 5)(x 1 + x 3 + x 5) (x 1 + x 2 + x 3)(x 1 + x 2 + x 4 + x 5) Instance 2: (x 6 + x 2)(x 6 + x 5)(x 1 + x 3)(x 1 + x 2)(x 1x 6 + x 7)(x 1 + x 2 + x 6)(x 1 + x 2 + x 7 + x 9) (x 1 + x 2 + x 9 + x 1)(x 1 + x 4 + x 7)(x 6 + x 8 + x 9)(x 2 + x 3 + x 4 + x 6)(x 3 + x 5 + x 6 + x 7 + x 9) (x 2 + x 3 + x 6x 8)(x 6 + x 4 + x 1)(x 4 + x 5 + x 6 + x 7 + x 8 + x 9) Figure 2: Motivational example: Two boolean satisfiability instances. occurrences in the instance for both the uncomplemented and complemented form of each variable. The approach assigns a truth assignment to each of the variables according to the count. For example, if a variable, v i, appears complemented, ten times, and uncomplemented twenty times the algorithm assigns v i to True. The second algorithm, which we entitle Difficult Clauses, attempts to satisfy all short clauses first. Clauses which contain few literals, variables in either complemented or uncomplemented form, are considered more difficult to satisfy due to more limited number of assignment options. The algorithm selects the literal which occurs most frequently in the short clauses, and assigns the variable according to the literal s appearance. Pseudo-code for each of the algorithms is presented in Figure 1. Consider the two instances shown in Figure 2. Each of these instances is solved using the Literal Approach and Difficult Clauses algorithms resulting in the SAT solutions shown in Figure 3. One way to compare the solutions produced by the two algorithms is to compare the number of non-important variables. We denote the non-important variables, or variables which can be assigned to either true or false without affecting the satisfiability of the instance, with a -. The number of non-important variables is a solution property that can be used to distinguish algorithms. However, for these instances, this solution property does not distinguish between the two algorithms. On Instance 1, the Difficult Clauses algorithm has more non-important variables than the Literal Approach. However, on Instance 2, the Literal Clauses has a higher number of non-important variables. This situation is by no means unique for these two algorithms and these two instances. Algorithms often perform differently on instances of different structure. The Difficult Clauses approach considers the most constrained portion of the problem, short clauses, and tries to address this part of the problem which it considers to be the most difficult. As a result, this approach performs better on instances that have the most constraining components in the short clauses.

4 By identifying the difference in the structures of the two instances, a proper classification of the algorithms can be made. Instance properties identify the structures of instances. In this example, the ratio of short clauses in the instance can aid in the classification. We define this property in the following way. N c 1 1 (1) N c 2 l i i=1 where, N c is the number of clauses in the instance, and l i is the number of literals in the clause. Instances with a larger number of short clauses will have a higher value for this property, implying that these instances have their most constraining components in short clauses. Specifically, for Instance 1, this property value is.225, and for Instance 2, it is By considering both the solution property (non-important variables)and the instance property (ratio of short clauses), on this simple example, it is possible to classify these two algorithms. When the number of non-important variables is high on instances with a low ratio of short clauses, the most likely the algorithm used is the Literal Approach. In the case consisting of a high number of non-important variables and a high ratio of short clauses, the classification would be for the Difficult Clauses approach. This simple example demonstrates the importance of instance properties. Note that for many complex algorithms and instances, a sufficient number of uncorrelated instance and solution properties are needed to perform this type of classification. Our prime goal is to identify these properties and find the most effective ways to combine them. Instance 1 x 1 x 2 x 3 x 4 x 5 DC Literal: Difficult: Instance 2 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 1 DC Literal: Difficult: Figure 3: Motivational sxample: Solutions to SAT instances using algorithms Literal Approach and Difficult Clauses. 1.3 Key Technical Features In this section, we briefly outline the key technical novelties of the RGFE technique. Relative Generic Forensic Engineering. We introduce a generic flow for the RGFE technique that allows it to be applied to a variety of optimization problems with minimal retargeting. The new approach compares the properties of an instance in relative terms. Relative comparison is done by using benchmark testing instances to calibrate each of the algorithms/tools to each other. Generic PropertyFormulation. A systematic way to develop instance and solution properties for different problems allows the generic RGFE technique to be applied to a variety of optimization problems. The generic property formulation is applied to a problem which has been formulated in terms of an objective function and constraints. Special emphasis is placed on the widely used -1 ILP formulation. Instance Properties. Problem instances have varying complexity which is often dependent upon particular structural aspects of the instance. Additionally, different algorithms perform differently depending on the complexity or structure of the problem instance it is presented with. Therefore, it is necessary to provide ways to measure and identify the structure of an instance with respect to other instances. We introduce instance properties which provide a measure for comparing instances and therefore facilitate more accurate analysis and classification of the algorithms. Calibration. Calibration is performed on both instance and solution properties in order to place the data into the proper perspective. For instance properties, calibration provides a way to scale and classify (order)the instances, while the solution properties for each algorithm are calibrated per instance to place the data into the proper perspective to differentiate the algorithms.

5 Forensic Scenarios. The amount of available information is a key aspect of Forensic Engineering. A technique can only be as accurate as the amount of information available for analysis. We present five forensic scenarios which classify the amount of information which is available for analysis. One-out-of-anyAlgorithm Classification. The number of algorithms available for solving a particular problem can be unlimited, and many may not be known when applying the RGFE technique. As a result, the technique must be able to classify algorithms not only in terms of the algorithms which have been previously analyzed but also as an unknown algorithm. CART model and Simulated Annealing. We have developed a new CART model for classification and clustering. The key novelty is that the new CART model does not only partition the solution space so that classification can be conducted but also maximizes the volume of space that indicates solutions that are created by none of the observed algorithms. The CART model is created using a simulated annealing algorithm. 1.4 Thesis Organization In the next chapter, we present the related work to the forensic engineering technique. In Chapter 3, preliminary background, in order to make the work self-contained, is presented for the generic problem formulation used for the RGFE technique, the boolean satisfiability problem, and the graph coloring problem. The details of the construction and use of the Relative Generic Forensic Engineering technique is presented in Chapter 4. In the next chapter, the new enabling features of the RGFE technique are presented in detail. Chapter 6 presented the generic formulation for properties and introduces instance and solution properties in their generic forms. Before presenting the experimental conclusions in Chapter 9, the details of the modeling and validation phases of the RGFE technique are discussed. 2 Related Work The related work can be traced along a number of directions. We summarize research in the areas which are most directly related: intellectual property protection, forensic analysis, statistical methods, and algorithms for the two selected NP-complete problems on which we apply the RGFE technique: boolean satisfiability, and graph coloring. Due to the rapidly increasing reuse of intellectual property (IP)such as IC cores and software libraries, intellectual property protection (IPP)has become a mandatory step in the modern design process. Recently, a variety of IPP techniques, such as watermarking, fingerprinting [5, 14, 71], metering [5], obfuscation [13], and reverse engineering [11], have attracted a great deal of attention in this area. The most widely studied technique, watermarking, can be applied to two different types of artifacts: static and functional. Static artifacts [3, 82] are ones which have only syntactic components that are not altered during its use and include images [81, 83], video [84], audio [48], and textual objects [8, 6]. The common denominator for all of these techniques is that they use postprocessing methods to embed a particular message into the artifact. Techniques have also been proposed for the watermarking of computer generated graphical objects, both modeled [69] and animated [26, 44]. Watermarking techniques have also been proposed for functional artifacts at many different levels of abstraction. Some of the targeted levels include system and behavioral synthesis, physical design, and logic synthesis [38, 46, 51]. These techniques mainly leverage the fact that there are often numerous solutions of similar quality for a given optimization problem and the solution which is generated by the technique has a certain characteristics that corresponds to the designer s signature. More complex watermarking protocols, such as multiple watermarks [51], fragile watermarks [27], publicly detectable watermarks [7] and software watermarking [55, 66], have also been proposed. Forensic analysis has been widely used in areas from anthropology to visual art. Most commonly forensic analysis is used for DNA identification [72]. The closest work related to forensic engineering in the computer science and engineering community is copy detection [12]. While the focus of forensic engineering is to identify the tool used to produce a particular result, copy detection focuses on determining if an illegal copy of the tool exists. There are two categories of copy detection, mainly used for textual data, and are either signature-based and registration-based. Signature-based schemes [5, 8, 7] add extra information (a signature)into the object in order to identify its origin. However, these signatures are often easily removed and do not assist in detecting partial copies. Registration-based schemes [12, 57, 67, 88] focus on detecting the duplication of registered objects. These approaches have demonstrated strong

6 performance but also have drawbacks. Large modifications can render these techniques useless and the registration database is often large and cannot detect partial copies. Other techniques which have a similar flavor are the detection of authentic Java Byte codes [1] and code obfuscation. Code obfuscation techniques [13] attempt to transform programs into illegible equivalent programs which are difficult to reverse engineer. This technique is most often applied to Java byte codes [56] and mobile code [37, 74]. Principal component analysis is a standard statistical procedure that reduces a number of possibly correlated variables into a set of smaller uncorrelated variables [42]. We use non-parametric statistical techniques for classification because they can be applied to data which has arbitrary distributions and without any assumptions on the densities of the data [8]. The Classification and Regression Trees (CART)model is a tree-building non-parametric technique widely used for the generation of decision rules for classification. An excellent reference for the CART model is [9]. Probabilistic optimization techniques were first introduced by Metropolis et al. [61] for numerical calculation of integrals. The simulated annealing optimization technique originates from statistical mechanics and is often used to generate approximate solutions to very large combinatorial problems [45]. Bootstrapping is a classification validation technique that assesses the statistical accuracy of a model. In the case of nonparametric techniques, bootstrapping is used to provide standard errors and confidence intervals. The standard references for bootstrapping include [25, 18, 35]. Since the mid-194 s, integer and linear programming have been widely studied topics. Linear programming (LP)and the simplex algorithm were introduced by Dantzig [23]. While the simplex technique often works quite well in practice, its worst case runtime is exponential. Consequently, both pseudopolynomial and polynomial algorithms have been developed for LP. Although, integer linear programming (ILP)is an NP-complete problem, it works well on many specific instances of moderate size. There are a number of excellent references on linear programming [64, 17, 23, 24, 28] and ILP [39, 73, 79, 85]. In the CAD domain, ILP is applied for a number of optimization problems, such as behavioral synthesis [3] and system synthesis [68]. Boolean satisfiability (SAT)was the first problem that was determined to be NP-complete [29]. The problem has a variety of applications in many areas such as artificial intelligence, VLSI design and CAD, operations research, and combinatorial optimization [6]. Probably the most well-known applications of SAT in CAD are Automatic Test Pattern Generation (ATPG)[52, 58] and Deterministic Test Pattern Generation [34]. Other applications include logic verification, timing analysis, delay fault testing [16], FPGA routing [65], covering problems [2], and combinational equivalence checking [32]. Many different techniques have been developed for solving the boolean satisfiability problem. Techniques such as backtrack search [89], local search [78], algebraic manipulation [31], continuous formulation and recursive learning [59] are among the most popular. Additionally, several public domain software packages are available such as GRASP [59], GSAT [76], Sato [89] and others [2, 22, 77]. Graph coloring is a popular NP-complete optimization problem with many applications in CAD as well as many other application domains. The applications include operations scheduling [21], register assignment, multi-layer planar routing [19], and wireless spectrum estimation [43]. A number of different algorithms have been developed for solving the graph coloring problem. For example, there are several iterative improvement approaches [4, 62], such as tabu search [36] and simulated annealing [15, 63, 41]. Also, a number of other techniques have been developed, including heuristics [1, 33, 54], constructive [86], and exact implicit enumeration algorithms [1, 21]. Additionally, a number of hardware platforms have been developed for efficient graph coloring, including reconfigurable FPGAs [53] and ones based on couplet oscillators [87]. Excellent surveys on graph coloring are [4, 41]. 3 Preliminaries In order to make the presentation self-contained, in this chapter we introduce the background information in a number of areas. We begin by identifying the differences between the new RGFE technique and the Computational Forensic Engineering (CFE)technique[47]. In order to accomplish this task, we provide a brief summary of CFE. Then, we present a brief summary of the generic problem format which consists of an objective function and constraints. Finally, we present a brief review of each of the demonstration problems: boolean satisfiability and graph coloring.

7 3.1 Computational Forensic Engineering Technique The Computational Forensic Engineering technique identifies an algorithm/tool, which has been used to generate a particular previously unclassified output, from a known set of algorithms/tools. This technique is composed of four phases: feature and statistics collection, feature extraction, algorithm clustering, and validation. In the feature and statistics collection phase, solution properties of the problem are identified, quantified, analyzed for relevance and selected. Furthermore, preprocessing of the problem instances is done by pertubating the instances - removing any dependencies the algorithms have on the instance format. In the next phase, each of the pertubated instances are processed by each of the algorithms. From each output/solution from each instance and algorithm, the solution properties are extracted. The algorithm clustering phase then clusters the solution properties in n-dimensional space, where n is the number of properties. The n-dimensional space is then partitioned into subspaces for each algorithm. The final step validates the accuracy of the partitioned space. This approach performed well on both the graph coloring and boolean satisfiability problem. However, that was the case only under a number of limiting assumptions. The computational forensic engineering technique performed algorithm classification on one-out-of-k known algorithms, and was tested on a variety of different instances. However many of these instances had similar instance structures. This forensic engineering technique is problem specific and is not easily generalizable to other problems. Lastly, the CFE technique performed analysis of the techniques in the form of blackboxes. In this forensic scenario, which we discuss in more detail in Chapter 5.1, this technique has access to each of the considered algorithms and unlimited analysis can be performed. However, in many cases, this type of information or access may not be available. The RGFE technique eliminates several major limitations of the CFE technique. The technique performs one-out-of-any classification instead on one-out-of-k. In this case, output of an algorithm that was never previously analyzed can be classified as unknown. The key enabler for the effectiveness of the Relative Generic Forensic technique is the calibration of problem instances. By identifying, analyzing, and classifying instances by their properties, the quality of the RFGE classification is expanded to another dimension enabling more statistically sound classification. Additionally, we present a generic formulation and generic property formulation that enables the application of this technique to numerous optimization problems. Lastly, we identify five scenarios which define the complexity of analysis for the RGFE technique based on the amount of information available. 3.2 Generic Problem Formulation Numerous important optimization problems can be readily specified in the form defined by an objective function and constraints. Typical problems include set cover, the travelling salesman problem, vertex cover, template matching, k-way partitioning, graph coloring, maximum independent set, maximum satisfiability, satisfiability and scheduling [39, 73, 79, 85]. The form defined by an objective function and constraints is the canonical form which enables the generic formulation of instance and solution properties. Both the original instance as well as the corresponding solution can be analyzed starting from the following form: MAX(Y )=cx (2) Ax B (3) We have selected this form because many popular optimization problems can be formulated in this way, the format is mathematically tractable, and is simple to analyze in terms of the needs of the forensic engineering technique. We restrict the objective function, Eq. 2, and the constraints, in the form presented in Eq. 3, to be linear. Most of the optimization problems are formulated with x = {x 1,,x n} as -1 variables. This problem formulation is most commonly known due to its use in integer linear programming[39, 73, 79, 85]. 3.3 Boolean Satisfiability The boolean satisfiability problem consists of a set of variables and clauses which are built over the complemented and uncomplemented forms of the variables. The goal is to find a truth assignment for each variable in such a way that at least one variable in every clause evaluates to true. The problem can

8 be formally stated in the following way. Problem: Boolean Satisfiability Instance: A set U of variables and a collection C of clauses over U. Question: Is there a satisfying truth assignment for C? The boolean satisfiability problem can be specified using the canonical form in many different ways; we present two. The first approach is simple and straightforward, allowing direct mapping of the clauses to constraints, while the second approach is somewhat more complex. The first approach maps each variable of the SAT problem directly. We therefore define the variable x in the following way. x i = { 1, if xi is assigned to true., if x i is assigned to false. We define the objective function as presented in (2), where c is a negative identity vector. Note that we decided to maximize the number of variable that evaluate to true. Another alternative is to have an empty objective function term. Or, if the problem to be solved is MAXSAT, where we want to satisfy the problem with as many true variables as possible, we would assign c to be a positive identity vector. To generate proper constraints, we translate each clause of the satisfiability problem into a single constraint. If the variable appears in the constraint as uncomplemented, the variable appears as a positive variable in the constraint. However, if the variable is complemented in the clause, we represent the variable as negative in the constraint. The bound term of each constraint is calculated using the following formula. b = 1 - (number of variables which are complemented) For example, if we have the clauses shown below, we would create the constraints shown to the right. (X 2, X 3, X 4) X 2 + X 3 + X 4 1 (X 1, X 3, X 4) X 1 + X 3 - X 4 (X 1, X 2, X 4) -X 1 - X 2 - X 4-2 For the second approach, we begin by mapping the variables. For each variable in the SAT problem we create two new variables, one to represent the complemented form and one for the uncomplemented form. Therefore, when we have n variables, we create 2n variables, where x 1,..., x n represent the uncomplemented versions and x n+1,..., x 2n represent the complemented versions. Formally, we define x i as: x i = { 1, if variable xi is selected to evaluate to true, otherwise. The remainder of the problem is specified using two sets, S and C. We define each element in the finite set S as a single clause in the SAT problem. Therefore, S j contains all x i which appear in clause j. SetC contains 2n elements, one for each x i. Each subset C i contains as elements all S i in which the variable i appears in. Now, the objective function is defined as in (2), where c is a negative identity vector. For the constraints, the formula presented in (3)is used, and we define b as a positive identity vector and A ij as follows A ji = { 1, if element Sj is in subset C i, otherwise. Finally, we include constraints that enforce that no variable can be simultaneously assigned to both the complemented and uncomplemented value. i : x i + x i+n 1 (4) Given the following example defined on a set of the variables {X 1,...,X 4}, we define x i {1,...,8} where x 1,...,x 4 represent uncomplemented forms and x 5,...,x 8 represent complemented forms of X 1,...,X 4. The matrix A ij for these constraints is also shown.

9 (X 2, X 3, X 4) S 1 = {x 2,x 3,x 4} (X 1, X 3, X 4) S 2 = {x 1,x 3,x 8} (X 1, X 2, X 4) S 3 = {x 5,x 6,x 8} C 1={S 2}, C 2={S 1}, C 3={S 1,S 2}, C 4={S 1}, C 5={S 3}, C 6={S 3}, C 7={ }, C 8={S 2,S 3} [ ] A = In the remainder of this manuscript, we will only use the first formulation of the SAT problem. 3.4 Graph Coloring The graph coloring problem asks for an assignment of a color to each node in the graph such that no two nodes with an edge between them are colored with the same color and to use the minimum number of colors. Formally, the problem is defined in the following way [29]. Problem: Graph K-Colorability Instance: Graph G = (V,E), positive integer K V. Question: Is G colorable, i.e. does there exist a function f: V 1, 2, 3,..., K s.t. f(u) f(v) whenever u,v E? A A E B E B D C D C (a) (b) Figure 4: Graph coloring: (a) graph instance with 5 vertices and 6 edges. (b) a graph coloring solution for the graph instance. An example of a graph coloring instance and a colored solution is presented in Figure 4. A graph instance can be transformed into the generic formulation in the following way. First, we define variables x i and x ij. x i = x ij = { 1, if color i is used, otherwise. { 1, if node j was colored with color i, otherwise. The goal is to minimize the number of colors used to color the graph. Therefore, the objective function canbewrittenintheformofeq. (2),wherec is a negative identity vector, and x is the number of colors used to color the graph instance. There are three types of constraints for the graph coloring problem. The first constraint is on the number of colors used. If any node x ij is colored with color i, then the color has been used, and the value of x i should be 1. Therefore we have the following constraint. j V : x i x ij (5) The second constraint is related to constraints induced by the connectivity of the graph. No two nodes with an edge between them can be colored using the same color. We define a matrix E which contains the elements e mn. { e mn = 1, when nodes m and n are connected, otherwise. Now, the second constraint can be stated as e mn =1& i : x in + x im 1 (6)

10 The last type of constraint states that all nodes in the graph must be colored. j V : x ij 1 (7) The graph presented in Figure 4(a), can be specified in the following generic form. We assume that only three colors are needed to color the instance. Therefore i = {1,2,3}. The instance contains five vertices that we denote using the following notation: V = {A,B,C,D,E}. i OF: MAX(Y)= (-c)x Constraints: Eq. (4) Eq. (5) x 1 x 1A, x 1 x 1B,..., x 1 x 1E x 1A + x 1D 1, x 2A + x 2D 1, x 3A + x 3D 1 x 2 x 2A, x 2 x 2B,..., x 2 x 2E x 1A + x 1E 1, x 2A + x 2E 1, x 3A + x 3E 1 x 1 x 3A, x 3 x 3B,..., x 1 x 3E x 1B + x 1C 1, x 2B + x 2C 1, x 3B + x 3C 1 x 1B + x 1E 1, x 2B + x 2E 1, x 3B + x 3E 1 x 1C + x 1D 1, x 2C + x 2D 1, x 3C + x 3D 1 Eq. (6) x 1A + x 2A + x 3A 1 x 1B + x 2B + x 3B 1 x 1C + x 2C + x 3C 1 x 1D + x 2D + x 3D 1 x 1E + x 2E + x 3E 1 4 Relative Generic Forensic Engineering In this chpater, we introduce the RGFE technique. The technique operates on the input in the generic formulation, ILP. Our implementation is restricted to instances that are formulated as -1 ILP. The technique consists of two stages: analysis and evaluation. In the analysis stage, the goal is to classify the behavior of algorithms for a specific problem specified using the -1 ILP format with a high confidence. The classification is achieved using a CART model that is built using instance and solution properties from a set of known algorithms. The CART model is then used to evaluate the behavior of unclassified algorithm output. The flow of the analysis stage is presented graphically in Figure 5 and using pseudo code format in Figure 6. The steps in the evaluation stage are shown in Figure Analysis Stage The analysis stage of the RGFE technique, shown in Figure 5, is composed of three phases: property collection, model building, and validation phases. The property collection phase defines, extracts, and calibrates both instance and solution properties for the given optimization problem. In the modeling phase, the relevant calibrated properties are used to develop a CART classification scheme. The last phase tests and validates the quality of the CART model to quantify and ensure high confidence in the developed classification. Property Collection Phase In this phase, the main steps are property selection and property calibration. We begin by selecting generic instance properties that will assist in classifying the characteristics of the given problem. The solution properties are selected to characterize the decisions or the optimization mechanisms used by an algorithm on a particular type of instance. We discuss a general approach for identifying and defining properties in more detail in Chapter 6.1. Once a set of properties have been selected for the targeted optimization problem, we proceed to extract each of the properties from our representative sets of instances and algorithms. The instance properties are extracted directly from each of the instances and the solution properties are extracted from the solution/output of each algorithm run on each of the instances. In this step, it is important to develop fast techniques and software for extracting the properties. While some properties can be deterministic, others may have statistical components, and therefore require large computational overhead for each instance or solution. Note that all property extraction methods are implemented to analyze the instances

11 Representative Set of Algorithms Representative Set of Instances Property Collection Phase Property Selection Property Extraction Calibrate Properties Identify Relevant Properties Principal Componet Analaysis Additional Instances Bootstrapping Build CART Model Plot Properties in n-d Space YES NO # Tries? NO Quality? Modeling Phase NO YES Confidence Interval YES Quality? Learn and Test Validation Phase Figure 5: Overall flow of the RGFE technique: Analysis Stage. and solutions in the generic form. Therefore, all implemented property extraction methods can be reused for any new optimization problem. We discuss this process in greater detail in Chapter 6.1. For each property, it is necessary to calibrate the raw property values in order to interpret their meanings properly. We calibrate each instance and solution property for all algorithms using two conceptually different methods: scaling and ranking. We discuss the details of the calibration process in Chapter 5.3. In the final steps, relevant properties are identified and principal component analysis is performed. Relevant properties are properties which aid in distinguishing the algorithms from each other. If a property yields the same value for all (or a great percentage of)instances or all algorithms, then it is not useful in the classification process, and is excluded from further consideration. Lastly, principal component analysis is performed in order to eliminate the set of properties which provide similar or identical information as other sets of properties. All of the steps in the property collection phase are encapsulated in Figure 6, lines 1, 2, and Modeling Phase In the modeling phase, the calibrated properties are used to model the behavior of each of the algorithms. This is accomplished by representing each solution from each of the available algorithms as a point in n-dimensional space. Each dimension represents either a solution or an instance property. The number of dimensions, n, is the total number of properties. Once the space is populated with the extracted data, we apply a generalized CART approach. The generalized CART approach, presented in Chapter 7, partitions the space into subspaces not only for each algorithm but also reserves empty regions for unobserved algorithms. Simulated annealing is used to develop a low complexity generalized CART model. Validation Phase In order to verify the quality of the CART model, we refine and test the model built in the previous phase using two techniques: bootstrapping and learn-and-test. The validation phase of the technique is an iterative process. We begin by performing bootstrapping of the model. Bootstrapping is based on statistical sampling with replacement, presented in line 16 of the pseudo-code. We evaluate the overall quality of the CART model after bootstrapping and if the quality is not above a specified threshold, it is refined by introducing new instances into the representative set and beginning

12 Input: Representative Set of Instances, I i, Representative Set of Algorithm, A j. Algorithm: 1. Pk S = Define Solution Properties; 2. Pl I = Define Instance Properties; 3. while (L T < threshold && < user specified number of tries){ if(exceed user specified number of tries) Add instances to I i and restart; 6. while (E < threshold && < user specified number of tries){ 7. if(exceed user specified number of tries) 8. Add instances to I i and restart; 9. Sol ij = Run each instance, I i, on every algorithm, A j; 1. V I,A,P = Extract Properties P from I and Sol; 11. Calibrate Properties(V I,A,P); 12. Rp S = Identify Relevant Solution Properties (Pk S ); 13. Rq I = Identify Relevant Instance Properties (Pl I ); 14. Principal Component Analysis(V I,A,R); 15. N = Build n-dimensional Space (R, I, V I,A,R); 16. M = Use Simulated Annealing to develop CART model (N); 17. E = Evaluate CART using Bootstrapping(M); } 18. L T = Evaluate CART using Learn and Test (M); } 19. C = Build Confidence Interval (M); Figure 6: Pseudo-code for the RGFE technique. the process over at the property extraction step of the property collection phase. The quality is measured as the percentage of new instances incorrectly classified. Furthermore, as an alternative to inclusion of new instances, in the property calibration step, we attempt to conduct alternative calibration schemes for some of the properties, in order to refine the model. In the case that the quality of the model is not improved to at least the threshold level after m iterations of adding instances, where m is a user defined number of attempts, new properties are introduced and the analysis phase is restarted from the the beginning. When the model passes the bootstrapping validation step, we perform learn-and-test validation. This statistical validation method process introduces k new instances which have been run on a set of observed and unobserved algorithms. The model is tested to evaluate if it correctly classifies the output of each new test instance. The quality is measured by the number of correct classifications. If the quality is below a set threshold, the same iterative process used in the bootstrapping step is applied. Once the CART model passes the learn-and-test step, the confidence interval of the model is determined. The confidence of the model is the accuracy of classification which is specified for each of the considered algorithms separately. 4.2 Evaluation Stage The RGFE goal is to be able to correctly classify the output of an unknown algorithm. In this stage, we identify the process for evaluating unknown outputs. The evaluation process, shown in Figure 7, begins with property extraction of both instance properties and solution properties from the unknown instance and algorithm output. The properties that are extracted are the set of properties that we used to build the final CART model in the analysis stage. Next, the properties are properly calibrated according to the selected calibration scheme for each property. The calibrated properties of the unknown instance and solution are then evaluated by the CART model. The algorithm that the CART model classifies the output into is the algorithm which produced the solution with the confidence level of the algorithm in the model, which was found at the end of the analysis stage. Note that the analysis stage of the approach must only be performed once for a set of observed algorithms. However, once the analysis stage is done, the evaluation stage can be applied repeatedly. Only when new algorithms are observed must the analysis process be repeated. In order to correctly classify the new observed algorithm(s), the properties must be recalibrated to take into account the new algorithm(s). In some cases, it may be necessary to define new properties and to process additional instances on each of the observed algorithms in order to achieve a high confidence level.

13 5 Enabling Concepts There are four major enabling concepts for the RGFE technique: forensic scenarios, instance properties, solution properties, and calibration. The ability of the RGFE technique to correctly link unknown output to the algorithm which produced it is highly dependent on the amount of information available to the technique. We introduce forensic scenarios that classify the amount and type of the available information. More importantly, we introduce the notion of instance and solution properties as required by the RGFE technique. Finally, the concept of calibration is introduced and developed calibration techniques are presented. 5.1 Forensic Scenarios The ultimate goal of computational forensic engineering is to demonstrate that a specific tool was used to perform a task. Most commonly, forensic engineering is applied in situations where the owner of a tool wants to prove that the tool has been used by an unauthorized user. Forensic engineering is the science and engineering of using information from both the original instance and the generated output of the instance to classify the tool that was used to generate a particular output. Furthermore, a complete scheme for forensic engineering should be able to identify that a particular output (solution)is produced by a tool never encountered before. Correct classification is strongly dependent on the amount of information available to the forensic engineering technique. The common assumption for all of the scenarios assumes that for each observed instance, both the instance and tool output are available. We, therefore, define five different scenarios which classify the amount of available information, and present them in order of increasing information. The most difficult scenario for a forensic engineering technique is the first scenario where none of the tools are available. 1. Limited instance information. Instances and output from several tools are available. However, access to the tools themselves is not provided. Note that no specific instances can be used to uncover particular features or properties of the tools. However, it is assumed that a significant number of instances are available to make any statistically significant conclusions. 2. Information on a pay-per-use basis. The different tools are available on a pay-per-use bases. In this case, selected instances can be run on the tool for a fee. It is important in this case that the number of instances used in the representative set and the number of validation iterations of the forensic engineering technique are kept to a minimum. On the other hand, it is important that the submitted representative set of instances be diverse enough such that the technique can still accurately classify the tools. 3. Blackbox information. Blackboxes of the tools are available to do essentially unlimited instance testing. This scenario provides the forensic engineering technique with the opportunity to analyze any number of instances with a variety of different instance properties, however provides no information on the algorithm constructs used in the tools. 4. Whitebox information. Tools are available as whiteboxes, meaning the algorithmic details are available, however modification to the tools is not possible. In this scenario, we assume that unlimited instance testing can be done using the tool, or the tool can be reproduced from the whitebox information enabling essentially unlimited testing. 5. Unlimited information access. In this scenario, complete access to the algorithms/tools are available to the forensic engineering technique. Modifications of the algorithms/tools are possible and unlimited instance testing can be done. Relative Forensic Model Unknown Instance and Solution Property Extraction Property Calibration Classify Properties of Unknown Classification of Unknown with k% Confidence Figure 7: Flow of the RGFE technique: Evaluation Stage.

14 Of course, in all but one of these scenarios, the tools are available for testing instances. Note that some algorithms contain randomized components or can be obfuscated. As a result, the output of the algorithms will vary with each run, which greatly reduces the possibility of solely comparing the outputs to each other. Finally, note that there are three underlying assumptions that are required to make any forensic technique feasible. The first assumption is that each of the algorithms is stable. This implies that the algorithms produce predictable outputs that are created with the intention to optimize the quality of the solution. Obviously, if there are two algorithms that intentionally produce completely random solutions, one can not distinguish their outputs. The second assumption is that algorithms are sufficiently different among themselves. The algorithms that are just minor modification of each other often produce identical or similar solutions on majority of instances. The last assumption is that each of the problem instances has a large solution space with many solutions of similar high quality. If a problem only contains a single solution, obviously, no conclusive forensic analysis can be done. Either the algorithm/tool will find the single solution or not. Note that active watermarking research clearly indicates that for many problems that is essentially always the case [14, 71]. 5.2 Properties In this section, we present instance properties. Afterwards, we identify the relevance of solution properties and discuss the role of instance and solution properties in the RGFE technique. A property can be defined as a quality or trait belonging to an individual or object, or as an attribute that is common to all members of a class. Properties aid in the classification of objects. In the case of RGFE the properties are the distinguishing factors between different instances of a problem and the solutions or outputs generated by different algorithms. Instance properties provide insights into the structure of a problem and solutions. Furthermore, they facilitate instance classification. One important observation is that algorithms often perform differently on instances with different properties. For example, local search algorithms for the SAT problem often perform very poorly on random instances. However, they perform very well on structured instances. While instance properties assist in the classification of instances, solution properties facilitate identification of the behaviors of algorithms. Solutions are illustrations of how an algorithm handled the instance and provides insights into their optimization mechanisms. Properties of solutions help to identify the algorithm s mechanism(s), and to classify the behavior of the algorithms with respect to the instance. Properties are the basis for the RGFE technique. The development of properties and their calibration and relevance is the focus of the property collection phase. Additional properties are identified in the case that extensive validation of the models, in the validation phase, does not lead to a model of sufficient quality. It is important to note that calibrated properties are the defining components of the n-dimensional space, and therefore the classification model. 5.3 Calibration Calibration is one of the key steps of the property collection phase. It is vital for properly understanding property values for a particular algorithm/instance pair. In this section, we introduce the concept of calibration for the RGFE technique, discuss the benefits of calibration, and define two possible approaches for calibrating properties: rank-ordered and scale-based. Calibration is the mapping of raw data values to values which contain the maximum amount of information to facilitate a particular task, which in this case is algorithm classification. The goal of calibration for the RGFE technique is to provide a relative and relevant perspective on particular features of a solution generated by a particular algorithm on a particular instance or on instances suitable to produce solutions of a particular type. Calibration aids the proper interpretation of data. The best way to introduce calibration and establish its importance is to use a specific example. Consider two different SAT instances, par8-2-c and par8-1-c, solved using the GRASP and Walksat SAT algorithms. These algorithms are discussed in more detail in Chapter 9.1. We evaluate the instances and solutions with the solution property of non-important variables and present the results in Table 1. Nonimportant variables are variables that may switch their assignment in such a way that the correctness of the obtained solution is preserved. In a sense, the number of non-important variables indicates the robustness of the obtained SAT solution. We discuss the property in greater detail in Chapter 6.3. Without calibration, by considering only these two instances, we would associate a range of.39 to.59 to the GRASP solutions and of.53 to.71 to the Walksat solutions. These two ranges overlap and

15 therefore classification is difficult. The reason is obvious; the two instances have different structure. There is intrinsically many more non-important variable in the instance par8-2-c than in par8-1-c. Calibration can compensate for this difference in instances. For example, we see that in both cases, GRASP has a property value approximately 2% lower than that of Walksat for this property. Calibration of the values with respect to the other algorithms enables proper capturing of the relationships between the algorithms, which is not visible from the raw values. GRASP Walksat par8-2-c par8-1-c Table 1: Property values for the solution property non-important variables on two SAT instances,par8-2-c and par8-1-c,solved by two SAT solvers,grasp and Walksat. Two technical difficulties arise when calibrating data. The first difficulty is determining the goal of the calibration scheme, or what the operational definition is for the quality of a proposed calibration scheme. The second difficulty is how to derive a high quality calibration scheme. We present two calibration schemes and discuss how each of them address these difficulties. The first calibration approach is a rank-ordered scheme. For each property value on a particular instance, we rank each of the algorithms. Using these rankings, a collaborative ranking for the property is built by examining the rankings of each of the algorithms on all instances. Additional consideration must be made on how to resolve any ties in ranking, and how to combine rankings for individual instances. One can use either average ranking, median ranking, or some other function of ranking on the individual instances. In our experimentations we used modal ranking - where the ranking of each algorithm is defined as the rank that was detected on the largest number of instances. Note that in this situation, two or more algorithms can have the same ranking. Consider the following example that illustrates the key ideas and trade-offs. For property P on instance X, consider algorithms a, b, c, and d which have property values 23, 31, 18, 13 respectfully. A forward rank order scheme would rank the algorithms (b, c, d, a)on instance X. In addition to instance X, consider instances Y and Z with algorithm rankings of (b, c, d, a)and (b, d, c, a), respectfully. By examining the rankings of each of the algorithms, we find algorithm b always has the first ranking and algorithm a always has the last. In the case of algorithm c and d, we find algorithm c more often appears second and d third. Note that the classifications of algorithms c and d are not exact. While for property P we are able to distinguish algorithms a and b, additional properties, and therefore dimensions of analysis, need to be considered. Rank-order calibration schemes are simple to implement and are robust against data outliers. However, rank order schemes eliminate the information about the relationship between numerical values for a given property of the algorithms. Additionally, the property after rank order calibration does not provide a mechanism for stating an unpopulated region. Unpopulated regions are necessary for the RGFE technique to classify output from an algorithm which has not be observed or studied in the model. Rank-based property calibration can only classify unobserved areas when multiple properties are consider together. When analyzing multiple properties together, rank order calibrated properties may create empty regions, where no data is present. The second type of calibration mechanism is a scale-based scheme. In these types of techniques, calibration is done by mapping the data values from the initial scale on to a new scale. Possible types of data-mapping are normalization against the highest or lowest value, against the median, or against the average value. We use a scheme where the smallest value on all instances is mapped to value, the largest value to value 1, and all other values are mapped according to the formula: x new = x init x s x l x s, where x init is the initial value for the property before calibration, x s is the smallest value, and x l is the largest value before calibration. The advantage of a scale-based scheme is, in principle, higher resolution and more expressive power than a rank order scheme. However, these types of approaches can be very sensitive to data outliers - a few exceptionally large or small values. For a scale-based scheme, each of the property values may be plotted on a segment after data-interpretation on the absolute values has been applied. Regions of the segment which are populated by a particular algorithm are defined as classification regions for these algorithms. Regions of the segment which are not populated by any algorithm are specified as unclassified. The quality of a calibration scheme greatly impacts the effectiveness of the overall relative forensic

16 technique. A number of metrics can be defined to quantitatively compare calibration schemes. We mainly used two alternatives: percentage of incorrect classification and unclassified regions. In the case of both the rank-based and scale-based schemes, the final calibration scheme may misclassify a number of values. The number of incorrectly classified property values is an indicator of the quality of the scheme in the sense that it identifies the accuracy of the calibration. Unclassified regions are regions after the scale-based calibration of a property that identify property value areas which are not indicating values associated with any of the observed algorithms. These regions are the key to identifying when a given instance is not generated by one of the previously observed algorithms. One way to evaluate the effectiveness of the calibration scheme is to consider the sum of the sizes of the unclassified regions. Larger regions indicate better classification of the algorithms. Calibration and the results of calibration are used in three phases of the RGFE technique. First, calibration is performed in the property collection phase. In this phase, calibration techniques are applied to each of the instance and solution properties in such a way that the relevance of the property can be determined. Additionally, the calibration scheme of a particular property can be modified or changed in order to aid in increasing the quality of the proposed model. Lastly, in the evaluation phase, the calibrated value of each property of the unknown instance and solution is used to fit the unknown into the model. 6 Properties In this section, we discuss the purpose of the generic problem formulation and how the properties for different optimization problems are extracted in a systematic generic way. In the second part of this section, we present generic forms for both instance properties and solution properties and demonstrate their meaning on the SAT and GC problems. 6.1 Generic Property Formulation and Extraction Our generic problem formulation is the standard formulation for -1 integer linear programming. It consists of a linear objective function (OF)and linear constraints. All variables are integers and can be assigned to a value of either or 1. We have adopted this format due to the fact that it naturally represents many optimization problems well. For many problems this formulation is readily available [39, 73, 79, 85]. There are two key benefits for developing properties in generic form. The first is a conceptual benefit, while the second is software reuse. The conceptual benefit is that treating properties in a generic form greatly facilitates the process of identifying new properties for newly targeted optimization problems. This is so because key insights are developed that are based on a combination of constraints that make a particular optimization problem difficult. The software reuse benefit lies in the fact that many properties that are developed for one optimization problem can be easily reused for forensic analysis of other optimization problems. Therefore, software for the extraction of these properties need only be written once. Note that although many problems can be specified using the generic format, often a specific optimization problems have specific features. For example, depending on the problem, the generic formulation may or may not have an objective function. Specifically, the SAT problem contains an empty objective function; as long as all the constraints are satisfied, the value of the objective function is irrelevant. However the GC problem formulation relies on the objective function to enforce minimization of the number of colors used to color the graph. Other properties to consider include the types of variables that appear in the constraints (positive only, negative only, or both), the weights of the variables in the constraints (are they all the same or not), does the objective function contain all of the variables in the problem or only a subset, and so forth. The key is to identify the intrinsic essential properties of the problem and develop a quantitative way to measure them. An important observation is that some problems can be represented in the generic form in different ways. For example, in Chapter 3.3, we introduced two different representations of the SAT problem in the generic format. It is important to note that these two different formulations are equivalent to each other and the original problem. However, depending on the formulation of the problem in the generic form, it may be easier and cleaner to identify and extract certain properties. This by no means implies that these properties do not exist in the other formulation; they are just more difficult to identify. Note that the solution properties can also be extracted in the generic form by mapping the solution output of the algorithms to the generic solution form, then computing the property values. We illustrate the steps in Figure 8.

17 Representative Set of Algorithms Representative Set of Instances Run Algorithms on Instances Convert to Generic Form Instance Solutions from Algorithm Convert to Generic Solution Form Property Extraction Property Selection Solution Property Extraction Impl. Extraction Instance Property Extraction Impl. Solutions Property Values Instance Property Values Calibrate Properties Figure 8: Procedure for generic property extraction. The representative set of instances for the problem are used both in their standard representation and in generic form. On the right side of the Property Extraction phase, the instances are converted into generic form and then in this form the instance property extraction methods are applied. The instance property values are collected and passed on to the calibration step. On the left side of Figure 8, the representative set of algorithms executes each of the instances in the representative set. The solutions for each algorithm on each instance are converted to the generic solution formulation, which is then given to the solution property extraction method along with the original instance in its generic formulation. The solution property values for the given algorithm and instance are collected and the data passes to the calibration phase. 6.2 Instance Properties We have developed a number of generic instance properties. We present generic instance properties in their generic form for the SAT and GC problems. [I 1] Constraint Difficulty. Each constraint in the problem formulation contains coefficients for each variable appearing in the constraint and the value (b-value)on the right-hand side of the constraint. The goal of constraint difficulty is to provide a measure of how much effort and attention the algorithm places on a given constraint. For example, in the SAT formulation, each constraint represents a single clause, and therefore all variables have unit weight. The b-value of the constraint is dependent on the number of positive and negative literals in the constraint. Therefore, in this case this generic property summarizes information about the size of the clauses in the instance. The aggregate information about constraints can be expressed using statistical measures such as average and variance, which we actually used in our system. Also, note that we used a weighted average of clause size in the motivational example. [I 2] Ratio of Signs of Variables. The key observation is that some variables tend to appear in all constraints in a single form, while others variables will appear in multiple forms and have more balanced appearance counts. For this property, we assume, without loss of generality, that all coefficients b are positive. In the problem formulation, analysis of the positive, negative, and x- weighted, occurrences of a variable can be examined with respect to the total number of occurrences of the variable in the instance. In the GC problem formulation there are only positive variables. However, we can consider the number of times a variable appears compared to the average number

18 P(constraint satisfied)= 1 - [ P(variable assigned opposite of constraints benefit)]. v i of appearances of all variables. This property measures the relative degree of the each node in the graph. In the SAT problem, we can use this property to identify the tendency of a variable in the instance to be assigned true or false. Again, various statistical measures can be used to aggregate this information. We use average and variance. [I 3] Ratio of Variables vs. Constraints. This property can be applied to all or a subset of variables in all or a subset of the constraints in the instance. It provides insight into the difficulty of these constraints. A low number of variables in a large number of constraints can imply that the constraints are difficult to satisfy due to the fact that numerous constraints are highly dependent on the same variables. This property has a direct interpretation for the SAT problem. For the GC problem, when considering a subset of the constraints which represent the edges in the graph, the property provides a measure of the subgraph s density, or the density of the entire graph when considering all edge constraints. [I 4] Bias of a Variable. We measure the bias of a variable to be assigned to either zero or one, based on the number of constraints which would benefit from the variable being assigned each way. Note, that this property has no interpretation for the GC formulation due to only positive occurrences of each variable. [I 5] Probabilityof satisfying constraints. This property considers the difficulty of satisfying each constraint based on the variables, weights of the variables, and its b-value. We define the probability of the constraint to be satisfied as shown below. For the SAT problem, this is a measure as to the difficulty of a clause. For our formulation of the GC problem, this property has no relevance, since most of the constraints contain the same number of variables. 6.3 Solution Properties Solution properties analyze the relationship of the solution and structure of the problem instance. The key to identifying relevant solution properties is to find structures of the solution that occur when algorithms based on different optimization paradigms are applied, i.e. greedy, constructive, iterative improvement, deterministic, probabilistic. [S 1] Non-important variables. This property identifies variables that received an assignment which has no effect on the objective function or has no effect on the satisfiability of the constraints. Therefore, the goal is identify variables which are not crucial to the solution of the problem. If a variable in the SAT problem can alternate its assignment without causing the instance to become unsatisfiable, it is not important to the solution. Constructive and greedy algorithms tend to find solutions which have a high number of variables which are not crucial to the solution. In the GC domain, if this property is considered on only a subset of the variables, the node color variables, it represents the number of nodes that can change color assignments in the graph. [S 2] Variable(s) Flip with k% of Constraints still satisfied. While the non-important variables property aims at identifying variables that have no bearing on the solution of the instance, this property attempts to measure the importance of the variables which impact the objective function and/or the constraints. If a variable can flip its assignment and keep a large percentage of constraints still satisfied, the variable has less importance on the solution. Note that this is not a completely accurate measure of a variable s importance since most constraints have multiple variable dependencies. Therefore, we expanded this property to consider pairs or triplets of variables. [S 3] Constraint Satisfaction. This property aims at identifying the extent to which the constraint was satisfied. For example, for constraints of the type Ax b, we can define the property as the Ax(sol) b value of. This equation evaluates how much more the constraint was satisfied over the Ax(max) b required level by considering the solution s Ax value less the required value against the maximum possible value less the requirement. This formulation is applied to each constraint in the SAT problem and is applied to the GC constraint which requires all nodes to be colored with at least a single color. In the case of SAT, this property translates to the level to which each clause is satisfied by the solution, and for GC it indicates the amount of flexibility in color assignment for each node.

19 [S 4] Variable Tendency. In many cases, constructive algorithms follow the natural tendencies presented by the instance. More specifically, variables that have all positive coefficients have an intrinsic inclination to be assigned a value of 1, and vice versa. This property tries to quantify to what extent this tendency was followed by a particular type of algorithm. A measure of variable tendency is the number of variables which were assigned according to the ratio of its positive and negative appearances of a variable in the SAT problem. Variations of this property include weighted tendency. We used both variations for the SAT problem. 7 Modeling Phase In this section we discuss the modeling phase of the RGFE technique. We begin with a discussion concerning how clustering and classification of the property data for the tools is achieved. The Classification and Regression Trees (CART)model is adopted and generalized for classification, and we analyze the benefits of the model and its application. Finally, we present details on the use of simulated annealing to generate the CART model. The starting point for the development of the classification model is that each solution for each algorithm for a given problem is represented as a point in n-dimensional space, where n is the number of solution and instance properties. The goal of classification is to partition the space into subspaces that classify regions populated mainly, or in the ideal case only, by solutions produced by a particular algorithm. We define A + 1 classification classes, where A is the number of algorithms observed. The additional classification class, a unique addition to the standard classification problem, is reserved for subspaces that are not populated by solutions of any of the observed algorithms. If a given output falls into these spaces, the output is classified as produced by an unknown or unobserved tool. Our goal is to perform classification of the data with a model of low Kolmogorov complexity, yet high accuracy. The low Kolmogorov complexity indicates that there is no danger that we overtuned our model to the given data. Specifically, we follow the principle that for every parameter in the model, we have to have at least five points in the property space. We adopt the CART model as a starting point for classification for a number of reasons. The model is intuitive, extensively proven to provide accurate classification in many situations, and provides an easy mechanism to enforce the principle of low complexity. The essence of the CART model can be conveniently summarized in two different ways: decision tree and geometric. When viewed from the decision three point of view, the CART model consists of a binary tree where each branching is done by answering the question, Does a particular point have a value for a particular property higher or smaller than a specified value? Each leaf of the tree indicates that a particular set of points corresponds to a particular algorithm. The goal is to design a tree with as few as possible leaves for a given misclassification rate. From the geometric point of view, the CART model can be interpreted as partitioning of the n- dimensional space using n 1-dimensional hyperplanes that are orthogonal on the axis of the space. For example, in 2D space, the area is partitioned using vertical and horizontal lines. In order to develop the CART model of the n-dimensional data as required by the RGFE technique, we first define a standard grid. The resolution of the grid is determined by a user specified threshold for misclassification. Essentially, we keep altering the resolution using binary search until the threshold is not achieved. In the next step, we examine each of the hypercubes, which the grid defines, for misclassifications, i.e. we identify regions which contain data from multiple algorithms. Once the n-dimensional space has been sufficiently divided into hypercubes, the adjacent hypercubes are merged to represent regions of classification for each algorithm. For this task, we use a simulated annealing approach. The approach begins with a random classification of the space, which implies random merging of hypercubes into classification regions. Each merged classification solution is evaluated according to the following objective function: OF = αn i + βp m (8) where N i is the number of parameters needed to represent the defined classification in the CART model and P m is the percentage of misclassification which can occur. The intuition behind the objective function is that the CART model will have smaller representation and complexity if fewer parameters are used. However, smaller representation implies higher misclassification levels. As a result, we try to balance both components in the objective function. The simulated annealing algorithm requires specification of two additional key components: a basic move generation mechanism and a cooling schedule. The move generation mechanism specifies which

20 changes can be made to the current solution in order to generate a new, so-called neighborhood, solution. In the case of the hypercubes, we define a change as either the movement of a single hyperplane of the hypercube by a single grid unit, or as the replacement of a single hypercube side with a new hypercube side in a different dimension. For example, if we consider the 2-dimensional space shown on the right in Figure 9, the grey region is redefined on the left by a move of a single vertical boundary to a horizontal boundary. Figure 9: Simulated Annealing move for CART model: vertical boundary moved to a horizontal boundary. The cooling schedule consists of a number of parameters which define the temperature decrement of the simulated annealing algorithm. These parameters are initial temperature, stop criterion, temperature decrement between successive stages, and the number of transitions for each temperature value. We define the initial temperature as the temperature on which 5% of the moves result in an increased OF. We stop searching if we visit s (s was 3 in our experimentations)temperatures without any improvement in solution. We decrement the temperature using a geometric schedule and at each temperature value, we consider 1 generations. 8 Validation Phase We apply both bootstrapping and learn-and-test techniques for validation of the CART model. We apply bootstrapping by drawing a percentage of resamples of size N, where N is the size of the original data set, from the representative set of instances. The resamples are selected randomly with replacement. In our experimentation, we resampled two thirds of the initial data set. For each resampled set, we rebuild the CART model and classify each of the unselected instances. The number of incorrectly classified instances represents the quality of the current model. The overall quality of the model is the lowest quality of the CART models built using the resampled sets. For our experimentation, we resample the instance 2 times, build 2 new CART models, and evaluate each to find the lowest quality model. Learn-and-test is applied by considering twenty instances which were not used in the representative set of instance and ten perturbations of each of these instances. Each of these new instances are run on the representative set of algorithms. The instances are then classified by the generalized CART model. The quality of the model is evaluated by the number of correctly classified solutions. The confidence interval for the final CART model can be determined using a variety of different techniques. We choose to validate the established model using the k-nearest-neighbors technique. We select 1 random points in the model and for each point we consider the 1 data points closest to the randomly selected point. The selected point is classified according to the algorithms that generated the most data points near the point. This classification is compared to the classification achieved using the built CART model, and the percentage of points correctly classified by the CART model is the confidence interval of the approach. 9 Experimental Results In this chapter, we present the results of the experimental evaluation of the RGFE technique on the boolean satisfiability and graph coloring problems. For each problem, we discuss the instance and solution properties used for the developed generalized CART model, and the optimization algorithms for which were used to build the model. Furthermore, we provide experimental evidence of the importance of simultaneously considering both solution and instance properties using the SAT problem. Finally, the performance of the overall technique for classifying both observed and unobserved algorithm output for both problems is presented.