A Two-folded Impact Analysis of Schema Changes on Database Applications

Transcription

1 International Journal of Automation and Computing 6(2), May 2009, DOI: /s A Two-folded Impact Analysis of Schema Changes on Database Applications Spyridon K. Gardikiotis Nicos Malevris Department of Informatics, Athens University of Economics and Business, GR10434, Greece Abstract: Database applications are becoming increasingly popular, mainly due to the advanced data management facilities that the underlying database management system offers compared against traditional legacy software applications. The interaction, however, of such applications with the database system introduces a number of issues, among which, this paper addresses the impact analysis of the changes performed at the database schema level. Our motivation is to provide the software engineers of database applications with automated methods that facilitate major maintenance tasks, such as source code corrections and regression testing, which should be triggered by the occurrence of such changes. The presented impact analysis is thus two-folded: the impact is analysed in terms of both the affected source code statements and the affected test suites concerning the testing of these applications. To achieve the former objective, a program slicing technique is employed, which is based on an extended version of the program dependency graph. The latter objective requires the analysis of test suites generated for database applications, which is accomplished by employing testing techniques tailored for this type of applications. Utilising both the slicing and the testing techniques enhances program comprehension of database applications, while also supporting the development of a number of practical metrics regarding their maintainability against schema changes. To evaluate the feasibility and effectiveness of the presented techniques and metrics, a software tool, called DATA, has been implemented. The experimental results from its usage on the TPC-C case study are reported and analysed. Keywords: Software engineering, database applications, impact analysis, program slicing, coupling metrics. 1 Introduction Databases lie at the core of most modern software applications, which have evolved from the client-server model to complex multi-tier architectures, such as web-based applications. Regardless of the complexity of their architecture, the last application tier is usually known as the data tier and refers to data storage and manipulation. In the vast majority of these applications, also known as database applications (DAs), a database management system (DBMS) is employed to provide this functionality. A DA can therefore be defined [1] as any software application that interacts with one or more DBMSs. The interaction with the underlying DBMS is usually carried out in terms of embedding database statements, such as SQL [2] statements, into the DA s source code. The presence of such special statements turns out to impose a number of limitations to the DAs program analysis and comprehension, while also originating new software engineering issues. Among them, one major issue concerns the impact of the database schema changes on the DA. The database schema is a representation of the application data structures and may refer to different level of abstractions. This paper refers to the physical level of abstraction, i.e., the actual implementation of the data model, and provides an impact analysis of such changes in terms of both the affected parts of the DA s source code and the affected test cases. In addition, a number of coupling metrics is defined to assess the maintainability of DAs against such changes. In summary, the presented work addresses the following research questions (RQ): RQ1: Which schema changes are critical for DA s main- Manuscript received March 3, 2008; revised December 24, 2008 *Corresponding author. address: spgardi@aueb.gr tenance? RQ2: Which parts of DA s source code are affected by a schema change? RQ3: Which DA s test cases are being affected by a schema change? RQ4: How does a schema change affect the code percentage covered by the same set of test cases? RQ5: Is it possible to define some criteria for choosing between alternative sets of schema changes in order to minimise the impact on the code of the dependent DAs? To evaluate the answers to the above questions, a software tool, called DATA, has been implemented and a number of experiments was conducted by utilising it. In this paper, the results on a case study that is based on the TPC-C benchmark [3] are reported and analysed. The rest of this paper is organized as follows. Section 2 presents an overview of the related work, whereas Section 3 briefly introduces concepts necessary for the readability of the paper. Section 4 provides a list of atomic schema changes. In Section 5, the main characteristics of the employed testing technique are outlined. Section 6 introduces a new approach for the dependency analysis of DAs, whereas Section 7 describes the suggested impact analysis and defines new metrics for assessing the coupling between the DA and the databases schema. An overview of the implemented software tool is presented in Section 8 and its experimental results on the TPC-C case study are reported in Section 9. Finally, Section 10 concludes with the contributions of the presented work. 2 Related work Although DAs became increasingly popular during the last decades, they have gained relatively little attention

2 110 International Journal of Automation and Computing 6(2), May 2009 from researchers. Research on the effects of database schemas to dependent software applications includes approaches that treat this issue, also known as schema-change transparency [4], as part of the database evolution. Proposals in this direction include schema versioning [5], schema mutation and schema modification. Such approaches can be regarded as orthogonal to the approach presented in this paper. In the software engineering community, with the exception of our previous work presented in [6], most of the papers concerning the DAs deal with the issues of generating test database instances [7, 8] and testing DAs. In the latter category, Chan and Cheung [9] applied white-box testing techniques by using relational algebra expressions to transform the database statements to a general-purpose programming language, whereas Chan et al. [10] presented an approach that employed a set of mutation operators to perform fault-based testing in DAs. Haraty et al. [11] suggested a test selection methodology for regression testing database stored modules, while Haftmann et al. [12] proposed parallel black-box regression testing for DAs in order to control the database state during the tests. Compared to the former paper and with regard to the impact analysis of test cases, our work uses an enhanced algorithm that is based on a weighted control flow graph (CFG) and which, contrary to that paper, depicts the database constraints. Furthermore, our analysis concerns general DAs and not database stored modules and does not therefore require the availability of a database dependency mechanism. Compared to the latter paper, our work concerns white-box testing for DAs without requiring any control to the database state. Regarding the usage of program slicing techniques in the DAs domain, the majority of the published research refers to the area of database reverse engineering (DBRE), where the main objective is to extract and conceptualise the data model of the application [13, 14]. Henrard and Hainaut [15] employed program slicing techniques to elicitate data dependencies from legacy systems. With reference to program slicing in impact analysis, to the best of our knowledge, most of the proposed algorithms, such as [16], are targeted towards the analysis of software and not schema changes effects. Similarly, Briand et al. [17] investigated the use of coupling measurement to support impact analysis of object-oriented systems. Clearly, all this work can also be considered as orthogonal to the work presented in this paper. 3 Background concepts 3.1 Control flow graph (CFG) A CFG for a program unit is a directed graph that consists of a set of N nodes and a set E of directed edges between its nodes, where E N N [18]. Each node can represent either one statement or a basic block [19] of program statements, i.e., a sequence of statements in which control enters at the first statement and exits at the last statement without halt or possibility of branching except at the end. A node that has more than one outbound edge is called a decision node. Each edge of the graph (n i, n j) E represents the transfer of control from node n i to node n j. In addition, in each CFG there is one entry and one exit node, where program execution starts and ends respectively and each edge can be associated with a predicate that describes the logical condition under which it will be executed. Finally, a complete path p in the CFG is a sequence of nodes entry,, n i, n j, n k,, exit bearing the property that i, j, k with 1 i, j, k N holds that each consequent pair (n i, n j), (n j, n k ) E. Throughout this paper, a complete path is simply referred to as path and consequently the notation, n i, n j, n k, is considered to be equivalent with entry,, n i, n j, n k,, exit. A path p is feasible if there is at least one input datum that can actually cause the execution of p. Algorithm 1 and Fig. 1 present an example of a PL/SQL procedure with its corresponding CFG. In Fig. 1, the entry and the exit point of the program are depicted by two special non labelled nodes. The entry node also includes the declarations and all the non executable initialisation statements of the program. Simple statement nodes are shown as circles whereas predicate nodes are shown as rectangles. All the nodes are labelled by the statement they depict. Finally, the edges leaving predicate nodes, known as branches, are labelled by the logical condition that needs to be evaluated for the flow to pass them through. Algorithm 1. A PL/SQL program unit Procedure pi estimation (n NUMBER) DECLARE p NUMBER := 0; BEGIN FOR k IN 1 n LOOP //S1 p := p + ((( 1) (k + 1))/((2 k) 1)); //S2 END LOOP; p := 4 p; //S3 DBMS OUTPUT.PUT LINE( pi: p); //S4 END; 3.2 Test cases Fig. 1 The CFG for Algorithm 1 A test case is related to a path of the CFG. More precisely, a test case or simply a test t is a double-tuple t, p, where t denotes the identifier of the test case and p the path of the CFG. A test suite T is a set of test cases for a specific program unit. A test execution trace et is a tripletuple et, i, t, where et is the trace identifier and i is an input datum for which the path p of test case t is feasible. Table 1 presents three simple test cases and their execution traces for the CFG of Fig. 1. In this case, the program, and consequently the input datum, contains only one numerical parameter, namely the parameter n.

3 S. K. Gardikiotis and N. Malevris / A Two-folded Impact Analysis of Schema Changes on Database Applications 111 Table 1 Simple test cases and execution traces for the program of Fig. 1 Test case Execution trace Description t1, S1, S3, S4 et1, n = 0, t1 0 loop execution t2, S1, S2, S1, S3, S4 et2, n = 1, t2 1 loop execution t3, S1, S2, S1, S2, S1, S2, S1, S3, S4 et3, n = 3, t3 3 loop executions 3.3 Program slicing Program slicing is a reverse engineering technique that was originally introduced and defined by Weiser [20] as the task of computing program slices. The basic idea is to isolate portions of the program that are relevant to a particular behaviour of interest. Thus, a program slice consists of those parts of the program that (potentially) affect the values computed at some point of interest, referred to as the slicing criterion [21]. The slicing criterion is typically defined as a pair v, po, where v is a program variable and po is a specific program point. In fact, this type of slicing is also known as backward slicing, whereas forward slicing aims to generate slices that consist of all program portions that may be potentially affected by the slicing criterion, i.e., including those program statements and control predicates whose values or execution depend on the values computed on the slicing criterion. Apart from this type of slicing, which can be computed statically, there are also other types, such as dynamic slicing, amorphous slicing, chopping, etc. [22] 3.4 Program and system dependency graph Among the existing static slicing approaches, the majority reduces the problem of program slicing to a graph reachability problem by employing the program dependency graph (PDG). The PDG, which was initially introduced in [23], is a directed graph consisting of a set of nodes that correspond to program statements and predicates and of a set of directed edges between their nodes, which represent dependencies. The direction of the graph traversal, i.e., backwards or forwards, covers backward and forward slicing, respectively. However, the PDG is suitable only for intraprocedural slicing and therefore Horwitz et al. [24] proposed a PDG s extended version, called the system dependency graph (SDG), for inter-procedural slicing. Fig. 2 depicts the PDG for the PL/SQL program of Algorithm 1. The entry and the exit points of the program are represented by rectangle nodes, whereas the program statements are shown as circle PDG nodes and are labeled by their numbers, as shown in Fig. 1. The presented PDG contains both control and data dependencies where the former are shown as solid lines and the latter as dashed lines. Finally, edges that originate from predicate nodes, such as S1, are labeled by the required truth value of the corresponding logical condition. Fig. 2 The PDG for Algorithm 1 4 Database schema changes A database schema change can be considered as any modification to the objects/entities that the schema includes Each change is performed through the language that the database management system (DBMS) provides, which is also referred to as the data definition language (DDL). Depending on the database vendor (e. g., Oracle, Microsoft, PostgreSQL, MySQL, etc.), there may be differentiations in the supported set of schema objects and DDL commands. To enhance the paper comprehension and avoid referring to vendor-specific features, the list of changes presented in Table 2 includes the most commonly used schema objects and DDL commands in the Oracle9i database [25] that conform to the standard definition of the SQL database language [2]. Given the level of compliance that each vendor maintains with regard to this standard, the list of considered objects and commands may need to be modified or extended in order to fully cover the peculiarities of a specific database (e. g., a corresponding set of schema objects and changes that PostgreSQL supports is described in [1]). Table 2 Atomic schema changes for Oracle PL/SQL Schema object Change DDL command Acronym Stored procedures- functions Table/view Fields Deletion Creation Modification Deletion Creation drop procedure /function create procedure /function alter procedure /function create table /view drop table /view SD SC SM TD TC Renaming rename to TR Type change alter table modify TFTM Size Change alter table modify TFSM Renaming alter table rename column to TFR Addition alter table add TFA Deletion alter table drop column TFD Allow null alter table modify TFN Addition alter table add constraint Table related Constraints Deletion alter table entities Renaming drop constraint alter table rename constraint TCA TCD TCR Creation create trigger TTrC Triggers Deletion drop trigger TTrD Modification replace trigger TTtM Creation create index TIC Indexes Deletion drop index TID Modification alter index TIM The schema changes described in Table 2 are atomic in the sense that they cannot be broken up into a set of other changes. In other words, any non-atomic schema change can be mapped into a unique set of atomic changes, e. g., a renaming of a stored procedure is mapped into a stored pro-

4 112 International Journal of Automation and Computing 6(2), May 2009 cedure deletion (SD) and a stored procedure creation (SC). As far as the information presented in Table 2 is concerned, the first column contains the type of the schema object, the second column describes the change that is performed by the database command presented in the third column, and the last column contains an acronym for referencing this change. It should be noted that within our impact analysis framework, a finer granularity level than the one that the schema itself provides is required. To justify this remark, let us consider the example of changing the type and the size of a table column. In a schema level analysis, both these changes would be considered as the same atomic change, namely a table modification. From the perspective of DAs impact analysis, however, these changes should be regarded as two separate atomic changes since they produce different impact results. This is depicted by rows 7 and 8 of Table 2, which contain the table field type modification (TFTM) and the table field size modification (TFSM), respectively. call show recs(products); //S4 else dbms output.put line(products.id : products.desc); //S5 end if; exit;// S6 exception when others then BEGIN dbms output.put line(sqlerrm); //S7 exit;//s8 END; END; 5 DAs testing To be applicable for DAs domain, a testing technique should derive test cases that cover both the statements of the procedural programming language and the embedded database statements. However, the majority of the existent techniques does not consider the latter type of statements. To overcome this limitation, we have previously developed two methods [26], which rely on an extended CFG version and use the DBMS s execution plan to depict each embedded statement. Both methods could be used for the impact analysis of schema changes and would produce equivalent results. To avoid redundancy, we shall report our results with regard to the first method, called the semantically equivalent method. Algorithm 2 and Fig. 3 present an example of the CFG derived by this method for a sample DA implemented in Oracle PL/SQL. Algorithm 2 contains the statements of a DA s simple procedure, whereas Fig. 3 contains the corresponding CFG. For each database statement, such as S2, special CFG creation rules are applied. Initially, a decision pseudo-node (P 1) is added to depict the successful or not execution of the database manipulation language (DML) statement. Thus, if the DML statement is successfully executed, the control will be transferred to the next program statement; otherwise, an exception will be raised and the control will be transferred either to the statement that addresses the specific exception or to the end of the program if the exception cannot be handled. A plethora of reasons can result in the unsuccessful execution of a DML statement, such as, the loss of connectivity to the underlying database, references to invalid database objects, etc. Algorithm 2. A sample DA s procedure in PL/SQL Procedure PLSQL proc is products prod%rowtype; BEGIN dbms output.put line( Printing products made in Greece ); // S1 SELECT * into products from prod where country from= GR order by date in;//s2 if SQL%ROWCOUNT > 1 then//s3 Fig. 3 The semantically equivalent CFG for Algorithm 2 Furthermore, a second decision pseudo-node (P 2) is added to denote that either zero or n rows of the accessed database objects will be affected, depending on the DML statement type (i.e., retrieval, update, deletion, or insertion), where n > 0. This node also validates the data type compatibility between the host and database variables. In the presented example, where a retrieval DML statement is encountered, if neither row nor any row with incompatible data are retrieved, an exception will be raised and therefore the program execution will be transferred to the exception handling statement (S7). Otherwise, the fetched rows will be assigned to the host variable products, denoted by the control transfer to S2 where this assignment is performed. At this point, it should be noted that an alternative lighter model would be produced if instead of the pseudo-nodes two pseudo-edges were added from the DML statement to the exception handling statement. In the implemented tool, however, the model containing the pseudo-nodes was preferred for three main reasons: 1) it separates database statements allowing for the definition of new test coverage criteria (see Section 7.2); 2) it is applicable to cursor statements (where open should be distinguished from fetch) and it is compatible with traditional test case generation algorithms, which impose a restriction of two maximum outgoing edges for each CFG node. Finally, the produced CFG is enriched with weighted measures that are associated with each arc of a decision node similarly to a proposal made by Siougle [27]. Each weight provides a means to estimate the feasibility of a path containing its associated arc. Our interest focuses on providing the weights for the arcs originating from the special pseudo-nodes of the database statements. Thus, as far as

5 S. K. Gardikiotis and N. Malevris / A Two-folded Impact Analysis of Schema Changes on Database Applications 113 the first pseudo-node is concerned, the weight should indicate the possibility of either a successful or unsuccessful execution. In the implemented tool, presented in Section 8, this weight reflects the validity of the database objects referenced by the database statement and its calculation is based on the comparison between the objects names of the underlying schema and their references in the DA s source code. As far as the second pseudo-node is concerned, the weight encompasses both the complexity of the associated database statement (based on code metrics and statistics gathered by the underlying DBMS [26] ) and the type compatibility between the schema objects and their references in the DA s code. In the case of DML statements other than data retrieval (i. e., INSERT, UPDATE and DELETE statements), the arcs originating from the second node are not weighted but instead are annotated with the logical condition that represents the relevant database constraints. 6 Program and dependency analysis for DAs Program analysis is essential for acquiring the necessary knowledge to support the program s software engineering, including its maintenance activities. DAs program analysis includes control flow, data flow, and dependency analysis. Similar to test case generation, the first two types of analyses require the DA s parsing and the representation of the gained information in a CFG. In this case, however, the analysis is focused on revealing the dependency relationships between the DA s statements and the schema objects. The granularity level of analysis is therefore the DAs statements instead of basic blocks that are used in testing. The results that derive from the dependency analysis are described in terms of the DA s PDG. More specifically, the CFG that is constructed during the control flow analysis depicts all types of DA s and annotates the corresponding nodes with their types; thus there are regular, declaration, predicate, loop, and database nodes, where the latter are further divided into insert, update, delete, select, and select* nodes. Data flow analysis includes, apart from the typical sets (definitions, computational-uses, and predicate-uses), three new sets tailored for the database statements, called sql def, sql use and sql type use. These sets are defined for each database statement and concern the definition and use of the schema objects by DA s statements. Especially, the last set, i.e., the sql type use, regards the schema objects that are used for the declaration of host variables. In Oracle [25] terminology, these are called anchored declarations and allow for binding (anchoring) the data type of a scalar variable to a database table/view or a table/view field. Given this CFG, the control dependencies between its nodes are detected by constructing the post dominator tree, where a CFG node i dominates a CFG node j if every path p from the entry node to i goes through j [19]. The dominator tree represents this domination relation by having all the dominated nodes as descendants of the dominator node and the initial CFG node as root. Data dependencies between host variables and statements are calculated by employing the reaching definitions algorithm, also presented in [19], whereas simple dependencies between declaration and definition nodes can directly be detected. In addition to these types of dependencies, a new type is introduced to describe the data type dependency between host variables and database objects. More specifically, the dependency between an anchored declaration of a host variable and its definitions can either be characterised as type transparent (TT) or not according to the exclusive use of the schema object upon its definition. To depict these four types of dependencies, namely control, data, SQL, and type transparency, the PDG is constructed based on the CFG. Each PDG node is also characterized by the represented statement s type, i.e., there are regular, declaration, predicate, insert, update, delete, select, and select* nodes. The algorithm for the PDG construction has been modified from its original version [23] in order to consider the embedded database statements. More specifically, each database statement corresponds to one PDG node, whereas additional pseudo nodes are defined for each database object referenced by the program unit. In analogy to the case of the host variables, each pseudo node is considered to declare/define the relevant database object and therefore is related with a data dependency relationship to each database node that refers to the represented database object. Given the PDGs for each DA s program unit, the SDG can be constructed applying the algorithm presented in [24]. As a comprehensive example, Algorithm 3 presents a sample PL/SQL program unit and Fig. 4 shows its corresponding PDG. The P 1, P 2, P 3, and P 4 are the pseudo-nodes that correspond to the database objects products.price, products, products.id, and products.checked, respectively. All the other nodes are named from the statement number they depict, e. g., S2 represents the declaration statement of line 2: current price products.price%type;. Each node is characterized by its type, e. g., S2 S4 are declaration nodes, S6 is a regular node, S7 and S13 are database select nodes, S8 is a predicate node, S9 and S11 are database update nodes, and S14 is a database insert node. All types of dependencies, described above, are depicted by the PDG. Hence, the P 2 node that corresponds to the database table products is data related to nodes P 1, P 3, P 4 that correspond to its fields, while in addition is directly SQL-data related to all the SQL statements that contain a reference to it, i.e., S7, S9, S11, S13, and S14. The depiction of the fourth dependency type, namely type transparency (TT), requires two distinct sub-types, i.e., TT and non-tt dependency. An example of a TT dependency is the one between S2 and S7, whereas a non TT dependency is the one between S3 and S6. Algorithm 3. A sample PL/SQL program unit 1. PROCEDURE check price IS 2. current price products.price%type; 3. default price products.price%type; 4. max id integer; 5. BEGIN 6. default price = 0.99; 7. SELECT price INTO current price FROM products WHERE id=100; 8. if current price > default price 9. UPDATE products set checked=false WHERE id=100;

6 114 International Journal of Automation and Computing 6(2), May else 11. UPDATE products set checked=true WHERE id=100; 12. end if 13. select max(id) into max id from products; 14. insert into products (price, id) values (current price, max id+1) 15. END check price; Fig. 4 The extended PDG for Algorithm 3 7 Impact analysis of schema changes on DAs In this section, we describe how the intended impact analysis is performed in terms of both the affected parts of the DAs source code and the affected test cases. In the former case, the PDG/SDG of the DA is used based on a statement-level analysis, whereas in the latter case the CFG, derived by the semantically equivalent testing method, is used based on a basic block-level analysis. In addition, we introduce coupling metrics as a proactive means towards assessing the maintainability of a DA against a schema change. 7.1 Affected statements The impact analysis of schema changes on the DA s source code is based on program slicing. Based on the criticality level of each schema change, a rule is derived that describes the affected parts of the DA s code in terms of slicing criteria. In more detail, a classification scheme that includes three criticality levels from 0 to 2 is employed. Level 0 refers to non-critical changes, level 1 to execution critical and level 2 to compilation critical changes. A noncritical (NC) schema change is defined as a change that does not affect the validity of the underlying DA s source code. An execution critical (EC) schema change is defined as a change that causes an error during the actual execution of the DA s source code. Upon such a modification (e. g., null values are not permitted in a table field) the code referencing the altered schema object, although it is syntactically valid (i. e., can successfully be compiled), is not executionally safe and therefore needs to be corrected. Lastly, a compilation critical (CC) schema change is defined as a change that syntactically invalidates the dependent DA s source code units and therefore requires corrections. Thus, upon such a modification (e. g., data type change of a table field), all the source code parts referencing the altered schema object are invalid and cannot be compiled successfully. Table 3 presents a synopsis of the slicing criteria for each criticality level. It should be noted that there are cases where schema changes of the same criticality level require different treatment. Such cases include the acronym of the change they refer to according to the list of Table 2. Slicing is performed by using the enhanced PDG/SDG, which is produced by the DA s program analysis. Table 3 Criticality level Slicing rules for the impact analysis of schema changes Slicing rule 0 Non critical No slicing TFSM: Forward slicing from the pseudo node of the field from the database insert - update - delete nodes excluding TT dependencies. TFA: Forward slicing from the pseudo node of the relevant table/view including only database insert and select * nodes. TFN: Forward slicing from the pseudo node of the relevant table/view including database insert - update and from the pseudo node of 1 Execution the specific field including database select nodes. critical TCA: Forward slicing from the pseudo node of the relevant table/view including database insert and delete nodes and from the pseudo node of the specific field including database update and select nodes. TCD: Forward slicing from the pseudo node of the relevant table/view including database insert nodes and from the pseudo node of the specific field including database select nodes. General rule: Forward slicing from the pseudo node of the changed schema object including all types of dependencies. 2 Compilation TFTM: Forward slicing from the pseudo node of critical the field excluding declaration nodes, nodes that have outbound TT dependencies and paths starting from nodes that have inbound only TT dependencies. As an illustrative example, let us consider the sample program unit of Algorithm 3 and apply two different schema changes. The first change is a compilation critical change and in particular the deletion of the table field products.price. Given the PDG of Fig. 4 and the general rule for criticality level 2 changes, the results of forward slicing, with respect to the field products.price, are shown in Fig. 5 and Algorithm 4. Fig. 5 presents the results in terms of the affected PDG nodes (the light gray labelled nodes are those that were eliminated), and Algorithm 4 shows the affected source code statements. Thus, according to the rule concerning the TFD change, the pseudo-node of the specific field, i. e., P 1, is initially identified. Forward slicing including all dependency types from P 1 to the end of program will result into nodes S2, S3, S6, S7, S8, S9, S11, and S14. However, taking into consideration that our objective is to trace the unit statements that need to be corrected, the control dependencies can be excluded from the slicing and hence nodes S9 and S11 are removed from the result set. Algorithm 4. The derived code slice after field deletion 1. PROCEDURE check price IS 2. current price products.price%type; 3. default price products.price%type; 5. BEGIN

7 S. K. Gardikiotis and N. Malevris / A Two-folded Impact Analysis of Schema Changes on Database Applications default price = 0.99; 7. SELECT price INTO current price FROM products WHERE id=100; 8. if current price > default price 14. insert into products (price, id) values (current price, max id+1) 15. END check price; Fig. 5 The derived PDG slice after field deletion To demonstrate the usage of TT dependencies, let us also consider a second schema change that concerns the type modification (TFTM) of the table field products.price. Although this change is also compilation critical it requires a specific slicing rule. Thus, P 1 is initially identified as the PDG pseudo-node of the field. Forward slicing from P 1 to the end of program results into the nodes S2, S3, S6, S7, S8, S9, S11, and S14. Applying the exclusion rule described in Table 3 we obtain the final result set (S6, S7, S8, and S14) shown in Fig. 6 and the respective affected statements shown in Algorithm 5. Similarly to the first case, control dependencies are not considered when applying program slicing. Algorithm 5. The derived code slice after field type change 1. PROCEDURE check price IS 5. BEGIN 6. default price = 0.99; 7. SELECT price INTO current price FROM products where id=100; 8. if current price > default price 14. insert into products (price, id) values (current price, max id+1) 15. END check price; Fig. 6 The derived PDG slice after field type change 7.2 Affected test cases The objective in this case is to identify which test cases of an already derived test suite are affected by a schema change. In order to achieve this, prioritization algorithms can be employed based on the weighted CFG produced by the semantically equivalent method. Such algorithms aim at reducing the number of infeasible paths generated by assigning higher priority to the more probably feasible test paths. In our case, this possibility is estimated by using weights on the arcs of the derived CFG, in analogy to a proposal made by Yates and Malevris [28]. The total number of generated paths can vary according to the test data adequacy criterion that is used. In addition to the criteria definitions that apply for conventional software applications [18], we have defined and used a new adequacy criterion for the purposes of our experiments. The criterion is called database statement or all-db-nodes coverage criterion and is defined as follows: a set P of CFG paths satisfies the database statement (all-db-nodes) coverage criterion if and only if for every database node n of the CFG there is at least one path p in P such that node n lies on path p. In this definition, the term database node refers to the specific node of the semantically equivalent CFG that represents the database statement (i.e., S2 node of the CFG presented in Fig. 3). Given a set of test cases for a DA and a schema change, we are interested in analysing the impact of this change on the test coverage obtained, i. e., the test cases that were feasible and became infeasible, and vice versa. To illustrate this type of impact analysis, let us consider the sample program presented in Algorithm 2 and a TFD change concerning the table field country from. Table 4 describes the impact of this change on the results produced by a test case generation algorithm that uses the semantically equivalent CFG of Fig. 3 and aims at branch testing. Table 4 Impact analysis of a TFD change on test cases Test case Priority Coverage (%) Branch All-db-nodes Before After Before After Before After t1, S1, P 1, P 2, S2, S3, S4, S6, end 1 2 t2, S1, P 1, P 2, S2, S3, S5, S6, end t3, S1, P 1, P 2, S7, S8, end t4, S1, P 1, S7, S8, end Initially, the weights of the CFG arcs originating from the first database specific pseudo-predicate node will be modified according to the change. This in turn will affect the prioritisation of the test cases, which, as mentioned above, is based on the feasibility probability of each test case. Thus, after the schema change, test case t4 will be given the highest priority whereas before the change it was given the lowest. This priority change indeed reflects the fact that t4 becomes feasible after the schema change whereas all the other test cases become infeasible. In terms of the branch coverage obtained, the schema change results into a reduction of branch coverage percentage from 50% to 25%. In fact, the experimental results in our case study showed that the deviations of the branch coverage percentage do not accurately reflect the effects of schema changes. On the contrary, the new coverage criterion defined above, reflects this impact, which in the above presented example is depicted by a total fall of the all-db-nodes coverage percentage from 100% to 0%.

8 116 International Journal of Automation and Computing 6(2), May Coupling measurement Based on the DAs program analysis, an assessment of the coupling between DA s source code and schema changes is derived. The produced metrics can be used in a proactive method that indicates the impact size of a schema change on the DA s source code. In this sense, these metrics do not aim to evaluate the database or code design but to guide both the database administrators and the code maintainers through the selection of alternative combinations of schema and code changes, which provides the safest solution in terms of the DA s operability. In general, coupling can be considered as an internal software quality measure that influences external quality software measures, such as maintainability. Software maintainability is informally defined in [29] as the ease with which the software can be maintained, enhanced, adapted, or corrected to satisfy specified requirements. In this sense, the minimisation of coupling metrics, which were originally suggested by Chidamber and Kemerer [30] in the context of object-oriented systems, promotes the maintainability, reusability, and testability of a class. In particular, coupling between objects (CBO) for a class is defined as the count of the number of other classes to which it is coupled. An object is coupled to another object if one of them acts upon the other, i.e., methods of one use methods or instance variables of the other. In analogy and within the context of DAs, Table 5 formally defines and describes the coupling between a database application and schema (CBDAS) and the coupling between a DA s code unit and specific schema objects (CBUSO). In addition, a special coupling metric called data type coupling between unit and schema (DTCBUS) is also defined to cover the case of anchored declarations, i. e., the cases where host language variables are defined by using database schema objects. The definitions in Table 5 are made with reference to a DA s code unit u and the following terms: U: The set of DA s distinct code units. DB SO: The set of distinct database schema objects. SO U : The set of distinct schema objects referenced by u. SO Ui: The i-th element of SO U. REF (u, SO): The set of distinct statements in uthat references the schema object SO. ANCHOR(u, SO): The set of distinct declaration statements in uwhere host language variables are declared using data types based on the types of database schema objects (anchored declarations). It should be noted that the notation X in the above definitions denotes the number of elements in X. Moreover, according to their definitions, these metrics account for direct coupling. Indirect coupling metrics can be derived by applying the transitive closure of the direct coupling relationship on the unit call graph, i. e, if code unit u 1 directly references a database object and code unit u 2 calls u 1 then u 2 indirectly references the same database object. Furthermore, the above mentioned metrics account for import coupling in the sense that we consider only those cases where the database schema objects are referenced by the DA s statements and not cases where the DA s source code is possibly referenced by the database schema. Metric Table 5 CBUSO(u) = SOu REF (u, SOu i) i=1 DT CBUSO(u) = SOu ANCHOR(u, SOu i) i=1 CBDAS(U) = U CBUSO(i) i=1 SchemaCoverage(U, DB SO) = U SOi i=1 DB SO DAs Coupling Metrics 8 DATA software tool Description The number of distinct statements of unit u that reference database schema objects. The number of declarations anchored to database schema objects in unit u. The total number of distinct statements in a DA that reference database schema objects. The percentage of database schema objects referenced by a DA. To assess the effectiveness of both the testing methods and the change impact analysis presented above, a software tool called DATA (database application testing and analysis) has been implemented. Its architecture is shown in Fig. 7. It is logically divided into three layers, namely the interface, the service, and the repository layer. Fig. 7 The architecture of the DATA tool The interface layer consists of four interfaces: a graphical user interface (GUI), an application programming interface (API) and two file interfaces for importing and exporting files respectively. Among others, the GUI of DATA allows the users (testers and maintainers) to request change impact analysis for a schema change, test case generation according to specific coverage criteria and coupling measurement. In terms of implementation, the DATA is realized as a webbased application. Its GUI is implemented as a set of Java server pages (JSPs) hosted by an application server that uses Tomcat [31] software. The API provides access to DATA classes in order to be (re)used by other software tools in an automatic way. To provide the same functionality in the case of non-automatic communication, the export file utility is also provided producing a number of text files that

9 S. K. Gardikiotis and N. Malevris / A Two-folded Impact Analysis of Schema Changes on Database Applications 117 contain the derived results of the DATA s services. The import utility imports data into the system by either loading input files or by directly accessing a database. The former case refers to importing test data sets and DA s source code files, whereas the latter case refers to the retrieval of the database schema and the queries execution plan. The DATA s service layer consists of seven services and one service manager. The service manager acts as a coordinator between the interface and the service layer, which translates each request to a series of services calls. Moreover, it can directly serve a request by accessing the repository where all the previous results of DATA are stored. The core services of DATA include the code parser, the graph builder, the test case generator, the metric generator, the impact analyser, the execution plan analyser and the schema analyser. All the services are implemented as Java classes using the Java SDK 1.5 [32]. The code parser parses the DA s source code files and stores the derived information in the repository. Its grammar is written in JavaCC [33] and at this version supports Oracle PL/SQL DAs. Based on the parser s information, the graph builder constructs a number of graphs, such as a simple version of the CFG with data-flow information, extended CFG versions for the implemented testing methods and the PDG/SDG version required by the introduced impact analysis. The derived versions of the CFG are used by the test case generator to produce the test suites. In addition to these test suites, the user can request to import test suites from external software tools either by providing their API to the service manager or by manually using the export/import file utilities to export CFGs and import test cases, respectively. The metric generator computes the coupling metrics of the DA, and also calculates the code coverage by the test cases and the test data which are imported by the user. The impact analyser provides an impact analysis for each database schema change inputted by the maintainer in terms of both the affected statements/units and the affected test cases/code coverage. The execution plan analyser retrieves for each database statement the execution plan and also provides a translation of this plan to an equivalent part of source code (the current version of DATA supports translation to Oracle PL/SQL and Java programming languages). The execution plan is required by the implemented testing methods. Finally, the schema stores a representation of the database schema into the DATA s repository and validates it against the DA s source code. The schema can either be directly retrieved from the underlying DBMS or imported as a text file containing all the schema DDL statements, such as the file generated by the most commonly used database tools (e. g., TOAD [34], Oracle enterprise manager, etc.). As far as the repository layer is concerned, it consists of a MySQL 4.1 [35] relational database. It contains the results generated by most of the core services and can be also used for retrieving archived data. 9 Experimentation In this section, the objectives of the conducted experiment are firstly described in terms of the addressed research questions. Then, the design of the experiment is explained and finally the obtained results are reported and analysed. 9.1 Experiment definition and context Our objective is to evaluate both the effectiveness and the efficiency of DATA with reference to the impact analysis of database schema changes. In particular, we study the effects of such changes in the source code of the DA and in the relevant test cases. Within this context, this study addresses the five research questions (RQs) introduced in Section 1. The experiments were conducted with reference to the TPC-C DA. TPC Benchmark TM C (TPC-C) is as an online transaction processing (OLTP) workload [3]. It is a mixture of read-only and update intensive transactions that simulate the activities found in complex OLTP application environments. The benchmark portrays a company that is a wholesale supplier with a number of geographically distributed sales districts and associated warehouses. As the company s business expands, new warehouses and associated sales districts are created. All warehouses maintain stocks for the items sold by the company. In summary, the TPC-C DA includes 5 code units, while its schema consists of 9 database tables, 92 table fields and 1 stored procedure (i. e., in total 102 schema objects). The rationale behind choosing the TPC-C for our case study has to do with the high volume of involved database transactions as well as with the fact that it captures the characteristics of real complex DAs. Table 6 summarises the size statistics for each one of the five units of the TPC- C DA. All units were implemented in Oracle PL/SQL and their total lines of code (LOC) include comments which represent an average percentage of 2% per unit. To demonstrate the effect of anchored declarations, one code unit, namely the ostat, was implemented by using this type of declarations. Database statements include all type of database statements, i. e., even those statements that will not be further expanded by the semantically equivalent method such as commit, rollback, etc. A final remark concerns the CFG-related lines of the table; bb-cfg is the CFG for the basic-block analysis level, sem-cfg is the CFG derived by the semantically equivalent method for the basic-block analysis level, and stmt-cfg is the CFG for statement-level analysis. Table 6 Sizes of the TPC-C DA s units Size/Unit neword payment ostat delivery slev LOC Statements DB statements Host variables DB objects referenced bb-cfg nodes edges sem-cfg nodes edges stmt-cfg nodes edges PDG nodes edges

10 118 International Journal of Automation and Computing 6(2), May Experiment design Apart from RQ1, where the experiments simply included the execution of the TPC-C DA against a set of schema changes, the five code units of the TPC-C were initially parsed by the DATA tool. For each one of the units, two different analysis levels were selected, namely basic-block and statement level analysis. For the former level, the normal and the semantically equivalent CFG were requested, whereas for the latter level, the extended PDG was requested. For both derived CFGs, test cases were generated to obtain branch coverage. A set of 100 randomly produced schema changes was then input to DATA and impact analysis was requested in terms of both the affected statements and test cases. Initially, for each change, the tool checked the criticality level of the change according to its local knowledge base. This base was built after the experiments for the RQ1, which did not involve, DATA. If the change was critical, DATA would check the TPC-C DA s coupling metric with regard to the specific database object that the change referred to. For each code unit with CBUSO greater than zero (using database tables as the granularity level), the impact analysis was triggered, otherwise, a message reporting no impact was returned. The experiments were conducted on a HP Compaq D330 personal computer host machine, serving as the application, web and database server of the DATA tool. The CPU was an Intel Pentium 4 processor at 2.8 GHz, whereas the main memory was 1 GB. 9.3 Experimental results This section presents the experimental results on the TPC-C case study with reference to each one of the addressed research questions RQ1: Which schema changes are critical for the DA s maintenance? To address RQ1, every TPC-C unit was executed for each atomic schema change presented in Table 2. A hypothesis for answering the above question is that a schema change is considered to be critical for a DA s maintenance if it requires maintenance activities, such as corrections, to its source code [1]. Table 7 provides a detailed description for the types of errors that were produced during the experiments, whereas Table 8 summarises the criticality classification for each atomic schema change. It also presents the correlation between the schema changes and the derived error types. Table 8 Classification of Schema Changes for Oracle PL/SQL Atomic schema changes Error type Classification SC, TC, TCD, TCR, TTrD e0 NC TFSM, TFA, TFN, TCA, e3, e4,e5, EC TTrC, TTrM, TIC, TID, TIM e6, e7 SD, SM, TD, TR, TFTM, TFR, TFD e1, e2 CC It should be noted that this classification is database specific, and therefore variations between different vendors may occur. For example, the deletion of a stored procedure in PostgreSQL is classified as an execution critical change [1] whereas in the Oracle9i database it is classified as a compilation critical change. Furthermore, some changes can be classified as either compilation or execution critical depending on their particular form or the specific statement that references them. An example of the former case is the modification of a stored procedure, which, if it concerns an erroneous change of the procedure s body, will cause a runtime error, whereas if it concerns the procedure s signature it will invalidate its compilation. An example of the latter case is the deletion of a table field, which, if it is referenced by a select* statement, will cause a runtime error during the assignment of the result to a host variable, whereas if it is literally referenced by a database statement it will invalidate the compilation of the relevant code unit. In such cases, the schema change is assigned with the maximum criticality level, i. e., in the previous example, both changes are classified as compilation critical. Table 9 summarises the type of atomic schema changes generated during the experimentation according to their classification type, while Fig. 8 graphically depicts their proportional distribution with reference to each schema change category, i. e., NC, EC, and CC. Table 9 Overview of the generated schema changes according to their classification NC EC CC Change Number Change Number Change Number SC 4 TCA 3 SD 1 TC 3 TTrC 2 SM 1 TCD 8 TIC 4 TD 9 TCR 6 TFA 9 TR 8 Total 21 TFSM 9 TFTM 8 TFN 9 TFR 8 Total 36 TFD 8 Total 43 Table 7 Description of error types Error Description e0 No error. e1 SQL statements referring to nonexistent database objects. e2 Host variables declared/associated to nonexistent db objects. e3 Insert statement with wrong type/domain values. e4 Update statement with wrong type/domain values. e5 Incompatible comparison in the WHERE clause. e6 e7 Missing field reference in an INSERT statement. Failed assignment of the fetched results to the host variables. Classification of the schema changes for the experimen- Fig. 8 tation

11 S. K. Gardikiotis and N. Malevris / A Two-folded Impact Analysis of Schema Changes on Database Applications RQ2: Which parts of a DA s source code are affected by a schema change? In accordance to the impact analysis described in Section 7.1, Fig. 9 graphically depicts (in LOC) the original size for each code unit and the maximum (max), arithmetic mean (µ) and standard deviation (σ) of the slice sizes that are derived for the EC schema changes. Fig. 9 Slice sizes derived for EC schema changes In analogy, Fig. 10 contains the corresponding graph for CC changes. Non critical (NC) changes are not depicted graphically since they do not trigger the impact analysis and therefore no slice is generated for them. Similarly, the minimum size of the derived slice for critical changes is also not depicted in the graphs since it equalled zero in all cases. The relevant figures of both graphs are summarised in Table 10, which therefore presents for each type of critical change (i. e., EC and CC), the original size in LOC, the derived slice s statistics (maximum, arithmetic mean, and standard deviation) in LOC, the percentage of the standard deviation of slice size with regard to the original size, and the result of the analysis, i. e., successful or not. Fig. 10 Slice sizes derived for CC schema changes Table 10 Affected source code (PDG) Unit Change LOC Max µ σ Percent (%) Success neword payment ostat delivery slev EC Y CC Y EC Y CC Y EC Y CC Y EC Y CC Y EC Y CC Υ The main conclusion is that in all tested changes, the impact was successfully assessed, while at the same time the resulting code slices were significantly smaller in size than the original program. The impact estimation was regarded as successful if the code slice contained all the really affected lines. Thus, although in some cases the impact analysis was overestimating due to the rather conservative slicing rules, the final result was successful. An example of such cases is TCAs where according to the slicing rule a select node was included in the slice if it contained a reference to the table field of the TCA. An improvement would be to refine the rule by investigating if the table filed was involved in a logical condition of the where clause. It should also be noted that a working assumption was made, according to which data flow information for units called internally from the analysed unit was supposed to be available. Finally, another important conclusion is that the fact that only data, SQL and TT dependency types were considered during slicing in order to achieve smaller slices did not result in any loss of accuracy RQ3 and RQ4: Which DA s test cases are affected by a schema change? How does a schema change affect the code coverage by the same set of test cases? To address both questions, which are complementary, the analysis described in Section 7.2 was applied. Table 11 summarises the experimental results for the test cases generated for TPC-C units. For each code unit, the average percentage figures obtained for branch coverage and databasestatement coverage before and after schema change are reported, respectively. The results are reported against the same set of test cases and test data. The graphical representation of the percentage decrease in both branch and SQL coverage resulted by critical schema changes is shown in Fig. 11. The conclusion is that depending on their specific type, the schema changes may affect those test cases that are relevant to the modified schema object. In terms of coverage metrics, the introduced database-statement coverage metric proved to represent the impact degree more accurately than the branch coverage metric. Although this conclusion is indicated but cannot be proven via Table 11 and Fig. 11 that present aggregated information referring to average figures, a number of experiments showed that the branch coverage percentage remained the same before and after a schema change although different sets of test cases became infeasible from feasible, and vice versa. To that end, branch coverage metric failed to reflect the impact. On the contrary, the impact was successfully depicted by the databasestatement coverage, which in the same set of experiments differed before and after the change highlighting the impact. Compared to the impact analysis based on slicing, one difference that becomes evident concerns those changes referring to database constraints, such as TCAs, where the coverage of both metrics is correctly not affected. The impact analysis in these cases is rather a question of finding which test data will cause different execution traces before and after the change than a matter of test case feasibility.

12 120 International Journal of Automation and Computing 6(2), May 2009 Table 11 Affected test cases and coverage Unit Change Branch coverage (%) SQL coverage (%) Before After Before After neword EC CC payment EC CC ostat EC CC delivery EC CC slev EC CC Fig. 12 uses 3-dimension bars; the horizontal axe refers to the DA s code units, the depth axe to the schema objects, and the vertical axe to the schema coverage. Schema objects are divided into four groups: tables, fields, stored procedures, and the entire set of schema objects. Code units are divided into six groups: five groups for each code unit separately and one group for the DA as a whole. The schema coverage metric provides a rough indication of whether or not it is possible for a change on each group of schema objects to affect the DA s source code. Thus, a change on a database table is certain to affect the DA since all the database tables are referenced overall, i. e., there is 100% schema coverage for schema tables by all the DA s units. In analogy, the fact that the tables coverage with reference to slev is 33%, means that only changes to specific tables will affect this code unit. In general, the lower the schema coverage gets for a code unit, the safer the change becomes in terms of unit s operability. Table 12 presents the CBUSO metrics derived by the TPC-C DA for two specific schema objects, namely tables and stored procedures, resulting in 50 (= 5 10) coupled pairs. Table 13 summarises some descriptive statistics, i.e., minimum, maximum, arithmetic mean (µ), and standard deviation (σ) for the coupling metrics referring to unit level and the values for the application level metrics. Table 12 The CBUSO for TPC-C tables and stored procedures Fig. 11 Decrease in branch and SQL coverage percentage due to schema changes RQ5: Is it possible to define some criteria for choosing between alternative sets of schema changes in order to minimise the impact on the code of the dependent DAs? To effectively answer this question, we shall use the metrics introduced in Section 7.3. Fig. 12 illustrates the results for the DA s schema coverage, i.e., what percentage of schema objects are referenced by the DA s source code. neword payment ostat delivery slev District Warehouse Orders New order Order line History Customer Stock Item Gettimestamp Table 13 Values and statistics for the metrics Metric Unit level Application level Min Max µ σ CBUSO DTCBUSO SchemaCoverage % CBADB Fig. 12 Schema coverage percentages for the TPC-C Figs. 13 and 14 show how the CBUSO reflects the impact of a schema change in terms of code slices and test cases respectively. In Fig. 13, the left vertical axis contains the standard deviation (σ) of the slice sizes (measured in LOC) of the affected unit code, the right vertical axis shows the average CBUSO value with respect to the modified schema table, and the horizontal axis shows the code units and the type of schema changes. Similarly, Fig. 14 depicts the database-statement coverage decrease on its left vertical axis. In both graphs, the bars represent the impact and the lines represent the average CBUSO values.

13 S. K. Gardikiotis and N. Malevris / A Two-folded Impact Analysis of Schema Changes on Database Applications 121 The cost of DATA s execution per code unit is shown in Table 14. The results are grouped based on the obtained analysis level, i. e., there are two groups referring to basic-block and statements analysis level, respectively. Each group is further divided according to the execution phases of DATA. Table 14 Cost of impact analysis (s) Fig. 13 Affected slices and coupling metrics for schema changes Unit Basic blocks Statement Initiali- SemCFG Testcases Initiali- PDG Slicing sation sation neword payment ostat delivery slev Fig. 14 Affected SQL coverage and coupling metrics for schema changes Following the experiments conducted, a first conclusion is that if there is no coupling (i. e., the CBUSO equals zero) there will be no effect. An additional conclusion for the slicing case is that higher coupling corresponds to greater impact within the same code unit. This does not hold at the level of different units where the same coupling may result into different impacts. This however is reasonable since the referenced source code in these cases is not the same and therefore cannot serve as a common comparison basis. 9.4 Cost The initialisation phase in both groups includes parsing, control, and data flow analysis. The SemCFG phase refers to the construction of the CFG of the semantically equivalent method. The TestCases phase includes test cases generation for each schema change. The PDG phase refers to the PDG construction whereas Slicing regards the execution of the slicing algorithm for each change. TestCases and Slicing are triggered by each critical schema change provided that at least one code unit is coupled to the modified schema object. The time cost at these cases contains the average execution time for the set of changes tested. The same information is graphically demonstrated in Fig. 15 and confirms that the cost for estimating the impact of schema changes is relatively small in comparison to DA s program analysis and graph construction. Given that the latter processes are executed only once as part of program analysis, there is a significant time-saving for performing repetitive tasks, such as the impact analysis of different schema changes. Furthermore, the information gained during the analysis phase can also be used for performing other software engineering activities, such as testing, with a minimum overhead. Fig. 15 DATA s time execution cost for two levels of analysis. (a) Basic blocks; (b) Statements

14 122 International Journal of Automation and Computing 6(2), May Limitations The main arguments towards the generality of the present study are the investigation of a DA with a relatively small size of code units and the simplicity of its architecture. To evaluate the significance of the coupling and the schema coverage metrics, larger-scale experiments comprising the results from a number of DAs should be conducted involving a number of DAs. As far as the complexity of the architecture is concerned, it is not obvious how the experimental results of an one-tier DA can be generalized to the distributed multi-tier DAs context. Towards that direction, we are currently working on extending our approach to apply for complex web-based DAs by utilizing the agent technology, as suggested in our recent work presented in [36]. Concerning the internal validity of the presented study, the main threat is the limitations imposed by the DATA s current implementation version; the tool does not consider dynamically generated database statements, it supports one schema per DA, and it does not treat the case of database trigger modifications. 10 Conclusions In this paper, we presented a two-folded impact analysis of schema changes on database applications; both the affected source code statements and the test suites are reported. The code analysis is founded on a static program slicing algorithm that employs a DAs-specific version of the PDG. The impact analysis of test suites is based on the use of a structural test generation technique tailored for DAs. Furthermore, the additional information derived during the program analysis of the DAs allowed for the development of a number of practical metrics; coupling metrics can thus be computed towards DAs maintenance assessment, and test adequacy criteria can be defined to measure databasespecific code coverage. The rationale behind developing this approach is to provide the software engineers of DAs with a systematic and automated solution that facilitates major maintenance tasks, such as source code corrections and regression testing, which are triggered by the occurrence of schema changes. A software tool called DATA that implements the aforementioned techniques and metrics was used in a case study that is based on the TPC-C benchmark [3]. The experimental results confirmed the effectiveness of the presented approach with regard to its intended scope and derived a number of salient conclusions: 1) There are different criticality levels of the impact of a schema change on the operability of the dependent DAs. 2) The employment of program slicing algorithms, which use the suggested extended PDG version, proved to successfully address DAs peculiarities resulting in a relatively precise impact analysis. 3) The deviation that schema changes cause in the values of a simple database-specific metric for code coverage is indeed representative of the impact size on the corresponding test suite. 4) The suggested coupling measurement provides a simple and practical way towards the assessment of the DAs code maintainability, which can guide the database administrator s decision among alternative sets of schema changes in order to minimise the impact on the code of the dependent DA. 5) The methods presented in the context of this paper can efficiently be implemented and provide a generic solution that is reusable, extensible, and applicable in a wide range of software engineering tasks, such as testing and maintenance. Acknowledgements The authors gratefully acknowledge the useful comments and suggestions made by the referees that helped us produce the final improved form of the paper. References [1] S. K. Gardikiotis, N. Malevris, T. Konstantinou. A Structural Approach towards the Maintenance of Database Applications. In Proceedings of the 8th International Database Engineering and Applications Symposium, IEEE Computer Society, Coimbra, Portugal, pp , [2] ISO/IEC 9075 part 1-14 Information technology Database languages - SQL, [3] TPC, [Online], Available: October 15, [4] X. Li. A Survey of Schema Evolution in Object-oriented Databases. In Proceedings of IEEE International Conference on Technology of Object-oriented Languages and Systems, Melbourne, Australia, pp , [5] J. F. Roddick. A Survey of Schema Versioning Issues for Database Systems. Information and Software Technology, vol. 37, no. 7, pp , [6] S. K. Gardikiotis, N. Malevris. DaSIAn: A Tool for Estimating the Impact of Database Schema Modifications on WEB Applications. In Proceedings of the 4th ACS/IEEE International Conference on Computer Systems and Applications, IEEE Computer Society, Sharjah/Dubai, UAE, pp , [7] D. Chays, Y. Deng, P. G. Frankl, S. Dan, F. I. Vokolos, E. J. Weyuker. An AGENDA for Testing Relational Database Applications. Software Testing, Verification and Reliability, vol. 14, no. 1, pp , [8] J. Zhang, C. Xu, S. C. Cheung. Automatic Generation of Database Instances for White-box Testing. In Proceedings of the 25th Annual International Computer Software and Applications Conference, IEEE Press, Chicago, IL, USA, pp , [9] M. Y. Chan, S. C. Cheung. Testing Database Applications with SQL Semantics. In Proceedings of the 2nd International Symposium on Cooperative Database Systems for Advanced Applications, Springer, Wollongong, Australia, pp , [10] W. K. Chan, S. C. Cheung, T. H. Tse. Fault-based Testing of Database Application Programs with Conceptual Data Model. In Proceedings of the 5th International Conference on Quality Software, IEEE Computer Society, Melbourne, Australia, pp , [11] R. A. Haraty, N. Mansour, B. Daou. Regression Testing of Database Applications. In Proceedings of the 16th Symposium on Applied Computing, ACM, Las Vegas, USA, pp , 2001.

15 S. K. Gardikiotis and N. Malevris / A Two-folded Impact Analysis of Schema Changes on Database Applications 123 [12] F. Haftmann, D. Kossmann, E. Lo. A Framework for Efficient Regression Tests on Database Applications. The VLDB Journal, vol. 16, no. 1, pp , [13] V. Englebert, J. L. Hainaut. DB-MAIN: A Next Generation Meta-CASE. Journal of Information Systems, vol. 24, no. 2, pp , [14] D. Yeh, Y. Li. Extracting Entity Relationship Diagram from a Table-based Legacy Database. In Proceedings of the 9th European Conference on Software Maintenance and Reengineering, IEEE Computer Society, Manchester, UK, pp , [15] J. Henrard, J. L. Hainaut. Data Dependency Elicitation in Database Reverse Engineering. In Proceedings of the 5th European Conference on Software Maintenance and Reengineering, IEEE Press, Lisbon, Portugal, pp , [16] A. Orso, T. Apiwattanapong, M. J. Harrold. Leveraging Field Data for Impact Analysis and Regression Testing. ACM SIGSOFT Software Engineering Notes, vol. 28, no. 5, pp , [17] L. C. Briand, J. Wuest, H. Lounis. Using Coupling Measurement for Impact Analysis in Object-oriented Systems. In Proceedings of the 15th IEEE International Conference on Software Maintenance, IEEE Computer Society, Oxford, UK, pp , [18] H. Zhu, P. A. V. Hall, J. H. R. May. Software Unit Test Coverage and Adequacy. ACM Computing Surveys, vol. 29, no. 4, pp , [19] A. V. Aho, R. Sethi, J. D. Ullman. Compilers: Principles, Techniques and Tools, Addison Wesley, [20] M. Weiser. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering, IEEE Press, San Diego, California, USA, pp , [21] F. Tip. A Survey of Program Slicing Techniques. Journal of Programming Languages, vol. 3, no. 3, pp , [22] B. Xu, J. Qian, X. Zhang, Z. Wu, L. Chen. A Brief Survey of Program Slicing. SIGSOFT Software Engineering Notes, vol. 30, no. 2, pp. 1 36, [23] K. J. Ottenstein, L. M. Ottenstein. The Program Dependence Graph in a Software Development Environment. In Proceedings of the 1st Software Engineering Symposium on Practical Software Development Environments, ACM, Pittsburgh, Pennsylvania, pp , [24] S. Horwitz, T. Reps, D. Binkley. Interprocedural Slicing Using Dependence Graphs. ACM Transactions on Programming Languages and Systems, vol. 12, no. 1, pp , [25] Oracle, [Online], Available: October 15, [26] S. K. Gardikiotis, N. Malevris. Program Analysis and Testing of Database Applications. In Proceedings of the 5th IEEE/ACIS International Conference on Computer and Information Science, IEEE Computer Society, Honolulu, Hawaii, USA, pp , [27] E. Siougle. Constructing a Software System for Program Testing, Master dissertation, Athens University of Economics and Business, Athens, Greek, [28] D. Yates, N. Malevris. Reducing the Effects of Infeasible Paths in Branch Testing. ACM SIGSOFT Software Engineering Notes, vol. 14, no. 8, pp , [29] IEEE Std (R2002), IEEE Standard Glossary of Software Engineering Terminology, IEEE, [30] S. R. Chidamber, C. F. Kemerer. A Metrics Suite for Object Oriented Design. IEEE Transactions on Software Engineering, vol. 20, no. 6, pp , [31] Apache Software Foundation, [Online], Available: October 15, [32] The Source for Java Developers, [Online], Available: October 15, [33] The Source for Java Technology Collaboration, [Online], Available: October 15, [34] ToadSoft, [Online], Available: October 15, [35] MySQL, [Online], Available: October 15, [36] S. K. Gardikiotis, V. S. Lazarou, N. Malevris. An Agentbased Approach for the Maintenance of Database Applications. In Proceedings of the 5th International Conference on Software Engineering Research, Management and Applications, IEEE Computer Society, Busan, Korea, pp , Spyridon K. Gardikiotis received the B. Sc. degree in informatics from the Athens University of Economics and Business, Greece, in 1997 and the M. Sc. degree in advanced computing from the Imperial College of Science, Technology and Medicine, London, UK, in He currently works as an IT expert at the Central Bank of Greece and he is a Ph. D. candidate in informatics at the Athens University of Economics and Business. His research interests include software engineering of web/database applications, enterprise software architecture, and IT projects governance. Nicos Malevris received the B. Sc. degree in mathematics from the University of Athens, Greece in 1982, the M. Sc. degree in operational research from the University of Southampton, UK, in 1984 and the Ph. D. degree in computer science from the University of Liverpool, UK, in He has been with the Department of Informatics at the Athens University of Economics and Business since 1991 where he is an associate professor. His research interests include software quality assurance, software testing, and software reliability.