Requirements Traceability Recovery

Transcription

1 MASTER S THESIS Requirements Traceability Recovery - A Study of Available Tools - Author: Lina Brodén Supervisor: Markus Borg Examiner: Prof. Per Runeson April 2011

2

3 Abstract This master s thesis is focused on tools implementing information retrieval based methods to recreate, or recover, requirements traceability in the process of software development. Requirements traceability is an area that is often neglected or poorly performed and maintained in the development life-cycle, even if it is well known that these activities is associated with high costs if they are badly or inefficient executed. A tool developed to perform these tasks thus could help companies both improving their requirements traceability management and to be more efficient in for example the process of tracing requirements to test case descriptions. There are a number of research prototypes developed today and this thesis aims to compare these with each other and investigate if there are any major differences between them. A literature study of the field is presented as well as the tools available and the methods they implement. Furthermore a simplistic method, called the Naïve approach, is developed and implemented in addition to the prototypes. Testing and comparison between the different tools are performed with real industrial data sets following a framework especially developed for the purpose of comparing requirements tracing experiments, which is also described in the report. The results indicates that the Naïve approach is performing much worse than the other prototypes when there are multiple traceability links between each item in two artifacts, but fairly well when a one-to-one mapping exist. On the other hand, neither of the tested tools shows any exceptional performances This implies that further work and more extensive research in the area of requirements traceability recovery is necessary, before drawing the conclusion that a company a significantly helped by integrating one of the tools in their development processes.

4

5 Acknowledgement I would like to thank my supervisor, Markus Borg, for all the invaluable help he has provided me throughout this master s thesis. Thanks also to my examiner, Per Runeson, for the help and useful discussions and comments on my work along the way. Finally I would like to thank the company, and the people there, which has provided the data that made this master s thesis possible at all.

6

7 Table of Contents 1 Introduction 1 I Theoretical Background 3 2 Software traceability 5 3 Information retrieval IR methods Weighting and Thesaurus Stemming and Stop words Measurements IR-based traceability recovery Tools II Evaluation 17 5 Framework for Evaluation 19 6 Phase I: Definition Motivation Purpose Object Hypothesis Perspective Domain Scope Importance Phase II: Planning Experimental design Measurement Product iii

8 8 Phase III: Realization Preparation Execution Phase IV: Interpretation Interpretation context Results Extrapolation Impact III Conclusion Discussion Strengths and Limitations Alternative approaches Future work and final remarks 49 IV Appendices 51 A Glossary 53 B Safety Integrity Level 55 C Implementation of The Naïve approach 57 iv

9 CHAPTER1 Introduction Today the documentation of large-scale software development taking place in many companies and organizations is a fairly ad hoc process resulting in numerous textual documents in various forms. The documentation itself is also often incomplete and unsatisfactory and managing and coordination of these documents is both a complicated and time consuming, but yet crucial, task to be able to achieve and maintain a high efficiency. This is certainly also true in the case with the alignment between requirements and test activities. And since it is well known that these processes are associated with high costs if they are inefficient or poorly executed, there is also a possibility to cut costs in particularly these areas if they are performed in a correct and efficient way with support of appropriate tools. The main idea of this master s thesis is to investigate and compare tools that are specialized at requirements traceability recovery, i.e. the process of identifying for example which requirements that could be affected if a change is made in one of the test cases. The testing and evaluation of the currently available tools should be performed with real industrial data provided by a large international company. One of the reasons for this is the fact that there is a lack of this type of experiments, mainly because many companies are unwilling to share their data with people outside their organization, although this type are the most interesting from an industrial point of view since this is where the application could generate the greatest benefits. The primary method can be characterized as a combination of a brief survey of the field of traceability recovery in general and in particular, information retrieval based tools in that area. Further on these tools are used in a comparison experiment, following a framework for comparing requirements tracing experiments developed by Hayes et al. [1], that aims to establish if there is any major difference in performance depending on which tool used and how much the data set influence the result of the traceability recovery. 1

10 CHAPTER 1. INTRODUCTION Structure The first part of this report is called Theoretical Background and it briefly describes the field of software traceability and requirements traceability recovery, introduces the key elements in information retrieval and presents the different tools. Part two, Evaluation, is the main part of the master s thesis report. It is based on the structure that is used in the framework which has been used for the evaluation of the testing, and is presented in the beginning of this part. Finally there is a third part about conclusions and further work followed by a part containing the appendices. 2

11 Part I Theoretical Background 3

12

13 CHAPTER2 Software traceability Software traceability, from now on only traceability, is defined as follows by Gotel and Finkelstein (... ) the ability to describe and follow the life of a requirement, in both a forwards and backwards direction (i. e., from its origins to its subsequent deployment and use, and through all periods of on-going refinement and iteration in any of these phases.) [2] This is probably the most well-known definition of traceability, even if it partly defines traceability as something only associated with requirements. This is not completely true since traceability plays an important role in many parts of the software artifact life-cycle. The original intent of traceability was the ability to link product documentation requirements back to stakeholders, but today traceability is used to find relationships between various artifacts, such as code-to-code links, documents-to-documents links and code-to-document links et cetera [1]. In this master thesis the traceability concept involves all forms of traces or connections between software artifacts. One of the main fields in software engineering supported by improved traceability is Verification and Validation (V&V), where the ability to trace artifacts back to requirements can help verify that the system developed actually is the system requested by the stakeholders, and that all features have been implemented and the specifications have been meet. Another example in which traceability is very useful is in the case with artifacts being reused, for instance in a new project or in an upgraded version. Clearly, it could be both practical and lucrative for any software development company to maintain traceability and several standards include traceability as a recommended activity. But sometimes it is required. Safety-critical systems are often legally required to demonstrate that all parts of the developed system and used code trace back to valid requirements [3]. Despite the advantages mentioned above, many organizations and companies fail to achieve proper traceability documentation among their software projects and/or products. In many cases the requirements traceability matrix (RTM), which is the main component in traceability analysis, has to be constructed after-the-fact by non-developers [1]. That is due to the fact that documentation of traceability during 5

14 CHAPTER 2. SOFTWARE TRACEABILITY the progress of the development has been neglected, incomplete (or at least not detailed enough), if conducted at all. To manually perform this task is extremely tedious, errorprone and time consuming since the requirement specifications often contains a large number of requirements. Because many requirements as well as many other artifacts, at least partly, are written in natural language, Information Retrieval techniques have been proposed, and applied, to automate the process of generating candidate links (links suggested to represent a connection between the two given artifacts) [4]. 6

15 CHAPTER3 Information retrieval The idea behind information retrieval (IR), to use computers to automatically search for relevant information in digitally stored documents, was born in the 1950s [5]. The purpose of IR is to reduce information overload, that is, to provide the user with information only related to what he or she is really looking for. A typical IR problem involves a collection of documents (information) and a user information need, which is expressed in the form of a text query. The task is to find documents in the collection that are considered relevant to the query. The most visible applications of IR techniques today are Web search engines such as Google or Bing 1. Natural Language Processing (NLP) is a field of computer science and linguistics. It is closely related to IR and is concerned with the interaction between computers and human languages. The intention is to parse and process human languages into a format that could be understood by machines without loss of aspects that could be read between the lines of a human, i. e. the process is not only about chopping up the text into words and compare them against other words, which is pretty much what IR does, but rather to make the computer understand that two words could mean the same even if their grammatical form differs or if they are synonyms. Actually it is hard to make a definite distinction between information retrieval and natural language processing. But to avoid using both IR and NLP notation this report, for the sake of simplicity and to avoid misunderstanding, refer to all methods and techniques as IR, even if they actually might be classified as NLP activities. 3.1 IR methods The techniques involved in IR are essentially divided into three categories of models. The first one is the set-theoretical models in which documents are represented as sets of words and similarity is calculated from set-theoretical operations and methods [6]. These types of models are found to be a bit too simplistic in the context of traceability link generation. Thus, they are left outside the discussion further on. Basically the main 1 or 7

16 CHAPTER 3. INFORMATION RETRIEVAL focus of this short survey of the IR field will be on the techniques actually being used in the traceability recovery area today. As a result of this, things that are part of IR but which are not relevant to the application in the traceability field might be left out. Algebraic models is the second category of IR models. The models in this category use vectors or matrices to represent documents, as well as queries. When the similarity between the document and the query has been derived it is represented as a scalar value. The Vector Space Model (VSM) [7], which is one of the most frequently used models in practice, is included in this category. Documents are represented as vectors in a multi-dimensional space, each term in the vectors correspond to one dimension in this space. A text document d i is represented by the vector d j = (t 1j, t 2j,..., t nj ) where t ij represent the number of times the term t i appears in d j. An example with three vectors in three dimensions could be seen in figure 3.1. The query is represented in the same way, q = (t 1q, t 2q,..., t nq ). When the vectors representing documents are put together into one matrix it is called the term-document matrix, in which each term is represented by a row and each document represented by a column. This means that each cell in the matrix contains the number of times the associated term appears in the indicated document. The similarity is computed in terms of distances in this vector space. Figure 3.1: Vector Space Model example. Indexing by Latent Semantic Analysis (LSI) [8] is a further development of the basics in VSM. One of the drawbacks in the VSM model is the inability to handle any possible synonyms present in the documents. LSI overcomes this drawback and is also able to handle polysemy, words that have more than one meaning. The LSI model is based on the same term-document matrix as the VSM method use, which usually is very large and very sparse. Once it is built, Singular Value Decomposition (SVD) is performed on the matrix to reduce its rank, make it more manageable and to determine patterns among the terms and concepts. While performing the SVD one has to define the right size of the new matrix. This is a difficult decision since the space has to be big enough to include the right level of detail, without including too much noise, at the same time as the size has to be heavily reduced without excluding valuable information. This is often done as a parameter setting in the tools implementing the LSI model [9]. The third category is the probabilistic models. This category treats the retrieval process as a probabilistic inference and the similarity is calculated as the probability that a document is relevant for a given query. The probabilistic relevance model makes an estimation of the probability of finding if a document d i is relevant to a query q. The model makes the assumption that the probability of finding only depends on the query and document representation. It also assumes that there is a set of documents that is preferred as the answer set for each query q. The goal is to maximize the overall probability of relevance for this set, to this 8

17 CHAPTER 3. INFORMATION RETRIEVAL specific user. Furthermore the prediction is that all documents in this set are relevant, whilst documents not included in the set are non-relevant to the current query. 3.2 Weighting and Thesaurus There are different ways to assign weights to the terms representing the importance of their occurrence in the document now represented as a vector. The raw weighting is computed simply as the number of times the term occurs in the document, as described in the VSM description case above. When using the binary weighting each term is assigned a weight of 1 if the term exists in the document and 0 otherwise. Term frequency (tf) weighting is used when one wants to assign each term a weight proportional to the frequency of the term occurrences in the document. It is calculated by dividing the raw value (number of times the word occurs in the document) by the number of words in the document in which the term is included. Another way to assign proportional weights to the terms is by inverse document frequency (idf) weighting. It assigns a weight to each term obtained by dividing the total number of documents with the number of documents in which the term is included and then take the logarithmic of this number. This weighting system is based on the idea that the more documents a term occurs in, the less information it provides in a semantic meaning. If for example the word car appears in all of the documents in one collection, then car does not say much about the content in each of the documents. But if instead car is only present in one of the documents, then it provides the system with important information that could distinguish that particular document from the rest if the user wants information about cars. When applying both the term frequency weighting and the inverse document frequency weighting the result is the tf-idf weighting. It is supposed to combine the best of the two methods to get a term weight that is high when the term occurs many times in a few numbers of documents and low when the term occurs in many documents. This weighting method is by far the most frequently used. The use of a thesaurus is another common method applied in the fields of IR to reduce the number of different terms in the process. In its simplest form it is a set of triples t, t, α, where t and t are matching thesaurus terms (keywords or phrases) and α is the similarity coefficient between them. An example of a simple thesaurus triple could be ( error, fault, 0.9). This techniques is often used to extend the tf-idf weighting method. 3.3 Stemming and Stop words Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form - generally a written word form. It can be very effective and useful to apply this technique, but there are a lot of challenges in the implementation of these kinds of algorithms and has been so ever since the first paper about the technique was published in 1968 [5]. While working with IR techniques on large data sets including a lot of text it is sometimes useful to filter out words that do not have any specific meaning in a text, for example words as the, are, at and so on. Those kinds of words are called stop words, and a list of them is referred to as a stop word list. Though, it is important to notice that there is no universal, definite stop word list, and the method is not even implemented by all IR techniques. A stop word function on the other hand is a rule that exclude words whose length is less than a certain number decided by the user. 9

18 CHAPTER 3. INFORMATION RETRIEVAL 3.4 Measurements To evaluate the performances of different IR methods there are some basic traditional measures to apply to the results generated by the IR-process. The two most well established measures are precision, which is a measure of exactness and recall, measuring the completeness of the received results. These two metrics are used in all papers applying IR techniques for requirements tracing. While using precision and recall the documents are classified into four different categories which can be seen in figure 3.2. Retrieved & irrelevant Not retrieved & irrelevant Retrieved & relevant Not retrieved but relevant Figure 3.2: Classification of documents. Precision is defined as the number of relevant documents retrieved by a search, divided by the total number of documents retrieved by the same search, figure 3.3. precision = {Retrieved & relevant} {Retrieved & relevant} {Retrieved & irrelevant} Recall is defined as the number of relevant documents retrieved by a search divided by the total number of documents that are relevant and hence should have been retrieved by the search, figure 3.3. recall = {Retrieved & relevant} {Retrieved & relevant} {Not retrieved but relevant} Figure 3.3: Recall and Precision. There is a dependency between recall and precision; as recall goes up, precision decreases and as recall goes down, precision increases. As one could understand it is easy to achieve 100% recall, simply return all documents no matter what query used in the search, but this would result in decreased precision. Recall on its own therefore is not enough to measure the performance of an IR method, it has to be combined with precision, or another method measuring the non-relevant documents. The results of these two 10

19 CHAPTER 3. INFORMATION RETRIEVAL measures are often presented as precision versus recall -graphs, see figure 3.4, in which it is easy to see the trade-off that has to be made between the numbers of retrieved documents and the numbers of relevant documents. Figure 3.4: Recall vs precision example. An alternative way of displaying these values is through the harmonic mean, or F-score as it is sometimes called. The formula for calculating the F-score is F = 2 precision recall (precision + recall) There is also a number of alternative, secondary measures used to evaluate the performances of different IR techniques mainly from the analyst point of view. Some of them are Lag, DiffAR and DiffMR [10] but these will not be further described since they are not going to be used in this study. 11

20 CHAPTER 3. INFORMATION RETRIEVAL 12

21 CHAPTER4 IR-based traceability recovery The process of constructing or re-constructing traceability links after-the-fact is called traceability recovery. As mentioned earlier it is a tedious and time-consuming procedure to perform manually and because of that and the fact that many software artifacts are written in natural language various IR methods have been proposed and applied to automate at least parts of the process. Basically the traceability recovery process can be described as follows [1]. The starting point is two different artifacts. In this example two documents will be used; one document with high-level requirements and one document with design-level requirements, but tracing can be applied to any artifacts of choice. The first step is to read through all of the requirements, compare them against each other and generate a list of candidate links for each high-level requirement. That is, a list of all design-level requirements that might be connected to the current high-level requirement in the meaning that they influence each other if there is a change in one of them. There are different ways of doing this. The most intuitive approach would be to start out with the first high-level requirement, read through all the design requirements and write down which, if any, design requirements that trace back to the current high-level requirement. Then move on to the second high level requirement and start the process all over again. One would have to continue in the same way until all of the high level requirements have been examined. If the process is carried out in such a way it is easy to understand that the number of comparisons that has to be done rapidly becomes very high. This step in the traceability process is called the candidate link lists generation. When the list has been generated the candidate links have to be examined more closely and assigned the binary value link / no link, which is the second step in the traceability process. This step is always done by a human analyst in order to assure the Independent V&V inspection based on the results of the traceability analysis to be trustworthy. In the examination process the analyst may also choose to consult other 13

22 CHAPTER 4. IR-BASED TRACEABILITY RECOVERY related artifacts in order to be able to decide whether a candidate link should be marked as link or no link [1]. As one could see, it certainly would facilitate the tracing process if the candidate link lists could be generated automatically, without requiring an analyst going through all of the requirements over and over again manually. This is also where the application of IR in the traceability recovery process is mainly done. The functionality is often embedded in some kind of traceability recovery tool, a software program assigned to the task of reconstructing traceability. There are various types of recovery tools, with different purposes, implementing various techniques. Though in this master thesis, only tools based on IR techniques and especially designed and used for traceability recovery will be discussed. The tools studied and evaluated are briefly presented below together with some other tools/prototype tools currently beeing developed. 4.1 Tools RETRO Hayes et al. [11] presented REquirements TRacing On-target (RETRO), a specialpurpose requirements tracing tool, which automates the generation of RTMs. RETRO implements a number of IR methods in order to do so. The default tracing technique in the tool is the VSM method with tf-idf term weighting, but it also includes the LSI technique and the possibility to combine different methods and parameter settings to achieve the best results. ReqSimile Natt och Dag et al. have developed and released ReqSimile [12], which is a tool performing automated similarity analysis of textual requirements. ReqSimile 1 is an open-source application and is freely available for all interested in using it. The method implemented in ReqSimile is VSM. Poirot:TraceMaker Lin et al. have applied a probabilistic approach to dynamically generate traceability links. [13] Their method is implemented in the Poirot:TraceMaker tool, which is a web-based tool developed to support traceability recovery between different artifacts. ReqAnalyst Lormans et al. [14] have developed a prototype tool called ReqAnalyst in order to test different traceability recovery approaches, and in particular LSI methods. The tool allows the user to change the settings of the parameters in the LSI reconstruction. Once it has executed, ReqAnalyst provides the user with the reconstructed traceability matrix, with the possibility of generating different requirements views. In which it is made possible to obtain continuous feedback on the progress of ongoing software development or maintenance projects. TraceViz A prototype tool, TraceViz, used to visualize traceability links was presented by Marcus et al. [15] to support users during recovery and maintenance. TraceViz is an Eclipse plugin and uses mainly the LSI method to recover traceability links between the artifacts

23 CHAPTER 4. IR-BASED TRACEABILITY RECOVERY The used IR technique is not unique but the key advantage of this tool is instead the visualization of the recovered links. ADAMS Re-Trace Another tool based on the LSI method is the ADAMS Re-Trace tool. Re-Trace is integrated in the artifact management system ADvanced Artifact Management System (ADAMS), and is also available in the Eclipse-based client of ADAMS. Re-Trace was developed and integrated into ADAMS by Oliveto [5] in order to provide the ADAMS users with support of automatic traceability recovery as an complement to the manual creation of traceability links that is already enabled in the system. A summary of the different tools can be seen in Table 4.1. Table 4.1: Summary of IR-based traceability recovery tools. Tool Method/methods Type of program Available RETRO VSM, LSI Stand-alone Yes 2 ReqSimile VSM Stand-alone Yes Poirot:TraceMaker Probabilistic Web-based No ReqAnalyst LSI Stand-alone No TraceViz LSI Eclipse plug-in No ADAMS Re-Trace LSI Integrated or plug-in No 2 The available version only implements VSM. 15

24 CHAPTER 4. IR-BASED TRACEABILITY RECOVERY 16

25 Part II Evaluation 17

26

27 CHAPTER5 Framework for Evaluation The testing part of the study in this master thesis is based on A Framework for Comparing Requirements Tracing Experiments by Hayes and Dekhtyar [1]. In the paper the authors (... ) propose a framework for developing, conducting and analyzing experiments on requirements traceability. The framework and its different phases are further and described in more detail, while it is used in each of the following four chapters. A summary of the experimental framework can be seen in figure

28 CHAPTER 5. FRAMEWORK FOR EVALUATION Figure 5.1: Summary of Framework for Evaluation. 20

29 CHAPTER6 Phase I: Definition According to Hayes et al. [1], before starting with the actual testing activities the researcher has to decide about the scope and object of the project. These activities are referred to as the definition phase and consists of eight parts: Motivation, Purpose, Object, Hypothesis, Perspective, Domain, Scope and Importance. 6.1 Motivation The motivation of performing this experiment, including the examination, comparison and evaluation of the performance of different IR-based traceability recovery tools when using real industrial data, is to: Understand the differences in performance there might be between different tools implementing different IR-techniques. Learn about the behavior of the tools when they are provided with industrial data, since previous experiments mainly have been performed with special compiled data sets. And partly, confirm the results derived in previous experiments. 6.2 Purpose The purpose of performing this experiment is to: Examine the field of traceability recovery, and especially the tools publicly available. Test different tools on the same industrial data set. Evaluate the performance of the tools. Partly, validate the results from previous experiments. 21

30 CHAPTER 6. PHASE I: DEFINITION 6.3 Object The object of study are the available IR-based traceability tools, ReqSimile and REquirements TRacing On-target (RETRO). Unfortunately no other tools, among those mentioned in the Tools-section (Section 4.1), are available for testing. In order to get a very basic tool/method to include in the comparison, a simple, naïve approach was implemented in Java, more info about this implementation could be read in the next subsection. From the beginning, Google Desktop, which is a tool developed to make searching through a user s computer as easy as it is to search the web with the Google search function for information 1, was meant to be a reference point in the comparison and evaluation of recall/precision. But we failed to configure it properly. No matter what the settings were, Google Desktop managed to find about 280 local system files, and indexed them. This would certainly influence the recall/precision value that was going to be used in the evaluation since the sets of requirements and test case descriptions only consisted of about 220 files each, as a result, we decided to exclude Google Desktop as benchmark. The Naïve approach - a Java-implementation As already mentioned, a naïve approach was implemented, figure 6.1, to be used as a benchmark when evaluating the results of other tools. The idea was to be able to evaluate the simplest method possible, which is the direct comparison between for example which test case descriptions that have the most words in common with a specific requirement. This approach does not take any weighting, stemming or stop words/stop word list into consideration. The implementation in Java consists of three different classes N_app (the main program), Item and res_obj which basically works as follows: Required input format: One folder with requirements; each saved as a single file where the filename is used as a unique identifier. One folder with test case descriptions; each saved as a single file where the filename is used as a unique identifier. The folders with requirements and test cases are given as sources of files, the main program goes through the two folders and creates vectors of File objects and lists the names of all files in each folder. Then vectors of Item objects are created, each Item represents a requirement or a test case description and holds information about the name and the contents of the requirement/test case, the latter is saved as a vector of Strings, each representing a word or token. The String method replaceall() is used here with the argument ("[ˆa-zA-Z,_]","") in order to get Strings with only letters or underlines (motivated by the fact that many names in programming code embodies this special token) in the comparison vector. When these two Item vectors are created the comparison between each Item object is executed. This is performed through the String operation compareto() and to be considered as equal the two Strings being examine must be exactly alike, for example cat and cats are considered as different, this is motivated by the fact that considering these two words as alike/partly alike is a typical case of stemming and it has already been mentioned that stemming is not taken into consideration in this approach. The results are saved in res_obj objects, which contains the name of the two compared Items and the number of words that are exactly the same in both Items. Then the res_obj objects are put in a vector, sorted in a descending order with accordance to the number of matching words, and finally all

31 CHAPTER 6. PHASE I: DEFINITION Figure 6.1: Flowchart for the Naïve approach. these vectors (each representing one requirement and the comparisons with all the test cases) are put in one big vector. This one is used to print the results the user wants. For example the user could ask the program to write out the name of the five (or any arbitrary number) test cases that matches the requirements best. One could also choose to only print the names of test cases that have at least ten (or any arbitrary number) matching words. 6.4 Hypothesis The hypotheses to be verified or rejected are the following: Null hypothesis - no difference in recall/precision between the tools exists. Alternative hypothesis 1 - RETRO shows better recall/precision, when evaluated on the special purpose data set, CM-1, than ReqSimile. Alternative hypothesis 2 - RETRO achieves better recall/precision, when evaluated on the industrial data set, than ReqSimile. Alternative hypothesis 3 - No tool perform better than the naïve approach. The motivation of the null hypothesis is that, since this is a comparison, the most basic difference that could occur is that the tools show different values in recall/precision. For the first alternative hypothesis RETRO is suggested to perform better than ReqSimile on the CM-1 data set because it has been tested and evaluated on that particular data set before [11]. But the statement certainly could have been the opposite, ReqSimile performing better than RETRO. The same goes for the second alternative hypothesis, but here it is only by the sake of simplicity that RETRO is the one claimed to perform better, since neither of the tools, according to our survey of existing literature, have been thoroughly tested with industrial software artifacts before. The third and last alternative hypothesis is stated to investigate if the traceability tools actually are useful and perform better than a simple mapping between two sets, see more details about the naïve approach in Section

32 CHAPTER 6. PHASE I: DEFINITION 6.5 Perspective The performance study of the traceability tools is carried out from the perspective of a student, which most likely could be categorized into the researcher-role mentioned in the description of the framework. 6.6 Domain The domain for the experiments is that of an engineer, who will use the tools in practice while working with the task of tracing requirements or any other traceable artifact between software artifacts. 6.7 Scope The scope of an experimental study is decided by looking at the size of the domains considered. According to this, the study of traceability tools is placed in the category of multi-projects. This means that the experiments examine objects across a single team and a set of programs. Other categories are blocked subject-project, replicated project and single project. 6.8 Importance There are two levels of importance according to the framework, domain importance and object of study importance. These are evaluated on the following scale: safetycritical (potential loss of human life), mission-critical, quality of life (on which level the performance is) or convenience. This study of traceability tools is considered as safety critical in the domain importance since the studied module has a SIL (Safety Integrity Level) classification (see Appendix B for more information), and the object of study importance is graded as quality of life. 24

33 CHAPTER7 Phase II: Planning After the definition phase it is time to plan how the experiments actually are going to be carried out, the framework defines three parts in this planning phase: Experimental design, Measurement and Product. 7.1 Experimental design The primary independent variable, the factor that the researcher hypotheses will cause the results of the experiment, is in most requirements tracing studies the requirements tracing technique. This is also what determines the external validity, referring to the ability of generalize the results. In this case it means that the tool used to perform the tracing process is the independent variable and hence determine the level of generalization. But this is also affected by the representation used for the data set, and how the set is prepared and managed. In this experiment there will be two different data sets, the industrial data set and the CM-1 set, both described further on. Other important independent variables influencing the external validity are type and size of the project artifacts being traced, as well as the quality of them. The believability of the relation between the hypothesized causes and the results of the experiments are referred to as the internal validity. Requirements tracing technique Since the object of study is IR-based traceability tools, this is also what separates the different items in the experiment. RETRO implements both the Vector Space Model and the LSI technique according to its specification paper [11], while ReqSimile only adopts the VSM. However, when performing the tests the settings in RETRO were not modifiable and the tests were executed with the VSM, which is RETRO s default technique. In the Naïve approach words in one set of traceability data are simply compared with words in another set, the items including most correct words are returned. 25

34 CHAPTER 7. PHASE II: PLANNING Type and size of project artifacts A traceset in requirements tracing would typically be a pair of software artifacts divided into lower level requirements along with an answerset, which is a mapping between the two artifacts that has been validated. While performing requirements traceability experiments the ability to perform well, achieve accuracy, for both small as well as large tracesets, is referred to as scalability. The artifacts in the industrial data set, which was provided by a large company active in the power and automation sector, in this study consist of a test case document at design level, figure 7.1, which should be traced to a requirements document at the same level. The traces have been determined and verified manually, thus the traceset is complete. The test case document consists of about 220 test case descriptions and the requirements document of about 225 requirements, the total number of combinatorial links becomes 220 x 225 = Since a traceset with more than 3000 combinatorial links are considered large, the test-requirements traceset certainly qualifies into this category. Also the CM-1 data set should be considered as a large traceset as it contains 235 high level requirements, 220 low level requirements and a manually verified answer set. All the test case descriptions and requirements in both data sets are of a textual nature and to characterize the data sets furthermore in detail a word count were performed by MS Word which resulted in the following numbers presented in Table 7.1. Table 7.1: Number of words in each of the data set documents. Data set Document Words Items Words/Item Industrial Requirements Test case descriptions CM-1 Low level requirements High level requirements Quality of artifacts In the industrial traceset used in this study not every test case description traces to a requirement, but every requirement traces to exactly one test case. Because of the fact that there are more requirements than test case descriptions in the industrial data set, this means that some requirements traces to the same test case description, while some test case descriptions lacks traces to requirements. Since the data set is from an SIL-classified (more information about SIL in Appendix B) component it is required to have full traceability between the stated requirements and the test case descriptions. This is what gives us the key solution so that we can determine whether the tools have performed well or not. The relationship between requirements and test case descriptions in this situation infers that the ability of the tested tools to detect both test cases without any requirements origin and to detect untested requirements can be validated because of the one-to-one mapping that exists. The CM-1 traceset contains high level requirements that traces to low level requirements, as well as requirements that does not have any trace to the low level requirements. Dependent variables As in many other requirements tracing experiments, recall and precision are the dependent variables in this study. In order to examine all possible combinations of tools and 26

35 CHAPTER 7. PHASE II: PLANNING data sets, which constitute the independent variables of this experiment, it is required to perform six (= number of tools x number of data sets) different tests to get a full factorial design [1]. 7.2 Measurement The measures used in this study are mainly recall and precision which are well established, validated measurements from the information retrieval field. They are both described in Section 3.4 in which also F-score, another measure used in this comparison, is explained. 7.3 Product In the definition of the framework [1], the authors describe the product as follows, In traceability experiments the products are usually the items being traced while a model or process is being evaluated. In this case it means that the products are the two sets of data, the industrial data set and the CM-1 data set. That is, two levels of documentation. In the industrial data set both requirements and test case descriptions are at design level, while the CM-1 data set consist of higher level requirements and lower level requirements. The different levels are schematically shown in the V-model, figure 7.1. More information about the different data sets and their content could be read in Section 8.1. Figure 7.1: The V-model with the different levels of documents. 27

36 CHAPTER 7. PHASE II: PLANNING 28

37 CHAPTER8 Phase III: Realization When the experiment is defined and planned it is time to start the actual experimentation, the realization phase, which consists of three parts: Preparation, Execution and an optional Analysis. 8.1 Preparation In many requirements tracing experiments a pilot study, a smaller experiment just to get initial results, is conducted. In this case though, no pilot study was performed in the cases with RETRO and ReqSimile since the tools already were available and it was just as easy to directly provide them with the real data, which in any case had to be prepared. The required input formats are listed in Table 8.1. In the case with the Naïve approach the situation was a little bit different. Since the program was going to be developed from scratch the use of a smaller test set was necessary to be able to manually follow each step and verify that the program was doing the right thing and that the results were reasonable. Further comments about this test set can be found in the description about the conversion of industrial data into the input format required by the Naïve-approach. Table 8.1: Required input format by different tools. Tool RETRO ReqSimile Naïve approach Required Input Format Single files, without file extensions MS Access, database document Single files, with or without file extensions 29

38 CHAPTER 8. PHASE III: REALIZATION Industrial data converted into ReqSimile required input format Since the industrial data files were given in MS Word format this preparation consisted of going through both the requirements document and the test case document in order to get each item and convert them into an MS Access database format, which is the input format required in ReqSimile. This was done simply by manually copying the selected item in the Word file and pasting it into the database document. At the same time all unwanted data, such as formatting, numbering etc. was deleted. Basic words that occurred in all test case and requirement items were also deleted since it would not contribute with anything to the IR-processes in the tools. When all data was transformed into the proper format, it also had got new numbers since the only numeration used in the original Word document were a built-in list numeration, which unfortunately couldn t be used in the database document. A new answer set was manually made where the name of each test case together with its new number and the matching requirement number was put in a list. If there were no matching requirements, this was noted as well. Industrial data converted into RETRO required input format When using RETRO it is required that all input data is in the form of single files, one for each requirement/test case, without any file extension and since the industrial data was given in Word files it had to be transformed into this format. This process was performed in a similar way as was done when the data should be converted into the ReqSimile required format, through manual copy/paste. To make the process a little more effortless the database document created previously was used as the main source, to avoid having to go through all data one more time just to delete unwanted data that already had been deleted in the process of converting the data into ReqSimile-format. The numeration created there was used to give each file a unique name. The results of the remodulation was two folders containing 224 files with requirements and 218 files with test case descriptions. Industrial data converted into Naïve-approach required input format Since the RETRO required input format also satisfies the input format required by the Naïve approach there was no need to convert the industrial data a third time. On the other hand, a smaller test set hade to be constructed in order to be able to develop the program in a proper way. This was simply done by taking a small subset of the industrial data set, in this case the first eleven requirements and the first eleven test case descriptions. In this way one could monitor the development process in a proper way and make sure that the Naïve approach performed as expected. CM-1 data converted into ReqSimile required input format The CM-1 data set, from the CM-1 project of the NASA Metrics Data Program 1, had to be converted from single file format into a database in order to be used with ReqSimile. Each file was opened, the content copied and pasted into the MS Access database document together with the name of the file, which also is the name of the requirement/test case. CM-1 data converted into RETRO required input format There was no need to convert the CM-1 data before using it with RETRO since this data set already is prepared and in the right format for RETRO use. 1 Available at 30

39 CHAPTER 8. PHASE III: REALIZATION CM-1 data converted into Naïve-approach required input format Due to the fact that the filenames (the identifiers) are treated as Strings in the implementation of the Naïve approach, one could use the same input format as in the RETRO case, hence there is no need to convert the data. 8.2 Execution The actual performance of the testing was divided into six different cases. Each case is further explained in the following subsections. RETRO with industrial data After the transformation of data into the RETRO required input format, the execution of this test was straight forward. The folders containing the requirements and test case descriptions were given as sources and the program created a new project. A project in RETRO is exactly what it sounds like, the project one is working with at the moment, which is given a name and is saved so that it is possible to save work that has been done and continue at another occasion. Then all traces between high level document elements (in this case, the requirements) and low level document elements (test case descriptions) were calculated and displayed. The filter, which gives the oppertunity to see traces only with weights greater than a specified value, was kept at 0.0 which is the default value, in order to be able to see all possible traces. An overview of the RETRO screen layout can be seen in figure 8.1. Since in this case each requirement only had one corresponding test case description, according to the verified manual traces belonging to the data set, the position of the correct test case description in the list of proposed links was noted manually. If the requirement did not have any corresponding test case, the position of this requirements correct test case was noted as 300, just with the intent of making the results suitable for further evaluation in MATLAB. In the MATLAB script for this case a vector with the position of the right test case number in the list of purposed links is created. A matrix with three columns are then created, in the first column the value is set to 1 if the position of the right test case are lower than, or equal to, the cut point value, otherwise it is 0. In the second column the number of returned links that are not correct is noted, this is of course the cut point value if the value in the first column is 0 and the cut point value minus one if the correct link is in the returned list. The third column specifies the number of correct links that are NOT found, thus 1 or 0 depending on the number in the first column. Then recall, precision and F-score are computed through summarizing the different columns according to the formulas in section 3.4. RETRO with CM-1 data The execution of this test was done in a similar way as the test with the industrial data set. After defining the right sources for the folders containing the high and low requirements, creating a new project, RETRO traced all possible links. Then an RTM was generated by the RETRO built-in function Generate Report, and this constitutes the base in all further calculations in the case with the CM-1 data set tested with RETRO. Because the RETRO generated RTM was ought to be compared with the correct RTM belonging to the CM-1 data set, it had to be converted into a format that was suitable for the comparison which was going to be performed with MATLAB. This was done by manually transforming the generated RTM into a cell-matrix, containing the 31

40 CHAPTER 8. PHASE III: REALIZATION Figure 8.1: Screen shot of RETRO with anonymized element text. 32

41 CHAPTER 8. PHASE III: REALIZATION Figure 8.2: Screen shot of the transformed RTM with CM-1 data. name of each requirement in the first column and the corresponding, traced test case descriptions in the following columns, see figure 8.2. Since the results derived from RETRO, and also the file with the correct answer, are sets of one requirement and its corresponding test case descriptions, it is necessary to control that the two files have their sets in the same order. This is done through the MATLAB-operation strcmp() and causes the test script to break if there should be any disorder in the lineup of sets between the two files. Then a three column matrix is created in a similar way as in the previous case, in the first column the number of correct links that are returned is noted. This is calculated by using the strcmp() operation to compare the returned results with the correct answer set. The second column once again holds the information about how many incorrect links that are returned, and in the third column the total number of correct links that are NOT returned is noted. Recall, precision and F-score are then derived in the same way as in the previous case. ReqSimile with industrial data The requirements and test case descriptions, in the form of an MS Access database document was set as a source, the requirements/test case descriptions were fetched (also in ReqSimile this is what constitute a project) and then preprocessed. The results was collected in the same way as when the data was used with RETRO, by manually noting the position of the correct test case description for each requirement. A screen shot of the test can be seen in figure 8.3. The MATLAB part of this execution is performed in the same way as in the RETRO with industrial data case. ReqSimile with CM-1 data Both high and low level requirements were fetched and preprocessed in the same way as when executing the test with the industrial data set. Then the headache began when the results were going to be collected and saved in a proper MATLAB format. In ReqSimile one was only able to mark the name of one proposed link at a time. And since all 33

42 CHAPTER 8. PHASE III: REALIZATION Figure 8.3: Screen shot of ReqSimile with anonymized descriptions. 34