Requirements Traceability Recovery

Size: px
Start display at page:

Download "Requirements Traceability Recovery"

Transcription

1 MASTER S THESIS Requirements Traceability Recovery - A Study of Available Tools - Author: Lina Brodén Supervisor: Markus Borg Examiner: Prof. Per Runeson April 2011

2

3 Abstract This master s thesis is focused on tools implementing information retrieval based methods to recreate, or recover, requirements traceability in the process of software development. Requirements traceability is an area that is often neglected or poorly performed and maintained in the development life-cycle, even if it is well known that these activities is associated with high costs if they are badly or inefficient executed. A tool developed to perform these tasks thus could help companies both improving their requirements traceability management and to be more efficient in for example the process of tracing requirements to test case descriptions. There are a number of research prototypes developed today and this thesis aims to compare these with each other and investigate if there are any major differences between them. A literature study of the field is presented as well as the tools available and the methods they implement. Furthermore a simplistic method, called the Naïve approach, is developed and implemented in addition to the prototypes. Testing and comparison between the different tools are performed with real industrial data sets following a framework especially developed for the purpose of comparing requirements tracing experiments, which is also described in the report. The results indicates that the Naïve approach is performing much worse than the other prototypes when there are multiple traceability links between each item in two artifacts, but fairly well when a one-to-one mapping exist. On the other hand, neither of the tested tools shows any exceptional performances This implies that further work and more extensive research in the area of requirements traceability recovery is necessary, before drawing the conclusion that a company a significantly helped by integrating one of the tools in their development processes.

4

5 Acknowledgement I would like to thank my supervisor, Markus Borg, for all the invaluable help he has provided me throughout this master s thesis. Thanks also to my examiner, Per Runeson, for the help and useful discussions and comments on my work along the way. Finally I would like to thank the company, and the people there, which has provided the data that made this master s thesis possible at all.

6

7 Table of Contents 1 Introduction 1 I Theoretical Background 3 2 Software traceability 5 3 Information retrieval IR methods Weighting and Thesaurus Stemming and Stop words Measurements IR-based traceability recovery Tools II Evaluation 17 5 Framework for Evaluation 19 6 Phase I: Definition Motivation Purpose Object Hypothesis Perspective Domain Scope Importance Phase II: Planning Experimental design Measurement Product iii

8 8 Phase III: Realization Preparation Execution Phase IV: Interpretation Interpretation context Results Extrapolation Impact III Conclusion Discussion Strengths and Limitations Alternative approaches Future work and final remarks 49 IV Appendices 51 A Glossary 53 B Safety Integrity Level 55 C Implementation of The Naïve approach 57 iv

9 CHAPTER1 Introduction Today the documentation of large-scale software development taking place in many companies and organizations is a fairly ad hoc process resulting in numerous textual documents in various forms. The documentation itself is also often incomplete and unsatisfactory and managing and coordination of these documents is both a complicated and time consuming, but yet crucial, task to be able to achieve and maintain a high efficiency. This is certainly also true in the case with the alignment between requirements and test activities. And since it is well known that these processes are associated with high costs if they are inefficient or poorly executed, there is also a possibility to cut costs in particularly these areas if they are performed in a correct and efficient way with support of appropriate tools. The main idea of this master s thesis is to investigate and compare tools that are specialized at requirements traceability recovery, i.e. the process of identifying for example which requirements that could be affected if a change is made in one of the test cases. The testing and evaluation of the currently available tools should be performed with real industrial data provided by a large international company. One of the reasons for this is the fact that there is a lack of this type of experiments, mainly because many companies are unwilling to share their data with people outside their organization, although this type are the most interesting from an industrial point of view since this is where the application could generate the greatest benefits. The primary method can be characterized as a combination of a brief survey of the field of traceability recovery in general and in particular, information retrieval based tools in that area. Further on these tools are used in a comparison experiment, following a framework for comparing requirements tracing experiments developed by Hayes et al. [1], that aims to establish if there is any major difference in performance depending on which tool used and how much the data set influence the result of the traceability recovery. 1

10 CHAPTER 1. INTRODUCTION Structure The first part of this report is called Theoretical Background and it briefly describes the field of software traceability and requirements traceability recovery, introduces the key elements in information retrieval and presents the different tools. Part two, Evaluation, is the main part of the master s thesis report. It is based on the structure that is used in the framework which has been used for the evaluation of the testing, and is presented in the beginning of this part. Finally there is a third part about conclusions and further work followed by a part containing the appendices. 2

11 Part I Theoretical Background 3

12

13 CHAPTER2 Software traceability Software traceability, from now on only traceability, is defined as follows by Gotel and Finkelstein (... ) the ability to describe and follow the life of a requirement, in both a forwards and backwards direction (i. e., from its origins to its subsequent deployment and use, and through all periods of on-going refinement and iteration in any of these phases.) [2] This is probably the most well-known definition of traceability, even if it partly defines traceability as something only associated with requirements. This is not completely true since traceability plays an important role in many parts of the software artifact life-cycle. The original intent of traceability was the ability to link product documentation requirements back to stakeholders, but today traceability is used to find relationships between various artifacts, such as code-to-code links, documents-to-documents links and code-to-document links et cetera [1]. In this master thesis the traceability concept involves all forms of traces or connections between software artifacts. One of the main fields in software engineering supported by improved traceability is Verification and Validation (V&V), where the ability to trace artifacts back to requirements can help verify that the system developed actually is the system requested by the stakeholders, and that all features have been implemented and the specifications have been meet. Another example in which traceability is very useful is in the case with artifacts being reused, for instance in a new project or in an upgraded version. Clearly, it could be both practical and lucrative for any software development company to maintain traceability and several standards include traceability as a recommended activity. But sometimes it is required. Safety-critical systems are often legally required to demonstrate that all parts of the developed system and used code trace back to valid requirements [3]. Despite the advantages mentioned above, many organizations and companies fail to achieve proper traceability documentation among their software projects and/or products. In many cases the requirements traceability matrix (RTM), which is the main component in traceability analysis, has to be constructed after-the-fact by non-developers [1]. That is due to the fact that documentation of traceability during 5

14 CHAPTER 2. SOFTWARE TRACEABILITY the progress of the development has been neglected, incomplete (or at least not detailed enough), if conducted at all. To manually perform this task is extremely tedious, errorprone and time consuming since the requirement specifications often contains a large number of requirements. Because many requirements as well as many other artifacts, at least partly, are written in natural language, Information Retrieval techniques have been proposed, and applied, to automate the process of generating candidate links (links suggested to represent a connection between the two given artifacts) [4]. 6

15 CHAPTER3 Information retrieval The idea behind information retrieval (IR), to use computers to automatically search for relevant information in digitally stored documents, was born in the 1950s [5]. The purpose of IR is to reduce information overload, that is, to provide the user with information only related to what he or she is really looking for. A typical IR problem involves a collection of documents (information) and a user information need, which is expressed in the form of a text query. The task is to find documents in the collection that are considered relevant to the query. The most visible applications of IR techniques today are Web search engines such as Google or Bing 1. Natural Language Processing (NLP) is a field of computer science and linguistics. It is closely related to IR and is concerned with the interaction between computers and human languages. The intention is to parse and process human languages into a format that could be understood by machines without loss of aspects that could be read between the lines of a human, i. e. the process is not only about chopping up the text into words and compare them against other words, which is pretty much what IR does, but rather to make the computer understand that two words could mean the same even if their grammatical form differs or if they are synonyms. Actually it is hard to make a definite distinction between information retrieval and natural language processing. But to avoid using both IR and NLP notation this report, for the sake of simplicity and to avoid misunderstanding, refer to all methods and techniques as IR, even if they actually might be classified as NLP activities. 3.1 IR methods The techniques involved in IR are essentially divided into three categories of models. The first one is the set-theoretical models in which documents are represented as sets of words and similarity is calculated from set-theoretical operations and methods [6]. These types of models are found to be a bit too simplistic in the context of traceability link generation. Thus, they are left outside the discussion further on. Basically the main 1 or 7

16 CHAPTER 3. INFORMATION RETRIEVAL focus of this short survey of the IR field will be on the techniques actually being used in the traceability recovery area today. As a result of this, things that are part of IR but which are not relevant to the application in the traceability field might be left out. Algebraic models is the second category of IR models. The models in this category use vectors or matrices to represent documents, as well as queries. When the similarity between the document and the query has been derived it is represented as a scalar value. The Vector Space Model (VSM) [7], which is one of the most frequently used models in practice, is included in this category. Documents are represented as vectors in a multi-dimensional space, each term in the vectors correspond to one dimension in this space. A text document d i is represented by the vector d j = (t 1j, t 2j,..., t nj ) where t ij represent the number of times the term t i appears in d j. An example with three vectors in three dimensions could be seen in figure 3.1. The query is represented in the same way, q = (t 1q, t 2q,..., t nq ). When the vectors representing documents are put together into one matrix it is called the term-document matrix, in which each term is represented by a row and each document represented by a column. This means that each cell in the matrix contains the number of times the associated term appears in the indicated document. The similarity is computed in terms of distances in this vector space. Figure 3.1: Vector Space Model example. Indexing by Latent Semantic Analysis (LSI) [8] is a further development of the basics in VSM. One of the drawbacks in the VSM model is the inability to handle any possible synonyms present in the documents. LSI overcomes this drawback and is also able to handle polysemy, words that have more than one meaning. The LSI model is based on the same term-document matrix as the VSM method use, which usually is very large and very sparse. Once it is built, Singular Value Decomposition (SVD) is performed on the matrix to reduce its rank, make it more manageable and to determine patterns among the terms and concepts. While performing the SVD one has to define the right size of the new matrix. This is a difficult decision since the space has to be big enough to include the right level of detail, without including too much noise, at the same time as the size has to be heavily reduced without excluding valuable information. This is often done as a parameter setting in the tools implementing the LSI model [9]. The third category is the probabilistic models. This category treats the retrieval process as a probabilistic inference and the similarity is calculated as the probability that a document is relevant for a given query. The probabilistic relevance model makes an estimation of the probability of finding if a document d i is relevant to a query q. The model makes the assumption that the probability of finding only depends on the query and document representation. It also assumes that there is a set of documents that is preferred as the answer set for each query q. The goal is to maximize the overall probability of relevance for this set, to this 8

17 CHAPTER 3. INFORMATION RETRIEVAL specific user. Furthermore the prediction is that all documents in this set are relevant, whilst documents not included in the set are non-relevant to the current query. 3.2 Weighting and Thesaurus There are different ways to assign weights to the terms representing the importance of their occurrence in the document now represented as a vector. The raw weighting is computed simply as the number of times the term occurs in the document, as described in the VSM description case above. When using the binary weighting each term is assigned a weight of 1 if the term exists in the document and 0 otherwise. Term frequency (tf) weighting is used when one wants to assign each term a weight proportional to the frequency of the term occurrences in the document. It is calculated by dividing the raw value (number of times the word occurs in the document) by the number of words in the document in which the term is included. Another way to assign proportional weights to the terms is by inverse document frequency (idf) weighting. It assigns a weight to each term obtained by dividing the total number of documents with the number of documents in which the term is included and then take the logarithmic of this number. This weighting system is based on the idea that the more documents a term occurs in, the less information it provides in a semantic meaning. If for example the word car appears in all of the documents in one collection, then car does not say much about the content in each of the documents. But if instead car is only present in one of the documents, then it provides the system with important information that could distinguish that particular document from the rest if the user wants information about cars. When applying both the term frequency weighting and the inverse document frequency weighting the result is the tf-idf weighting. It is supposed to combine the best of the two methods to get a term weight that is high when the term occurs many times in a few numbers of documents and low when the term occurs in many documents. This weighting method is by far the most frequently used. The use of a thesaurus is another common method applied in the fields of IR to reduce the number of different terms in the process. In its simplest form it is a set of triples t, t, α, where t and t are matching thesaurus terms (keywords or phrases) and α is the similarity coefficient between them. An example of a simple thesaurus triple could be ( error, fault, 0.9). This techniques is often used to extend the tf-idf weighting method. 3.3 Stemming and Stop words Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form - generally a written word form. It can be very effective and useful to apply this technique, but there are a lot of challenges in the implementation of these kinds of algorithms and has been so ever since the first paper about the technique was published in 1968 [5]. While working with IR techniques on large data sets including a lot of text it is sometimes useful to filter out words that do not have any specific meaning in a text, for example words as the, are, at and so on. Those kinds of words are called stop words, and a list of them is referred to as a stop word list. Though, it is important to notice that there is no universal, definite stop word list, and the method is not even implemented by all IR techniques. A stop word function on the other hand is a rule that exclude words whose length is less than a certain number decided by the user. 9

18 CHAPTER 3. INFORMATION RETRIEVAL 3.4 Measurements To evaluate the performances of different IR methods there are some basic traditional measures to apply to the results generated by the IR-process. The two most well established measures are precision, which is a measure of exactness and recall, measuring the completeness of the received results. These two metrics are used in all papers applying IR techniques for requirements tracing. While using precision and recall the documents are classified into four different categories which can be seen in figure 3.2. Retrieved & irrelevant Not retrieved & irrelevant Retrieved & relevant Not retrieved but relevant Figure 3.2: Classification of documents. Precision is defined as the number of relevant documents retrieved by a search, divided by the total number of documents retrieved by the same search, figure 3.3. precision = {Retrieved & relevant} {Retrieved & relevant} {Retrieved & irrelevant} Recall is defined as the number of relevant documents retrieved by a search divided by the total number of documents that are relevant and hence should have been retrieved by the search, figure 3.3. recall = {Retrieved & relevant} {Retrieved & relevant} {Not retrieved but relevant} Figure 3.3: Recall and Precision. There is a dependency between recall and precision; as recall goes up, precision decreases and as recall goes down, precision increases. As one could understand it is easy to achieve 100% recall, simply return all documents no matter what query used in the search, but this would result in decreased precision. Recall on its own therefore is not enough to measure the performance of an IR method, it has to be combined with precision, or another method measuring the non-relevant documents. The results of these two 10

19 CHAPTER 3. INFORMATION RETRIEVAL measures are often presented as precision versus recall -graphs, see figure 3.4, in which it is easy to see the trade-off that has to be made between the numbers of retrieved documents and the numbers of relevant documents. Figure 3.4: Recall vs precision example. An alternative way of displaying these values is through the harmonic mean, or F-score as it is sometimes called. The formula for calculating the F-score is F = 2 precision recall (precision + recall) There is also a number of alternative, secondary measures used to evaluate the performances of different IR techniques mainly from the analyst point of view. Some of them are Lag, DiffAR and DiffMR [10] but these will not be further described since they are not going to be used in this study. 11

20 CHAPTER 3. INFORMATION RETRIEVAL 12

21 CHAPTER4 IR-based traceability recovery The process of constructing or re-constructing traceability links after-the-fact is called traceability recovery. As mentioned earlier it is a tedious and time-consuming procedure to perform manually and because of that and the fact that many software artifacts are written in natural language various IR methods have been proposed and applied to automate at least parts of the process. Basically the traceability recovery process can be described as follows [1]. The starting point is two different artifacts. In this example two documents will be used; one document with high-level requirements and one document with design-level requirements, but tracing can be applied to any artifacts of choice. The first step is to read through all of the requirements, compare them against each other and generate a list of candidate links for each high-level requirement. That is, a list of all design-level requirements that might be connected to the current high-level requirement in the meaning that they influence each other if there is a change in one of them. There are different ways of doing this. The most intuitive approach would be to start out with the first high-level requirement, read through all the design requirements and write down which, if any, design requirements that trace back to the current high-level requirement. Then move on to the second high level requirement and start the process all over again. One would have to continue in the same way until all of the high level requirements have been examined. If the process is carried out in such a way it is easy to understand that the number of comparisons that has to be done rapidly becomes very high. This step in the traceability process is called the candidate link lists generation. When the list has been generated the candidate links have to be examined more closely and assigned the binary value link / no link, which is the second step in the traceability process. This step is always done by a human analyst in order to assure the Independent V&V inspection based on the results of the traceability analysis to be trustworthy. In the examination process the analyst may also choose to consult other 13

22 CHAPTER 4. IR-BASED TRACEABILITY RECOVERY related artifacts in order to be able to decide whether a candidate link should be marked as link or no link [1]. As one could see, it certainly would facilitate the tracing process if the candidate link lists could be generated automatically, without requiring an analyst going through all of the requirements over and over again manually. This is also where the application of IR in the traceability recovery process is mainly done. The functionality is often embedded in some kind of traceability recovery tool, a software program assigned to the task of reconstructing traceability. There are various types of recovery tools, with different purposes, implementing various techniques. Though in this master thesis, only tools based on IR techniques and especially designed and used for traceability recovery will be discussed. The tools studied and evaluated are briefly presented below together with some other tools/prototype tools currently beeing developed. 4.1 Tools RETRO Hayes et al. [11] presented REquirements TRacing On-target (RETRO), a specialpurpose requirements tracing tool, which automates the generation of RTMs. RETRO implements a number of IR methods in order to do so. The default tracing technique in the tool is the VSM method with tf-idf term weighting, but it also includes the LSI technique and the possibility to combine different methods and parameter settings to achieve the best results. ReqSimile Natt och Dag et al. have developed and released ReqSimile [12], which is a tool performing automated similarity analysis of textual requirements. ReqSimile 1 is an open-source application and is freely available for all interested in using it. The method implemented in ReqSimile is VSM. Poirot:TraceMaker Lin et al. have applied a probabilistic approach to dynamically generate traceability links. [13] Their method is implemented in the Poirot:TraceMaker tool, which is a web-based tool developed to support traceability recovery between different artifacts. ReqAnalyst Lormans et al. [14] have developed a prototype tool called ReqAnalyst in order to test different traceability recovery approaches, and in particular LSI methods. The tool allows the user to change the settings of the parameters in the LSI reconstruction. Once it has executed, ReqAnalyst provides the user with the reconstructed traceability matrix, with the possibility of generating different requirements views. In which it is made possible to obtain continuous feedback on the progress of ongoing software development or maintenance projects. TraceViz A prototype tool, TraceViz, used to visualize traceability links was presented by Marcus et al. [15] to support users during recovery and maintenance. TraceViz is an Eclipse plugin and uses mainly the LSI method to recover traceability links between the artifacts

23 CHAPTER 4. IR-BASED TRACEABILITY RECOVERY The used IR technique is not unique but the key advantage of this tool is instead the visualization of the recovered links. ADAMS Re-Trace Another tool based on the LSI method is the ADAMS Re-Trace tool. Re-Trace is integrated in the artifact management system ADvanced Artifact Management System (ADAMS), and is also available in the Eclipse-based client of ADAMS. Re-Trace was developed and integrated into ADAMS by Oliveto [5] in order to provide the ADAMS users with support of automatic traceability recovery as an complement to the manual creation of traceability links that is already enabled in the system. A summary of the different tools can be seen in Table 4.1. Table 4.1: Summary of IR-based traceability recovery tools. Tool Method/methods Type of program Available RETRO VSM, LSI Stand-alone Yes 2 ReqSimile VSM Stand-alone Yes Poirot:TraceMaker Probabilistic Web-based No ReqAnalyst LSI Stand-alone No TraceViz LSI Eclipse plug-in No ADAMS Re-Trace LSI Integrated or plug-in No 2 The available version only implements VSM. 15

24 CHAPTER 4. IR-BASED TRACEABILITY RECOVERY 16

25 Part II Evaluation 17

26

27 CHAPTER5 Framework for Evaluation The testing part of the study in this master thesis is based on A Framework for Comparing Requirements Tracing Experiments by Hayes and Dekhtyar [1]. In the paper the authors (... ) propose a framework for developing, conducting and analyzing experiments on requirements traceability. The framework and its different phases are further and described in more detail, while it is used in each of the following four chapters. A summary of the experimental framework can be seen in figure

28 CHAPTER 5. FRAMEWORK FOR EVALUATION Figure 5.1: Summary of Framework for Evaluation. 20

29 CHAPTER6 Phase I: Definition According to Hayes et al. [1], before starting with the actual testing activities the researcher has to decide about the scope and object of the project. These activities are referred to as the definition phase and consists of eight parts: Motivation, Purpose, Object, Hypothesis, Perspective, Domain, Scope and Importance. 6.1 Motivation The motivation of performing this experiment, including the examination, comparison and evaluation of the performance of different IR-based traceability recovery tools when using real industrial data, is to: Understand the differences in performance there might be between different tools implementing different IR-techniques. Learn about the behavior of the tools when they are provided with industrial data, since previous experiments mainly have been performed with special compiled data sets. And partly, confirm the results derived in previous experiments. 6.2 Purpose The purpose of performing this experiment is to: Examine the field of traceability recovery, and especially the tools publicly available. Test different tools on the same industrial data set. Evaluate the performance of the tools. Partly, validate the results from previous experiments. 21

30 CHAPTER 6. PHASE I: DEFINITION 6.3 Object The object of study are the available IR-based traceability tools, ReqSimile and REquirements TRacing On-target (RETRO). Unfortunately no other tools, among those mentioned in the Tools-section (Section 4.1), are available for testing. In order to get a very basic tool/method to include in the comparison, a simple, naïve approach was implemented in Java, more info about this implementation could be read in the next subsection. From the beginning, Google Desktop, which is a tool developed to make searching through a user s computer as easy as it is to search the web with the Google search function for information 1, was meant to be a reference point in the comparison and evaluation of recall/precision. But we failed to configure it properly. No matter what the settings were, Google Desktop managed to find about 280 local system files, and indexed them. This would certainly influence the recall/precision value that was going to be used in the evaluation since the sets of requirements and test case descriptions only consisted of about 220 files each, as a result, we decided to exclude Google Desktop as benchmark. The Naïve approach - a Java-implementation As already mentioned, a naïve approach was implemented, figure 6.1, to be used as a benchmark when evaluating the results of other tools. The idea was to be able to evaluate the simplest method possible, which is the direct comparison between for example which test case descriptions that have the most words in common with a specific requirement. This approach does not take any weighting, stemming or stop words/stop word list into consideration. The implementation in Java consists of three different classes N_app (the main program), Item and res_obj which basically works as follows: Required input format: One folder with requirements; each saved as a single file where the filename is used as a unique identifier. One folder with test case descriptions; each saved as a single file where the filename is used as a unique identifier. The folders with requirements and test cases are given as sources of files, the main program goes through the two folders and creates vectors of File objects and lists the names of all files in each folder. Then vectors of Item objects are created, each Item represents a requirement or a test case description and holds information about the name and the contents of the requirement/test case, the latter is saved as a vector of Strings, each representing a word or token. The String method replaceall() is used here with the argument ("[ˆa-zA-Z,_]","") in order to get Strings with only letters or underlines (motivated by the fact that many names in programming code embodies this special token) in the comparison vector. When these two Item vectors are created the comparison between each Item object is executed. This is performed through the String operation compareto() and to be considered as equal the two Strings being examine must be exactly alike, for example cat and cats are considered as different, this is motivated by the fact that considering these two words as alike/partly alike is a typical case of stemming and it has already been mentioned that stemming is not taken into consideration in this approach. The results are saved in res_obj objects, which contains the name of the two compared Items and the number of words that are exactly the same in both Items. Then the res_obj objects are put in a vector, sorted in a descending order with accordance to the number of matching words, and finally all

31 CHAPTER 6. PHASE I: DEFINITION Figure 6.1: Flowchart for the Naïve approach. these vectors (each representing one requirement and the comparisons with all the test cases) are put in one big vector. This one is used to print the results the user wants. For example the user could ask the program to write out the name of the five (or any arbitrary number) test cases that matches the requirements best. One could also choose to only print the names of test cases that have at least ten (or any arbitrary number) matching words. 6.4 Hypothesis The hypotheses to be verified or rejected are the following: Null hypothesis - no difference in recall/precision between the tools exists. Alternative hypothesis 1 - RETRO shows better recall/precision, when evaluated on the special purpose data set, CM-1, than ReqSimile. Alternative hypothesis 2 - RETRO achieves better recall/precision, when evaluated on the industrial data set, than ReqSimile. Alternative hypothesis 3 - No tool perform better than the naïve approach. The motivation of the null hypothesis is that, since this is a comparison, the most basic difference that could occur is that the tools show different values in recall/precision. For the first alternative hypothesis RETRO is suggested to perform better than ReqSimile on the CM-1 data set because it has been tested and evaluated on that particular data set before [11]. But the statement certainly could have been the opposite, ReqSimile performing better than RETRO. The same goes for the second alternative hypothesis, but here it is only by the sake of simplicity that RETRO is the one claimed to perform better, since neither of the tools, according to our survey of existing literature, have been thoroughly tested with industrial software artifacts before. The third and last alternative hypothesis is stated to investigate if the traceability tools actually are useful and perform better than a simple mapping between two sets, see more details about the naïve approach in Section

32 CHAPTER 6. PHASE I: DEFINITION 6.5 Perspective The performance study of the traceability tools is carried out from the perspective of a student, which most likely could be categorized into the researcher-role mentioned in the description of the framework. 6.6 Domain The domain for the experiments is that of an engineer, who will use the tools in practice while working with the task of tracing requirements or any other traceable artifact between software artifacts. 6.7 Scope The scope of an experimental study is decided by looking at the size of the domains considered. According to this, the study of traceability tools is placed in the category of multi-projects. This means that the experiments examine objects across a single team and a set of programs. Other categories are blocked subject-project, replicated project and single project. 6.8 Importance There are two levels of importance according to the framework, domain importance and object of study importance. These are evaluated on the following scale: safetycritical (potential loss of human life), mission-critical, quality of life (on which level the performance is) or convenience. This study of traceability tools is considered as safety critical in the domain importance since the studied module has a SIL (Safety Integrity Level) classification (see Appendix B for more information), and the object of study importance is graded as quality of life. 24

33 CHAPTER7 Phase II: Planning After the definition phase it is time to plan how the experiments actually are going to be carried out, the framework defines three parts in this planning phase: Experimental design, Measurement and Product. 7.1 Experimental design The primary independent variable, the factor that the researcher hypotheses will cause the results of the experiment, is in most requirements tracing studies the requirements tracing technique. This is also what determines the external validity, referring to the ability of generalize the results. In this case it means that the tool used to perform the tracing process is the independent variable and hence determine the level of generalization. But this is also affected by the representation used for the data set, and how the set is prepared and managed. In this experiment there will be two different data sets, the industrial data set and the CM-1 set, both described further on. Other important independent variables influencing the external validity are type and size of the project artifacts being traced, as well as the quality of them. The believability of the relation between the hypothesized causes and the results of the experiments are referred to as the internal validity. Requirements tracing technique Since the object of study is IR-based traceability tools, this is also what separates the different items in the experiment. RETRO implements both the Vector Space Model and the LSI technique according to its specification paper [11], while ReqSimile only adopts the VSM. However, when performing the tests the settings in RETRO were not modifiable and the tests were executed with the VSM, which is RETRO s default technique. In the Naïve approach words in one set of traceability data are simply compared with words in another set, the items including most correct words are returned. 25

34 CHAPTER 7. PHASE II: PLANNING Type and size of project artifacts A traceset in requirements tracing would typically be a pair of software artifacts divided into lower level requirements along with an answerset, which is a mapping between the two artifacts that has been validated. While performing requirements traceability experiments the ability to perform well, achieve accuracy, for both small as well as large tracesets, is referred to as scalability. The artifacts in the industrial data set, which was provided by a large company active in the power and automation sector, in this study consist of a test case document at design level, figure 7.1, which should be traced to a requirements document at the same level. The traces have been determined and verified manually, thus the traceset is complete. The test case document consists of about 220 test case descriptions and the requirements document of about 225 requirements, the total number of combinatorial links becomes 220 x 225 = Since a traceset with more than 3000 combinatorial links are considered large, the test-requirements traceset certainly qualifies into this category. Also the CM-1 data set should be considered as a large traceset as it contains 235 high level requirements, 220 low level requirements and a manually verified answer set. All the test case descriptions and requirements in both data sets are of a textual nature and to characterize the data sets furthermore in detail a word count were performed by MS Word which resulted in the following numbers presented in Table 7.1. Table 7.1: Number of words in each of the data set documents. Data set Document Words Items Words/Item Industrial Requirements Test case descriptions CM-1 Low level requirements High level requirements Quality of artifacts In the industrial traceset used in this study not every test case description traces to a requirement, but every requirement traces to exactly one test case. Because of the fact that there are more requirements than test case descriptions in the industrial data set, this means that some requirements traces to the same test case description, while some test case descriptions lacks traces to requirements. Since the data set is from an SIL-classified (more information about SIL in Appendix B) component it is required to have full traceability between the stated requirements and the test case descriptions. This is what gives us the key solution so that we can determine whether the tools have performed well or not. The relationship between requirements and test case descriptions in this situation infers that the ability of the tested tools to detect both test cases without any requirements origin and to detect untested requirements can be validated because of the one-to-one mapping that exists. The CM-1 traceset contains high level requirements that traces to low level requirements, as well as requirements that does not have any trace to the low level requirements. Dependent variables As in many other requirements tracing experiments, recall and precision are the dependent variables in this study. In order to examine all possible combinations of tools and 26

35 CHAPTER 7. PHASE II: PLANNING data sets, which constitute the independent variables of this experiment, it is required to perform six (= number of tools x number of data sets) different tests to get a full factorial design [1]. 7.2 Measurement The measures used in this study are mainly recall and precision which are well established, validated measurements from the information retrieval field. They are both described in Section 3.4 in which also F-score, another measure used in this comparison, is explained. 7.3 Product In the definition of the framework [1], the authors describe the product as follows, In traceability experiments the products are usually the items being traced while a model or process is being evaluated. In this case it means that the products are the two sets of data, the industrial data set and the CM-1 data set. That is, two levels of documentation. In the industrial data set both requirements and test case descriptions are at design level, while the CM-1 data set consist of higher level requirements and lower level requirements. The different levels are schematically shown in the V-model, figure 7.1. More information about the different data sets and their content could be read in Section 8.1. Figure 7.1: The V-model with the different levels of documents. 27

36 CHAPTER 7. PHASE II: PLANNING 28

37 CHAPTER8 Phase III: Realization When the experiment is defined and planned it is time to start the actual experimentation, the realization phase, which consists of three parts: Preparation, Execution and an optional Analysis. 8.1 Preparation In many requirements tracing experiments a pilot study, a smaller experiment just to get initial results, is conducted. In this case though, no pilot study was performed in the cases with RETRO and ReqSimile since the tools already were available and it was just as easy to directly provide them with the real data, which in any case had to be prepared. The required input formats are listed in Table 8.1. In the case with the Naïve approach the situation was a little bit different. Since the program was going to be developed from scratch the use of a smaller test set was necessary to be able to manually follow each step and verify that the program was doing the right thing and that the results were reasonable. Further comments about this test set can be found in the description about the conversion of industrial data into the input format required by the Naïve-approach. Table 8.1: Required input format by different tools. Tool RETRO ReqSimile Naïve approach Required Input Format Single files, without file extensions MS Access, database document Single files, with or without file extensions 29

38 CHAPTER 8. PHASE III: REALIZATION Industrial data converted into ReqSimile required input format Since the industrial data files were given in MS Word format this preparation consisted of going through both the requirements document and the test case document in order to get each item and convert them into an MS Access database format, which is the input format required in ReqSimile. This was done simply by manually copying the selected item in the Word file and pasting it into the database document. At the same time all unwanted data, such as formatting, numbering etc. was deleted. Basic words that occurred in all test case and requirement items were also deleted since it would not contribute with anything to the IR-processes in the tools. When all data was transformed into the proper format, it also had got new numbers since the only numeration used in the original Word document were a built-in list numeration, which unfortunately couldn t be used in the database document. A new answer set was manually made where the name of each test case together with its new number and the matching requirement number was put in a list. If there were no matching requirements, this was noted as well. Industrial data converted into RETRO required input format When using RETRO it is required that all input data is in the form of single files, one for each requirement/test case, without any file extension and since the industrial data was given in Word files it had to be transformed into this format. This process was performed in a similar way as was done when the data should be converted into the ReqSimile required format, through manual copy/paste. To make the process a little more effortless the database document created previously was used as the main source, to avoid having to go through all data one more time just to delete unwanted data that already had been deleted in the process of converting the data into ReqSimile-format. The numeration created there was used to give each file a unique name. The results of the remodulation was two folders containing 224 files with requirements and 218 files with test case descriptions. Industrial data converted into Naïve-approach required input format Since the RETRO required input format also satisfies the input format required by the Naïve approach there was no need to convert the industrial data a third time. On the other hand, a smaller test set hade to be constructed in order to be able to develop the program in a proper way. This was simply done by taking a small subset of the industrial data set, in this case the first eleven requirements and the first eleven test case descriptions. In this way one could monitor the development process in a proper way and make sure that the Naïve approach performed as expected. CM-1 data converted into ReqSimile required input format The CM-1 data set, from the CM-1 project of the NASA Metrics Data Program 1, had to be converted from single file format into a database in order to be used with ReqSimile. Each file was opened, the content copied and pasted into the MS Access database document together with the name of the file, which also is the name of the requirement/test case. CM-1 data converted into RETRO required input format There was no need to convert the CM-1 data before using it with RETRO since this data set already is prepared and in the right format for RETRO use. 1 Available at 30

39 CHAPTER 8. PHASE III: REALIZATION CM-1 data converted into Naïve-approach required input format Due to the fact that the filenames (the identifiers) are treated as Strings in the implementation of the Naïve approach, one could use the same input format as in the RETRO case, hence there is no need to convert the data. 8.2 Execution The actual performance of the testing was divided into six different cases. Each case is further explained in the following subsections. RETRO with industrial data After the transformation of data into the RETRO required input format, the execution of this test was straight forward. The folders containing the requirements and test case descriptions were given as sources and the program created a new project. A project in RETRO is exactly what it sounds like, the project one is working with at the moment, which is given a name and is saved so that it is possible to save work that has been done and continue at another occasion. Then all traces between high level document elements (in this case, the requirements) and low level document elements (test case descriptions) were calculated and displayed. The filter, which gives the oppertunity to see traces only with weights greater than a specified value, was kept at 0.0 which is the default value, in order to be able to see all possible traces. An overview of the RETRO screen layout can be seen in figure 8.1. Since in this case each requirement only had one corresponding test case description, according to the verified manual traces belonging to the data set, the position of the correct test case description in the list of proposed links was noted manually. If the requirement did not have any corresponding test case, the position of this requirements correct test case was noted as 300, just with the intent of making the results suitable for further evaluation in MATLAB. In the MATLAB script for this case a vector with the position of the right test case number in the list of purposed links is created. A matrix with three columns are then created, in the first column the value is set to 1 if the position of the right test case are lower than, or equal to, the cut point value, otherwise it is 0. In the second column the number of returned links that are not correct is noted, this is of course the cut point value if the value in the first column is 0 and the cut point value minus one if the correct link is in the returned list. The third column specifies the number of correct links that are NOT found, thus 1 or 0 depending on the number in the first column. Then recall, precision and F-score are computed through summarizing the different columns according to the formulas in section 3.4. RETRO with CM-1 data The execution of this test was done in a similar way as the test with the industrial data set. After defining the right sources for the folders containing the high and low requirements, creating a new project, RETRO traced all possible links. Then an RTM was generated by the RETRO built-in function Generate Report, and this constitutes the base in all further calculations in the case with the CM-1 data set tested with RETRO. Because the RETRO generated RTM was ought to be compared with the correct RTM belonging to the CM-1 data set, it had to be converted into a format that was suitable for the comparison which was going to be performed with MATLAB. This was done by manually transforming the generated RTM into a cell-matrix, containing the 31

40 CHAPTER 8. PHASE III: REALIZATION Figure 8.1: Screen shot of RETRO with anonymized element text. 32

41 CHAPTER 8. PHASE III: REALIZATION Figure 8.2: Screen shot of the transformed RTM with CM-1 data. name of each requirement in the first column and the corresponding, traced test case descriptions in the following columns, see figure 8.2. Since the results derived from RETRO, and also the file with the correct answer, are sets of one requirement and its corresponding test case descriptions, it is necessary to control that the two files have their sets in the same order. This is done through the MATLAB-operation strcmp() and causes the test script to break if there should be any disorder in the lineup of sets between the two files. Then a three column matrix is created in a similar way as in the previous case, in the first column the number of correct links that are returned is noted. This is calculated by using the strcmp() operation to compare the returned results with the correct answer set. The second column once again holds the information about how many incorrect links that are returned, and in the third column the total number of correct links that are NOT returned is noted. Recall, precision and F-score are then derived in the same way as in the previous case. ReqSimile with industrial data The requirements and test case descriptions, in the form of an MS Access database document was set as a source, the requirements/test case descriptions were fetched (also in ReqSimile this is what constitute a project) and then preprocessed. The results was collected in the same way as when the data was used with RETRO, by manually noting the position of the correct test case description for each requirement. A screen shot of the test can be seen in figure 8.3. The MATLAB part of this execution is performed in the same way as in the RETRO with industrial data case. ReqSimile with CM-1 data Both high and low level requirements were fetched and preprocessed in the same way as when executing the test with the industrial data set. Then the headache began when the results were going to be collected and saved in a proper MATLAB format. In ReqSimile one was only able to mark the name of one proposed link at a time. And since all 33

42 CHAPTER 8. PHASE III: REALIZATION Figure 8.3: Screen shot of ReqSimile with anonymized descriptions. 34

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing

SERG. Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Delft University of Technology Software Engineering Research Group Technical Report Series Reconstructing Requirements Traceability in Design and Test Using Latent Semantic Indexing Marco Lormans and Arie

More information

AUTOMATED REQUIREMENTS TRACEABILITY: THE STUDY OF HUMAN ANALYSTS. A Thesis. Presented to. the Faculty of California Polytechnic State University

AUTOMATED REQUIREMENTS TRACEABILITY: THE STUDY OF HUMAN ANALYSTS. A Thesis. Presented to. the Faculty of California Polytechnic State University AUTOMATED REQUIREMENTS TRACEABILITY: THE STUDY OF HUMAN ANALYSTS A Thesis Presented to the Faculty of California Polytechnic State University San Luis Obispo In Partial Fulfillment of the Requirements

More information

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access

More information

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

Cost-Effective Traceability Links for Architecture-Level Software Understanding: A Controlled Experiment

Cost-Effective Traceability Links for Architecture-Level Software Understanding: A Controlled Experiment Cost-Effective Traceability Links for Architecture-Level Software Understanding: A Controlled Experiment Muhammad Atif Javed, Srdjan Stevanetic and Uwe Zdun Software Architecture Research Group University

More information

Text Mining for Software Engineering: How Analyst Feedback Impacts Final Results

Text Mining for Software Engineering: How Analyst Feedback Impacts Final Results Text Mining for Software Engineering: How Analyst Feedback Impacts Final Results Jane Huffman Hayes and Alex Dekhtyar and Senthil Sundaram Department of Computer Science University of Kentucky hayes,dekhtyar

More information

The Concern-Oriented Software Architecture Analysis Method

The Concern-Oriented Software Architecture Analysis Method The Concern-Oriented Software Architecture Analysis Method Author: E-mail: Student number: Supervisor: Graduation committee members: Frank Scholten f.b.scholten@cs.utwente.nl s0002550 Dr. ir. Bedir Tekinerdoǧan

More information

1 Example of Time Series Analysis by SSA 1

1 Example of Time Series Analysis by SSA 1 1 Example of Time Series Analysis by SSA 1 Let us illustrate the 'Caterpillar'-SSA technique [1] by the example of time series analysis. Consider the time series FORT (monthly volumes of fortied wine sales

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Advancing Trace Recovery Evaluation Applied Information Retrieval in a Software Engineering Context Markus Borg

Advancing Trace Recovery Evaluation Applied Information Retrieval in a Software Engineering Context Markus Borg Advancing Trace Recovery Evaluation Applied Information Retrieval in a Software Engineering Context Markus Borg Licentiate Thesis, 2012 Department of Computer Science Lund University ii Licentiate Thesis

More information

The Role of Requirements Traceability in System Development

The Role of Requirements Traceability in System Development The Role of Requirements Traceability in System Development by Dean Leffingwell Software Entrepreneur and Former Rational Software Executive Don Widrig Independent Technical Writer and Consultant In the

More information

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Enhancing Requirement Traceability Link Using User's Updating Activity

Enhancing Requirement Traceability Link Using User's Updating Activity ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

Key Benefits of Microsoft Visual Studio Team System

Key Benefits of Microsoft Visual Studio Team System of Microsoft Visual Studio Team System White Paper November 2007 For the latest information, please see www.microsoft.com/vstudio The information contained in this document represents the current view

More information

CHAPTER VII CONCLUSIONS

CHAPTER VII CONCLUSIONS CHAPTER VII CONCLUSIONS To do successful research, you don t need to know everything, you just need to know of one thing that isn t known. -Arthur Schawlow In this chapter, we provide the summery of the

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Do you know? "7 Practices" for a Reliable Requirements Management. by Software Process Engineering Inc. translated by Sparx Systems Japan Co., Ltd.

Do you know? 7 Practices for a Reliable Requirements Management. by Software Process Engineering Inc. translated by Sparx Systems Japan Co., Ltd. Do you know? "7 Practices" for a Reliable Requirements Management by Software Process Engineering Inc. translated by Sparx Systems Japan Co., Ltd. In this white paper, we focus on the "Requirements Management,"

More information

Get the most value from your surveys with text analysis

Get the most value from your surveys with text analysis PASW Text Analytics for Surveys 3.0 Specifications Get the most value from your surveys with text analysis The words people use to answer a question tell you a lot about what they think and feel. That

More information

Applying Machine Learning to Stock Market Trading Bryce Taylor

Applying Machine Learning to Stock Market Trading Bryce Taylor Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article

More information

Requirements Traceability. Mirka Palo

Requirements Traceability. Mirka Palo Requirements Traceability Mirka Palo Seminar Report Department of Computer Science University of Helsinki 30 th October 2003 Table of Contents 1 INTRODUCTION... 1 2 DEFINITION... 1 3 REASONS FOR REQUIREMENTS

More information

ElegantJ BI. White Paper. Considering the Alternatives Business Intelligence Solutions vs. Spreadsheets

ElegantJ BI. White Paper. Considering the Alternatives Business Intelligence Solutions vs. Spreadsheets ElegantJ BI White Paper Considering the Alternatives Integrated Business Intelligence and Reporting for Performance Management, Operational Business Intelligence and Data Management www.elegantjbi.com

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

System Development and Life-Cycle Management (SDLCM) Methodology. Approval CISSCO Program Director

System Development and Life-Cycle Management (SDLCM) Methodology. Approval CISSCO Program Director System Development and Life-Cycle Management (SDLCM) Methodology Subject Type Standard Approval CISSCO Program Director A. PURPOSE This standard specifies content and format requirements for a Physical

More information

Towards Collaborative Requirements Engineering Tool for ERP product customization

Towards Collaborative Requirements Engineering Tool for ERP product customization Towards Collaborative Requirements Engineering Tool for ERP product customization Boban Celebic, Ruth Breu, Michael Felderer, Florian Häser Institute of Computer Science, University of Innsbruck 6020 Innsbruck,

More information

Semester Thesis Traffic Monitoring in Sensor Networks

Semester Thesis Traffic Monitoring in Sensor Networks Semester Thesis Traffic Monitoring in Sensor Networks Raphael Schmid Departments of Computer Science and Information Technology and Electrical Engineering, ETH Zurich Summer Term 2006 Supervisors: Nicolas

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Text mining & Information Retrieval Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis

Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis Derek Foo 1, Jin Guo 2 and Ying Zou 1 Department of Electrical and Computer Engineering 1 School of Computing 2 Queen

More information

Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution

Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution Warehouse and Business Intelligence : Challenges, Best Practices & the Solution Prepared by datagaps http://www.datagaps.com http://www.youtube.com/datagaps http://www.twitter.com/datagaps Contact contact@datagaps.com

More information

Partnering for Project Success: Project Manager and Business Analyst Collaboration

Partnering for Project Success: Project Manager and Business Analyst Collaboration Partnering for Project Success: Project Manager and Business Analyst Collaboration By Barbara Carkenord, CBAP, Chris Cartwright, PMP, Robin Grace, CBAP, Larry Goldsmith, PMP, Elizabeth Larson, PMP, CBAP,

More information

Similarity and Diagonalization. Similar Matrices

Similarity and Diagonalization. Similar Matrices MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

More information

Using Library Dependencies for Clustering

Using Library Dependencies for Clustering Using Library Dependencies for Clustering Jochen Quante Software Engineering Group, FB03 Informatik, Universität Bremen quante@informatik.uni-bremen.de Abstract: Software clustering is an established approach

More information

Model Based System Engineering (MBSE) For Accelerating Software Development Cycle

Model Based System Engineering (MBSE) For Accelerating Software Development Cycle Model Based System Engineering (MBSE) For Accelerating Software Development Cycle Manish Patil Sujith Annamaneni September 2015 1 Contents 1. Abstract... 3 2. MBSE Overview... 4 3. MBSE Development Cycle...

More information

Horizontal Traceability for Just-In-Time Requirements: The Case for Open Source Feature Requests

Horizontal Traceability for Just-In-Time Requirements: The Case for Open Source Feature Requests JOURNAL OF SOFTWARE: EVOLUTION AND PROCESS J. Softw. Evol. and Proc. 2014; xx:xx xx Published online in Wiley InterScience (www.interscience.wiley.com). Horizontal Traceability for Just-In-Time Requirements:

More information

Using Metadata Manager for System Impact Analysis in Healthcare

Using Metadata Manager for System Impact Analysis in Healthcare 1 Using Metadata Manager for System Impact Analysis in Healthcare David Bohmann & Suren Samudrala Sr. Data Integration Developers UT M.D. Anderson Cancer Center 2 About M.D. Anderson Established in 1941

More information

Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER

Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Useful vs. So-What Metrics... 2 The So-What Metric.... 2 Defining Relevant Metrics...

More information

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle Introduction I ve always been interested and intrigued by the processes DBAs use to monitor

More information

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

BUSINESS RULES AND GAP ANALYSIS

BUSINESS RULES AND GAP ANALYSIS Leading the Evolution WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Discovery and management of business rules avoids business disruptions WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Business Situation More

More information

A Business Process Services Portal

A Business Process Services Portal A Business Process Services Portal IBM Research Report RZ 3782 Cédric Favre 1, Zohar Feldman 3, Beat Gfeller 1, Thomas Gschwind 1, Jana Koehler 1, Jochen M. Küster 1, Oleksandr Maistrenko 1, Alexandru

More information

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur Module No. # 01 Lecture No. # 05 Classic Cryptosystems (Refer Slide Time: 00:42)

More information

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining

More information

Processing and data collection of program structures in open source repositories

Processing and data collection of program structures in open source repositories 1 Processing and data collection of program structures in open source repositories JEAN PETRIĆ, TIHANA GALINAC GRBAC AND MARIO DUBRAVAC, University of Rijeka Software structure analysis with help of network

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Co-Creation of Models and Metamodels for Enterprise. Architecture Projects.

Co-Creation of Models and Metamodels for Enterprise. Architecture Projects. Co-Creation of Models and Metamodels for Enterprise Architecture Projects Paola Gómez pa.gomez398@uniandes.edu.co Hector Florez ha.florez39@uniandes.edu.co ABSTRACT The linguistic conformance and the ontological

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Introduction to Software Paradigms & Procedural Programming Paradigm

Introduction to Software Paradigms & Procedural Programming Paradigm Introduction & Procedural Programming Sample Courseware Introduction to Software Paradigms & Procedural Programming Paradigm This Lesson introduces main terminology to be used in the whole course. Thus,

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

JRefleX: Towards Supporting Small Student Software Teams

JRefleX: Towards Supporting Small Student Software Teams JRefleX: Towards Supporting Small Student Software Teams Kenny Wong, Warren Blanchet, Ying Liu, Curtis Schofield, Eleni Stroulia, Zhenchang Xing Department of Computing Science University of Alberta {kenw,blanchet,yingl,schofiel,stroulia,xing}@cs.ualberta.ca

More information

Checking Access to Protected Members in the Java Virtual Machine

Checking Access to Protected Members in the Java Virtual Machine Checking Access to Protected Members in the Java Virtual Machine Alessandro Coglio Kestrel Institute 3260 Hillview Avenue, Palo Alto, CA 94304, USA Ph. +1-650-493-6871 Fax +1-650-424-1807 http://www.kestrel.edu/

More information

Requirements engineering

Requirements engineering Learning Unit 2 Requirements engineering Contents Introduction............................................... 21 2.1 Important concepts........................................ 21 2.1.1 Stakeholders and

More information

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Comparison of Standard and Zipf-Based Document Retrieval Heuristics Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany

More information

Traceability Patterns: An Approach to Requirement-Component Traceability in Agile Software Development

Traceability Patterns: An Approach to Requirement-Component Traceability in Agile Software Development Traceability Patterns: An Approach to Requirement-Component Traceability in Agile Software Development ARBI GHAZARIAN University of Toronto Department of Computer Science 10 King s College Road, Toronto,

More information

Text Analytics. A business guide

Text Analytics. A business guide Text Analytics A business guide February 2014 Contents 3 The Business Value of Text Analytics 4 What is Text Analytics? 6 Text Analytics Methods 8 Unstructured Meets Structured Data 9 Business Application

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

Software-assisted document review: An ROI your GC can appreciate. kpmg.com

Software-assisted document review: An ROI your GC can appreciate. kpmg.com Software-assisted document review: An ROI your GC can appreciate kpmg.com b Section or Brochure name Contents Introduction 4 Approach 6 Metrics to compare quality and effectiveness 7 Results 8 Matter 1

More information

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy The Deep Web: Surfacing Hidden Value Michael K. Bergman Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy Presented by Mat Kelly CS895 Web-based Information Retrieval

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Graphical Web based Tool for Generating Query from Star Schema

Graphical Web based Tool for Generating Query from Star Schema Graphical Web based Tool for Generating Query from Star Schema Mohammed Anbar a, Ku Ruhana Ku-Mahamud b a College of Arts and Sciences Universiti Utara Malaysia, 0600 Sintok, Kedah, Malaysia Tel: 604-2449604

More information

Multivariate Analysis of Variance (MANOVA): I. Theory

Multivariate Analysis of Variance (MANOVA): I. Theory Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the

More information

Technical Report. The KNIME Text Processing Feature:

Technical Report. The KNIME Text Processing Feature: Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold Killian.Thiel@uni-konstanz.de Michael.Berthold@uni-konstanz.de Copyright 2012 by KNIME.com AG

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Linear Codes. Chapter 3. 3.1 Basics

Linear Codes. Chapter 3. 3.1 Basics Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

More information

Recovering Traceability Links in Software Artifact Management Systems

Recovering Traceability Links in Software Artifact Management Systems 13 Recovering Traceability Links in Software Artifact Management Systems using Information Retrieval Methods ANDREA DE LUCIA, FAUSTO FASANO, ROCCO OLIVETO, and GENOVEFFA TORTORA University of Salerno The

More information

Process Models and Metrics

Process Models and Metrics Process Models and Metrics PROCESS MODELS AND METRICS These models and metrics capture information about the processes being performed We can model and measure the definition of the process process performers

More information

Project VIDE Challenges of Executable Modelling of Business Applications

Project VIDE Challenges of Executable Modelling of Business Applications Project VIDE Challenges of Executable Modelling of Business Applications Radoslaw Adamus *, Grzegorz Falda *, Piotr Habela *, Krzysztof Kaczmarski #*, Krzysztof Stencel *+, Kazimierz Subieta * * Polish-Japanese

More information

A Case Retrieval Method for Knowledge-Based Software Process Tailoring Using Structural Similarity

A Case Retrieval Method for Knowledge-Based Software Process Tailoring Using Structural Similarity A Case Retrieval Method for Knowledge-Based Software Process Tailoring Using Structural Similarity Dongwon Kang 1, In-Gwon Song 1, Seunghun Park 1, Doo-Hwan Bae 1, Hoon-Kyu Kim 2, and Nobok Lee 2 1 Department

More information

Reputation Network Analysis for Email Filtering

Reputation Network Analysis for Email Filtering Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler University of Maryland, College Park MINDSWAP 8400 Baltimore Avenue College Park, MD 20742 {golbeck, hendler}@cs.umd.edu

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

Efficient Management of Tests and Defects in Variant-Rich Systems with pure::variants and IBM Rational ClearQuest

Efficient Management of Tests and Defects in Variant-Rich Systems with pure::variants and IBM Rational ClearQuest Efficient Management of Tests and Defects in Variant-Rich Systems with pure::variants and IBM Rational ClearQuest Publisher pure-systems GmbH Agnetenstrasse 14 39106 Magdeburg http://www.pure-systems.com

More information

Data Modeling Basics

Data Modeling Basics Information Technology Standard Commonwealth of Pennsylvania Governor's Office of Administration/Office for Information Technology STD Number: STD-INF003B STD Title: Data Modeling Basics Issued by: Deputy

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Development of an Enhanced Web-based Automatic Customer Service System

Development of an Enhanced Web-based Automatic Customer Service System Development of an Enhanced Web-based Automatic Customer Service System Ji-Wei Wu, Chih-Chang Chang Wei and Judy C.R. Tseng Department of Computer Science and Information Engineering Chung Hua University

More information

Component visualization methods for large legacy software in C/C++

Component visualization methods for large legacy software in C/C++ Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University mcserep@caesar.elte.hu

More information

AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS

AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS Advances in Information Mining ISSN: 0975 3265 & E-ISSN: 0975 9093, Vol. 3, Issue 1, 2011, pp-19-25 Available online at http://www.bioinfo.in/contents.php?id=32 AN SQL EXTENSION FOR LATENT SEMANTIC ANALYSIS

More information

Requirements Specification and Testing Part 1

Requirements Specification and Testing Part 1 Institutt for datateknikk og informasjonsvitenskap Inah Omoronyia Requirements Specification and Testing Part 1 TDT 4242 TDT 4242 Lecture 3 Requirements traceability Outcome: 1. Understand the meaning

More information

Statistics Review PSY379

Statistics Review PSY379 Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses

More information

Methodology of Building ALM Platform for Software Product Organizations

Methodology of Building ALM Platform for Software Product Organizations Methodology of Building ALM Platform for Software Product Organizations Ivo Pekšēns AS Itella Information Mūkusalas 41b, Rīga, LV-1004 Latvija ivo.peksens@itella.com Abstract. This work investigates Application

More information