Reverse Engineering Methodology to Recover the Design Artifacts: A Case Study

Reverse Engineering Methodology to Recover the Design Artifacts: A Case Study Nadim Asif School of Computing, IES, Leeds Metropolitan University Beckett Park Campus, Leeds LS6 3QS, UK. N.Asif@lmu.ac.uk Abstract As a software system evolves, new features are added and obsolete are removed, the design artifacts gradually diverge from its original design. Many approaches for design recovery or reverse engineering has been suggested, most with some type of support tool. Since a project's time constraints may prohibit use of sophisticated techniques and/or tools due to the learning curves associated with the techniques and tools, methods that can be applied in lieu of complex support tools may be required. Reverse engineering produces a high-level representation of a software system from a low-level one. This paper describes a case study, which use the methodology for reverse engineering that recovers the design artifacts of a software system from its source code and related documentation. The methodology consists of five phases, which can be attempted at different levels of abstraction according to the task at hand to recover the design artifacts. The methodology also makes use of tools, approaches and representations typically found in the forward software development process. Key Words: Reverse Engineering, Design Recovery, Software Understanding and Maintenance. 1. Introduction The useful software systems continuously evolve [1]. As they evolve, so too do their designs. New modules and dependencies are added to support new features, while obsolete functionality is removed. Consequently, the design gradually diverges from its original design. Different design artifacts become inconsistent with the current implementations, making maintenance tasks difficult and error prone. Software maintenance of large systems depends on several factors including the existence of accurate documentation of the system design. In some cases, software and documentation fail to be consistent in that the documentation, and subsequently the design, is rarely updated to reflect modifications made to the system. In other cases the original system design does not have any type of existing documentation and, as such, any rationale behind the design decisions made during the implementation of the system are lost. In either case, lack of a consistent design has many impacts on the effectiveness of any efforts to maintain and modify existing systems. Reverse engineering is a crucial part of software maintenance and a maintainer needs to understand the code before attempting any modification. Reverse engineering is the process of creating higher-level of abstraction from source code and available documentation [2]. Reverse engineering can be used for a variety of purposes: to reconstruct or improve documentation; to facilitate software maintenance or conversion activities; or to redesign and re-engineer an existing system. Unfortunately, source code does not contain much of the design information and additional information sources are required. Usually the scale of the software is often large, the maintainer also needs some automated support for the understanding and the recovery of the design artifacts. The design information from a combination of code, existing design documentation (if available), and general knowledge about problem and application domain is required to recover the design artifacts. This paper first briefly summarizes the reverse engineering abstraction levels and important concepts necessary to understand the process. After that, it describe a methodology in which human and computer interact to recover the design artifacts, which is followed by a review of the experience in using this process for the case study. 2. Background This section gives background information in the area of reverse engineering abstraction, system artifacts and the Reverse Engineering Abstraction Methodology (REAM) [5] used in the case study to recover the design artifacts.

2.1 Abstraction Levels An abstraction for a software artifact is a succinct description that suppresses the details that are unimportant to software developer and emphasizes the information that is important. For example, the abstraction provided by high level programming language allows a programmer to construct the algorithms without having to worry about the details of hardware register allocation. Software typically consists of several layers of abstraction built on top of raw hardware; the lowest-level software abstraction is object code, or machine code. Implementation is a common terminology for the lowest level of detail in an abstraction. When abstraction is applied to computer programming, program behavior is emphasized and implementation details are suppressed. The knowledge of a software product at various levels of abstraction undoubtedly underlies operations regarding the maintenance and reuses the existing software components. It is, therefore natural that there is a steadying growing interest in reverse engineering, as a capable of extracting information and documents from a software product to present in higher levels of abstraction than that of code. The abstraction as the process of ignoring certain details in order to simplify the problem and so facilitates the specification, design and implementation of a system to proceed in step-wise fashion. In the context of software maintenance [3], four levels of reverse engineering abstraction are defined: implementation abstraction, structural abstraction, functional abstraction and domain abstraction. Implementation abstraction is a lowest level of abstraction and at this level the abstraction of the knowledge of the language in which the system is written, the syntax and semantics of language and the hierarchy of system components (program or module tree) rather then data structures and algorithms is abstracted. Structural abstraction level is a further abstraction of system components (program or modules) to extract the program structures, how the components are related and control to each other. Functional abstraction level is a higher abstraction level, it usually achieve by further abstraction of components or sub-components (programs or modules or class) to reveal the relations and logic, which perform certain tasks. Domain Abstraction further abstracts the functions by replacing its algorithmic nature with concepts and specific to the application domain. 2.2 System Artifacts Five levels of abstraction that scope the system artifacts are Requirements, Features, Architecture, Design and Implementation [4]. Since reverse engineering itself is a process requiring abstraction at different levels [2], the system artifacts should be constrained to five levels of abstraction. A distinction between the problem and solution domains has to be model. There are two ways to view software systems functionality. From the perspective of the user, the requirements of the system are specified in the problem domain. The problem domain outlines what the system is supposed to do. From the perspective of a developer, the system can be viewed in the solution domain, which specifies how the system achieves the tasks specified in the problem domain. The user requirements represent the highest level of abstraction at which the system can be represented. The functionality is expressed at a fine grain level without any emphasis whatsoever on the implementation dependent details. The software system is expected to satisfy the requirements specified. The requirement specification document is typically the product of a system analyst s interactions with the potential users and system experts, resulting in a text document supported by figures and diagrams. The features bridge the gap between the artifacts that are being developed and the requirements specified. The architecture of a system specifies how the artifacts of the system combine together to implement the desired functionality. The internal design and implementation of the system artifacts are the elements of the design layer of abstraction. The design only goes to show the functional decisions made while building the system, which usually resides in the minds of the developers and is rarely conveyed in any form. Design entities like classes, structures, and user-defined data types etc. are modeled in this layer of abstraction. Implementation is the lowest level of abstraction and constitutes those artifacts that implement the functionality of the system. It is done using a programming language and is usually rich details. Typically source files, directories and file systems make up the implementation layer. 2.3 Methodology The Reverse Engineering Abstraction Methodology (REAM) is aimed at assisting the activities of reverse engineering to recover the design of the software at different levels of abstraction. The methodology consists of (five phases) high level model, functional model, architectural model, source code model and mapping model. The figure I contains a graphical depiction of a REAM [5]. REAM help engineers perform various software engineering tasks by exploiting the high-level, functional, architectural, source code and mapping models to recover the design artifacts. The goal of this iterative approach is to enable

a software engineer to produce, within a time constraints of the task being performed, a high-level, functional, architectural, source code and mapping model that is suitable to use for recovering the design artifacts and reasoning about the tasks at hand. An engineer can interprets the models, as necessary, modifies the high-level model, functional model, architectural model, source code model, or mapping model to iteratively to recover and reasons about the systems artifacts. correspond to existing high-level models in the recovery process. The approach used to perform the case study described is based on a combined top-down and bottom-up approach to recover the design artifacts. Recent investigations have shown that this kind of approach is reasonable and appropriate by considering the time constraints and task in hand [6]. In order to facilitate the recovery of design artifacts from the existing system, system analysis and design (SA/SD) and UML is used to communicate the understanding and recover the design artifacts at each HIGH LEVEL MODEL FUNCTIONAL MODEL MAPPING DESIGN ARTIFACTS ARCHITECTURAL MODEL SOURCE CODE MODEL Figure I 3. Process Overview In the context of the Reverse Engineering Abstraction Methodology (REAM) [5], the diagram shown in figure II depicts the process that described in this paper. Specifically, this paper describes the approach and a case study that involved five distinct models of abstraction to recover the design artifacts. Each phase of the process is encapsulated in a box, artifact(s) in the process in rectangles, and activities in square. 3.1 Multi-level Abstraction Approach to Recover the Design Artifacts Several techniques have been suggested for recovering the design artifacts from the existing systems. These techniques range from formal approach [9] to semiformal functional abstraction [10] and structural abstraction [6]. The representations constructed by these techniques are often biased by the implementations, and as such, do not always (High-Level, Functional, Architectural, Source Code and Mapping) model. First High-Level model for the system is developed from the available documentation (documents, system knowledge) and experience and refined based on empirical investigations involving the existing system. Second, Source Code model (such as call graph) is constructed by using the third party tools. A prototype of a Design Recovery Tool (DRT) was developed during this research, the tool consists of several C++ programs, a user interface implemented in Visual Basic 6.0, and links to AT&T Graphviz package (Dotty) to view the particular artifacts. In the next step, the functional model is developed by using the high-level model and source code model. The mapping model between the two models is also defined to explore and build the functional model (relationship between high-level and source code model) to recover the design artifacts. At this stage an abstract understanding of the functions that the system performs is developed. It can consist of an analysis of the system's input/output behavior expressed in terms

High Level Model Functional Model Doc. System Knowledge Doc. Goals Func. Develop Develop Goals Sys. Summary Use Case Desc. UseCaseDiagram Mapping Model Design Artifacts Mapping Source Code Model Architectural Model Doc. S. Code CASE Tools Doc. CASE Tools Extract Abstract Class Diagram Code Doc. Component Diagram Arch. Desc. Figure II of nested data flow diagrams or it may be a Use Case diagram in UML, documents the functional features of the system. This help to understand some of the reasons driving the design decisions made by the developers of the software. The Architectural model is extracted from the understanding and the artifacts developed by (High- Level, Functional, Source Code and Mapping) models. The architectural description is extracted through out the length of the project. This provided a detail view of the system. The component and package diagram of UML is used to convey the information about the architecture of the system. Once the models are developed at the different levels of abstraction described above, it is important to correlate them to verify and glean away any discrepancies. Another useful exercise would be to try to map the feature description to the source and architectural models, which would make the abstractions completely connected among each other. Re-documentation of the models increase the comprehension about the system and also offer scope for improving the models before they are released. The result of this phase of the process is the reverse engineered documentation, which can then be utilized. Generally, the user iteratively computes and investigate successive mapping model until acquires enough information for the task being performed. 4. Case Study In this section we demonstrate how our approach supports to recover the design artifacts by applying it on the Mozilla [7]. Early in 1998, Netscape announced to the world that it would provide the source code for Netscape communicator freely to the internet community and that this free revision of communicator would be known as Mozilla. In the first phase the high-level model was developed from the available

documents and experience. From these available sources the functional description of the system was also developed and it started with a short summary of the overall system behavior. The Unified Modeling Language (UML) was selected to visualize and communicate the software system design. Due to space constraints, the recovered design artifacts are not included in the case study. The core functionality of Mozilla revolves around XUL (XML-based user interface language). XUL is an XML-based language for describing the layout and component of user interfaces and also use C++, Java Script and HTML. XUL is used to describe windows and their contents with application windows, such as the Mozilla browser window. Actually XUL is used to define every aspect of the windows user interface, from its menus to its toolbars to its status bars. The user interface is configurable through markup, it is not hard coded in the source, basically it is loaded at runtime enabling programmers to tweak the interface without having to recompile the source code. XUL makes the user interface dynamically configurable. Interactions and events related to the user interface flow through Java Script and are handled either in source code or in a script. More options normally specify command handlers, which flow through Java Script to C++ and from C++ the handlers may drop through directly to C. The HTML is used to describe the contents of a document and XUL markup is used to describe the contents of an active window, which can include multiple HTML documents. HTML, XML & XUL achieve flexibility through an object model called DOM. Interfaces into the DOM are defined in Interface Definition Language (IDL). These interfaces serve as the glue between Java Script and C/C++ source code. In the second phase the source code model was extracted and the process is depicted in Figure II. A prototype of Design Recovery Tool was used to extract the developer s documentation, functions, classes and flow of control from the source code. The developer s documentation provided knowledge about the components that implement the structure of the application. Several modules of the source were documented, and debugging the source was also important method used to extract the model as an abstraction. Debugging was also found to be the best method of understanding the program flow and extract the reference formats (reports, menu, and interfaces). These documents were scanned thoroughly for clues about the critical modules in the application. In the third phase a good understanding about the functional aspects of the application was developed. The Use Case description was built for the system from the available documents and by building a Use Case diagram at the system level and by providing fine grain Use Case diagram wherever necessary. Each Use Case was documented textually to provide more understanding about its functionality. It was revealed that application core implements the core functionality for application components and application services process XUL. C/C++ source code serves as the basis for an object class, which defines core functionality and services. Application Services are implemented by Application Runnrer (nsapprunner) and Application shell (nsappshell). Application Runner loads an XUL file and an application core hook them together through the applications shell. The main function main( ) of Mapping of nsiapprunner to Source Code Figure III

AppRunner sets up the application shell and handles tasks for initializing the shell, running the shell, and shutting down the shell. AppShell provides key services for the applications shell and XUL as well as controllers for widgets and windows callbacks. These features are implemented through nsappshellservices, nscommandlineservices, nswebshellwindow, and nsxulcommand Application shell provide services and hooks, it does not provide the core functionality for user interfaces. The application cores for the browser and editor components are defined in nsbrowserappcore and nseditorappcore respectively. The nsbrowsermain instantiate main and sets up the console and browser windows, nsbrowserwindow creates browser windows and nsxpbasewindow handles core windowing tasks. It is noted that many applications cores can be instantiated from nsappcore. These application cores provide the core functionality for the browser, mail and editor components. In the next phase the mapping process was performed to map the high-level and functional model to the source code model to consolidate all the models. All the models were reviewed again in the light of the goals specified during the start of the study. The class nsiapprunner is mapped to the source code files is depicted in figure III. The recovered class CHTMLToken relationship with the source files is depicted in figure IV. During this phase many additional relationships and corrections were made to the constructed models. Class CHTMLToken Relationship with Source Files Figure IV In the next phase the Architecture Model was abstracted and the process is depicted in figure II. Abstracting the architectural description was an ongoing process throughout the project. The static architecture of the system artifacts was identified in the beginning, and incremental changes were made as more information was learnt. However Component diagrams were built in UML and the relationships among the components were visually represented by a dependency relationship between them. This graphical view effort made possible to understand the architectural layout of the software. The end result of this study project was one coherent UML model that correlates all the knowledge HTML/CSS/JavaScript Layout/XUL - Interface Java Script - Event / Command Application Services Application Core Architecture Figure V gained at different levels of abstraction. The abstraction levels found very useful for the purpose of hiding the real complexity of the details (source code, textual descriptions, reference formats). It showed that few graphical descriptions can greatly reduced the effort in trying to comprehend the relationships and interaction among the different artifacts (source code, developers documents, reference formats). It helps to limit the scope of exploration, and enables to work without getting lost in the complex code. 5. Evaluation Table I list the application modules that were analyzed and their total size in line of code (LOC), including program files and header files. The proposed approach and tool have been evaluated by using two well known metrics of the information retrieval [8], related in particular to the retrieval effectiveness: recall and precision. Recall is the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database. Precision is the ratio of the number of relevant documents retrieved over the total number of documents retrieved. In fact recovery process can be seen as the retrieval, or even classification, of design artifacts present in the source code and the other related documents. Mozilla CPP C H Total LOC db 0 16 15 31 7528 xpfe 54 0 67 121 33121 editor 55 75 130 45351 HTML Parser 30 0 42 72 29850 Table I

The identification of the relevant artifacts and the validation of relevant retrieved artifacts have been done manually. Table II and III summarize an example data of db (functions). db(functions) Relevant Irrelevant Total Retrieved 115 46 161 Table II Relevant and Irrelevant Retrieved functions db(functions) Retrieved Missed Total Relevant 115 0 115 Table III Retrieved and Missed Relevant functions Therefore, the Recall and Precision measures are: Recall = # (Relevant ^ Retrieved ) % # Retrieved = 115/161 % = 71.43% Precision = # (Relevant ^ Retrieved ) % # Relevant = 115/115 = 100% This means the misclassification, that is the ratio between the number of irrelevant retrieved artifacts and the number of retrieved artifacts is 28.57%. 6. Conclusion The methodology permits the user to develop highlevel, functional, architectural, source code and mapping models to recover the design artifacts at different levels of abstractions by exploiting various types of information (like available documents, experience and source code). The approach not only providing a choice to derive the high level model from the source code model but it also provide the approach to develop and abstract the high-level model, functional model and architectural model from the source code model and available sources (like documents and domain knowledge), and correlate them at different levels of abstraction. The methodology is lightweight and iterative and can be used according to the tasks in hand at different levels of abstraction. The methodology also demonstrate that, high-level, functional, architectural and mapping models can be beneficial for planning, assessing, and executing tasks on an existing system to recover and abstract the design artifacts. Future work consists of using the methodology for building the new tools for process automation, and the refinement in the methodology based on the experiments. 7. Acknowledgements The Author would like to thank the Pat Allen, Janet Finlay and Mark Dixon for their comments on this paper. 8. References [1] M.M Lehman, L.A Belady," Program Evolution - Processes of Software Change", Academic Press, London, 1985. [2] Elliot J. Chikofsky and James H. Cross II, Reverse Engineering and Design Recovery: A Taxonomy, IEEE Software, vol. 7, no. 1, January 1990. [3] M.Harandi and J.Ning, Knowledge-Based Program Analysis, IEEE Software, 7(1), 1990. [4] C. Riva, Reverse Architecting: An Industrial Experience Report, IEEE Proceedings of Working Conference on Reverse Engineering (WCRE 00), 2000. [5] Nadim Asif, Mark Dixon, Janet Finlay, George Coxhead Recover the Design Artifacts, Proceedings of the International Conference of Information and Knowledge Engineering (IKE 02), pp 656-662, Las Vegas, June 24-27, 2002. [6] Gail C. Murphy, David Notkin and Kevin J. Sullivan, Software Reflexion Models: Bridging the Gap between Design and Implementation, IEEE Transaction on Software Engineering, vol. 27, No. 4, April 2001. [7] http:\\www.mozilla.org [8] W.B.Frakes and R.Baeza-Yates, " Information Retrieval: Data Structures and Algorithms. Prentice- Hall, Englewood Cliffs, NJ, 1992. [9] G.C. Gannod and B.H.C Cheng, "Strongest Postcondition as the Formal Basis for Reverse Engineering", Journal of Automated Software Engineering, vol 3, pp 139-164, June 1996. [10] A. Quilie,"Memory-Based Approach to Recognizing Programs Plans", Communication of the ACM, vol. 30, pp 84-93, May 1994.