1 Automated Modeling of Legacy Systems Using the UML by Pan-Wei Ng Software Engineering Specialist Rational Software Singapore Poor documentation is one of the major challenges of supporting legacy systems; in many cases, the only available documentation is the source code itself. This article describes steps you can take to model a legacy system using the Unified Modeling Language (UML), taking into account language constructs and application-specific usage of these constructs. The resulting model can be used to produce useful documentation that abstracts flows and structural relationships. We will illustrate the concepts behind the modeling process, using COBOL as the language of the legacy system, and automate the modeling process using Rational Roseý and its Extensibility Interface, which are also described in this article. Supporting legacy systems is not easy. When an enhancement request or defect is reported, the support team must be able to quickly evaluate the potential impact and magnitude of the change, estimate the effort required, and prioritize the work involved. None of this is simple, because often the only existing documentation is buried within the source code, and a typical application might have several hundred source files, each containing several thousand lines of code. Knowledge of the internal workings of these systems is often in the hands of only a few people who, through experience, understand only part of each legacy system they work with. It is not unusual to come across legacy systems that no single person can fully describe. Many legacy systems were written using older programming languages, which lack support for good programming practices. Even though the limitations of the programming languages might have been overcome by good development principles, this seldom happened (or it happened inconsistently). When these
2 systems were developed, the software industry had yet to evolve good software development principles such as encapsulation, abstraction, modularity, and architecture. Even for legacy systems that used newer, objectoriented languages, such principles might have been misapplied. Consequently, the impact of a change is seldom restricted to a small set of files or routines. Support teams need to sift through many source files and available documentation to trace through the program flows. In addition, serious problems can arise because important business rules are often embedded within legacy systems. Failure to account for such rules when making a change can be disastrous. The more critical the legacy system is to the business, the greater the risk. Consequently, supporting legacy systems is extremely demanding and risky. Automated Modeling of Legacy Systems For all these reasons, it is very valuable to have a structured approach for systematically deriving useful documentation based on existing legacy source code -- an approach that takes language specifics into account and produces documentation that is understandable by many people. This article describes such an approach. This approach employs the UML notation system and a modeling process based on the Rational Unified Processý or RUPý product, both of which are well recognized by the object-oriented community. Using UML raises the abstraction level for the legacy system being modeled up from the language in which it was originally constructed. Furthermore, the use of object-oriented approaches for modeling legacy systems is highly beneficial from the design perspective. After all, shouldn't good software design principles be universal and not limited to object-oriented languages? Thus, the choice of the UML and the RUP implies a common representation, and a common approach to describing software systems, which are both understandable by the software community in general. Automation is important, because it is not economically feasible to manually analyze and extract useful information from the extensive amount of source code found in legacy systems. In fact, lack of automation might be the reason for inadequate documentation in the first place. Automation is not limited to code scanning, which typifies most attempts at understanding legacy code. Rather, automation refers to abstracting internal representation to make it useful for further manipulation, analysis, and visualization. The idea of modeling encompasses the concepts of internal representation with a variety of views. Moreover, automation demands explicit expression of source code analysis rules, which has the advantage of making code analysis consistent. A Fundamental Modeling Principle: The UML Collaboration The UML collaboration is a fundamental construct for relating design to requirements. It describes how participating elements interact with each other to produce some observable result of value to the client. It comprises a structural aspect that describes relationships between participating elements, and a dynamic aspect that describes the behavior of these elements.
3 A use-case realization is one kind of collaboration. Figure 1 shows an example of a use-case realization: an Actor initiating communication with a use case. The use case is a statement of responsibility that dictates what the system should do. The use-case realization describes how that responsibility is fulfilled. It does so by identifying participants -- in this case, classes -- and showing that they interact with one another to fulfill the stated responsibility. The concept of collaborations is common to UML and the RUP, which use different stereotypical participants. In use-case analysis, the stereotyped classes include boundary, control, and entity classes; in business modeling, stereotyped classes include business workers and business entities. The same concept can be borrowed and applied to different domains. For example, when modeling SQL scripts, the participants are stored procedures and database tables. When attempting to document a collaboration, it is important to work at an appropriate level of abstraction to avoid unnecessary complexity. For example, when you do business modeling, you should not be dealing with Java classes. Figure 1: UML Collaboration In this article, we will describe abstract flows as collaborations based on existing source code. Without abstraction, the resulting documentation would simply mirror existing source code and fail to provide an accurate overview of the legacy system. Consequently, it is important to formulate abstraction rules that identify significant elements in the given source. Steps for Modeling Legacy Systems Using the UML This section describes steps for modeling legacy systems, which are made up of a number of programs, each fulfilling some system functionality. One or more source files implements each program through a programming language. In modeling a legacy system, it is important to understand both programming language constructs, and the programming style (i.e., the way the language is used) because both language and program-specific constructs are necessary to
4 abstract the program, as depicted in Figure 2. Figure 2: Legacy System Structure and Abstraction Because legacy systems may be developed using multiple languages, the specifics of each language must be taken into account. This implies that, for each language used in the legacy system, we must perform the following steps: 1. Identify significant language constructs (i.e., language constructs that are useful in abstracting flows in a source file and understanding relationships between elements). 2. Describe language grammar. 3. Map language grammar elements to UML elements. 4. Identify views of the UML elements that abstract program structure and flows. Once the language specifics have been handled, each program in the legacy system can be analyzed individually. This implies that for each program, we must take the following additional steps: 5. Identify language usage in the legacy program, including abstraction rules that are useful in abstracting program flows. 6. Organize the model of the legacy program. 7. Document the model. Our approach distinguishes general language aspects from program-specific aspects, and our goal is to: Automate as much as possible the general language parts that are reusable across programs implemented in that language. Identify guidelines to support the program-specific portions. General Language Aspects This section describes how the general language aspects of modeling legacy systems are taken into account. The steps in this section have to be repeated once per implementation language. Step One: Identify Significant Language Constructs
5 This first step examines significant language constructs to identify those that are useful in abstracting flows within a program. The significant language constructs for COBOL, along with explanations of their significance, are listed in Table 1. Table 1: COBOL Significant Language Constructs Significant Language Construct COBOL Program COBOL Working Storage Sections COBOL Paragraph COBOL Statements COBOL Perform Statements COBOL Conditional Statements Embedded SQL Statements Explanation This is the physical unit of the legacy system to be modeled. The legacy system is analyzed by analyzing each program individually. This represents the information that is manipulated by the program. This represents the decomposition of the COBOL program. This represents each step in the flow of the COBOL program and is the next level decomposition of the COBOL paragraph. This describes the flow within the COBOL program by relating different COBOL paragraphs to each other. This represents decisions described in the program: they are often a physical implementation of some business or technical rules. These describe database transactions. SQL statements can be used to identify database structures (i.e., tables and columns) that are usually poorly documented. Step Two: Describe Language Grammar A language grammar is a formal representation of language constructs. Grammars for legacy systems tend to be much more complicated than those of newer languages such as Java and C#. One challenge is to define a good grammar for the target language. Distinguishing which language constructs are significant, as we did in the previous step, permits us to ignore the unimportant aspects of the grammar and consequently simplifies this step. COBOL files often have characters before the seventh column, and after the seventy-second column of each source line, which are mingled with the actual source. These characters are unimportant, so we can have a preprocessor discard them -- which is a lot simpler than dealing with them within the parser. Step Three: Map Language Grammar Elements to UML Elements In this step, we will map language grammar elements to stereotyped UML
6 elements. When defining this mapping, it is important to ensure that there is a one-to-one correspondence to avoid confusion. As there are embedded SQL statements in the source code that can give clues to dependencies in SQL tables, we will map both COBOL and SQL language constructs to UML. Mapping COBOL to UML. A possible mapping from COBOL to UML is summarized in Table 2. The first column identifies the grammar element; the second column provides the UML mapping and its corresponding stereotype where applicable; and the third column gives the reason for the mapping. Table 2: COBOL Language Grammar Mapping to UML No Grammar Element UML Mapping Reason 1 Program component <<program>> This is the physical implementation of the program design. 2 Program class <<program>> This is the design of the program. 3 Program use case Because each program is written to fulfill some system functionality, it can be mapped to a use case. 4 Program use-case realization The realization of the program purpose is in the program itself. 5 Record Description nested class stereotyped <<record>> or attribute If the record description has an associated PIC declaration, it is an attribute; otherwise, it is a nested class belonging to the program. See Figure 3. 6 Paragraph private operation Each paragraph is a collection of steps that is not callable by another program. 7 Statement activity Each statement is a step within a paragraph. 8 Perform Statement 9 Perform Statement 10 Conditional Statement activity <<perform>> reflexive message decision <<condition>> This is a type of statement. This is a perform statement that calls another paragraph in the program. This is a type of statement.
7 Figure 3: UML Mapping of COBOL Record Descriptions Mapping Embedded SQL to UML. SQL statements are mapped into dependencies in class diagrams and stereotyped according to the type of SQL operation, such as select, insert, delete, update, create, and so on. Although it is possible to stereotype the dependencies as read, write, or read/write, the SQL operations designations are more meaningful. Furthermore, the types of SQL statements are limited and consequently do not clutter the class diagrams. Our proposed mapping from embedded SQL to UML is summarized in Table 3. Table 3: Embedded SQL Language Grammar Mapping to UML No Grammar Element UML Mapping Reason 1 SQL Table Clause class <<table>> A "FROM" clause in an embedded SQL statement indicates the presence of a table in some database. 2 SQL Statement message A running COBOL program sends messages to records of some SQL tables; that is, it sends a message to a record. 3 SQL Select Statement 4 SQL Insert Statement dependency <<select>> dependency <<insert>> The select statement implies that the COBOL program is referencing a table; it does not modify the contents of the table. The insert statement implies that the COBOL program is referencing a table, and it modifies the contents of the table. Stereotypes are also defined for <<update>>, <<create>>, and similar statements. 5 SQL Field Clause attribute Table fields (attributes) can be extracted from SQL statements. Step Four: Identify Views of UML Elements
8 This step defines views that abstract structural relationships and program flows. Several possible views, along with justifications for them, are listed in Table 4. We will see examples of these views in the next section after we introduce program-specific aspects. The views are generated automatically by a Rose COBOL Add-In, which we will describe in greater detail later on. Table 4: Views of UML Elements Mapped from Language Grammar View Diagram Type Justification View of Participating Tables View of Participating Classes SQL Table Manipulation class diagram class diagram This view presents the SQL tables that are manipulated by the program (see Figure 4). This view presents the structure of the COBOL program and its constituent record descriptions, which are stereotyped as <<program>> and <<record>> respectively (see Figure 5). sequence diagram This view describes how the COBOL program manipulates the database (see Figure 6). Program Flow activity diagram This view describes the flow within a paragraph in the COBOL program (see Figure 7). Note that program flows for COBOL are represented in an activity diagram rather than a sequence diagram. Because there are no classes and objects in COBOL, a sequence diagram would contain only a single column; it would be visually unappealing and not very useful. Activity diagrams, on the other hand, are visually similar to flow charts, which are familiar to COBOL developers. In addition, you can add swimlanes, which are useful in subsequent refactoring. Each swimlane represents a COBOL file, and can be used to evaluate the impact of delegating flows to COBOL files. Step Five: Identify Language Usage In addition to the language-related steps we described above, it is necessary to identify program-specific styles and guidelines when analyzing legacy systems. Identify Abstraction Rules. Each development organization has specific guidelines for developing software and for the use of language constructs that can be exploited to abstract program flows. For example, an organization standard might dictate that: Remarks for each paragraph are placed after the paragraph name.
9 Remarks for each SQL statement are placed before the SQL statement. Note that these standards are not always adhered to, so there is always some tidying up required. Earlier, we discussed the significant effort that is sometimes required to derive a robust grammar. One possible way to make this easier is to introduce preprocessing that discards unnecessary elements in the grammar. The example we mentioned above was discarding characters in a COBOL program that appear before the seventh column and after the seventy-second column, which are skipped by the COBOL compiler. Identifying Abstraction Rules for COBOL Programs. A typical COBOL program contains a large number of record descriptions, paragraphs, and embedded SQL statements. We need to identify which of these are crucial to the understanding of the program by employing the rules listed in Table 5. Table 5: Abstraction Rules for COBOL Programs Main Paragraph Significant Parts Main Subordinate Paragraphs Paragraphs with Embedded SQL Statements Record Descriptions in an SQL Statement Justification This is the first paragraph called when the program executes. It calls other paragraphs and normally provides a good overview of the program. In our case, this paragraph is labeled as 0000-MAIN. These are the immediate paragraphs called by the main paragraph. These paragraphs are significant because they invoke database transactions. Record descriptions (attributes) found in SQL statements are significant because of their relationship with the SQL tables. The rules listed in Table 5 are only a starting point and are at best guidelines; exceptions are frequent. Human intervention is required to decide if a specific paragraph or record description is truly significant. Step Six: Organize the Model Once we have mapped the constructs and parts of a COBOL program into UML, we need to decide where to place them in a model, and how the significant parts will be documented through appropriate views. We can use the design model organization listed in Table 6. Table 6: Design Model Organization
10 Design Model COBOL Package Table Package Justification This package holds the COBOL program as a class stereotyped <<program>>. This package holds all tables extracted from SQL tables, and is useful in understanding the schema of the relational database referenced by the COBOL program. Use-Case Realization Package This package holds the diagrams describing significant aspects of the COBOL program. These diagrams will be treated as important views below and are illustrated in Figures 4 through 7. Step Seven: Document the Model Having defined the mapping from the given legacy language to UML, and the mapping from the given COBOL program to a design model organization, we can now start to document the modeling results. The document is expressed in terms of views to the model: View of participating tables View of participating classes SQL table interaction Program flow These views are described in the following subsections. View of Participating Tables. Figure 4 shows an example View of Participating Tables, which depicts the static relationships between the COBOL program and the SQL tables. Each type of access is stereotyped according to the SQL clause through which the COBOL program manipulates the table. Tables can be manipulated in many ways; consequently, a table can have multiple dependencies, each with a different stereotype.
11 Figure 4: View of Participating Tables View of Participating Classes. Figure 5 shows an example View of Participating Classes, depicting the static relationships between the COBOL program and its record descriptions. If we attempt to map this with the analysis classes in the RUP, then the COBOL program will play the role of both a boundary class and a control class in a use-case realization. The record description classes will play the role of entity classes. This analogy will be useful when attempting to refactor the COBOL program. In Figure 5, the stereotype <<significant>> is based on the rules identified in Table 5. Figure 5: View of Participating Classes
12 SQL Table Interaction View. Figure 6 is an example of an SQL Table Interaction diagram. It describes sequentially how the COBOL program makes the transaction. If failure modes have to be documented, they will appear as separate SQL table interaction diagrams. Each message in the sequence diagram has a name equivalent to the SQL clause. The detail of each message is in its description and consequently not shown in Figure 6. Figure 6: SQL Table Interaction Diagram Program Flow View. Figure 7 shows an example Program Flow View. It is an activity diagram describing the conditional logic within a paragraph. Such diagrams are useful for describing both business logic and transactional logic.
13 Figure 7: Program Flow View Note that conditions in Figure 7 are represented as states stereotyped as <<Condition>> rather than as decisions, which we saw in Table 2. Our Rose COBOL Add-In uses a UML activity instead. Automated Modeling Rather than manually abstracting program structure and flows from source code, it is better to have a tool that automatically generates the "first cut" and allows the generated model to be manipulated. This section discusses how to achieve automation as well as the additional benefits of automation for supporting legacy systems. It also discusses the requirements and design of such an automated tool. Requirements The requirements of an automated modeling tool are summarized as a usecase model in Figure 8, in which the Actors are members of a support team consisting of the developer, IT manager, maintenance team, and architect. Figure 8: Actors and Use Cases For Automated Modeling Tool Table 7 describes the use cases in this model. Table 7: Use Cases for Automated Modeling Tool
14 No Use Case Description 1 Manage Legacy Model Create and update the legacy model. 2 Document Legacy System Reverse engineer existing COBOL source and produce Word documents. 3 Analyze Impact of Change Trace one part of the reverse-engineered model to another, including queries searching for impacted UML elements. 4 Make Source Code Changes The maintenance team makes changes, which must be synchronized to the reverse-engineered model. 5 Review Changes The manager reviews changes to the source code or model to ensure that only relevant portions are modified. 6 Refactor Architecture -The architect might reorganize the model to improve resiliency. Elements of the new model must trace to the existing one. -Automatic refactoring rules might be defined to generate the new model. The use cases outlined in Table 7 have a much larger scope than merely modeling legacy code. This is because we do not model a legacy system for the sake of documenting it, but to maintain and upgrade it. We can partially implement the use cases for the automated modeling tool through our Rose COBOL Add-In and related Rational tools, as follows. Manage Legacy Model Use Case: an inherent capability of Rose. Document Legacy System Use Case: fulfilled by our Rose COBOL Add- In, which abstracts the legacy system in a Rose model, and also by Rational SoDA, which walks through the Rose model and generates appropriate reports. Analyze Impact of Change Use Case: fulfilled by the Rose COBOL Add-In as it captures the relationships between elements within the Rose model and presents them through appropriate views (described above). Review Changes Use Case: fulfilled by allowing the user to hyperlink to the source code from the model. Although a "Refactor Architecture" use case would be very beneficial, identifying and codifying refactoring rules is not simple, so we will leave this to future research. However, having a UML model of the legacy system allows us to perform refactoring with UML rather than COBOL. By operating at a higher level of abstraction and in a language-independent environment (in UML), the refactoring rules can be made generic and not limited to COBOL. Structure of the Automated Modeling Tool
15 An overview of the structure elements in our Automated Modeling Tool is depicted in Figure 9 as a class diagram. This figure can be viewed as a generic structure for automating the modeling steps discussed earlier and can be applied to systems implemented by any language. Figure 9: Structure of the Automated Modeling Tool The elements in Figure 9 are stereotyped as entities, which represent information, and controllers, which manipulate the information. Entities are denoted by the entity class stereotypes; controllers are denoted by control class stereotypes. Every entity in this figure contains a number of elements. For example, the legacy source code contains characters; the UML representation contains many UML elements mapped from the source code. The notes in this figure describe the mapping rules used to relate adjacent entities. The entities are described in Table 8. Note that the refactoring mechanism has not yet been implemented. Table 8: Structural Elements
16 No Architectural Entities Description 1 Original Legacy Source This represents the existing source code. 2 Cleaned Legacy Source This is generated by running the original source code through a preprocessor. Existing source code may be preformatted with additional characters outside the standard COBOL constructs. 3 UML Representation This is the UML representation of the legacy source, which is generated by running the cleaned legacy code through a parser. 4 Rose Representation Because of Rose's internal architecture, the UML representation is not directly consumable by Rational Rose and has to be mapped to a Rose representation. In addition, the Rose modeller insulates changes in the Rose code base. A separate instance of the Rose representation is generated by a refactoring mechanism. 5 Documentation of Legacy Code This is the document extracted from the Rose representation by Rational SoDA. 6 Other Language Representation The refactored Rose representation can be used to forward generate code in another language through other Rose language add-ins, such as Java and VB. Implementation The UML diagrams in Figures 4-7 were generated with the Rose COBOL Add-In. Dragging the preprocessed COBOL source into any class diagram in Rose will perform the automated modeling. This will generate several packages and diagrams, as depicted in Figure 10.
17 Click to enlarge Figure 10: Rose Screen Showing Generated COBOL Packages In the logical view, three packages are created: A COBOL package containing COBOL source. A tables package containing the SQL tables accessed. A use-case realization package containing several views described earlier. The Rose COBOL Add-In also supports a traceability capability. The user can right click on a COBOL operation or activity and select Browse. The Add-In will load the COBOL file and highlight the code from which the UML classifier is mapped. For example, Figure 11 is produced by the Add-In when you right click on one of the activities shown in Figure 7. This feature is useful to check whether the mapping is correct.
18 Figure 11: Tracing UML Mapping to Legacy Source Code Evaluating the Legacy Model Now that we have automatically generated a visual model, we want to establish how accurately it reflects the given legacy system. Our model was produced through a set of transformations shown in Figure 10; consequently, the quality of our model can be evaluated by examining these transformations and asking questions such as the following. Consistency: Is the modeling process repeatable and reversible? At every stage in the modeling process, can we reconstruct the previous stage? Consistency demands a bidirectional relationship between the entities shown in Figure 9. For example, if a COBOL program is mapped to a UML class stereotyped <<program>>, then the converse must be valid. Completeness: Have we considered every factor during the transformation? The modeling process is complete if no information is lost during the modeling process. Completeness demands that every element in one entity must be mapped to another element in the next stage. Consistency and completeness are properties of the mapping rules between adjacent entities. Consequently, the goal of any modeling process is to define good mapping rules. Once these properties are achieved, it is possible to synchronize the contents of the source code and the UML model. Figure 12 illustrates the concepts of consistency and completeness. It shows a mapping that is not fully complete and consistent because Source Element 4 is not mapped to anything (i.e., not complete), and the mapping from Source Element 3 to UML Element 3 is unidirectional (i.e., not consistent). However,
19 the mapping between both entities is consistent and complete with respect to the significant elements. Figure 12: Consistency and Completeness of Modeling Process: Example of Incomplete Mapping It is very difficult -- and it requires too much detail -- to establish a fully complete and consistent modeling process. Instead, we focus on the significant elements and on achieving completeness with respect to those elements. The modeling process described in this article is founded on this principle, and it also distinguishes significant elements that are language specific and program/organization specific. Further Development In this article, we have discussed the steps necessary to construct a visual model of a legacy system and how to automate those steps. Ours is a UML model, physically stored using Rose's internal representation. We demonstrated the validity of this approach through the implementation of our Rose COBOL Add-In, and then evaluated the modeling process in terms of consistency and completeness. Continuation of this work will depend very much on customer needs and requests. In fact, I developed the Rose COBOL Add-In for a customer who had numerous COBOL applications that were sparsely documented. The customer was a large organization that also intended to adopt the RUP and UML, so the notion of having existing COBOL applications documented in UML was very attractive. The support team would also be trained in UML, and the documentation would be aligned with that in the RUP, for example, as usecase realizations (Figures 4-7 show how this can be achieved). There is strong interest from other potential clients who have seen demonstrations of the Rose COBOL Add-In, because they recognize the universality of the modeling approach and the advantages of UML. I have also received requests to abstract program flows developed in other languages, including some for poorly documented systems developed in Java. Based on the same principles discussed in this article, my teammates and I developed a separate Rose add-in to automatically generate sequence diagrams from Java code (i.e., a Rose Java Add-In). To do this, we identified abstraction rules to simplify an otherwise very complicated sequence diagram. In addition, the Rose Java Add-In has a complexity indicator for each operation. Developers need to examine only the operations with high indicators. In another case, a client had a large application comprising 200+ collaborating executables (EXEs) and dynamic link libraries (DLLs) developed
20 in Visual Basic 5 and Delphi. In this case, the significant elements were EXEs and DLLs rather than source code. Given the increasing interest in the UML and RUP within the software community, it is reasonable to expect that demands for automatic modeling of legacy systems will continue to grow. Although the automatic generation of a UML model solves only part of the challenges faced by support teams, it is an important part. In addition, an automated modeling tool can facilitate several other tasks, such as: Synchronizing the model against code changes. Reviewing code changes. Refactoring the legacy model and transforming it into another implementation language. Clearly, more exploration and feedback are required, especially in the areas of formulating abstraction rules, identifying useful viewpoints into the model, and identifying refactoring rules. Identifying and structuring these rules will be the basis for subsequent enhancements. For more information on the products or services discussed in this article, please click here and follow the instructions provided. Thank you! Copyright Rational Software 2002 Privacy/Legal Information