Automatic Structure Discovery for Large Source Code

Transcription

1 Automatic Structure Discovery for Large Source Code By Sarge Rogatch Master Thesis Universiteit van Amsterdam, Artificia Inteigence, 2010 Automatic Structure Discovery for Large Source Code Page 1 of 130

2 Acknowedgements I woud ike to acknowedge the researchers and deveopers who are not even aware of this proect, but their findings have payed very significant roe: Soot deveopers: Raa Va ee-rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and others. TreeViz deveoper: Werner Randeshofer H3 Layout author and H3Viewer deveoper: Tamara Munzner Researchers of static ca graph construction: Ondˇre Lhot ak, Viay Sundaresan, David Bacon, Peter Sweeney Researchers of Reverse Architecting: Heidar Pirzadeh, Abdewahab Hamou-Lhad, Timothy Lethbridge, Luay Aawneh Researchers of Min Cut reated probems: Dan Gusfied, Andrew Godberg, Maxim Babenko, Boris Cherkassky, Kostas Tsioutsiouikis, Gary Fake, Robert Taran Automatic Structure Discovery for Large Source Code Page 2 of 130

3 Contents 1 Abstract Introduction Proect Summary Goba Context Reevance for Artificia Inteigence Probem Anaysis Hypotheses Business Appications Thesis Outine Literature and Toos Survey Source code anaysis Soot Rasca Custering Particuary Considered Methods Affinity Propagation Cique Percoation Method Based on Graph Cut Other Custering Methods Network Structure Indices based Hierarchica custering methods Background Max Fow & Min Cut agorithm Godberg s impementation Min Cut Tree agorithm Gusfied agorithm Community heuristic Fake-Taran custering Apha-custering Hierarchica version Ca Graph extraction The Probem of Utiity Artifacts Various Agorithms Theory Normaization Directed Graph to Undirected Leverage An argument against fan-out anaysis Lifting the Granuarity An Aternative Merging Heterogeneous Dependencies Apha-search Search Tree Prioritization Hierarchizing the Partitions Distributed Computation Perfect Dependency Structures Maximum Spanning Tree Automatic Structure Discovery for Large Source Code Page 3 of 130

4 5.6.2 Root Seection Heuristic Impementation and Specification Key Choices Reducing Rea- to Integer- Weighted Fow Graph Resuts Presentation Fie formats Visuaization Processing Pipeine Evauation Experiments Anayzed Software and Dimensions Interpretation of the Resuts Architectura Insights Cass purpose from ibrary neighbors Obvious from cass name Hardy obvious from cass name Not obvious from cass name Cass name seems to contradict the purpose Casses that act together Couped casses are in different packages Couped casses are in the same package Suspicious overuse of a generic artifact Differentiation of couping within a package A package of omnipresent casses Security attack & protection How impemented, what does, where used Further Work A Sef-Improving Program Cut-based custering Connection strengths Apha-threshod Concusion Maor chaenges Worst-case compexity Data extraction Noise in the input Domain specifics Evauation of the resuts Gain over the state of the art Practica contributions Scientific contributions Appendices Evidence A package of omnipresent cient artifacts Security mechanism: identify and circumscribe Subsystems Insight on the impementation An overcompicated subsystem Divide & Conquer Utiity Artifacts Automatic Structure Discovery for Large Source Code Page 4 of 130

5 10.2 Statistica Measures Packages by ubiquity Architectura Fitness of SE Cass coupings Anayzed Source Code Exampes Dependency anaysis A Java cass that does not use cas or externa fied accesses Some casses whose dependencies are not specific at a Dependencies ost at Java compie time Probematic cases Ca Graph Extraction Cass name contradicts the purpose Visuaizations State of the art too STAN Cfinder (Cique Percoation Method) Custer Tree in Text Indentation by Height Bracketed presentation Sunray Representation Sunburst Hyperboic tree (pane) Circuar TreeMap View on the whoe program Parts of the architecture H3 Sphere Layout Near cass CrmSOAP Casses that act together References Automatic Structure Discovery for Large Source Code Page 5 of 130

6 1 Abstract In this work we attempt to infer software architecture from source code automaticay. We have studied and used unsupervised earning methods for this, namey custering. The state of the art source code (structure) anaysis methods and toos were expored, and the ongoing research in software reverse architecting was studied. Graph custering based on minimum cut trees is a recent agorithm which satisfies strong theoretica criteria and performs we in practice, in terms of both speed and accuracy. Its successfu appications in the domain of Web and citation graphs were reported. To our knowedge, however, there has been no appication of this agorithm to the domain of reverse architecting. Moreover, most of existing software artifact custering research addresses egacy systems in procedura anguages or C++, whie we aim at modern obect-oriented anguages and the impied character of reations between software engineering artifacts. We consider the research direction important because this custering method aows substantiay arger tasks to be soved, which particuary means that we can custer software engineering artifacts at cass-eve granuarity whie earier approaches were ony abe to do custering at package-eve on rea-word software proects. Given the target domain and the supposed way of usage, a number of aspects must be researched, and these are the main contributions of our work: - extraction of software engineering artifacts and reations among them (using state of the art toos), and presentation of this information as a graph suitabe for custering - edge weight normaization: we have deveoped a directed-to-undirected graph normaization, which is specific to the domain and aeviates the widey-known and essentia probem of utiity artifacts - parameter (apha) search strategy for hierarchica custering and the agorithm for merging the partitions into the hierarchy in arbitrary order - distributed version for coud computing - a soution for an important issue in the custering resuts, namey, too many sibing custers due to amost acycic graph of reations between them, which is usuay the case in the source code domain; - an agorithm for computing package/namespace ubiquity metric, which is based on the statistics of merge operations that occur in the custer tree A prototype incorporating the above points has been impemented within this work. Experiments were performed on rea-word software proects. The computed custering hierarchies were visuaized using state of the art toos, and a number of statistica metrics over the resuts was cacuated. We have aso anayzed the encountered remaining issues and provided the promising further work directions. It is not possibe to infer simiar architectura insights with any existing approach; an account is given in this paper. We concude that our integrated approach is appicabe to arge software proects in obect-oriented anguages and produces meaningfu information about source code structure. Automatic Structure Discovery for Large Source Code Page 6 of 130

7 2 Introduction As the size of software systems increases, the agorithms and data structures of the computation no onger constitute the maor design probems. When systems are constructed from many components, the organization of the overa system the software architecture presents a new set of design probems. This eve of design has been addressed in a number of ways incuding informa diagrams and descriptive terms, modue interconnection anguages, tempates and frameworks for systems that serve the needs of specific domains, and forma modes of component integration mechanisms [Gar1993]. The software architecture of a program or computing system is the structure or structures of the system, which comprise software components, the externay visibe properties of those components, and the reationships between them. The term aso refers to documentation of a system s software architecture. Documenting software architecture faciitates communication between stakehoders, documents eary decisions about high-eve design, and aows reuse of design components and patterns between proects [Bass2003]. Software architecture determines the quaity attributes exhibited by the system such as faut-toerance, backward compatibiity, extensibiity, fexibiity, reiabiity, maintainabiity, avaiabiity, security, usabiity, and other ities. When performing Software quaity anaysis, we can spit the features upon anaysis into two principa categories: apparent: how the software behaves and ooks atent: what is the potentia of the software, what is in its source code and documentation This is iustrated in Figure 2-1 beow: Quaity Anaysis Software Architecture Source Code Apparent Known bugs Features Latent Unknown bugs GUI user-friendiness Depoyment difficuty Configurabiity Maintainabiity Extensibiity Reusabiity Faut toerance Figure 2-1 Quaity anaysis options We can anayze the apparent features directy. But in order to anayze the atent features, we need to anayze the source code. The atter is effort-intensive if performed manuay. Software Automatic Structure Discovery for Large Source Code Page 7 of 130

8 architecture is a high-eve view of the source code, describing the important facts and omitting the detais. In the idea case, the software architecture is avaiabe (in a document) and refects the source code precisey. Then quaity anaysis performed ony on the software architecture wi give a good coverage of the atent features (perhaps, except minor unknown bugs). In the worst case, ony the source code is avaiabe for the software, with no documentation at a, i.e. the architecture is not known. Then we can either descend to manua source code anaysis, or try to infer the software architecture from the source code automaticay! Usuay, software is not we documented: the software architecture is either too oosey described in the documentation, or ony avaiabe for some parts of the software, or becomes out-of-sync with the actua source code. In this case, we can utiize the avaiabe fragments for semi-supervised inference of the software architecture from the source-code (data) and the documentation (abes). The dashed bidirectiona arrow on the picture above denotes that: the actua software architecture (how the source code is written) can become inconsistent with the caimed software architecture (how it is designed in the documentation). Deveopment in a rush, time pressure, quick wins, hacks and workarounds are some of the reasons why it usuay happens so; even when there is no expicit software architecture (no documentation), there is some impicit software architecture which is in the source code (the actua architecture). Software maintenance and evoution is an essentia part of the software ife cyce. In an idea situation, one reies on system documentation to make any change to the system that preserves system s reiabiity and other quaity attributes [Pir2009]. However it has been shown in practice that documentation associated with many existing systems is often incompete, inconsistent, or even inexistent [Let2003], which makes software maintenance a tedious and humanintensive task. This is further compicated by the fact that key deveopers, knowedgeabe of the system s design, commony move to new proects or companies, taking with them vauabe technica and domain knowedge about the system [ACDC2000]. The obective of design and architecture recovery techniques is to recover high-eve design views of the system such as its architecture or any other high-eve design modes from ow-eve system artifacts such as the source code. Software engineers can use these modes to gain an overa understanding of the system that woud hep them accompish effectivey the maintenance task assigned to them [Pir2009]. The most dominating research area in architecture reconstruction is the inference of the structura decomposition. At the ower eve, one groups goba decarations such as variabes, routines, types, and casses into modues. At the higher eve, modues are custered into subsystems. In the resut there are fat or hierarchica modues. Hierarchica modues are often caed subsystems. Whie earier research focused on fat modues for procedura systems, newer research addresses hierarchica modues [Kosc2009]. 2.1 Proect Summary A but trivia changes in software systems require a goba understanding of the system to be changed. Such non-trivia tasks incude migrations, auditing, appication integration, or impact anaysis. A goba understanding cannot be achieved by ooking at every singe statement. The source code provides a huge amount of detais in which we cannot see forest for the trees. Instead, to understand arge systems, we need a more coarse-grained map - software Automatic Structure Discovery for Large Source Code Page 8 of 130

9 architecture. Software architecture reconstruction is the form of reverse engineering in which architectura information is reconstructed for an existing system. [Kosc2009] Many companies have huge repositories of source code, often in different programming anguages. Automatic source code anaysis toos aso produce a ot of data with issues, metrics, and dependencies in the source code. This data has to be processed and visuaized in order to give insightfu information to IT quaity experts, deveopers, users and managers. There are a number source code visuaization methods and toos that address this probem with different eves of success. In this proect we pan to appy the Artificia Inteigence techniques to the probem of source code visuaization. Appications of AI (Custer Anaysis) to coaboration, word association and protein interaction anaysis [Pa2005]; socia network and WWW anaysis [Fa2004], where aso ots of data must be processed, are we known and produce fruitfu resuts. We hope in this proect to identify simiar opportunities in the software visuaization domain. We further reaize that our task is best characterized as reverse architecting, a term appearing in the iterature [Riv2000]: reverse architecting is a favour of reverse engineering that concerns with the extraction of software architecture modes from the system impementation. The known Artificia Inteigence agorithms, such as custering of graphs, either optimize specific statistica criteria, or expoit the underying structure or other known characteristics of the data. In our case, the data is extracted from the source code of software. The vertices of the graph upon anaysis are software engineering artifacts, where the artifacts can be of different granuarity: from instructions/operators to methods/fieds and then to casses, modues, packages and ibraries. The edges of our graph are dependencies between the artifacts, which aso have different granuarities in their turn: from edges of the contro fow graph, to edges of the method ca and fied access graphs, and then to edges of the cass couping graph, the package usage and ibrary dependency graphs. Within the scope of this proect we view the foowing stages: 1 Extract the SE artifacts and their reations, such as function/method ca and fied access graphs, inheritance/subtyping reations and metrics, which is a matter of pre-requisite toos. Though some uncertain decision making is needed even at this stage (e.g. poymorphism handing within static ca graph extraction), we take the state of the art methods and do not focus on their improvement, however we try to use the best of avaiabe pre-requisites and use severa of them in case they are non-dominated, i.e. none of them is better in a the aspects. 2 Devise an automatic grouping of the extracted artifacts in order to achieve meaningfu visuaizations of them. We focus on this. 3 Visuaize the resuts and anayze source code quaity taking into account the resuts of custering. These tasks are aso hard - the former invoves automatic graph ayout and the atter invoves uncertain decision making - and thus eft to the state of the art toos or human experts. We impement a prototype caed InSoAr, abbreviated from Infer Software Architecture. As different viewpoints specify what information shoud be reconstructed (in our particuar case of automatic reconstruction, inferred by our program) and hep to structure the reconstruction process [Kosc2009], we disambiguate the meaning in which we use (reversed) software architecture in the context of the goa we pursue and the maor facts our program infers to nested software decomposition. This term is adopted in the existing works on reverse architecting ([END2004], [UpMJ2007], [Andre2007]). We decompose a set of software engineering artifacts (e.g. Java casses) according to the couping between SE artifacts. We assume that nested software decomposition, in which artifacts serving simiar purpose or acting in a composite mechanism are grouped together, is the most insightfu and desirabe for software engineers. This is confirmed in [Kosc2009] (see section 2 of the thesis), and we Automatic Structure Discovery for Large Source Code Page 9 of 130

10 discuss further in the thesis the works that attempt to create nested software decompositions ([Ser2008], [Maqb2007], [Rays2000], [Pate2009]). 2.2 Goba Context Existing Software Visuaization toos extract various metrics about the source code, ike number of ines, comments, compexity and obect-oriented design metrics as we as dependencies in the source code ike ca graphs, inheritance/subtyping and other reations first. As the next step they visuaize the extracted data and present it to the user in an interactive manner, by aowing zooming and dri-down or expand/coapse. Exampes of these toos are STAN [Stan2009], SQuAVisiT [Rou2007], DA4Java [Pin2008] and Rasca [Ki2009, aso persona communication with Pau Kint]. A common probem of such toos is that there are too many SE artifacts to ook at everything at once. According to [Stan2009]: To get usefu graphs, we have to carefuy seect the perspectives and scopes. Otherwise we end up with very big and cumsy graphs. For exampe, we may want to ook at an artifact to see how its contained artifacts interact or how the artifact itsef interacts with the rest of the appication. The DA4Java too [Pin2008], attempts to sove this probem by aowing the user to add or remove the artifacts the user wants to see in the visuaization, and, aso, to dri down/up from containing artifacts to the contained artifacts (e.g. from packages to casses). We want to sove the probem of the overwheming number of artifacts by grouping them using AI techniques such as custering, earning and cassification, so that a reasonabe number of groups is presented to the user. From the avaiabe AI techniques graph custering is known to be appied in the software visuaization domain. It seems that many custerizers of software artifacts are empoying non- MinCut based techniques, refer to [Maqb2007] for a broad review of the existing custerizers and [Ser2008] for a recent particuar custerizer. There is some grounding for this, as according to [Ding2001] a MinCut-based custering agorithm tends to produce skewed cuts. In the other words: a very sma subgraph is cut away in each cut. However, this might not be a probem for graphs from the domain of source code anaysis. Another motivation for appying MinCutbased custering agorithms in our domain arises due to the fact that software normay has ceary-defined entry points (sources, in terms of MaxFow-ike agorithms) and ess cearydefined exit points (sinks, in terms of MaxFow). A good choice of sink points is aso a matter of research, whie the current candidates in mind are: ibrary functions, dead-end functions (which do not ca any others), and the program runtime termination points. For extraction of Software Engineering artifacts we pan to use the existing toos such as Soot (Sabe group of McGi University) [Soot1999] and Rasca (CWI) [Ki2009]. Soot buids a compete mode of the program upon anaysis, either from the source code or from the compied Java byte-code. The main vaue of this too for our proect is that it impements some heuristics for static ca graph extraction. 2.3 Reevance for Artificia Inteigence Many AI methods require parameters, and the performance of the methods depends on the choice of parameter vaues. The best choice of methods or parameter vaues is amost aways domain-specific, usuay probem-specific and even sometimes data-specific. We want to investigate these pecuiarities for the domain of source code visuaization and reverse engineering. Automatic custering agorithms have a rich history in the artificia inteigence iterature, and in not so recent years have been appied to understanding programs written in procedura anguages [Man1998]. The purpose of an automatic custering agorithm in artificia inteigence is to group together simiar entities. Automatic custering agorithms are used Automatic Structure Discovery for Large Source Code Page 10 of 130

11 within the context of program understanding to discover the structure (architecture) of the program under study [Rays2000]. One exampe of the specifics of software custering is that we want to custer entities based on their unity of purpose rather than their unity of form. It is not usefu to custer a four-etter variabes together, even though they are simiar [Rays2000]. In this proect we attempt to regard reations that expose the unity of purpose, and we use the notion of simiarity in this meaning. One of ong-term goas of Artificia Inteigence is creation of a sef-improving program. Perhaps, this can be approached by impementing a program that does reverse engineering and then forward engineering of its own source code. In between there must be high-eve understanding of the program, its architecture. It is not cear, what understanding is, but seems it has much in common with the abiity to visuaize, expain to others and predict behavior. This proect is a sma step towards automatic comprehension of software by software. 2.4 Probem Anaysis The particuar probem of interest is inference of the architecture and different facts about software from its source code. It is desired that the high-eve view on software source code is provided to human experts automaticay, omitting the detais that do not need human attention at this stage. Software products often ack documentation on the design of their source code, or software architecture. Athough fu-fedged documentation can ony be created by human designers, an automatic inference too can aso provide some high-eve overview of source code by means of grouping, generaization and abstraction. Such a too coud aso point the paces where human experts shoud pay attention to. Semi-automatic inference can be used when documentation is partiay avaiabe. The key step that we make in this proect is graph custering, which spits the source code into parts, i.e. performs grouping. We suppose that this wi hep with the generaization over software artifacts and the detection of ayers of abstraction within the source code. By generaization we mean the detection of the common purpose which software artifacts in a group serve. One way to determine the purpose is by expoiting the inguistic knowedge found in the source code, such as identifier names and comments. This was done in [Kuhn2007], however they did not partition software engineering artifacts into structuray couped groups prior to inguistic information retrieva. We beieve that forma reations (e.g. function cas or variabe accesses) shoud be taken into account first, and then the inguistic reations shoud be anayzed within the identified (e.g. by means of ca graph custering) groups, rather than doing vocabuary anaysis across the whoe source code. By abstraction we mean the identification of abstraction ayers within the source code. For exampe, if a indirect (i.e. mediated) cas from group A and B to groups C and D go through group E, and there are no direct cas from {A, B} to {C, D}, then it is ikey that group E serves as a ayer of abstraction between groups {A, B} and {C,D}. The aforementioned decisions need uncertain inference and error/noise toerance. Thus we think that the probem shoud be approached with AI techniques. 2.5 Hypotheses In the beginning of this proect we had hypotheses as isted beow. 1) By appying Custer Anaysis to a graph of SE artifacts and their dependencies, we can save human efforts on some common tasks within Software Structure Anaysis, namey identification of couped artifacts and breaking down the system s compexity. Automatic Structure Discovery for Large Source Code Page 11 of 130

12 2) Partitiona custering wi provide better nested software decompositions than the widey used (and to our knowedge, the ony for custering arge number of SE artifacts) hierarchica custering, which is in fact a greedy agorithm 3) Semi-supervised earning of software architecture from source code (unabeed data) and architecture design (abeed data) can be used to improve resuts over usua custering, which is unsupervised earning A few comments for this hypothesis: Here we assume that the expicit architecture, i.e. the design documentation created by human experts, provides some partitioning of SE artifacts into groups. By means of Custer Anaysis we try to infer the impicit architecture, which is the architecture of how the source code is actuay written, and we aso get some partitioning of SE artifacts into groups. It is obvious that this task can aso be viewed as cassification: for each SE artifact the earning agorithm outputs the architectura component (or, in terms of Machine Learning, the cass, but do not confuse with SE casses) which the artifact beongs to, and perhaps aso the certainty of this decision. When the expicit architecture is ony partiay avaiabe (which is aways the case, except for competey documented trivia software, where competey stands for each SE artifact ), we can think of severa approaches for cassifier training: i. Train the cassifier on the documented part of this software ii. Train the cassifier on other simiar software which is better documented iii. Train the cassifier on ibrary functions, which are usuay documented best of a (e.g. Java & Sun ibraries). 4) Improvements in ca graph custering resuts can be achieved through integration with some of the foowing: cass inheritance hierarchy, shared data graph, contro fow graph, package/namespace structure, identifier names (text), supervision (documented architecture pieces), etc. Evidence for the first hypothesis is provided mosty in section 7.2. The second hypothesis is discussed theoreticay, mosty in sections and We have ony discussed hypothesis 3, as impementation and experiments woud take too much time. In the resuting software decompositions we can see that ibrary SE artifacts indeed give insight about the purpose of cient-code artifacts appearing nearby. In section 5.2 we provide evidence and argue theoreticay in support of hypothesis 3: proper weighting of different kinds of reations can be earnt on training data (source code with known nested decomposition) and then appied to nove software. For hypothesis 4, empirica resuts show that indeed the resuting hierarchica structure ooks better when mutipe kinds of reations are given on input of the custering agorithm, and the reasons are theoreticay discussed in section Business Appications Consider a company that is proposed to do maintenance for a software product. Having a visuaization too, the company can anayze the quaity of the source code, so the company knows the risks associated with the software and can estimate the difficuty and expensiveness of its maintenance more accuratey. To be abe to do this we need to: 1 extract the architecture (even when the architecture was not designed from the very beginning, there is aways the actua impicit architecture, which is how the source code actuay written); 2 and in some way derive the evauation of the source code from the resut of step 1. Automatic Structure Discovery for Large Source Code Page 12 of 130

13 Both steps are probematic in the sense that they require uncertain inference and decision making, which is a task of Artificia Inteigence in case we want to do this automaticay. Within the scope of this proect we focus on step 1 and eave step 2 to a human expert. In this section we consider the vaue of this proect for potentia target groups, and then for particuar stakehoders of a proect: administrators, managers, deveopers, testers and users. But first of a, beow is the grounding of why reverse architecting software is vauabe. According to [Riv2000]: Software Deveopment domain is characterized by fast changing requirements. Deveopers are forced to evove the systems very quicky. For this reason, the documentation about the interna architecture becomes rapidy obsoete. To make fast changes to the software system, deveopers need a cear understanding of the underying architecture of the products. To reduce costs severa products share software components deveoped in different proects. This generates many dependencies that are often uncear and hard to manage. Deveopers need to anayze such dependency information either when reusing the components or when testing the products. They often ook for the big picture of the system that is a cear view of the maor system components and their interactions. They aso need a cear and updated description of the interfaces (and services) provided by the components. The deveopers of a particuar component need to know its cients in order to anayze the impact of changes on the users. They aso want to be abe to inform their customers of the changes and to discuss with them the future evoution of the component. When defining a common architecture for the product famiy, architects have to identify the commonaties and variabe parts of the products. This requires comparing the architectures of severa products. The deveopers need a quick method for extracting the architecture modes of the system and to anayze them. The teams often use an iterative deveopment process that is characterized by frequent reeases. The architecture modes coud be a vauabe input for reviewing the system at the end of each iteration. This increases the need for an automatic approach to extract the architecture mode. In our view, if a visuaization too is deveoped, one that can reverse-architect a software from source code and provide concise accurate high-eve information to its users, the effect of empoying this too wi be comparabe to the effect of moving from assemby anguage to C in the past. Beow are the benefits that a too aowing to reverse-engineer and visuaize the software architecture from the source code provides to different stakehoders. It is ikey that the ist is far not compete, and that it wi be extended, improved and detaied in the process of deveopment of the too, as new facts become apparent. Administrators Reduced expenses The deveopment team is more productive. Reduced risks Consider a company that is proposed to do maintenance for a software product. This company has a too to anayze the quaity of the software more precisey, so the company knows the risks associated with the software and can estimate the difficuty of its maintenance more accuratey. Increased product ifetime and safety factor The curse of compexity is aeviated. Better decisions Automatic Structure Discovery for Large Source Code Page 13 of 130

14 Managers Deveopers Testers Users The company can estimate the quaity of software products more precisey, thus knows the situation better, and this eads to better decisions. Better contro of task progress Functionaity + Quaity = Efforts + Duration Now there is a better way to check whether a task was performed in fewer efforts by means of reducing the quaity. New team members are trained faster Usuay, deveopers which are new to the proect spend much time studying the existing system, especiay when itte documentation is avaiabe. Fewer obstaces to introduce new resource action: When whie checking an ongoing proect a manager determines that the proect is ikey not to fit the deadine, and the deadine is strict, the possibe actions to aeviate this are: either shrink the functionaity, or reduce the quaity, or add a new deveoper. However, the atter action is usuay probematic due to the necessity to train the deveoper on this particuar proect. It is easier to recover after reduce quaity action. When producing a Work Breakdown Structure, it is easier to identify: reusage capabiities task interdependencies task workoads Finay, it is easier to manage the changing requirements to a software product. The too heps deveopers to: take architectura decisions identify the proper usage for the existing code and the origina intents of its impementers search for possibe side-effects of executing some statement, introducing a fix or new feature identify the causes of a bug The too aso partiay reieves deveopers from maintaining documentation, given that the source code is good: ogica, consistent and sef-expanatory. Testers wi get a way to better identify the possibe contro paths determine the weak paces of the software When a bug reported by a cient is not obvious to reproduce, taking a ook at the visuaization of the software can hep to figure out why. Users are provided with better services. Support is more prompt: o Issues are fixed faster, as it is easier for deveopers to find the causes and to devise the fixes. o Requested new features are impemented faster, as it is easier for deveopers to understand the existing system, and the reusage reca is higher. More powerfu software, because: o deveopers can buid more compex systems More stabe software, because it is easier to Automatic Structure Discovery for Large Source Code Page 14 of 130

15 o determine the weak parts of the software by deveopers and testers o determine the possibe contro paths and cover them with tests 2.7 Thesis Outine In this section we have introduced the state of the art in the area of Reverse Architecting and paced Artificia Inteigence techniques in this context. Further we discuss the candidate AI techniques in section 3, anayze the weaknesses of the existing approaches and give counterexampes. We aso discuss state of the art in source code anaysis in this section, as we need some source code anaysis in order to extract input data for our approach. Section 4 provides the background materia for proper understanding of our contributions by the reader. The theory we deveoped in order to impement the proect is given in section 5. As we often used probem-soving approach, we are not aways confident about the originaity, and optimaity or superiority of the soutions we devised. Certainy, we admit this as a weakness of our paper in section 9. The most-ikey to be origina, optima or superior soutions are put in section 5. The soutions suspected to be non-origina are discussed together with the background materia in section 4. The soutions known or ikey to be far from optima are discussed together with our experiments (section 7) or impementation and specification (section 6). We do not impement to our knowedge inferior soutions, if efforts-to-vaue tradeoff aows given the effort imit. The empirica evidence for the quaity of custering and meaningfuness of the produced software decompositions is given in section 7.2. We give further visua exampes, evidence and proofs in the appendix. Pease, note that the appendix is aso an important part of the work, as we put some parts there in order not to overoad the textua information of the thesis with huge visua exampes and source code. We aso discuss those visua exampes partiay in the appendix, though in a ess forma way, necessary for their comprehension. We give a ist of probems that we coud not sove within the imits of this proect, see section 8. Finay, we summarize our contributions and discuss the approach in section 9. Automatic Structure Discovery for Large Source Code Page 15 of 130

16 3 Literature and Toos Survey Architecture reconstruction typicay invoves three steps [Kosc2009]: 1 Extract raw data on the system. 2 Appy the appropriate abstraction technique. 3 Present or visuaize the information. Within this proect we perform integrative task over the isted above steps. We seect suitabe state of the art methods and toos, adhering to reaistic estimations on the practica needs, port the methods and toos from other domain into ours and sove the arising issues. 3.1 Source code anaysis In the iterature they distinguish two types of source code anaysis: static (the subect program is NOT run) dynamic (the subect program is executed) Source code anaysis is the process of extracting information about a program from its source code or artifacts (e.g., from Java byte code or execution traces) generated from the source code using automatic toos. Source code is any static, textua, human readabe, fuy executabe description of a computer program that can be compied automaticay into an executabe form. To support dynamic anaysis the description can incude documents needed to execute or compie the program, such as program inputs [Bink2007]. According to [Ki2009], source code anaysis is aso a form of programming. Ca graphs depict the static, caer-caee reation between functions in a program. With most source/target anguages supporting functions as the primitive unit of composition, ca graphs naturay form the fundamenta contro fow representation avaiabe to understand/deveop software. They are aso the substrate on which various interprocedura anayses are performed and are integra part of program comprehension/testing [Nara2008]. In this proect we consider ca graph as the most important source of reations between software engineering artifacts. Thus most of our interest in source code anaysis fas into ca graph extraction. This is aso the most difficut, and computer time- and space-consuming operation among a the extractions we perform as pre-requisites. Extraction of other reations, such as inheritance, fied access and type usage is more straightforward and mosty reduces to parsing. Ca graphs are commony used as input for automatic custering agorithms, the goa of which is to extract the high eve structure of the program under study. Determining ca graph for a procedura program is fairy simpe. However, this is not the case for programs written in obect-oriented anguages, due to poymorphism. A number of agorithms for the static construction of an obect-oriented program s ca graph have been deveoped in the compier optimization iterature in recent years. [Rays2000] In the context of software custering, we attempt to infer the unity of purpose of entities based on their reations, commony represented in such abstractions as data dependency graphs and ca graphs [Rays2000]. The atter paper experiments with 3 most common agorithms for the static construction of the ca graph of an obect-oriented program, avaiabe at that time: Naïve This agorithm assumes that the actua and the impementing types are the same as the decared type. The benefits are: no extra anaysis, sufficient for the purposes of a non-optimizing compier, and very simpe. Cass Hierarchy Anaysis (CHA) [Diw1996] is a whoe-program anaysis that determines the actua and impementing types for each method invocation based on the type structure of the program. The whoe program is not aways avaiabe for anaysis, not ony for trivia (but common) reasons of absence of a.ar fie, but aso Automatic Structure Discovery for Large Source Code Page 16 of 130

17 due to features such as refection and remote method invocation. CHA is fow and context insensitive. Rapid Type Anaysis (RTA) [Bac1996] uses the set of instantiated types to eiminate spurious invocation arcs from the graph produced by CHA. This anaysis is particuary effective when a program is ony using a sma portion of a arge ibrary, which is often the case in Java [Rays2000]. This is aso the case in our proect: out of 7.5K of casses upon anaysis, 6.5K are ibrary casses. Studies have shown that RTA is a significant improvement over CHA, often resoving 80% of the poymorphic invocations to a singe target [Bac1996]. At the time of [Rays2000] experiments, RTA was considered to be the best practica agorithm for ca graph construction in obect-oriented anguages. The improved methods we are using in this proect, Spark [Lho2003] and VTA [Sun1999] [Kwo2000], were under deveopment. The authors of [Rays2000] were ony abe to concude that the choice of ca graph construction agorithm does indeed affect the automatic custering process, but not whether custering of more accurate ca graphs wi produce more accurate custering resuts. Assessment of both ca graph and custering accuracy is a fundamenta difficuty Soot Soot [Soot1999] is a framework originay aimed at optimizing Java bytecode. However, we use it first of a for parsing and obtaining structured in-memory representation of the source code upon our anaysis. The framework is open-source software impemented in Java, and this is important as it gives an opportunity to modify the source code of the too in order to tune it for our needs. Soot supports severa intermediate representations for Java bytecode anayzed with it: Baf, a streamined representation of bytecode which is simpe to manipuate; Jimpe, a typed 3- address intermediate representation suitabe for optimization; and Grimp, an aggregated version of Jimpe suitabe for decompiation [Soot1999]. Another intermediate representation impemented in the recent years is Shimpe, a Static Singe Assignment-form version of the Jimpe representation. SSA-form guarantees that each oca variabe has a singe static point of definition which significanty simpifies a number of anayses [EiNi2008]. Our fact extraction and the prerequisites for our anayses are buit on top of the Jimpe intermediate representation. The prerequisites are ca graph extractors, namey, Variabe Type Anaysis (VTA) [Sun1999] and Spark [Lho2003], a fexibe framework for experimenting with points-to anayses for Java. Soot can anayze isoated source/bytecode fies, but for ca graph extraction whoe-program mode [EiNi2008, p.19] is required. In this mode Soot first reads a cass fies that are required by an appication, by starting with the main root cass or a the casses suppied in the directories to process, and recursivey oading a casses used in each newy oaded cass. The compete appication means that a the entaied ibraries, incuding ava system ibraries, are processed and represented in memory structuray. This causes crucia performance and scaabiity issues, as it was tricky to make Soot fit in 2GB RAM whie processing the software proects we further used in our experiments on custering. As each cass is read, it is converted into the Jimpe intermediate representation. After conversion, each cass is stored in an instance of a SootCass, which in turn contains information ike its name, signature (fuy-quaified name), its supercass, a ist of interfaces that it impements, and a coection of SootFied`s and SootMethod`s. Each SootMethod contains information ike its (fuy-quaified) name, modifier, parameters, ocas, return type and a ist of Jimpe 3-address code instructions. A parameters and ocas have decared types. Soot can produce Jimpe intermediate representation directy from the Java bytecode in cass fies, and not ony from high-eve Java programs, thus we can anayze Java bytecode that has been produced by any compier, optimizer, or other too. [Sun1999] Automatic Structure Discovery for Large Source Code Page 17 of 130

18 3.1.2 Rasca In this section we discuss recent state of the art deveopments in source code anaysis and manipuation (SCAM) domain, pacing our proect and research in this context. Most of the research addresses expicit facts, whie we aim at identification of the impicit facts (architecture), as there is an inference step (namey, custering) between SE artifact reations and presentation to a user. SCAM is a arge and diverse area both conceptuay and technoogicay. Many automated software engineering toos require tight integration of techniques for source code anaysis and manipuation, but integrated faciities that combine both domains are scarce because different computationa paradigms fit each domain best. Both domains depend on a wide range of concepts such as grammars and parsing, abstract syntax trees, pattern matching, generaized tree traversa, constraint soving, type inference, high fideity transformations, sicing, abstract interpretation, mode checking, and abstract state machine. Rasca is a domainspecific anguage that integrates source code anaysis and manipuation at the conceptua, syntactic, semantic and technica eve [Ki2009]. The goas of Rasca are: To remove the cognitive and computationa overhead of integrating anaysis and transformation toos To provide a safe and interactive environment for constructing and experimenting with arge and compicated source code anayses and transformations such as, for instance, needed for refactorings To be easiy understandabe by a arge group of computer programming experts Visuaization of software engineering artifacts is important. CWI/SEN1 research group is deveoping Rasca within The Meta-Environment, a framework for anguage deveopment, source code anaysis and source code transformation: This framework coud use the resuts of this proect, providing input and taking output. There is no ca graph custering in the framework yet. The research group is currenty deveoping a visuaization framework, which coud graphicay iustrate the resuts of this proect too. 3.2 Custering Custering is a fundamenta task in machine earning [Ra2007]. Given a set of data instances, the goa is to group them in a meaningfu way, with the interpretation of grouping dictated by the domain. In the context of reationa data sets that is, data whose instances are connected by a ink structure representing domain-specific reationships or statistica dependency the custering task becomes a means for identifying communities within networks. For exampe, in the bibiographic domain expored by both [Ra2007] and [Fa2004], they find networks of scientific papers. Interpreted as a graph, vertices (papers) are connected by an edge when one cites the other. Given a specific paper (or group of papers), one may try to find out more about the subect matter by pouring through the works cited, and perhaps the works they cite as we. However, for a sufficienty arge network, the number of papers to investigate quicky becomes overwheming. By custering the graph, we can identify the community of reevant works surrounding the paper in question. An exampe of vaue that custering can bring into graph comprehension is iustrated in Figure 3-1 beow. Both pictures on the eft and on the right are adacency matrices of the same graph. However, the vertices (which are row and coumn abes) in the right picture are ordered according to the custer they beong to, so that vertices of the same custer go subsequenty. The matrix on the right is amost quasi-diagona, thus we can ook at the contracted graph of 17 vertices (one vertex per custer) instead of the origina graph of 210 vertices. The edges of the contracted graph wi refect the exceptions that prevent the adacency matrix on the right from being stricty quasi-diagona, and the weights of those edges refect the cardinaity of the exceptions. Automatic Structure Discovery for Large Source Code Page 18 of 130

19 Figure 3-1: Custering faciitates comprehension of an adacency matrix No singe definition of a custer in graphs is universay accepted [Sch2007], thus there are some intuitive desirabe custer properties mentioned in the iterature. In the setting of graphs, each custer shoud be connected: there shoud be at east one, preferaby severa paths connecting each pair of vertices within a custer. If a vertex [u] cannot be reached from a vertex [v], they shoud not be grouped in the same custer. Furthermore, the paths shoud be interna to the custer: in addition to the vertex set [C] being connected in [G], the subgraph induced by [C] shoud be connected in itsef, meaning that it is not sufficient for two vertices [v] and [u] in [C] to be connected by a path that passes through vertices in [V\C], but they aso need to be connected by a path that ony visits vertices incuded in [C]. As a consequence, when custering a disconnected graph with known components, the custering shoud usuay be conducted on each component separatey, uness some goba restriction on the resuting custers is imposed. In some appications, one may wish to obtain custers of simiar order and/or density, in which case the custers computed in one component aso infuence the custerings of other components. This aso makes sense in the domain of software engineering artifacts custering when we are anayzing disoint ibraries with intent to ook at their architecture from the same eve of abstraction. It is generay agreed upon that a subset of vertices forms a good custer if the induced subgraph is dense, but there are reativey few connections from the incuded vertices to vertices in the rest of the graph [Sch2007]. Sti, there are mutipe possibe ways of defining density. At this point there are two things worthy to notice: 1 [Sch2007] uses the notion of cut size, c(c, V\C) to measure the sparsity of connections from custer [C] to the rest of the graph, and this matches to the centra custering approach we use in our work: Graph Custering based on Minimum Cut Trees [Fa2004]. Minimum cuts pay centra roe there in both inter-custer and intra-custer connection density evauation. 2 For cacuation of both inter- and intra- custer densities, in the formuas of [Sch2007, page 33] they use maximum number of edges possibe as the denominator. However, they consider the number of edges in a compete graph as the maximum number of edges possibe, which can be wrong due to the specific of the underying data (not a the graph configurations are possibe, i.e. the denominator must be far ess than the number of edges in a compete graph), and this can cause density estimation probems and skew the resuts. Considering the connectivity and density requirements given in [Sch2007], semanticay usefu custers ie somewhere in between the two extremes: the oosest a connected component, and the strictest a maxima cique. Connected components are easiy computed in O( V + E ) time, whie cique detection is NP-compete. An exampe of good (eft), worse (midde) and bad (right) custer is given in Figure 3-2 beow. The custer on the eft is of good quaity, dense and introvert. Automatic Structure Discovery for Large Source Code Page 19 of 130

20 The one in the midde has the same number of interna edges, but many more edges to outside vertices, making it a worse custer. The custer on the right has very few connections outside, but acks interna density and hence is not a good custer. Figure 3-2: Intuitivey good (eft), worse (midde) and bad (right) custers It is not aways cear whether each vertex shoud be assigned fuy to a custer or coud it instead have different eves of membership in severa custers? [Sch2007] In Java casses custering, such a situation is easiy imaginabe: a cass can be converting data from XML document into a database, and hence coud be custered into XML with 0.3 membership, for exampe, and database with a membership eve of 0.4. The coefficients can be normaized either per-custer: the sum of a membership eves over a casses beonging to this custer equas to 1.0 or per-cass: the sum of a membership eves over a the custers which this cass beongs to equas to 1.0 A soution, hierarchica disoint custering, woud sometimes create a supercuster (parent or indirect ancestor) to incude a casses reated to XML and database, but the downside is that there can be casses deaing with database but having no reation to XML whatsoever. This is the soution adopted in our work; however, due to the aforementioned downside, an aternative seems interesting too: fuzzy graph custering [Dong2006]. In a fuzzy graph, each edge is assigned a degree of presence in the graph. Different non-fuzzy graphs can be obtained by eaving ony the edges with presence eve exceeding a certain threshod. The agorithm of [Dong2006] expoits a connectivity property of fuzzy graphs. It first precusters the data into subcusters based on the distance measure, after which a fuzzy graph is constructed for each subcuster and a threshoded non-fuzzy graph for the resuting graph is used to define what constitutes a custer Particuary Considered Methods Fake-Taran custering, aso known as graph custering based on minimum cut trees [Fa2004], was used as the core custering approach in this work. However, a number of other custering methods were considered within our research too. It was concuded that a of them are either inappicabe due to our probem size (in terms of agorithmic compexity), or inferior to Fake- Taran custering in terms of custering performance (the accuracy of the resuts and the usefuness of the measure which the methods are aiming to optimize) Affinity Propagation Affinity propagation [Fre2007] is a custering agorithm that takes as input measures of simiarity between pairs of data points and simutaneousy considers a data points as potentia exempars. Rea-vaued messages are exchanged between data points unti a high-quaity set of Automatic Structure Discovery for Large Source Code Page 20 of 130

21 exempars and corresponding custers graduay emerges. A derivation of the affinity propagation agorithm stemming form an aternative, yet equivaent, graphica mode is proposed in [Giv2009]. The new mode aows easier derivations of message updates for extensions and modifications of the standard affinity propagation agorithm. In the initia set of data points (in our case, software engineering artifacts, e.g. Java casses), affinity propagation (AP) pursues the goa of finding a subset of exempar points that best describe the data. AP associates each data point with one exempar, resuting in a partitioning of the whoe data set into custers. The measure which AP maximizes is the overa sum of simiarities between data points and their exempars, caed net simiarity. It is important to note why a degenerate soution doesn t occur. The net simiarity is not maximized when every data point is assigned to its own singeton exempar because it is usuay the case that a gain in simiarity a data point achieves by assigning itsef to an existing exempar is higher than the preference vaue. The preference of point i, caed p(i) or s(i, i), is the a priori suitabiity of point i to serve as an exempar. Preferences can be set to a goba (shared) vaue, or customized for particuar data points. High vaues of the preferences wi cause affinity propagation to find many exempars (custers), whie ow vaues wi ead to a sma number of exempars (custers). A good initia choice for the preference is the minimum simiarity or the median simiarity [Fre2007]. Affinity propagation iterativey improves the custering resut (net simiarity), and the time required for one iteration is asymptoticay equa to the number of edges in the simiarity graph. In their experiments [Fre2007] authors used some fixed number of iterations, but one can aso run iterations unti some pre-defined time imit is exceeded. Normay, the agorithm takes NxN adacency matrix on input, but we cannot aow this in our probem because such a soution is not scaabe to arge software proects. In a arge software proect there can be 0.1 miions of software engineering artifacts (e.g., Java casses), but ony 1-2 miions of reations among them (method cas, fied accesses, inheritance, etc), i.e. the graph upon anaysis is very sparse. On medium-size software proects, consisting of no more than artifacts at the seected granuarity (e.g. casses), however, it is feasibe to compute an adacency matrix using some transitive formua for simiarity of artifacts which do not have a direct edge in the initia graph of reations, thus it is worthy to mention the practica constraints for affinity propagation. The number of scaar computations per iteration is equa to a constant times the number of input simiarities, where in practice the constant is approximatey The number of rea numbers that need to be stored is equa to 8 times the number of input simiarities. So, the number of data points is usuay imited by memory, because we need N 2 simiarities for N data points. Though affinity propagation has a sparse version, the variety of the resuting custering configurations becomes very imited in this case. In a sparse version, the simiarity between any two points not connected in the input graph is viewed as negative infinity by the agorithm. Beow are two main consequences of this: If there is no edge between point A and point B in the input graph, then point A wi never be seected as an exempar for point B. If there is no path of ength no more than 2 between points C and D, then affinity propagation wi never assign points C and D into the same custer This is iustrated on Figure 3-3 beow. A B C D Figure 3-3 Issues of the sparse version of affinity propagation Automatic Structure Discovery for Large Source Code Page 21 of 130

22 When custering software engineering artifacts, e.g. Java/C# casses, it seems reasonabe that sometimes we want some casses to get into the same custer even though there is no path of ength no more than 2 between them. We concude that affinity propagation is not appicabe to our probem thus Cique Percoation Method Cfinder is a too for finding and visuaizing overapping groups of nodes in networks, based on the Cique Percoation Method [Pa2005]. Within this proect, we used it as is in an attempt to custer software engineering artifacts using state of the art toos from a different domain, namey, socia network anaysis. In contrast to Cfinder/CPM, other existing community finders for arge networks, incuding the core method used in our proect, find disoint communities. According to [Pa2005], most of the actua networks are made of highy overapping cohesive groups of nodes. Though Cfinder is caimed to be fast and efficient method for custering data represented by arge graphs, such as genetic or socia networks and microarray data and very efficient for ocating the ciques of arge sparse graphs, our experiments showed that it is not appicabe to our domain de facto, for both scaabiity and resut usefuness issues. When our origina graph, containing 7K vertices (Java casses) and 1M edges (various reations), was given on input of Cfinder, it did not produce any resuts in reasonabe time. When we reduced the graph to cient-code artifacts ony, resuting in 1K casses and 10K edges, Cfinder sti did not finish computations after 16 hours; however, at east it produced some resuts which coud be visuaized with Cfinder. It produced one community with severa ciques in it, see Figure 10-8 in the appendix. The seected nodes beong to the same cique. Unfortunatey, hardy any architectura insight can be captured from this picture even when zoomed, see Figure 10-9 in the appendix. We suppose that the reason for such poor behavior of Cique Percoation Method in our domain, as opposed to coaboration, word association, protein interaction and socia networks [Pa2005], resides in the specific of our data, namey, software engineering artifacts and reations between them. Our ciques are often huge and nested, thus the computationa compexity of CPM approaches its worst case. Certainy, we have studied Cfinder too superficiay, and perhaps there is indeed a way to reduce our probem into one feasibe to sove with CPM, but after spending reasonabe amount of efforts on this grouping approach we concude that either it is inappicabe, or much more efforts must be spent in order to get usefu resuts with it Based on Graph Cut A group of custering approaches is based on graph cut. The probem of minimum cut in a graph is we studied in computer science. An exact soution can be computed in reasonabe poynomia time. In a bipartition custering probem, i.e. ony two custers are needed, minimum cut agorithm can be appied in order to find them. The vertices of the input graph represent the data points and the edges between them are weighted with the affinities between the data points. Intuitivey, the fewer high affinity edges are cut, the better the division into two coherent and mutuay different parts wi be [Bie2006]. In the simpest min cut agorithm, a connected graph is partitioned into two subgraphs with the cut size minimized. However, this often resuts in a skewed cut, i.e. a very sma subgraph is cut away [Ding2001]. This probem coud argey be soved by using some cut cost functions proposed in the iterature in the context of custering, among which the average cut cost (Acut) and the normaized cut cost (Ncut). Acut cost seems to be more vunerabe to outiers (atypica data points, meaning that they have ow affinity to the rest of the sampe) [Bie2006]. However, skewed cuts sti occur when the overaps between custers are arge Automatic Structure Discovery for Large Source Code Page 22 of 130

23 [Ding2001] and, finay, both optimizing the Acut and Ncut costs are NP-compete probems [Shi2000]. In the fuy unsupervised-earning scenario, no prior information is given as to which group data points beong to. In machine earning iterature these target groups are caed casses, but do not confuse with Java casses. Besides this custering scenario, in the transduction scenario the group abes are specified for some data points. Transduction, or semi-supervised earning, received much attention in the past years as a promising midde group between supervised and unsupervised earning, but maor computationa obstaces were inhibiting its usage, despite the fact that many natura earning situations directy transate into a transduction probem. In graph cut approaches, the probem of transduction can naturay be approached by restricting the search for a ow cost graph cut to graph cuts that do not vioate the abe information [Bie2006]. Fast semi-definite programs reaxations of [Bie2006] made it possibe to find a better cut than the one found using spectra reaxations of [Shi2000], and the authors in their experiments were abe to process graphs of up to 7K vertices and 41K edges within reasonabe time and memory. However, this is sti far not enough for our probem, as even a medium-size proect has about 1M of reations between software engineering artifacts. Paper [Ding2001] proposes another cut-based graph partition method, which is based on min-max custering principe: the simiarity or association between two subgraphs (cut set) is minimized, whie the simiarity or association within each subgraph (summation of simiarity between a pairs of nodes within a subgraph) is maximized. The authors present a simpe minmax cut function together with a number of theoretica anayses, and show that min-max cut aways eads to more baanced cuts than the ratio cut [Hag1992] and the normaized cut [Shi2000]. As the optima soution for their min-max cut function is NP-compete, the authors used a reaxed version which eads to a generaized eigenvaue probem. The second owest eigenvector, aso caed the Fieder vector, provides a inear search order (Fieder order). Thus the min-max cut agorithm (Mcut) provides both a we-defined obective and a cear procedure to search for the optima soution [Ding2001]. The authors report that the Mcut outperformed the other methods on a number of newsgroup text datasets. Unfortunatey, the computationa compexity of the agorithm is not obvious from the artice, except that the computation of Fieder vector can be done in O( E V ), but the number of data points used in their experiments did not exceed 400, which is far too itte for our probem. One important detai about bipartition-based graph custering approaches is the transition from 2-custer to mutipe-custer soution. If this is done by means of recursive bipartition of the formery identified custers, then the soution is hierarchica in nature, which is good for source code structure discovery. However, this agorithm is aso greedy, thus custering quaity can be unacceptabe. Apparenty, the optima partition of a system into 3 components can differ much from the soution received by first bipartitioning the system, and then bipartitioning one of the components. This is iustrated in Figure 3-1 beow. Figure 3-4 Optima 2-custering (dashed) vs. 3-custering (dotted) Automatic Structure Discovery for Large Source Code Page 23 of 130

24 Finay, the main custering method seected for the impementation of our proect is aso cutbased [Fa2004]. Though it produces hierarchica custering, it does not suffer from the issue of greedy approaches demonstrated above. This is because the hierarchy arises due to the custering criteria used, namey, vertices of the exampe graph in Figure 3-4 are sent into three sibing custers as soon as the key parameter (apha) is sma enough, and unti apha gets even smaer to send a the vertices into one parent custer. Depending on the input graph, there might be no vaue of the key parameter at which a certain amount of custers is produced (e.g. 2 in our exampe), thus there can be a threshod from 3 custers to 1 custer incorporating a the vertices of those chid custers Other Custering Methods Other custering methods were studied without experiments within this proect. Most of the methods did not pass the eary cut stage because they are not scaabe to arge graphs. First of a, methods that require compete adacency matrix on input were discarded, as it impies at east N 2 operations whie our graph is sparse. Then, greedy hierarchica custering approaches, either aggomerative or divisive, were eft out. We, however, find it important to discuss the confusion observed in the iterature on software architecture recovery, e.g. the work reported in [Czi2007] and a number of hierarchica custering for software architecture recovery approaches discussed in [Maqb2007]. Whie the tites say hierarchica custering, for the quaity of resuts it is crucia to distinguish how the hierarchy emerges: whether it happens due to the greedy order in which custers are identified, or it is data-driven. The custering agorithm we use in our proect fas into the atter category. We discuss those from the former category in subsection beow. The superiority of partitiona custering methods in the domain of software source code is confirmed in [Czi2008], where the authors improve their own earier resuts of [Czi2007] by means of using partitiona custering instead of hierarchica aggomerative custering used in their earier work pursuing the same goa of automatic source code refactoring. The partitiona custering method used in [Czi2008] is k-medoids [Kau1990] with some heuristics for choosing the number of medoids (custers) and the initia medoids, whie the heuristics are domain-specific. Another point that we consider worth discussing in a subsection is not a custering method itsef, but rather a technique that aow a series of custering methods to work in a drasticay reduced computationa compexity without maor precision osses, as reported in [Ra2007]. We do not use any of the custering methods, e.g. k-medoids [Kau1990] or Girvan- Newman [Gine2002], acceerated with this network structure indices technique in our proect for the foowing reasons: [Ra2007] admits the superiority of Fake-Taran [Fa2004] custering method in terms of custering resut quaity. The ony argument of [Ra2007] against minimum cut tree based custering methods is that they are not scaabe to arge graphs. However, it seems that the authors of [Ra2007] were not aware of the actua computationa compexity of Fake-Taran custering, in terms of both worst-case and usua-case. There is some rationae behind this, as Fake-Taran custering reies on the computation of maximum fow in a graph, which is beieved to be a very hard poynomia agorithm. The widey known Dinic s agorithm for max fow in a rea-vaued network works in O ( V O( V 2 E ), and the push-reabe agorithm referenced by [Fa2004] works in V E og E 2 ) time. This coud give the authors of [Ra2007] a wrong idea on the scaabiity of minimum cut tree based custering. However, recent deveopments Automatic Structure Discovery for Large Source Code Page 24 of 130

25 in max fow agorithm aow computing the minimum cut in as itte as O( E min( V 2 / 3, V E ) og E 2 ogu ), where U is the maximum capacity of the network [Go1998]. Our practica studies shown that the actua running time of Godberg s impementation (see section of the thesis) of push-reabe based max fow agorithm is neary ( E ) in the usua case. As this agorithm requires integra arc capacities, we have deveoped within our work a method to approximate rea-vaued max fow probem with an integra one that satisfied the needs of our proect, namey, the property that aowed minimum cut tree based hierarchica custering [Fa2004] was not ost due to conversion from rea-vaued to integra max fow probem. One custering agorithm improved in [Ra2007], namey Girvan-Newman [Gir2002], is greedy hierarchica (divisive) custering. It is not said whether there are some non-greedy hierarchica custering agorithms can be improved with network structure indices technique. In [Ra2007] the authors ony worked with non-weighted graphs. It is not cear whether the technique can sti hande weighted graphs, and if so, whether reavaued weights are possibe. Maximum fow agorithms for non-weighted graphs have smaer agorithmic compexity too: e.g. Dinic bocking fow agorithm for network with unit-capacities terminates in O( V E ) [Dini1970]. Thus Fake-Taran custering having Dinic s agorithm in the backend woud work much faster, but this is ony possibe in networks with unit capacities. Thus we concude that Fake-Taran custering agorithm [Fa2004] is fast enough, produces both better custering quaity than the riva approaches, and a data-driven hierarchica custering that is very desired for software architecture domain Network Structure Indices based Simpe custering methods, ike a new graphica adaptation of the k-medoids agorithm [Kau1990] and the Girvan-Newman [Gir2002] method based on edge betweenness centraity, can be effective at discovering the atent groups or communities that are defined by the ink structure of a graph. However, many approaches rey on prohibitivey expensive computations, given the size of reationa data sets in the domain of source code anaysis. Network structure indices (NSIs) are a proven technique for indexing network structure and efficienty finding short paths [Ra2007]. In the atter paper they show how incorporating NSIs into these graph custering agorithms can overcome these compexity imitations. The k-medoids agorithm [Kau1990] can be thought of as a discrete adaptation of the k- means data custering method [MaQu1967]. The inputs to the agorithm are k, the number of custers to form, and a distance measure that maps pairs of data points to a rea vaue. The procedure is as foows: (1) randomy designate k instances to serve as seeds for the k custers; (2) assign the remaining data points to the custer of the nearest seed using the distance measure; (3) cacuate the medoids of each custer; and (4) repeat steps 2 and 3 using the medoids as seeds unti the custers stabiize. In a graph, medoids are chosen by computing the oca coseness centraity [Ra2007] among the nodes in each custer and seecting the node with the greatest centraity score. One issue with k-medoids approach is simiar to the probem of sparse version of Affinity Propagation we discussed in : in contrast to the datacustering counterpart k-medoids, graph distance is highy sensitive to the edges that exist in the graph. Adding a singe short-cut ink to a graph can reduce the graph diameter, atering Automatic Structure Discovery for Large Source Code Page 25 of 130

26 the graph distance between many pairs of nodes. Second issue arises when graph distances are integers. In this case nodes are often equidistant to severa custer medoids. [Ra2007] resoves the atter conficts by randomy seecting a custer; however, this can resut in custerings that do not converge. This is further resoved by a threshod on the fraction of non-converged custers. The Girvan-Newman agorithm [Gir2002] is a divisive custering technique based on the concept of edge betweenness centraity. Betweenness centraity is the measure of the proportion of shortest paths between nodes that pass through a particuar ink. Formay, betweenness is defined for each edge e E as:, where g(u, v) is the tota number of geodesic paths between B( e) nodes u and v, and g e (u, v) is the number of geodesic paths u, vv g( u, v) between u and v that pass through e. A geodesic path in [Ra2007] is simpy the shortest path in a graph. Note that there can be mutipe shortest paths, i.e. they a have the same ength but pass through different chains of edges. Aso, the methods of [Ra2007] work with non-weighted graphs, i.e. each edge has ength 1. The agorithm ranks the edges in the graph by their betweenness and removes the edge with the highest score. Betweenness is then recacuated on the modified graph, and the process is repeated. At each step, the set of connected components of the graph is considered a custering. If the desired number of custers is known a priori (as with k-medoids), we hat when the desired number of components (custers) is obtained. The main probem with the two custering agorithms described above is agorithmic compexity, and this aso appies to many other approaches, but the above two were studied in [Ra2007] and acceerated dramaticay with network structure indices. The baseine custering agorithms are intractabe for arge graphs: For k-medoids custering, cacuation and storage of pairwise node distances can be g 3 done in O(V ) 2 time and O(V ) space with Foyd-Warsha agorithm (can be found in e.g. [CLR2003]). For Girvan-Newman custering, cacuation of edge betweenness for the inks in a graph is an O( V E ) operation. A network structure index (NSI) is a scaabe technique for capturing graph structure [Ra2007]. The index consists of a set of node annotations combined with a distance measure. NSIs enabe fast approximation of graph distances and can be paired with a search agorithm to efficienty discover short (note, not the shortest!) paths between nodes in the graph. A distance to zone (DTZ) index was empoyed in [Ra2007]. The DTZ indexing process creates d independent sets of random partitions (caed dimensions) by stochasticay fooding the graph. Each dimension consists of z random partitions (caed zones). DTZ annotations store the distances between each node and a zones across each dimension. The approximate distance between two nodes u and v is defined as: D ( u, v) dist ( u, zone( v)) dist ( v, zone( u)) DTZ d d, where dist d ( u, zone( v)) is the ength of the shortest path between u and the cosest node in the same zone as v. Creating the DTZ index requires O( E z d) time and O( V z d) space. Typicay they seect e ( u, v) z, d V, thus DTZ index can be created and stored in a fraction of the time and space it takes to cacuate the exact graph distances for a pairs of nodes in the graph. The resuts of empirica study of the speed improvement achieved with NSIs are Automatic Structure Discovery for Large Source Code Page 26 of 130 d

27 iustrated in Figure 3-5 beow [Ra2007]. The top ine shows bidirectiona breadth-first search, which can become intractabe for even moderate-size graphs. The midde ine shows an optima best first search, which represents a ower bound on the run time for any search-based method. The ower ine shows an NSI-based method, DTZ with 10 dimensions and 20 nodes. Figure 3-5 k-medoids speed using 3 different methods of distance cacuation Hierarchica custering methods Most custering agorithms can be cassified into two popuar techniques: partitiona and hierarchica custering. Hierarchica custering methods represent a maor cass of custering techniques [Czi2007]. There are two types of hierarchica custering agorithms: aggomerative and divisive. Given a set of n obects, The aggomerative (bottom-up) methods begin with n singetons (sets with one eement), merging them unti a singe custer is obtained. At each step, the most simiar two custers are chosen for merging. The divisive (top-down) methods start from one custer containing a n obects and spit it unti n custers are obtained. The aggomerative custering agorithms differ in the way the two most simiar custers are determined and the inkage-metric used: singe, compete or average. Singe ink agorithms merge the custers whose distance between their cosest obects is the smaest. Compete ink agorithms merge the custers whose distance between their most distant obects is the smaest. Average ink agorithms merge the custers in which the average of distances between the obects from the custers is the smaest. In genera, compete ink agorithms generate compact custers whie singe ink agorithms generate eongated custers. Thus, compete ink agorithms are generay more usefu than singe ink agorithms [Czi2007]. Average ink custering is a compromise between the sensitivity of compete-ink custering to outiers and the tendency of singe-ink custering to form ong chains that do not correspond to the intuitive notion of custers as compact, spherica obects [Man1999]. In addition to the above mentioned issues of aggomerative custering approaches, and the suspicious averaging of distances, the issue we discussed near Figure 3-4 sti remains too, Automatic Structure Discovery for Large Source Code Page 27 of 130

28 namey, the greedy nature of such agorithms. On the other hand, partitiona custering agorithms ook at a the data at once, and produce a partition of the data points into some number of custers. According to [Jain1999], the partitiona techniques usuay produce custers by optimizing a criterion function defined either ocay (on a subset of the patterns) or gobay (defined over a of the patterns). At this point it is worthy to notice that Fake-Taran custering agorithm optimizes the criteria gobay, see section In [Jain1999] they provide taxonomy of custering agorithms, see Figure 3-6 beow. The main custering agorithm we use in our proect was not yet invented at the time the review [Jain1999] was written, and fas into Graph Theoretic category under Partitiona custering approaches. We are stressing this to prevent confusion of it with the hierarchica custering approaches present in the iterature on the basis of the fact that the custering agorithm [Fa2004] produces custering hierarchy too. Sti, it is a partitiona custering method, we grounded theoreticay and free of the disadvantages of greedy agorithms. Figure 3-6 A taxonomy of custering approaches A recent review of mutipe hierarchica custering approaches appied to the domain of software architecture recovery, [Maqb2007], concuded that the performance of the state of the art agorithms is poor. The authors mention arbitrary decisions taken inside the custering agorithms as a core source of probems. We have demonstrated another source of probems near Figure 3-4, namey, greedy nature of the agorithms. Furthermore, a the agorithms described there work in at east O(n 2 ) agorithmic compexity, as they take n by n simiarity matrix on input, where n is the number of software engineering artifacts. In our work, we propose an approach that is scaabe to arge sparse graphs of reations between software engineering artifacts, and the custering decisions are strongy grounded by the theory of [Fa2004]. Automatic Structure Discovery for Large Source Code Page 28 of 130

29 4 Background The pre-requisite methods and toos used in our proect are described and formaized in this section. We aso find it worthy to discuss the known chaenges arising in the domain of software engineering artifacts custering. In section 5 we provide the theory we devised on top of the background materia given here. 4.1 Max Fow & Min Cut agorithm The maximum fow probem and its dua, the minimum cut probem, are cassica combinatoria probems with a wide variety of scientific and engineering appications. In a graph denoting a fow network, edge weights denote the capacities, i.e. the amount of substance that can fow through a connection between points (vertices). The task is assignment a certain amount of fow to each connection (pipe), so that the tota fow from a source vertex to a sink vertex is maximized. More background about this cassica probem can be found in [CLR2003]. Here we ust mention the differences with the shortest-path probem that is a prevaient pre-requisite for other custering agorithms: High weight is good for fow, but bad for short path. A path is a chain of edges, whie fow can go in mutipe parae directions. There is no (poynomia) soution for ongest path probem, but there are soutions for maximum fow. Most efficient agorithms for the maximum fow are based on the bocking fow and the push-reabe methods. The shortest augmenting path agorithm, the bocking fow method, and the push-reabe method use a concept of distance, taking the ength of every residua arc to be unit (one). Using a more genera, binary ength function [Go1998] substantiay improved the previous time bounds. As a potentia further improvement direction, the authors mention considering ength functions that depend on the distance abes of the endpoints of an arc in addition to the arc s residua capacity. Within our proect we do not seek to improve the speed of the max fow agorithm and take it as is with agorithmic compexity: O( E min( V 2 / 3, V E ) og E 2 ogu ) In a typica graph for our experiments, 10K vertices and 1M edges, this amounts to 1M*464*8*32 = 119G of trivia operations in the worst case. However, the ossess heuristics [Che1997] for push-reabe based impementation of max fow (see subsection 4.1.1) kept the actua number of scaar operations to about 100M. So we use binary bocking fow for theoretica estimations of agorithmic compexity and Godberg s impementation of pushreabe based agorithm with heuristics, which works fascinating in practice Godberg s impementation In this proect we used Godberg s impementation of push-reabe agorithm soving max fow probem, see HIPR which is an improvement over the H_PRF version described in [Che1997]. This impementation performs much ess scaar operations than the worst-case estimation due to ossess (in terms of optimaity) speedup heuristics. In [Che1997] the authors point out a probem famiy on which a the known max fow methods have quadratic (in the number of vertices) time growth rate. However, even for this probem famiy their best impementation (H_PRF) processed a graph of 65K vertices and 98K edges in reasonabe time Automatic Structure Discovery for Large Source Code Page 29 of 130

30 As maximum fow computation is the botteneck in our proect, it is important that the impementation is highy optimized, incuding ow-eve optimizations. This impementation is written in C and incudes some heuristics that aow computing the maximum fow much faster than the worst-time compexity estimation. In our practica needs we observe that the agorithm computes max fow in a graph of 10K vertices and 1M edges in 0.02 seconds on a 1.7GHz computer with 2MB cache memory, whie processor cache size is important as most of the time is spent in cache misses. 4.2 Min Cut Tree agorithm Cut trees, introduced in [GoHu1961] and aso known as Gomory-Hu trees, represent the structure of a s t cuts of undirected graphs in a compact way. Cut trees have many appications, but in our proect we use them for custering as described in [Fa2004]. A known agorithms for buiding cut trees use a minimum s t cut (see 4.1 above) as a subroutine [GoTs2001]. In [GoHu1961] they showed how to sove the minimum cut tree probem using n 1 minimum cut computations and graph contractions, where n is the number of vertices in the graph. An efficient impementation of this agorithm is non-trivia due to subgraph contraction operations used [GoTs2001]. Gusfied [Gus1990] proposed an agorithm that does not use graph contraction; a n 1 minimum s t cut computations are performed on the input graph. We use this agorithm in our proect for one more reason in addition to the above mentioned: it is possibe to appy the community heuristic, specific to the purpose for which we need computation of minimum cut tree [Fa2004]. Both Gusfied agorithm and the community heuristic are described in the subsections. The input to the cut tree probem is an undirected graph G ( V, E), in which edges have capacities, each denoting the maximum possibe amount of fow through an edge. We say that an edge crosses the cut if its two endpoints are on different sides of the cut. Capacity of a cut is the sum of capacities of edges crossing the cut. For s, t V, an s t cut is a cut such that s and t are on different sides of it. A minimum s t cut is an s t cut of minimum capacity. A (goba) minimum cut is a minimum s t cut over a s, t pairs. A cut tree is a weighted tree T on V with the foowing property. For every pair of distinct vertices s and t, et e be a minimum weight edge on the unique path from s to t in T. Deeting e from T separates T into two connected components, X and Y. Then ( X, Y) is a minimum s t cut. Note that T is not a subgraph of G, i.e. edges of T do not need to be in G Gusfied agorithm In [Gus1990] they provide a simpe method for impementing minimum cut tree agorithm, which does not invove graph contraction [GoHu1961] and works in the same agorithmic compexity. Beow is the pseudo code of Gusfied agorithm: (1) For a vertices, i=2 n, set prev[i]=1; (2) For a vertices, s=2 n do (3) t = prev[s]; (4) Cacuate a minima cut (S,T) and the vaue [w] of a maxima fow in the graph, using [s] as the source and [t] as the sink (5) Add edge (s,t) with weight [w] to the resuting tree (6) For a vertices [i] from S /* the source-side vertices after the cut */ (7) If i > s and prev[i]==t then (8) prev[i] = s; (9) End; (10) End; (11) End; Automatic Structure Discovery for Large Source Code Page 30 of 130

31 Apart from simpicity of the agorithm, we aso use it because the community heuristic [Fa2004] (aso described in section of the thesis) can be appied thus reducing the required number of max fow computations substantiay Community heuristic The running time of the basic cut custering agorithm [Fa2004] is equa to the time to cacuate the minimum cut tree, pus a sma overhead for extracting the subtrees under the artificia sink t. But cacuating the min-cut tree can be equivaent to computing n 1 maximum fows in the worst case for [GoHu1961], and aways for [Gus1990] which we provided in section and use in our proect. Fortunatey, [Fa2004] proves a property that aows to find custers much faster, in practice, usuay in time equa to the tota number of custers times the time to compute max fow. The gist of the community heuristic foows. If the cut between some node v and t yieds the community S (vertices on the source-side of the cut), then we do not use any of the nodes in S as subsequent sources to find minimum cuts with t, since according to a emma proved in [Fa2004] their communities woud be subsets of S. Instead, we mark the vertices of S as being in community S, and ater if S becomes part of a arger community S we mark a nodes of S as being part of S. The heuristic reies on the order in which we iterate over the vertices of the graph, as opposed to the baseine Gusfied agorithm (section 4.2.1) which passes the vertices in arbitrary order. It is desired that the argest custers are identified first. As proposed in [Fa2004], we sort a nodes according to the sum of the weights of their adacent edges, in decreasing order. 4.3 Fake-Taran custering In [Fa2004] they introduce simpe graph custering methods based on minimum cuts within the graph. The cut custering methods are genera enough to appy to any kind of graph but, according to the authors of the paper, are we-suited for graphs where the ink structure impies a notion of reference, simiarity or endorsement. The authors experiment with Web and citation graphs in their work. Given an undirected graph G ( V, E) and a vaue of parameter, the basic custering agorithm of [Fa2004], which we ca Apha-custering (see section 4.3.1) due to the presence of parameter, finds a community for each vertex with respect to an artificia sink t added to the graph G. The artificia sink is connected to each node of G via an undirected edge of capacity. The community of vertex s with respect to vertex t is the set of vertices on the source-side of the minimum cut between vertex s as the source, and vertex t as the sink. In the hierarchica version of Fake-Taran custering agorithm, we can observe that the agorithm does not depend on parameter in case a the breakpoints of parametric max fow [Bab2006] have been considered, where parametric edges are those connecting the artificia sink to the rest of the graph. Thus given the input graph, a hierarchica custering is produced on output. It is very important to stress that there are no parameters to tune, unike parameter k in k-medoids custering [Kau1990] or exempar preferences in affinity propagation [Fre2007]. The resuting custering is competey data-driven Apha-custering Parameter serves as an upper bound of the inter-custer edge capacity and a ower bound of the intra-custer edge capacity, according to a theorem proven in [Fa2004]: Automatic Structure Discovery for Large Source Code Page 31 of 130

32 Let G( V, E) be an undirected graph, s V a source, and connect an artificia sink t with edges of capacity to a nodes. Let S be the community of s with respect to t. For any non-empty P and Q, such that P Q S and P Q {}, the foowing bounds aways hod: c( S, V S) c( P, Q) V S min( P, Q ) The eft side of the inequaity bounds the inter-community edge-capacity, thus guaranteeing that communities wi be reativey disconnected. Here c( S, V S) is the cut size (the sum of the capacities of edges going from the eft set of vertices to the right set) between the vertices in S and the rest of the graph. The right side of the inequaity means that for any cut inside the community S, even the minimum one, its vaue (i.e. the sum of edges crossing the cut) wi be at east times the minimum of the cardinaities over the two sides of the cut. In the other words: If we want to separate 1 vertex from a custer (containing at east 2 vertices), we have to cut away edges with tota weight at east If we want to separate 2 vertices from a custer (containing at east 4 vertices), we have to cut away edges with tota weight at east 2 If we want to separate 3 vertices from a custer (containing at east 6 vertices), we have to cut away edges with tota weight at east 3 And so on. As goes to 0, the cut custering agorithm wi produce ony one custer, namey the entire graph G, as ong as G is connected. On the other extreme, as goes to infinity, there wi be n trivia custers, a singetons. When a particuar number of custers is needed, say k, we can appy binary search in order to determine the vaue of that produces the number of custer cosest to k. When a hierarchy of custers is needed (see section 4.3.2), the resuts of custering using mutipe vaues of must be merged. The basic custering agorithm, as in [Fa2004], is shown in Figure 4-1 beow. Figure 4-1 Cut custering agorithm Hierarchica version The hierarchica cut custering agorithm provides a means to ook at graph G in a more structured, muti-eve way [Fa2004]. In contrast to the greedy hierarchica custering agorithms, discussed in and used in the state of the art reverse architecting approaches ([Maqb2007], [Czi2007]), the hierarchaity of custers produced by Fake-Taran custering agorithm foows from the nesting property proven in [Fa2004], namey, custers produced Automatic Structure Discovery for Large Source Code Page 32 of 130

33 using ower vaues of are aways supersets of custers produced at higher vaues of in the basic cut custering agorithm The hierarchica cut custering agorithm of [Fa2004] is given in Figure 4-2 beow. The authors propose to contract custers produced with higher vaues of before running the agorithm on smaer vaues of. However, this puts a constraint on the order in which we can try different vaues of. If we want to try smaer vaues of first, e.g. because we are imited in time and want to get more high-eve views on the software system first, instead of contracting the input graph we shoud rather be abe to merge custering obtained at an arbitrary into a gobay maintained custering hierarchy, as devised within our proect and described in section 5.4 of the thesis. Figure 4-2 Hierarchica cut custering agorithm 4.4 Ca Graph extraction A dynamic ca graph is a record of an execution of the program, e.g., as output by a profier. Thus, a dynamic ca graph can be exact, but ony describes one run of the program. A static ca graph is a ca graph intended to represent every possibe run of the program. The exact static ca graph is undecidabe, so static ca graph agorithms are generay overapproximations. That is, every ca reationship that occurs is represented in the graph, and possiby aso some ca reationships that woud never occur in actua runs of the program. Beow are some exampes of the difficuties encountered when generating ca graph from source code (static): poymorphism: depending on the cass of obect assigned to a variabe of the base cass, different methods are caed invariants: if in the code beow x >= 0 aways, then the ca to func2() actuay never occurs: o if(x < 0) { func1(x); } ese { func2(x); } contextuaity: in the exampe above, we can consider the reasons for x to be negative or non-negative, and mark in the ca graph the fact that either func1() or func2() can be caed from the current function depending on the context. So, both dynamic and static ca graph generation have drawbacks: static: the ca graph is imprecise dynamic: we need many runs to ensure that the source code is covered enough In our prototype there is a point at which the program does not care whether static or dynamic ca graph is suppied. In principe, we accept any graph of reations on input without binding to a programming anguage or static/dynamic kinds of anaysis. In the experiments within this Automatic Structure Discovery for Large Source Code Page 33 of 130

34 proect, however, we used static ca graph extracted from source code in Java using Soot [Soot1999] and the approaches for virtua method ca resoution avaiabe within the framework: [Sun1999], [Lho2003]. 4.5 The Probem of Utiity Artifacts Not a component dependencies have the same eve of importance. This appies particuary to utiity components which tend to be caed by many other components of the system, and as such they encumber the structure of the system without adding much vaue to its understandabiity [Pir2009]. A research about the properties of utiity artifacts [HaLe2004] concuded that: Utiities can have different scope, i.e. not ony at the system eve. Utiities are often packaged together, but not necessariy Utiities impement genera design concepts at a ower eve of abstraction than those design concepts The common practice for detecting utiities is to use heuristics that are based on computing a component s fan-in and fan-out. The rationae behind this is that [HaLe2004]: something that is caed from many paces is ikey to be a utiity, whereas something that itsef makes many cas to other components is ikey to be too compex and too highy couped to be considered a utiity. An exhaustive review of the existing reverse architecting approaches based on custering and the ways they detect and remove utiities is given in [Pir2009]. Among these approaches are [Man1998] / [Bunch1999], [Mu1990], [Wen2005], [Pate2009]. In [ACDC2000] they used somewhat different approach. As in the first phase of ACDC agorithm they simuate the way software engineers group entities into subsystems, the authors observed and used the fact that software engineers tend to group components with arge fan-in into one custer: support ibrary custer containing the set of utiities of the system. To our knowedge, a the existing reverse architecting approaches that address the probem of utiity artifacts at a, detect and remove utiity artifacts from further anaysis. In our proect we devise and impement weighting of reations according to their chance to be utiity cas/dependencies, and the theory is given in section 5.1. In [Roha2008] they do use weighting according to utiity measures deveoped within that work, however, that weighting appies to components (vertices of the graph) in contrast to edges (reations) in our proect. Furthermore, they do not run custering after weight assignment. The maor technique used for detection of utiity artifacts is fan-in anaysis, where the variations are based on the exporation of the component dependency graph buit from static anaysis of the system [Pir2009]. Dependencies incude method cas, generaization, reaization, type usage, fied access, and others. Some approaches represent the cardinaity of dependencies with weights on the edges of the dependency graph, e.g. [Stan2009]. The rationae behind using fan-in anaysis as indication of the extent to which an artifact can be considered utiity is as foows: the more cas a component has from different paces (i.e. the more incoming edges in the static component graph), then the more purposes it ikey has, and hence the more ikey it is to be a utiity, and the researchers currenty converge on this rationae [Mu1990], [Pate2009], [HaLe2006], [Roha2008]. The weak points of the approaches that attempt to sove the probem of utiity detection are isted in [Pir2009]. Those approaches use evidenty more compicated fan-in anaysis than we do in this proect, sometimes they even combine fan-in with fan-out anaysis. However, the strong point of our approach is that the soution of utiity artifacts probem is shared between pre-custering phase and the custering itsef, see section 5.1. At the pre-custering phase we estimate utiityhood of software engineering artifacts and reations. Then the custering Automatic Structure Discovery for Large Source Code Page 34 of 130

35 phase smoothes the ikey utiity connections because those connections are assigned ow weight in the pre-custering (normaization) phase. Though [HaLe2006] experimented with combination of fan-in and fan-out anaysis in order to determine the extent to which a component can be considered a utiity, their metrics were ony abe to detect system-scope utiities. We argue that this issue was encountered because the authors were trying to sove the probem of detection, thus they had to introduce a threshod in order to make a decision. But threshods differ for the whoe system and for utiities in oca subsystems; furthermore, oca utiities do not necessary have the same decision threshod across different subsystems. A counterexampe against fan-in anaysis aone was given in [Roha2008], we show it too in Figure 4-3 beow. It is arguaby whether C2 is a utiity indeed, but C3 apparenty is, according to utiity rationae discussed above. However, functions usuay ca ony functions at ower eve eves of abstraction, thus, a utiity function either does not ca any others or cas mosty utiity functions [HaLe2006], [HaLe2004]. Thus, C2 is ikey to be a utiity too. However, fan-in anaysis aone woud not detect it as such. Figure 4-3 Likey utiity C2, but with ow fan-in The above exampe is not a probem for our approach, as the connection weight between C3 and C2 stays strong (see section 5.1), thus C3 wi be custered with C2 first (in the bottom of the custering hierarchy or, in other words, at a high vaue of parameter apha, see [Fa2004] and section 4.3.2). Thus if some magic orace (which we do not have expicity in our approach) deems C3 to be a utiity, then C2 wi be the cosest to it in terms of unity of purpose, according to the custering resuts. Afterwards it is not hard to infer that C2 is a utiity too. In our work we aso give a counterargument against fan-out anaysis, which arises in practice due to impreciseness of ca graph extraction (discussed in sections 3.1 and 4.4) and the kind of errors the state of the art ca graph anayses make, namey, due to poymorphism there appear excessive cas to mutipe derived casses (subtypes) in the ca graph that never occur in practice. We observed this in our experiments (appendix , aso ) and provide our argument in section beow. Finay, it makes sense to mention that [HaLe2004] identify (in their reasoning, not automaticay) different kinds of utiities: Utiities derived from the usage of a particuar programming anguage. An exampe is a cass that impements Enumeration interface in Java. Utiities derived from the usage of a particuar programming paradigm. For exampe, accessor methods or initiaizing functions Utiities that impement data structures (inserting, removing, sorting) Mathematica functions Input/Output Operations Automatic Structure Discovery for Large Source Code Page 35 of 130

36 Our resuts show (see appendix ) that not ony utiity artifact probem has been aeviated, but aso utiity artifacts are categorized according to their purpose, ikewise the other artifacts. 4.6 Various Agorithms A number of cassica agorithms in computer science were used in order to impement this proect, and thus appear in this paper. The most important of them were discussed earier in this section. Beow we give a short ist and remarks about the rest. A reader that needs more background can refer to [CLR2003], [AHU1983] and [Knu1998]. Agorithm Remarks Breadth & Depth First Searches Priority Queue Priority Bocking Queue Java Minimum Spanning Tree Tree Traversas, Metrics & Manipuations Height, depth, cardinaity, etc. Lowest Common Ancestor Disoint-set data structure / union-find agorithm Reindexing Techniques Graph contraction Reusabe fu-indexing map Dynamic Programming Suffix Tree Subgraph/subset processing Insertion/Remova: O(1) Creation: O(nIndices) Listing: O(nStoredItems) For the statistics For the statistics Automatic Structure Discovery for Large Source Code Page 36 of 130

37 5 Theory Within this work we have devised and used some theory needed in order to: Appy the custering method of [Fa2004] to the source code anaysis domain Sove the issues encountered during the appication, namey, excessive number of sibing nodes (aka apha-threshod). This happens due to the specifics of the domain, namey, software system are usuay neary-hierarchica, i.e. there are few (ideay, no) cyces. Optimize the search direction in order to get the most important soutions as eary as possibe during the iterative runtime of the hierarchica custering agorithm Aow parae computation, as the custering process sti takes considerabe time The foowing subsections provide this theory in the amount necessary to impement our system. Some proofs and empirica evauations require considerabe efforts and are thus eft out of our scope. We represent the source code of software as a directed graph G ( V, E), with V n vertices and E m edges. Each vertex corresponds to a software engineering artifact (usuay, a cass of obect-oriented anguages, e.g. Java cass) and each directed edge to a reation between software engineering artifacts, e.g. method ca or cass inheritance. Aso, we usuay assume that G is connected, as otherwise each component can be anayzed separatey uness some goba restrictions on custering granuarity are posed. 5.1 Normaization In section 4.5 we have discussed the probem of utiity artifacts. Our practica experiments have confirmed that with Fake-Taran custering agorithm aso produces degenerate resuts in case we custer the graph of reations as is, i.e. in case each reation corresponds to an edge of weight 1 in the input graph for custering. Moreover, the graph of reations between software engineering artifacts is directed, however Fake-Taran custering is restricted to undirected graphs due to the underying minimum cut tree agorithm [GoHu1961], which is ony known for undirected graphs even though its own underying agorithm, max fow [Go1998], is avaiabe for both directed and undirected graphs. Extending Fake-Taran custering agorithm to work with directed graphs is both hard theoretica and risky task (see section 8). Thus within this proect we decided to convert directed graph into undirected by means of normaization. The authors of [Fa2004] in their experiments used normaization simiar to the first iteration of HITS [HITS1999] and to PageRank [Brin1998]. Each node distributes a constant amount of weight over its out-bound edges. The fewer pages a node points to, the more infuence it can pass to its neighbors that it points to. In their experiment with CiteSeer citations between documents [Fa2004], the authors normaize over a outbound edges for each node (so that the tota sums to unity), remove edge-directions, and combine parae edges. Parae edges resuting from two directed edges are resoved by summing the combined weight. However, in the domain of software source code, it seems more reasonabe to normaize over the incoming arcs rather than outgoing, so that each node receives a constant (or ogarithmic) amount of weight from its incoming edges. This is grounded in [Pir2009], as they review many works on utiity artifact detection and point the fact that utiity functions are caed by many other components as a main property. In the iterature on reverse architecting the expoitation of this property is caed fan-in anaysis, discussed in section 4.5 of the thesis too. The crucia difference in how our approach addresses the probem of utiity artifacts, compared to other existing approaches (see section 4.5), is in the foowing. The existing approaches focus on detection of utiity artifacts with the goa of further remova prior to Automatic Structure Discovery for Large Source Code Page 37 of 130

38 custering. Our soution for the utiity artifacts issue is spit between pre-custering and custering phases, thus we are not concerned with the probem of detection, which woud entai further binary decision on whether to remove an artifact before custering Directed Graph to Undirected Our conversion from directed to undirected graph works as foows. We discard arcs that oop a vertex to itsef. Apparenty, we ose some information in this step, namey, the fact that the corresponding SE artifact references (cas, uses, etc) itsef. However, we do not see a way to make use of this information without damaging custering quaity. The atter was observed in our experiments. For each vertex, its fan-in is cacuated as the sum of weights of a the incoming arcs in the initia graph: S w i, i Each arc in the graph is then repaced with a normaized arc having weight without everage (section beow) amounts to: w ~ i, w S i, ~, which w i, In the target undirected graph, an edge between vertices i and receives weight to the sum of the weights of the opposite-directed arcs: ~ ~ u i, u, i wi, w, i u, equa Let s define U as the tota weight of edges adacent to vertex in the target undirected graph (adacent weight): U i u i, i wi S The foowing properties can be observed: Each vertex receives a constant (C=1) amount of weight via arcs w ~ i, The tota weight of edges adacent to a vertex in the undirected graph can be both more or ess than C If a vertex has at east one incoming arc in the directed graph, it wi have adacent weight at east C in the undirected graph In practice, there are sedom SE artifacts that ony use others, but are not used from any pace thus do not have an incoming arc. Thus in most cases: U 1 Excusions: artifacts that are caed externay (e.g. thread entry points or contexts aunched from Spring framework) may not have any incoming arcs when the input graph does not contain reations of the whoe program (e.g. the code of system ibraries or Spring framework is not avaiabe). For such artifacts, it can happen that U 1 It seems that U 1 (or equivaent for the everaged counterpart from section 5.1.2) can be the (partia) cause of apha-threshod issue observed, see section 5.6 If vertices are SE methods, and edges are method cas, then a vertex has high adacent weight when the corresponding method cas many methods infrequenty caed from other paces. Frequency is cacuated by the number of occurrences in the source code., w S, i i i Automatic Structure Discovery for Large Source Code Page 38 of 130

39 Obviousy, such a conversion can be performed in O( E og V ) scaar operations by means of two passes through a the arcs of the graph: first, cacuate the vaues of S ; second, cacuate the weight ~ for each arc and combine it with ~ (if this opposite arc is w i, present) using baanced trees of incident vertices for each vertex. Practicay, we use hash maps here Leverage In the previous section we gave formuas that force vertices to receive constant amount of weight, i.e. for each vertex : w~ i, 1 C i However, it seems not reasonabe to discard the cardinaity of references to an SE artifact competey. Thus we use ogarithmic everage of the bound on the weight that a vertex can receive from the incoming arcs. In this case, the vaues of the discounted arc weight ~ and the adacent weight for a vertex U i u Automatic Structure Discovery for Large Source Code Page 39 of 130 w, i U are instead cacuated as foows: w i w~, i, og S S u i, u, i w~ i, w~, i i, i wi S, og S w S, i i og S In our view, the usage of everaged estimation of connection strength, as described above, pursues (and, empiricay, achieves) the foowing obectives: Aeviate the probem of utiity artifacts (aso characteristic for the normaization described in section 5.1.1) Regard the scae of connectedness rather than the magnitude. E.g., the scae of difference between 2 and 4 connections is the same as the one between 100 and 200. The former distinguishes between more and ess couped high-eve (in other interpretation, specific) SE artifacts. The atter distinguishes between more and ess omnipresent utiity (genera-purpose) artifacts. By ooking at the custerings produced with and without everage, and comparing some inherent indicators, namey the range of parameter apha between singe-custer and a-singeton-custers resuts of partitiona custering [Fa2004] the number of excessive sibing custers in the custering hierarchy due to aphathreshod (aso, see section 5.6) Though we are not abe to provide a comparison in percentage, as evauation of custering quaity is not a straightforward task in itsef, from the experiments, indicators as described above and subective evauation of the resuting custering hierarchy we concude that everage improves custering quaity for this agorithm [Fa2004] in software source code domain. Comparing to the iterature, we can observe that some kind of ogarithmic everage is used in other reverse architecting approaches [HaLe2006], [Roha2008]. The utiityhood metric of [HaLe2006] consists of two factors mutipied: one is the fan-in based ratio (straight division of the fan-in cardinaity by the number of artifacts), another is based on fan-out with ogarithmic everage. Their rationae is that fan-in is much more important than fan-out, i w i,

40 however, fan-out shoud aso pay roe in the utiityhood of an artifact (we have discussed this too in section 4.5). [Roha2008] approves this rationae and adopts a derived approach for estimating the impact of component modification in their TWI (Two Way Impact) metric. However, the ogarithmic mutipier sti stays in the part responsibe for fan-out (in the terms of [Roha2008], it is Cass Efferent Impact) and fan-in is sti represented by a direct ratio of cardinaities, in contrast to our approach An argument against fan-out anaysis In the existing reverse architecting approaches, e.g. [Roha2008] and [HaLe2006], they use fan-out anaysis in addition to fan-in. By doing this they attempt to make use of the second part of the rationae for utiity detection (we discussed it in section 4.5), namey: something that itsef makes many cas to other components is ikey to be too compex and too highy couped to be considered a utiity [HaLe2004], [Pir2009]. However, in case the underying data for fan-in/fan-out anaysis is a ca graph extracted from source code of a program in obect-oriented anguage, we can observe that such a reation graph has excessive outgoing arcs, which are noise (iustrated in section ). This happens due to impreciseness of ca graph extraction (section 4.4). Though the existing heuristics for ca graph extraction ([Sun1999], [Bac1996], [Lho2003]) can aeviate this probem, they cannot eiminate it and we are sti getting vertices with excessive fan-out. Thus in practice we argue against fan-out anaysis. Though we agree that a component that makes many cas is ikey to be compex and highy couped, thus utiityhood of such a component shoud be discounted with respect to a metric inferred from pure fan-in anaysis, we can ony do this when our graph of reations does not have excessive outgoing arcs, i.e. in theory or in dynamic anaysis. In practice of static anaysis, for each poymorphic ca site there is usuay ony a singe or a few cas to some most specific subtypes that actuay occur and are designed by software engineers, and the rest is noise. Thus, by discounting the utiityhood for the components containing such ca sites due to high fan-out, we woud propagate the mistake. We suppose to achieve more noise-toerant soution by not using fanout anaysis (at east, in the form of ratio or ogarithmic mutipier) Lifting the Granuarity The input graph contains reations between SE artifacts of different granuarity. There are method to method, method to fied, method to cass and cass to cass reations. Anayzing Java programs, we generaize any ess than cass-eve artifacts as members (nested casses do not fa into this category), see section 5.2. In this proect we experimented with ony casseve artifact custering. Thus we have to ift the reations invoving ess than cass-eve artifacts to the cass-eve, i.e. ift the granuarity to cass eve. An aternative is given in section For the approaches and a discussion about ifting the component dependencies in genera one can refer to [Kri1999]. In coupe with the normaization that we are discussing in section 5.1, we see two principa options for ifting the granuarity: 1 Before normaization 2 After normaization Adopting the first option, we woud first aggregate a the arcs in the initia directed graph G ~ which connect members of the same cass, and connect the vertices (which represent SE casses) of a derived graph G with arcs having the aggregated weights. This corresponds to the ifting of [Kri1999]. In the next step we woud normaize the directed graph G as described in our previous sections. Adopting the second option, we attempt to toerate the noise in the input graph G ~ and improve the quaity of our heuristic addressing utiity artifact probem. An aternative soution pursuing the same goa is proposed in Beow is the rationae for the heuristic that show in this section and choose to impement in our proect. An error in utiityhood estimation for a Automatic Structure Discovery for Large Source Code Page 40 of 130

41 Automatic Structure Discovery for Large Source Code Page 41 of 130 singe member artifact is smoothed by utiityhood estimations for the rest of artifacts which are members of the same SE cass, in case ifting to cass-eve occurs after normaization. We empiricay observed that, indeed, option 2 eads to better custering resuts than option 1. This is further confirmed by the indicators intrinsic to the custering method used: aphathreshod and excessive number of sibing custers (see section 5.6). Formay, consider V ~ is the initia set of vertices where a vertex can correspond to both a member-eve and a cass-eve SE artifact, the heterogeneous reations between SE artifacts in the initia directed graph G ~ constitute its set of arcs E ~ whereas arc from vertex i to vertex has weight i a,, V is a subset of V ~ consisting of cass-eve SE artifacts, and a cass-eve artifact is never aso a member-eve artifact, the membership reations are defined by mapping V V M ~ :, where membership reations are ony defined from a member-eve artifact to a cass-eve artifact and, for convenience, each cass-eve artifact maps to itsef, If we ift the granuarity prior to normaization, we get undirected graph 1 G with edge weights i u, (1) and properties (section 5.1.1) as foows: M i k M k k wi a ) (, ) (,,, (1) ; V i wi S, (1) (1) ; ) (1, (1), 1) ( ~ i i S w w ; (1), (1), (1), (1), ~ ~ i i i i w w u u Merging the formuas in order to demonstrate the intuition about the resuting weights in the undirected graph, we get: V M i M k k M i M k k V i M i M k k M i M k k i i a a a a u u ) ( ), (, ) ( ), (, ) ( ), (, ) ( ), (, (1), (1), Figure 5-1 Lift, then normaize In the above formua, ) ( 1 i M is the inverse for mapping M. Namey, ) ( 1 i M is the set of members of SE cass i. Formay: } ) ( { ) ( 1 i k M k i M It is easy to notice that the iteration in both denominators in Figure 5-1 occurs over a the vertices of graph G ~, thus: V i M k k M i M k k M V k k M i M k k i i a a a a u u ~ ), (, ) ( ), (, ) (, ~, ) ( ), (, (1), (1), Figure 5-2

42 Automatic Structure Discovery for Large Source Code Page 42 of 130 Now et s regard the formuas for the second option, where we first normaize and then ift the granuarity. We get undirected graph 2 G with edge weights i u, (2) and properties as foows: V k S a k ~, ; k k S a a,, ~ ; k k k k a a b b,,,, ~ ~ ) ( ~, ) (, ) (, ) (,,, (2) 1 1 ~ M V k k i M k k M i k M k k i a a S a w M i k M k k u i b ) (, ) (,,, (2) Figure 5-3 Normaize, then ift To compare the outcome over both options, i u, (1) and i u, (2), et s bring the formua in Figure 5-3 into a simiar to Figure 5-2 presentation: M i k M k V k k V k k k i a a a a u ) (, ) (, ~,, ~,,, (2) ) ( ~, ) (, ) ( ~, ) (,, (2), (2) i M k V k M k M V k k i M k k i i a a a a u u Figure 5-4 Whereas formua for the first option, i u, (1), from Figure 5-2 can be rewritten as: ) ( ~, ) ( ) (, ) ( ~, ) ( ) (, (1), (1), i M k V k i M k M k M V k k M i M k k i i a a a a u u Figure 5-5 If we fix i and et ) (, 1 i M k n c n a k and V k n C n a k ~,, where ) ( } { 1 M n, we can observe that the eft summands of formuas Figure 5-4 and Figure 5-5 (and by anaogy, the right summands too) reate to each other as: N N N N i i C c C c C c C C C c c c w w ~ ~ (2), (1), Figure 5-6 Comparison of the undirected weights

43 We hope this can further be used for forma study of the effect of noise, but eave this out of the scope of this paper. Ca graph extraction methods put some noise ([Bac1996], [Sun1999], [Lho2003], aso section 3.1) into the resuting graph, either by adding cas which never occur, or drop some cas which however may occur (section 4.4). In genera, we can designate this noise as 0 k, 1 for each arc a k, in the input graph of reations, meaning that there are ( 1 k, ) a k, true cas and there are k, a k, fase cas An Aternative In principe we coud, without ifting the granuarity, normaize and then run Fake-Taran hierarchica custering agorithm over heterogeneous graph consisting of both member- and cass-eve SE artifacts as vertices, and heterogeneous reations between SE artifacts as edges. We coud try to ift the granuarity from member-eve to cass-eve after custering of this graph has been performed. We argue that this soution can produce a better custering hierarchy, in terms of how we it refects the actua decomposition of the software system, because, compared to the soution of section 5.1.4, ess information is ost prior to custering. Namey, the information oss occurs in the foowing: By aggregating edge weights over a the members of a SE cass, we get a singe (if any) edge (reation) between any two SE casses. A SE cass becomes connected to other SE casses with edges, where for each edge its weight represents the connection strength between the two casses. However, some members of an SE cass, vertex v, in the initia graph might be more connected with members of one SE cass, vertex v 1, and the other members of that cass v might be more connected with members of another SE cass v 2. An exampe of two cases which custering wi not be abe to distinguish due to this information oss is iustrated in Figure 5-7 beow. System A Subsystem C1 Subsystem Subsystem C2 C3 C4 C5 System B Subsystem C1 Subsystem C2 C3 C4 C5 Figure 5-7 A counterexampe to ifting the granuarity prior to custering Consider a system of 5 casses (rectanges) containing 3 members (adacent squares) each. The member-eve reations (edges), which shoud be considered present: Automatic Structure Discovery for Large Source Code Page 43 of 130

44 - in both cases are drawn in bue; - ony in case A in back; - ony in case B in red. It is obvious that in case A the graph contains 2 separate cyces, drawn in yeow, however, in case B there is a singe cyce traversing a the five SE casses, drawn in green. The above is drawn on the eft side of Figure 5-7. It is reasonabe that case A and case B determine different decompositions of the system, iustrated in the top and in the bottom of the right side of Figure 5-7 correspondingy. The difference is that in case B (singe cyce, bottom diagram) the casses {C 2, C 3, C 4, C 5 } do not constitute a subsystem without cass C 1, even though {C 2, C 3 } and {C 4, C 5 } constitute subsystems of which the whoe system can be composed by adding {C 1 }. On the other hand, in case A (two cyces, upper diagram) casses {C 2, C 3, C 4, C 5 } constitute a subsystem, which is a combination of two disoint subsystems. In practice, fact disoint wi be repaced with oosey-connected (in terms of connection density, i.e. do not confuse with weaky connected components in a directed graph), and instead of the criterion of a connected component, the criterion of a custer wi be used. An apparent disadvantage of member-eve custering is the computationa compexity. In our experiments we observed 14.5 times more members than casses usuay. However, in addition to this disadvantage, it is not cear on how to ift the granuarity to casseve after custering at member-eve. Each cass contains severa members, each its member wi appear somewhere in the member-wise custering hierarchy. How to arrange the casses into a hierarchy then, having the data on where their members appear in the member-wise hierarchy? One approach is (weighted) voting. However, a probem arises: member-wise partitiona custerings wi have the nesting property ([Fa2004], aso section 4.3.2), but after voting it is most ikey to be ost at cass-eve. At this point many options arise for soving this probem, e.g. 1 For each pair of casses, count how many times their members form pairs through appearing together and at each eve of the member-wise hierarchy. We get a sparse matrix of counts, perhaps aso weighted by depth/height of the node, at which pairs were encountered. We can now run a custering agorithm on this new matrix as edge weights, perhaps with some normaization. This soution seems to be aso vunerabe to the issue dispayed in Figure 5-7, though ess than a soution that osses member-wise reation information in the very beginning. 2 Start buiding a new tree. Let each cass node to appear at the position where the (weighted) maority of its members has appeared in a subtree of the member-wise hierarchy. This soution is prone to non-deep hierarchies with excessive number of sibing nodes, and the atter woud hinder comprehensibiity. Due to practica difficuties (risen computationa compexity) and many reasonabe options without a singe good theoretica option, we did not deveop this aternative within the current proect further. 5.2 Merging Heterogeneous Dependencies The phase of extraction of reations between software engineering artifacts can produce various kinds of reations. In this proect we used: 1 Method-to-method cas 2 Cass-to-cass inheritance 3 Method-to-cass fied access 4 Method-to-cass type usage: a method has statements with operands of that cass) 5 Method-to-cass parameter & return vaues: a method takes parameters or returns vaues of which are instances of a certain cass Note that the kind 5 is not exhausted by kind 4, as e.g. methods in an interface do not have bodies, thus do not have statements. Automatic Structure Discovery for Large Source Code Page 44 of 130

45 Now the question is how to consider the various kinds of reations for the inference of software structure. We see the foowing principa options. Option 1: Custerize the graphs of homogenous reations separatey, i.e. one graph per one kind of reation. Then combine the resuting mutipe hierarchies into a singe one. This soution has the same root disadvantage as the one discussed in 5.1.5, namey, it is not cear on how to merge the hierarchies. The chaenge of neary-hierarchica input data for custering in software engineering domain is discussed in section 5.6. Thus, another disadvantage arises from the fact that a graph representing a singe kind of reation is even more neary-hierarchica than a graph combining mutipe reations (dependencies) between SE artifacts. For exampe, inheritance aways forms a directed acycic graph (DAG) of reations. In Java programming anguage, if we consider ony casses (not interfaces, i.e. ony extends but not impements kind of inheritance), it is aways a tree. In C++ it can sti be a DAG. Option 2: Combine the mutipe graphs into a singe prior to custering. This has the same disadvantage comparing to option 1 as discussed in section (obviousy, a simiar counterexampe can be given by anaogy), namey, oss of information about the kind of reation which an edge in the input graph for custering represents. On the other hand, an strong side of this option is in the fact that a graph combining mutipe kinds of reations is ess ikey to be neary a tree, thus this soution aeviates the issue discussed in section 5.6. In this proect we impement option 2, and point out option 1 together with the simiar aternative discussed in section as a direction for further research. We use equa weight for one reation of each type. An improved approach coud try to earn the optima weights by means of training on systems, for which authoritative decompositions are avaiabe, comparing its performance using an appropriate metric for nested software decompositions (see [UpMJ2007], [END2004]), and then use the same weights for merging the reations of nove software. 5.3 Apha-search Basic cut custering agorithm (section 4.3), given some vaue of the parameter apha, produces a partition of vertices into groups, i.e. fat decomposition of the system upon anaysis. For smaer vaues of the parameter apha, there are fewer groups. For higher vaue of apha, there are more groups. The groups have nesting property [Fa2004], i.e. they naturay form a hierarchy. The exact hierarchy can be computed by the hierarchica custering agorithm (section 4.3.2), but this requires running the basic cut custering agorithm (section 4.3.1) over a the vaues of parameter apha producing different number of custers. There are can be many fow breakpoint apha-s that can be found fast [Bab2006], of them no more than V 2 produce different number of custers. In our experiments, cacuation of custering for a singe apha was taking 4.5 minutes for 7K vertices, thus it is not feasibe to do this operation V 2 times. In order to produce as much as possibe resut within imited time, we perform the most important probes first. We used a binary search tree approach, described in the subsections Search Tree An initia interva is chosen as the root of the tree, such that min yieds a singe min ; max custer and max yieds many singeton custers. These bounds can be found with binary search, as proposed in [Fa2004], but in practice we ust use vaues that produce sma enough and arge enough number of custers correspondingy. Each chid node in the search tree corresponds to a haf of the interva of its parent, so r r that node ; r wi have chidren ; and 2 ; r. It is convenient to 2 Automatic Structure Discovery for Large Source Code Page 45 of 130

46 view the space of apha vaues as a tree because in the search agorithm we can then maintain the foowing invariant: At each iteration, there is a tree of apha-vaues for which probes (runs of basic cut custering agorithm) have been aready performed We can use any eaf node as the base for the next probe Thus the search tree does not have to be baanced. We can do more probes in a more interesting interva (where more fine-grained decomposition of the system wi say more to a software engineer), and ess probes in another. An iustration of apha space and search tree is given in Figure 5-8 beow. The apha-interva for each node is denoted with a bock arrow. ( min max ) / min max min max min Figure 5-8 Search Tree and Apha Space max Each iteration is an attempt to improve the custering hierarchy, consisting of the foowing steps: Seect a eaf node, and without oss of generaity consider its interva is ; r Cacuate the apha vaue for the next probe: Run the basic cut custering agorithm using m Prioritization In the beginning we put the root node corresponding to the whoe interva ; min max into a priority queue. It is aso a eaf node at this moment, as no chid nodes have been added. In the previous section we showed that any eaf node can be taken at each iteration for the next probe. Thus we can maintain invariant that there are ony eaf nodes in the priority queue, and chose a reasonabe priority function. Automatic Structure Discovery for Large Source Code Page 46 of 130 m r 2 Add two chid nodes into the search tree, corresponding to intervas ; m ; m r A node does not add a eft chid (or a right chid, by anaogy) into the search tree in case k ) k( ), where k ( ) is the number of custers produced by the basic cut custering ( m agorithm using this vaue of parameter. A the above gives a base for prioritization described in the foowing section. and

47 Let k ( ) be the number of custers that the basic cut custering agorithm produces using parameter. Consider we have performed a probe for m and are now going to push chidren of node ; r, which are ; m and m; r, into the queue. Then for chid ; m (and by anaogy, for chid m; r ) we set the foowing priority: P, m og( k( ) k( )) m 2 og( ) min{ k( ) k( ), k( ) k( )} 1/ Beow is the motivation for each of the summands constituting og( k( m m P, : m r m 2 m ) k( )) forces the intervas spanning arge number of custers (from k( ) to be considered earier. k( m ) to ) og( m ) forces arge intervas to be considered earier. Here arge refers to the difference of, in contrast to the previous point where the difference of the number of custers is regarded. min{ k( m) k( ), k( r ) k( m)} forces the more baanced intervas to be considered earier. This summand contributes most of a into the priority, as it is a big improvement when we e.g. spit an interva of 2000 custers into parts of 1000 and 1000, rather than 1998 and 2. m 1/ forces to make probes for sma vaues of apha earier. The probes at 2 sma vaue of apha yied decisions about the upper (coser to the root) eves of the custering hierarchy. Of mutipe priority functions considered, the one given in this section demonstrated the best vaue-for-time in our experiments. 5.4 Hierarchizing the Partitions The way we merge the partitions produced by the basic cut custering agorithm differs from the simpe hierarchization method described and empoyed in [Fa2004] because we do not pass a the apha-s from the highest ti the smaest determined by parametric max fow agorithm as fow breakpoints (see [Bab2006]), but instead we run the basic cut custering using the most-desired apha as determined by our prioritization heuristic (section 5.3). Thus we must be abe to merge the outcome of basic cut custering agorithm (a partition of vertices into custers) into the gobay maintained custering hierarchy for arbitrary apha. In this way we aow arbitrary order of passing through the vaues of parameter apha. The need for this abiity is further motivated by the intent to compute in parae (section 5.5). Different processors may compute singe-apha custering (one run of the basic cut custering agorithm) with different speed, not ony due to the difference in computationa power, but aso because the running time of basic cut custering agorithm is, in practice, proportiona to the number of custers in the resuting partition. As we discussed in section 4.2.2, this happens due to the community heuristic described in [Fa2004]. We sove the foowing probem: given the goba custering tree, and a resut of basic cut custering for which is not yet in the tree, transform the tree so that it refects the resut of custering for this new. Formay: Let T be the goba custering tree, in which eaf nodes denote SE artifacts and inner nodes denote custers at different eves of the hierarchy, and the height of the tree is height (T) Let par v be the parent node for node v in T 2 m Automatic Structure Discovery for Large Source Code Page 47 of 130

48 For an inner node v, et chi v be the set of its chidren nodes in T, and et apha (v) be the vaue of parameter apha at which the basic cut custering agorithm united a the descendants of node v into a singe custer, thus introducing node v into T In order for each node have a parent, introduce a fake root fr in T having apha ( fr) min, where min is defined as in section Let C( ) { Ci ( )} be the partition of SE artifacts into custers produced by the basic cut custering agorithm using the nove vaue. We remind that basic cut custering is a partitiona custering agorithm and produces custers that have nesting property, i.e. for 1 2 each custer produced with 2 contains a set of SE artifacts which is a subset of some custer produced with 1. Formay: C ) C ( ) v C ( ) v C ( ) 1 2 i ( 2 1 i 2 1 Figure 5-9 Custer nesting property Note, that the above formua forbids a case when there are two vertices u v ( ) and v ( 1 ) whie u ). C C ( 1, C i 2 Our task is: integrate the custering resut C ( ) into the goba custering tree T. Now we can define it formay. For each C i ( ) C( ), find vertex p in T such that there exists C ( apha ( p)) having C ( ) as its subset, i.e. C ( ) C ( apha ( p)), but none of the i nodes in the subtree of p satisfies this requirement. Taking into account the formua in Figure 5-9, this task amounts to finding node p such that: apha ( p) apha ( v), v chi p Figure 5-10 The pace to insert new custer The atter can be done in O ( height ( T)) operations by simpy scanning the nodes of T, starting from any node u ( ) and testing for match to the criteria in Figure 5-10 above. C i By noticing that nodes in the path from any node u to the root of T are sorted by apha, i.e. apha ( p) apha ( v),v chi p, we can appy a binary-search-ike approach used in some agorithms for Lowest Common Ancestor (section 4.6). Thus we can reduce this subtask to O(og( height ( T))) operations. It is now obvious that the agorithmic compexity of merging a nove partition (custering) into the goba custer tree is: O( V C( ) og( height ( T))) In the above formua, C ( ) is the number of custers produced by the basic cut custering agorithm using this vaue of parameter. 5.5 Distributed Computation In the previous sections, 5.3 and 5.4, we have devised a ground for distributed computing of the hierarchica custering tree. This is an improvement over the hierarchica custering agorithm of [Fa2004] (aso discussed in section 4.3.2), which is imited to sequentia processing due to contraction whie passing from the arger apha-s down to the smaer. The idea is in running mutipe basic cut custerings in parae, processing one at a processor. We can notice that basic cut custerings (partitions) can be computed independenty for different apha-s. After the resut for some has been computed, we must merge it into the custering hierarchy. In order for the custer tree to remain consistent, we need synchronization during this merge operation. Then the reeased processor can take another most interesting apha from the priority queue (section 5.3), and synchronization is required here again in order for the queue and the search tree to remain consistent. We can do distributed computation on as many processors as the number of eaves in the search tree. The Automatic Structure Discovery for Large Source Code Page 48 of 130 i

49 number of eaves in the search tree grows fast, as processing of a node usuay adds two new eaf nodes for further search. This was impemented within our proect, see section 6. Note that we are spitting each apha-interva in the search tree into 2 chid interva, haf of the parent each. We coud, however, spit the parent interva into 3 or more chid intervas, thus producing 3 or more chid eaf nodes in the search tree. This makes sense to do when there are very many processors (e.g. a network of computers), thus we want the search tree to grow fast in order for as many as possibe processor to get their tasks earier. 5.6 Perfect Dependency Structures A specific property of data that arises in the domain of software engineering, is neary-acycic structure of dependencies among software engineering artifacts. In case this structure stays (ocay) acycic even after conversion from directed to undirected graph (see section 5.1), the custering agorithm receives on input a tree, which is a degenerate case for graph custering. According to [Sch2007], There shoud be at east one, preferaby severa paths connecting each pair of vertices within a custer. But in a tree there is exacty one path between each pair of vertices. In case of Fake-Taran custering [Fa2004], a phenomenon undermining custering quaity was observed. We ca it apha-threshod, which is in the foowing: Often there is no way to get a certain amount of custers, say more or ess cose to K. Using the notation of section 5.3.1, we formaize this as: k ( ) K k( ), 0 t t In the other words, any apha ess than t yieds a significanty smaer number of custers than K, whie any apha greater than t yieds a significanty arger number of custers than K. Let be the greatest apha yieding a number of custers smaer than K, and r be the smaest apha yieding a number of custers arger than K. Then we can rewrite the phenomenon as: k( t ) K k( t ), 0 t r k k( ) K k( ) k r r Figure 5-11 Apha-threshod It is now easy to notice that in the custer tree (section 5.4) apha-threshod can impy a parent node, i.e. a custer produced by the basic cut custering at, having an excessive number of chidren, whie every chid corresponds to a custer produced at r. A the kr k chid custers do not need to have the same parent however, as demonstrated in a counterexampe, see Figure 5-12 beow: Automatic Structure Discovery for Large Source Code Page 49 of 130

50 r r r r r r r r Figure 5-12 Excessive custers, but over different eaves In practice, some nodes in the custer tree do indeed have an excessive number of chidren. In our source code domain experiments we observed that there is aways one apha-threshod entaiing a singe node with many chidren. For exampe, an apha-threshod from k 433 to k r 3839 custers whie a 3406 chid custers appear under the same parent in the custer tree. We observed simiar effect using any options for: normaization (sections and 5.1.2), incuding the case of no normaization (ust summing up the weights of the opposite directed arcs), or granuarity ifting (section 5.1.4), or production of the input graph from the various dependencies between SE artifacts (5.2), or software proect upon anaysis and the set of ibraries incuded (section 7.1). Thus we concude that the phenomenon is intrinsic to the domain of software source code, to the best of our knowedge and empirica evidence. Apparenty, this phenomenon hinders comprehension of nested software decompositions produced with hierarchica Fake-Taran custering agorithm. Whie further study of this phenomenon is a hard theoretica task (but see section 8.4), we make a reasonabe assumption that the phenomenon occurs due to a specific property of the underying data, namey, amost perfecty hierarchica structure of dependencies is a common practice, whie software engineers do their best to achieve this Maximum Spanning Tree Consider the issue of an excessive number of chidren (due to apha-threshod, section 5.6) occurred for some node in the custer tree, thus its custer has many nested custers at the immediatey next eve, i.e. the decomposition is fat. A fat decomposition containing many items is not neary as comprehensibe as if we hierarchize the items so that a kind of divide-nconquer approach is appicabe for comprehension of the subsystem. Thus et us hierarchize the fat decomposition. Let C be the parent custer containing n p (excessive number of) chid custers p C p, 1, Cp,2,..., Cp, n p, thus C p Cp, 1 Cp,2... Cp, np. For a custer C et V (C) be the set of vertices of the input graph (they are aso the eaf nodes of the custer tree, and they are aso SE artifacts ike Java casses) which constitute the custer C. First of a, we create graph G p ( Vp, E p ), where Vp contains n p vertices, i-th vertex stands for i-th custer C,, and each edge in E has weight e, equa to the aggregated p i weight over a the edges of the input (SE artifact reation) graph connecting a vertex from custer C to a vertex from custer C. i { v 1,..., vn } 1 Second, we assume that there is an amost perfect hierarchy in G p, and the rest is noise. Thus our task is to fiter signa from noise. The hierarchy is the signa, and the Automatic Structure Discovery for Large Source Code Page 50 of 130 p { v,..., } n 1 v 1 n 2 i

51 T E being a cyces in G p are noise. Hierarchy can be formaized as the subset of p p tree, in which an edge from parent to a chid denotes the decomposition intended by software engineers (e.g. reduction from a task to subtask, or from genera to specific, etc). Noise is the rest of edges, namey E T p p, and each of them is either a vioation of the architecture (e.g. a hack written by a software engineer), or the noise propagated from ca graph construction (section 4.4), or a minor reation between SE artifacts. It is now obvious that a reasonabe soution for our task of fitering signa from noise is Maximum Spanning Tree, which fiters a graph from cyces so that the sum of edge weights in the resuting tree is the maximum over a the possibe trees spanning graph G. At this point it is important to notice, that graph G p is connected, as otherwise custers C p, 1, Cp,2,..., Cp, n woud not become chidren of the same parent custer C p. p Thus, there is aways a tree spanning the whoe graph G p. Usuay, the probem of minimum spanning tree appears in the iterature. For the convenience of the reader, we show here how the probem of maximum spanning tree can be reduced to the probem of minimum spanning tree. In the graph G, et B 1 max ei, i,. Then we repace each edge of weight e i, with an edge of weight B e i,, sove the probem of minimum spanning tree with any of the efficient agorithms (section 4.6) and return the edge weights back, in both T p and p G p. At this point, we have fitered signa from noise in the graph induced by the excessive chidren, and constructed a tree (hierarchy) that spans them. However, an unexpected question arises: what shoud be seected as the root of the tree? Root Seection Heuristic Proper seection of the root of the maximum spanning tree is crucia for understanding of the hierarchy. We iustrate this in Figure 5-13 beow: p C1 C4 C2 C6 C2 C3 C4 C5 C7 C8 C3 C1 C5 C6 C7 C8 Figure 5-13 Node C1 vs. node C4 as the root Obviousy, the same maximum spanning tree is iustrated both in the eft and in the right of the above picture. However, the understanding about which SE artifact is more higheve/genera or ow-eve/specific totay differs. Automatic Structure Discovery for Large Source Code Page 51 of 130

52 Of the considered options for seection of the root, two seemed reasonabe and we did experiments with them. Option 1. The intention is to seect the root in such a way, that heavy cyces (the noise removed from G ) appear as far as possibe from the root, i.e. coser to the eaves. The p agorithm for this option is given beow: (1) Sort a the edges of graph G p in the order of their weights, heavier first (2) Let U be the set of disoint subsets of the vertices of G p (3) Passing the edges of G p from the heavier to the ightest, do (4) (1) If the edge is in the tree T p, unite in U its incident vertices (5) (2) Ese, find the path between its incident vertices through ony the edges in T p, and unite in U a the vertices encountered on the path (6) (3) If U has become a singe subset, stop the passing of edges. (7) End. (8) The ast united vertex (or the weighted midde of the path, if mutipe), becomes the root of T p Disoint-set data structure and union-find agorithm was used for U, see section 4.6. The agorithmic compexity of the root seection is: OE og E V k( V ) k(n) is the inverse Ackermann function. p p p p where Option 2. The intention is to seect the centra node, whie the seection is prioritized by the weights of the edges in the tree ony (i.e. not in G ). The agorithm in this case is prioritized breadth-first search starting from the eaves. Initiay, a the eaves are put into the priority queue. When a vertex is removed from the queue, we decrease the to go counter for its singe adacent vertex. If to go counter becomes 1, this adacent vertex is put into the queue with priority equa to the weight of the incident edge (the more weight, the earier wi be removed). To go counter for a vertex denotes the number of adacent vertices which have not yet been regarded, and is initiay equa to its degree. The ast vertex pushed into the queue becomes the root of our maximum spanning tree T p. The agorithmic compexity of this root seection option is: O V p og V ). ( p In practice, the second root seection option is producing empiricay much better hierarchies. The resuts we are showing throughout the paper are processed with this heuristic after hierarchica custering. We can see (sections 7 and 10) that indeed, the probem of excessive chid custers has been aeviated, and SE artifacts are sti grouped according to their unity of purpose. p Automatic Structure Discovery for Large Source Code Page 52 of 130

53 6 Impementation and Specification We impemented parae computation of hierarchica Fake-Taran custering within this proect as mutipe OS-processes on our doube-core processor, each working in separate directory. Changing the prototype to working on mutipe computers amounts to sharing the parent directory over the network and aunching remote processes rather than oca. The choice of programming anguage was driven by whether we need speed of impementation or runtime speed of the program. Most of InSoAr, 14K ines of code, is impemented in Java: the source code is 434KB in size, and it was a written by one programmer, the author of the thesis, within the short time period of this proect. Some state-of-the-art source code metrics over InSoAr are produced with STAN ([Stan2009]) and demonstrated in Figure 6-1 to the right. The botteneck part, minimum cut tree agorithm (section 4.2.1) using the community heuristic (section 4.2.2), is impemented in C, and uses Godberg s impementation of maximum fow agorithm (section 4.1.1) modified for our needs. We used a possibe incuding ow-eve optimizations for the botteneck part. A visuaization of InSoAr at package-eve (not the cass-eve InSoAr operates) with, to our knowedge, the best state-of-the art structure anaysis too STAN [Stan2009] is given in appendix , and a zoomed-out version in Figure 6-2 beow. The shadow is the siding window visibe in fu size. Figure 6-1 Metrics over InSoAr Figure 6-2 InSoAr at package-eve visuaized with STAN Automatic Structure Discovery for Large Source Code Page 53 of 130

54 6.1 Key Choices Most of the key choices are theoretica, thus described under sections 3, 4 and 5. We do not provide a bow by bow description of InSoAr due to the nature of the paper, imit in pages and size of the system. Beow are the most important, though appied aspects Reducing Rea- to Integer- Weighted Fow Graph After normaization (section 5.1) we get an undirected graph with rea-vaued edge weights. Fake-Taran custering agorithm (section 4.3) aso uses rea-vaued parameter apha in order to prepare a minimum cut tree task (section 4.2). The agorithm soving the min cut tree probem reies on computations of maximum fow in a graph (section 4.1). Though there are agorithms soving maximum fow probem for rea-vaued edge capacities, however, they are much sower. Both the fastest known max fow agorithm (we use it for theoretica bounds on the worst-case compexity, section 4.1) and the best known impementation of another max fow agorithm (section 4.1.1) we used in practice, require integer arc or edge capacities. Thus we must convert from rea- to integer-weighted graph. For each vertex in the graph we cacuate the sum of weights of the adacent edges. Then we adust the weights proportionay, so that they have the argest possibe integer vaues, taking into account the imitations of 32-bit and 64-bit integers. The atter two are used as edge capacities and excess fow in Godberg s impementation of push-reabe max fow (section 4.1.1). Our experiments have shown that max fow soution never became suboptima due to this conversion Resuts Presentation The resut of hierarchica custering is a tree (more precisey, a forest, when there are mutipe disoint components in the software artifact dependency graph), where - Leaves are casses of the software upon anaysis and its ibraries. - Inner nodes are custers at different eves. - There is at east one root per disoint component. - Mutipe roots per disoint component appear in case the seected ower bound of apha was not ow enough to unite a the nodes of that component into a singe custer. As it is not trivia to present the resuts in a comprehensibe form, some aspects of the used presentation approaches are described further. Our main representation of the resuts is in text format. Not going into the detais of each vaue, we see 3 principa ways to represent a tree: 1 Indented by Depth 2 Indented by Height 3 Bracketed The first is more convenient to view, as nested custers (or SE artifacts, if eaves) appear under their parents. An exampe of this presentation is in Figure 7-3. However, this presentation takes a ot of space on hard drive. The second presentation has an advantage that SE artifacts (the nodes that have abes) aways appear in the beginning of a ine, as they are eaves thus have height 0. However, effective comprehension of this presentation needs some training, see appendix This presentation aso takes ess space, as nodes are often at arge depth, but rarey at arge height. The third, bracketed presentation, aims to show much more abes (eaf nodes) on a imited space. Inner nodes do not take a ine each, but are grouped in one ine and represented as brackets. An exampe is in Figure Fie formats A number of fie formats is used at various stages of the software engineering artifacts extraction and custering pipeine. It does not make sense to describe them in detai at this stage. Thus we give a short ist of the formats: Automatic Structure Discovery for Large Source Code Page 54 of 130

55 Text (identifiers ike ca graph etc. appear for historica reasons, indeed they contain heterogeneous reations): Litera graph of reations (user friendy form): itcagraph.txt Computer-friendy form of the graph of reations: cagraph.txt, ccgcasses.txt, ccgmembership.txt, ccgmethods.txt) Custer tree: cthier.txt, perftree.txt, treexxxx.txt, hitperfcu.txt (height-indented), ctbracketed.txt (bracketed presentation) Inputs for Cfinder (ist of arcs) Inputs for H3Viewer: h3reduced2.vist, h3sphere.vist Inputs/outputs for a process performing basic cut custering: passorder.txt, intgraph.txt (DIMACS format), ver2node.txt, ftcusters.txt, ftcconout.txt XML: Custer Tree XML Per-package statistics in XML Input for TreeViz: perftv*.xml For exampe, beow is a short description of the custer tree XML format. Severa XML representations were considered, e.g. an XML eement corresponding to a custer tree node coud contain properties ike apha and heads as nested XML eements aong with an XML eement chidren which woud ist a the chid nodes and their subtrees. However, we attempted to choose a representation that is easier to view by a human, and this shoud be the one that contains ony chid nodes as the chid XML eements for a node, i.e. homogenous. The root node ooks ike beow: <custertree vertexcount= 7474 nodecount= 7721 rootcount= 6476 disointcount= 2 > Beow is an exampe of an inner node (custer), ab is the apha at which the custer was produced, dcomp is the number of its disoint component: <node id= 7477 chidcount= 2 ab= heads= 5287, 7710 dcomp= 1 > Beow is an exampe of a eaf node, i.e. a SE artifact (Java cass in this case): <node id= 4578 abe= net.sf.freeco.cient.contro.ingameinputhander dcomp= 1 /> 6.3 Visuaization Pure XML or HTML formats, GraphViz and FreeMind toos were considered. However, we chose the foowing visuaization toos because they perform we at arge trees: H3Viewer: This too can draw arge trees in 3D hyperboic space TreeViz: This too supports 7 different presentations for arge trees 6.4 Processing Pipeine The runtime of the anayzer is divided into stages, where outputs from a preceding stage are inputs to a succeeding stage. Outputs are fushed into fies. This aows reusing the resuts of a stage without re-running it, as we as substituting different impementations of a stage, e.g. Java or C#, static or dynamic ca graph extractors. Beow is a diagram of the present stages: Automatic Structure Discovery for Large Source Code Page 55 of 130

56 Source code A directory with: - List of casses - List of members - Membership reations - Indexed methodwise graph of reations Buid SE Reations Extraction SE Reations Compactor Granuarity Seector Libraries resoved User-friendy form Methodwise Graph OR Casswise Graph Undirected Rea Normaizer Incrementa Custer Hierarchy Custer Tree Hierarchizer Custer Partition Custering Task Definition Perfectizer Partitiona Custerizer XML Converter Evauator Representer Post-inference Reversed Architecture in XML Performance Diagrams Various Visuaizations of the Tree Reations by Architectura Fitness Package Ubiquity Metric Figure 6-3 InSoAr Processing Pipeine In Figure 6-3, processing stages are drawn as rectanges, whie inputs or resuts are drawn as paraeograms. The pipeine takes source code on input. Source code shoud be buit, in order to resove ibrary dependencies. Then ca graph and other reations between software engineering artifacts must be extracted. In the current impementation, using Soot to process Java programs, we produce the graph of reations in user-friendy form. Java casses are on the outer eve, inside are methods and fieds, and for each method there is a ist of reations with other member- or cass-eve SE artifacts. If something can produce such a graph of reations from other programming anguage, e.g. C# or C++, we do not depend on programming anguage since this point. A graph of reations produced by means of dynamic anaysis is aso an option here. Then we run a stage caed SE Reations Compactor, which converts the graph of reations into a Automatic Structure Discovery for Large Source Code Page 56 of 130

57 computer-friendy form. Though it is aso a text format, it occupies substantiay ess space and is easier to read into memory. SE Reations Compactor aso performs some reindexing, so that further stages a reeased from these operations after input. Impementation of the further stages foows the theory we gave in section 5. After oading the graph of reations, a stage caed Granuarity Seector aows to choose whether we are going to custerize at cass- or method-eve, and can be used to ift the granuarity prior to custering. Its output is a directed graph of reations between SE artifacts. Undirected Rea Normaizer converts a directed graph to undirected, normaizes and ifts the granuarity to cass-eve, if necessary. It hods a graph, from which the initia Custer Tree can be buit. The initia custer tree contains a the SE artifacts as eaf nodes, which are chidren of one fake root, even if in different disoint components. Custer Tree is updated incrementay by Hierarchizer. The atter maintains the apha search tree, prepares a new task for basic cut custering processor, receives the resut and merges it into the goba custer tree. Partitiona Custerizer is a separate, probaby remote, process that performs a singe fat custering using Fake-Taran agorithm, taking the input from a fie and producing output into a fie. There is aways an option to use named pipes instead of fies here, so that sow hard drive is not needed. The current custer tree is fushed every certain amount of minutes to disk. This is Incrementa Custer Hierarchy. We can force the pipeine to stop by creating fie shutdown.sig. Then the atest hierarchy is aso saved to disk. Perfectizer addresses the issue and computes soution we discussed in section 5.6. It takes the resut of hierarchica custering on input, and produces perfected resut in the same format. This step is, of course, optiona. The rest of the pipeine after Perfectizer addresses various presentations, evauation and post-inference. Automatic Structure Discovery for Large Source Code Page 57 of 130

58 7 Evauation The main premise for high quaity of a produced custering hierarchy is the theoretica grounding of the custering agorithm we used: the quaity of the produced custers is bounded by strong minimum cut and expansion criteria [Fa2004]. We consider cut size a rationa criterion in the domain of software engineering because the sum of edge weights refects the amount of interaction (reations) between SE artifacts (e.g. Java casses), which a software engineer needs to study in the source code in order to understand couping between either two SE artifacts, or two groups (communities, custers) of SE artifacts. This matches the main idea behind maxfow/min-cut custering technique, according to [Fa2004]: to create custers that have sma inter-custer cuts (i.e. between custers) and reativey arge intra-custer cuts (i.e. within custers). This guarantees strong connectedness within the custers and is aso a strong criterion for a good custering, in genera. Assuming from the above that the custering agorithm performs we, we shoud study whether this quaity has not been ost due to the adaptations we used for the custering to work in the domain of software source code, see section 5. These adaptations aso incude extraction of the ca graph and other reations between SE artifacts, which is data, specific to the domain. We stress that not ony the quaity of the custering method is important, but it is aso important that its input data is adequate and of high quaity, see sections 4.4 and Another theoretica premise for high-quaity of the reconstructed architecture is that we have incorporated a soution (section 5.1, which foows state of the art best practices discussed in section 4.5) to the main probem for custering in source code domain, according to the iterature [Pir2009], utiity artifacts. 7.1 Experiments In the argest of our experiments we processed software containing 2.07M ( ) graph edges (reations) over heterogeneous set of vertices (SE artifacts) containing 11.2K (11 199) Java casses, 163K (163183) of cass member eve artifacts (methods and fieds) This is a rea-word medium-size proect provided by KPMG for our experiments during the internship. The cient code contains about 500 casses, thus the remaining 10.7K casses are in ibraries, which incude Java system ibraries, Spring framework, Hibernate, Apache commons, JBPM, Mue, Jaxen, Log4, Dom4, and others. Together with ibraries the proect becomes 22 times bigger and fas into category of arge software. Note that our conception of a medium-sized proect differs significanty from the caimed in other scientific works. In [Pate2009] they anayze (mosty, custerize) a proect containing 147 casses in 10 packages. In practica software engineering this proect must be cassified as sma or even above-tiny. In contrast, ust the cient code of our medium-sized proects contains Java casses. The tota number of Java casses we custered hierarchicay is in the argest experiment, and in the usua experiments. Furthermore, we suspect that the input data of reated works anayzing ony a part of the program (e.g. ony cient code) was far not as precise as ours, because advanced ca graph extraction techniques (VTA, Spark in Soot) require anaysis of the whoe program with ibraries, and even simuate native cas of the Java Virtua Machine [Lho2003] Anayzed Software and Dimensions FreeCo is an open source game simiar to Civiization or Coonization. Its source code is in Java and avaiabe here: The proect is medium-size, containing about 1000 of cient-code casses. Together with ibraries it becomes about 7.5K casses. Thus we used it in our experiments. The extracted graph of reations contains 1M edges for this proect. Automatic Structure Discovery for Large Source Code Page 58 of 130

59 dem0 proect is a web appication that aso provides web services, uses Spring framework and works with database through Hibernate. It is not open-source, thus we are ony showing the parts for which we received permission from KPMG. This proect contains about 500 casses of cient code and many casses in ibraries. In order for Soot to fit in 2GB memory imit during VTA (variabe type anaysis) ca graph construction, we had to imit the number of ibrary casses to 6.5K. In the argest experiment we used RTA (rapid type anaysis) for ca graph construction, thus it was possibe to process a 10.7K ibrary casses with Soot. In the former case, the graph of reations contained 0.5M edges, whie in the atter there were 2M edges. 1 InSoAr processing 1.1 Custering hierarchy we demonstrate in this paper 72 hours, 0.6GB RAM 1.2 Acceptabe resuts (differences are visibe empiricay, concusions 1-2 hours need statistica studies) 1.3 The argest experiment (11.2K casses, 2M reations) 1.3GB RAM, 120 hours 2 Ca graph construction (and other reations with Soot) 2.1 VTA in usua experiments, 7.5K casses: 2GB RAM 0.5 hour 2.2. RTA in the argest experiment, 11.2K casses: (VTA gets out of memory in this case) 2GB RAM, 2 hours 3 Basic cut custering (one apha, in a separate process) 3.1 In the usua experiments 4.5 minutes 3.2 In the argest experiment 20 minutes 3.3 Memory requirement, no more than 35MB Figure 7-1 Actua time and space requirements The actua significant time and space requirements are given in Figure 7-1 above. Note, that we are using far not optima impementation of the prototype (Java). E.g., basic cut custering impemented in C/C++ requires ony 35MB in the argest experiment. We use doube-core, 1.7GHz each, machine with 2GB RAM and 2MB L2 cache. When 2 basic cut custerizers are run in parae, the duration is 6.5 minutes instead of expected the same 4.5 minutes due to cache misses (cache memory is shared between the 2 cores). 7.2 Interpretation of the Resuts Atogether, our nested software decomposition (hierarchica tree) shows SE artifacts from genera (in the top, coser to the root) to specific (in the bottom). SE artifacts are grouped according to their unity of purpose, so that a group of artifacts serving simiar purpose or coaborating for a composite purpose (act together) constitutes a subtree. More precisey, the hierarchica tree refects the strength of couping. There is some noise in the input data (extracted ca graph) and some uncertainty on how to combine (coefficients, etc) different kinds of reations between SE artifacts prior to custering. However, the custering agorithm further decomposes the vertices (SE artifacts) into hierarchica communities stricty, using a bound between inter-custer and intra-custer connection strengths (SE artifacts couping). One interpretation of the atter paragraph is that, in the second approximation: In the top (i.e. near root) of the decomposition appear artifacts, which are: ess couped to the rest of the program, or more genera (genera-purpose) In the bottom (far from the root) appear artifacts, which are: coser to the core of the program (more couped with the rest of the program), or more specific (compex) Automatic Structure Discovery for Large Source Code Page 59 of 130

60 For a custer node (which is non-eaf, inner node), not ony the depth (distance from the root), but aso the height (distance to the remotest eaf in its subtree) shoud be considered Architectura Insights In the subsequent sections we provide an account of particuar facts, which become apparent to a sufficienty experienced software engineer by browsing the (various presentations of) the resuts produced with our prototype. Mining these facts with state of the art toos is either not possibe, or requires immense efforts, e.g. browsing and interpreting manuay many ines of source code. In genera, we ca the inferred facts architectura insights, as they hep the viewer to, at east, get a first impression of the source code, and mosty comprehend the decomposition of the software system into subsystems. Taking 10M ines of source code on input, InSoAr produces ony about 10K nodes of custer tree on output. The gain in comprehension is 1000 times, which is, roughy, cacuated from the number of items necessary to scan in order to get a goba understanding of the system, see section 2.1. Having a nested software decomposition provided by InSoAr, a software engineer can effectivey appy divide&conquer approach for software comprehension (appendix ), or detect cross-package subsystems impementing compicated ogic (appendix ). One can aso observe some metrics cacuated after architecture reconstruction, and we give some exampes in appendix These metrics can give idea of how ubiquitous a package is (i.e. how broad in the architecture the casses of this package are spread, appendix ), and how we coupings between SE artifacts fit the impicit architecture (appendix ). Often, insights not ony about architecture, but aso about impementation can be captured. We give such exampes in appendix Certainy, the ist cannot be exhaustive as these are ony exampe architectura insights we coud think of and describe within imited time and pages. We invite the reader to browse the hierarchy on his/her own by downoading the custering hierarchy of a demo proect and the H3 sphere visuaizer from the internet. Beow are the inks: Data fies. Leaf nodes of the trees correspond to Java casses of ibraries and cient code. Inner nodes correspond to custers at different eves in the hierarchy. Cient source code (the appication, i.e. non-ibraries) is in package com.dem0.* In XML: In H3Viewer format: In TreeViz format: H3Viewer: pease, downoad it from the website of its deveoper: TreeViz website: BUT: we have tuned TreeViz within our proect, so that it shows cient-code artifacts in green (and the rest is in orange), and ists descendants of a subtree upon mouse hover, when no more than 100. Downoad the archive and unpack the two fies into the same directory before running. Tuned TreeViz: Automatic Structure Discovery for Large Source Code Page 60 of 130

61 7.2.2 Cass purpose from ibrary neighbors Library neighbors can te an experienced software engineer a ot about the purpose of cient code casses, see Figure 7-2 beow. This foows directy from the criteria for custering: dense interaction (many cas, fied accesses, type usages) between SE casses within a custer and reativey oose interaction between casses from different custers. The crucia advantage that software engineers acquire having software structure inferred with InSoAr is in the foowing. In order to figure out the purpose of ibrary casses, as we as other facts ike requirements, constraints and imitations, one usuay can read the documentation. Appication casses, on the other hand, are not we documented (section 2), thus software engineers woud have to scan and interpret manuay the source code of the cass. However, having our custering hierarchy, a software engineer can simpy read the documentation for ibrary casses which are couped with the appication casses upon anaysis. It is obvious what is meant by purpose. Beow are exampes of other facts that can be read from the documentation of a ibrary cass (instead of the source code of an appication cass): Requirement: an open database connection Limitation: usage of 128-bit encryption, which is not strong enough for certain purposes Basing on these facts, vioations of the constraints can be identified easier. Figure 7-2 Library casses are in pink, appication casses are in orange, custers are in ight-bue In the above figure, we see a subtree with casses serving the same purpose, as can be understood from their names. One fact that we easiy infer is that the appication s subsystem for time and scheduing reies on JodaTime ibrary rather than inferior Java system ibrary for time. Automatic Structure Discovery for Large Source Code Page 61 of 130

62 Obvious from cass name One can argue that the custering hierarchy does not bring any vaue about the purpose of a SE cass when the cass appears near simiary named, sometimes ibrary casses, because cass purpose was aready obvious from cass name, as in Figure 7-3 beow. In this figure we see that appication cass com.kpmg.kpo.web.security.empoyeeuserdetaisservice and others appear couped (descendants of custer #8604) with ibrary casses org.springframework.security.userdetais.userdetaisservice and org.springframework.security.userdetais.userdetais. However, the point is: The fact that these simiary named casses got into the same custer tes us about good architectura stye: casses with simiar names serve a simiar purpose. The purpose of the ibrary casses is known from the documentation. Appication cass EmpoyeeUserDetaisService is most couped with ibrary casses which are supposed to serve this purpose, and not with something ese, which woud be architecture vioation Good quaity of our custering hierarchy is confirmed by such an occurrence! Figure 7-3 Cass EmpoyeeUserDetaisService and neighbors In addition to the above points, nearby we aso see casses with very different names and from very different packages, e.g. GrantedAuthority from ibrary package org.springframework.security, AssignRoesCommand from com.kpmg.kpo.web.binding, AppicationManager from com.kpmg.kpo.domain anonymous nested casses of EmpoyeeRoe from com.kpmg.kpo.domain As a resut, a human software engineer is provided with an insight about the subsystem, which: manages user detais, where the users are most ikey empoyees, and there is a dedicated service for this, which is based on the standard service of Spring framework Automatic Structure Discovery for Large Source Code Page 62 of 130

63 addressing this purpose. When a user becomes authorized by the subsystem, a corresponding security token is issued (cass GrantedAuthority), which is a string (ook at nested cass StringAuthority under EmpoyeeUserDetais). When authorization fais (perhaps, ony for the reason that there is no such user/empoyee), UsernameNotFoundException is thrown. The atter is a standard exception from Spring framework, thus it is ikey that the cient code (appication casses) does not hande this exception at a or in fu, but rather reies on the standard faciities of Spring framework, otherwise a more specific exception inheriting UsernameNotFoundException woud be impemented in the appication and appear nearby in the custering hierarchy. The set of business entities which an empoyee can access is determined through assignment of roes, appication cass EmpoyeeRoe, and roes are assigned using com.kpmg.kpo.web.binding.assignroescommand, which is ikey to occur when a privieged user takes the corresponding action from web UI. We wrote the above paragraph without ooking at a singe ine of source code of either of the mentioned casses, even more, having amost no experience with Spring framework, ust principa understanding of programming concepts. Automatic Structure Discovery for Large Source Code Page 63 of 130

64 Hardy obvious from cass name In contrast to the previous exampe, it is not that easy to reaize the purpose of a cass caed com.kpmg.kpo.generated.axws.crm.crmsoap. CRM is ikey to stand for Customer Reations Management and SOAP is the we known (otherwise, it is as easy as a search in Googe) Simpe Obect Access Protoco for exchanging structured information for web services. The atter two potentia concepts are pretty distant from one another. Its situation in the custering hierarchy makes things much more cear, namey, the foowing facts becomes apparent to a human software engineer: CrmSOAP is much more about SOAP than CRM, because it is custered together with SOAP-reated ibrary casses. If the software engineer was not famiiar with SOAP, after seeing the custering hierarchy he/she can reaize that XML underies SOAP, because the neighbor ibrary casses are in avax.xm package See Figure 7-4 beow and a 3D view on the same part of the custer tree in appendix Figure 7-4 CrmSOAP and neighbors Automatic Structure Discovery for Large Source Code Page 64 of 130

65 Not obvious from cass name In this exampe a cass is caed AuditEntryDTO which says nothing about its purpose, uness we know that the software proect is heaviy reated to Auditing business and ookup DTO in Wikipedia: After the above two steps we know sti do not know why it is Entry, i.e. entry of what? However, a gance at the custering hierarchy makes things cear; perhaps even repacing the need for the two aforementioned steps, see Figure 7-5 beow. Apparenty, there is some ogging (casses containing Trai, which is a synonym for ogging, and ConsoeAuditLogImp). That is why Entry it is an entry of some og (namey, audit og). And the ogging is impemented as a service, transferring AuditEntryDTO obects between software appication subsystems. Figure 7-5 Cass AuditEntryDTO Automatic Structure Discovery for Large Source Code Page 65 of 130

66 Cass name seems to contradict the purpose Whie browsing through the custering hierarchy we encountered an exampe where cass name seems to contradict the purpose. Though a cass is named com.kpmg.kpo.action.generichander, it appears in the custer that addresses Java Reguar Expressions and expression evauation in JBPM, This is a strong caim invoving aso doubts about custering quaity, thus we ooked into the source code of the cass (provided in section ), which is obviousy confirming the resut of custering. Let us ook at this case coser in terms of software quaity. The fact that cass name contradicts the purpose does not ust mean that the cass is named incorrecty, which can seem a minor defect. Indeed, it means that the designers of the architecture saved efforts (i.e. took reduce quaity action, as we discussed in section 2.6) at some point during software deveopment cyce. What can be the reason for not naming a cass propery? Most ikey, it happened because the purpose was not identified propery, and identification of purpose constitutes significant amount of design efforts. We identify two consequence of such fact, for programmers and for business: Programmers (deveopers working directy with source code) get a wrong idea about the purpose of the cass when considering its reusage, e.g. through inheritance, modification (adding/removing/changi ng methods) or simpy usage from another pace in the software Companies that buy or take for outsourcing services the source code containing such architectura vioations, get ess vaue than they may think they get, as at some point the earier saved design efforts wi pay-off with unexpected expenses Figure 7-6 GenericHander contradicting the purpose We studied the software upon anaysis further in order to provide evidence for the caim that this architectura vioation propagates into the rest of source code, if not fixed timey. Indeed, cass GenericHander is inherited by 4 oosey reated casses, we give their names beow: com.kpmg.kpo.action.{abstractdocumenthander, PrintOutAction, SendFie and SendNotification}, whie other action casses in the package do not. In genera, the package com.kpmg.kpo.action is suspected to have ow quaity. Automatic Structure Discovery for Large Source Code Page 66 of 130

67 7.2.3 Casses that act together In our view, the most vauabe inference InSoAr makes is detection of sets of casses that act together. This foows directy from the property of custering and the data we anayze: couping of SE casses within a custer is higher than couping between custers. To our knowedge, there is no means to identify efficienty (in terms of human efforts) such groups with any existing static or dynamic code anaysis toos. As was discussed in section 2.2, state of the art toos either aow user to seect a set of SE artifacts for which the user wants to see their coupings, or to dri down from packages to subpackages, casses and methods ([Stan2009], [Rou2007], [Pin2008]). In contrast, we do this gobay, for a the SE casses at once. Less couped casses get into a group ony after more couped casses have been sent into that group, where the former stands for higher eves of the custering hierarchy (coser to the root), and the atter stands for ower eves (coser to the eaves). Beow is an exampe a piece of XML output that demonstrates the caim. <node id= 8837 chidcount= 2 ab= heads= 7179, 8897 > <node id= 8897 chidcount= 2 ab= heads= 7179 > <node id= 7179 chidcount= 2 ab= heads= 241 > <node id= 9104 chidcount= 2 ab= heads= 241 > <node id= 241 abe= com.kpmg.kpo.domain.taskinstance /> <node id= 654 abe= com.kpmg.kpo.service.imp.abstracttaskinstanceservice /> </node> <node id= 790 abe= com.kpmg.kpo.web.view.taskinstanceview /> </node> <node id= 550 abe= com.kpmg.kpo.bpm.assigntoempoyee /> </node> <node id= 8898 chidcount= 2 ab= heads= 219 > <node id= 219 abe= com.kpmg.kpo.domain.peerstatustype /> <node id= 712 abe= com.kpmg.kpo.usertypes.peerstatustypeusertype /> </node> </node> Figure 7-7 Casses that act together (XML) In the figure above we see 6 casses from 5 different packages under com.kpmg.kpo are indeed a singe subsystem, according to the impicit architecture, whie package structure can be viewed as a kind of expicit architecture. Modern integrated deveopment environments (IDEs), e.g. Ecipse or Microsoft Visua Studio, can easiy show a the casses/fies in a package/namespace, teing a software engineer about the expicit architecture. However, there is no way in these eading IDEs to show what we have shown in Figure 7-7. At present, software engineers can ony get such diagrams from expicit software architecture, e.g. a subsystem or couping documented in Software Design Document. As this is the centra inference in which InSoAr speciaizes, we provide further evidence for the quaity of hierarchica custering and meaningfuness of the resuts as a number of images showing different parts of the system, see appendix and across the paper. Though it is hard to prove that this property aso hods at goba eve due to arge visuaizations required, we caim that this resut is not oca and not random, i.e. parts which are not shown in our pictures, ook fine and refect the impicit architecture too. We kindy ask an unconvinced reader to downoad the sampes from the internet (section 7.2) and try them himsef/hersef. Automatic Structure Discovery for Large Source Code Page 67 of 130

68 Couped casses are in different packages Detection of cass couping across packages/namespaces is important for the reasons discussed throughout the paper (impicit architecture without scanning miions of ines of source code manuay), we ust give a few exampes beow as the evidence that InSoAr does grouping of casses together according to their unity of purpose, which can be vaidated from the names of the casses. Automatic Structure Discovery for Large Source Code Page 68 of 130

69 Couped casses are in the same package As software engineers often put most couped casses in the same package/namespace, in addition to naming them simiary, the fact that casses from the same package appear nearby in the custering hierarchy can serve as vaidation for the custering resuts. We can observe a match of the expicit and impicit architecture in this case. A usefu fact that becomes apparent after ooking at a custer of casses from the same package, differentiation of couping, is discussed in section In Figure 7-8 beow we can see a number of casses that act together. Cass PortfoioManagerComponent is from package com.kpmg.esb.mue.component, whie the rest of casses are from package com.kpmg.service.portfoiomanager. We can infer that, most ikey, PortfoioManagerComponent is a high eve cass that operates the simpe casses in its custer. Casses PortfoioManagerComponent, PortfoioManagerType and ServiceType (custer #8964) are the most couped among the group dispayed in the figure. The second highy couped group consists of casses RetrieveFaut and RetrieveFaut_Exception, custer #9408. The atter two groups, together with two more casses, PortfoioManagersType and RetrieveResut, form a arger group #9409. Ony afterwards the rest of casses dispayed in the picture (except RetrievePortType) attach to this group, and thus to a the casses in it. This happens in custer #7445, which is at a higher eve of custering hierarchy than custer #8964, #9408 or #9409 and the interaction density (couping) is ower among the casses of group #7445. Figure 7-8 Except PortfoioManagementComponent, couped casses from the same package Automatic Structure Discovery for Large Source Code Page 69 of 130

70 7.2.4 Suspicious overuse of a generic artifact In this exampe we see a cass caed GenericComponentException which is, however, couped with cass DocumentComponent from a different package, see Figure 7-9. We rather mention the fact that the casses are from different packages for convenience of the reader, in order not to forget that state of the art toos cannot hep. However, the observation that heps to discover an issue here is in the fact that the cass representing the exception is caed Generic whie in the custer tree we can see that it is couped and thus serves error-handing for a specific cass DocumentComponent. We can guess (without ooking at the source code, thus saving efforts 1000 times) that this happened either because the purpose of GenericComponentException was not we identified whie designing the architecture, and it shoud rather be caed DocumentComponentException (or something even more specific a study of the source code is needed), or because even though the purpose of GenericComponentException was we identified and at some paces in the source code it indeed serves as a generic artifact (e.g. as the base for inheritance to more specific exceptions), during the evoution of software it happened that this generic artifact was too heaviy used in cass DocumentComponent. In the second case, a suggested improvement of the architecture is to create another exception cass specific to DocumentComponent, e.g. DocumentComponentException and refactor the source code of DocumentComponent to make it using this dedicated specific artifact. With the two points above we have exhausted the possibe cases, i.e. there is no reason to ca an exception-cass GenericComponentException whie it mosty serves (and is mosty couped with) a cass caed DocumentComponent. Thus the source code is not optima, whie detecting such a defect in nove source code (i.e. when there is no programmer that knows about it) is not possibe with state of the art toos, except that by scanning a the source code ine by ine. The benefit of custering is obvious: 10M ines of source code vs. 10K nodes in the custer tree. At any rate, the detected architecture vioation says about saved efforts during the design of the software, and wi resut in unexpected expenses ater, by anaogy to what we discussed in section Automatic Structure Discovery for Large Source Code Page 70 of 130

71 Figure 7-9 GenericComponentException serving mosty DocumentComponent Automatic Structure Discovery for Large Source Code Page 71 of 130

72 7.2.5 Differentiation of couping within a package Figure 7-10 Many casses at the same package in IDE Often there are too many casses in one package, which hinders comprehension for a software engineer ooking at the package/namespace exporer in IDE 1. In Figure 7-10 above we demonstrate such an exampe, how it ooks in a popuar Java IDE (Ecipse) and how it ooks after computing the custering hierarchy with our approach. A package containing even more 1 IDE Integrated Deveopment Environment Automatic Structure Discovery for Large Source Code Page 72 of 130

73 casses with very different purposes (in contrast to what we observe here) is demonstrated in section In principe, InSoAr differentiated the couping within that package too, but there is a separate fact to be discussed because the casses appeared scattered across the system. Describing more extensivey, Figure 7-10 shows two representations of a set of casses, whie not a the casses from the eft side have to be present in the right side: the rest can appear somewhere ese in the custer tree. An aphabetica ist of casses in a package is on the eft, and this is what a software engineer sees with state of the art toos (IDEs). A subtree containing many of the casses from the ist is dispayed on the right, and this is what we can see in the custer tree produced by InSoAr. The task here is that a user needs to infer the purpose of these casses or how they are reated to each other, incuding a goba understanding, i.e. not ust pairwise reations. Our argument is that this is much (in this context, our much usuay means 1000 times across the paper) easier to do having the custer tree. When the cass names in the package are not very meaningfu, accompishing of this task for a human expert amounts to scanning the source code of the casses, which is usuay 1000 ines per cass. Even after scanning the source code, there is a comprehensiona difficuty in taking into account thousands of the observed facts at once (for humans). To aeviate this, human experts need some diagrams to be drawn, which is a mechanica difficuty. In the remaining case when the cass names are very meaningfu, the user can pick out the groups from the ist, which is debatabey O( N og N) operations in the mind of the user (if the user foows sorting based on pairwise simiarity comparison, and disoint subsets unification agorithms), where N is the number of casses in the package: the casses are sorted aphabeticay, however the first token is not necessary the one that gives the user an idea about proper grouping, think of GetCientsResponse and SendCientsResponse in Figure 7-10 above. Even in this rare case of very meaningfu names of casses in a package containing many casses, obviousy, a software engineer benefits from having the custer tree. From Figure 7-10, as we as Figure (appendix ) and Figure 7-11 demonstrating more or ess the same fragment of the custering hierarchy, we can see that couping of casses differs even though they are in the same package and appear as a pain ist in IDE. We caim that this differentiation is an important feature that faciitates program comprehension by a software engineer. For exampe, we immediatey see that cass ObectFactory has a different nature than GetCientsRequest, or GetCientsResponse, or others from that upper group in Figure The bottom group is the most couped within itsef, and then to the rest of casses than any other cass shown in the figure. Without ooking at a singe ine of source code, we guess that ObectFactory is some manager-cass, whie PermissionsType (note s after Permission ) and SimpeCientType are the most thorough watched by it. On the other hand, we aso see in Figure 7-11 evidence for correctness of grouping. Casses PermissionType (note the absence of s after Permission ) and AccesRightType got custered together, and guess from the names that this is semanticay true. Automatic Structure Discovery for Large Source Code Page 73 of 130

74 Figure 7-11 Custer tree indented by node depth Apparenty, representationa power differs across the three our textua representations of custer tree, in terms of number of abes (ony eaf nodes have them) that can be shown to a user within imited space and the easiness of interpretation of the presented information by the user (software engineer). The bracketed approach, Figure in appendix , is the most powerfu in terms of the number of abes (casses, eaf nodes) that can be dispayed within the same space. However, efficient comprehension of this representation needs some training and famiiarity with nested structures, i.e. trees where ony eaf nodes have abes A package of omnipresent casses Another exampe is essentia for understanding of our endeavor and the advance over state of the art toos. In section we have shown that InSoAr can differentiate couping within a package and thus faciitate comprehension of the package and casses in it. However, the casses from that package were devoted to more or ess the same purpose. In this section we show a principay distinct case, where even though casses are in the same package, they serve different purposes. Figure 7-12 shows on the eft how such a package ooks in Integrated Deveopment Environment (IDE), and spitting casses into packages is an instance of expicit architecture being decared and impemented by software engineers. However, according to how the source code was written (the impicit architecture), each (we, amost each) of these casses is couped with a distinct group of casses from other packages (serves a distinct purpose), and this is shown on the right of Figure Further evidence is provided in appendix Thus we concude that deveopers grouped casses into com.kpmg.kpo.domain package according to some more high-eve purpose, e.g. because the Automatic Structure Discovery for Large Source Code Page 74 of 130

75 casses are omnipresent (and our decomposition of the software system says that they are indeed omnipresent). On the other hand, we can aso see that package com.kpmg.kpo.domain is 28 th (out of 98) in ubiquity among cient-code container artifacts (packages, or casses that have nested casses) and ranked as 338 th (out of 1235) among a the containers incuding ibraries, see Figure Its average merge height (the second coumn) is 20, which is not high reativey. This means that the casses of this package become united into a singe subsystem (containing casses from other packages too) more or ess soon, not too far from the bottom of the hierarchy. Thus we concude that there is sti another high-eve purpose, except omnipresence discussed in the previous paragraph. Without ooking at a singe ine of the source code of either of these casses, we wi not be surprised in case their mission is to support Obect-to-Reationa Mapping (ORM) 2, where a cass is aso a tabe in the database, and instances of this cass are aso rows of that tabe 3. We concude this from the foowing facts (and of course, InSoAr gave us those facts): cass com.kpmg.kpo.domain.domainentity got custered with com.kpmg.kpo.dao.dao 4 and com.kpmg.kpo.dao.daofactory (see the top-right part of Figure 7-12) cass com.kpmg.kpo.domain.auditleve got custered with com.kpmg.kpo.dao.auditlevedao (see the bottom-right part of Figure 7-12) Furthermore, we see that this ORM is suppied by Hibernate 5 technoogy, as in Figure 10-1 (appendix ) cass com.kpmg.kpo.domain.archiveentry got custered with com.kpmg.kpo.dao.hibernate.hibernatearchivedao cass. To recap, ust by ooking at the hierarchy produced with InSoAr, we reaized: The high-eve purpose a group of SE artifacts (Java casses here) serves The ower-eve purpose for each SE artifact, by ooking at the casses with which it is couped No other too can to this extent faciitate comprehension of software system by humans. 2 ORM: 3 In software engineering, it is proper to speak about instances of casses here, because an obect is an instantiated cass. 4 DAO Data Access Obect, 5 Hibernate is an ORM ibrary for Java, Automatic Structure Discovery for Large Source Code Page 75 of 130

76 Figure 7-12 Singe package in IDE, but mutipe differen ogica subsystems Automatic Structure Discovery for Large Source Code Page 76 of 130

77 7.2.7 Security attack & protection Security attack and protection is usuay a dua task, ike cryptography and cryptanaysis. Thus discussing one we usuay mean both. In terms of software protection, many schemes rey on incomprehensibiity of the protection mechanism for an attacker. An exampe is inection of seria key/icense checking code (instructions, subroutines, casses) a across the program being protected. Apparenty, this eads to couping of SE artifacts in the program onto the security mechanism, and the participants of security mechanism itsef wi be custered together, as we discussed in genera for casses that act together (section 7.2.3). In the above scenario (inection a across the program), the security mechanism becomes a group of utiity artifacts, and we wi observe the same effect as in appendix or section for utiity or omnipresent artifacts in genera. Thus, an attacker is abe to identify and circumscribe the security mechanism and study its coupings efficienty using the genera techniques we discuss in this paper, subsections of 7.2 and appendixes 10.1, This approach works even when ony binaries are avaiabe (Soot extracts reations from binary code too) and when the binary code is obfuscated 6. Security mechanism wi get custered together anyway, and non-obfuscated neighboring artifacts (ibrary casses or at east ower-eve OS subroutines) wi discover its purpose. In the rare case when everything is obfuscated down to machine code eve, i.e. I/O ports and interrupts, custering of dynamicay extracted graph of reations can be used for efficient discovery of the security mechanism. Pursuing the goa of protection, one can do the same: study how the security mechanism ooks after custering, whether it is easy to identify and circumscribe, and whether the atter information provides an attacker with sufficient means for breaking the security. There are some nuances, however: even though the security mechanism may seem strong to a defender using one parameter set when custering, another parameter set (e.g. by seecting some other options from discussed in sections 5.1, 5.2, and 5.6.2) may sti exhibit the weaknesses of the security mechanism. The evidence (a part of) security mechanism identified in dem0 proect is provided in appendix : Even if we did not know the purpose of cient-code casses EmpoyeeUserDetaisService, EmpoyeeUserDetais and StringAuthority (nested in EmpoyeeUserDetais), e.g. due to obfuscation, we coud determine it from the ibrary neighbors from package org.springframework.security. If we know weaknesses of casses couped with the security mechanism (and thus custered together), we can attempt to expoit them to compromise the security mechanism, even though the atter is strong itsef. Exampes of such potentia targets visibe in the picture in appendix are: WebservabeObectInToAuditabeObectTransformer: how is it about boundary cases? org.apache.og4.logger: can we inect our code into this cass, or substitute it entirey by changing the CLASSPATH on the server upon attack? InSoAr is not an utimate too for compromising security. Structura security is vunerabe. Agorithmic (e.g. Petri networks) or mathematica (e.g. factorization) methods wi sustain How impemented, what does, where used In the two figures beow we see the casses serving time and scheduing custered together in a subtree. We see that the time & scheduing subsystem has recurrence rues, rue factories and a rue service, Figure There is a base cass com.kpmg.reccurrence.recurrencerue, whie weeky and monthy recurrence rues inherit from it. A more specific part of this subsystem is shown in appendix We see that there is aso quartery recurrence rue and factory. The 6 Obfuscated source or binary code is the one that has been made difficut to understand, by e.g. repacing Java cass names with some meaningess strings. Automatic Structure Discovery for Large Source Code Page 77 of 130

78 more specific part, Figure 10-2, contains aso casses for representing the days of week and week of month. On the other hand, the more genera part further gets oined with casses TaskInstanceCommand, CreateWorkfowInstanceAction, CreateInstanceCommand and CreateInstanceVaidator, see Figure The atter mentioned group of casses, obviousy, uses the time and scheduing subsystem, e.g. com.kpmg.web.action.createworkfowinstanceaction aows user to define some schedued workfow action. Note that the casses are from different packages. We cannot figure out this configuration using any state of the art toos. Figure 7-13 Time & Scheduing subsystem Automatic Structure Discovery for Large Source Code Page 78 of 130

79 8 Further Work 8.1 A Sef-Improving Program In order to make a program that improves itsef we first need to make a program that improves programs and point it to itsef. In its turn, prior to making a program that improves programs, we need a program that understands programs, at east in the way humans do. Obviousy, abiity to understand requires abiity to anayze as a prerequisite. State of the art source code anaysis toos exist, but they do processing without understanding. In this proect we have impemented a program that infers structure of source code to faciitate its further comprehension by humans. A further work direction is impementing a program that attempts to comprehend the structure without humans and then does some forward-engineering of improved source code. 8.2 Cut-based custering As we have discussed in section 5.1, Fake-Taran custering [Fa2004] is ony avaiabe for undirected graphs. This restriction is posed by minimum cut tree agorithm [GoHu1961], but maximum fow agorithms are avaiabe for both directed and undirected graphs. To satisfy this restriction, in our proect we were converting directed graph of software engineering artifact reations into undirected using normaization akin to the one described in [Fa2004] and PageRank [Brin1998]. Though the custering demonstrated good resuts, it is obvious that important information is ost during the conversion from directed to undirected graph. Within our proect we have aso tried to eiminate the constraint posed by minimum cut tree, as we do not need a fu-fedged cut tree, but ony the edges separating the artificia sink from the rest of the tree [Fa2004]. However, this is a hard theoretica task being too risky given the nature of our proect (master thesis). Thus, as a direction for further work we propose eiminating the requirement of undirectedness from Fake-Taran custering, thus devising a version that takes a directed graph on input, uses directed max fow agorithm in the backend, and somehow workarounds minimum cut tree expoiting the fact that we need custering of a directed graph, but not the entire correct minimum cut tree of an undirected graph. 8.3 Connection strengths Our normaization, motivated in section 4.5 and provided in section 5.1, ets custering produce good resuts (section 7) aeviating the probem of utiity artifacts. It is interesting to investigate, whether edge weighting considering more properties (fan-out, graph-wide facts) can resut in even better custering hierarchy. 8.4 Apha-threshod We observed this phenomenon during the adaptation of Fake-Taran custering into the domain of source code, and discussed it in 5.6 proposing an ad-hoc soution that aeviates the issue. However, it is sti interesting to anayze and formaize the cases when this phenomenon occurs, and its extent, in terms of the properties of the input graph. In our intuition, the foowing two theoretica facts shoud ead to a sound theoretica concusion: 1 The centra theorem of [Fa2004], discussed in section 4.3.1: c( S, V S) c( P, Q) V S min( P, Q ) Automatic Structure Discovery for Large Source Code Page 79 of 130

80 2 The formaization of apha-threshod phenomenon we provided in section 5.6: k ( ) K k( ), 0 t t In the further work one shoud investigate, why there exists apha t such that for any sma epsion, there is a community S such that there is no partition S S1... Sq in which each S i can satisfy the bicriterion (fact 1) using some apha from the epsion-range of [ ; ] t t. t, i.e. Automatic Structure Discovery for Large Source Code Page 80 of 130

81 9 Concusion In the whoe, we concude that the soution we used in this proect is practica. The software processed by our prototype is arge, rea-word and typica (see section 7.1). The custering hierarchy produced is meaningfu for software engineers (section 7.2, aso 10.4), correates with the known expicit architecture (appendix ) and refects the existing impicit architecture (sections 7.2, and ) providing vauabe facts for software engineers which can not be observed using state of the art toos except that by scanning miions of ines of source code manuay. We concude that the method we devised (section 5.1) for aeviating the probem of utiity artifacts (section 4.5) and directed-to-undirected conversion of the reation graph in the domain of software source code, works we in practice. The empirica proof is provided in section Not ony utiity artifacts did not confuse the custering resuts, but aso they were custered together refecting the unity of purpose. This can be viewed as (perhaps, a prerequisite for) the categorization concuded to be desirabe in [HaLe2004]. In section 5.1.5, a disadvantage of ifting SE artifact granuarity prior to custering was investigated, namey, information oss. The possibe soutions for ifting the granuarity to cass eve after member-eve custering were given. We concuded not to adopt any of the aternative soutions due to practica reasons (computationa compexity) and ack of a reasonaby grounded soution. The aternative soutions discussed in section and 5.2 (option 1) have the foowing fact in common: they both rey on the merge operation for two nested decompositions inferred using different features, either different members of a SE cass or different kinds or reations between SE artifacts. The strong point of both soutions is reduced information oss, if compared to the soution we impemented in this proect. Thus we concude that by researching a suitabe merge operation, one has a possibiity to make two improvements at once. One crucia contribution is the scae at which our reverse-architecting approach can operate. Whie the existing approaches are ony abe to process sma or tiny software (e.g. [Pate2009] takes 147 Java casses on input), whereas we process up to casses in our experiments and bump into the imits of the too that provides input reations for our prototype, namey, ca graph extraction ([Soot1999], [Bac1996], [Sun1999], [Lho2003]) exhausts 2GB of memory on a 32-bit machine. This is not a probem for a 64-bit machine, and companies who have huge proects aso have appropriate hardware (we do not mean supercomputers by the atter). We specuate that a 64-bit machine with 100GB of RAM and 32 processors shoud be sufficient for anayzing any rea-word software proect with our too and, ess surey, the prerequisite toos in reasonabe time. We stress the importance of operationa scae in reverse architecting, as reverse architecting is mosty to address the issue of overwheming compexity, which arises in arge software proects and causes incomprehensibiity. Certainy, speed of processing does not bring any vaue without quaity of the resuts. The hierarchica custering technique [Fa2004] underying our reverse-architecting approach is we grounded theoreticay, giving the premises for caims about the resuting quaity, even though there is no unbiased indicator of software custering quaity avaiabe (except the unikey case of the exact match) because such an estimation is a subective task for human software engineers. Even though there are some metrics for software custering quaity proposed in the iterature ([UpMJ2007], [END2004], [MoJo1999]), not ony their adequacy but aso their scae is in question: e.g. in [UpMJ2007] their experiments with the devised metric UpMoJo are imited to hierarchica decompositions of no more than 346 artifacts and average height 4. Research and experiments with automatic Automatic Structure Discovery for Large Source Code Page 81 of 130

82 metrics of custering quaity woud require arge efforts which we can not aow within this proect, thus we eave this as a direction for the future work. Apart from strongy grounded custering agorithm of [Fa2004], we have studied the iterature on reverse engineering topics, incorporated the best practices and ideas from there (see sections 3 and 4), and deveoped our own theory necessary for adaptation of the custering agorithm into the domain of software source code and containing our ideas for improvement as we (section 8). In addition to the theoretica premises for high quaity of the resuting custering hierarchies, our experiments confirmed that software artifacts in the resuts are indeed grouped according to their unity of purpose (as motivated in section 2) and, apparenty, the visuaized hierarchies (section 10) are comprehensibe and meaningfu for software engineers. [Nara2008] makes the foowing note on how graph topoogy coud infuence software processes: Understanding ca graph structure heps one to construct toos that assist the deveopers in comprehending software better. For instance, consider a too that extracts higher-eve structures from program ca graph by grouping reated, ower-eve functions. Such a too, for exampe, when run on a kerne code base, woud automaticay decipher different ogica subsystems, say, networking, fie system, memory management or scheduing. Devising such a too amounts to finding appropriate simiarity metric(s) that partition the graph so that nodes within a partition are more simiar compared to nodes outside. Understandaby, different notions of simiarities entai different groupings. Such a too has been impemented in our proect. In the resuts section above we show that different ogica subsystems are indeed identified. The simiarity metric we use is the amount of interaction between software engineering artifacts. However, a desirabe abiity which we do not yet have in our too is inference of custer abes. This direction appears in [Kuhn2007], however, in their turn the authors propose to combine inguistic topics with forma appication concepts as a future work. Finay, we give an account of the weak sides of our proect. We are imited in avaiabe efforts, and the nature of the proect (master thesis) constrains us to certain decisions and strategy, such as avoiding risky research directions (in an attempt to invent, e.g. see section 8) and preference of breadth (mutipe approaches; extraction, format conversion, custering, presentation, import/export, visuaization, statistics; custering and software engineering iterature review, comparison to riva approaches, impementation, specification, experiments) rather than depth (singe best approach; devising new theory, impementation according to best SE practices). Our caims about the resuting custering quaity, aso in comparison to the other approaches, are mosty theoretica and empirica. Our statistica proof (section ) expoits the assumption that more high-eve Java casses have shorter package prefixes, in terms of token count (name depth). This is not aways true, as there can be higher- and ower- eve casses within one package (e.g. see section 7.2.5), as we as there are ow-eve casses having short prefixes (think of ava.ang.string). Statistica comparison to other reverse architecting approaches is desired, using the same experimenta setting (the same software upon anaysis). However, this is very effortconsuming without bringing much vaue into the resut of the proect in terms of custering quaity and speed. More iustrative statistica proof for theory and caims is desired. Automatic Structure Discovery for Large Source Code Page 82 of 130

83 Our theory shoud be checked for originaity. Though we tried our best to review the existing approaches, it is not possibe to prove that something does not exist, perhaps in different terms or in a different domain. We often use probem-soving approach: given an intention, define the probem, sove it and impement the soution. Thus we do not aways know whether someone ese has aready soved the same probem, and if so, how our soution reates to his/her in terms of precision, speed, advantages and disadvantages. This saves huge amount of efforts, up to 99% in our view, thus etting us to impement more soutions athough having ess evidence for their originaity and optimaity or superiority. 9.1 Maor chaenges Appicabiity of the custering agorithm and the success of a this endeavor of integration were not obvious since the beginning. The foowing subsections ist the maor chaenges were identified prior to the start or during the proect Worst-case compexity The theoretica estimations on the worst-case compexity seemed prohibiting. Fake-Taran custering uses minimum cut tree agorithm, which uses maximum fow agorithm up to V times in the worst case. Hierarchica version of the custering agorithm adds factor of V further. Thus, the tota agorithmic compexity is: O( V 2 E min( V 2 / 3 V E ) og E ogu ) Automatic Structure Discovery for Large Source Code Page 83 of 130, For the source code of a typica medium-size software proect of (for exampe) 10K casses and 18 1M reations, the number operations is This coud take a thousand of years on the usua computer on which the resuts we showed in the thesis were indeed produced in 72 hours within our proect. This was achieved due to carefu choice of the impementations and the underying heuristics. The software proects we used in our experiments are typica, as most of their casses are Java ibrary casses. Thus we concude that the custering method is appicabe in genera Data extraction Anayzing rea-word software proects is a chaenge even for toos that extract the data we use on input of our too. In order to make Soot to extract the ca graph we had to study its design and impementation, tune the parameters carefuy and even change the source code of Soot in order to aow ca graph extraction (and effectivey, whoe-program anaysis) without providing and anayzing a the ibraries on which the software upon anaysis depends Noise in the input Ca graph extraction is far from precise. By manuay anayzing the extracted ca graph and the origina source code, we observe that many ca reations are absent from the ca graph though the source code ceary states the presence, and vice versa, there are many cas in the ca graph which never occur during program execution and are not designed to occur by the deveopers. Another cause of noise in the input for architecture reconstruction agorithm is the mistakes made by software engineers due to ack of goba understanding of the system. These mistakes vioate the impicit architecture, or even sometimes the expicit one. The fact of 2

84 vioation of the atter can be proved when the documentation or any other form of expicit archiecture, e.g. packaging structure, is avaiabe. Thus forma reation graph s comprehension of the source code differs a ot from the comprehension of deveopers who wrote the source code. As can be seen from the resuting custering, this noise has been successfuy toerated Domain specifics The data configurations specific to the domain are omnipresent utiity artifacts and amost perfecty hierarchica structure of software. We have discussed these issues within the thesis, and devised and impemented soutions, which aso constitute the maor theoretica contributions of our work Evauation of the resuts It was chaenging to evauate the reversed architecture due to the common probems in Artificia Inteigence and other reevant fieds: Human to machine inteigence gap: whie a machine can ony cacuate some measure over the resuts, humans can find them meaningfu, usefu, easy to comprehend, etc. Lack of obective criteria: even when software engineers discuss some architecture (either the currenty documented, or prospective architectura decisions), arguments often bump into the phiosophy of software engineering. Different experts adopt different approaches, or they ust ike some decisions more than other. Lack of abeed data: rea-word (at east, non-trivia) software proects never have documentation of hierarchica architecture ti SE cass-eve, i.e. target nested software decomposition which we coud train on or compare with. Furthermore, architectura documentation is usuay not a hierarchy. Lack of adequate measures: a counterexampe to [END2004] is provided in [UpMJ2007], whie the atter is not scaabe to arge nested software decompositions. What we did manuay is a brute-force evauation of the reversed architecture by ooking at subtrees of the custer tree and arguing for the usefu and adequate (refecting the actua architecture) facts that a software engineer can see in those subtrees (section 7). There is an advantage in this kind of evidence too: we provide reaistic evauations as usuay concuded by humans, rather than abstract measures that might not refect what humans want to see. 9.2 Gain over the state of the art Practica contributions The output of our prototype needs human anaysis in the end. However, we stress the gain in comprehensibiity: instead of scanning and manuay interpreting miions of ines of source code, human software engineers need to ook at a few thousands of nodes in the custering hierarchy to get architectura insights, e.g. those described in 7.2. The atter section contains the typica actions the humans shoud take for this, though mainy it is a matter of experience and natura inteigence. To summarize, eaf node abes (i.e. the names of SE casses) are heaviy used for both vaidation of the architecture (e.g. cass name must not confuse a software engineer about its purpose) and vaidation of the quaity of custering. The atter is possibe because our approach does not use textua information at any stage of inference, either identifier (type, variabe) names, keywords, comments or whatever. InSoAr s inference is purey based on forma reations between software engineering artifacts. The fact that SE casses having simiar abes (the same textua features, e.g. the words composing package names or cass names) appear Automatic Structure Discovery for Large Source Code Page 84 of 130

85 nearby in the custering hierarchy (under the same parent, in the same subtree) says both about the quaity of the architecture (decouped, cass purposes are we-defined) and the quaity of our resut. In Figure 9-1 beow we give a trace of software comprehensibiity gain in numbers from an experiment with FreeCo open-source proect. See section 5.2 for the detais on what reations were extracted and how they were merged into a singe input graph for custering. So we concude that there is neary-1000 times gain in comprehensibiity of software: from 7.5M ines of source code to 11K nodes in the custer tree (or, debatabey, 3.9K subsystems inner nodes). Athough there are 7.5K casses in the software, they are not comprehensibe if presented as a pain ist (section 7). The same for non-perfectized custer tree: though there are 9.6K nodes, the hierarchy is ess comprehensibe (section 5.6.1) than 11.4K nodes of a perfectized counterpart due to the issue of an excessive number of chidren (section 5.6). 1 Estimated number of ines of source code Number of forma reations extracted Number of edges in the ca graph Number of fied accesses Thus, number of other reations Number of SE artifacts Casses Fieds Methods Number of items in the custer tree Before perfectization Of them, inner nodes (subsystems) After perfectization (section 5.6) Of them, inner nodes (subsystems) Labeed eaf nodes (SE casses) Figure 9-1 Software comprehensibiity gain, in numbers From the above tabe we aso see that the gain in comprehensibiity over non-custered graph of extracted reations, cacuated as the ratio of the item numbers, is neary-100 times: 822K edges in the input graph vs. 11.4K nodes in the custer tree. Note that the graph of reations is not ust the ca graph: 2/3 are indeed, method ca reations (ca graph) 1/3 are other reations (fied access, inheritance, type usage, parameter and return types) The exampes in section 7 iustrate that the determined custers make it possibe for a software engineer to infer the purpose of SE casses from the names of these casses and the neighbouring casses in the custer tree. This is usefu even in case the purpose of a cass is obvious from its name, as its position in the custer tree vaidates its proper naming, assuming that the quaity of the custering is high, which was aso concuded. The centra inference our too does is identification of hierarchica groups of casses that act together (section 7.2.3). In composition with identified purposes (the paragraph above), this can be used for obtaining overviews of systems that ack documentation, or documenting subsystems (incuding the private case of a singe-cass subsystem). Aong the way it simpifies detection of anomaies in the software system by a software engineer, e.g. overcompicated couped groups as in appendix Differentiation of couping within a package (section 7.2.5) presents a software engineer with a structure whie the expicit architecture shows a pain ist, which can be much harder to comprehend in case there are many SE casses in the ist. On the other hand, when casses beonging to different subsystems appear in the same package due to some more high- Automatic Structure Discovery for Large Source Code Page 85 of 130

86 eve property (section 7.2.6), a software engineer can observe the actua impicit subsystem for each cass in the custer tree. These facts particuary hep a software engineer in refactoring and identification of subsystems affected by a change: as we see in section 7.2.6, the subsystem affected by a change in a cass from that package is not the package, but the impicit subsystem with which it is couped. And that is the one inferred by our too. Apparenty, security mechanism is a private case of a subsystem, section Thus with our too software engineers can inspect structura vunerabiities of a software system. As shown in section and appendix , sometimes aso insight on the impementation can be captured. The utimate goa is in aowing Divide&Conquer approach to software comprehension, however it is hard to support such caim, as visuaizationa probems are encountered in the upper eves (near the root) of the custer tree, namey: there become too many eaves (abeed nodes) in a subtree, thus some inference of custer abes is needed. We provide the evidence we currenty have in appendix Sti, software engineers can start abeing subsystems from the bottom eve up. From Figure 9-1 above we see that for 7.5 miion ines of source code, there are ony 3.9K inner nodes (i.e. subsystems) in the custer tree. Labeing these nodes manuay for the sake of Divide&Conquer opportunity can be a reasonabe task, given that our too provides a mean to identify the purpose of subsystems near eaf nodes cheapy. To recapituate, in section 7.2 we discussed the practica facts that can be inferred automaticay using the approach we devised. These facts can not be inferred with any other state of the art software engineering too. This ist is not exhaustive, as there are ony facts we coud think of and discuss iustrativey. Atogether, we characterize these facts as architectura insights with practica appications in reverse engineering, software quaity and security anaysis Scientific contributions An efficient agorithm for high-quaity custering of arge software has been invented. The agorithm is based on Fake-Taran custering (sections 4.3) thus inherits its intrinsic hierarchica property, optimization of graph cut criteria, and a premise for high-quaity as reported in [Fa2004] in genera. Our contributions on top of Fake-Taran custering agorithm are provided in section 5. Of them, the foowing are specific to the domain of source code: Edge weight normaization (section 5.1): incorporates the recent concusions in the iterature on Reverse Engineering (section 4.5) about the main domain-specific probem utiity artifacts and proposes directed-to-undirected graph conversion (as Fake- Taran agorithm requires undirected graph on input) based on utiitihood rationae from the iterature (fan-in anaysis). Perfectization (section 5.6): makes adustments to the hierarchica resuts of Fake- Taran custering, so that a specific property of data (namey, neary-perfect hierarchica structure of software as a good practice) does not confuse the custering resut. The foowing our contributions are improvements over hierarchica Fake-Taran custering agorithm in genera: Distributed version (section 5.5, using the contributions of sections 5.3 and 5.4): motivated by the need for hierarchica custering of arge software, a distributed version aows running mutipe basic (section 4.3.1) Fake-Taran custering probes in parae, one processor per one vaue of parameter apha. The resuts are then merged into a singe hierarchy, as described in section 5.4. Prioritized apha-search (section 5.3): in the absence of time to compute the resut of basic Fake-Taran custering for each necessary (i.e. potentiay producing different Automatic Structure Discovery for Large Source Code Page 86 of 130

87 number of custers) vaue of parameter apha, it aows taking the most important probes first, so that the more important decisions about the custering hierarchy are taken earier. A purey theoretica, within our work, contribution is given in section 5.2: it discusses the potentia soutions for considering mutipe kinds of forma reations between SE artifacts during custering, however we did not have time to impement semi-supervised earning proposed there. It currenty serves as evidence for our hypothesis 3. In our current custering agorithm we use the same merge-weight (equa to 1) for each kind of SE reations. Minor contributions incude: Reduction of rea- to integer- vaued fow graph (section 6.1.1): this aows substitution of integer-capacities max fow agorithms into Fake-Taran custering, instead of more computationay expensive rea-capacities max fow agorithms. A review of state of the art custering and source code anaysis methods and toos under section 3, and pre-requisites from custering, source code anaysis and reverse engineering in section 4. Automatic Structure Discovery for Large Source Code Page 87 of 130

88 10 Appendices 10.1 Evidence This section contains evidence for the caims about properties and quaity of the resuting custering hierarchy. This does not incude evidence invoving source code demonstration: such evidence is isted in section 10.3 beow A package of omnipresent cient artifacts Here we continue the evidence for the caim discussed in Automatic Structure Discovery for Large Source Code Page 88 of 130

89 Figure 10-1 HibernateArchiveDao custered with ArchiveEntry Automatic Structure Discovery for Large Source Code Page 89 of 130

90 Security mechanism: identify and circumscribe See section for the discussion. Automatic Structure Discovery for Large Source Code Page 90 of 130

91 Subsystems In Figure 10-2 beow we continue iustrating the time & scheduing subsystem, as appeared in the custering hierarchy and discussed in section This part is coser to the bottom of the custer tree, as can be seen from the non-branching nodes on the right. Figure 10-2 Time&Scheduing subsystem (continued) Automatic Structure Discovery for Large Source Code Page 91 of 130

92 Insight on the impementation Beow we show obvious exampes of insights that the reversed architecture gives to software engineers. For humans it is now easy to make a note that, e.g. cient code cass HibernateWorkfowTempateDao works with Criterion, Restrictions and LogicaExpression of Hibernate ibrary; in its turn, it is ikey to be used by SimpeExpression Automatic Structure Discovery for Large Source Code Page 92 of 130

93 Automatic Structure Discovery for Large Source Code Page 93 of 130

94 The part of custer tree demonstrated beow tes software engineers mutipe architectura facts. Couping structure JMSUti AbstractDocumentHander Purpose from ibrary neighbors How documents are handed via JMS Messages, Sessions, Connections An overcompicated subsystem In the picture beow we see a number of casses with, concuding from the names, different purposes acting as a singe mechanism. The casses are aso from different packages, thus we cannot detect this couping efficienty using state of the art toos. However, when changing the software, it is important to identify the extent of subsystem to be changed. By circumscribing a subsystem subect for change, we narrow down the search space of side effects. This is a fragment of custer tree for proect FreeCo, which is an open-source game. The nature of task the casses are performing, AI payer, vaidates the intuition about compexity of the subsystem and the custering resut. Automatic Structure Discovery for Large Source Code Page 94 of 130

95 Divide & Conquer Automatic Structure Discovery for Large Source Code Page 95 of 130

98 Utiity Artifacts In the figure beow we can see that utiity artifacts have been indeed identified and custered together, even fairy exhibiting their unity of purpose. StringBuider and StringBuffer are unarguaby utiity casses in Java. Descendants of custer id9792 are these 2 casses together with others serving a simiar purpose, except the subtree of id9791. The atter subtree contains the rest of the program, and the artifacts there can usuay be viewed as more high-eve or specific (in the meaning opposed to genera-purpose artifacts). Automatic Structure Discovery for Large Source Code Page 98 of 130

99 10.2 Statistica Measures Packages by ubiquity In these statistica measures we investigated how cass name ength (in tokens 7 ) correates with the position in the custer tree. For each node in the custer tree we cacuated how many token-wise suffixes of each name are matched (i.e. have the same token-wise prefix) in the subtree of the node. Suffix Tree data structure and Dynamic Programming (section 4.6) aowed computing it fast (subquadratic compexity). Then we computed mutipe averages (per token-wise prefix): average match depth: the average depth of custer tree node at which the suffixes of this prefix matched average match height: the same, but height is averaged average number of nodes in the subtree: the same, but the number of nodes in the subtree of a custer tree node is averaged Then we appied ranking approach: comparison across each dimension adds +1/-1 to the sum. Afterwards, we sorted the tokenwise name prefixes according to this rank, the figure beow demonstrates the resut. We ca this rank package ubiquity. Figure 10-3 Packages by ubiquity 7 For exampe, ava.ang.string has 3 tokens: ava, ang and String Automatic Structure Discovery for Large Source Code Page 99 of 130

100 Median Matched Name Length from CT Node Height, in tokens Figure 10-4 Shares of package name depth over custer tree node height Automatic Structure Discovery for Large Source Code Page 100 of 130

101 Architectura Fitness of SE Cass coupings In the Figure 10-5 beow we see that 10% of reations between SE artifacts vioate the impicit architecture, whie 90% fit it very we. In the next diagram, Figure 10-6, we see that 80% of weighted misfits is constituted by 16% of reations (coupings). Misfitness of Cient-Code Cass Coupings Figure 10-5 Architectura Vioation Extent over Sorted Ordina 1.2 Cumuative weighted misfitness fraction Figure 10-6 Sum of misfitness times weight, from most misfitting on Automatic Structure Discovery for Large Source Code Page 101 of 130

102 Misfitness over Weight Figure 10-7 Architectura Vioation over Couping Strength: Note that the scaes are ogarithmic To draw the diagram beow, reations between SE artifacts were sorted by weiht, from the most strongy couped to the east couped Cumuative Misfitness over Weight Automatic Structure Discovery for Large Source Code Page 102 of 130

103 10.3 Anayzed Source Code Exampes Dependency anaysis A Java cass that does not use cas or externa fied accesses package com.kpmg.esb.mue.component; import com.kpmg.kpo.service.servicemessagerueservice; /** * Common abstract super cass that provides inection mechanism for services * */ pubic abstract cass AbstractPersistabeComponent { } /** * servicemessagerueservice inected by Spring */ private ServiceMessageRueService servicemessagerueservice; pubic ServiceMessageRueService getservicemessagerueservice() { return servicemessagerueservice; } /** * Accessor method. * servicemessagerue */ pubic void setservicemessagerue(servicemessagerueservice servicemessagerue) { this.servicemessagerueservice = servicemessagerue; } Some casses whose dependencies are not specific at a package com.kpmg.kpo.action; import org.bpm.graph.def.actionhander; import org.bpm.graph.exe.executioncontext; pubic cass SendFie extends GenericHander impements ActionHander { } private static fina ong seriaversionuid = L; private String pubic void execute(executioncontext executioncontext) throws Exception { System.out.printn(messageType); } pubic void setmessagetype(string messagetype) { this.messagetype = messagetype; } Perhaps obects of this cass are used as items in array or inked data structures. Automatic Structure Discovery for Large Source Code Page 103 of 130

104 Firsty, the type of an item (CompeteTaskCommand in this case) pays a considerabe roe in decision making regarding the further handing of the item and further program fow. Conditiona fow branches basing on instanceof operator resut. Secondy, specific features of an instance of this cass are stored in String fieds, <comment> and <transition> in our exampe. Package com.kpmg.kpo.web.binding; import ava.io.seriaizabe; import com.kpmg.kpo.domain.taskinstance; pubic cass CompeteTaskCommand impements Seriaizabe { /** * SeriaVersionUID, required by Seriaizabe. */ private static fina ong seriaversionuid = 1L; private TaskInstance task; private String comment; private String transition; /** the task */ pubic TaskInstance gettask() { return task; } /** task the task to set */ pubic void settask(taskinstance task) { this.task = task; } //Reset other fieds if we choose a new Task. This.comment = nu; this.transition = nu; } /** the comment */ pubic String getcomment() { return comment; } /** comment the comment to set */ pubic void setcomment(string comment) { this.comment = comment; } /** the transition */ pubic String gettransition() { return transition; } /** transition the transition to set */ pubic void settransition(string transition) { this.transition = transition; } Automatic Structure Discovery for Large Source Code Page 104 of 130

105 Dependencies ost at Java compie time Certain reations are ost whie compiing.ava fies into.cass fies. This occurs for an exampe cass beow: package com.kpmg.kpo; /** * Goba variabes for the workfow */ pubic fina cass WorkFowVariabes { /** * Cass cannot be instantiated. */ private WorkFowVariabes() { } /** * Constant name used to store the transition map. */ pubic static fina String TRANSITIONS = transitions ; /** * Constant name use to identify the domain peer in BPM. */ pubic static fina String PEER = peer ; /** * The comma-separated ist of undo-actions. */ pubic static fina String UNDO_ACTIONS = undo_actions ; /** * Due-date for a specific Task. */ pubic static fina String DUE_DATE = duedate ; /** * Warning start date for a specific task. */ pubic static fina String WARNING_START_DATE = warningstartdate ; /** * Key to store the defaut transition (if any) under. */ pubic static fina String DEFAULT_TRANSITION = defaut_transition ; } Whenever a string constant from WorkFowVariabes cass is used in ava source code, e.g. WorkFowVariabes.PEER, its vaue is substituted into the binary code ( peer in our exampe) rather than a reference to fied PEER of type WorkFowVariabes. See the exampe that uses WorkFowVariabes.PEER beow. As a consequence of this fact, vertex WorkFowVariabes gets no adacent edges in the reations graph and thus becomes an orphan, i.e. the singe vertex in a disoint component. Package com.kpmg.kpo.action; import ava.io.seriaizabe; import org.bpm.context.exe.contextinstance; import org.bpm.graph.def.actionhander; import org.bpm.graph.exe.executioncontext; import com.kpmg.kpo.workfowvariabes; import com.kpmg.kpo.domain.messagetype; import com.kpmg.kpo.domain.workfowinstance; /** * Generic action hander for action that need a {@ink MessageType} and the id * of a {@ink WorkfowInstance} * <p> * The taskname may be provided. It can be derived from the task that is * executed but a designer may need to choose a different taskname as a * different task in essence is responsibe for firing the event. For exampe a Automatic Structure Discovery for Large Source Code Page 105 of 130

106 * reset task does execute this hander but the task that executed the reset * task itsef is the taskname we want to provide. * */ pubic abstract cass AbstractDocumentHander extends GenericHander impements ActionHander { private static fina ong seriaversionuid = L; private String messagetype; private String pubic void execute(executioncontext executioncontext) throws Exception { ContextInstance bpmcontext = executioncontext.getcontextinstance(); fina WorkfowInstance peer = (WorkfowInstance) bpmcontext.getvariabe(workfowvariabes.peer); if ( taskname == nu taskname.equas( )) { taskname = executioncontext.geteventsource().getname(); } Seriaizabe command = getcommandobect(peer.getid(), taskname, messagetype, executioncontext); JMSUti.getInstance().sendByJMS(getQueueName(), command); } pubic void setmessagetype(string messagetype) { this.messagetype = messagetype; } abstract String getqueuename(); } abstract Seriaizabe getcommandobect(long workfowinstanceid, String taskname, String messagetype, ExecutionContext context); Probematic cases package com.kpmg.kpo.audittrai.imp; import ava.uti.uuid; import ava.uti.concurrent.executorservice; import ava.uti.concurrent.executors; import ava.uti.concurrent.threadfactory; import com.kpmg.kpo.audittrai.auditlog; import com.kpmg.kpo.dto.auditentrydto; /** * AuditLog impementation that deegates to another AuditLog running in a * separate thread. This fire-and-forget approach heps in keeping audit trai * ogging fast, yet the code invoking these og statements does not know about * success or faiure of storing the og entry in the database (nor about * vaidation resuts). * <p/> * As a side-effect, this deegating AuditLog generates a GUID returned by the * og method (and sets that GUID on the AuditEntryData instance passed to the * deegate). * <p/> * The assumption is that using unmanaged threads in Jboss is OK (within Automatic Structure Discovery for Large Source Code Page 106 of 130

107 * certain bounds, of course). */ pubic fina cass DeegatingAuditLogImp impements AuditLog { private AuditLog deegate; private ExecutorService executorservice; pubic DeegatingAuditLogImp(AuditLog deegate, ExecutorService executorservice) { this.deegate = deegate; this.executorservice = executorservice; } pubic DeegatingAuditLogImp(AuditLog deegate) { // The foowing ExecutorService is guaranteed to execute the og cas sequentiay this(deegate, Executors.newSingeThreadExecutor(new ThreadFactory() { } })); pubic Thread newthread(runnabe r) { Thread t = Executors.defautThreadFactory().newThread(r); t.setname( audittrai- + t.getname()); return t; } pubic String og(auditentrydto data) { String guid = UUID.randomUUID().toString(); fina AuditEntryDTO datawithguid = data.withguid(guid); // This is fire-and-forget. If the deegate s og ca in the separate thread throws an exception, // it wi not affect this thread. This.executorService.execute(new Runnabe() { pubic void run() { DeegatingAuditLogImp.this.deegate.og(dataWithGuid); } }); } } return guid; Automatic Structure Discovery for Large Source Code Page 107 of 130

108 Ca Graph Extraction The cass StatusType is an Enum (Java). Obviousy, it does not ca any methods from com.sun.imageio package indeed, and the source code confirms that. However, ca graph extraction adds noise which we can see in the figure beow. The figure demonstrates reations between SE artifacts ifted to cass-eve. The number 238 in the eft-top corner stands for the number of other casses with whom cass StatusType has reations. Automatic Structure Discovery for Large Source Code Page 108 of 130

109 The aforementioned cass-eve noise resuts from the underying method-eve noise. CHA ca graph extraction encountered a ca to Obect.cone() from StatusType.vaues() and concuded many cas to different casses derived from Obect possibe, however those cas never occur indeed and were by no means intended by software engineers designing and impementing the source code. The data containing noise cas is iustrated in the figure beow. The number 145 stands for the number of destinations which method StatusType.vaues() may ca, according to this ca graph extraction approach. This number is aso the number of outgoing arcs from the corresponding vertex in our input graph we woud get in case we use this ca graph extraction approach. Automatic Structure Discovery for Large Source Code Page 109 of 130

110 Points-to anaysis techniques (RTA, VTA, Spark) hep to aeviate this probem substantiay, though, at the cost of some mistakeny dropped cas too. See the picture beow Cass name contradicts the purpose We have to support a strong caim about proper custering resut from section with source code of the cass whose name contradicts the purpose. We can see that indeed the source code works mosty with reguar expressions and JBPM expression evauation. Thus, the custering was correct, whie the name of the cass is deceiptive. package com.kpmg.kpo.action; import ava.uti.regex.matcher; import ava.uti.regex.pattern; import org.bpm.graph.exe.executioncontext; import org.bpm.pd.e.imp.jbpmexpressionevauator; pubic abstract cass GenericHander { //Prepare to identify any EL expressions, #{ }, regex: #\{.*?\}. Private static fina String EL_PATTERN_STRING = #\\{.*?\\} ; //Turn the pattern string into a regex pattern cass. Private static fina Pattern EL_PATTERN = Pattern.compie(EL_PATTERN_STRING); //Evauate the input as a possibe EL expression. Protected Obect evauateel(string inputstr, ExecutionContext ec) { if (inputstr == nu) { return nu; } Automatic Structure Discovery for Large Source Code Page 110 of 130

111 } Matcher matcher = EL_PATTERN.matcher(inputStr); if (matcher.matches()) { //input is one big EL expression return JbpmExpressionEvauator.evauate(inputStr, ec); } ese { return inputstr; } /* Treats input as a possibe series of EL expressions and concatenates what is found. */ protected String concatenateel(string inputstr, ExecutionContext ec) { if (inputstr == nu) { return nu; } Matcher matcher = EL_PATTERN.matcher(inputStr); StringBuffer buf = new StringBuffer(); whie (matcher.find()) { // Get the match resut String eexpr = matcher.group(); // Evauate EL expression Obect o = JbpmExpressionEvauator.evauate(eExpr, ec); String evaue = ; if (o!= nu) { evaue = String.vaueOf(JbpmExpressionEvauator.evauate(eExpr, ec)); } // Insert the cacuated vaue in pace of the EL expression matcher.appendrepacement(buf, evaue); } matcher.appendtai(buf); // Deiver resut if (buf.ength() > 0) { return buf.tostring(); } ese { return nu; } } /* Returns true if the vaue is a String which contains the pattern deineating an EL expression. */ protected booean hasel(obect vaue) { if (vaue instanceof String) { Matcher matcher = EL_PATTERN.matcher((String) vaue); return matcher.find(); } return fase; } */ } /* Returns true if the vaue is a String which in its entirety composes one EL expression. protected booean isel(obect vaue) { if (vaue instanceof String) { Matcher matcher = EL_PATTERN.matcher((String) vaue); return matcher.matches(); } return fase; } 10.4 Visuaizations State of the art too STAN Beow is visuaization of a part (which fits a sheet) of InSoAr at package-eve with a state of the art too STAN: Automatic Structure Discovery for Large Source Code Page 111 of 130