Frappé: Querying the Linux Kernel Dependency Graph! Nathan Hawes Oracle Labs nathan.hawes@oracle.com Ben Barham Oracle Labs ben.barham@oracle.com Cristina Cifuentes Oracle Labs cristina.cifuentes@oracle.com ABSTRACT Frappé is a developer tool for querying and visualizing the dependencies of large C/C++ software systems to the order of 10s of millions of lines of code in size. It supports developers with a range of code comprehension queries such as Does function X or something it calls write to global variable Y? and How much code could be affected if I change this macro? Results are overlaid on a visualization of the dependency graph data based on a cartographic map metaphor. In this paper, we give a brief overview of Frappé and describe our experiences implementing it on top of the Neo4j graph database. We detail the graph model used by Frappé and outline its key use cases using representative queries and their runtimes with the dependency graph data of the Unbreakable Enterprise Kernel. Finally, we discuss some of the open challenges in supporting source code queries across single and multiple versions of an evolving codebase with current property graph database technologies: performance, efficient storage, and the expressivity of the graph querying language given a graph model. Categories and Subject Descriptors D.2.3 [Software Engineering]: Coding Tools and Techniques program editors. D.2.5 [Software Engineering]: Testing and Debugging Debugging aids, Testing tools. D.2.6 [Software Engineering]: Programming Environments Integrated environments. General Terms Algorithms, Experimentation, Languages. Keywords Source code querying tools, Graph databases, C/C++. 1. INTRODUCTION Whether identifying architectural issues or simply locating the underlying definition of a symbol, source code querying tools are becoming increasingly important as code bases grow into the 10s of millions of lines of code [1]. This is particularly true for C/C++ source code of this magnitude typically systems code where the preprocessor, complex language features, custom build systems, and large quantities of legacy code further complicate such tasks and reduce the effectiveness of tooling designed to support them. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. GRADES'15, May 31 - June 04 2015, Melbourne, VIC, Australia 2015 ACM. ISBN 978-1-4503-3611-6/15/05 $15.00 DOI: http://dx.doi.org/10.1145/2764947.2764951 Figure 1. Components of a source code querying system Source code querying systems typically comprise four elements, as seen in Figure 1: an extractor to pull in the desired system model from the source code, a repository to store that data, an interface that allows the user to specify queries and view their results, and a query processor that handles queries by examining the repository [2]. The simplest code querying systems are textsearch based, where the source code files themselves serve as the data repository, regular expressions and lists of source locations as the interface, and tools such as grep as the query processor. More sophisticated and precise solutions typically hook into or partially re-implement the compilation and linking process to extract structured data into a separate repository. End users typically query via predefined commands in an Integrated Development Environment (IDE); e.g. go to definition, find references, remove unused imports. Results are most often displayed as a list of textual results with hyperlinks to the relevant source code. While the solutions employed in many IDEs have greater precision due to their more complete understanding of the underlying language (i.e. types, scoping, and linking information), text editors like vim and emacs and simpler code querying tools like grep are still widely used in the C/C++ community. This is largely due to impracticality of integration into large, complex build systems in some cases, as well as spurious errors owing to the custom parser used by most IDEs (Visual Studio and XCode being exceptions). The purely textual presentation of query results used both in simple text-based tools and IDEs also places a large cognitive load on the user, as many code queries can produce many results in large-scale systems and each result is often structured (e.g. when searching for dependency cycles or paths from an entry point to a function of interest). Frappé is a C/C++ source code querying tool that aims to support large-scale code bases. Our main focus has been on the extractor, to obtain precise dependency information (including cross-linking of information), and the interface, to present large, structured result sets effectively [3]. This paper focuses on our experiences using a graph database as the repository and query processor: Neo4j [4] and its Cypher query language [5]. The paper explains use cases used in the source code querying community, and explains which use cases are solved well using relational or graph databases and their query languages. It also poses some challenges to the graph database community in order to support a
richer, more interesting set of use cases, namely, being able to efficiently store data for multiple versions of a large C/C++ codebase as it evolves, and to specify succinct, performant queries within and across these versions. 2. FRAPPÉ OVERVIEW With Frappé, we aim to address the above issues by providing an easy-to-integrate, precise C/C++ source code querying system that scales both in terms of performance and presentation. Taking into account the four components of a source code querying system, only the extractor and interface components need to be tailored for source code querying. Hence, our major focus to date has been on the extractor and interface. For the extractor, integration with custom builds is made easy using the same approach as the Oracle Parfait bug-checking tool [6] by providing wrapper scripts that serve as drop-in replacements for the most common compilers (e.g. gcc, icc, cc, Clang) and compilation tools. These scripts still execute the native compiler they wrap, but also run a modified version of the complete Clang compiler (rather than a custom parser) to capture precise information on the various source entities and dependencies in each compilation unit. For the interface, Frappé uses the captured dependency information to generate a zoomable 2D spatial visualization of the code that employs a cartographic map metaphor such that the continent/country/state/city hierarchy of the map corresponds to the equivalent in source code: the high-level architectural components down to the individual files and functions [3,7]. Overlaying query results on this map be they individual source entities, paths through the code, or transitive closures gives an immediate general impression of the location, locality, structure, and quantity of results, which helps in perceptually filtering out irrelevant results. We looked to existing database management systems to fill the role of the repository, query processor and query language for specifying custom queries in the interface. Relational DBMSs coupled with SQL would work well for some of the simpler use cases Frappé targets (covered in Section 4), but many common source code queries involve transitive closure or reachability computations. Specifying these in SQL can be difficult and results in verbose recursive queries that, when backed by a relational DBMS and large data set, often suffer performance issues due to repeated join operations. For the implementation of Frappé, we instead chose to investigate a graph database, Neo4j, avoiding joins altogether. Its property graph model is a good conceptual fit for querying source code: many core program concepts are graph-based or graph-like (e.g. call graphs, type hierarchies, data-flow graphs, or control-flow graphs) and the most widely employed visual representation for software is the node-link diagram (e.g. to display module dependencies, call graphs, flow charts or UML). As a result, graph-based query languages, like Neo4j s Cypher, seem natural in this domain. Urma et al. also recently showed through their tool Wiggle [8], that Neo4j and Cypher were able to succinctly express source code queries for Java and scale to codebases of ~3 million lines of code. 3. GRAPH MODEL The graph model used by Frappé synthesizes information from the preprocessor, abstract syntax tree, directory structure and linker in order to serve a variety of use cases, outlined in Section 4. Nodes foo.h int bar(int); foo.c #include foo.h int bar(int input) { return input; } main.c #include foo.h int main(int argc, char **argv) { return bar(argc); } compile gcc foo.c -c -o foo.o gcc main.c foo.o -o prog Figure 2. Example dependency graph in the graph represent a range of entities from symbol definitions and declarations to macro definitions, source files, directories, and modules. Edges represent the directed associations between these entities. The top of Figure 2 shows a small example program and its corresponding dependency graph for illustration purposes. The main function in file main.c makes use of function bar, defined in file foo.c and declared in header file foo.h. File foo.c is compiled into the object file foo.o. File main.c is compiled and linked with object file foo.o to produce the executable program prog. The bottom part of Figure 2 shows the corresponding dependency graph produced by Frappé. The nodes of this graph are the executable program prog, object file foo.o, source files main.c, foo.h and foo.c, function main and bar, formal parameters argv, argc and input, and their types char and int. Edges on the other hand show relationships between these nodes, e.g. compiled_from, includes, or file_contains. Of interest, note that the edge isa_type from argv to char makes use of the QUALIFIER ** to denote the correct signature for argv. The complete set of node and edge types is shown in Table 1. As well as being typed or labeled, nodes and edges are further characterized by a number of properties outlined in Table 2. These properties are useful in both displaying results to users and in
specifying and refining queries. Examples of this are shown in section 4. In general, the properties relate to naming, locations in source code (where applicable), and other statically determinable information such as qualifiers and positional information. 4. USE CASES By using the graph model outlined in Section 3, Frappé is able to support a variety of use cases. In terms of query processing, these range from simple index lookups to reachability queries, transitive closures, and complex pattern matching. 4.1 Code Search Whether finding a macro mentioned in a bug report or a vaguely remembered utility function, code search provides a quick entry point to exploring the source code related to a particular task. In large codebases, querying on name alone can produce an unmanageable number of results, so it is important to take advantage of the fact that users typically know the type of the entity they are looking for (e.g. a struct as opposed to a macro) and have some idea of where, at a high level, the entity is defined (e.g. in a particular directory or module). Searching for name, type and location with the graph model requires an index of symbol names with wildcard or fuzzy matching support, and the ability to filter on TYPE and node properties in general, but also on the surrounding graph structure. For example, to return only fields named 'id' present in the module wakeup.elf, any fields that are not reachable from the node representing that module via a sequence of file_contains, compiled_from and linked_from edges can be eliminated. START m=node:node_auto_index('short_name: wakeup.elf') MATCH m -[:compiled_from linked_from*]-> f WITH distinct f MATCH f -[:file_contains]-> (n:field{short_name: 'id'}) Figure 3. Symbol search constrained by module Figure 3 shows how this is achieved in Neo4j s Cypher language. The first MATCH statement matches all modules (nodes) m to all files f that are in the transitive closure (*) of outgoing edges (-[ ]->) with type compiled_from or linked_from. The WITH statement only carries forward the set of distinct files in that set. The second MATCH statement returns all fields with a SHORT_NAME property of 'id' that have an incoming file_contains edge from a file in that set. 4.2 Cross Referencing and Code Navigation Once a developer has a given source file open in their editor, moving within and between files quickly becomes crucial to their productivity. This facility is achieved through two core actions. The first, go-to-definition, involves automatically navigating the user from the symbol reference currently under their cursor in a given source file directly to its correct underlying definition. This is useful for quickly understanding and verifying the type of a variable or the behavior of a particular function. The second core action, find-references, involves finding all locations where the symbol under the cursor is referenced, and letting users inspect each in turn. This helps in understanding how a given function, type or macro is used when writing new code, and to ensure all references to a symbol are consistent after it has been refactored. Nodes: x.type directory enum_def enumerator field file function function_decl Edges: type(x) calls casts_to compiled_from contains declares dereferences dereferences_member dir_contains expands_macro file_contains gets_align_of gets_size_of has_local has_param has_param_type function_type global global_decl local macro module parameter primitive static_local struct struct_decl typedef union union_decl has_ret_type includes interrogates_macro isa_type link_declares link_matches linked_from linked_from_lib reads reads_member takes_address_of takes_address_of_member uses_enumerator writes writes_member Node property Description TYPE The node s type. See Table 1. SHORT_NAME The file name. The symbol name, e.g. main. The file name. NAME The symbol name including its parent, e.g. message::id The file path. LONG_NAME The fully qualified symbol name, e.g. message::get_id(int). For nodes with TYPE enumerator only. The VALUE enumerator s integer value VARIADIC For nodes with TYPE function only. VIRTUAL Present if the function is variadic or virtual. IN_MACRO Present if the node results from a macro expansion. Edge property Description USE_FILE_ID USE_START_LINE The source range of the expression USE_START_COL corresponding to the edge (e.g. the complete USE_END_LINE call site for a calls edge) or of the macro USE_END_COL expansion that produces it. NAME_FILE_ID NAME_START_LINE NAME_START_COL NAME_END_LINE NAME_END_COL ARRAY_LENGTHS BIT_WIDTH QUALIFIERS INDEX LINK_ORDER Table 1. Node and edge types Table 2. Node and edge properties The source range of the representative token corresponding to the edge (e.g. the function name token for a calls edge) For edges labeled is_a only. These properties further describe the nature of the type use: the constant dimension sizes of declared arrays, the bit widths of fields and the coded string of type qualifiers in spoken order. ] for array, * for pointer, c for const, v for volatile, and r for restrict For edges of type has_param and has_param_type only. The parameter position. For edges of type linked_from only. The link order.
START n=node:node_auto_index('short_name: id') WHERE (n) <-[{NAME_FILE_ID: 222525, NAME_START_LINE: 104, NAME_START_COLUMN: 16}]- () Figure 4. Go to definition The go-to-definition action can be thought of as a code search of the symbol name under the cursor, where the results are not constrained by the location of their definitions, but instead by the location of their references. In terms of the graph model, this amounts to eliminating results that do not have an incoming edge (<-[]-) whose NAME* file and source range properties match the position in the file of the start of the symbol under the cursor, as in Figure 4. The find-references action can then be thought of as simply listing the incoming edges of the result of the go-todefinition query. 4.3 Debugging While being able to quickly jump to the definition or references of a symbol is useful in itself, these navigation tasks are often used to manually explore code paths as part of a broader goal that can be addressed more directly. Debugging is one such use case. If a global variable, for example, is known to be in an invalid state at a certain point in the program, the ultimate goal of the user is to find where that invalid value is coming from. Inspecting all the places where the global variable is written via find-references can be time-consuming and unproductive in large code bases, as a large number of results are irrelevant. Instead only the writes that happen before the point in the program that is known to have an invalid state need to be considered. If, at an earlier point in the program execution, the state is know to be valid, that fact can also be used to further bound the writes that need to be considered. START from=node:node_auto_index('short_name: sr_media_change'), to=node:node_auto_index('short_name: get_sectorsize'), b=node:node_auto_index('short_name: packet_command') MATCH writer -[write:writes_member]-> ({SHORT_NAME:'cmd'}) <-[:contains]- b WITH to, from, writer, write MATCH direct <-[s:calls]- from -[r:calls{use_start_line: 236}]-> to WHERE r.use_start_line >= s.use_start_line AND direct -[:calls*]-> writer RETURN distinct writer, write.use_start_line Figure 5. Paths where field cmd is written While the relative control flow ordering of edges (i.e. the fact that one call, read, write, etc. happens before or after another) is not currently captured in the graph model, Figure 5 shows an example query using a comparison of the USE_START_LINE property as an approximation. In this example the value stored in the field 'cmd' is known to be correct at the beginning of the function 'sr_media_change' and invalid on entering the function 'get_sectorsize'. 4.4 Code Comprehension One of the more difficult aspects of working on a large codebase is understanding how the various parts of the system fit together and affect one another in a general sense. Program slicing is a well-known technique used for this purpose [12]. Given a seed statement or region in the code that is of interest, a forward program slice is the subset of statements in the whole program that are transitively impacted by the seed region. A backward slice, on the other hand is the subset of statements that the seed region depends on. One of the simplest approximations of a program slice is the transitive closure of the call graph i.e. the calls edges of the graph model. A backward slice is then the transitive closure of outgoing calls edges from a seed function, and represents all functions that, if modified, could alter the behavior of that function. Similarly, a forward slice is the transitive closure of incoming calls edges and represents all code that may be affected if the seed function is changed. An example backward slice query is shown in Figure 6. The same idea can be applied to other edge types too, such as file includes, or to macro expansions to see all code potentially affected by the seed macro. START n=node:node_auto_index('short_name: pci_read_bases') MATCH n -[:calls*]-> m RETURN distinct m Figure 6. Transitive closure of outgoing calls Beyond transitive closures, shortest path queries are also useful in understanding how the parts of a codebase fit together. When commencing a new task in an unfamiliar region of the codebase, it is useful to understand how execution might reach it from a common entry point, or a more familiar part of the code. 5. Querying the Linux Kernel To illustrate the performance and memory requirements of the above use cases on an open codebase, we ran Frappé over Oracle s Unbreakable Enterprise Kernel version 2.6.32, with 11.4 million lines of code. 5.1 Graph characteristics We extracted all nodes and edges described in Table 1, resulting in just over half a million nodes and close to four million edges, for a ratio of 1:8 (see Table 3). All nodes, properties, relationships and indexes were stored in a Neo4j database of close to 800MB. See Table 4 for the complete breakdown. Table 3. Graph metrics Node count Edge count Graph Density 508 032 3 991 063 0.0000155 Table 4. Database size (MB) Properties Nodes Relationships Indexes Total 465 8 149 129 787 Figure 7 shows the node degree (both in and out) distribution using a log scale for the count of nodes with each degree. As can be seen, a large majority of nodes have a small node degree, whereas a few nodes have a huge degree. These latter nodes are normally primitives and other commonly used types (e.g. int with degree 79K) as well as common constants (e.g. NULL with degree 19K) that are referenced in many places throughout the codebase.
Node%count%(log%scale)% 1" 1" 101" 201" 301" 401" 501" 615" 752" 985" 1290" 1660" 1865" 4308" 5.2 Use Case Benchmarks with Neo4j Each of the example queries from Section 4 above were run over the Linux kernel graph with Neo4j 2.2.1 community edition. Each query was run ten times with a cold cache and ten times with a warm cache on a server with 8x Intel Xeon CPU E5-2690 v2 @ 3.00GHz, with 128GB RAM. The Java virtual machine s maximum heap was set to 2GB. Table 5 reports the performance of the queries with both a cold and warm cache. As can be seen, both cold performances for the simpler queries (code search and cross-referencing) take ~3 seconds, while warm performances bring that number down to ~100ms more than adequate for those types of queries. The debugging example takes a little longer, at 3 to 4 seconds, down to ~300ms with a warm cache, while the code comprehension example does not terminate within 15 minutes. Code search Fig.3 X-referencing Fig. 4 Debugging Fig. 5 Comprehension Fig. 6 Node%degree%(in2out)% Figure 7. Linux kernel node degree distribution Table 5. Query performance Time cold / warm (ms) Min Avg Max Result Count 2567 / 97 2816 / 114 3205 / 145 1 2641 / 96 3079 / 121 3681 / 144 1 3749 / 258 4020 / 308 4352 / 410 26 > 15 mins, aborted 1664 1 6. EXPERIENCES AND CHALLENGES In this section we outline the challenging aspects of implementing the prototype version of Frappé on top of the Neo4j graph database as well as some of the unresolved issues in making Frappé practical for real-world deployment in projects with codebases in the order of 10s of millions of lines of code. From the succinct example queries in Section 4 and the results of the benchmarks in Section 5.2, it is clear that Neo4j and Cypher is a promising solution for a subset of source code querying use cases. To be a suitable all-round solution, however, there are a number of remaining issues. 6.1 Query Performance While Neo4j was able to process the queries for code search and code navigation with sub-second response times in large graphs sufficiently fast for those tasks (also achievable with relational models) performance was mixed in the remaining use cases (that a graph model is better suited to). This was largely due to 1 Computed via Neo4j s Java API in ~20ms 100000" 10000" 1000" 100" 10" suboptimal graph explorations being chosen by the Cypher query language. For example, while the transitive closure is expressible in Cypher, its associated runtime is unreasonable. We instead implemented transitive closure ourselves by traversing the graph directly via Neo4j s Java embedded mode (bypassing Cypher) to achieve sub-second performance. For the debugging use case, however, providing a specialized implementation is not possible as a workaround for performance. In that case a general pattern matching query language is needed, as the shape of the pattern to be matched when debugging varies from one bug to the next bug. 6.2 Graph Model and Query Language Modeling the disparate information from the preprocessor, abstract syntax tree, directory structure and linker as a connected property graph was relatively straightforward. The current model is able to support a range of use cases that are reasonably succinct to specify in Cypher. Further improvements can be made by making use of some of the new features in Neo4j 2, specifically, the new node label feature. Labels would allow nodes to have both their underlying type (e.g. function, struct, union) as well as a grouped type (e.g. symbol, type, container). For example, the first Cypher query of Table 6, querying for all nodes that are both containers and symbols with a particular name "foo" becomes the second Cypher query in the table. Cypher 1.X Cypher 2.X Table 6. Newer Cypher syntax START n:node=node_auto_index(" (TYPE: struct TYPE: union TYPE: enum... <and so on>) AND NAME: foo") MATCH (n:container:symbol{name: "foo"}) Edges may also be grouped in a similar manner (e.g. link, preprocessor, containment, etc), but unfortunately Neo4J does not extend its label support to edges. The main issue with the current model is its representation of symbol references as edges. The source file where a reference occurs in the code is not necessarily the same as that of either end node, due to the C preprocessor, so the edge and source file need to be associated directly. As Neo4j does not support hyper edges, however, the NAME_FILE_ID and USE_FILE_ID properties are used to create the association instead. This makes matching all the references (e.g. calls, writes, reads, etc.) within a file much clumsier than it could be. One workaround for a lack of hyper edge support is to instead model references as nodes. For example, foo -[:calls]-> bar, where an edge property associates the containing file, would become foo -[:calls]-> callsite -[:calls]-> bar and file -[:contains]-> callsite. With this option, specifying a match for the references associated with a particular file improves, but specifying matches in general becomes at best less succinct and at worst impossible: while Cypher supports matching repeating edges (via *), repeating patterns of edges and nodes are unable to be expressed. A possible solution to this is by adding the original edge as a shortcut as well, i.e. foo - [:calls]-> bar, but this would still not allow repeating edges to be filtered by their location. 6.3 Evolving Codebases as Temporal Graphs In many circumstances, supporting the use cases mentioned above for the latest snapshot of a codebase is an improvement on what is currently possible for large C/C++ systems. In reality, however, developers are rarely making changes based on the latest
snapshot. Instead, they are working on top of versions that are days, months or years old, depending on the scope of the bug fix, feature addition, or backport being implemented. As such, Frappé would ideally support queries against all these versions of the code. One of the simplest ways to achieve this is to include the graph data store Frappé generates within the version control system alongside the source code it was derived from. This ensures that whenever the source code is checked out the correct version of the graph store is locally available. The large size of the graph store (~1GB for the Unbreakable Enterprise Kernel and an order of magnitude bigger for other large systems) reduces the appeal of this approach. Its use blows out the size of the version control repository and, with the 100s to 1000s of developers that work on these larger systems, creates significant network traffic. Another option is to maintain the graph data for all versions centrally. The simplest approach is to store and query each version in isolation by routing traffic based on a specified version. This has two main drawbacks, however. Firstly, as large codebases evolve slowly, most of the graph data extracted remains the same from one version to the next, so increasing numbers of duplicate nodes, edges and properties are being needlessly stored over time. Secondly, it fails to take advantage of the potential to query across versions. This is particularly useful in the software engineering domain, where understanding what has changed between versions and the wider effects of those changes is a common and difficult task in large codebases, known as software change impact analysis [9]. While it is possible to incorporate this temporal aspect into the graph model, doing so makes querying much clumsier and at times impossible, as noted in Section 6.2. A more comprehensive solution is needed to efficiently store and query the delta of the program data that has evolved in a given large codebase. 7. RELATED WORK For each of the challenges covered in the previous section query performance, the property graph model and query language, and support for evolving graphs there is a wide range of related work focused on each issue individually, but no single solution that addresses them all. As discussed, while Neo4j [4] provides an increasingly userfriendly graph query language [5] for pattern matching, its performance, although adequate for many use cases, is far from ideal on more complex queries and it lacks support for efficiently storing and querying evolving graphs. PGX [10] and a number of other approaches are making significant improvements in the area of pattern matching performance. LLAMA [11] and others are focused on efficient storage of evolving graphs with minimal performance impact. 8. SUMMARY Frappé is a source code querying system, and as such, it is composed of four elements: an extractor to pull in the desired system model from the source code, a repository to store that data, an interface that allows the user to specify queries and view their results, and a query processor that handles queries by examining the repository. In this paper we explain our experiences with the use of the Neo4j graph database and its Cypher query language as the repository, and query processor in our source code querying system. We give an overview of the graph model used by Frappé for C source code, and provide developer use cases for querying systems. The experimental data over the Unbreakable Enterprise Kernel show that complex queries lack in performance, whereas manual implementation of those queries can give results in a sub-second timeframe. We show the challenges for graph databases, their graph model, and query languages as they relate to source code querying systems used with evolving large codebases of millions of lines of code. Our experience shows that in this area, several open challenges remain for the graph database community. 9. ACKNOWLEDGMENTS We would like to thank Matthew Johnson and Edward Evans for their help in collecting data for this paper. 10. REFERENCES [1] G. Robles, J.J. Amor, J.M. Gonzalez-Barahona, and I. Herraiz. 2005. Evolution and growth in large libre software projects. In Principles of Software Evolution, Eighth International Workshop on. 165 174. DOI:http://dx.doi.org/10.1109/IWPSE.2005.17 [2] Timothy C Lethbridge and Nicolas Anquetil. 1997. Architecture of a source code exploration tool: A software engineering case study. TR-97-07, School of Information Technology and Engineering, University of Ottawa (1997). [3] Nathan Hawes and Ben Barham. 2014. Frappé: Using Clang to query and visualize large codebases. (October 2014). http://llvm.org/devmtg/2014-10/#talk23 [4] Neo Technology. 2015. Get Started. (March 2015). http://neo4j.com/developer/gt-started/ [5] Neo Technology. 2015. Intro to Cypher. (March 2015). http://neo4j.com/developer/cypher-query-language/ [6] C. Cifuentes, N. Keynes, Lian Li, N. Hawes, and M. Valdiviezo. 2012. Transitioning Parfait into a Development Tool. Security Privacy, IEEE 10, 3 (May 2012), 16 23. DOI:http://dx.doi.org/10.1109/MSP.2012.30 [7] Nathan Hawes. 2013. Code Maps: A scalable visualisation technique for large codebases. In 22nd Australasian Software Engineering Conference: ASWEC 2013. Engineers Australia, 33 34. http://search.informit.com.au/documentsummary; dn=443094612534872; res=ieleng [8] Raoul-Gabriel Urma and Alan Mycroft. 2015. Source-code queries with graph databases with application to programming language usage and evolution. Science of Computer Programming 97 (2015), 127 134. [9] Robert S. Arnold and Shawn A. Bohner. 1996. Software Change Impact Analysis. IEEE Computer Society Press, Los Alamitos, CA, USA. [10] Raghavan Raman, Oskar van Rest, Sungpack Hong, Zhe Wu, Hassan Chafi, and Jay Banerjee. 2014. PGX.ISO: Parallel and Efficient In-Memory Engine for Subgraph Isomorphism. In Proceedings of Workshop on GRAph Data Management Experiences and Systems (GRADES 14). ACM, New York, NY, USA, Article 5, 6 pages. DOI:http://dx.doi.org/10.1145/2621934.2621939 [11] D. Margo P. Macko, V. Marathe and M. Seltzer. 2014. LLAMA: Efficient Graph Analytics Using Large Multiversioned Arrays. Ph.D. Dissertation. Harvard University. [12] M. Weiser. 1984. Program Slicing. IEEE Transactions on Software Engineering SE-10, 4 (July 1984), 352 357.