Costbased Optimization of Graph Queries in Relational Database Management Systems


 Chester Jenkins
 1 years ago
 Views:
Transcription
1 Costbased Optimization of Graph Queries in Relational Database Management Systems D I S S E R T A T I O N zur Erlangung des akademischen Grades Dr. rer nat. im Fach Informatik eingereicht an der MathematischNaturwissenschaftlichen Fakultät II HumboldtUniversität zu Berlin von Dipl.Ing. (FH) Silke Trißl M.Sc. Präsident der HumboldtUniversität zu Berlin: Prof. Dr. JanHendrik Olbertz Dekan der MathematischNaturwissenschaftlichen Fakultät II: Prof. Dr. Elmar Kulke Gutachter: 1. Prof. Dr. Ulf Leser 2. Prof. JohannChristoph Freytag, Ph.D. 3. Prof. Dr. Thorsten Grust eingereicht am: Tag der mündlichen Prüfung:
2
3 Alles hat ein Ende nur die Wurst hat zwei. Stephan Remmler Acknowledgement This thesis would not have been possible without the help, support, and encouragement of many people. First of all, I would like to thank my supervisor Prof. Ulf Leser. He gave me the opportunity to start my PhD and provided a welcoming and pleasant working environment at HumboldtUniversität zu Berlin. I am greatly indebted to him for his patience, encouragement, and guidance during all these years with ups and downs. I could not have imagined a more motivated or dedicated advisor for my PhD study. I am grateful to all who gave me the opportunity to partly finance my PhD by teaching. I met committed and inquiring students in the courses and exercises I taught for Prof. Ulf Leser at HU Berlin and Prof. Felix Naumann at HPI Potsdam. Dr. Márta Gutsche and the Frauenförderung at HU Berlin gave me the opportunity to spark interest in girls to study computer science. I thank Prof. Louiqa Raschid at University of Maryland who invited me for a research exchange to the US. I am also grateful to the BMBF who supported my research. I would not have finished this PhD thesis without the help and support of many colleagues and friends. Thanks to Jörg, Timo, and Philippe who shared an office with me. Thanks to Jens, Melanie, Jana, Long, Roger, and Samira who also accompanied me for a long time during my thesis. I want to acknowledge all researchers and students from the groups WBI and DBIS at HU, Informationssysteme at HPI, and Genetik und Biometrie at FBN. Many thanks for constructive criticism and helpful suggestions. I am greatly indebted to all colleagues who tried to cheer me up during common lunch and coffee breaks. I acknowledge some students, who I met during my time in Berlin. Raphael and Philipp did a lot of programming in my first project Columba. Johannes, Christoph, Florian, and André used some ideas of GRIPP in their Studien or Diplomarbeiten and gave feedback on the algorithm. Last but not least, würde ich mich gerne bei meiner Familie bedanken, die während der gesamten Zeit Freud und Leid mit mir geteilt hat. Meine Eltern hatten und haben immer ein offenes Ohr für meine Sorgen und Nöte von ganzem Herzen vielen Dank dafür. Also, many thanks to my sister. Whenever I needed to discuss a problem, she listened patiently and gave me good advice.
4
5 Abstract Graphs occur in many areas of life. We are interested in graphs in biology, where nodes are chemical compounds, enzymes, reactions, or interactions, which are connected by either directed or undirected edges. Efficiently querying these graphs is a challenging task. In this thesis we present GRIcano, a system that efficiently executes graph queries. For GRIcano we assume that graphs are stored and queried using relational database management systems (RDBMS). We use an extended version of the Pathway Query Language PQL to express graph queries, for which we describe the syntax and semantics in this work. We employ ideas from RDBMS to improve the performance of query execution. Thus, the core of GRIcano is a costbased query optimizer, which is created using the Volcano optimizer generator. This thesis makes contributions to all three required components of the optimizer, the relational algebra, implementations, and cost model. Relational algebra operators alone are not sufficient to express graph queries. Thus, we first present new operators to rewrite PQL queries to algebra expressions. We propose the reachability φ, distance Φ, path length ψ, and path operator Ψ. In addition, we provide rewrite rules for the newly proposed operators in combination with standard relational algebra operators. Secondly, we present implementations for each proposed operator. The main contribution is GRIPP, an index structure that allows us to execute reachability queries on very large graphs containing directed edges. GRIPP has advantages over other existing index structures, which we review in this work. In addition, we show how to employ GRIPP and the recursive query strategy as implementation for all four proposed operators. The third component of GRIcano is the cost model, which requires cardinality estimates for the proposed operators and cost functions for the implementations. Based on extensive experimental evaluation of the proposed implementations we present functions to estimate the cardinality of the φ, Φ, ψ, and Ψ operator and the cost of executing a query. The novelty of our approach is that these functions only use key figures of the graph. We finally present the effectiveness of GRIcano using exemplary graph queries on real biological networks. v
6
7 Zusammenfassung Graphen sind in vielen Bereichen des Lebens zu finden, wobei wir speziell an Graphen aus der Biologie interessiert sind. Knoten in solchen Graphen sind chemische Komponenten, Enzyme, Reaktionen oder Interaktionen, die durch gerichtete oder ungerichtete Kanten miteinander verbunden sind. Eine effiziente Ausführung von Graphanfragen ist eine Herausforderung. In dieser Arbeit präsentieren wir GRIcano, ein System, das das effiziente Ausführen von Graphanfragen erlaubt. Wir nehmen an, dass die Graphen in relationalen Datenbankmanagementsystemen (RDBMS) gespeichert sind und darin auch angefragt werden. Als Graphanfragesprache schlagen wir eine erweiterte Version der Pathway Query Language (PQL) vor. Der Hauptbestandteil von GRIcano ist ein kostenbasierter Anfrageoptimierer, der mit Hilfe des Optimierergenerators Volcano erzeugt wird. Diese Arbeit enthält Beiträge zu allen drei benötigten Komponenten des Optimierers, der relationalen Algebra, Implementierungen und Kostenmodellen. Die Operatoren der relationalen Algebra alleine sind nicht ausreichend, um PQL Anfragen auszudrücken. Daher stellen wir zuerst die neuen Operatoren Erreichbarkeits φ, Distanz Φ, Pfadlängen ψ und Pfadoperator Ψ vor. Zusätzlich geben wir Regeln für die Umformung von Ausdrücken an, die die neuen Operatoren zusammen mit den Standardoperatoren der relationalen Algebra enthalten. Des Weiteren präsentieren wir Implementierungen für jeden vorgeschlagenen Operatoren. Der Hauptbeitrag dabei ist GRIPP, eine Indexstruktur, die die effiziente Ausführung von Erreichbarkeitsanfragen auf sehr großen Graphen mit gerichteten Kanten erlaubt. Wir zeigen, wie GRIPP und die rekursive Anfragestrategie genutzt werden können, um Implementierungen für alle vorgeschlagenen Operatoren bereitzustellen. Die dritte Komponente von GRIcano ist das Kostenmodell, das Kardinalitätsabschätzungen für die vorgeschlagenen Operatoren und Kostenmodelle für die Implementierungen benötigt. Basierend auf umfangreichen Experimenten schlagen wir Funktionen für die Abschätzung der Kardinalitäten der Operatoren φ, Φ, ψ und Ψ vor. Zusätzlich leiten wir Funktionen für die Abschätzung der Kosten für die Ausführung von Graphanfragen ab. Der neue Ansatz der Kostenmodelle ist, dass die Funktionen nur Kennzahlen der Graphen verwenden. Abschließend zeigen wir die Wirkungsweise von GRIcano mit Beispielanfragen auf echten biologischen Netzwerken. vii
8
9 Contents 1. Introduction Queries on Graphs Motivation Contribution Structure of this Work Definitions and Terminology Graphs Definitions Storage and Traversal Relational Algebra Algebra and Relations Operators Equivalence Rules CostBased Query Optimization Query Processing Implementation of Operators Cost Function and Query Optimization Volcano Graph Queries Data Model Graph Queries Query Graph Evaluation of Graph Queries Pathway Query Language Graphs in PQL Syntax PQL and Nongraph Relations PQL Semantics Semantics of Node Conditions Semantics of Path Conditions Semantics of HAVING Conditions Semantic of the Subgraph Specification Conversion to Relational Algebra Related Work ix
10 Contents 4. Operators for Graph Queries Operators for Nodes Operators for Paths Path Operator, Ψ Reachability operator, φ Path Length Operator, ψ Distance Operator, Φ Summary Related Work Implementations for Operators GRIPP Index Structure Reachability Queries Distance Queries Path Length and Path Queries Other Index Structures Transitive Closure Dual Labeling Label + SSPI RDBMS Capabilities Recursive Strategies Summary Related Work Performance of GRIPP Experimental Setup Generated Graphs Realworld Graphs Implementation Details Index Creation Query Performance Reachability Queries Distance Queries Path Length Queries Path Queries Comparison of Query Types Summary GRIcano Cardinality Estimates Reachability Operator Distance Operator Path Length Operator x
11 Contents Path Operator Validation on Real World Graphs Cost Functions Reachability Queries Distance Queries Path Length Queries Path Queries Validation on Real World Graphs GRIcano Experimental Evaluation Related Work Cardinality and Cost Estimates Rulebased Query Optimization Costbased Query Optimization Conclusion and Outlook Summary Future Work A. Strongly Connected Component 151 A.1. Kosaraju s Algorithm B. Rewrite Rules for Operators 153 B.1. Path Operator B.1.1. Restriction on Start and End Node B.1.2. Path Operator and Other Operators B.2. Path Length Operator B.2.1. Restriction on Start and End Node B.2.2. From Path Operator Ψ to Path Length Operator ψ B.2.3. Path Length Operator and Other Operators B.3. Distance Operator B.3.1. Restriction on Start and End Node B.3.2. From Path Operator Ψ to Distance Operator Φ B.3.3. Distance Operator and Other Operators B.4. Reachability Operator B.4.1. Restriction on Start and End Node B.4.2. From Path Operator Ψ to Reachability Operator φ B.4.3. Reachability Operator and Other Operators C. Additional Algorithms for GRIPP 161 C.1. Relational Schema for Storing GRIPP C.2. Stop Node List for GRIPP C.3. Reachability for Sets of Nodes xi
12 Contents D. Graph Properties 165 E. Model Specification for Volcano 167 F. Cost and Cardinality Functions for Volcano 173 G. Exemplary Queries for GRIcano 179 xii
13 1. Introduction The topic of this work is costbased optimization of graph queries in relational database management systems. In Section 1.1 we first introduce the kind of graphs that led us to this topic, before we proceed in Section 1.2 with the motivation for our approach. In Section 1.3 we summarize our contribution in the area of costbased optimization of graph queries. Finally, in Section 1.4 we give an overview of this work Queries on Graphs Graphs occur in many areas of life. Examples are public transport plans, road maps, the World Wide Web (WWW), or social networks. Common to all these graphs is that they consist of nodes and edges. Nodes are stations, junctions, web pages, or people. Edges in such networks are tracks, roads, links, or personal relationships. All these graphs have interesting features but we are interested in graphs in biology. To understand the content of these graphs we first make a short digression to cell biology. For a more comprehensive introduction we refer the reader to Alberts et al. [AJW + 08]. All biological cells are built in similar fashion, though there exist differences in the structure of cells between the three major groups, prokaryotes, eukaryotes, and archaea. All have in common that they contain a cell membrane as boundary to the outside and a genome, which holds information for building and maintaining the cell. In eukaryotes the genome is contained inside the nucleus, while in prokaryotes and archaea the genome is free in the cytoplasm. The genome is comprised of long stretches of DNA, the chromosomes. Genes are short regions of the genome that code for a functional product in the cell. During the transcription process genes are read and transcribed into RNA. Either the RNA itself is the functional product or the RNA, possibly with some modifications, is translated to proteins. Proteins in a cell are the workhorses as they catalyze reactions, process signals, or transport molecules. One class of proteins, the enzymes, catalyze chemical reactions, such as the degradation of sugar or the production of essential amino acids. Another class, the membrane proteins, reside inside the cell membrane and react to outer stimuli or facilitate the transport of substances in and out of a cell. When an outer stimuli occurs membrane proteins may activate or inactivate proteins inside the cell to enhance or suppress reactions. There exist other protein groups such as histones, which are concerned with packing the DNA in the nucleus of eukaryotes, collagens, which occur mainly in muscle cells, or antibodies, which are required in higher organisms for the immune response. 1
14 1. Introduction To give an impression of the complexity of the problem, every human has about 250,000 different proteins in his or her body, according to current estimates. Each protein may interact with numerous other proteins or some of the hundreds of thousands organic and inorganic substances. Biologists have studied these complex interactions involving proteins and other substances. Their knowledge is stored as graphs in publicly available data sources. Biological graphs may roughly be divided into three categories, metabolic networks, signaling pathways, and proteinprotein interaction networks 1. For a review on different biological networks see [BN05]. Metabolic networks are graphs, which represent the conversion of substances in a cell. Nodes in these networks are proteins, other molecules such as sugars or fatty acids, or reactions. Edges in such graphs are usually directed and indicate that a molecule participates in a reaction. The most familiar conversion is the glycolysis. In the glycolysis glucose is converted to pyruvate, which produces energy during the conversion. Proteins and reactions participating in this conversion are said to be in the glycolysis pathway. In general, pathways in metabolic networks are subgraphs that stand for specific conversions defined by researchers. The pathways may overlap, i.e., they may share proteins or reactions. Data sources for metabolic networks are KEGG [KGK + 04], BioCyc [KOMK + 05], and Reactome [JTGV + 05] for instance. Figure 1.1 shows the glycolysis given by KEGG. Circles are molecules that are converted, rectangular boxes on edges stand for reactions catalyzed by enzymes that are identified by their EC number, and the boxes with rounded corners represent other pathways. Signaling pathways are graphs that capture the information flow in a cell. Nodes in these graphs are usually proteins or reactions, while edges represent the flow of information. For example, Figure 1.2 shows the activation of protein kinase A (PKA) by an outer stimuli as given by BioCarta [htt11b]. The activated form of PKA regulates several reactions, including one reaction of the glycolysis presented in Figure 1.1. Depending on the outer stimuli glucose PKA phosphorylates or dephosphorylates the complex of the two enzymes phosphofructokinase 2 and fructose2,6bisphosphatase. The phosphorylation status influences the reaction rate of the glycolysis. The third group of biological graphs are proteinprotein interaction networks. In these graphs nodes are proteins, while edges represent interactions between proteins and they are usually undirected. Figure 1.3 shows known interactions for the protein complex phosphofructokinase 2 and fructose2,6bisphosphatase (PFKFB1) as given by String [vmjs + 05], a data source for proteinprotein interactions. The red node in the center is PFKFB1. It interacts with protein kinase A (PKACA) and several other proteins. The different colors of the edges code for different evidences, e.g., interactions found in other data sources are represented by blue edges, while interactions derived using text mining methods are shown by light green edges. Other data sources that contain data about proteinprotein interactions are for in 1 See Pathguide: the pathway resource list for a list on data sources 2
15 1.1. Queries on Graphs Figure 1.1.: The glycolysis as given by KEGG. The circles are molecules that are converted, rectangular boxes on edges stand for reactions catalyzed by enzymes that are identified by their EC number, and the boxes with rounded corners stand for other pathways. 3
16 1. Introduction Figure 1.2.: The activation of PKA through an outer stimuli from BioCarta. stance DIP [XSD + 02], BIND [BBH03], Intact [XSD + 02], and PubGene [JLKH01] Motivation The examples in the last section show only small parts of different biological graphs. Table 1.1 shows the number of nodes and edges of selected data sources. For example, KEGG contains 42,002 nodes and 51,450 edges in its reference pathway as of March The reference pathway is a summarization of the pathways of all species. In contrast, BioCyc stores an individual metabolic network for each of the roughly 400 species. In addition, in contrast to KEGG BioCyc also represents relationships between genes and proteins. Biologist use specialized graph viewing tools to display those graphs. For a review on the tools see Suderman & Hallett [SH07]. The tools usually display parts of the entire graph, e.g., a single pathway of a metabolic network, possibly with links to other pathways as shown in Figure 1.1. With such tools a biologist is only able to navigate through graphs. Consider the question How many steps does a cell require to produce the amino acid lysine given the substrate glucose. A biologist may use the metabolic network of KEGG, 4
17 1.2. Motivation Figure 1.3.: Known proteinprotein interactions for the protein complex PFKFB1 in humans. The different colors of edges stand for different evidences, e.g., interactions found in other data sources are represented by blue edges, while interactions derived using text mining methods are shown by light green edges. where she has to start at glucose in the glycolysis pathway, follow the link to the pathway of the citrate cycle, and then follow the link to the pathway of the lysine biosynthesis. This way, she will count that there are 25 steps required to produce lysine from the substrate glucose. Clearly, when manually navigating through the images of pathways a biologist might not find the shortest path or occasionally even no path at all although there exists one. Thus, tools are required that allow users to pose queries such as the one presented above and return an answer to the user. In [HNM + 00] van Helden and colleagues identified several other questions that are interesting for biologists: Get all reactions catalyzed by a given gene product. Find all metabolic pathways that convert compound A into compound B in less than X steps. Retrieve all genes whose expression is directly or indirectly affected by a given compound. Find all compounds that can be synthesized from a given precursor in less than X steps. Currently, researchers have to write specialized programs to traverse the graphs to 5
18 1. Introduction Biological graph Number of nodes Number of edges Metabolic networks KEGG [KGK + 04] 42,002 51,450 BioCyc A. thaliana [KOMK + 05] 10,951 23,649 Reactome [JTGV + 05] 11,795 23,649 Signaling pathways BioCarta [htt11b] only images NetPath TGFβ [KMR + 10] TransPath [KPV + 06] > 100,000 >240,000 Proteinprotein interaction networks String [vmjs + 05] > 2,500,000 > 50,000,000 DIP [XSD + 02] 23,201 71,276 Intact 50, ,044 Table 1.1.: Sizes of biological graphs (in March 2011). answer such queries. Whenever they want to pose a new query these programs need to be adjusted. In this work we present GRIcano to overcome this problem Contribution In this work we present GRIcano, a novel tool that efficiently retrieves answers to graph queries. In GRIcano we employ ideas from query optimization in relational database management systems (RDBMS) and carry these ideas over to graph query optimization. In the following chapters we target several aspects of graph queries. We specifically make the following contributions: Extend the existing query language PQL. We present and extend the Pathway Query Language (PQL) [Les05a], which was developed to express graph queries. Using PQL a user may express conditions of a graph query as predicates. In Chapter 3 we describe the syntax as well as the semantics of PQL. Define relational operators to express PQL queries. In order to optimize a graph query we want to be able to alter the order in which predicates of the query are evaluated. We may achieve this by rewriting the PQL query to an algebraic expression and apply rewrite rules for transformation. As standard operators from relational algebra are not sufficient for expressing PQL queries, which we discuss in Chapter 4, we develop new and novel operators in this thesis. We define the path Ψ, path length ψ, distance Φ, and reachability operator φ to express predicates of graphs queries and provide rewrite rules for the exchange of operators. Propose and experimentally evaluate implementations for operators. For each proposed operator we have to provide implementations to compute the result. Thus, in Chapter 5 we discuss implementations to answer reachability, 6
19 1.4. Structure of this Work distance, path length, and path queries. We may use GRIPP, our newly developed index structure, for answering all four types of graph queries. Chapter 6 shows that we are able to compute the GRIPP index even for very large graphs, for which the transitive closure cannot be created. In addition, we are able to answer reachability queries on average in almost constant time regardless the size and shape of the graph using GRIPP. Develop functions to estimate cardinality of operators and cost of implementations. For costbased query optimization we require cardinality estimates for the different operators and cost functions for each implementation. In Chapter 7 we develop equations that are based on key figures of the graph, which is to our knowledge a novel approach. Using our cost functions we correctly predict on generated as well as on realworld graphs the result sizes and fastest implementations. Present and evaluate a prototypical implementation of GRIcano. In Chapter 7 we present GRIcano, the first system that performs costbased query optimization for graph queries. The underlying costbased query optimizer is generated using the Volcano framework [GM93]. Volcano requires as input the available operators and rewrite rules of the algebra, the available implementations for the different operators, and the equations for the cardinality and cost estimates. We show the effect of GRIcano using exemplary queries Structure of this Work In Chapter 2 we introduce basic notation on graphs, relational algebra, and costbased query optimization. Chapter 3 is devoted to a data model for storing graphs, graph queries, and PQL, a language to express graph queries. In Chapter 4 we first argue that PQL queries should be executed like standard SQL queries, i.e., first transforming them to an algebraic expression. We induce the necessity of new operators for the algebra and introduce the path operator, Ψ, path length operator, ψ, distance operator Φ, and reachability operator φ. We also provide rewrite rules for exchanging operators. In Chapter 5 we provide implementations for the operators proposed in Chapter 4. We present GRIPP, an index structure to efficiently answer reachability queries even on large graphs. In Chapter 6 we experimentally evaluate the presented implementations. In Chapter 7 we devise functions to estimate cardinality for the four newly defined operators and cost functions for the different implementations. In that chapter we also introduce GRIcano, our graph query optimizer. We show the capabilities of GRIcano using selected queries. Chapter 8 concludes the work. 7
20
21 2. Definitions and Terminology This chapter introduces basic notation on graphs, relational algebra, and query optimization. In Section 2.1 we formally define graphs and properties of graphs. Section 2.2 introduces fundamental concepts behind relational algebra. In Section 2.3 we present an introduction to costbased query optimization in relational database management systems Graphs This work mostly deals with graph structured data. We therefore formally introduce graphs. For this purpose we adopt notation from Cormen et al. [CLR01] Definitions Definition 2.1 (Graph) A graph G = (V (G), E(G)) is a tuple consisting of a set of nodes V (G) and a set of edges E(G), with E(G) V (G) V (G). Whenever the context of the graph is clear we may write G = (V, E). There exist two types of graphs, directed and undirected graphs. Directed graphs have ordered pairs of nodes in E. In contrast, in undirected graphs the set E contains unordered pairs of nodes. Consider (u, v) E with u, v V and u v. In a directed graph only v is adjacent to u, while in an undirected graph the relation is symmetric, i.e., (u, v) is the same as (v, u). If (u, v) E in a directed graph we say node u has the outgoing edge (u, v) and therefore u is start node of (u, v). In analogy (u, v) is an incoming edge of node v and therefore v is target node of (u, v). We call u parent of v and v child of u. Definition 2.2 (Size of a graph) Let G = (V, E). The size of G is the number of nodes V plus the number of edges E in G, i.e., G = V + E. Based on the ratio between edges and nodes, which is called the density of a graph, we are able to divide graphs into two groups sparse and dense graphs. The literature does not provide a clear distinction between the two types. As rule of thumb, if the number of edges E is close to V 2 the graphs are called dense, otherwise if E V 2 they are sparse. 9
22 2. Definitions and Terminology e f a d b c Figure 2.1.: A directed graph. Circles represent nodes; arrows between nodes represent edges. Nodes in this example are uniquely labeled. The size of the graph is 14 (6 nodes plus 8 edges). For example, the degree of node b is deg(b) = 3. To describe the shape of a graph we look at the distribution of node degrees. To do so, we first define the degree of a node. Definition 2.3 (Degree of a node) Given a graph G = (V, E). The degree of node v V deg(v) is the number of edges in which v participates. If G is directed we may distinguish between an indegree deg in (v) and an outdegree deg out (v) of a node v. The indegree is the number of edges with v as target node and, in analogy, the outdegree is the number of edges with v as start node. Based on the distribution of the node degree we distinguish between different graph topologies. The distribution of the node degrees of random graphs follows a binomial distribution. Graphs where the distribution of the node degrees follows a powerlaw are called scalefree. Barabási and Oltvai describe in [BO04] these topologies. Nodes and edges are often labeled. Therefore we define a label function for nodes and edges of a graph. Definition 2.4 (Label function, φ) Let L be a set of labels. A label function φ assigns labels to nodes and edges, φ(v, L) : V L and φ(e, L) : E L. In this work we assume each label l L consists of a type and a value. Graphs also contain paths. Definition 2.5 (Path and path length) Let G = (V, E). A path p is a sequence of nodes v 0, v 1, v 2,..., v k, v i V such that (v i 1, v i ) E for i = 1, 2,..., k. The length of the path is the number of edges in the path. If there exists a path p from u to w we say w is reachable from u, written as u w. 10