Mining Correlated Subgraphs in Graph Databases



Similar documents
A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

Integrating Pattern Mining in Relational Databases

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Binary Coded Web Access Pattern Tree in Education Domain

On the k-path cover problem for cacti

A Survey of Graph Pattern Mining Algorithm and Techniques

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Association Rule Mining

Protein Protein Interaction Networks

Graph Mining and Social Network Analysis

Large induced subgraphs with all degrees odd

Graph Mining and Social Network Analysis. Data Mining; EECS 4412 Darren Rolfe + Vince Chu

Implementing Graph Pattern Mining for Big Data in the Cloud

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Molecular Fragment Mining for Drug Discovery

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Types of Degrees in Bipolar Fuzzy Graphs

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Why? A central concept in Computer Science. Algorithms are ubiquitous.

IMPROVING BUSINESS PROCESS MODELING USING RECOMMENDATION METHOD

Mining Mobile Group Patterns: A Trajectory-Based Approach

Reducing the Number of Canonical Form Tests for Frequent Subgraph Mining

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets

A 2-factor in which each cycle has long length in claw-free graphs

9.1. Graph Mining, Social Network Analysis, and Multirelational Data Mining. Graph Mining

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Part 2: Community Detection

Process Mining by Measuring Process Block Similarity

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Chapter 6: Episode discovery process

DEGREES OF CATEGORICITY AND THE HYPERARITHMETIC HIERARCHY

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Every tree contains a large induced subgraph with all degrees odd

DualIso: Scalable Subgraph Pattern Matching On Large Labeled Graphs SUPPLEMENT. Computer Science Department

Domain Classification of Technical Terms Using the Web

New Matrix Approach to Improve Apriori Algorithm

Degrees that are not degrees of categoricity

Multi-table Association Rules Hiding

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

8.1 Min Degree Spanning Tree

A Time Efficient Algorithm for Web Log Analysis

Approximation Algorithms

A Fast and Efficient Method to Find the Conditional Functional Dependencies in Databases

Procedia Computer Science 00 (2012) Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

136 CHAPTER 4. INDUCTION, GRAPHS AND TREES

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

Fault Analysis in Software with the Data Interaction of Classes

Topology-based network security

Business Lead Generation for Online Real Estate Services: A Case Study

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

A Performance Comparison of Five Algorithms for Graph Isomorphism

Generating models of a matched formula with a polynomial delay

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

Computer Science Department. Technion - IIT, Haifa, Israel. Itai and Rodeh [IR] have proved that for any 2-connected graph G and any vertex s G there

An Empirical Study of Two MIS Algorithms

A Graph-Theoretic Network Security Game

Network Algorithms for Homeland Security

Visual Analysis Tool for Bipartite Networks

Mining the Temporal Dimension of the Information Propagation

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Mining Association Rules: A Database Perspective

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Network (Tree) Topology Inference Based on Prüfer Sequence

Graphical degree sequences and realizations

Cacti with minimum, second-minimum, and third-minimum Kirchhoff indices

Ph.D. Thesis. Judit Nagy-György. Supervisor: Péter Hajnal Associate Professor

Determination of the normalization level of database schemas through equivalence classes of attributes

Neovision2 Performance Evaluation Protocol

Effective Data Mining Using Neural Networks

Web Usage Mining - Languages and Algorithms. 1 Introduction. John R. Punin, Mukkai S.Krishnamoorthy, Mohammed J.Zaki

Random vs. Structure-Based Testing of Answer-Set Programs: An Experimental Comparison

MapReduce Approach to Collective Classification for Networks

On the independence number of graphs with maximum degree 3

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

A Breadth-First Algorithm for Mining Frequent Patterns from Event Logs

GameTime: A Toolkit for Timing Analysis of Software

Transcription:

Mining orrelated Subgraphs in Graph atabases Tomonobu Ozaki 1 and Takenao Ohkawa 2 1 Organization of Advanced Science and Technology, Kobe University 2 Graduate School of Engineering, Kobe University 1-1 Rokkodai-cho, Nada, Kobe, 657-8501, Japan {tozaki@cs., ohkawa@}kobe-u.ac.jp http://www25.cs.kobe-u.ac.jp/ Abstract. In this paper, we bring the concept of hyperclique pattern in transaction databases into the graph mining and consider the discovery of sets of highly-correlated subgraphs in graph-structured databases. To discover frequent hyperclique patterns in graph databases efficiently, a novel algorithm named HSG is proposed. y considering the generality ordering of subgraphs, HSG employs the depth-first/breadth-first search strategy with powerful pruning techniques based on the upper bound of h-confidence measure. The effectiveness of HSG is assessed through the experiments with real world datasets. 1 Introduction Recently, the research area of correlation mining, that extracts the underlying dependency among objects, attracts a big attention and extensive studies have been reported [25,23,7,15,12]. Among these researches on correlation mining, we focus on the hyperclique pattern discovery [26,27] in this paper. While the most of researches aim at finding mutually dependent pairs of objects efficiently, a hyperclique pattern is a set of highly-correlated items that has high value of an objective measure named h-confidence [26,27]. The h-confidence measure of an itemset P = {i 1,,i m } is designed for capturing the strong affinity relationship and is defined as follows. hconf(p )= min {conf(i l P \{i l })} = sup(p )/ max {sup({i l})} l=1,,m l=1, m where sup and conf are the conventional definitions of support and confidence in association rules[1], respectively. A hyperclique pattern P states that the occurrence of an item i l P in a transaction implies the occurrence of all other items P \{i l } in the same transaction with probability at least hconf(p ). In addition, the cosine similarity between any pair of items in P is greater than or equals to hconf(p )[27]. y these features, hyperclique pattern discovery has been applied successfully to some real world problems [9,18,24]. While hyperclique pattern discovery aims at finding valuable patterns in transaction databases, structured data is becoming increasingly abundant in many application domains recently. Although we can easily expect to get a more powerful T. Washio et al. (Eds.): PAK 2008, LNAI 5012, pp. 272 283, 2008. c Springer-Verlag erlin Heidelberg 2008

Mining orrelated Subgraphs in Graph atabases 273 tool for structured data by introducing correlation mining, the most of current research on correlation mining are designed for transaction databases and little attention is paid to mining correlations from structured data. Motivated by these background, in this paper, we tackle the problem of hyperclique pattern discovery in the context of graph mining[21,22] and discuss the effectiveness of the correlation mining in structured domains. The basic idea of hyperclique patterns in graph databases is simple: Instead of items, we employ subgraphs (i.e. patterns) as building blocks of hyperclique patterns. While this simple replacement might seem to be trivial, it gives us new expectations and difficulties. On one hand, the proposed framework extracts sets of mutually dependent or affinitive patterns in graph databases. ecause each pattern gives another view to other patterns in the same set, we can expect to obtain new findings and precise insights. On the other hand, as easily imagined, hyperclique pattern discovery in graph databases is much harder than the traditional tasks because there are exponentially many subgraphs in graph databases and any combinations of those subgraphs are to be potentially candidates. In order to alleviate this combinatorial explosion and to discover hyperclique patterns efficiently, in this paper, we propose a novel algorithm named HSG. HSG reduces the search space effectively by taking into account the generality ordering of hyperclique patterns. The main contributions of this paper are briefly summarized as follows. First, we formulate the new problem of hyperclique pattern discovery in graph databases. Second, we propose a novel algorithm named HSG for solving this problem efficiently. Third, through the experiments with real world datasets, we assess the effectiveness of our proposal. This paper is organized as follows. In section2, after introducing basic notations, we formulate the problem of hyperclique pattern discovery in graph databases. In section3, the proposed algorithm HSG is explained in detail. After mentioned related work in section4, we show the results of the experiments in section5. Finally, we conclude the paper and describe future work in section6. 2 Preliminaries Let L be a finite set of labels. A labeled graph g =(V g,e g,l g )onl consists of a vertex set V g,anedgesete g and a labeling function l g : V g E g Lthat maps each vertex or edge to a label in L. Hereafter, we refer labeled graph as graph simply. Each graph can be represented in so called code word [3,28], that is a unique string which consists of a series of edges associated with connection information. Especially, we employ canonical code word[3,28] which is minimal code word among isomorphic graphs to represent each graph. The lexicographic order on code word gives a total order on graphs. Given two graphs g and g, g< lex g denotes that the code word of g is lexicographically earlier than that of g.ifthe code word of g is a prefix of that of g,wedenoteitasg< pfx g.examplesof graphs and those code words are shown in Fig.1.

274 T. Ozaki and T. Ohkawa g 0 g 1 g 2 g 3 g 4 g 5 A A A A (0,1,A,,) (0,1,A,,) (0,1,A,,) (0,1,A,,) (0,1,,,) (0,1,,,) (0,2,A,,) (1,2,,,) (1,2,,,) (2,3,,,) (3,0,,,A) (All edge labels are assumed to be ) (1,2,,,) (2,3,,,) For example, the relations below hold. g 0 < lex g 4 g 1 < lex g 2 g 3 < lex g 5 g 0 < pfx g 1 g 2 < pfx g 3 g 4 < pfx g 5 Fig. 1. Examples of Labeled Graphs and those ode Words Agraphg =(V g,e g,l g ) is called a subgraph of another graph g =(V g,e g,l g ), denoted as g g, if there exists an injective function f : V g V g such that u V g l g (u) = l g (f(u)) and (u, v) E g (f(u),f(v)) E g l g (u, v) = l g (f(u),f(v)). If g g,thenwesaythatg is more general than g.notethat,if g< pfx g holds, then g g also holds[3,28]. ased on the relationship of subgraphs, we consider the joint occurrence of a set of subgraphs in a graph. The most intuitive definition is as follows: Given a set of subgraphs G and a graph g,if g i Gg i g holds, then G is considered as to be occurred in g. However, this simple definition might not be suitable for the hyperclique patterns of subgraphs because large number of uninteresting combinations of subgraphs having large overlaps in a graph will be obtained. Therefore, we introduce another definition in consideration of edge-disjointness to suppress the redundancy. Given a set of m subgraphs G = {g 1,,g m } and a graph g, G is called a set of k-edge disjoint subgraphs of g, denoted as G k g,if there exists the following set of injective functions {f i : V gi V g i =1,,m}: (1) g i Gg i g (2) m i=1 E g i i=1,,m {(f i(u),f i (v)) (u, v) E gi } k The second condition gives the constraint on the edge overlaps. y this constraint, the redundant combinations can be expected to be controlled. For example in Fig.1, while both g 1 g 3 and g 2 g 3 hold, if k is set to be 0, then {g 1,g 2 } 0 g 3 does not hold because of an overlap of edge A- in g 3. We introduce the definitions of support and h-confidence for a set of subgraphs. Let = {d 1,,d N } be a database of N graphs. The support and h-confidence of a set of subgraph G = {g 1,,g m } in are defined as follows: sup (G) = d σ(g, d )/N where σ(g, d )= hconf (G) =sup (G)/ max i=1,,m {sup ({g i })} { 1(G k d ) 0(otherwise) ased on the above preparation, we formulate the problem of mining frequent hyperclique patterns in graph databases (HSG mining in short) below. Given a database of labeled graphs, a positive number called minimum support σ (0 <

Mining orrelated Subgraphs in Graph atabases 275 σ 1) and a positive number called minimum h-confidence h c (0 h c 1), then the problem of HSG mining is to find all frequent hyperclique patterns of subgraphs G in such that sup (G) σ, hconf (G) h c and the cardinality of G is more than one. Note that, because we are interested in the sets of mutually dependent subgraphs, the hyperclique patterns of cardinality one are excluded. 3 Mining Hyperclique Patterns of Subgraphs In this section, we propose an algorithm named HSG for mining frequent hyperclique patterns in graph databases. efore describing the concrete algorithm, we show some properties of hyperclique patterns and a data structure called a conditional prefix tree of hyperclique patterns, that are utilized for the effective pruning based on the generality ordering of hyperclique patterns. 3.1 Properties of Hyperclique Patterns Given two sets G 1 and G 2 of subgraphs, if there exists an injective function φ : G 1 G 2 which satisfies g G 1 g φ(g) G 2,thenwesaythatG 1 is more general than G 2 and denote it as G 1 G 2. As shown formally below, given a set of subgraphs G 1, there are two kinds of specializations to obtain a more specific set of subgraphs G 2 from G 1.Note that, while only first kind of specialization is considered in item set mining, the second one also plays the key role in HSG mining. (1) Specialization by addition G 2 is obtained by adding a new subgraph g to G 1, i.e. G 2 = G 1 {g } (2) Specialization by replacement G 2 is obtained by replacing a subgraph g G 1 to a more specific subgraph g ( g), i.e. G 2 =(G 1 \{g}) {g }. The following two lemmas hold in hyperclique patterns of subgraphs based on the generality ordering introduced above. Lemma 1 (Anti-monotone property of support value). Given two sets G 1 and G 2 of subgraphs, if G 1 G 2,thensup (G 1 ) sup (G 2 ) holds. Proof. Obvious from the definition of support value. y this lemma, if a set of subgraphs G 1 does not satisfy the minimum support, then all sets of subgraphs G 2 s.t. G 1 G 2 can be eliminated safely from the candidate of frequent hyperclique patterns. Lemma 2 (Upper bound of h-confidence). Given two sets of subgraphs G 1 = G A G s.t. G A, G A G = and G 2 = G A G s.t. G A G =, if G G, then the following inequality holds. up(g 1,G A )=sup (G 1 )/ max g G A {sup ({g})} hconf (G 2 )

276 T. Ozaki and T. Ohkawa Proof. Since G A G 2,max{sup ({g})} max {sup ({g })} holds. y lemma1, g G A g G 2 sup (G 1 ) sup (G 2 ) also holds. Therefore, sup (G 1 )/ max {sup ({g})} g G A sup (G 2 )/ max {sup ({g })} =hconf (G 2 )holds. g G 2 This lemma gives the upper bound of h-confidence. If up(g 1,G A ) does not satisfy the minimum h-confidence h c, then any set of subgraphs G 2 = G A G s.t. G G must not satisfy h c. Furthermore, this lemma also shows the antimonotone property of h-confidence with respect to the specialization by addition. y definition, hconf (G 1 )=up(g 1,G 1 )holds.thus,ifhconf (G 1 ) <h c,then no set of subgraphs obtained by adding some subgraphs to G 1 can satisfy h c. 3.2 A onditional Prefix Tree of Hyperclique Patterns Here, we consider the enumeration of hyperclique patterns in graph databases. According to the reverse search[2], the repeated enumeration of the same pattern can be avoided by generating each pattern from its unique parent. In case of hyperclique patterns of subgraphs, the parent can be uniquely defined by using the total order of graphs formed by code word. The parent of a set of subgraphs G, denoted as p(g), is a set obtained by removing the smallest element with respect to < lex from G, i.e. p(g) =G \{g G g Gg < lex g}. ecause of the anti-monotone property of hyperclique patterns with respect to the specialization by addition shown in lemma1 and 2, all subsets of a frequent hyperclique pattern must be also frequent hyperclique patterns. Furthermore, a hyperclique pattern should be enumerated via its parent to avoid the repeated enumerations. Therefore, in our strategy, a new hyperclique pattern G will be generated by joining two hyperclique patterns G 1 = G {g 1 } and G 2 = G {g 2 } as G = G {g 1 } {g 2 } = G 1 {g 2 }. Note that the enumeration via parent can be naturally realized through the join operation. Since a hyperclique pattern will be generated by joining two hyperclique patterns having the same parent, it is convenient to treat all hyperclique patterns which have the same parent as an unit. Furthermore, in order to effectively utilize the pruning based on the generality ordering, hyperclique patterns in this unit should be organized in consideration of the generality ordering. Motivated by these backgrounds, we propose a tree-shaped data structure called conditional prefix tree of hyperclique patterns, on which our algorithm HSG works, for storing hyperclique patterns which have the same parent in common. A conditional prefix tree of hyperclique patterns PT G =(V G,E G, G, root) is an ordered tree and it stores hyperclique patterns which have a hyperclique pattern G as those parent. The root node root is a dummy node. Each node v in V G, except for root, corresponds to a hyperclique pattern G {g(v)} and has an graph g(v). E G V G V G and G V G V G represent the set of parent-child and sibling relationships, respectively. These are formally defined as follows. E G = {(v 1,v 2) g(v 1) < pfx g(v 2), v V G[ g(v 1) < pfx g(v ) g(v ) < pfx g(v 2)]} {(root, v 3) v V G[ g(v ) < pfx g(v 3)]} G = {(v 1,v 2) g(v 1) < lex g(v 2), v V G[(v,v 1) E G (v,v 2) E G ]}

Mining orrelated Subgraphs in Graph atabases 277 G {g 0} G {g 1} A A G {g 2} G {g 3} A A = parent (condition) G g0 A g1 g2 A A g4 g5 G {g 4} G {g 5} g3 A Fig. 2. An Example of onditional Prefix Tree Intuitively speaking, v 1 is the parent of v 2 if the code word of g(v 1 ) is the longest prefix of that of g(v 2 ). If v 3 has no such parent, then root is assigned as the parent of v 3.Notethat, (g 1,g 2 ) E G g 1 g 2 holds. The children of a node are ordered in the lexicographic order < lex. An example of conditional prefix tree is shown in Fig.2. This tree is constructed from six hyperclique patterns that have {G} as parent in common. 3.3 HSG: A Hyperclique Pattern Miner in Graph atabases In this subsection, we propose an algorithm HSG and explain it in detail. The algorithm HSG for mining frequent hyperclique patterns in graph databases is shown in Fig.3. In the following explanation, we use the notations below for the sake of simplicity: G x = G {g(g x )}, G x = G {g(g x)} and G x,y = G {g(g x ),g(g y )} whereweassumeg(g x ) g(g x ). As an input, HSG takes an unconditional prefix tree PT of hyperclique patterns that stores frequent hyperclique patterns of cardinality one, i.e. frequent subgraphs potentially obtained by the conventional graph miners[28,11,10,16]. Then, HSG calls a procedure LoopV with T a = T b = PT (line1 in HSG). HSG consists of two main procedures LoopV and LoopH which realize the join of elements in a conditional prefix tree mutually while considering the generality ordering. LoopV traverses a tree T a in preorder by using recursive call (line5 in LoopV). y using the preorder traversal, elements in T a will be considered in the order of < lex. uring the traversal, LoopV calls LoopH with G, g a and T b (line3 in LoopV). LoopH also traverses a tree T b in preorder (line16 in LoopH). Since T a and T b refer to the same tree at the beginning, if no pruning is applied, all pairs of elements in a conditional prefix tree will be considered. Note that, no repeated enumeration occurs due to the check of g(g a ) lex g(g b ) (line2 in LoopH). uring the recursive calls, LoopH constructs two new conditional prefix trees NT a and NT b which form the search spaces afterwards. NT a is a prefix tree under the condition G a and it is used as an input for discovering hyperclique patterns whose parent is G a,b (line4 in LoopV). NT a will be constructed by adding a new hyperclique pattern G a,b whenever it is obtained (line10 in LoopH). NT b is a

278 T. Ozaki and T. Ohkawa Algorithm HSG(PT ) 1: LoopV(, PT, PT ) Procedure LoopV(G, T a, T b ) 1: for each g a T a s children //G {g(g a)} is a frequent hyperclique pattern 2: NT a := new root node, NT b := new root node 3: LoopH(G, g a, T b, NT a, NT b ) //specialize G {g(g a)} by addition 4: LoopV(G {g(g a)}, NT a, NT a) //search on new conditional prefix tree 5: LoopV(G, g a, NT b ) //preorder traversal in T a //specialize G {g(g a)} by replacement Procedure LoopH(G, g a, T b, NT a, NT b ) 1: for each g b T b s children //check G {g(g a), g(g b )} and prune by it 2: if (g(g a) lex g(g b )) then 3: add g b to the last of NT b s children 4: continue 5: if (sup (G {g(g a), g(g b )}) <σ) then continue //pruning (1) 6: if (G up(g {g(g a), g(g b )}, G) <h c) then continue //pruning (2) 7: N a := NT a 8: if (hconf (G {g(g a), g(g b )}) h c) then //pruning (3) 9: ouput(g {g(g a),g(g b )}) //output of a frequent hyperclique pattern 10: := new node, g() :=g(g b ), add to the last of N a s children 11: N a := //replacement of N a 12: N b := new node, g(n b ):=g(g b ), add N b to the last of NT b s children 13: if (up(g {g(g a), g(g b )}, G {g(g a)}) <h c) then //pruning (4) 14: for each g c g b s children add g c to the last of N b s children 15: else 16: LoopH(G, g a, g b, N a, N b ) //preorder traversal in T b // specialize G {g(g a), g(g b )} by replacement Fig. 3. An algorithm HSG of mining hyperclique patterns in graph databases prefix tree under the condition G, on which hyperclique patterns having G a as parents will be mined (line5 in LoopV). onceptually, NT b will be obtained by pruning some branches in T b. Four prunings will be applied in LoopH. They are achieved partially by not adding new vertices to NT a and NT b. The first pruning is based on the antimonotone property of support value in lemma1 (line5 in LoopH). If the support of G a,b is less than the minimum support, then all patterns which are more specific than G a,b must not satisfy the minimum support. Thus, we ignore the following specializations of G a,b by skipping the loop of line1 in LoopH: (1) G a,b by not calling LoopH (line16 in LoopH), (2) patterns obtained by specialization of G a,b by addition by not updating NT a,and(3)g a,b and G a,b by not updating NT b. The second pruning is derived from the upper bound of h-confidence in lemma2 (line6 in LoopH). As similar to the first pruning, all specializations of G a,b will be ignored in the same way. The third pruning is by anti-monotone property of h-confidence with respect to the specialization by addition in lemma2

Mining orrelated Subgraphs in Graph atabases 279 (line8 in LoopH). If G a,b dose not satisfy minimum h-confidence, the search for patterns having G a,b as parent will be avoided by not adding G a,b to NT a.the fourth pruning is based on the upper bound of h-confidence in lemma2 (line13 in LoopH). The search for G a,b can be avoided by not calling LoopH. Note that, G a,b as well as G a,b must be considered. Therefore, NT b has to be updated. This is achieved through the update of N b. As shown above, HSG makes the best use of the pruning based on the specializations by using the conditional prefix trees. For HSG, the following theorem holds. Theorem 1. Given an unconditional prefix tree having all frequent subgraphs, HSG discovers all frequent hyperclique patterns without any duplication. Proof. erived from the complete enumeration procedure by the double preorder traversals and the safety prunings guaranteed by lemma1 and 2. Although HSG can discover all frequent hyperclique patterns, the obtained set of hyperclique patterns may contain some redundancy. Since each frequent subgraph in the unconditional prefix tree is treated as an item, if some subgraphs which are equivalent in some senses are contained in the tree, they cause the redundancy. To eliminate obviously redundant patterns, we believe that the frequent subgraphs included in the unconditional prefix tree should be limited to the representatives such as closed subgraphs (a graph g c s.t. g g c g sup (g c ) = sup (g )) or minimal subgraphs (a graph g m s.t. g g g m sup (g m )=sup (g )). In particular, minimal subgraphs might be more suitable if the edge-disjointness is considered in the joint occurrence. Although, to the best of our knowledge, the method which finds minimal subgraphs directly has not been proposed yet, those subgraphs can be obtained by some post-processing of the conventional graph miners[28,11,10,16]. 4 Related Work The concept of HSG mining is inspired by the hyperclique pattern discovery in transaction databases [26,27]. The methods of mining correlated pairs of items have been proposed[25,23,7]. Furthermore, correlated pattern mining based on a pattern-growth methodology in transaction databases has been proposed[15]. ompared with these methods, HSG is different in the point of finding sets of affinitive structured patterns. On the correlation mining in graph databases, a new problem named orrelated Graph Search has been proposed recently[12]. In this problem, Pearson s correlation coefficient[20] is employed as correlation measure and all correlated subgraphs with a query graph will be discovered. This framework is greatly different from our proposal because the different measure is employed and only subgraphs correlated with a given query are considered. Pattern team proposed in [13] is a set of patterns that optimizes some quality measure. The discovery of pattern team may look similar to the HSG mining

280 T. Ozaki and T. Ohkawa Table 1. Statistics of atasets V a E a V E escription 1 1000 11.6 20.5 20 20 A synthetic dataset generated by graph generator[5] PTE 340 27.0 27.4 66 4 The Predictive Toxicology Evaluation hallenge[8] TP M 877 29.1 31.5 12 4 The TP AIS Antiviral Screen dataset[4] : # of graphs in datasets. V a, E a: average number of vertices and edges per graph. V, E : # of distinct labels of vertices and edges. because both find the set or combination of patterns. However, pattern team discovery is done by selecting patterns from the given set. In addition, pattern team usually consists of a set of mutually dissimilar and independent patterns for optimizing the quality measure. Similar to the pattern team in some senses, the concept of α-orthogonal patterns in graph databases has been proposed recently[6]. In this framework, a set of frequent maximal subgraphs that are mutually dissimilar with each other will be obtained by employing a randomized search. While treating a set of subgraphs, this framework is also different from the HSG mining because HSG discovers the complete sets of affinitive subgraphs. From the aspect of finding similar patterns, redescription mining [19,17,29] is closely related to the HSG mining. In redescription mining, patterns consist of any combinations of conjunction, disjunction and negation of items and pairs of patterns that occur in almost the same transactions will be discovered. While this framework is very general, neither the application to the structured data nor precise algorithms which use the generality ordering have been proposed yet. 5 Experimental Evaluation To assess the effectiveness of the proposed algorithm, we implement HSG in Java and conduct some experiments with the datasets shown in Table1 on a P (PU: Intel(R) ore2quad 2.4GHz) with 4Gbytes of main memory running Windows XP. Furthermore, another miner phsg, that is HSG without pruning (2) and (4), is also prepared to demonstrate the effects of pruning related to the specialization by replacement. In the experiments, we construct the unconditional prefix trees PT by using minimal subgraphs only. Experimental results are shown in Table2. The obtained number of hyperclique patterns decreases when the value of k is reduced. Furthermore, though not shown in Table2, about 231 million and 17 thousand of hyperclique patterns were obtained if we set σ =0.1,h c =0.9 and k = in PTE and TP M, respectively. This means that the consideration of edge-disjointness succeeds in suppressing the generation of redundant patterns. In all cases, phsg discovers all frequent hyperclique patterns in a reasonable time though at least O( PT 2 ) of candidates will be generated if no pruning applied. Thus, it is understood that the pruning by minimum support is effective enough. Note that, this pruning eliminates the patterns obtained by the specialization by addition as well as the specialization by replacement. ompared

Mining orrelated Subgraphs in Graph atabases 281 Table 2. Experimental Results h c k P Time and. P Time and. Results for 1 σ =0.025 ( PT = 1208) σ =0.01 ( PT = 8946) 0.8 0 0 0.3 (0.6) 17.4 (32.6) 0 1.3 (4.4) 102.5 (337.9) 1 2 0.4 (0.6) 22.5 (38.2) 4 2.0 (6.1) 155.9 (470.6) 7 0.3 (0.5) 22.7 (38.3) 756 2.1 (5.7) 191.3 (513.1) 0.7 0 0 0.3 (0.6) 18.0 (32.6) 0 1.4 (4.4) 109.6 (337.9) 1 45 0.4 (0.6) 23.2 (38.2) 64 2.2 (6.1) 174.5 (470.6) 81 0.3 (0.5) 23.4 (38.4) 3066 2.3 (5.7) 213.7 (514.3) Results for PTE σ =0.1 ( PT = 561) σ =0.05 ( PT = 1441) 0.9 0 16 0.3 (1.1) 2.3 (8.3) 154 1.7 (8.0) 9.6 (48.0) 1 93 0.6 (2.0) 4.2 (13.9) 565 3.0 (13.2) 16.2 (67.0) 0.8 0 85 0.9 (2.1) 2.9 (8.4) 821 3.2 (9.4) 13.8 (49.9) 1 524 3.4 (7.0) 5.9 (14.8) 3815 16.5 (28.1) 29.5 (77.2) 0.7 0 165 1.3 (4.0) 3.5 (9.1) 1165 5.5 (11.8) 17.0 (51.4) 1 1228 9.0 (35.5) 7.9 (17.1) 6000 45.4 (77.2) 38.5 (85.4) Results for TP M σ =0.1 ( PT = 417) σ =0.05 ( PT = 1592) 0.9 0 9 0.5 (2.4) 2.2 (11.3) 10 1.2 (7.7) 9.9 (62.6) 1 48 0.7 (2.8) 3.1 (13.9) 70 1.6 (9.0) 15.8 (79.0) 0.8 0 32 1.0 (2.7) 3.3 (11.3) 40 2.2 (8.0) 14.7 (62.7) 1 109 1.4 (3.4) 4.4 (14.0) 242 2.9 (9.7) 22.7 (79.3) 0.7 0 110 2.6 (8.8) 4.3 (11.6) 129 4.1 (14.1) 19.0 (63.0) 1 371 48.3 (116.6) 5.7 (14.7) 628 50.3 (123.1) 28.5 (80.2) k: # of the edge overlaps permitted in the joint occurrence ( means no restriction). P : # of obtained hyperclique patterns. Time: execution time after PT is given (in second). and.: # of candidates enumerated during the search (in thousand). Numbers in parentheses in Time and and. are for phsg. with phsg, the execution time of HSG for real world problems decreases to 16.0% in the maximum and to 33.9% on the average. The number of candidate patterns is also reduced to 15.9% in the maximum and to 30.8% on the average. It is also observed that HSG runs about two times faster than phsg in the synthetic dataset on the average. These reductions are the strong evidences to show the effectiveness of the pruning based on the generality ordering, especially on the specialization by replacement. 6 onclusion In this paper, we formulate the problem of hyperclique pattern discovery in graph databases. To solve this problem efficiently, a novel algorithm named HSG is proposed that utilizes the depth-first/breadth-first search with the effective pruning based on the generality ordering. We believe that HSG can mine hyperclique

282 T. Ozaki and T. Ohkawa patterns efficiently not only in other types of structured data but also in transaction databases with the conceptual hierarchy because the conditional prefix trees, on which HSG works, can be constructed naturally from these kinds of datasets. For future work, the theoretical analysis of the proposed algorithm and further experiments with large-scale datasets are necessary. In addition, some more efficient mechanism is required for computing support value of a set of edge disjoint subgraphs. For this objective, we plan to employ the idea of support value computation of edge disjoint subgraphs in a large graph[14]. We also plan to apply the proposed algorithm to top-k correlated pattern discovery as well as to redescription mining in structured databases. References 1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proc. of 20th International onference on Very Large ata ases (VL 1994), pp. 487 499 (1994) 2. Avis,., Fukuda, K.: Reverse search for enumeration. iscrete Applied Mathematics 65(1-3), 21 46 (1996) 3. orgelt,.: On canonical forms for frequent graph mining. In: Working Notes of the 3rd International EML/PK- Workshop on Mining Graphs, Trees and Sequences (MGTS 2005), pp. 1 12 (2005) 4. orgelt,., erthold, M.R.: Mining molecular fragments: Finding relevant substructures of molecules. In: Proc. of the 2002 IEEE International onference on ata Mining (IM 2002), pp. 51 58 (2002) 5. heng, J., Ke, Y., Ng, W.: Graphgen: A graph synthetic generator (2006), http://www.cse.ust.hk/graphgen/ 6. Hasan, M., haoji, V., Salem, S., esson, J., Zaki, M.: ORIGAMI: Mining representative orthogonal grap patterns. In: Proc. of the 7th IEEE International onference on ata Mining (2007) 7. He, Z., Xu, X., eng, S.: Mining top-k strongly correlated item pairs without minimum correlation threshold. International Journal of Knowledge-based and Intelligent Engineering Systems 10(2), 105 112 (2006) 8. Helma,., King, R.., Kramer, S., Srinivasan, A.: The predictive toxicology challenge 2000-2001. ioinformatics 17(1), 107 108 (2001) 9. Hu, T., Xiong, H., Sung, S.Y.: o-preserving patterns in bipartite partitioning for topic identification. In: Proc. of the 7th SIAM International onference on ata Mining, pp. 509 514 (2007) 10. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence of isomorphism. In: Proc. of the 3rd IEEE International onference on ata Mining, pp. 549 552 (2003) 11. Inokuchi, A., Washio, T., Motoda, H.: omplete mining of frequent patterns from graphs: Mining graph data. Machine Learning 50, 321 354 (2003) 12. Ke, Y., heng, J., Ng, W.: orrelation search in graph databases. In: Proc. of the 13th AM SIGK International onference on Knowledge iscovery and ata Mining (K 2007), pp. 390 399 (2007) 13. Knobbe, A.J., Ho, E.K.Y.: Pattern teams. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PK 2006. LNS (LNAI), vol. 4213, pp. 577 584. Springer, Heidelberg (2006)

Mining orrelated Subgraphs in Graph atabases 283 14. Kuramochi, M., Karypis, G.: Finding Frequent Patterns in a Large Sparse Graph. ata Mining and Knowledge iscovery 11(3), 213 321 (2005) 15. Lee, Y.-K., Kim, W.-Y., ai, Y.., Han, J.: omine: Efficient mining of correlated patterns. In: Proc. of the 3rd IEEE International onference on ata Mining, pp. 581 584 (2003) 16. Nijssen, S., Kok, J.: A quickstart in frequent structure mining can make a difference. In: Proc. of the 10th AM SIGK International onference on Knowledge iscovery and ata Mining (K 2004), pp. 647 652 (2004) 17. Parida, L., Ramakrishnan, N.: Redescription mining: Structure theory and algorithms. In: Proc. of the 20th National onference on Artificial Intelligence and the 17th Innovative Applications of Artificial Intelligence onference, pp. 837 844 (2005) 18. Qian, T., Xiong, H., Wang, Y., hen, E.: On the strength of hyperclique patterns for text categorization. Information Sciences 177(19), 4040 4058 (2007) 19. Ramakrishnan, N., Kumar,., Mishra,., Potts, M., Helm, R.F.: Turning cartwheels: an alternating algorithm for mining redescriptions. In: Proc. of the 10th AM SIGK International onference on Knowledge iscovery and ata Mining, pp. 266 275 (2004) 20. Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: Proc. of the 8th AM SIGK International onference on Knowledge iscovery and ata Mining, pp. 32 41. AM Press, New York (2002) 21. Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGK Explorations 5(1), 59 68 (2003) 22. Washio, T., Kok, J.N., e Raedt, L. (eds.): Advances in Mining Graphs, Trees and Sequences. IOS Press, Amsterdam (2005) 23. Xiong, H., rodie, M., Ma, S.: Top-cop: Mining top-k strongly correlated pairs in large databases. In: Proc. of the 6th International onference on ata Mining, pp. 1162 1166 (2006) 24. Xiong, H., He, X., ing,., Zhang, Y., Kumar, V., Holbrook, S.R.: Identification of functional modules in protein complexes via hyperclique pattern discovery. In: Proc. of the Pacific Symposium on iocomputing, pp. 221 232 (2005) 25. Xiong, H., Shekhar, S., Tan, P.-N., Kumar, V.: Exploiting a support-based upper bound of pearson s correlation coefficient for efficiently identifying strongly correlated pairs. In: Proc. of the 10th AM SIGK International onference on Knowledge iscovery and ata Mining, pp. 334 343. AM Press, New York (2004) 26. Xiong, H., Tan, P.-N., Kumar, V.: Mining strong affinity association patterns in data sets with skewed support distribution. In: Proc. of the 3rd IEEE International onference on ata Mining (IM 2003), pp. 387 394 (2003) 27. Xiong, H., Tan, P.-N., Kumar, V.: Hyperclique pattern discovery. ata Mining and Knowledge iscovery 13(2), 219 242 (2006) 28. Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: Proc. of the 2002 IEEE International onference on ata Mining (IM 2002), pp. 721 724 (2002) 29. Zaki, M.J., Ramakrishnan, N.: Reasoning about sets using redescription mining. In: Proceeding of the 11th AM SIGK International onference on Knowledge iscovery in ata Mining, pp. 364 373 (2005)