Neighborhood Based Fast Graph Search in Large Networks

Transcription

1 Neighborhood Bsed Fst Grph Serch in Lrge Networks Arijit Khn Dept. of Computer Science University of Cliforni Snt Brbr, CA 9306 Ziyu Gun Dept. of Computer Science University of Cliforni Snt Brbr, CA 9306 Nn Li Dept. of Computer Science University of Cliforni Snt Brbr, CA 9306 Supriyo Chkrborty Dept. of Electricl Engineering University of Cliforni Los Angeles, CA Xifeng Yn Dept. of Computer Science University of Cliforni Snt Brbr, CA 9306 Shu To IBM T. J. Wtson 9 Skyline Drive Hwthorne, NY 0532 shuto@us.ibm.com ABSTRACT Complex socil nd informtion network serch becomes importnt with vriety of pplictions. In the core of these pplictions, lies common nd criticl problem: Given lbeled network nd query grph, how to efficiently serch the query grph in the trget network. The presence of noise nd the incomplete knowledge bout the structure nd content of the trget network mke it unrelistic to find n exct mtch. Rther, it is more ppeling to find the top-k pproximte mtches. In this pper, we propose neighborhood-bsed similrity mesure tht could void costly grph isomorphism nd edit distnce computtion. Under this new mesure, we prove tht subgrph similrity serch is NP hrd, while grph similrity mtch is polynomil. By studying the principles behind this mesure, we found n informtion propgtion model tht is ble to convert lrge network into set of multidimensionl vectors, where sophisticted indexing nd similrity serch lgorithms re vilble. The proposed method, clled Ness (Neighborhood Bsed Similrity Serch), is pproprite for grphs with low utomorphism nd high noise, which re common in mny socil nd informtion networks. Ness is not only efficient, but lso robust ginst structurl noise nd informtion loss. Empiricl results show tht it cn quickly nd ccurtely find high-qulity mtches in lrge networks, with negligible cost. Ctegories nd Subject Descriptors H.3.3 [Informtion Serch nd Retrievl]: Serch process; I.2.8 [Problem Solving, Control Methods, nd Serch]: Grph nd tree serch strtegies Generl Terms Algorithms, Performnce Permission to mke digitl or hrd copies of ll or prt of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distributed for profit or commercil dvntge nd tht copies ber this notice nd the full cittion on the first pge. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission nd/or fee. SIGMOD, June 2 6, 20, Athens, Greece. Copyright 20 ACM //06...$0.00. Keywords Grph Query, Grph Serch, Grph Alignment, RDF. INTRODUCTION Recent dvnces in socil nd informtion science hve shown tht linked dt pervde our society nd the nturl world round us [36]. Grphs become incresingly importnt to represent complicted structures nd schem-less dt such s wikipedi, freebse [5] nd vrious socil networks. Given n ttributed network nd smll query grph, how to efficiently serch the query grph in the trget network is criticl tsk for mny grph pplictions. It hs been extensively studied in chemi-informtics, bioinformtics, XML nd Semntic Web. SPARQL [27] is the stte-of-the RDF query lnguge for Semntic Web. SPARQL requires ccurte knowledge bout the grph structure to write query nd lso it performs n exct grph pttern mtching. However, due to the noise nd the incomplete informtion (structure nd content) in mny networks, it is not relistic to find exct mtches for given query. It is more ppeling to find the top-k pproximte mtches. Unfortuntely, grph similrity mesures such s subgrph isomorphism, mximum common subgrphs, grph edit distnce, missing edges tht re pproprite for chemicl structures nd biologicl networks, re not suitble for entity-reltionship grphs nd socil networks. There re two chllenging issues for these grph theoretic mesures. First, entity-reltionship grphs nd socil networks hve quite different chrcteristics from physicl networks. They re not governed by physicl lws nd often full of noise, thus mking strict topologicl similrity exmintion nerly impossible. How the entities re connected in these networks re not s importnt s how closely these entities re connected. Second, these grphs re very lrge nd complex with lot of ttributes ssocited. If ccurcy is to be ensured, the lgorithms developed for edit distnce nd missing edges re not sclble. These two issues motivte us to invent new grph similrity mesures tht re less sensitive to structure chnges, nd hve sclble indexing nd serch solutions. Figure () shows grph query to Find the thlete who is from Romni nd won gold in 3000m nd bronze in 500m both in 984 olympics.. Compre this query ginst possible mtch in FreeBse (Olympics) shown in Figure (b), it is observed tht these two grphs re by no mens similr under trditionl grph similrity definitions. Grph edit distnce between

2 Romni Bronze Romni 500m 984 () Query 3000m Mricic Puic (b) Mtch in Freebse Gold Bronze 500m m Gold Figure : Top- Mtch for Query () in FreeBse these two grphs is 7. The size of their mximum common grph is 3. The number of mximum missing edges for the query grph is 4. However, Mricic Purc in Figure (b) is good mtch for the query shown in Figure (), becuse she hs ll these ttributes quite close to her in Figure (b). In prctice, it is hrd to come up with query tht exctly conforms with the grph structures in the trget network due to the lck of schems in linked dt. However, it is esy to write query like Figure (), where user connects entities with possible links. As long s the the proximity between these entities is pproximtely mintined in query grph, the system shll be ble to deliver mtches like Figure (b). The bove pproximte query form cn serve s primitive for mny dvnced grph opertors such s RDF query nswering, network lignment, subgrph similrity serch, nme dismbigution nd dtbse schem mtching. For exmple, bsed on prtil informtion relted to one person, e.g. his friends, one cn lign his physicl socil circle with his cyber socil network on Fcebook. In mny cses, nodes in socil or informtion networks hve incomplete informtion or even nonymized informtion. Nevertheless, the prtil neighborhood informtion vilble from query grph will be helpful to identify entities in the trget network. Clerly, there is need to dopt pproximte similrity serch techniques to solve the bove problem. In bioinformtics, pproximte grph lignment hs been extensively studied, e.g. PthBlst [2], Sg [33]. These studies resort to strict pproximtion definition such s grph edit distnce, whose optiml solution is expensive to compute. Since they re trgeting reltively smll biologicl networks with less thn 0k nodes, it is difficult to pply them in socil nd informtion networks with thousnds or even millions of nodes. As illustrted in NetAlign [23], in order to hndle lrge grphs with 0k nodes, one hs to scrifice ccurcy to chieve better query response time. Recently there hve been other studies on pproximte mtching with lrge grphs, i.e., TALE [34], SIGMA [24] nd G-Ry [35]. However, both TALE nd SIGMA consider the number of missing edges s the qulittive mesure of pproximte mtching nd hence, the techniques cnnot cpture the notion of proximity mong lbels, s shown in Figure. G-Ry, on the other hnd, tries to mintin the shpe of the query by llowing some pproximtion in the mtch. Unfortuntely, shpe is not n importnt fctor in entity-reltionship grphs. In this pper, we introduce novel neighborhood-bsed similrity mesure by vectorizing nodes ccording to the lbel distribution of their neighbors. We further extend the similrity notion to grph by finding the embeddings in the trget grph tht mximize the sum of node mtches. This grph mtching technique voids complicted subgrph isomorphism nd grph edit distnce clcultion, which becomes infesible for lrge grphs. It is observed tht socil/informtion networks usully hve more diversified node lbels nd therefore less uto-isomorphic structure, but my contin more noise. Our objective function cn provide better similrity semntics for grphs with vrious rndom noise. It simplifies the procedure of grph mtching, leding to the development of n efficient grph serch frmework, clled Ness (Neighborhood Bsed Similrity Serch). With the introduction of sclble indices built on vectorized nodes nd n intelligent query optimiztion technique, Ness cn quickly nd ccurtely find high-qulity mtches in lrge networks, with negligible time cost. Our contributions. We propose novel similrity serch problem in grphs, neighborhood-bsed similrity serch, which combines the topologicl structure nd content informtion together during the serch process. The similrity definition proposed in this work is ble to void expensive isomorphism testing s much s possible. The principles to derive pproprite functions to fit this definition re crefully exmined. We found tht the informtion propgtion model stisfies these principles, where ech node propgtes certin frction of its lbels to its neighbors, nd thereby we could convert ech node into multidimensionl vector, where sophisticted indexing nd similrity serch lgorithms re vilble. Tht is, we successfully turn grph serch problem into high-dimension index problem. We first identify set of rules to define pproximte mtches of nodes bsed on their neighborhood structure nd lbels. These rules re importnt since the query my not lwys hve complete informtion bout the exct neighborhood structure in the trget grph. The pproximte node mtch concept is further extended to subgrph similrity serch, i.e. multiple node lignment for given query grph. We prove tht under this mesure, subgrph similrity serch is NP hrd. However, in comprison with grph isomorphism, which is neither known to be solvble in polynomil time nor NP-hrd, grph similrity mtch is proved to be polynomil. We demonstrte tht, without performing subgrph isomorphism testing, it is possible to prune unpromising nodes by itertively propgting node informtion mong shrinking cndidte set, which significntly reduces query execution time. We further nlyze how to index the vector structure s well s optimize query processing to speed up similrity serch. The informtion propgtion model nd the neighborhood vectoriztion pproch keep the index structure much simpler thn the grph itself, thus mking it esy to be updted dynmiclly for grph chnges rising from node/edge insertion nd deletion. In summry, we propose completely new grph similrity serch frmework, Ness, to define nd determine pproximte mtches in mssive grphs. As tested in rel nd synthetic networks, Ness is ble to find high-qulity mtches efficiently in lrge scle networks. 2. PRELIMINARIES A lbeled grph G = (V G, E G, L G ) hs lbel set L G nd ech node u V G is ttched with set of lbels. The lbel set of node u in G is denoted by L(u) L G. For the ske of simplicity, we ssume there re no lbels nd weights on the edges. Nevertheless, the proposed techniques could be extended for grphs with lbeled or weighted edges. Given two lbeled grphs G nd G, G is clled subgrph isomorphic to G, if there exists subgrph H of G, such tht G is isomorphic to H. Formlly, we define subgrph isomorphism s follow. DEFINITION (SUBGRAPH ISOMORPHISM). A subgrph isomorphism is n injective function f : V G V G, s.t., () u

3 V G, L(u) L(f(u)), nd (2) (u, v) E G, (f(u), f(v)) E G. DEFINITION 2 (EMBEDDING). Given grph G nd query grph Q, n embedding of Q is n (injective) function f : V Q V G, such tht, v V Q, L(v) L(f(v)), where f(v) V (G). In this work, we only studied the one-to-one node mtching for query grph Q nd the node lbels re preserved in the embedding. However, our cost function nd lgorithms cn be extended to include other mtching nd node lbel similrity scenrios. Given two grphs G nd Q, there might be mny possible embeddings. Certinly, the qulity of n embedding depends on whether it preserves the connections nd lbels in the query grph or not. Subgrph isomorphism ctully defines n exct embedding, written s f e. The qulity of n embedding cn be defined in vrious wys; i.e., for given lbel-preserved embedding f, we cn count the number of edge mismtches, C e = {(u, v) E Q : (f(u), f(v)) E G}, s the embedding s qulity. In generl, for cost function C : f R, we define the top-k grph similrity serch problem s below. PROBLEM STATEMENT. Given grph G nd query grph Q, find the top-k embeddings with respect to cost function C. The edge mismtch cost function C e hs been studied in [38, 34, 24]. Unfortuntely, it cnnot differentite the cse where two nodes re close to ech other but there is no direct edge between them. f f 2 c b u u 3 u 2 c b u' u' 3 u' 2 G b c v v 2 v 3 Q Figure 2: Problem with Edge Mismtch Cost Function b u d cf d g ce Figure 3: Informtion Propgtion Model Figure 2 shows one exmple. There re two lbel-preserved embeddings f nd f 2 of the query grph Q in trget grph G. In f nd f 2, there is no edge connecting nd b. Thus, C e will ssign equl cost to both embeddings. On the other hnd, the grph edit distnce between f nd Q is 2, wheres it is only between f 2 nd Q. Although, intuitively it is observed tht f is better mtch thn f 2, becuse the nodes with lbels nd b re only 2-hops wy in f, wheres they re disconnected in f 2. This observtion inspires us to develop neighborhood-bsed similrity mesure tht discounts how nodes re exctly connected, but focuses on the proximity mong the lbels crried by these nodes. It needs to chieve the following two objectives: () The cost function should identify pproximte embeddings, nd (2) it must be esy to compute. In the next section, we will define the neighborhood-bsed similrity cost function nd the complexity nlysis of tht function. 3. NEIGHBORHOOD-BASED GRAPH SIM- ILARITY In order to solve the problem rised by the edge mismtch cost function, we define novel neighborhood-bsed similrity mesure by compring the h-hop neighbors of node, defined s follows. h DEFINITION 3 (h-hop NEIGHBORS). Given grph G nd node u V (G), the h-hop neighborhood of u is the set of nodes v whose distnce from u is less thn or equl to h. To compre the neighborhoods of two nodes, we resort to n informtion propgtion model [22] tht is ble to trnsform neighborhoods into vectors in multidimensionl spce, where sophisticted indexing nd fst similrity serch lgorithms re vilble. 3. Informtion Propgtion Model Figure 3 shows the informtion propgtion model to chrcterize the neighborhood informtion round node u. The lbel informtion encoded in u s neighbors is propgted to u through different pths nd ccumulted t u. One could use the ccumulted informtion nd its strength s vector to describe the neighborhood of u. The neighborhood vector of u is denoted by R(u), which consists of set of tuples, R(u) = { l, A(u, l) }, where l is lbel present in the neighborhood of u nd A(u, l) represents the strength of lbel l t node u in grph. There re mny different mechnisms to propgte informtion. However, not every one is vlid for grph similrity serch. Any vlid one must comply with the following principle, PROPERTY (COST FUNCTION). For grph similrity cost function C, given n exct embedding f e, C(f e) must be equl to 0. Here, we consider simple but effective informtion propgtion model so tht the derived neighborhood-bsed similrity mesure stisfies the bove principle. It propgtes informtion long the shortest pths between two nodes with exponentil decy to the length. Eq. describes the formul of A(u, l) in R(u) = { l, A(u, l) } tht represents the h-hop neighborhood of node u in grph. A(u, l) = h i= α i d(u,v)=i I(l L(v)), () where I(l L(v)) is n indictor function which tkes vlue one when l is in the lbel set of v nd zero otherwise. d(u, v) is the distnce between u nd v. α is constnt clled the propgtion fctor. It is between 0 nd, whose optimum vlue will be discussed lter. Eq. 2 confines Eq. to n embedding f in G by only considering the vertices nd the shortest pths in f. A f (u, l) = h i= α i v V f,d(u,v)=i I(l L(v)). (2) Using this informtion propgtion model, we shll formulte the neighborhood-bsed cost function. 3.2 Neighborhood-bsed Cost Function Given query grph Q nd its embedding f in the trget grph G, we cn pply the informtion propgtion model to propgte lbels in Q nd f. Since vertices in f might not be directly connected, we will consider ll of the shortest pths connecting these vertices during propgtion. To derive the neighborhood-bsed cost function C N (f), we first compute the difference between the neighborhood vectors R f (u) nd R Q(v), representing the neighborhoods u nd v in the embedding nd the query grph, respectively. C N (v, u) = l R Q (v) M(A Q(v, l), A f (u, l)), (3)

4 where M(x, y) is positive difference function s given below. { x y, if x > y; M(x, y) = 0, otherwise. The reson to dpt positive difference function is tht if the embedding f in G crries more lbels thn Q, we shll not penlize it. Only when there re lbels nd edges missed in f, C N (v, u) will return positive vlue. Note tht, the summtion in Eqution 3 is considered over ll lbels l present in R Q (v), i.e. {l : A Q (v, l) > 0}. For brevity, we simply denote this by l R Q (v) in Eqution 3, nd the sme nottion will be used in the remining of the pper. Given n embedding f, we ggregte the differences for ll pirs (v, u), where u = f(v). The neighborhood bsed grph similrity cost C N (f) is given s follows. C N (f) = C N (v, f(v)) (4) v V u2 fbcuf2 bu3 u 2 bv v2 Q f bc b c c d d G Q Figure 4: Neighborhood Bsed Similrity Cost G Figure 5: Exmple of Flse Positive Figure 4 provides n exmple of neighborhood bsed grph mtching cost. In grph G, lbel b is propgted to node u from node u 2 nd u 2, vi the corresponding shortest pths respectively. Assume α = 0.5 nd h = 2, we hve A G (u, b) = = We cn derive the neighborhood vectors for other nodes in G: R G (u ) = { b, 0.75, c, 0.5 }, R G (u 2 ) = {, 0.5, c, 0.25 }, R G (u 3 ) = {, 0.5, b, 0.75 } nd R G (u 2) = { c, 0.5,, 0.25 }. Similrly, R Q (v ) = { b, 0.5 } nd R Q (v 2 ) = {, 0.5 }. In Figure 4, we hve two possible embeddings f nd f 2. R f (u ) = { b, 0.5 } nd R f (u 2 ) = {, 0.5 }. Hence, C N (f ) = ( ) + ( ) = 0. For f 2, we mtch v to u nd v 2 to u 2. We hve R f2 (u ) = { b, 0.25 } nd R f2 (u 2) = {, 0.25 }. Therefore, C N (f 2) = ( ) + ( ) = 0.5. Note tht, for the embedding f 2, node u 3 will not contribute ny lbels to R f2 since it does not prticipte in the mtching. However, it is on the shortest pth from u 2 to u, thus propgting lbels between u 2 nd u. We must mention tht the vectoriztion of the neighborhoods nd the comprison mong these vectors cn be done in vrious wys. However, the finl cost function must stisfy the bsic property of C (Property ) to void flse negtives for exct embeddings. The following theorem shows tht C N follows this property. THEOREM. For n exct embedding f e, C N (f e) = 0. PROOF. For n exct embedding f e, if (v, v 2) E Q, then (f e(v ), f e(v 2)) E G. Thus, the shortest distnce between the node pirs f e(v ), f(v 2) in f e cnnot be higher thn the shortest distnce between the node pirs v, v 2 in Q. Hence, it follows from Eq. tht l, v, A f (f e (v), l) A Q (v, l). Therefore, bsed on Eq. 3 nd Eq. 4, C N (f e ) = 0. Q Theorem ensures tht there is no flse negtives for exct embeddings. However, there might be some flse positives s shown in Figure 5. In this exmple, if h =, C N (f) = 0, lthough f is not n exct embeddings of Q. Fortuntely, if we increse h to 2, C N (f) > 0. In rel-life grphs tht hve low utomorphism nd more distinct lbels in nodes, flse positives cn mostly be voided, s shown in our experiments nd in the following Lemm. LEMMA. Given grph G nd query grph Q, if ech of their nodes hs distinct lbel, for ny inexct embedding f, h > 0, α > 0, C N (f) > 0. PROOF. Omitted. Our definition of neighborhood-bsed cost function is robust ginst structurl differences nd other forms of noises. As long s two close lbels in query grph re close enough in the trget grph, we consider it s potentil mtch. We cn lso rnk the embeddings bsed on the proximity of their lbels in the trget grph compred to tht in the query grph. Thus, even if there exists no exct embedding of the query grph, the cost function cn identify the closely pproximte mtches nd rnk them bsed on their structurl differences. We formlly define our problem sttement s follows. PROBLEM STATEMENT 2. [Neighborhood-Bsed Top-k Similrity Serch] Given trget grph G nd query grph Q, find the top-k embeddings with respect to the cost function C N. In the following discussion, we show tht the bove problem is NP-hrd by reducing the clique problem to it. LEMMA 2. Given grph G nd query grph Q, u V G, v V Q, L(u) =, L(v) =, if Q is complete grph, then for ll inexct embeddings f, C N (f) > 0. PROOF. Since u V G, v V Q, L(u) =, L(v) =, for ny inexct embedding f, ech node u = f(v) hs only one lbel, which is sme s the lbel of node v in Q. Since, Q is complete grph, there exists t lest one node f(v) in f nd lbel l such tht the number of -hop neighbors of v in Q tht hs lbel l is more thn the number of -hop neighbors of f(v) in f with lbel l. Hence, A Q(v, l) > A f (f(v), l). Therefore, it follows from the definition of C N tht, C N (f) > 0. THEOREM 2. Neighborhood-Bsed Top-k Similrity Serch is NP-hrd. PROOF. Let us consider the cse where L(u) =, L(v) =, u V G, v V Q, nd Q is complete grph. Suppose the top- mtch f cn be identified in polynomil time. Given f, it cn lso be verified in polynomil time, whether C N (f) = 0. Now, if C N (f) = 0, by Lemm 2, there exists clique of size of Q in the trget grph G. So, it is possible to solve the clique problem in polynomil time. However, we know tht, the clique decision problem is NP-hrd [0], therefore we hve contrdiction. Hence, the similrity serch problem is NP-hrd. The grph isomorphism problem is neither known to be solvble in polynomil time nor NP-complete. However, given two grphs Q nd G of sme size, it is possible to determine in polynomil time, if G itself is n embedding of Q with cost C N (f) = 0. We cll this problem s the Grph Similrity Mtch problem. Thus, we suspect tht neighborhood-bsed similrity serch might hve lower time complexity thn grph theoretic mesures such s grph isomorphism nd edit distnce.

5 THEOREM 3. Grph Similrity Mtch is polynomil in n, where n = V Q. PROOF. Since G itself is n embedding f of Q, we cn determine the individul node mtching costs C N (v, u) in polynomil time, for ll v V Q, u V G. Next, we construct flow network nd determine the minimum cost of mximum flow in tht network (see Figure 6). From the source node s, dd directed edge to ech node v in Q. The cpcity of ech of these edges is nd the cost is 0. Similrly, from ech node u in G, dd directed edge to the sink node t. The cpcity nd cost of ech of these edges re nd 0 respectively. From ech node v in Q, dd directed edge to ech node u in G, if L(v) L(u). The cpcity nd cost of this edge re nd C N (v, u) respectively. Due to the cpcity constrints, ech node in Q cn be mtched with t most one node in G, nd lso only one node of Q cn be mtched with sme node in G. Clerly, if the mximum flow in this network is n nd the minimum cost of the mximum flow is 0, then G is n embedding of Q with cost C N (f) = 0. However, this flow problem cn be solved using the Ford nd Fulkerson lgorithm [] in O(n 3 ) time. Therefore, given two grphs Q nd G of the sme size, it is possible to determine in polynomil time, if G itself is n embedding f of Q with cost C N (f) = 0. follow, A G(u, l) = h n i (l)α i (l) i=2 < n2 (l)α 2 (l) n(l)α(l) To void flse positive, we wnt A G (u, l) < A Q (v, l) = α(l) s shown in Figure 7. Hence, α(l) <. n(l)+n 2 (l) In the next section, we will introduce n itertive method to find the top-k embeddings in lrge grph. 4. SEARCH ALGORITHM In this section, we introduce sclble itertive pproch to find the top-k grph embeddings. Our gol is not to enumerte ll the possible embeddings f in G for given query grph, whose cost is prohibitive. Insted of enumerting f, we directly use A G (u, l) to bound A f (u, l) since A G (u, l) A f (u, l). LEMMA 3. Given query grph Q nd its embedding f in G, l, u V f, A G (u, l) A f (u, l). PROOF. Omitted. (5),0 s Q v v 2,C N(v,u ) u u 2 G,0 t Lemm 3 shows tht A G (u, l) in the neighborhood vector R G (u) cnnot be lower thn A f (u, l) of the sme lbel l in the neighborhood vector R f (u), where f is subgrph of G. THEOREM 4. Given query grph Q nd its embedding f in G, M(A Q (v, l), A G (f(v), l)) C N (f) v V Q l R Q (v) v n u n PROOF. It follows from Lemm 3 so tht M(A Q (v, l), A f (u, l)) M(A Q (v, l), A G (u, l)). Figure 6: Flow Network to Solve Grph Similrity Mtch 3.3 Propgtion Fctor: α In the informtion propgtion model described in Eq., the propgtion fctor, α, should be less thn in order to reflect the reltion tht the strength A(u, l) of lbel l t node u decreses with the increse of distnce. However, we find the top-k embeddings by repetedly mtching the individul nodes from G nd Q tht stisfies cost threshold ϵ (The detiled procedure will be discussed in the next section). Now, if α is lrge, ech node will propgte high frction of lbels to its neighbors nd this cn increse the number of flse positives t the initil node mtching stge, thus slowing down the overll serch process. In Figure 7, for α = 0.5 nd h = 2, we get R G (u) = {, } = {, 0.5 } nd R Q (v) = {, 0.5 }. Thus, node u G will be reported s mtch of node v Q even for cost threshold ϵ = 0. Clerly, this is flse positive. To solve this problem, we do not employ uniform propgtion fctor for different lbels. Insted, for ech lbel l, we select n optimum α(l). For given lbel l, let us ssume tht, the mximum number of one-hop neighbors with lbel l, of ny node in G is n(l). To consider the worst cse, let us ssume tht, some node u in G hs no one-hop neighbor with lbel l; but it hs n 2 (l) two-hop neighbors with lbel l, n 3 (l) three-hop neighbors with lbel l nd so on. Therefore, the strength of lbel l t node u in G will be s Theorem 4 shows tht without enumerting embeddings of Q in the trget grph G, we cn derive the lower bound: M(A Q (v, l), A G (u, l)), where u is possible mtch of v in G. u G A G(u, ) = 0.5 Figure 7: High α v Q A Q(v, ) = 0.5 Flse Positive for u b c d c G b c Figure 8: Node Mtching Exmple Our lgorithm works by itertively pruning unpromising nodes in the trget grph.. Mtch the individul nodes of the query grph with some nodes in the trget grph, which stisfies predefined cost threshold ϵ (See Eq. 7). 2. Discrd the lbels of the unmtched nodes in the trget grph. 3. Propgte the lbels only mong the mtched nodes from the previous step. Recompute the neighborhood vectors R G (u) only for the mtched nodes. Repet Step until convergence. u b b c d Q v

6 During ech itertion, we remove the lbels of the unmtched nodes in the trget grph G nd then recompute the neighborhood vectors only for the mtched nodes. Since the modified trget grph hs more unlbeled nodes compred to the previous itertion, it will decrese A G (u, l). With this new nd reduced set of neighborhood vectors nd using the sme cost threshold ϵ, we determine the individul node mtches with the nodes of the query grph. Therefore, some dditionl nodes in G will be unmtched t ech itertion. The itertion continues until there is no unmtched nodes found. For rel life grphs, with less utomorphism nd more distinct lbels, we cn unlbel most of the unpromising nodes using this technique. Thus finding the top-k embeddings from the set of remining mtched nodes of G becomes lmost trivil. To determine the runtime complexity of our itertive serch lgorithm, let us denote the number of promising nodes present before i-th itertion s n i nd the number of unpromising nodes discovered t i-th itertion s k i ; where i. Clerly, n = n nd n i+ = n i k i. If there re totl r itertions, r i= k i = O(n). Let the complexity of itertion i be T i. In the first itertion, for ech node, it needs to propgte its lbels t h hops. Thus, T = O(nld h ), where l is the verge number of lbels, d h is the verge number of h-hop neighbors for ech node in G. However, for ech of the subsequent itertions, it is not necessry to perform such propgtion for ll the nodes in the grph. Rther, the number of unpromising nodes t itertion i +, for i, cn be determined by either propgting the remining n i+ nodes lbels, or by subtrcting the effect of k i unpromising nodes from previous itertion. Hence, T i+ = O(min{n i+, k i }ld h ), for i. Therefore, the overll runtime complexity of our serch lgorithm is given s follow. T + r i=2 r T i = O(nld h ) + O(min{n i+, k i}ld h ) i= r = O(nld h ) + O(k i ld h ) i= = O(nld h ) (6) In prctice, it converges much fster. Next, we shll discuss the detils of the itertive lgorithm nd the lgorithm to find the top-k embeddings from the nodes filtered by the itertive lgorithm. 4. Node Mtch Given the trget grph G nd the query grph Q, we compute the vectors R G (u) nd R Q (v) for ll nodes u V G, v V Q, considering their h-hop neighborhoods. For ech node pir u V G, v V Q, s.t. L(v) L(u), we clculte the node mtching cost, cost(u, v) s the difference of their neighborhood vectors, cost(u, v) = M(A Q (v, l), A G (u, l)). (7) l R(v) Figure 8 shows n exmple. Assume α = 0.5 nd h = 2. We get R G (u) = { b, 0.5, c, } = { b, 0.5, c, 0.5 }, nd similrly, R G (u ) = { b,, c, 0.25 }. Menwhile, for the query grph Q, we hve R Q(v) = { b, 0.5, c, 0.25 }. Hence, cost(u, v) = 0 nd lso cost(u, v) = 0 following the bove eqution. Now, for ech node v V G, we mintin list of nodes u V G, such tht L(v) L(u) nd cost(u, v) ϵ. Here, ϵ is predefined cost threshold. The vlue of ϵ will be discussed shortly. 4.2 Top-k Serch In order to find the top-k grph embedding, we initilize the cost threshold ϵ to smll vlue ϵ 0 0 nd perform the bove mentioned itertive procedure until it termintes. Given the mtched nodes, if we cnnot find t lest k embeddings from them, with cost C N (f) ϵ V Q ech; then the threshold cost ϵ is doubled nd we repet the bove procedure, until the k embeddings re found. Otherwise, we find the top-k embeddings mong the mtched nodes. Note tht, t this point, ny embedding formed by ll unmtched nodes will hve cost C N (f) > ϵ V Q. However, it is possible to hve some embedding with few mtched nd unmtched nodes, nd the cost of such embeddings might lso be C N (f) ϵ V Q. The problem is eliminted s follow. We set ϵ equl to the highest cost of the discovered top-k embeddings nd then run the lgorithm gin (this step will find top-k embeddings whose node cost might be higher thn ϵ). In this cse, ny embedding formed by t lest one of the unmtched node will hve cost more thn tht of ny of the top-k embeddings found erlier. Hence, the top-k embeddings identified only using the mtched nodes will be the best top-k embeddings. The complete lgorithm is given below. Algorithm Top-k Serch Input: Trget grph G, query grph Q, positive integer k. Output: Top-k mtches f bsed on the cost metric C N. procedure : ϵ ϵ 0, compute R G(v), v V Q 2: list 0(v) = {u : u V G L(v) L(u)} 3: i, strt with originl grph G nd compute R G(u), u V G 4: for ll v V Q do 5: list i (v) = {u : u V G L(v) L(u) cost(u, v) ϵ} 6: end for 7: (list, i) = Itertive Unlbel(list, i, G, Q) 8: if k mtches of cost C N (f) ϵ V Q cn be found in {u : u list i (v) v V Q } then 9: report top-k mtches nd stop 0: else : ϵ 2ϵ 2: go bck to step 2 3: end if Algorithm 2 Itertive Unlbel (list, i, G, Q) procedure : if list i (v) < list i (v) for some v V G then 2: for ll u V G do 3: if u list i (v) v V Q then 4: unlbel u 5: end if 6: end for 7: recompute R(u) u V G 8: (list, i) = Itertive Unlbel(list, i +, G, Q) 9: else 0: return (list, i) : end if From the finl list of mtched nodes for ech node in V Q, how cn we find embeddings with cost C N (f) ϵ V Q ech (line 8 of Algorithm )? One simple technique is to consider ll possible combintions from the lists nd verify their costs. When the number of mtched nodes in ech of the finl lists is smll, it is not time consuming to check. However, when the lists re long, we cn do better thn brute force enumertion using dynmic progrmming.

7 After finl list of mtched nodes list(v) for ech v V Q is generted, we perform the propgtion once more mong the mtched nodes; however this time we propgte the node id s insted of lbels. After this propgtion, ech mtched node u in G will hve its neighboring nodes (denoted s neighbor(u)) within h hops who hve influence on the cost (Eq. ). The finl embeddings cn be formed s follows. We select node u list(v) for some v V Q nd initilize set P ossible_mtch = neighbor(u). We hve two situtions: () within h hops of u, there is no f(v ) v v in Q. (2) v v of Q, we try to identify mtch u inside P ossible_mtch nd extend this set by dding neighbor(u ) nd lso eliminting the node u from P ossible_mtch. For the first sitution, we could derive the cost for node u, l L(v) AQ (v, l). We cn recurse mong these two situtions to find the embeddings. In this wy, we cn find the low-cost embeddings without enumerting ll possible combintions mong the nodes in the finl lists. 5. INDEXING The most expensive prts of Ness re the computtion of R G (u) for ll u in G (Line 3 of Algorithm ) nd the determintion of list (v) for ll v in V Q (Line 5 of Algorithm ). However, the computtion of R G (u) cn be done off-line by performing bredth first serch up to h-hops from ech node in G. Its time complexity is O( V G d h ), where d is the verge degree of ech node. To speed up the computtion of list (v) for ll v V Q, we use two types of simple index structures. In the first type of indexing, we build hsh tble corresponding to ech lbel. The nodes in G re hshed bsed on their lbels. Given query node v, we use this hsh structure to quickly identify the set of possible mtches u, such tht L(v) L(u). If the lbels of v re very selective, there will be limited number of possible mtches u nd we cn quickly determine the nodes u mong these mtches, for which cost(u, v) ϵ. Algorithm 3 Neighborhood Bsed Indexing Off-line Procedure : pre compute R G (u) = { l, A G (u, l) } for ll u V G 2: for ll lbel l do 3: crete sorted list S(l) of nodes in descending order of A G (u, l), such tht u i (l) is i-th node in S(l) 4: end for On-line Procedure : i 2: sum(i) M(A Q (v, l), A G (u i (l), l)) l R(v) 3: if sum(i) ϵ then 4: i i + 5: go to step 2 6: else 7: verify u j(l) if cost(u j(l), v) ϵ, j < i, l R Q(v) 8: end if However, if the lbels of v re not very selective nd there re mny possible mtches using the hshing technique discussed bove, we use the second index structure, which is built on the neighborhood vector R G(u) following the principle of Threshold Algorithm [2]. The neighborhood vector R G(u) = { l, A G(u, l) } for ech node u V G is pre computed. Next, for ech lbel l, we generte sorted list S(l) of nodes u in descending order of their A G(u, l) vlues. Let us denote the node t position i from the top of S(l) s u i (l). In the online phse, we strt from the top of the ech l R Q (v) sorted list S(l) in prllel nd go to the next position in the subsequent itertion. For some position i from the top, we compute, sum(i) = M[A Q (v, l), A G (u i (l), l)]. Assume t itertion i = i, sum(i ) becomes greter thn the cost threshold ϵ. Then, we terminte this itertive procedure nd verify for ll nodes u j(l), where j < i, l R Q(v), if cost(u j(l), v) ϵ. For ech v V Q, we need to verify only O((i ) l ) nodes for their cost; where l denotes the number of lbels in R Q(v). This cn reduce the complexity of the online lgorithm significntly. The complete procedure for neighborhood bsed indexing is given in Algorithm 3. Proof of Correctness. Let us denote S i (l) s ll the nodes up to position i from top of the sorted list S(l), i.e. S i(l) = {u j(l), j i}. The following lemm will be useful to prove the correctness of our indexing lgorithm. LEMMA 4. If sum(i) > ϵ, then for ll u {S i (l) : l R Q (v)}, cost(u, v) > ϵ. PROOF. It follows directly from the fct tht, ech S(l) is sorted list of nodes u in descending order of A G (u, l) vlues. Therefore, in Algorithm 3, we strt from i = nd find the smllest i, for which sum(i) > ϵ. Following the previous lemm, for ny node u {S i (l) : l R Q (v)}, we cn eliminte them without ctully computing cost(u, v). We note tht, our indexing cn be esily implemented in diskbsed mnner for very lrge grphs. Also we cn pply externl memory bredth first serch lgorithms, e.g., Ulrich Meyer [] nd Lrs Arge [2], to compute the neighborhood vectors R G(u) for ll the nodes. Dynmic Updte. Our indexing structure cn efficiently ccommodte dynmic updtes in G, i.e., insertion/ deletion of nodes, edges nd lbels. If node u is dded or deleted in G, it will only chnge the vectors of u s h-hop neighbors. We only need to propgte the lbels of these nodes nd modify their neighborhood vectors. They lso need to be updted in the sorted lists of lbel l for ll l L(u). The ddition/ deletion of lbel cn be hndled similrly. If n edge (u, u 2 ) is dded/ deleted in G, we need to updte vectors for the h hop neighbors of both u nd u QUERY OPTIMIZATION In this section, we eliminte the non-discrimintive lbels both from the trget nd query grphs t the initil stge of our mtching lgorithm to mke the technique more efficient. The efficiency of the lgorithm Itertive Unlbel is relted to the number of individul node mtches for ech node in the query grph. If there exists some node which is not very selective in terms of its own lbels or the lbels present in its neighborhood, there will be mny mtches corresponding to tht node t the initil stge of our lgorithm. In order to eliminte the problem posed by these nodes, we first eliminte ll the non-discrimintive lbels both from the trget grph nd the query grph, nd then we lso ignore the nodes in the query grph, which do not contin sufficient number of discrimintive lbels in themselves nd in their neighborhoods. These nondiscrimintive lbels re considered t the lst stge of our mtching lgorithm, i.e., when we serch for the finl mtches. In the following discussion, we shll clrify the notion of discrimintive nd non-discrimintive lbels in the perspective of node nd grph mtches.

8 ? Sheil McCrthy? Andre Mgic in the Wter () Query Mrth Plimpton? John Stephen Wters Spielberg () Query Drren E. Burrows Thoms Burstin S. McCrthy Thoms Burstin S. McCrthy Pecker The Goonies Cry-Bby Amistd Andre Andre Mgic in the Wter The Lotus Eter Mgic in the Wter Bright Angel John Stephen Wters Spielberg John Wters Stephen Spielberg (b) Mtch_ (c) Mtch_2 (b) Mtch_ (c) Mtch_2 Figure 0: Top-2 Mtches (Query ) Figure : Top-2 Mtches (Query 2) Pruned # of nodes () hevy-hed Not Pruned A Q(v, l) A G(u, l) # of nodes A Q(v, l) (b) hevy-til A G(u, l) Figure 9: Discrimintive (Hevy-Hed) vs. Non-Discrimintive (Hevy-Til) Distribution Let us consider the distribution of A G(u, l) vlues of some lbel l, <l, A G(u, l)> R G(u), for different nodes u V G. Figure 9 shows one exmple. For lbel l, we plot the different A G (u, l) vlues long the X-xis. The Y -xis shows the number of nodes u hving tht prticulr A G (u, l) vlue in their neighborhood vector R G (u). The distribution in Figure 9() is skewed towrds the smller vlues of A G (u, l), wheres Figure 9(b) is skewed towrds the higher vlues of A G (u, l). We cll them s hevy-hed nd hevy-til distributions respectively. Given query node v, since we prune ll the nodes u in G for which l R Q (v) M[A Q (v, l), A G (u, l)] > ϵ, the lbels with hevy-hed distribution hve more pruning power thn those with hevy-til distribution. Therefore, we should retin lbels with hevy-hed distribution for node mtch, s those lbels re more discrimintive. 7. EXPERIMENTAL RESULTS In this section, we present the experimentl results to demonstrte the effectiveness nd the efficiency of the neighborhood bsed similrity serch technique on number of rel-life nd synthetic grph dtsets including DBLP, Intrusion, Freebse nd WebGrph. In order to evlute the effectiveness, we show two possible pplictions - RDF query nswering nd network lignment. We test the robustness of our pproch by providing the ccurcy of the best mtches for queries of different sizes nd under the presence of rndom noise. The efficiency nd sclbility of our pproch re lso investigted. All experiments re performed using single core in 40GB, 2.50GHz Xeon server. 7. Grph Dt Sets DBLP Collbortion Grph. The DBLP collbortion grph is downloded from ley /db. There re 684K distinct uthors nd 7M co-uthor edges mong them. We consider the nme of ech uthor s the lbel of tht node. There re 683, 927 distinct lbels in DBLP. We use the DBLP dtset for efficiency test. Freebse Entity Reltionship Grph. Freebse is lrge collbortive knowledge bse of structured dt hrvested from mny sources including Wikipedi. We downloded the film entity reltionship grph dt from / This grph hs 72K nodes, ech representing n entity, i.e., ctor, movie, director, producer nd so on. An edge represents the reltionship between two entities. Nmes of entities re treted s lbels. There re totl 579K edges nd 59, 54 distinct lbels in this grph. Freebse grph is used for effectiveness, robustness nd efficiency nlysis. Intrusion Alert Network. This network contins the nonymous log dt of intrusion lerts in computer network. It hs 200K nodes nd 703K edges where ech node is computer nd n edge mens possible ttck such s Denil-of-Service nd TCP Service Sweep. Ech node hs 25 lbels (computer generted lerts in this cse) on verge. There re round, 000 types of lerts. We use this grph for robustness nd efficiency experiments. WebGrph with Synthetic Lbels. We downloded the uk web grph dt from [4]. This web grph is collection of UK web pges. For our experiments, we use subset tht contins 0M pges (i.e. nodes) nd 23M hyperlinks (i.e. edges). We uniformly ssign 0, 000 syntheticlly generted lbels cross vrious nodes, such tht ech node gets one lbel. We test the sclbility of our pproch on this grph. 7.2 RDF Query Answering In ddition to the query shown in Figure, we show two more exmples using the Freebse grph dtset. Query : Who did cinemtogrphy for t lest two Sheil Mc- Crthy movies, one of them being Andre? The person ws lso cinemtogrpher of the movie Mgic in the Wter. Here, we would like to emphsize tht, Sheil McCrthy did not ct in the movie Andre. However, s discussed erlier, this type of inccurcy is common, since the user my not hve the c-

9 ACCURACY () Accurcy (Intrusion) ERROR RATIO (b) Error Rtio (Freebse) ERROR RATIO (c) Error Rtio (Intrusion) Figure 2: Robustness of Network Alignment AVG # OF ITERATIONS () Top-k Serch (Algorithm ) AVG # OF ITERATIONS (b) Itertive Unlbel (Algorithm 2) SEARCH TIME (SEC) (c) Online Serch Time Figure 3: Convergence of Online Serch Algorithm (DBLP) curte informtion, or there cn be some noises in the trget grph. Using our pproch, we get the following top-2 nswers for this query, s shown in Figure 0. Query 2: Which ctors hve ppered in both "John Wters" movie nd "Steven Spielberg" movie? The query nd the corresponding top-2 mtches re shown in Figure. Here, we would like to emphsize tht, ctors in the Freebse dtset re not directly connected with the directors nd cinemtogrphers; rther vi some movies. To write SPARQL query, we need to mintin this structurl property. However, given the query grph s shown in Figure, which does not mintin this structurl property; we still obtin the results, where the embeddings re very close to the query grph. 7.3 Network Alignment We perform network lignment for query grphs of different sizes nd in the presence of vrious mount of noise. For these experiments, three different sets of query grphs re used with dimeters 2, 3, 4 nd the number of nodes 00, 50, 200 respectively. These query sets will simulte the sitution when we lign smll socil network to lrge one. In ech query set, we rndomly select 00 subgrphs with the specified dimeters nd nodes from the originl grph dtsets. Then we introduce noise by dding edges to the query grphs, which re not present in the originl grph. The noise rtio is defined s the number of edges dded divided by the originl number of edges present in the query grph. We use propgtion depth 2 nd α is selected s described erlier in Section 3.3. The robustness of our pproch in the presence of rndom noise is mesured using two metrics. The ccurcy is defined s the number of correctly identified nodes of the trget grph in ll the top- mtches divided by the totl number of nodes in ll query grphs in the corresponding query set. The ccurcy is for both DBLP nd Freebse dtsets with different mounts of noise, since these grphs hve more number of distinct lbels. The ccurcy vs. noise rtio plots for Intrusion dtset is shown in Figure 2(). The ccurcy remins t reltively high level when the noise rtio increses up to 0.2. We lso mesure the error rtio, which is defined s the number of incorrectly identified nodes of the trget grph in ll the top- mtches divided by the totl number of nodes in ll query grphs in the corresponding query set. The lower is the error rtio, the more distinguishble the nodes re in terms of their neighborhood structure nd contents. The error rtio remins close to 0 for DBLP grph t different mount noise. The error rtio vs. noise rtio plots for Freebse nd Intrusion re shown in Figure 2(b) nd 2(c) respectively. It cn be observed tht the error rtio remins t reltively low level for Freebse grph, when the noise rtio increses up to 0.2. Hence, these experiments indicte tht DBLP nd Freebse is less utomorphic compred to the Intrusion network. 7.4 Efficiency Results We provide the running time of our lgorithm for different dtsets in Tble. For these experiments, we rndomly select query grphs with 50 nodes nd dimeter 2 from the originl grph dtsets. The vectoriztion nd indexing is performed with propgtion depth 2 nd the serch lgorithm is used to identify the top- mtches. It cn be observed tht our lgorithm is very efficient for lrge grph dtsets. The on-line phse for Intrusion grph requires more time becuse the verge number of lbels per node is much higher thn tht in other grphs. This leds to more time used for cost computtion (Eq. (7)). We lso verify the convergence rte of our Top-k Serch nd Itertive Unlbel lgorithms for vrious network lignment experiments discussed erlier. The convergence rte of these lgorithms is mesured s the verge number of itertions required before they terminte. When the noise rtio is incresed, our lgorithm requires more itertions to stisfy the cost threshold. Thus, the corresponding running time lso increses s shown in Figure 3 for the DBLP dtset. Moreover, it requires more time to identify the

10 AVG # OF ITERATIONS () Convergence (Freebse) SEARCH TIME (SEC) (b) Serch Time (Freebse) AVG # OF ITERATIONS (c) Convergence(Intrusion) SEARCH TIME (SEC) (d) Serch Time (Intrusion) Figure 4: Convergence of Online Serch Algorithm (Freebse & Intrusion) mtches of lrger query grph. The convergence plots for Freebse nd Intrusion networks re given in Figure 4. Dtset 2-hop Indexing Top- Serch (Off-line) (Online) DBLP, 733 sec 0.06 sec (0.7M, 7M, 0.7M) Freebse 280 sec 0.22 sec (0.2M, 0.6M, 0.2M) Intrusion 227 sec.6 sec (0.2M, 0.5M, K) WebGrph 5, 25 sec 0.26 sec (0M, 23M, 0K) Tble : Efficiency: Off-line Indexing nd Online Serch 7.5 Neighborhood-bsed Cost Function Properties Recll tht we proved in Theorem tht our neighborhood-bsed cost function ensures there is no flse negtives when the cost threshold is set to 0. In this subsection, we investigte the flse positive rte by using our neighborhood-bsed cost function with threshold set to 0. This experiment is performed on DBLP, Freebse nd Intrusion dtsets. In prticulr, for ech dtset, we select 00 smll query subgrphs with 0 nodes ech from the originl grph. For ech of the query grphs, by using 2-hop propgtion, we identify ll mtches with cost = 0. Among these mtches, we mnully verify if there is ny flse positives, i.e. mtch which is not grph isomorphic with the query grph. The percentge of flse positives is clculted s the number of flse positives divided by the totl number of mtches obtined. We show the results in Tble 2. It cn be seen tht using our cost function with cost threshold set to 0, the percentge of flse positives on rel-life socil/ informtion networks is very smll. Dtset Flse Positive DBLP 0% Freebse 0% Intrusion 0.3% Tble 2: Flse Positive Rtio Dtset Serch with Serch w/o Index&Op- Index&Optimiztion timiztion DBLP 0.06 sec 9.63 sec Freebse 0.22 sec.75 sec Tble 3: Benefits of Index nd Optimiztion As we hve discussed erlier, the higher the vlue of h is, the lower the number of flse positives will be. Therefore, for trget grph, we cn employ error rtio s cost function nd lern the stisfctory vlue of h from trining queries generted from the trget grph. DBLP grph is used in this experiment. We use trining set of 00 smll query grphs (with 0 nodes ech) generted from the DBLP grph. The queries re generted in such wy tht the lbels in the query nodes re mostly not unique. Some noise is lso dded in these query grphs s explined erlier. Next, we strt with h = 0 nd grdully increse h until the error rtio becomes less thn smll vlue. We show the results for DBLP grph in Figure 5. It cn be observed tht, by setting h = 2, we cn reduce the error rtio to n cceptble level when the noise rtio is below 0.. This indictes tht for the rel-life socil/ informtion networks with few uto-morphism nd mny distinct lbels, we only need smll propgtion depth to mke the error rtio close to zero. 7.6 Pruning Cpcity of Serch Algorithm We verify the pruning cpcity of our Top-k serch lgorithm with respect to the number of distinct lbels present in the trget grph. For this experiment, we use subgrph extrcted from the WebGrph dtset, which contins, 000 nodes nd 4, 067 edges. We vry the number of distinct lbels from to 800. Given rndomly extrcted query grph with the number of nodes V Q = 8, 0 nd 2 respectively, we check how mny subgrphs need to be verified during the finl mtch phse of our pproch. The smller this number is, the more powerful the pruning of our lgorithm is. We plot the number of subgrphs need to be verified in the finl mtch phse vs. the number of distinct lbels in Figure 6. Note tht the Y xis is in log scle. It cn be observed tht, when there is only distinct lbel in the entire grph, we need to verify bout 0 25 subgrphs for query grph with 8 nodes during the finl mtch phse. However, s the number of distinct lbels increses, the number of subgrphs tht we need to verify decreses rpidly. For 800 distinct lbels, we only need to verify very smll number of subgrphs (e.g. 2 subgrphs when V Q = 8) in the finl mtch phse of our pproch. Thus, our lgorithm cn be very efficient on grphs with few utomorphisms nd mny distinct lbels. 7.7 Indexing nd Query Optimiztion In Tble 3, we compre the running time of our online serch lgorithm with tht of liner scn with no indexing nd query optimiztion. Ech of the query grphs hs 50 nodes nd dimeter 2 for this experiment. It cn be observed tht, our indexing nd query optimiztion techniques cn significntly speed up online serch. We lso compre the index construction time of dynmic updte with the cost of rebuilding the whole index when the trget grph is modified. The propgtion depth is 2 for these experiments. The results for DBLP dtset re shown in Figure 7. As we cn see, for wide rnge of updtes in the trget grph, it is more efficient to updte the index structure rther thn re-indexing the grph. The

11 ERROR RATIO = 0 = 0.05 = 0.0 = PROPAGATION DEPTH # OF SUB GRAPHS (0 x ) V Q =8 V Q =0 35 V Q = # OF DISTINCT LABELS TIME (SEC) Dynmic Updte Re-Index % NODE UPDATE Figure 5: Stisfctory h Vlue (DBLP) Figure 6: Pruning Cpcity (WebGrph) Figure 7: Dynmic Updte Index (DBLP) results lso indicte tht our index structure is very efficient ginst dynmic updtes in the trget grph. 7.8 Sclbility We show the sclbility of our pproch on the WebGrph dtset. The vectoriztion time s function of the number of nodes in the grph is shown in Figure 8(). Figure 8(b) shows the chnge trends of the online serch time with respect to the number of nodes. The propgtion depth is 2 for indexing nd we identify the top- mtches using our serch lgorithm. Ech of the query grphs hs 0 nodes nd dimeter 3 for this experiment. As it cn be observed, for grph with 0 million nodes, our pproch cn return the top- mtch in 0. second. The corresponding index building time is lso tolerble. Both the index building time nd the online serch time is roughly liner in the number of nodes. These results show tht our technique is highly sclble for lrge scle informtion/ socil networks. TIME (SEC) # OF NODES (M) () Vectoriztion Time TIME (SEC) # OF NODES (M) (b) Serch Time For subgrph serch, Shsh et l. [3] extend the pth-bsed technique for full-scle grph retrievl; Yn et l. propose gindex [37] using frequent subgrphs. These studies inspired new grph index structures such s δ-tolernce Closed Frequent Subgrphs [8], Tree [40], nd GCoding[4]. He et l. [7] develop closure tree index to perform pproximte grph serch. Tin et l. [33] design frgment bsed index to ssemble n pproximte mtch. Shng et l. introduce n efficient lgorithm for testing subgrph isomorphism [29]. Ferro et l. propose novel indexing scheme, SING [26], bsed on loclity informtion. All these methods re built strictly on grph structures, not good for pproximte serch shown in Figure. There hve been significnt studies on inexct grph mtching on ttributed grphs [30, 7]. Tong et l. [35] propose the best-effort pttern mtching in lrge ttributed grphs. It finds the best mtch not bsed on the proximity mong the lbels, rther bsed on the shpe of the query grph. Tin et l. [34] proposed n pproximte subgrph mtching tool, clled TALE, with efficient indexing nd high pruning cpbilities. Mongiovì et. l. introduce set-coverbsed inexct grph mtching technique, clled SIGMA [24]. Both techniques only use edge misses to mesure the qulity of grph mtching. Therefore, they re not pproprite for the proximity bsed serch scenrio studied in this work. There hve been some recent work on inexct grph mtching, i.e., simultion bsed cubic time grph pttern mtching [3], homomorphism bsed subgrph mtching [4], Belief propgtion bsed net lignment [3], edgeedit-distnce bsed subgrph indexing technique [39] nd grph prtition bsed subgrph identifiction scheme [6]. Figure 8: Sclbility Results (WebGrph) 8. RELATED WORK Grph serch hs been studied in different contexts such s grph isomorphism, grph indexing, structure mtching, etc. In XML, where the structures encountered re often trees nd lttices, queries built on pth expression become populr [28] nd their corresponding indices hve been developed [9]. In bioinformtics, exct nd pproximte grph lignment hs been extensively studied, e.g., PthBlst [2], Sg [33], NetAlign [23], IsoRnk [32]. They re trgeting reltively smll biologicl networks with less thn 0k nodes. It is difficult to pply them in socil nd informtion networks with thousnds or even millions of nodes. Kernel bsed grph mtching techniques re lso proposed, e.g., common wlks [6, 8], shortest pth [5], limited-size subgrphs [9] nd subtree ptterns [20]. Recently, Shervshidze et. l [25] proposed fst subtree pttern kernel bsed on the Weisfeiler- Lehmn method. Kernel methods do not support subgrph serch well. 9. CONCLUSIONS In this pper, we defined new grph similrity mesure, neighborhood bsed grph similrity, nd proposed n informtion propgtion model to convert lrge network into set of multidimensionl vectors, where sophisticted indexing nd similrity serch lgorithms re vilble. We proved, under this mesure, tht subgrph similrity serch is NP hrd, while grph similrity mtch is polynomil. We introduced criterion to select the best propgtion rte with respect to different node lbels in grph. We further investigted the techniques to index the neighborhood vectors nd to compress them by deleting non-discrimintive lbels, thus optimizing the query processing time. The proposed method, clled Ness, is not only efficient, but lso robust ginst structure chnges nd informtion loss. Empiricl results show tht it could quickly nd ccurtely find high-qulity mtches in lrge networks, with negligible time cost. In future work, it will be interesting to consider the grph lignment problem, when the node lbels in two grphs re not exctly identicl, i.e the sme user cn hve slightly different usernmes in Fcebook nd Twitter.

12 0. ACKNOWLEDGMENTS This reserch ws sponsored in prt by the U.S. Ntionl Science Foundtion under grnt IIS nd by the Army Reserch Lbortory under coopertive greement W9NF (NS- CTA). X. Yn ws supported in prt by the Open Project Progrm of the Stte Key Lb of CAD&CG (Grnt No. A00), Zhejing University. The views nd conclusions contined herein re those of the uthors nd should not be interpreted s representing the officil policies, either expressed or implied, of the Army Reserch Lbortory or the U.S. Government. The U.S. Government is uthorized to reproduce nd distribute reprints for Government purposes notwithstnding ny copyright notice herein.. REFERENCES [] D. Ajwni, U. Meyer, nd V. Osipov. Improved externl memory bfs implementtion. In ALENEX, [2] L. Arge, G. S. Brodl, nd L. Tom. On externl-memory mst, sssp nd multi-wy plnr grph seprtion. In Workshop on Algorithmic Theory, Vol. 85 of LNCS, pges Springer, [3] M. Byti, M. Gerritsen, D. F. Gleich, A. Sberi, nd Y. Wng. Algorithms for lrge, sprse network lignment problems. ICDM, 0:705 70, [4] P. Boldi nd S. Vign. The WebGrph frmework I: Compression techniques. In WWW, pges , [5] K. M. Borgwrdt nd H.-P. Kriegel. Shortest-pth kernels on grphs. In ICDM, pges 74 8, [6] M. Brocheler, A. Pugliese, nd V. S. Subrhmnin. Cosi: Cloud oriented subgrph identifiction in mssive socil networks. ASONAM, 0: , 200. [7] S. Chudhury, K. Gnjm, V.Gnti, nd R. Motwni. Robust nd efficient fuzzy mtch for online dt clening. In SIGMOD, [8] J. Cheng, Y. Ke, W. Ng, nd A. Lu. FG-Index: Towrds verifiction-free query processing on grph dtbses. In SIGMOD, pges , [9] C. Chung, J. Min, nd K. Shim. APEX: An dptive pth index for xml dt. In SIGMOD, pges 2 32, [0] S. Cook. The complexity of theorem-proving procedures. In STOC, pges 5 58, 97. [] J. Edmonds nd R. M. Krp. Theoreticl improvements in lgorithmic efficiency for network flow problems. Journl of the ACM, 9(2): , 972. [2] R. Fgin, A. Lotem, nd M. Nor. Optiml ggregtion lgorithms for middlewre. In PODS, pges 02 3, 200. [3] W. Fn, J. Li, S. M, N. Tng, Y. Wu, nd Y. Wu. Grph pttern mtching: From intrctble to polynomil time. PVLDB, 3(): , 200. [4] W. Fn, J. Li, S. M, H. Wng, nd Y. Wu. Grph homomorphism revisited for grph mtching. PVLDB, 3():6 72, 200. [5] Freebse. [6] T. Gärtner, P. A. Flch, nd S. Wrobel. On grph kernels: Hrdness results nd efficient lterntives. In COLT nd the 7th Kernel Workshop, [7] H. He nd A. Singh. Closure-tree: An index structure for grph queries. In ICDE, pge 38, [8] H.Kshim nd A.Inokuchi. Kernels for grph clssifiction. ICDM Workshop on Active Mining, [9] T. Horváth, T. Gärtner, nd S. Wrobel. Cyclic pttern kernels for predictive grph mining. In KDD, pges 58 67, [20] J. J. Rmon nd T. Gärtner. Expressivity versus efficiency of grph kernels. In First Int. Workshop on Mining Grphs, Trees nd Sequences, pges 65 74, [2] B. P. Kelley, B. Yun, F. Lewitter, R. Shrn, B. R. Stockwell, nd T. Ideker. Pthblst: tool for lignment of protein interction networks. Nucleic Acids Res, 32:83 88, [22] A. Khn, X. Yn, nd K.-L. Wu. Towrds proximity pttern mining in lrge grphs. In SIGMOD, 200. [23] Z. Ling, M. Xu, M. Teng, nd L. Niu. Netlign: web-bsed tool for comprison of protein interction networks. Bioinformtics, 22(7): , [24] M. Mongiovì, R. D. Ntle, R. Giugno, A. Pulvirenti, A. Ferro, nd R. Shrn. Sigm: set-cover-bsed inexct grph mtching lgorithm. J. Bioinformtics nd Computtionl Biology, 8(2):99 28, 200. [25] N. N. Shervshidze nd K. M. Borgwrdt. Fst subtree kernels on grphs. pges Currn, 200. [26] R. D. Ntle, A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti, nd D. Shsh. Sing: Subgrph serch in non-homogeneous grphs. BMC Bioinformtics, :96, 200. [27] E. Prudhommeux nd A. Seborne. Sprql query lnguge for rdf. Technicl report, W3C, [28] C. Qun, A. Lim, nd K. Ong. D(k)-index: An dptive structurl summry for grph-structured dt. In SIGMOD, pges 34 44, [29] H. Shng, Y. Zhng, X. Lin, nd J. Yu. Tming verifiction hrdness: An efficient lgorithm for testing subgrph isomorphism. In VLDB, pges , [30] L. Shpiro nd R. Hrlick. Structurl descriptions nd inexct mtching. IEEE Trns. on Pttern Anlysis nd Mchine Intelligence, 3:504 59, 98. [3] D. Shsh, J. T.-L. Wng, nd R. Giugno. Algorithmics nd pplictions of tree nd grph serching. In PODS, pges 39 52, [32] R. Singh, J. Xu, nd B. Berger. Globl lignment of multiple protein interction networks with ppliction to functionl orthology detection. PNAS, 05(35): , [33] Y. Tin, R. McEchin, C. Sntos, D. Sttes, nd J. Ptel. SAGA: subgrph mtching tool for biologicl grphs. Bioinformtics, 23(2): , [34] Y. Tin nd J. M. Ptel. Tle: A tool for pproximte lrge grph mtching. In ICDE, pges , [35] H. Tong, C. Floutsos, B. Gllgher, nd T. Elissi-Rd. Fst best-effort pttern mtching in lrge ttributed grphs. In KDD, pges , [36] D. J. Wtts, P. S. Dodds, nd M. E. J. Newmn. Identity nd serch in socil networks. Sience, 296: , [37] X. Yn, P. S. Yu, nd J. Hn. Grph indexing: A frequent structure-bsed pproch. In SIGMOD, pges , [38] X. Yn, P. S. Yu, nd J. Hn. Substructure similrity serch in grph dtbses. In SIGMOD, pges , [39] S. Zhng, J. Yng, nd W. Jin. Spper: Subgrph indexing nd pproximte mtching in lrge grphs. PVLDB, 3():85 94, 200. [40] P. Zho, J. Yu, nd P. Yu. Grph indexing: tree + delt >= grph. In VLDB, pges , [4] L. Zou, L. Chen, J. Yu, nd Y. Lu. A novel spectrl coding in lrge grph dtbse. In EDBT, pges 8 92, 2008.