Measuring Similarity between Graphs Based on the Levenshtein Distance

Similar documents
Regular Sets and Expressions

Reasoning to Solve Equations and Inequalities

LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES

Small Businesses Decisions to Offer Health Insurance to Employees

Or more simply put, when adding or subtracting quantities, their uncertainties add.

Homework 3 Solutions

2 DIODE CLIPPING and CLAMPING CIRCUITS

How To Network A Smll Business

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:

Section 5-4 Trigonometric Functions

Small Business Networking

Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )

Small Business Networking

A.7.1 Trigonometric interpretation of dot product A.7.2 Geometric interpretation of dot product

Small Business Networking

Scalable Mining of Large Disk-based Graph Databases

CS99S Laboratory 2 Preparation Copyright W. J. Dally 2001 October 1, 2001

Concept Formation Using Graph Grammars

EQUATIONS OF LINES AND PLANES

Small Business Networking

Experiment 6: Friction

One Minute To Learn Programming: Finite Automata

9 CONTINUOUS DISTRIBUTIONS

Factoring Polynomials

Graphs on Logarithmic and Semilogarithmic Paper

Lecture 3 Gaussian Probability Distribution

1. Find the zeros Find roots. Set function = 0, factor or use quadratic equation if quadratic, graph to find zeros on calculator

Babylonian Method of Computing the Square Root: Justifications Based on Fuzzy Techniques and on Computational Complexity

Math 135 Circles and Completing the Square Examples

PROF. BOYAN KOSTADINOV NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY

Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Operations with Polynomials

How To Set Up A Network For Your Business

9.3. The Scalar Product. Introduction. Prerequisites. Learning Outcomes

Mathematics. Vectors. hsn.uk.net. Higher. Contents. Vectors 128 HSN23100

Simulation of operation modes of isochronous cyclotron by a new interative method

Example 27.1 Draw a Venn diagram to show the relationship between counting numbers, whole numbers, integers, and rational numbers.

Unit 6: Exponents and Radicals

Treatment Spring Late Summer Fall Mean = 1.33 Mean = 4.88 Mean = 3.

Basic Research in Computer Science BRICS RS Brodal et al.: Solving the String Statistics Problem in Time O(n log n)

Solving the String Statistics Problem in Time O(n log n)

AntiSpyware Enterprise Module 8.5

Binary Representation of Numbers Autar Kaw

Health insurance marketplace What to expect in 2014

Quality Evaluation of Entrepreneur Education on Graduate Students Based on AHP-fuzzy Comprehensive Evaluation Approach ZhongXiaojun 1, WangYunfeng 2

An Undergraduate Curriculum Evaluation with the Analytic Hierarchy Process

All pay auctions with certain and uncertain prizes a comment

FAULT TREES AND RELIABILITY BLOCK DIAGRAMS. Harry G. Kwatny. Department of Mechanical Engineering & Mechanics Drexel University

** Dpt. Chemical Engineering, Kasetsart University, Bangkok 10900, Thailand

How To Make A Network More Efficient

Health insurance exchanges What to expect in 2014

Helicopter Theme and Variations

5 a LAN 6 a gateway 7 a modem

Integration by Substitution

Economics Letters 65 (1999) macroeconomists. a b, Ruth A. Judson, Ann L. Owen. Received 11 December 1998; accepted 12 May 1999

and thus, they are similar. If k = 3 then the Jordan form of both matrices is

CHAPTER 11 Numerical Differentiation and Integration

SPECIAL PRODUCTS AND FACTORIZATION

Pentominoes. Pentominoes. Bruce Baguley Cascade Math Systems, LLC. The pentominoes are a simple-looking set of objects through which some powerful

P.3 Polynomials and Factoring. P.3 an 1. Polynomial STUDY TIP. Example 1 Writing Polynomials in Standard Form. What you should learn

0.1 Basic Set Theory and Interval Notation

Performance analysis model for big data applications in cloud computing

Unleashing the Power of Cloud

Exponential and Logarithmic Functions

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

How fast can we sort? Sorting. Decision-tree model. Decision-tree for insertion sort Sort a 1, a 2, a 3. CS Spring 2009

Decision Rule Extraction from Trained Neural Networks Using Rough Sets

DlNBVRGH + Sickness Absence Monitoring Report. Executive of the Council. Purpose of report

1.00/1.001 Introduction to Computers and Engineering Problem Solving Fall Final Exam

Algebra Review. How well do you remember your algebra?

Novel Methods of Generating Self-Invertible Matrix for Hill Cipher Algorithm

Integration. 148 Chapter 7 Integration

Multiplication and Division - Left to Right. Addition and Subtraction - Left to Right.

e.g. f(x) = x domain x 0 (cannot find the square root of negative values)

Vector differentiation. Chapters 6, 7

RTL Power Optimization with Gate-level Accuracy

ClearPeaks Customer Care Guide. Business as Usual (BaU) Services Peace of mind for your BI Investment

Numeracy across the Curriculum in Key Stages 3 and 4. Helpful advice and suggested resources from the Leicestershire Secondary Mathematics Team

6.2 Volumes of Revolution: The Disk Method

Small Business Cloud Services

Space Vector Pulse Width Modulation Based Induction Motor with V/F Control

Vendor Rating for Service Desk Selection

Software Cost Estimation Model Based on Integration of Multi-agent and Case-Based Reasoning

Implementation Evaluation Modeling of Selecting ERP Software Based on Fuzzy Theory

Example A rectangular box without lid is to be made from a square cardboard of sides 18 cm by cutting equal squares from each corner and then folding

Roudmup for Los Angeles Pierce College ADIV Program ancl csu Dominguez Hilk Rlt-B^sr/ progrum

Enterprise Risk Management Software Buyer s Guide

Understanding Basic Analog Ideal Op Amps

Unit 29: Inference for Two-Way Tables

In addition, the following elements form an integral part of the Agency strike prevention plan:

Transcription:

Appl. Mth. Inf. Sci. 7, No. 1L, 169-175 (01) 169 Applied Mthemtics & Informtion Sciences An Interntionl Journl Mesuring Similrity etween Grphs Bsed on the Levenshtein Distnce Bin Co, ing Li nd Jinwei in College of Computer Science nd Technology, hejing University, Hngzhou, Chin 1007 Received: 0 Oct. 01, Revised: 9 Nov. 01, Accepted: 11 Dec. 01 Pulished online: 1 Fe. 01 Astrct: Grph dt hs een commonly used nd widely reserched oth in cdemi nd industry for mny pplictions. And mesuring similrity etween grphs (i.e., grph mtching) is the essentil step for grph serching, pttern recognition nd mchine vision. At present, the most widely used pproch to ddress the grph mtching prolem is grph edit distnce (GED). However, the computtion complexity of GED is expensive nd it tkes uncceptle time when the grph ecomes lrger. Generlly, grph could e cnonicl leled y some sort of strings nd we use the depth-first serch (DFS) code s our cnonicl leling system. Bsed on DFS codes, comining the Levenshtein distnce (i.e., string edit distnce, SED), we proposed novel method for similrity mesurement of grphs. Processing nd clculting the distnce etween two DFS codes, we turned the grph mtching prolem into string mtching, which gins gret improvement on the mtching performnce. The experimentl results prove its usefulness. Keywords: Grph mtching, similrity, depth-first serch (DFS), Levenshtein distnce 1. Introduction As one of the most powerful structures, grphs cn contin richer informtion thn other dt structures nd they hve een widely investigted nd pplied in rod rnge of res. Especilly, grphs which re leled nd/or ttriuted cn e used to strct nd model mny complicted reltions mong dt. When using grphs for representtion, vertices usully represent regions (or fetures) of the ojects nd edges etween them represent the reltions etween region. For exmple, World Wide We (WWW) cn e viewed s grph in which vertices correspond to sttic pges nd edges correspond to links etween pges [1]. In usiness process, the leled grphs re commonly used to model the rel usiness opertions nd the usiness ctivities re represented y the vertices of the grphs. Since mny prolems could e solved more esily sed on grphs, people hve collected vst mounts of grph dt nd estlished grph dtse for different purposes. Menwhile, the cdemic communities hve pid lot of ttentions on grph relted reserches. Among which, mesuring the similrity etween grphs is one of the hottest topics nd it is the foundtion for mny other reserches or pplictions. For exmple, to support sclle grph serch over lrge grph dtses in ioinformtics [], chemicl informtics [], nd even in usiness process mngement [], it is essentil to mtch the grphs y mesuring their similrities. Up to now, the most widely ccepted method for grph similrity mesurement is grph edit distnce (GED) [5]. The sic ide of GED is to sum the cost of elementry error-correcting opertions: node sustitution, node insertion/deletion, edge insertion/deletion. And the miniml cost tken over ll opertions is the edit distnce etween two grphs. Bsed on GED, numer of pproches hve een proposed [6 9]. Unfortuntely, the prolem of GED is NP-hrd in generl nd its min drwck is the exponentil computtionl complexity in terms of the numer of grph edit vertices [8]. Thus,.eng et l.[8] introduce notion of so clled str representtion for grph structures nd propose three novel methods to otin lower nd upper ounds of GED in polynomil time. However, their lower ound of computtionl complexity is in O(n ) which is still kind of expensive for computtion involving lrge mount of grphs.. n et l. [9] propose feture-sed method for similrity serch in grph structures. They use indexed fetures in grph dtse to filter grphs without performing pirwise similrity Corresponding uthor e-mil: cnliying@zju.edu.cn

170 B. Co,. Li, J. in: Mesuring similrity etween grphs... computtion. But they still turn to GED for mesuring similrity when grph mtching is needed. In order to improve the efficiency of grph mtching prolem, in this pper we propose novel method for mesuring the similrity etween two grphs. The strt point of our method is the depth-first serch code (DFS code)[10] nd insted of GED mesurements, we use Levenshtein [11] distnce (i.e., string edit distnce, SED) to mesure the similrity etween two grphs. The computtion for SED is in O(n ) time which mkes our method pplicle in prctice. The rest of this pper is orgnized s follows. Section will formlly present some sic definitions for ccurte description of grphs, DFS code, SED, nd etc. Then the implementtion detils will e presented in Section. The experimentl evlutions re studied in Section. Section 5 concludes the pper nd presents some future work.. Bsic Definitions In our pper, we consider grphs with leled nodes nd edges. And we present some sic definitions s follows. Definition 1(Leled Grph). A leled grph is tuple G=(V,E,L V,L E,l), where V is set of finite vertices nd E V V is set of edges.l V nd L E denote the finite sets of vertex nd edge lels. l is the mpping function for lels. From Definition 1, since the edge is denoted y two nodes, if there is n order etween these two nodes then this is directed grph, otherwise, n undirected grph. Besides, if V(G 1 ) = V(G ) nd E(G 1 ) = E(G ), we consider grph G 1 nd G re the sme. And G 1 is isomorphic to G (i.e., G 1 = G ) if they shre the sme structure. Definition (Grph Isomorphism). Let G nd G e two grphs. A grph isomorphism etween G nd G is ijective mpping f : V(G) V(G ) such tht: u V,(l(u)=l ( f(u))) u,v V,((u,v) E ( f(u), f(v)) E ) nd (u,v) E,(l(u,v)=l ( f(u), f(v))) As shown in Figure 1, G 1, G nd G hve the sme numer of nodes nd edges. Besides, ech edge in G 1, G nd G is sme since their corresponding strt nd end nodes re sme. Through replcing the nodes in G 1, G 1 could e redrwn to G. Clerly, the difference mong G 1, G nd G is the wy of leling nd drwing the grphs. Tht is to sy, they hve the sme structure nd they re merely different forms of certin grph which is just -circles grph. From Figure 1, we cn see tht: (1) x in G corresponds to y 1 in G, () y 1 in G corresponds to x in G. Apprently, since the mpping etween nodes in G nd G is not unique, there re other mppings existed. And other drwings for G could e found. x 1 x x 1 x 1 G 1 y y G G 1 y 1 y x y Figure 1 Three isomorphic grphs In other words, if the topology of two grphs is sme, then these two grphs re isomorphic. The isomorphism is very common in grph dt. Under some circumstnces, such s frequent sugrph mining, the isomorphic grphs should e pruned for the reson of efficiency. As for our work, since isomorphic grphs could e viewed s the sme grphs, we needn t mtch ll the grphs one y one. Insted, choosing one of the sme grphs to mesure is resonle. Thus, efore mesuring the similrity, we hve to exmine the grph isomorphism. In order to solve the isomorphism prolem, we need to clculte the cnonicl lels of two grphs. The cnonicl lel for grph (denoted s cl(g)) is unique code which is sequence of ytes, chrcters or numers. It is irreltive with the order of vertices nd edges of the grph G nd totlly depends on the topology of G. If the cnonicl lels of two grphs re the sme, then these grphs re isomorphic to ech other. There re few cnonicl leling methods tht hve een pplied, for exmple, conctenting rows or columns of the djcency mtrix of grph. In our work, we introduce the depth-first serch code (DFS code), which first mentioned in gspn [10] lgorithm, s the foundtion of our cnonicl leling system. Next, we present the necessry informtion of DFS code nd more detils refer to gspn [10]. Depth-first serch is well-known nd populrly pplied in grph lgorithms nd it consumes less memory thn redth-first serch (BFS). When performing depth-first serch in grph, DFS tree would e constructed. The DFS trees of one grph mye vrious, which is determined y the visiting order of the vertices in the grph. Thus, we cn t exmine the isomorphism of two grphs y DFS sequences. Adopting the DFS lexicogrphic order nd the minimum DFS code s the cnonicl leling cn solve this prolem. First of ll, we present the definition of DFS suscripting s follows. Definition (DFS Suscripting). When uilding DFS tree T, the depth-first discovery of the vertices forms liner order. The suscripts re used to record this order, where i< j mens v i is visited efore v j when the DFS is performed. G T represents grph G suscripted with T. T is clled DFS suscripting of G. We cll v 0, the strting vertex in T, the root. The vertex v n which visited lst is clled the rightmost vertex. The stright pth from v 0 to v n is clled the rightmost pth. As shown in Figure, the vertex lels re, nd while the edges lels re nd. The drkened edges in x y 1

Appl. Mth. Inf. Sci. 7, No. 1L, 169-175 (01) / www.nturlspulishing.com/journls.sp 171 v 0 v 1 v v v 0 v 1 v v v 0 v 1 v v () () (c) (d) Figure The smple of DFS suscripting Figure () to Figure (d) represent three different DFS trees for the grph of Figure () nd they generte three different suscriptings. The rightmost pth for Figure () is (v 0,v 1,v ) nd (v 0,v 1,v,v ) is for Figure (c) nd Figure (d). Definition (Rightmost Extension). Given grph G nd its DFS tree T, we hve: Bckwrd extension: new edge cn e dded etween the rightmost vertex nd nother vertex on the rightmost pth. Forwrd extension: new vertex cn e introduced nd connected to vertex on the rightmost pth. Since oth kinds of ove extensions tke plce on the rightmost pth, we cll them rightmost extension. Tking Figure (c) s n exmple, since the edges lredy exist etween v 1, v nd v, the ckwrd extension cndidtes cn e (v,v 0 ) nd the forwrd extension cndidtes cn e edges extending from v 0, v 1, v or v, with new vertex introduced. The ll potentil rightmost extensions of Figure (c) re shown in Figure. The dshed lines represent the extensions. Among which, Figure () nd () oth extend from the rightmost vertex (i.e., v ) while Figure (c) to (d) re extend from other vertices on the rightmost pth. Anywy, ckwrd extension cn only occur on the rightmost vertex nd forwrd extension tkes plce on the vertex which elongs to the rightmost pth. void the extension of the sme grphs (i.e., isomorphic grphs), we hve to choose one se suscripting nd conduct rightmost extension on it. Definition 5(DFS Code). Given DFS tree T for grph G, sed on rightmost extension, the suscripted grph G T could e trnsformed to n edge sequence e i (i = 0,..., E 1). e i is clled DFS code, denoted s DFSCode(G,T). Bsed on Definition 5, there is ijective mpping etween suscripted grph nd DFS code. Besides, since there re vrious edge sequences for given grph G, we cn uild n order etween these sequences nd select the suscripting which genertes the minimum sequence s the suscripting of G. This order could lso e pplied to DFS codes nd we present it s follows. Definition 6(DFS Lexicogrphic Order). Let n edge e 5-tuple:(i, j,l i,l (i, j),l j ), where li nd l j re the lels of v i nd v j, respectively, nd l (i, j) is the lel of the edge connecting them. Given vertex v, the edge order is tht: All of its ckwrd edges should pper just efore its forwrd edges. If v does not hve ny forwrd edge, we put its ckwrd edges fter the forwrd edge, where v is the second vertex. Let the edge order tke the first priority, the vertex lel l i tke the second priority, the edge lel l ( i, j) tke the third nd the vertex lel l j tke the fourth to determine the order of two edges. The ordering sed on ove rules is clled DFS lexicogrphic order. From Definition 6 it follows tht, the complete sequence for Figure (c) is (0,1),(1,),(,),(,1). The DFS codes for Figure () to (d) re shown in Tle 1. We cn see from Tle 1 tht the first edges of the DFS codes re (0,1,,,),(0,1,,,) nd (0,1,,,). Since they hve the sme suscript (0,1) nd no edge order exists etween them, we cn t use it to tell the difference mong them. However, using the rest priorities of lel informtion we hve (0,1,,,) < (0,1,,,) < (0,1,,,). Therefore, c < c c < c d is the order for the DFS codes listed in Tle 1. () () (c) (d) (e) Figure The rightmost extension for Figure (c) As mentioned efore, it is likely tht one grph my hve more thn one DFS trees/suscriptings. In order to Tle 1 DFS codes for Figure () to (d) edge c c c c d e 0 (0,1,,,) (0,1,,,) (0,1,,,) e 1 (1,,,,) (1,,,,) (1,,,,) e (1,,,,) (,,,,) (,0,,,) e (,0,,,) (,1,,,) (,,,,)

17 B. Co,. Li, J. in: Mesuring similrity etween grphs... Definition 7(Minimum DFS Code). Given grph G, C(G) = {(DFSCode(G, T)) T, T is DFS tree for G}, sed on DFS lexicogrphic order, the minimum element of C(G) is clled minimum DFS code, denoted s mindfscode(g). According to Definition 7, the minimum DFS code of Figure () is c shown in Tle 1. Wht is more, we cn infer the following importnt reltionship etween the minimum DFS codes nd isomorphic grphs. Property 1.Given two grphs G nd G, we hve: G 1 = G mindfscode(g)=mindfscode(g ) Proof: : Since G is isomorphic to G, ccording to Definition 1, G nd G is one-to-one mpped under some certin function: f : V(G) V(G ). Thus, sed on the mpping etween E(G) nd E(G ) nd Definition 5, we cn infer the mpping of DFSCode(G,T) DFSCode(G,T). Nturlly, we hve mindfscode(g) mindfscode(g ). The proof is similr to. On the sis of ove discussions, we cn use the minimum DFS code s the cnonicl lel of one grph. At the end of this section, we present the definition of the Levenshtein distnce (i.e., string edit distnce, SED). Definition 8(String Edit Distnce, SED). Given two strings x nd y. The string edit distnce of x nd y, denoted s SED(x, y), is the minimum numer of insertions, deletions nd sustitutions to trnsform x into y. Since the cnonicl lel of grph is the minimum DFS code which could e viewed s string, we cn mesure the similrity etween two grphs y conducting string edit distnce (SED) clcultion on their minimum DFS codes. Thus, we turn the grph mtching prolem into string mtching prolem which is much esier to solve. Wht is more, the string edit distnce gurntees the efficiency of our work. Bsed on the definitions introduced in this section, we present the implementtion detils in the following section.. Implementtion In this section, we discuss the implementtion detils of mesuring similrity etween grphs sed on the DFS code mentioned in Section. Note tht, we view the grph which is used to mtch ginst the grph dtse s the source grph. According to the rel requirements of different ppliction scenrios, we cn divide our implementtion into two min phses which re preprocessing nd mtching. Preprocessing phse is performed offline while mtching is online. Usully, people py much ttention to mtching phse since it hs direct influences on user experiences driven y efficient performnce. Firstly, we present the pseudo code of preprocessing phse in Algorithm 1. Algorithm 1 The lgorithm for preprocessing phse Input: Grph dtse (GD) Output: The minimum DFS codes for GD: M 1 ; the orders of node lels nd edge lels: N,E 1: Initilize two mp structures: M 1 nd M : N get order of node lels in GD : E get order of edge lels in GD : for ech grph G in GD do 5: ID get the ID of G 6: code get mindfscode(g) y N nd E 7: M 1.put(ID,code) 8: end for 9: for ech record r in M 1 do 10: dd ID of r to the grph ID set: ID Set 11: for ech record r in M 1 {(ID,code)} do 1: if the code in r is sme with tht in r then 1: dd ID of r to the ID Set 1: M.put(code,ID Set) 15: end if 16: end for 17: end for 18: for ech record r=(code,id Set) in M do 19: code extrct two node lels of one edge in their ppering order from the code 0: end for As shown in Algorithm 1, the input for this phse is the grph dtse which my contin lrge numer of grphs. And this phse genertes three outputs: the minimum DFS codes for ll the grphs in the grph dtse; the orders of node lels nd edge lels. At first, we prse ech grph in the grph dtse nd produce two orders of ll the node lels nd edge lels existing in the grph dtse (line nd ). These orders re used for constructing the DFS code of the grphs in the following steps. By iterting the grphs in grph dtse (line -8), we get the ID nd the minimum DFS code of ech grph nd put them in mp. Then, we merge the sme codes in the grph dtse nd using n inverted key-vlue pir (i.e., the minimum DFS code is the key nd the vlue is grph IDs) to represent the DFS codes of the grph dtse (line 9-17). Thus, we put the sme or isomorphic grphs together nd we only select one of them for similrity mesurement, which is efficient for mtching. In fct, efore clculting the SED, we firstly preprocess the minimum DFS code for simplicity y extrcting the lels of two vertices of n edge in their ppering order (line 18-0). For exmple, the minimum DFS code of Figure () is (0,1,,,)(1,,,,)(1,,,,)(,0,,,) which would e extrcted to the string.

Appl. Mth. Inf. Sci. 7, No. 1L, 169-175 (01) / www.nturlspulishing.com/journls.sp 17 Besides, in order to correctly extrct the DFS code of the source grph nd conduct similrity mesurement etween grphs in online mtching, we must gurntee the consistency of the orders of node lels nd edge lels in the whole mtching procedure. Therefore, we record these orders nd output them for mtching phse. Then, we present the online mtching phse in Algorithm. Besides the results of the offline preprocessing phse, we dd the source grph s nother input. In this phse, we output the grph, of the grph dtse, which is most similr to the source grph. Algorithm The lgorithm for mtching phse Input: The minimum DFS codes for GD: M 1 ; the orders of node lels nd edge lels: N,E; source grph (G) Output: Grph G (G GD) which is most similr to G 1: Initilize one mp structure: M : code get mindfscode(g) y N nd E : for ech record(code,id Set) in M 1 do : filter the minimum DFS codes of code nd code 5: distnce clculte the SED(code,code ) 6: M.put(ID Set,distnce) 7: end for 8: Sort M y distnce vlues nd return the grph IDs of the smllest record of M From Algorithm we cn see tht, there re three steps in the mtching phse. The first step is to get the minimum DFS code for the source grph with the help of the orders of node lels nd edge lels generted in lst phse. Since this step is focus on only one grph, its computtion time is very short (line ). Secondly, we not only conduct the SED clcultion of DFS codes etween the source grph nd the grphs of the grph dtse ut lso put the set of grph IDs nd the clculted distnce into mp (line -7). This step costs most time of the mtching phse since we hve to mtch ech grph in the grph dtse. The time complexity of this step is O(m n ) where m represents the numer of the grphs in the grph dtse nd n is the numer of nodes the lrgest grph hs. Generlly, since m is much lrger thn n, the computtion time of this step is durle nd ccepted. At lst, we sort the mp y the vlue of distnce nd return the grph IDs of the smllest distnce (line 8). Notice tht, the SED of the minimum DFS codes etween two grphs cn t directly determine their similrity or distnce. Suppose tht the usiness process shown in Figure () is the source grph, Figure () nd (c) show two grphs in grph dtse. Compring with the source grph in (), the grphs in () nd (c) lck only one edge respectively, i.e., 1 nd 1, nd the rest nodes nd edges re sme. Thus, from the viewpoint of structure, they should hve the sme similrity to (). But, ccording to the minimum DFS codes shown in the figure, we hve SED(, )! = SED(, c). Grphs The Minimum DFS Codes: 1 11 () 1 1 () 1 1 (c) Figure The illustrtion for filtering the minimum DFS codes To solve this prolem, we filter the minimum DFS codes (line ) simplified in preprocessing phse. Since ech edge, in grph, represented y the DFS code hs een simplified to -tuple: (l i,l j ), the length of simplified minimum DFS code is the multiple of. Bsed on these tuples, we compre the codes of the source grph nd the grph in dtse. Then, remove the sme edges (DFS codes) in oth of them nd the rest codes re conducted SED clcultion. Note tht, fter exchnging two nodes of n edge, if this edge is sme with nother edge in other grphs, these two -tuples could e viewed sme too. Then, they would e removed from the code. As shown in Figure, to determine the similrity etween () nd (c), we could merely clculte the SED of two strings: 1 nd. Becuse, the edge could e turned to. nd 1 correspond to nd 1 for the sme rule. Bsed on the ove discussion, from Figure, we cn get: SED(, ) == SED(, c). To conclude this section, we preprocess the grph dtse y extrcting their minimum DFS codes. Then, we mesure the similrity etween grphs sed on these DFS codes with through string edit distnce technique. Furthermore, we implement prototype sed on the ove detils nd evlute its performnce. The experimentl results re present in the following section.. Evlution As mentioned efore, since the online mtching is much more concerned y the end users nd the offline preprocessing hs little contriution to the efficiency of mtching, we only study the performnce of the mtching phse implementtion in this section. In the following experiments, we compre our method (i.e., SED sed) with trditionl GED-sed nd oth of them re developed in Jv (Jdk1.6). GED-sed method is implemented in fst greedy wy nd its time complexity is O(n ). The sorting lgorithm for returning the smllest distnce is ule sort. And ll the tests re done on.6ghz Intel(R) Core(TM) Duo P800 PC with GB min memory, running Windows 7. The grph dtset we used here re generted syntheticlly. There re totlly different 6 vertex lels nd ech grph hs vertex size of 5 to 10. The model grph is rndomly selected from this dtset. Then, we mtch the model grph ginst ll dt grphs in the dtset. First of ll, we study the efficiency which is mesured y the time for mtching. We fix the numer of vertex in

17 B. Co,. Li, J. in: Mesuring similrity etween grphs... Mtching time (s) 1.6 1. 0.8 0. 0 GED sed SED sed C1 C C Test cses (ID) () Size of 000 Mtching time (s) Figure 5 Tests on sizes of 000 nd 10000 1 9 6 0 GED sed SED sed C1 C C Test cses (ID) () Size of 10000 Mtching time (s) 8 6 0 GED sed SED sed 6 8 10 Vertex numer in model grph Figure 7 Mtching time under different model grph sizes the model grph to 5 nd oserve the mtching time for oth methods under different size, rnging from 000 to 10000, of grph dtse. For ech grph dtse, we use different model grphs (with sme vertex numer 5) s different test cses. As shown in Figure 5, under different size of grph dtse (i.e., Figure 5() nd Figure 5()), GED-sed method costs much more time for mtching thn tht of SED-sed method in ll test cses. This is ecuse tht the computtion time for GED-sed method is O(n ) while it is O(n ) for SED-sed. Apprently, GED-sed method would ecome less pplicle once the size of the grph dtse grows lrge. Tle shows verge mtching time for different grph dtse sizes. Clerly, s the dtse size increses, oth methods need more time for mtching. In ddition, we cn see from Figure 6 tht the mtching time of GED-sed method to tht of SED-sed rtio Tle Mtching time for different grph dtse sizes Dtse Size GED-sed (s) SED-sed (s) 000 1.86 0.15 000.80 0. 6000 6.719 0.95 8000 8.18 1.8 10000 10.701.61 Mtching time rtio 1 1 10 8 6 000 000 6000 8000 10000 Grph dtse size Figure 6 Mtching time rtio for the model grph with 5 vertices Mtching time rtio 50 0 0 0 10 6 8 10 Grph dtse size Figure 8 Mtching time rtio for different model grph sizes decreses with incresing size of grph dtse. Filtering the minimum DFS codes efore mtching is the cuse for this trend. In tests of Figure nd Figure 8, we fix the size of grph dtse to 6 nd study the efficiency under different model grphs with different vertex numer rnged from to 10. As shown in Figure, using our SED-sed method, the mtching time is lmost unchnged with very smll vlues of 0.1 seconds round. However, there is n pprent growth trend in the GED-sed method nd the mtching time grows fst s the vertex numer incresed. Tht is to sy, GED-sed method is more sensitive to the size of the model grph thn ours. The reson is tht GED-sed implementtion hs to serch the est result in ech step nd grow sed on the current est result. There re mny recursive serches nd judges in this procedure which costs more thn computing two strings. Different from the mtching rtio trend tht showed in Figure 6, Figure 8 presents n opposite trend: the mtching time of GED-sed method to tht of SED-sed rtio increses with the incresing numer of vertex. This is ecuse in our SED-sed method, the numer of grphs which need to e mtched cn mke more contriutions to the mtching time thn the numer of vertex of model grph.

Appl. Mth. Inf. Sci. 7, No. 1L, 169-175 (01) / www.nturlspulishing.com/journls.sp 175 As for effectiveness study, sed on oservtions on mtching results of two methods, first severl results for oth methods re lmost sme. However, since our method nd GED-sed method dopt different principles (e.g. the costs for node/edge deletion, insertion nd other opertions) for similrity mesurements, there exist greter differences etween mtching results in lrger size for oth methods. Generlly, the difference rtio could e 0% round. Nevertheless, since first severl results re more concerned y users, our method cn stisfy their ccurcy demnds in generl. 5. Conclusion In this pper, we propose novel pproch for mesuring similrity etween grphs. Using depth-first serch (DFS) strtegy, we trverse the grphs nd lel them cnoniclly with the minimum DFS code. Then, fter extrcting these codes nd filtering them, we conduct the clcultion of string edit distnce (SED) etween the source grph nd grphs in dtse. Compring with trditionl GED sed method, our pproch is more efficient nd the experimentl evlution proves its utility in rel pplictions. There is still some work needs e done in the future. For exmple, the ccurcy of mesuring similrity etween grphs through clculting the SED of DFS codes should e studied oth theoreticlly nd experimentlly. Besides, the reltion etween our proposed method nd GED needs to e determined. At lst, we re going to exploit our method in rel grph dtsets nd to improve its performnce mking it more pplicle nd prcticle. Acknowledgement This reserch ws prtilly supported y following foundtions: Ntionl Science nd Technology Supporting Progrm of Chin(No.01BAH06F0, No. 011BAD1B0), Ntionl Nturl Science Foundtion of Chin under Grnt (No.61719), Reserch Fund for the Doctorl Progrm y Ministry of Eduction of Chin(No. 0110101110066) References [1] A.. Broder, R. Kumr, F. Mghoul, P. Rghvn, S. Rjgopln, R. Stt, A. Tomkins, nd J.L. Wiener, Grph structure in the We. In Proceedings of Computer Networks. 09-0, (000). []. Tin, R. C. McEchin, C. Sntos, D. J. Sttes, nd J. M. Ptel. Sg: sugrph mtching tool for iologicl grphs. Bioinformtics, (): -9, (007). [] P. Willett, J. Brnrd, nd G. Downs. Chemicl similrity serching. J. Chem. Inf. Comput. Sci, 8(6): 98-996, (1998). [] R. Dijkmn, M. Dums, nd L. Grci-Bnuelos. Grph mtching lgorithms for usiness process model similrity serch. In BPM, (009). [5] H. Bunke. On reltion etween grph edit distnce nd mximum common sugrph. Pttern Recognition Letters, 18(8): 689-69, (1997). [6] H. Bunke nd K. Sherer. A grph distnce metric sed on the mximl common sugrph. Pttern Recognition Letters, 19(-): 55-59, (1998). [7] J. Rymond, E. Grdiner, nd P. Willett. RASCAL: Clcultion of Grph Similrity using Mximum Common Edge Sugrphs. The Computer Journl, 5(6): 61-6, (00). [8]. eng, A.K.H. Tung, J. Wng, J. Feng, nd L. hou, Compring Strs: On Approximting Grph Edit Distnce. In Proceedings of PVLDB, 5-6, (009). [9]. n, F. hu, P.S. u, nd J. Hn, Feture-sed similrity serch in grph structures. In Proceedings of ACM Trns. Dtse Syst, 118-15, (006). [10]. n nd J. Hn, gspn: Grph-Bsed Sustructure Pttern Mining. In Proceedings of ICDM. 71-7. (00). [11] I. Levenshtein, Binry code cple of correcting deletions, insertions nd reversls. Cyernetics nd Control Theory, 10(8), 707-710, (1966). Bin Co is currently Ph.D. cndidte in the College of Computer Science, hejing University (Chin). He received his B.S. from hejing University of Technology, Chin in 008 nd he took successive postgrdute nd doctorl progrm in hejing University since 009. His reserch interests include workflow mngement, event processing nd sptil dtse. ing Li is currently n ssocite professor in the College of Computer Science, hejing University (Chin). He received his M.S. from hejing University in 1997 nd his Ph.D. in Computer Science from hejing University in 000. His reserch interests include softwre rchitecture, softwre utomtion, compiling technology nd middlewre technology. Jinwei in is currently professor in the College of Computer Science, hejing University (Chin). He received his Ph.D. in Computer Science from hejing University in 001. He is the visiting scholr of Georgi Institute of Technology, Americ in 008. His reserch interests include distriuted network middlewre, softwre rchitecture nd informtion integrtion.