Concept Formation Using Graph Grammars



Similar documents
Reasoning to Solve Equations and Inequalities

AntiSpyware Enterprise Module 8.5

Regular Sets and Expressions

EQUATIONS OF LINES AND PLANES

Section 5-4 Trigonometric Functions

Homework 3 Solutions

P.3 Polynomials and Factoring. P.3 an 1. Polynomial STUDY TIP. Example 1 Writing Polynomials in Standard Form. What you should learn

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:

Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )

LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES

Java CUP. Java CUP Specifications. User Code Additions You may define Java code to be included within the generated parser:

A.7.1 Trigonometric interpretation of dot product A.7.2 Geometric interpretation of dot product

Scalable Mining of Large Disk-based Graph Databases

Small Business Networking

Small Businesses Decisions to Offer Health Insurance to Employees

Morgan Stanley Ad Hoc Reporting Guide

Factoring Polynomials

Lecture 3 Gaussian Probability Distribution

5 a LAN 6 a gateway 7 a modem

Small Business Networking

Virtual Machine. Part II: Program Control. Building a Modern Computer From First Principles.

Or more simply put, when adding or subtracting quantities, their uncertainties add.

Small Business Networking

Tool Support for Feature-Oriented Software Development

One Minute To Learn Programming: Finite Automata

How To Network A Smll Business

A Visual and Interactive Input abb Automata. Theory Course with JFLAP 4.0

Operations with Polynomials

Small Business Networking

Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Recognition Scheme Forensic Science Content Within Educational Programmes

DAGmaps: Space Filling Visualization of Directed Acyclic Graphs

RTL Power Optimization with Gate-level Accuracy

Engineer-to-Engineer Note

Hillsborough Township Public Schools Mathematics Department Computer Programming 1

Simulation of operation modes of isochronous cyclotron by a new interative method

Health insurance exchanges What to expect in 2014

FAULT TREES AND RELIABILITY BLOCK DIAGRAMS. Harry G. Kwatny. Department of Mechanical Engineering & Mechanics Drexel University

All pay auctions with certain and uncertain prizes a comment

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

Graphs on Logarithmic and Semilogarithmic Paper

APPLICATION NOTE Revision 3.0 MTD/PS-0534 August 13, 2008 KODAK IMAGE SENDORS COLOR CORRECTION FOR IMAGE SENSORS

JaERM Software-as-a-Solution Package

Outline of the Lecture. Software Testing. Unit & Integration Testing. Components. Lecture Notes 3 (of 4)

Treatment Spring Late Summer Fall Mean = 1.33 Mean = 4.88 Mean = 3.

PROF. BOYAN KOSTADINOV NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY

Introducing Kashef for Application Monitoring

How To Make A Network More Efficient

CS99S Laboratory 2 Preparation Copyright W. J. Dally 2001 October 1, 2001

Babylonian Method of Computing the Square Root: Justifications Based on Fuzzy Techniques and on Computational Complexity

. At first sight a! b seems an unwieldy formula but use of the following mnemonic will possibly help. a 1 a 2 a 3 a 1 a 2

Linear Programming in Database

Modeling POMDPs for Generating and Simulating Stock Investment Policies

Econ 4721 Money and Banking Problem Set 2 Answer Key

Enterprise Risk Management Software Buyer s Guide

Health insurance marketplace What to expect in 2014

Integration. 148 Chapter 7 Integration

DlNBVRGH + Sickness Absence Monitoring Report. Executive of the Council. Purpose of report

Learning to Search Better than Your Teacher

Binary Representation of Numbers Autar Kaw

An Undergraduate Curriculum Evaluation with the Analytic Hierarchy Process

Experiment 6: Friction

Pentominoes. Pentominoes. Bruce Baguley Cascade Math Systems, LLC. The pentominoes are a simple-looking set of objects through which some powerful

How To Set Up A Network For Your Business

Economics Letters 65 (1999) macroeconomists. a b, Ruth A. Judson, Ann L. Owen. Received 11 December 1998; accepted 12 May 1999

Source Code verification Using Logiscope and CodeReducer. Christophe Peron Principal Consultant Kalimetrix

Math 135 Circles and Completing the Square Examples

New Internet Radio Feature

9 CONTINUOUS DISTRIBUTIONS

Section 7-4 Translation of Axes

6.2 Volumes of Revolution: The Disk Method

Mathematics. Vectors. hsn.uk.net. Higher. Contents. Vectors 128 HSN23100

Basic Research in Computer Science BRICS RS Brodal et al.: Solving the String Statistics Problem in Time O(n log n)

Helicopter Theme and Variations

Protocol Analysis / Analysis of Software Artifacts Kevin Bierhoff

Vectors Recap of vectors

2 DIODE CLIPPING and CLAMPING CIRCUITS

9.3. The Scalar Product. Introduction. Prerequisites. Learning Outcomes

Warm-up for Differential Calculus

Architecture and Data Flows Reference Guide

Data replication in mobile computing

Performance analysis model for big data applications in cloud computing

0.1 Basic Set Theory and Interval Notation

Decision Rule Extraction from Trained Neural Networks Using Rough Sets

Unit 6: Exponents and Radicals

A Network Management System for Power-Line Communications and its Verification by Simulation

Automated Grading of DFA Constructions

Math 314, Homework Assignment Prove that two nonvertical lines are perpendicular if and only if the product of their slopes is 1.

Application-Level Traffic Monitoring and an Analysis on IP Networks


Transcription:

Concept Formtion Using Grph Grmmrs Istvn Jonyer, Lwrence B. Holder nd Dine J. Cook Deprtment of Computer Science nd Engineering University of Texs t Arlington Box 19015 (416 Ytes St.), Arlington, TX 76019-0015 E-mil: {jonyer holder cook}@cse.ut.edu Phone: (817) 272-2596, Fx: (817) 272-3784 Astrct Recognizing the expressive power of grph representtion nd the ility of certin grph grmmrs to generlize, we ttempt to use grph grmmr lerning for concept formtion. In this pper we descrie our initil progress towrd tht gol, nd focus on how certin grph grmmrs cn e lerned from exmples. We lso estlish grounds for using grph grmmrs in mchine lerning tsks. Severl exmples re presented to highlight the vlidity of the pproch. Introduction Grphs re importnt dt structures ecuse of their ility to represent ny kind of dt. Algorithms tht generte theories of grphs re of gret importnce in dt mining nd mchine lerning. In this pper we descrie n lgorithm which lerns grph grmmrs set of grmmr production rules tht descrie grph-sed dtse. The gol of our reserch is to dpt grph grmmr lerning for concept formtion, hoping tht the expressive power of grphs nd the ility of grph grmmrs to generlize will turn out to e powerful lerning prdigm. This pper presents initil progress towrd tht gol nd sets the stge for susequent work. Only few lgorithms exist for inference of grph grmmrs. An enumertive method for inferring limited clss of context-sensitive grph grmmrs is due to Brtsch-Spörl (1983). Other lgorithms utilize merging technique for hyperedge replcement grmmrs (Jeltsch nd

Kreowski 1991) nd regulr tree grmmrs (Crrsco et l. 1998). Our pproch is sed on method for discovering frequent sustructures in grphs (Cook nd Holder 2000). In the following section we discuss different types of grph grmmrs nd rgue how they cn e useful in mchine lerning. We then descrie the grph grmmrs we set out to lern nd define some terminology. Next, we present set of exmples to provide some visul insight with grph grmmrs efore we descrie the lgorithm. Then, we present working exmple on n rtificil domin to etter illustrte the lgorithm. Next, we discuss the types of grmmrs the lgorithm cn lern, s well s point out some of its limittions. We conclude with n overll ssessment of the pproch nd give directions for future work. Grph Grmmrs nd Mchine Lerning When lerning grmmr in generl, one hs to decide the intended use of the grmmr. Grmmrs hve two pplictions: to prse or to generte lnguge. Prser grmmrs re optimized for fst prsing, giving up little ccurcy which results in grmmrs tht over-ccept. Tht is, they will ccept sentences tht re not in the lnguge. Genertor grmmrs trde ccurcy for speed s well. As expected, they will not e le to generte the entire lnguge. A grmmr tht cn generte nd prse the sme lnguge exctly is very hrd to design nd is usully too ig nd slow to e prcticl. In this pper we re ddressing the prolem of inferring grph grmmrs from positive exmples. Our purpose is to use grmmr lerning s n pproch to dt mining, ut other uses cn lso e found. The generted grph grmmr will e our theory of the input domin. In mchine lerning, lgorithms in generl re ttempting to lern theories tht cn generlize to certin degree, so tht new, unseen dt cn e ccurtely ctegorized. Trnslted to

grmmr terms, we would like to lern grmmr tht ccepts more thn just the trining lnguge. Therefore, we would like to lern prser grmmrs, which hve the power to express more generl concepts thn the sum of the positive exmples. Grmmrs cn e context-sensitive nd context-free. Context-sensitive grph grmmrs re more expressive nd llow the specifiction of grph trnsformtions, since oth sides of the production cn e ritrry grphs. To strt with, however, we imed t lerning context-free grmmrs tht hve single-vertex non-terminls on the left side of production rules. This is not serious limittion, especilly since the vst mjority of grph grmmr prsers cn only del with exctly such grmmrs (Rekers nd Schürr 1995). So why lern grph grmmrs versus textul ones? Textul grmmrs re lso useful, ut they re limited to dtses tht cn e represented s sequence. An exmple of such dtse is DNA sequence. Most dtses, however, hve non-sequentil structure, nd mny hve significnt structurl components. Reltionl dtses re generlly good exmples, ut even more complex informtion cn e represented using grphs. Exmples include circuit digrms nd the world-wide we. Grph grmmrs cn still represent the simpler feture vector-type dtses s well s sequentil dtses (like the DNA mentioned previously). Grphs re mong the most expressive representtions, therefore n lgorithm tht cn lern theory of grph would e useful. We hve to emphsize tht our purpose in lerning grph grmmrs is not to provide n efficient grph prsing lgorithm. Grph prsing will e necessry for clssifying unseen exmple grphs, nd while the prsing efficiency of the grph grmmr will e concern here, it is not primry gol of the generliztion step.

Grph Grmmrs Before we get into the detils of inferring grph grmmrs, we first give generl overview of the type of grmmr we seek to lern. In this pper, nd in our reserch in generl, we re concerned with grph grmmrs of the set theoretic pproch, or expression pproch (Ngl 1987). In this pproch grph is pir of sets G = V, E where V is the set of vertices or nodes, nd E V V is the set of edges. Production rules re of the form S Æ P, where S nd P re grphs. When such rule is pplied to grph, n isomorphic copy of S is removed from the grph long with ll its incident edges, nd is replced with copy of P, together with edges connecting it to the grph. The new edges re given new lels to reflect their connection to the sustructure instnce. A specil cse of the set-theoretic pproch is the node-lel controlled grmmr, in which S consists of single leled node (Engelfriet nd Rozenerg 1991). This is the type of grmmr we re focusing on. In our cse, S is lwys non-terminl, ut P cn e ny grph, nd cn contin oth terminls nd non-terminls. Since we re going to lern grmmrs to e used for prsing, the emedding function is irrelevnt. Externl edges tht re incident on vertex in the sugrph eing replced (P) lwys get reconnected to the single vertex S. Recursive productions re of the form S Æ P S. The non-terminl S is on oth sides of the production, nd P is linked to S vi single edge. The complexity of the lgorithm is exponentil in the numer of edges considered etween recursive instnces, so we limit the lgorithm to one for now. If the grmmr is used for grph genertion, this rule will generte n infinitely long sequence of the grph P. If the lnguge is to e finite, stopping lterntive production is required. One such production is S Æ P S, which reds replce S with P S or nothing. For our purposes, however, we use the production S Æ P S P. The rule S Æ P S, when used for

prsing, would imply tht nothing cn e replced with S, introducing n ritrry numer of S s. At the sme time, it cnnot prse chin of P s of finite length s it would hve no strting point, since P S does not exist in the input grph. Rememer tht the stopping lterntive of grph genertor rule is the strting point of prser rule. When prsing grph, we strt from the complete grph nd work towrds single nonterminl. This is done y removing sugrphs from the grph tht mtch the right side of production nd inserting the non-terminl on the left side in our exmple, replce P S with S, nd finlly, P with S. An exmple of recursive production is shown in Figure 1c, (S 1 ). Alterntive productions re of the form S Æ P 1 P 2 P n. The non-terminl grph S cn e thought of s vrile hving possile vlues P 1, P 2,, P n. We will sometimes refer to such n S s vrile non-terminl, or simply vrile. If S is single vertex, nd P i re lso single vertices, then S is synonymous with regulr non-grph vrile. Its vlues re the vertex lels, which cn e lphnumeric vlues like numers (discrete or continuous) or string descriptions. An exmple of vrile is shown in Figure 1c, where S 2 hs possile vlues c nd d. Exmples Before presenting the lgorithm, couple of exmples re given here to further clrify wht we re trying to ccomplish. The first exmple is suggested y the uthors of Sequitur (Nevill- Mnning nd Witten 1997). Sequitur infers compositionl hierrchies from strings. It detects repetition nd fctors it out y forming rules in grmmr. The exmple string to e nlyzed is cdcd. The grmmr generted y Sequitur is shown in Figure 1. (Non-terminls re in itlic old font.) Our lgorithm, clled SudueGL, lerns grph grmmrs; therefore, the input hs to e in grph formt. This sequentil dt ws

represented y series of vertices hving lels ccording to the exmple, connected y single edges, s shown in Figure 1. The grph grmmr lerned y SudueGL is shown in Figure 1c, while its sequentil interprettion is shown in Figure 1d. The first ovious difference is tht SudueGL is le to lern recursive grmmrs. SudueGL s version of the grmmr is lso more generl, since it would prse string of ny length, nd the letters c nd d do not hve to follow in the sme order. This exmple will e referenced in the next section where we descrie the lgorithm. c d c d ) S 1 1 1 2 c 2 d 2 ) S 1 S 2 S 1 S 2 c d S 2 c) S 1 S 2 S 1 S 2 c d S 2 d) Figure 1 First exmple: ) Grmmr y Sequitur ) Input grph to SudueGL c) Grph grmmr y SudueGL d) Equivlent string grmmr The next exmple is vrition of the previous one, with n x slightly reking the regulrity in the pttern: cdxcd. The grmmr lerned y Sequitur is shown in Figure 2 nd is very similr to the previous one in Figure 1. SudueGL, however, dded n extr production to its grmmr, resulting in the grmmr shown in Figure 2. S 1 x 1 1 2 c 2 d 2 S 1 S 2 S 1 S 2 S 2 c d S 3 S 1 x S 1 ) Figure 2 Grmmrs y ) Sequitur nd ) SudueGL lerned from cdxcd. )

A third exmple is given lter, fter the introduction of the lgorithm. Tht exmple involves n rtificil domin which ws specificlly designed to highlight SudueGL s cpilities. The Lerning Algorithm The SudueGL lgorithm is sed on the Sudue (Cook nd Holder 2000) knowledge discovery lgorithm tht extrcts common sustructures from grphs. SudueGL tkes dt sets in grph formt s input, hence dtse needs to e represented s grph efore pssing it to SudueGL. The grph representtion includes the stndrd fetures of grphs: leled vertices nd leled edges. Edges cn e directed or undirected. No restrictions re plced on the input grph. When converting dt set to grph representtion, typiclly ojects nd dt re mpped to vertices, nd reltionships nd ttriutes re mpped to edges. Serch SudueGL performs n itertive serch on the input grph, ech itertion resulting in grmmr production. When production is found, the right side of the production is strcted wy from the input grph y replcing ech occurrence of it y the non-terminl on the left side. SudueGL keeps iterting until the entire input grph is strcted wy into single nonterminl. User-specified limits cn e plced on the numer of productions to e found, nd on the mximum size of ech production rule (in numer of vertices). For other user-defined options plese see erlier pulictions, or the user mnul, ville t http://cygnus.ut.edu/sudue/. In ech itertion SudueGL performs em serch for the est sustructure to e used in the next production rule. The serch strts y finding ech uniquely leled vertex nd ll their instnces in the input grph. In our first exmple the input grph (shown in Figure 1) hs 4 uniquely leled vertices,, c nd d, ech hving 4, 4, 2 nd 2 instnces, respectively. The sugrph definition nd ll instnces re referred to s sustructure. The

ExtendSugstructure serch opertor is pplied to ech of these single-vertex sustructures to produce 2-vertex sustructures. This opertor extends sustructure in ll possile directions, nd collects instnces tht mtch, possily resulting in severl new sustructures. In our exmple, extending the unique vertex will result in instnces like, c nd d. These re different from ech other, ut ech hve severl instnces of their own. These instnces re collected to form new sustructures. SudueGL ( grph G, int Bem, int Limit ) repet grmmr = {} queue Q = { v v hs unique lel in G } estsu = first sustructure in Q repet newq = {} for ech sustructure S in Q newsus = ExtendSustructure(S) recursivesus = RecursifySustructure(S) newq = newq U newsus U recursivesus Limit = Limit - 1 evlute sustructures in newq y MDL Q = sustructures in newq with top Bem compression scores if est sustructure in Q etter thn estsu then estsu = est sustructure in Q until Q is empty or Limit <= 0 grmmr = grmmr U estsu G = G compressed y estsu until estsu cnnot compress the grph G return grmmr ExtendSustructure (sustructure S) newsus = S extended y n djcent edge in ll possile wys vrsus = S extended y n djcent edge in ll possile wys, replcing the dded vertex with non-terminl return newsus U vrsus RecursifySustructure (sustructure S) recsus = ll possile chins of instnces of S, linked y single edge return recsus Figure 3 SudueGL s min discovery lgorithm. The resulting sustructures re evluted ccording to the minimum description length (MDL) principle, which sttes tht the est theory is the one tht minimizes the description length of the entire dt set. The MDL principle ws introduced y Rissnen (1989), nd pplied to grph-sed knowledge discovery y Cook nd Holder (1994). The vlue of sustructure

is computed y dividing the description length of the input grph y the sum of the description lengths of the sustructure nd the input grph compressed y the sustructure: Vlue(S) = DL(G) / (DL(S) + DL(G S)), where G is the input grph, S is the sustructure, nd G S is the input grph G compressed using S. DL stnds for description length. Sud uegl seeks to mximize the vlue. Only sustructures deemed the est y the MDL principle re kept for further extension. Heuristics re pplied to stop the extension process, lthough exhustive nlysis is lso possile. One heuristic involves trcking the serch spce for locl minim. The serch is ndoned if new locl minimum is not found fter certin numer of pplictions of the Extend-Sustructure opertor. This heuristic is used y defult. Another heuristic requires the user to specify the mximum size of sustructure. Once the size limit is reched, the serch termintes. There lso exists n solute limit on the numer of sustructures to consider during the serch process, which cn e specified y the user. These heuristics, like others in SudueGL, cn lso e used in comintion. Recursion Recursive productions re mde possile y the Recursify-Sustructure serch opertor. It is pplied to ech sustructure fter the ExtendSustructure opertor. Recursify-Sustructure checks ech instnce of the sustructure to see if it is connected to ny other instnce of the sme sustructure y n edge. If so, recursive production is possile. The opertor dds the connecting edge to the sustructure nd collects ll possile chins of instnces. If recursive production is found to e the est in n itertion, ech such chin of sugrphs is strcted wy y replcing them with single vertex.

Since SudueGL discovers commonly occurring sustructures first nd then ttempts to mke recursive production, SudueGL cn only discover recursive productions tht prse lists of sustructures. In other words, it cn only mke recursive productions out of lists of sustructures tht re connected y single edge, which hve to hve the sme lel etween ech memer sustructure of the list. The lgorithm is exponentil in the numer of edges considered in the recursion, so we limit SudueGL to single-edge recursive productions. Therefore, the system does not yet lern productions such s S Æ S. The stopping condition in the recursion is generted y removing the recursive vertex long with the edge tht connects it to the rest of the sugrph. Vriles The mjor insight ehind discovering vriles is tht the context they pper in hs to e constnt. In other words, the first step towrds discovering vriles is discovering commonly occurring structures. If these commonly occurring structures re connected to vrying vertices, these vrying vertices cn e turned into vriles. Vriles hve to e connected to ech instnce of the common sustructure the sme wy connected y n edge of the sme lel nd direction to the sme vertex. We give n exmple of this in the next section, ut looking t the input grph in Figure 4 nd the grmmr rule in Figure 7 t this point could e helpful. SudueGL discovers vriles inside the ExtendSu-structure serch opertor. As mentioned efore, SudueGL extends ech instnce of sustructure in ll possile wys nd collects the instnces tht still mtch. After this step, it collects ll instnces tht were extended y the sme edge, regrdless of wht vertex they point to (s long s tht vertex is not lredy in the sustructure). This new vertex, is replced with vrile (non-terminl) vertex. The sustructure is then evluted nd competes with others for top plcement. Generlly speking, if

the vrile hs the sme vlue for most of the instnces, the sustructure will rnk worse thn the equivlent structure without the vrile ecuse of the extr its needed to encode the vrile. On the other hd, if the vrile hs mny vlues, it helps to cover mny more instnces of the sustructure thn the equivlent structure without the vrile, nd will compress the input grph etter. Let us work n exmple to illustrte the ove explntion. An Illustrtive Exmple In this section we give working exmple of SudueGL s opertion. Consider the input grph shown in Figure 4. It is the grph representtion of n rtificilly generted domin. It fetures lists of sttic structures (squre shpe), list of chnging structure (tringle shpe), nd some dditionl rndom vertices nd edges. For clener ppernce we omitted edge lels in the figures. The edge lels within the tringle-looking sugrph re t, in the squre -looking sugrph s, nd the rest of the edges re leled next. c d e f k x y x y r x y x y z q z q z q z q Figure 4 Input grph. SudueGL strts out y collecting ll the unique vertices in the grph nd expnding them in ll possile directions. Let us follow the extension of vertex x keeping in mind tht the others re expnded in prllel. When we expnd vertex x in ll possile directions, it results in 2 - vertex sustructures, with edges (x, s, y), (x, s, z), (y, next, x), nd (x, next, r). The fist two

sustructures will rnk higher, since those hve four instnces nd will compress the grph etter thn the ltter two with only 2 nd 1 instnces respectively. S 1 x y S 1 x y c d e f z q z q k Figure 5 First production generted y SudueGL S 1 r S 1 Figure 6 Input grph, prsed y the first production Applying the ExtendSustructure opertor three more times will result in sustructure hving vertices {x, y, z, q} nd four edges connecting these four vertices. This sustructure hs four instnces. Being the iggest nd most common sustructure, it will rnk on the top. Executing the RecursifySustructure opertor will result in the recursive grmmr rule shown in Figure 5. The production covers two lists of two instnces of the sustructure. The recursive production ws constructed y checking ll outgoing edges of ech instnce to see if they re connected to ny other instnce. We cn see in Figure 4 tht the instnce in the lower left is connected to the instnce on its right, vi vertex y eing connected to vertex x. Sme is the sitution on the lower right side. Astrcting out these four instnces of the sustructure using the ove production results in the grph depicted in Figure 6. The next itertion of SudueGL will use this grph s its input grph to lern the next grmmr rule. Looking t the grph, one cn esily see tht the most common sustructure now is the tringle-looking sugrph. SudueGL will in fct find portion of tht simply y looking for sustructures tht re exctly the sme. This prt is the sustructure hving vertices {, } nd edge (, t, ). It hs four instnces. Extending this structure further y n edge nd vertex will dd different vertices for ech instnce: c, d, e, nd f. The resulting single instnce sustructures will not do well when evluted y the MDL mesure.

SudueGL t this point will generte nother sustructure with four instnces, replcing vertices c, d, e, nd f with non -terminl vertex (S 3 ) in the sustructure, therey creting vrile. This sustructure now hs four instnces, nd stnds the est chnce of getting selected for the next production. After the ExtendSustructure opertion, however, SudueGL will hnd the sustructure to Recursify-Sustructure to see if ny of the instnces re connected. Since ll four of them re connected y n edge, recursive sustructure is creted which will cover even more of the input grph, hving included three dditionl edges. Also, it is replced y single non-terminl in the input grph, versus four non-terminls when strcting out the instnces non-recursively, oney-one. The new productions generted in this itertion re shown in Figure 7. Astrcting wy these sustructures produces the grph shown in Figure 8. S 2 S 2 S 2 S 3 S 3 k S 1 r S 1 S 3 c d e f Figure 7 Second nd third productions y SudueGL Figure 8 Input grph prsed y the second nd third productions In the next itertion, SudueGL cnnot find ny recurring sustructures tht cn e strcted out to reduce the grph s description length. The grph in Figure 8, therefore ecomes the right side of the lst production. When this rule is executed, the grph is fully prsed. Discussion SudueGL required only minor extensions to Sudue minly ecuse of the roustness of the MDL heuristic. Grmmrs with lrge mounts of disjunction, either in the form of multiple

productions or lrge discrete rnges for vriles, trdeoff with simpler grmmrs with less coverge. The MDL mesure provides resonle trdeoff etween the two. The next step in our nlysis of the SudueGL pproch is to empiriclly vlidte the lgorithm nd find its limittions. Empiricl testing is difficult for grph grmmr lerning due to the shortge of competitors nd rel-world exmples. However, utomted testing my e possile y rndomly generting grph grmmrs, generting exmples from these grmmrs, nd then nlyze SudueGL s ehvior while ttempting to lern the originl grmmr. Rndom grph grmmr genertion is not trivil given the huge spce of possile grmmrs, ut such methodology would llow us some mesure of SudueGL s effectiveness in this domin. We hve lso considered complexity nlysis of SudueGL. Since we re limiting SudueGL to recursive productions involving only one connecting edge, the complexity is no more thn tht of Sudue, which is constrined to e polynomil in the size of the input grph. Conclusions nd Future Work In this pper we introduced n lgorithm, SudueGL, which is le to lern grph grmmrs from exmples. The lgorithm is sed on the erlier system clled Sudue which hs hd success in structurl dt mining for yers. SudueGL focuses on context-free prser grph grmmrs. Although incomplete, its current cpilities include finding sttic structures, finding vriles, nd finding recursive structures. As mentioned t the eginning, this pper reports on work in progress. Our future plns include expnding grph grmmr lerning with concept lerning, nd hndling continuous vlues. As future results wrrnt, we my llow vriles to tke on vlues tht re not restricted to e single vertices. We lso pln to investigte other wys to identify recursive structures, with

focus on llowing the recursive non-terminl to e emedded in sugrph, connecting with more thn single edge. We will evlute our progress y compring the system s performnce to some competing mchine lerning lgorithms, such s inductive logic progrmming systems. We lso hve plns to pply SudueGL to rel-world domins, such s circuit digrms nd protein sequences. References Brtsch-Sörl, B. 1983. Grmmticl inference of grph grmmrs for syntctic pttern recognition. Lecture Notes in Computer Science, 153: 1-7. Crrsco, R. C., J. Oncin, nd J. Cler. 1998. Stochstic inference of regulr tree lnguges. Lecture Notes in Artificil Intelligence, 1433: 187-198. Cook, D. J., nd L. B. Holder. 2000. Grph-Bsed Dt Mining. IEEE Intelligent Systems, 15(2), 32-41. Cook, D. J., nd L. B. Holder. 1994. Sustructure Discovery Using Minimum Description Length nd Bckground Knowledge. Journl of Artificil Intelligence Reserch, Volume 1, 231-255. Engelfriet, J., nd G. Rozenerg. 1991. Grph grmmrs sed on node rewriting: n introduction to NLC grmmrs. Lecture Notes in Computer Science, 532, 12-23. Jeltsch, E., nd H. J. Kreowski. 1991. Grmmticl inference sed on hyperedge replcement. Lecture Notes in Computer Science, 532: 461-474. Ngl, M. 1987. Set theoretic pproches to grph grmmrs. In H. Ehrig, M. Ngl, G. Rozenerg, nd A. Rosenfeld, editors, Grph Grmmrs nd Their Appliction to Computer Science, volume 291 of Lecture Notes in Computer Science, 41-54. Nevill-Mnning, C. G., nd I. H. Witten. 1997. Identifying Hierrchicl Structure in Sequences: A liner-time lgorithm. Journl of Artificil Intelligence Reserch, 7, 67-82. Rekers, J., nd A. Schürr. 1995. A Prsing Algorithm for Context-Sensitive Grph Grmmrs. Technicl report 95-05, Leiden University. Rissnen, J. 1989. Stochstic Complexity in Sttisticl Inquiry. World Scientific Compny.