MINING RELAXED GRAPH PROPERTIES IN INTERNET



Similar documents
Lecture 2: Karger s Min Cut Algorithm

5 Boolean Decision Trees (February 11)

Department of Computer Science, University of Otago

Modified Line Search Method for Global Optimization

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

THE HEIGHT OF q-binary SEARCH TREES

I. Chi-squared Distributions

A probabilistic proof of a binomial identity

Chapter XIV: Fundamentals of Probability and Statistics *

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Sequences and Series

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

BENEFIT-COST ANALYSIS Financial and Economic Appraisal using Spreadsheets

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

A Constant-Factor Approximation Algorithm for the Link Building Problem

FOUNDATIONS OF MATHEMATICS AND PRE-CALCULUS GRADE 10

CHAPTER 3 THE TIME VALUE OF MONEY

Measures of Spread and Boxplots Discrete Math, Section 9.4

CAMERA AUTOCALIBRATION AND HOROPTER CURVES


Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

Soving Recurrence Relations

Notes on exponential generating functions and structures.

Hypothesis testing. Null and alternative hypotheses

Capacity of Wireless Networks with Heterogeneous Traffic

Convexity, Inequalities, and Norms

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

1 Computing the Standard Deviation of Sample Means

5: Introduction to Estimation

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

The Stable Marriage Problem

Chapter 7: Confidence Interval and Sample Size

Conceptualization with Incremental Bron- Kerbosch Algorithm in Big Data Architecture

Overview of some probability distributions.

1. C. The formula for the confidence interval for a population mean is: x t, which was

Output Analysis (2, Chapters 10 &11 Law)

Escola Federal de Engenharia de Itajubá

Clustering Algorithm Analysis of Web Users with Dissimilarity and SOM Neural Networks

WHEN IS THE (CO)SINE OF A RATIONAL ANGLE EQUAL TO A RATIONAL NUMBER?

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

Lecture 4: Cheeger s Inequality

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

Asymptotic Growth of Functions

Evaluating Model for B2C E- commerce Enterprise Development Based on DEA

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

MARTINGALES AND A BASIC APPLICATION

1. Introduction. Scheduling Theory

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Properties of MLE: consistency, asymptotic normality. Fisher information.

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

(VCP-310)

Descriptive Statistics

Entropy of bi-capacities

The Power of Free Branching in a General Model of Backtracking and Dynamic Programming Algorithms

A Mathematical Perspective on Gambling

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

Ekkehart Schlicht: Economic Surplus and Derived Demand

On the Capacity of Hybrid Wireless Networks

A Recursive Formula for Moments of a Binomial Distribution

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

MTO-MTS Production Systems in Supply Chains

Data Analysis and Statistical Behaviors of Stock Market Fluctuations

Universal coding for classes of sources

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Reliability Analysis in HPC clusters

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Hypergeometric Distributions

A gentle introduction to Expectation Maximization

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Confidence Intervals for One Mean

Permutations, the Parity Theorem, and Determinants

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

3. Greatest Common Divisor - Least Common Multiple

ODBC. Getting Started With Sage Timberline Office ODBC

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

How To Understand The Theory Of Coectedess

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu


Transcription:

MINING RELAXED GRAPH PROPERTIES IN INTERNET Wilhelmiia Hämäläie Departmet o Computer Sciece, Uiversity o Joesuu, P.O. Box 111, FIN-80101 Joesuu, Filad whamalai@cs.joesuu.i Hau Toivoe Departmet o Computer Sciece, Uiversity o Helsii P.O. Box 26, FIN-00014 Uiversity o Helsii, Filad htoivoe@cs.helsii.i Vladimir Poroshi Departmet o Computer Sciece, Uiversity o Joesuu, P.O. Box 111, FIN-80101 Joesuu, Filad vporo@cs.joesuu.i ABSTRACT May real world datasets are represeted i the orm o graphs. The classical graph properties oud i the data, lie cliques or idepedet sets, ca reveal ew iterestig iormatio i the data. However, such properties ca be either too rare or too trivial i the give cotext. By relaxig the criteria o the classical properties, we ca id more ad totally ew patters i the data. I this paper, we deie relaxed graph properties ad study their use i aalyzig ad processig graph-based data. Especially, we cosider the problem o idig sel-reerrig groups i WWW, ad give a geeral algorithm or miig all such patters rom a collectio o WWW pages. We suggest that such sel-reerrig groups ca reveal web commuities or other clusterig i WWW ad also acilitate i compressio o graph-ormed data. KEYWORDS graph, data miig, WWW. 1. INTRODUCTION I this paper, we itroduce ew cocepts ad methods or idig ew iormatio i graph-structured data, lie World Wide Web (WWW), citatio idexes or cocept maps. Our aim is to study systematically, whether the relaxed versios o the classical graph properties (Table 2) lie early cliques or early idepedet sets could reveal ew iterestig iormatio. As a example, we will loose the deiitio o a clique ad cosider the resultig sel-reerrig groups i the WWW cotext. The sel-reerrig groups ca be iterpreted as clusters i WWW. Such clusters give compact descriptio o the WWW graph structure ad are especially useul or idig web commuities. I additio, we will demostrate how this ew cocept ca be used or compressig other graph-based data, lie co-operatioal cocept maps. I the ext sectios, we will irst itroduce the idea o sel-reerrig groups ad give the ormal deiitios. The we will give the algorithm or idig all sel-reerrig sets i a give graph ad aalyze its complexity. We will itroduce our irst experimets or idig groupigs (potetial web-commuities) amog WWW pages ad applyig our algorithm or compressig co-operatioal cocept maps. Fially, we will compare our ew method to previous wor ad draw the ial coclusios. The basic cocepts ad otatios used i this paper are listed i Table 1.

Table 1. Basic cocepts ad otatios or graphs. G = (V, E) A graph which cosists o a set o vertices V ad a set o edges E = {{u, v} u, v V}. For a directed graph E cosists o ordered pairs, E = {{u, v} u, v V} V x V. u 4 v A path (o legth ) rom vertex u to vertex v i a directed graph G, i.e. a sequece (v 0,, v ) such that u = v 0 ad v = v ad (v i-1, v i ) E or i = 1, 2,,. I a udirected graph, ordered pairs (v i-1, v i ) are replaced by uordered pairs {v i-1, v i }. degree(v) The degree o a vertex v. I a udirected graph, the degree o v is the umber o edges icidet o it. I a directed graph, the degree o a vertex v is degree(v) = outdegree(v) + idegree(v). outdegree(v) The outdegree o a vertex v i.e. the umber o edges leavig it. idegree(v) The idegree o a vertex v i.e. the umber o edges eterig it. G = V The size o a graph G = (V, E) is the umber o vertices i V. Table 2. Classical graph properties. First our properties are deied or udirected graphs ad the last property or both directed ad udirected graphs. Complete graph G is complete, i all o its vertices are coected to each other: u, v V {u, v} E. Clique A subset o vertices V V such that each pair i V is coected by a edge i E: u, v V {u, v} E. I.e. a clique deies a complete subgraph i G. Idepedet set A subset o vertices V V such that o pair i V is coected by a edge i E: u, v V {u, v} E. Observe that V is a idepedet set i G i V is a clique i the complemet graph G = ( V,( V V ) \ E). Vertex cover A subset o vertices V V such that all edges i E are covered by vertices i V. I.e. or each {u, v} E either v V or u V or both. Observe that V is a vertex cover o G i V \ V is a idepedet set i G. Strogly coected A maximal subset o vertices V V such that or each pair o vertices u, v V there compoet are paths u 4 v ad v 4 u. I.e. vertices u ad v are reachable rom each other. A graph is strogly coected i it has oly oe strogly coected compoet. 2. SELF-REFERRING GROUPS The idea o sel-reerrig groups is to id a group o pages, which mostly reer to each other. Let us deote this set o sel-reerrig pages as S. I all pages i S reer to all other pages i S, the S is a clique ad the correspodig udirected graph is complete. However, we ca loose this criterio ad require that each page i S reers to at least mi part o other pages i S, i which mi is a user-deied threshold or S s reerece requecy. This id o sel-reerece is called here as strog sel-reerece. (See example i Figure 1.) Formally, we deie: Deiitio: Let S be a set o vertices, S > 1, ad mi > 0 a user-deied requecy threshold. S is strogly sel-reerrig, i p S, degree( p) mi. The reerece requecy o S is mi{ degree ( p) p S} ( S ) =. S 1 S 1 Figure 1. Graph deied by li structure o web pages. The dar odes costruct a strogly sel-reerrig set with threshold mi = 2/3.

Observe that we are ow cosiderig the uderlyig udirected graph structure o the WWW, ad accordig to deiitio o graph the duplicate edges are removed. By cosiderig the directed graph structure we ca id more iormatio about sel-reerrig groups ad the relatios betwee them. For example, we ca adopt the idea o HITS-algorithm [7], ad search the best authorities (pages reerred by good hubs ) ad the best hubs (pages reerrig to good authorities) i the group. It is quite plausible that the authorities reveal the good sources o ew iormatio i the group, ad the good hubs give good overviews o the topics cosidered i the group. Further, i we idetiy the WWW addresses, we ca also determie the geographic distributio o a group (whether it is iteratioal or represet oly some atioalities). 3. ALGORITHM FOR FINDING SELF-REFERRING GROUPS 3.1 Basic idea I the search algorithm or sel-reerrig pages we ca restrict the search space to coected compoet by the ollowig observatio: whe the requecy threshold mi 0.5, the the resultig strogly selreerrig set has to be a coected compoet. I the practical applicatios this extra requiremet is quite sesible. Lemma 1: Let S V be a strogly sel-reerrig set i G = (V, E) with requecy threshold mi 0.5. The S is coected compoet i G. Proo: Atitheses: Let S be ucoected i.e. there is compositio to m 2 parts S = S 1 S 2 S m such that S i S j = or all i j, i, j = 1,, m. Let S =. Accordig to assumptio, or all p S, degree(p) mi ( S 1) = 0.5 0.5. I.e. each p S has at least 0.5 eighbors i S. The or all S i, i = 1,, m, S i 0.5 + 1, ad S m (0.5 + 1) 2 (0.5 + 1) = + 2. Cotradictio. The basic idea o the algorithm (Alg. 2) is ollowig: 1. Search coected compoets rom graph G by depth-irst search. This ca be completed i time Θ( V + E ) (See sectio Aalysis.) Let the resultig vertex set be V V, ad the correspodig udirected subgraph G = (V, E ). 2. For each coected compoet search sel-reerrig groups rom G by depth-irst search (Alg. 2). 3.2 Depth-irst search or sel-reerrig sets Searchig the strogly sel-reerrig or wealy idepedet sets is more complex tha searchig commo cliques or idepedet sets. For commo cliques ad idepedet sets we ca use simple depth-irst search, because or each vertex set S, such that S is clique/idepedet set, also T S is clique/idepedet set. This meas that we ca prue a brach i search space, whe we id a set which is ot a clique/idepedet set. But this does ot hold or sel-reerrig sets or wealy idepedet sets. Thus, we will prove aother pruig criterio or sel-reerrig sets. Lemma 2: Let (S) be the reerece requecy o vertex set T, T =, ad (T) the reerece requecy o T s superset S, T S, S = +. The 1 ( T) ( S) +. + 1 + 1 Proo: Let us deote the ested subsets by S, S +1,, S +, S = S S +1 S + = T, S i = i. Let 1 ( S ) =. The the maximum value o (S +1 ) is l + 1 1 1 = ( S ) +. Let us suppose that o each step (S i+1 ) 1 gets the maximum value, i 1 1 ( S ) +. The, by solvig the recursio equatio, we get that ater steps gets i i i maximum value 1 1 1 ( S + ) = ( S ) + ( T) = ( S) +. + 1 + 1 + 1 + 1 The ollowig lemma is a direct cosequece o the deiitio o sel-reerrig sets: Lemma 3: Let v be a vertex ad S a strogly sel-reerrig set, v S. The the maximum size o S is degree ( v ) S + 1 mi.

Theorem: Let v be a vertex ad S a vertex set, v S, S =, ad (S) be the reerece requecy o set S. I degree ( v)(1 mi ), the oe o S s superset ca be strogly sel-reerrig. ( S ) < 1 ( 1) mi Proo: Let us deote by degree( v) = + 1 the umber o items we ca add to S= S util maximum size mi is met. Let the resultig maximal superset be S +. I the best case, the requecy o S + is 1 ( S ) = ( S ) +. The set is sel-reerrig, i 1 ( S ) + mi + + 1 + 1 + 1 + 1 degree( v) mi degree ( v) (1 ) mi degree ( v)(1 mi ) degree( v )(1 mi ) ( S ) = 1. Thus, i ( S) < 1, the S ( - 1)mi ( - 1)mi ( 1) mi ca ever become strogly sel-reerrig. Now we ca give the depth-irst search Algorithm 2 or searchig all strogly sel-reerrig supersets o the curret vertex set X, which is give as a parameter. I the mai program, Alg. 2, the depth-irst search is called by a sigular vertex set {v}, its degree d = degree(v), threshold mi, ad a additioal parameter last, which is explaied below. I each recursive call o Alg. 2, we irst chec, i the vertex set is sel-reerrig, ad i the positive case we output it. Otherwise we chec, whether it ca be expaded to a sel-reerrig set. I the coditio does ot hold, we retur rom recursio. Otherwise we expad X with a ew item, ad call the algorithm recursively. The recursio iishes, whe the maximum size o the sel-reerrig sets cotaiig iitial vertex v is met, or we recogize that o more sel-reerrig sets ca be oud. Ater returig rom the recursio, we try aother ew item, util all o them are tried. Observe that we have to ix some order or vertices to avoid testig the same vertex sets agai. I the Algorithm 2, we deote u > v, whe v precedes u i the give order. Variable last tells the last vertex added to the cadidate group, ad thus all vertices precedig it are already tested. Alg. 1 SelReerrigSets(G, mi ). A algorithm or searchig all strogly sel-reerrig sets i graph G = (V, E) Iput: G = (V, E), mi Output: Y V 1 begi compute all coected compoets i G = (V, E) 2 or each coected compoet V i G = (V, E) do 3 or all v V ds({v}, degree(v), mi, v) 4 ed Alg. 2 ds(x, d, mi, last). A depth-irst search o the sel-reeret sets i subgraph G = (V, E ) Iput: X V, d, mi, last Output: Y V 1 begi 2 i (X) mi the 3 output X 4 else i d(1 mi ) ( ( X ) < 1 ) ( X 1) mi 5 the retur 6 or all vertices u V (u > last ad v X (v, u) E) do 7 ds(x {u}, d, mi, u) 8 ed 3.3 Aalysis Let G = (V, E) be a directed graph. The strogly coected compoets ca be oud i time Θ(3 V + 2 E ) by a well-ow algorithm, depth-irst search oce or the origial graph G, oce or its traspose graph G T, ad outputs the vertices o each which calls tree. [13]

Let us ow cosider the worst case complexity o idig all sel-reerrig sets. I the size o the vertex set is V =, we have to try i the worst case 2 sets ( P(V) = 2 ). However, we ow that the maximum size or each sel-reerrig set X, which is superset o a vertex v, is degree( v) mi re + 1. This gives a upper limit or the depth o algorithm. I sparse graphs we achieve a eormous savig: e.g. i the graph size is 100, average degree o vertices is 18, ad mi = 0.9, the height o the search tree is oly 18 + 1= 21 istead o 100. However, i the worst case whe the graph is complete, we save 09. othig. Table 3. Sel-reerrig groups o size 2, whe the iitial graph cosisted o 100 best search results by Google. The page umbers are Google raig umbers. Page umbers Cotets 22, 34, 86 MGTS-2003 ad MGTS-2004 worshop pages ad a homepage o a member o program committee 22, 34, 97 MGTS-2003 ad MGTS-2004 worshop pages ad a homepage o aother member o program Committee 22, 39, 56 MGTS-2003 worshop page ad homepages o a member o committee i Eglish ad Japaese Table 4. Sel-reerrig groups o size 2, whe the iitial graph cosisted o 100 best search results by AltaVista. The page umbers are AltaVista raig umbers. Page umbers Cotets 1 23 58 Articles about Gra-FX program 1 58 67 Articles about Gra-FX program 4. INITIAL EXPERIMENTS 4.1 Searchig sel-reerrig groups i Iteret I the irst experimet, we searched sel-reerrig groups i Iteret. The iitial graphs were costructed rom the best search results with eywords graph data miig. For compariso we used two dieret search egies: Google, which is based o PageRa algorithm, ad AltaVista, which also uses coectivity iormatio, but is more cotet-orieted. I the extractio phase, oly lis to HTML pages via HTTP protocol were icluded ito results. Ater extractig the lis, the relative hyperlis were trasormed ito absolute addresses, cyclic lis to page itsel were iltered out, ad may lis to oe destiatio were cosidered as oe li. Ater that the 100 best raed pages were selected to the iput graph. The graphs were quite loosely coected ad we oud oly a couple o sel-reerrig groups with requecy threshold mi = 0.75. The sel-reerrig groups o size 2 are listed i Tables 3 ad 4. I the irst graph based o Google results we oud three sel-reerrig groups, all o them ocusig aroud MGTS worshops (Iteratioal Worshop o Miig Graphs, Trees ad Sequeces) ad homepages o program committee members. I the secod graph based o AltaVista results we oud oly two sel-reerrig groups, both o them about Gra-FX program or graph-based data miig, ad locatig o the same cite. This small experimet supported our iitial guess that the sel-reerrig groups are cetered aroud oe commo theme. However, there are various reasos or sel-reerece: the occurrece o same people, istitutios, products or just the locatio (cite). It also demostrated the importace o the iitial graph. I our experimet the iitial graphs were too sparse to produce more ad larger sel-reerrig groups, ad the costructio o more tightly coected graphs should be urther researched. 4.2 Combiig cocept-maps

I the secod experimet, we tested our algorithm or compressig co-operatioal cocept maps. The idea was to combie several cocept maps ad compress the resultig graph by replacig the strogly selreerrig sets by ew super-odes, which were give ew labels. The origial cocept maps were draw by three uiversity studets o topic Tazaia culture ad educatio. The cocept maps were combied simply by matchig the odes, which cotaied same words or stems, lie techology techology expositio ad social society. All the loose odes which had at most I the secod experimet, we tested our algorithm or compressig oe relatio were dropped. The resultig graph cosisted o 48 cocept odes. Ater combiatio, we searched strogly sel-reerrig sets amog cocepts i the uderlyig udirected graph. We made several experimets with dieret threshold values mi. First we ru the program with mi = 1.0, i.e. we searched all cliques i the uderlyig udirected graph. We oud 29 cliques, which were either cliques i oe o the origial cocept maps, or cotaied cocept odes with same words. This did ot reveal aythig ew, as was expected. The threshold valued mi = 0.9 ad mi = 0.8 gave the same result. However, with threshold values mi = 0.7 we oud ew cocept groups: {Cotextualized owledge, Formal owledge, Hidde owledge, ICT, Idigeus owledge systems}, {Educatio, Poor iteret expositio, Tazaia culture, Techology, Techology explosio}, ad {Educatio, Tazaia culture, Techological educatio, Techology, Techology expositio}. The irst cocept group could be cocluded rom oe o the cocept maps aloe, ad the third oe reveals the cotributio to educatioal techology, which was commo or all studets. However, the secod oe is a ew discovery which reveals the core o all the cocept maps. With threshold value mi = 0.6 we oud 49 sel-reerrig groups, each o them composed o at least our cocepts. The largest groups were {Chage i educatioal system, Diusio ito local social systems, Laptop/Destop computers, Programmig tutorial, Simputer, Techological educatio},{chage i educatioal system, Educatio, HIV/AIDS educatio, Laptop/Destop computers, Programmig tutorial, Techological educatio}, ad {Educatio, Poor iteret expositio, Tazaia culture, Techological educatio, Techology, Techology explosio}. The third oe oly expads the group oud i the previous experimet, but the irst two cocept groups are totally ew. They cotai mostly the same cocepts: Chage i educatioal system, Laptop/Destop computers, Programmig tutorial, Techological educatio. All o these are iherited rom oly oe cocept map, which was very tightly coected ad thus domiated the combiatio. This experimet gave rise to two remars: Firstly, our iitial combiatio techique (addig edges betwee odes cotaiig same words or stems) had too strog iluece o the result. Secodly, oe tightly coected map ca totally domiate the result, i the others are very loose. However, we oud the results quite ecouragig: sel-reerrig groups could be applied or a tool, which assists i costructig cooperative cocept maps by suggestig closely related cocepts. 5. RELATED WORK The existig data miig techiques or graph-based data ca be coarsely divided ito two approaches: miig requet substructures, ad clusterig ad searchig the graph accordig to coectivity iormatio. Our ew method ca be icluded ito the secod approach, although it diers rom the existig methods. I the classical graph-based data miig (e.g. [10, 2]) the datasets are represeted as graphs, ad the tas o idig requetly occurrig substructures reduces to search o requet subgraphs. The graph-based data miig suits especially well or structured data, but i act ay relatioal database ca be represeted as graph. The vertices o graph correspod etities ad the edges betwee them correspod biary relatioships. The geeral problem o graph-based data miig is ollowig: We are give a set o labelled graphs S, i which each vertex ad edge has a label associated to it, ad some threshold σ (0, 1]. The tas is to id all subgraphs which occur at least σ S o the iput graphs. Two subgraphs are cosidered idetical, i they are isomorphic (topologically idetical) ad have the same labels o the vertices ad edges. Usually it is also required that the graphs are coected, which reduces complexity ad is uiorm with the goal o idig iterestig relatios. The problem is very complex because subgraph isomorphism is ow to be NPcomplete [4]. Thus, our approach diers largely rom the classic graph-based data miig tas. While we search all istaces o some (relaxed) graph property i oe huge graph, the typical graph-based data miig tas supposes a set o labeled graphs, i which they search requetly occurrig subgraphs (i.e. each subgraph has several istaces i the collectio). However, both methods ca be used or simpliyig graph-structured

data. Coo ad Holder [2] suggest that we describe a requetly occurrig substructure by some cocept, ad replace its istaces by this cocept. Our approach ca be used i the similar way as our secod experimet demostrated. The techiques based o coectivity iormatio are owadays popular i Iteret cotext. I the clusterig-based approaches, the idea is to iterpret the sub-graph as a cluster, ad the measure o similarity by legth o the shortest path betwee two vertices (or more details see e.g. [5, 6, 12]). For example, the clusterig techiques have bee applied or aalyzig ad visualizig the Iteret structure [3, 14], costructig taxoomies [9], ad extractig web commuities [8]. However, the most importat applicatios based o coectivity aalysis are the advaced search methods lie HITS [7] ad PageRa [11]. Figure 2. Comparig the results o the trawlig algorithm ad sel-reerrig groups. By the trawlig algorithm, G 1 ad ay three vertices o G 2 are iterpreted as web commuities, while our algorithm or sel-reerrig groups produces oly G 2 with requecy threshold mi = 0.75. Accordig to the HITS algorithm, vertices 3, 4, 5, 6, 7, ad 8 all have equal authority values. I HITS algorithm, we evaluate two weights or each page: authority value, which tells how relevat iormatio the page cotais, ad hub value, which describes, how good reereces (lis) the page cotais. Accordig to mai assumptio, a page is a good authority i it is poited by good hub pages, ad a good hub, i it poits to good authorities. The goal is to id good authorities rom the iitial subgraph, by updatig hub ad authority weights util the system coverges. I Page Ra algorithm, oly oe weight, the page ra (PR) is calculated or the curret page. The PR value o a page is determied rom the PR values o the predecessor pages (pages poitig to it). Thus, loosely speaig, a good authority is also iterpreted as a good hub. I additio, the processig order simulates a idealized radom surer o Iteret. The trawlig algorithm [8] or eumeratig web commuities ca also be described by hubs ad authorities. I trawlig, the idea is to id subgraphs cotaiig bipartite cliques o i + j vertices. I such a clique each o i vertices (hubs) poits to all j vertices (authorities). A subgraph o size i + j cotaiig at least oe bipartite clique o i + j vertices is called bipartite core ad is cosidered as a potetial web commuity. This method diers rom our approach i two ways: irst, we do ot require ull cliques, but oly relative clique structure i the uderlyig udirected graph. Secod, our approach does ot restrict the web commuities to bipartite groups, but also democratic groups are cosidered. I Figure 1 we demostrate this dierece. Batagelj et al. [1] have geeralized the idea o cores. A -core i graph G = (V, E) is a maximal subgraph G = (V, E ), V V, E E such that or all v V deg(v). The degree uctio deg(v) ca be simply the degree o vertice v or some other measure based o idegrees ad outdegrees or edge weights. By determiig cores o dieret orders we ca costruct hierarchies o subgraphs accordig to their coectivity. The mai dierece to our approach is that the -cores do ot tell aythig about relative tighess o the clusterig, i.e. it is idepedet o the size o group S. E.g. let S = ad deg(v) deied as commo degree o vertex v. The S is -core v S degree(v). This shows that a -core ca be very strogly ( S ) = 1 sel-reerrig (whe =-1) or ot sel-reerrig at all (whe <<-1), ad it does ot catch the idea o selreerece. 6. CONCLUSIONS

I this paper we have itroduced ew iterestig graph properties, which ca be applied or aalysis ad processig o graph-based data. Especially, we have itroduced a otio o sel-reerrig set, which correspods to relaxed versio o a clique. We have give a algorithm or searchig all sel-reerrig sets rom the give graph, ad aalyzed its time complexity. I the iitial experimets we have demostrated, how the sel-reerrig sets ca be used or idig web commuities ad compressig graph-structured data. I the uture, we are goig to study systematically, whether the relaxed graph properties reveal ew iterestig iormatio ad how they ca be applied. I additio, we are goig to aalyze the sel-reerrig sets urther by HITS algorithm. The algorithm itsel oers may iterestig research tass. For example, we ca adapt it or miig sel-reerrig groups i weighed graphs. Miig the relaxed graph properties has also a geeral iterest i owledge discovery. The most importat research problem or their wider use is how to optimize the algorithm. It should also be studied, whether the existig optimizatio algorithms o the NP-complete problems could be applied withi tolerable error rage. REFERENCES [1] V. Batagelj ad M. Zaversi. Geeralized cores. Techical Report 799, Departmet o Theoretical Computer Sciece, Uiversity o Ljubljaa, 2002. [2] D. Coo ad L. B. Holder. Graph-based data miig. IEEE Itelliget Systems, 15(2):32 41, 2000. [3] S. Dill, R. Kumar, K.S. Mccurley, S. Rajagopala, D. Sivaumar, ad A. Tomis. Sel-similarity i the web. ACM Tras. Iter. Tech., 2(3):205 223, 2002. [4] M. R. Garey ad D. S. Johso. Computers ad itractability: A guide to the theory o N P-completeess. W. H. Freema, 1979. [5] I. Joyer, D. Co o, ad L.B Holder. Graph-based hierarchical coceptual clusterig. The Joural o Machie Learig Research archive, 2:19 43, March 2002. [6] R. Kaa, S. Vempala, ad A. Veta. O clusterigs-good, bad ad spectral. I Proceedigs o the 41st Aual Symposium o Foudatios o Computer Sciece, Nov 2000. [7] J. Kleiberg. Authoritative sources i a hyperlied eviromet. I Proceedigs o the 9th ACM-SIAM Symposium o Discrete Algorithms, 1998. Exteded versio i Joural o the ACM 46(1999). [8] R. Kumar, P. Raghava, S. Rajagopala, D. Sivaumar, A. Tompis, ad E. Upal. The web as a graph. Proc. 19th ACM SIGMOD-SIGACT-SIGART symposium o Priciples o database systems, pp. 1 10. ACM Press, 2000. [9] R. Kumar, P. Raghava, S. Rajagopala, ad A. Tomis. O semi-automated taxoomy costructio. I Proceedigs o the 4th Web Data Base Coerece, 2000. [10] M. Kuramochi ad G. Karypis. A eiciet algorithm or discoverig requet subgraphs. Techical Report 02-026, Departmet o Computer Sciece, Uiversity o Miesota, 2002. [11] L. Page, S. Bri, R. Motwai, ad T. Wiograd. The PageRa citatio raig: Brigig order to the web. Techical report, Staord Digital Library Techologies Project, 1998. [12] C.R. Palmer, P.B. Gibbos, ad C. Faloutsos. ANF: a ast ad scalable tool or data miig i massive graphs. Proc. Eighth ACM SIGKDD iteratioal coerece o Kowledge discovery ad data miig, pp. 81 90. ACM Press, 2002. [13] R. E. Tarja. Depth irst search ad liear graph algorithms. SIAM Joural o Computig, 1(2):146 160, 1972. [14] L. Tauro, C. Palmer, G. Sigaos, ad M. Faloutsos. A simple coceptual model or the iteret topology. I Global Iteret, 2001.