Graph Mining and Social Network Analysis. Data Mining; EECS 4412 Darren Rolfe + Vince Chu

Transcription

1 Graph Mining and Social Network nalysis Data Mining; EES 4412 Darren Rolfe + Vince hu

2 genda Graph Mining Methods for Mining Frequent Subgraphs priori-based pproach: GM, FSG Pattern-Growth pproach: gspan Social Networks nalysis Properties and Features of Social Real Graphs Models of Graphs we can use Using those models to predict/other things

3 Graph Mining Methods for Mining Frequent Subgraphs

4 Why Mine Graphs? lot of data today can be represented in the form of a graph Social: Friendship networks, social media networks, and instant messaging networks, document citation networks, blogs Technological: Power grid, the internet iological: Spread of virus/disease, protein/gene regulatory networks

5 What Do We Need To Do Identify various kinds of graph patterns Frequent substructures are the very basic patterns that can be discovered in a collection of graphs, useful for: characterizing graph sets, discriminating different groups of graphs, classifying and clustering graphs, building graph indices, and facilitating similarity search in graph databases

6 Mining Frequent Subgraphs Performed on a collection of graphs Notation: Vertex set of a graph gg by VV(gg) Edge set of a graph gg by EE(gg) label function, LL, maps a vertex or an edge to a label. graph gg is a subgraph of another graph ggg if there exists a subgraph isomorphism from gg to ggg. Given a labeled graph data set, DD = {GG 1, GG 2,, GGGG}, we define ssssssssssssss(gg) (or ffffffffffffffffff(gg)) as the percentage (or number) of graphs in DD where gg is a subgraph. frequent graph is a graph whose support is no less than a minimum support threshold, mmmmmm_ssssss.

7 Discovering Frequent Substructures Usually consists of two steps: 1. Generate frequent substructure candidates. 2. heck the frequency of each candidate. Most studies on frequent substructure discovery focus on the optimization of the first step, because the second step involves a subgraph isomorphism test whose computational complexity is excessively high (i.e., NP-complete).

8 Graph Isomorphism Isomorphism of graphs G and H is a bijection between the vertex sets of G and H G H FF: VV(gg) VV(HH) H G Such that any two vertices uu and vv of GG are adjacent in GG if and only if ƒ(uu) and ƒ(vv) are adjacent in HH. I J D J I D Graph G Graph H

9 Frequent Subgraphs: n Example Graph 1 Graph 2 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. heck the frequency of each candidate. Graph 3 Graph 4

10 Let the support minimum for this example be 50%. Frequent Subgraphs: n Example 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. heck the frequency of each candidate.

11 k = 1 k = 2 Frequent Subgraphs: n Example k = 3 k = 4 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. heck the frequency of each candidate.

12 k = 1 k = 2 Frequent Subgraphs: n Example k = 3 k = 4 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. heck the frequency of each candidate.

13 Graph 1 Graph 2 k = 3, frequency: 3, support: 75% Frequent Subgraphs: n Example 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. heck the frequency of each candidate. Graph 3 Graph 4

14 Graph 1 Graph 2 k = 4, frequency: 2, support: 50% Frequent Subgraphs: n Example 1. Start with a labelled graph data set. 2. Set a minimum support threshold for frequent graph. 3. Generate frequent substructure candidates. 4. heck the frequency of each candidate. Graph 3 Graph 4

15 priori-based pproach priori-based frequent substructure mining algorithms share similar characteristics with priori-based frequent itemset mining algorithms. Search for frequent graphs: Starts with graphs of small size ; definition of graph size depends on algorithm used. Proceeds in a bottom-up manner by generating candidates having an extra vertex, edge, or path. Main design complexity of priori-based substructure mining algorithms is candidate generation step. andidate generation problem in frequent substructure mining is harder than that in frequent itemset mining, because there are many ways to join two substructures.

16 priori-based pproach 1. Generate size kk frequent subgraph candidates Generated by joining two similar but slightly different frequent subgraphs that were discovered in the previous call of the algorithm. 2. heck the frequency of each candidate 3. Generate the size kk + 1 frequent candidates 4. ontinue until candidates are empty

17 lgorithm: priorigraph priori-based Frequent Substructure Mining Input: DD, a graph data set mmmmmm_ssssss, minimum support threshold Output: SS kk, frequent substructure set Method: SS 1 frequent single-elements in DD all (DD, mmmmmm_ssssss, SS 1 ) procedure priorigraph(d, min_sup, S k ) 1 S k+1 ; 2 foreach frequent g i S k do 3 foreach frequent g j S k do 4 foreach size (k+1) graph g formed by merge(g i, g j ) do 5 if g is frequent in D and g S k+1 then 6 insert g into S k+1 ; 7 if s k+1 then 8 priorigraph(d, min_sup, S k+1 ); 9 return;

18 GM - priori-based Graph Mining Vertex-based candidate generation method that increases the substructure size by one vertex at each iteration of priorigraph. kk, graph size is the number of vertices in the graph Two size-k frequent graphs are joined only if they have the same size-(k 1) subgraph. Newly formed candidate includes the size-(k 1) subgraph in common and the additional two vertices from the two size-k patterns. ecause it is undetermined whether there is an edge connecting the additional two vertices, we actually can form two substructures.

19 k = 4 + GM: n Example k = 5 Two substructures joined by two chains. kk, graph size is the number of vertices in the graph

20 FSG Frequent Subgraph Discovery Edge-based candidate generation strategy that increases the substructure size by one edge in each call of priorigraph. kk, graph size is the number of edges in the graph Two size-k patterns are merged if and only if they share the same subgraph having k 1 edges, which is called the core. Newly formed candidate includes the core and the additional two edges from size-k patterns.

21 k = 4 + FSG: n Example k = 5 Two substructure patterns and their potential candidates. kk, graph size is the number of edges in the graph

22 k = 5 + FSG: nother Example k = 6 Two substructure patterns and their potential candidates. kk, graph size is the number of edges in the graph

23 Pitfall: priori-based pproach Generation of subgraph candidates is complicated and expensive. Level-wise candidate generation readth-first search To determine whether a size-(k+1) graph is frequent, must check all corresponding size-k subgraphs to obtain the upper bound of frequency. efore mining any size-(k+1) subgraph, requires complete mining of size-k subgraphs Subgraph isomorphism is an NP Subgraph isomorphism is an NP-complete problem, complete problem, so pruning is expensive.

24 Pattern-Growth pproach 1. Initially, start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge such that newly formed graphs are frequent graphs graph g can be extended by adding a new edge e; newly formed graph is denoted by gg xx ee. If e introduces a new vertex, we denote the new graph by gg xxxx ee, otherwise gg xxxx ee, where f or b indicates that the extension is in a forward or backward direction 3. For each discovered graph g, it performs extensions recursively until all the frequent graphs with g embedded are discovered. 4. The recursion stops once no frequent graph can be generated.

25 lgorithm: PatternGrowthGraph Simplistic Pattern Growth-based Frequent Substructure Mining Input: gg, a frequent graph DD, a graph data set mmmmmm_ssssss, minimum support threshold Output: SS, frequent graph set Method: SS all PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP(gg, DD, mmmmmm_ssssss, SS) procedure PatternGrowthGraph(g, D, min_sup, S) 1 if g S then return; 2 else insert g into S; 3 scan D once, find all edges e that g can be extended to g xx e; 4 foreach frequent g xx e do 5 PatternGrowthGraph(g xx e, D, min_sup, S); 6 return;

26 Pattern-Growth: n Example Graph 1 Graph 2 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 3 Graph 4

27 Let the support minimum for this example be 50%. Pattern-Growth: n Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated

28 Graph 1 Graph 2 Let s arbitrarily start with this frequent vertex Pattern-Growth: n Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 3 Graph 4

29 Graph 1 Graph 2 Extend graph (forward); add frequent edge Pattern-Growth: n Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 3 Graph 4

30 Graph 1 Graph 2 Extend frequent graph (forward) again Pattern-Growth: n Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 3 Graph 4

31 Graph 1 Graph 2 Extend graph (backward); previously seen node Pattern-Growth: n Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 3 Graph 4

32 Graph 1 Graph 2 Extend frequent graph (forward) again Pattern-Growth: n Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 3 Graph 4

33 Graph 1 Graph 2 Stop recursion, try different start vertex Pattern-Growth: n Example 1. Start with the frequent vertices as frequent graphs 2. Extend these graphs by adding a new edge forming new frequent graphs 3. For each discovered graph g, recursively extend 4. Stops once no frequent graph can be generated Graph 3 Graph 4

34 Pitfall: PatternGrowthGraph Simple, but not efficient Same graph can be discovered many times; duplicate graph Generation and detection of duplicate graphs increases workload

35 gspan (Graph-ased Substructure Pattern Mining) Designed to reduce the generation of duplicate graphs. Explores via depth-first search (DFS) DFS lexicographic order and minimum DFS code form a canonical labeling system to support DFS search. Discovers all the frequent subgraphs without candidate generation and false positives pruning. It combines the growing and checking of frequent subgraphs into one procedure, thus accelerates the mining process.

36 gspan (Graph-ased Substructure Pattern Mining) DFS Subscripting When performing a DFS in a graph, we construct a DFS tree One graph can have several different DFS trees Depth-first discovery of the vertices forms a linear order Use subscripts to label this order according to their discovery time i < j means v i is discovered before v j. v o, the root and v n, the rightmost vertex. The straight path from v 0 to v n, rightmost path.

37 gspan (Graph-ased Substructure Pattern Mining) DFS ode We transform each subscripted graph to an edge sequence, called a DFS code, so that we can build an order among these sequences. The goal is to select the subscripting that generates the minimum sequence as its base subscripting. There are two kinds of orders in this transformation process: 1. Edge order, which maps edges in a subscripted graph into a sequence; and 2. Sequence order, which builds an order among edge sequences n edge is represented by a 5-tuple, (ii, jj, ll ii, II (ii,jj), ll jj ); ll ii and ll jj are the labels of vv ii and vv jj, respectively, and II (ii,jj) is the label of the edge connecting them

38 gspan (Graph-ased Substructure Pattern Mining) DFS Lexicographic Order For the each DFS tree, we sort the DFS code (tuples) to a set of orderings. ased on the DFS lexicographic ordering, the minimum DFS code of a given graph G, written as dfs(g), is the minimal one among all the DFS codes. The subscripting that generates the minimum DFS code is called the base subscripting. Given two graphs GG and GGG, GG is isomorphic to GGG if and only if dddddd(gg) = dddddd(ggg). ased on this property, what we need to do for mining frequent subgraphs is to perform only the right-most extensions on the minimum DFS codes, since such an extension will guarantee the completeness of mining results.

39 DFS ode: n Example DFS Subscripting When performing a DFS in a graph, we construct a DFS tree One graph can have several different DFS trees X X Z Y a a b b v 0 v 1 v 2 v 3 X X Z Y a a b b X X Z Y a a b b X X Z Y a a b b

40 X edge γ 0 b Z b a a X X a b Y e 0 e 1 (0, 1, X, a, X) (1, 2, X, a, Z) e 2 (2, 0, Z, b, X) e 3 (1, 3, X, b, Y) edge γ 1 (0, 1, X, a, X) e 0 DFS Lexicographic Order: n Example Z b a a X X a X b b Y e 1 (1, 2, X, b, Y) e 2 (1, 3, X, a, Z) e 3 (3, 0, Z, b, X) edge γ 2 e 0 (0, 1, Y, b, X) e 1 (1, 2, X, a, X) e 2 (2, 3, X, b, Z) For the each DFS tree, we sort the DFS code (tuples) to a set of orderings. ased on the DFS lexicographic ordering, the minimum DFS code of a given graph G, written as dfs(g), is the minimal one among all the DFS codes. Z Y e 3 (3, 1, Z, a, X)

41 gspan (Graph-ased Substructure Pattern Mining) 1. Initially, a starting vertex is randomly chosen 2. Vertices in a graph are marked so that we can tell which vertices have been visited 3. Visited vertex set is expanded repeatedly until a full DFS tree is built 4. Given a graph G and a DFS tree T in G, a new edge e an be added between the right-most vertex and another vertex on the right-most path (backward extension); or an introduce a new vertex and connect to a vertex on the right-most path (forward extension). ecause both kinds of extensions take place on the right-most path, we call them rightmost extension, denoted by gg rr ee

42 lgorithm: gspan Pattern growth-based frequent substructure mining that reduces duplicate graph generation. Input: ss, a DFS code DD, a graph data set mmmmmm_ssssss, minimum support threshold Output: SS, frequent graph set Method: SS all gggggggggg(ss, DD, mmmmmm_ssssss, SS) procedure gspan(s, D, min_sup, S) 1 if s dfs(s) then return; 2 insert s into S; 3 set to ; 4 scan D once, find all edges e that s can be right-most extended to s rr e; 5 insert s rr e into and count its frequency; 6 foreach frequent s rr e in do 7 gspan(s rr e, D, min_sup, S); 8 return;

43 Other Graph Mining So far the techniques we have discussed: an handle only one special kind of graphs: Labeled, undirected, connected simple graphs without any specific constraints ssume that the database to be mined contains a set of graphs Each consisting of a set of labeled vertices and labeled but undirected edges, with no other constraints.

44 Other Graph Mining Mining Variant and onstrained Substructure Patterns losed frequent substructure where a frequent graph G is closed if and only if there is no proper supergraph G0 that has the same support as G Maximal frequent substructure where a frequent pattern G is maximal if and only if there is no frequent super-pattern of G. onstraint-based substructure mining Element, set, or subgraph containment constraint Geometric constraint Value-sum constraint

45 pplication: lassification We mine frequent graph patterns in the training set. The features that are frequent in one class but rather infrequent in the other class(es) should be considered as highly discriminative features; used for model construction. To achieve high-quality classification, We can adjust: the thresholds on frequency, discriminativeness, and graph connectivity ased on: the data, the number and quality of the features generated, and the classification accuracy.

46 pplication: luster analysis We mine frequent graph patterns in the training set. The set of graphs that share a large set of similar graph patterns should be considered as highly similar and should be grouped into similar clusters. The minimal support threshold can be used as a way to adjust the number of frequent clusters or generate hierarchical clusters.

47 Social Network nalysis

48 Examples of Social Networks Twitter network Network ir Transportation Network

49 Social Network nalysis Nodes often represent an object or entity such as a person, computer/server, power generator, airport, etc Links represent relationships, e.g. likes, follow s, flies to, etc

50 Why are we interested? It turns out that the structure of real-world graphs often have special characteristics This is important because structure always affects function e.g. the structure of a social network affects how a rumour, or an infectious disease, spreads e.g. the structure of a power grid determines how robust the network is to power failures Goal: 1. Identify the characteristics / properties of graphs; structural and dynamic / behavioural 2. Generate models of graphs that exhibit these characteristics 3. Use these tools to make predictions about the behaviour of graphs

51 Properties of Real-World Social Graphs 1. Degree Distribution Plot the fraction of nodes with degree k (denoted p k ) vs. k Our intuition: Poisson/Normal Distribution WRONG! orrect: Highly Skewed mathworld.wolfram.com

52 Properties of Real-World Social Graphs 1. (continued) Real-world social networks tend to have a highly skewed distribution that follows the Power Law: p k ~ k -a small percentage of nodes have very high degree, are highly connected Example: Spread of a virus black squares = infected pink = infected but not contagious green = exposed but not infected

53 Properties of Real-World Social Graphs 2. Small World Effect: for most real graphs, the number of hops it takes to reach any node from any other node is about 6 (Six Degrees of Separation). Milgram did an experiment, asked people in Nebraska to send letters to people in oston onstraint: letters could only be delivered to people known on a first name basis. Only 25% of letters made it to their target, but the ones that did made it in 6 hops

54 Properties of Real-World Social Graphs 2. (continued) The distribution of the shortest path lengths. Example: MSN Messenger Network If we pick a random node in the network and then count how many hops it is from every other node, we get this graph Most nodes are at a distance of 7 hops away from any other node

55 Properties of Real-World Social Graphs 3. Network Resilience If a node is removed, how is the network affected? For a real-world graphs, you must remove the highly connected nodes in order to reduce the connectivity of the graph Removing a node that is sparsely connected does not have a significant effect on connectivity Since the proportion of highly connected nodes in a real-world graph is small, the probability of choosing and removing such a node at random is small real-world graphs are resilient to random attacks! conversely, targeted attacks on highly connected nodes are very effective!

56 Properties of Real-World Social Graphs 4. Densification How does the number of edges in the graph grow as the number of nodes grows? Previous belief: # edges grows linearly with # nodes i.e. EE(tt) ~ NN(tt) ctually, # edges grows superlinearly with the # nodes, i.e. the # of edges grows faster than the number of nodes i.e. EE(tt) ~ NN(tt) aa Graph gets denser over time

57 Properties of Real-World Social Graphs 5. Shrinking Diameter Diameter is taken to be the longest-shortest path in the graph s a network grows, the diameter actually gets smaller, i.e. the distance between nodes slowly decreases

58 Features/Properties of Graphs ommunity structure Densification Shrinking diameter

59 Generators: How do we model graphs Try: Generating a random graph Given n vertices connect each pair i.i.d. with Probability p Follows a Poisson distribution Follows from our intuition Not useful; no community structure Does not mirror real-world graphs

60 Generators: How do we model graphs (Erdos Renyi) Random graphs (1960s) Exponential random graphs Small world model Preferential attachment Edge copying model ommunity guided attachment Forest Fire Kronecker graphs (today)

61 Kronecker Graphs For kronecker graphs all the properties of real world graphs can actually be proven est model we have today djacency matrix, recursive generation

62 Kronecker Graphs 1. onstruct adjacency matrix for a graph G: GG = () = { 1 iiii ii aaaaaa jj aaaaaa aaaaaaaaaaaaaaaa, 0 ooooooooooooooooo } Side Note: The eigenvalue of a matrix is the scalar value ƛ for which the following is true: v = ƛv (where v is an eigenvector of the matrix )

63 Kronecker Graphs 2. Generate the 2nd Kronecker graph by taking the Kronecker product of the 1st graph with itself. The Kronecker product of 2 graphs is defined as:

64 Kronecker Graphs Visually, this is just taking the the first matrix and replacing the entries that were equal to 1 with the second matrix. 3 x 3 matrix 9 x 9 matrix

65 Kronecker Graphs We define the Kronecker product of two graphs as the Kronecker product of their adjacency matrices Therefore, we can compute the K th Kronecker graph by iteratively taking the Kronecker product of an initial graph G 1 k times: G k = G 1 G 1 G 1 G 1

66 pplying Models to Real World Graphs an then predict and understand the structure

67 Virus Propagation form of diffusion; a fundamental process in social networks an also refer to spread of rumours, news, etc

68 Virus Propagation SIS Model: Susceptible - Infected - Susceptible Virus birth rate β = the probability that an infected node attacks a neighbour Virus death rate ẟ = probability that an infected node becomes cured Healthy Node Heals with Prob ẟ Infects with Prob β Infected Node Infects with Prob β Infected Node t risk Node

69 Virus Propagation The virus strength of a graph: s = β/ẟ The epidemic threshold ττ of a graph is a value such that if: So we can ask the question: Will the virus become epidemic? Will the rumours/news become viral? How to find threshold ττ? Theorem: s = β/ẟ < ττ then an epidemic cannot happen. ττ = 1/ƛ 1, where ƛ 1, is the largest eigenvalue of adjacency matrix of the graph So if s < ττ then there is no epidemic

70 Link Prediction Given a social network at time t 1, predict the edges that will be added at time t 2 ssign connection score(x,y) to each pair of nodes Usually taken to be the shortest path between the nodes x and y, other measures use # of neighbours in common, and the Katz measure Produce a list of scores in decreasing order The pair at the top of the list are most likely to have a link created between them in the future an also use this measure for clustering

71 Score(x,y) = # of neighbours in common E F G H I J Link Prediction Top score = score(,) = 5 D Likely new link between and

72 Viral Marketing customer may increase the sales of some product if they interact positively with their peers in the social network ssign a network value to a customer

73 Diffusion in Networks: Influential Nodes Some nodes in the network can be active they can spread their influence to other nodes e.g. news, opinions, etc that propagate through a network of friends 2 models: Threshold model, Independent ontagion model

74 Thanks ny questions?