UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 1/22 LLNL-PRES-671587 New Applications of Computer Analysis to Biomedical Data Sets QB3 Seminar, UCSF Medical School, May 28 th, 215 LINEAR-ALGEBRAIC GRAPH MINING Geoffrey Sanders, CASC/LLNL Lawrence Livermore National Laboratory, P. O. Box 88, Livermore, CA 94551! This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-7NA27344
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 2/22 LLNL and LDRD LLNL is a DOE FFRDC Center for Applied Scientific Computing (CASC) Several of us work on Laboratory Directed Research and Development (LDRD) projects in HPC and Data Analysis Graph Analytics, Machine Learning, Network Analysis We are ALWAYS looking for domain scientist collaborators with interesting datasets or new data mining tasks DOE national labs have a history of building open HPC software for PDE-related applications (Physic Simulation) PETSc [H2], Trilinos [H4], Hypre [H3], Samrai [H1], etc. Desire to do so for graph mining applications
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 3/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 4/22 Graph Model Definitions Graph G(V,E) Vertices i, j in V Edges (i,j) in E Edge weights i j Undirected vs Directed? (i,j) and (j,i) Hypergraphs (i,j,k,l) and (p,q,r) in E Attributes? Vertex Labels Height, Gender, Profession Edge Labels Timestamp, volume
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 5/22 Difficult Topologies Scale-Free Small-World Community Structure Hierarchical Overlapping Heterogeneous in size, density, type, etc Other Structure Tree-Like Periphery
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 6/22 Web Data Commons Hyperlink Graph Crawled in 214, directed [D1] 1.7 B webpages, 64B hyperlinks
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 7/22 Spy Plot Autonomous System Graph [D2] vertices Graphs have natural sparse matrix representations vertices Linear algebra applies
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 8/22 Linear-Algebraic Kernels Linear Solve Eigensolve L x = b L x = λ x Matrix Factorization L F Tensor Factorization[T1] G t A r u k=1 w k k v k
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 9/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 1/22 Ranking Calculations Global Ranking? Ordered list Often only care about top few of the list PageRank [R1] Centrality measure Random walk Personalized PR [R2] Supervised Connection Subgraph [R3] Solve for direction Rank vertices 1
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 11/22 Exotic Ranking Calculations WalkScore[R4]: Meet a blend of complex constraints? My brother gets My score Worse than a buoy
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 12/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 13/22 Clustering Unsupervised? Spectral Clustering [O1,C4] Hard or Soft? sign(v).*(log 1 ( v +ε)+min{log 1 ( v +ε)}) Agglomerative [C1] Start with n groups Make local grouping decisions to maximize Modularity: comms! # " Internal edges $! & # % " Expected internal edges $ & % cc vec 1 cc 2 vec 3 Cc 4 ordered randomly Recursive Bipartite SC ordered by Fiedler vector O R I G I N A L V E R T E X S E T 1 2 3 4 Reason split: vector splitting: 5 6 7 8 9 connected components: Reason stopped: minimum cluster size max clusters 1 1 1 12 13
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 14/22 Overlapping (Co-)Clustering 2 HAIFENG XU, HANS DE STERCK AND GEOFFREY SANDERS A" B" C" D 1" Non-Negative 2" 3" MF[C2] Factors positive-valued L Feature'Matrix'F' 1" 2" 3" 6" 8" 9" 6" 1" 8" 1" A" B" 1" C" Latent Dirichlet D" Weighted'Bipar1te'Graph' Figure 1. Feature matrix F and induced weighted bipartite graph. Red dots correspond to row variables of F, and blue dots to column variables. The rows and columns may, e.g., represent LinkedIn users and skills, with the weights indicating how often a user s skill was endorsed by the user s connections. research groups, etc. These hierarchical structures are overlapping; for example, some professors may be active in multiple departments or faculties, and many skills (often even the more specialized ones) are taught in multiple degree programs. In a similar way, it is to be expected that many of the currently emerging online social networks also contain inherent overlapping hierarchical organization, in particular when they focus on a specific dimension of the human condition, like, e.g., the professional dimension. Consider for example the LinkedIn social network, where users connect to their business relations and acquaintances, and list user-defined skills and expertise on their user profiles that can be endorsed by their connections. Similar to the case of universities, it is clear that in a social network like LinkedIn there must be hierarchical overlapping groups of Interpreted as probabilities Coarsening [C5] Multilevel F Linked-in data G t users with similar skills and professions, and hierarchical overlapping groups of skill keywords that characterize professional groups. Figure 2. Bipartite graph hierarchy obtained by the FMCC algorithms. The input feature matrix is located at the bottom of the diagram. However, in contrast to the example of universities, in emerging social networks this hierarchy is not hard-coded into the structure of the network; if it were, it would seriously impede the growth and dynamical evolution of these networks. Since the hierarchy is not explicitly hard-coded into the structure of the network but is nevertheless present, it is at once a very interesting and a challenging problem to try to automatically generate a representation of this hierarchy from the Allocation [C3] Model a document as a weighted group of topics Each topic has individual vocabulary Document is a random combination of terms from its topics More general than termdocument data
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 15/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 16/22 Approximations to Expensive Calcs Triangle Counting[O3] Diagonal of A 3 is 6 x (# triangles) Mincut [O1] Estimate diagonal entries of A 3 Trace(A 3 ) = sum [ eigenvalues(a) 3 ] Nearly-planar coloring[o2] Maxcut [O4] Near Bi-Partite Structure [O5]
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 17/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 18/22 Important Extensions Figure 12: vector Matrix sparsity structure for example 3 under the reordering given by the Fiedler vector Figure 12: Matrix sparsity structure for example 3 under the reordering given by the Fiedler ordering in of row and column variables (right). The purple line in the figure (left) and associated ordering in of row and column variables (right). The purple (left) line inand theassociated figure on considered the right shows on the right shows the one-dimensional search space for row and column splittings by the one-dimensional search space for row and column splittings considered by the out-of-box undirected method. the out-of-box undirected method. Bipartite Graphs Tensors Time is a tensor dim. Causality? Co-cluster! Author Gather stats " " Dynamic Graphs [T2] Streaming where M [ADD THE NECESSARY CONSTRAINTS TO M]. Then, the eigenpairs of B ( o, x) and ( o, x) can be used to define a mapping into R2 such that the nodes are mapped to [DETERMINE PARAMETERS OF REGION] around vectors of length 1 at angles of 3, or 5 3. Spectrum of D 1 A 1.6.4 ".8 5 1 15.2 2 Im Directed Graphs 25.2 3.4 35.6 4.8 Labeled Graphs Factor in labels Label anomalies? 13: Original graph from example 4 (left), randomly reordered (middle), and bipartized Figure 13: Original graph from example 4 (left), randomly reordered (middle),figure and bipartized where M [ADD THE NECESSARY CONSTRAINTS TO M]. Then, the eigenpairs of B ( o, x) and (right). (right). 2 ( o, x) can be used to define a mapping into R such that the nodes are mapped to [DETERMINE PARAMETERS OF REGION] around vectors of length 1 at angles of 3, or 5 3. 1 45 Highly-cyclic structure 5 1 15 2 25 nz = 2613 3 35 4 1 45 15.8.6 1 Spectral Coordinates.25.25.6.2.2.15.15 1.4 15.2.4.6.8 1.1 Im (vi wi) Im (vi wi).1 Im.2 Spectral Coordinates.8 25 15 5 2.2 Re Figure 2:. Spectrum of D 1 A.4.5.2.5.5.5.1 3.4 35.6.1.15.15.2.2.25.8 4.25.3.25.2.15 1 45 5 1 15 2 25 nz = 2613 3 35 4 1 45.8.6.4.2.1.5 Re (v + w ) i i.2.5.4.1.6 Re Figure 2:. 4.25.15.8.1.5.5 Re (vi + wi).1.15.2 Figure 3:. 1, p,c 2 (Cc ). Entries of the eigenvector Cv = p,c v are vi = ip,c. Spectral Coordinates Spectral Coordinates.2.15 General c-cyclic structure For p =,..., c.25.2 1.25.2 Bc = B C c.25 Conference
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 19/22 Summary Linear Algebra addresses a diverse set of graph analytics Linear Algebra kernels are somewhat scalable, implemented in many computing environments Often requires close interaction with math or computer scientist to tune to new type of data/analytic
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 2/22 References I of III HPC [H1] Wissink, et al. Large Scale Structured AMR Calculations Using the SAMRAI Framework, SC1 Proceedings, 21. LLNL tech report UCRL-JC-144755. [H2] Balay, et al. Efficient Management of Parallelism in Object Oriented Numerical Software Libraries, Modern Software Tools in Scientific Computing, 1997 [H3] Falgout et al. Design of the hypre Preconditioner Library, Proc. of the SIAM Workshop on Object Oriented Methods for Inter-operable Scientific and Engineering Computing, 1998 [H4] Heroux et al. An Overview of Trilinos, SNL Tech Report SAND23-2927, 23 Data [D1] Meusel et al., Web Data Commons 214 Hyperlink Graph, http://webdatacommons.org/hyperlinkgraph/214-4/topology.htm [D2] Leskovec et al. Stanford Large Network Dataset Collection, https://snap.stanford.edu/data/
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 21/22 References II of III Ranking [R1] Page. PageRank: Bringing order to the web. Stanford Digital Libraries Working Paper, 1997 [R2] Haveliwala. Topic-sensitive pagerank, In WWW pages 517 526, 22 [R3] Faloutsos et al. Fast Discovery of Connection Subgraphs, KDD 24 [R4] Walkscore, http://www.walkscore.com/ Clustering [C1] Blondel et al. Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment (1), P18, 28. [C2] Paatero et al. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 1994 [C3] Blei et al. Latent Dirichlet Allocation. Journal of Machine Learning, 23 [C4] von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 27 [C5] Xu et al. Fast Multilevel Co-Clustering: Unraveling the Multilevel Overlapping Cluster Structure of Social Network Data, submitted to Numerical Linear Algebra with Applications, May 215
UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 22/22 References III of III Tensors [T1] Kolda et al. Tensor Decompositions and Applications, SIAM Review, 28 [T2] Dunlavy et al. Clustering network data using graphs, hypergraphs, and tensors, lecture given at University of Montreal, May, 215 Discrete Optimization [O1] Fiedler. Algebraic connectivity of Graphs, Czechoslovak Mathematical Journal: 23 (98), 1973. [O2] Hu et al. On Maximum Differential Graph Coloring, Lecture Notes in Computer Science, 211 [O3] Tsourakakis et al. Spectral Counting of Triangles in Power-Law Networks via Element-Wise Sparsification, Social Network Analysis and Mining, 29. [O4] Trevisan. Max Cut and the Smallest Eigenvalue, SIAM J. Comput. 212 Earlier version in Proc. of 41st ACM STOC, 29 [O5] Kirkland et al. Bipartite subgraphs and the signless laplacian matrix. Applicable Analysis and Discrete Mathematics, 211.