LINEAR-ALGEBRAIC GRAPH MINING



Similar documents
DATA ANALYSIS II. Matrix Algorithms

Part 2: Community Detection

Statistical and computational challenges in networks and cybersecurity

Complex Networks Analysis: Clustering Methods

Fast Multipole Method for particle interactions: an open source parallel library component

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Analysis of Internet Topologies: A Historical View

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Practical Graph Mining with R. 5. Link Analysis

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Analysis of Internet Topologies

Social Networks and Social Media

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

Information Management course

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics

CIS 700: algorithms for Big Data

Walk-Based Centrality and Communicability Measures for Network Analysis

Graph Processing and Social Networks

Protein Protein Interaction Networks

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Big Data: Rethinking Text Visualization

A scalable multilevel algorithm for graph clustering and community structure detection

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Big Graph Processing: Some Background

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

Mining Social-Network Graphs

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Unsupervised Data Mining (Clustering)

Key words. cluster analysis, k-means, eigen decomposition, Laplacian matrix, data visualization, Fisher s Iris data set

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Soft Clustering with Projections: PCA, ICA, and Laplacian

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

SHARP BOUNDS FOR THE SUM OF THE SQUARES OF THE DEGREES OF A GRAPH

How To Cluster

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Social Network Mining

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

BIG DATA What it is and how to use?

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

ADVANCED MACHINE LEARNING. Introduction

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Part 1: Link Analysis & Page Rank

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

Social Media Mining. Network Measures

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

Extracting Information from Social Networks

Introduction to Data Mining

SALEM COMMUNITY COLLEGE Carneys Point, New Jersey COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated

Visualization methods for patent data

Hadoop SNS. renren.com. Saturday, December 3, 11

Gephi Tutorial Quick Start

Application of Graph Theory to

HIGH PERFORMANCE BIG DATA ANALYTICS

Chapter ML:XI (continued)

Enhancing the Ranking of a Web Page in the Ocean of Data

Which universities lead and lag? Toward university rankings based on scholarly output

Large-Scale Spectral Clustering on Graphs

Cluster Analysis: Advanced Concepts

Graph Mining Techniques for Social Media Analysis

Introduction to Graph Mining

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Scaling Up HBase, Hive, Pegasus

Matrix Multiplication

The Data Mining Process

Proposal for Undergraduate Certificate in Large Data Analysis

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Sketch As a Tool for Numerical Linear Algebra

Distributed R for Big Data

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

Francesco Sorrentino Department of Mechanical Engineering

Small Maximal Independent Sets and Faster Exact Graph Coloring

Research Article A Comparison of Online Social Networks and Real-Life Social Networks: A Study of Sina Microblogging

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Proximal mapping via network optimization

M E M O R A N D U M. Faculty Senate Approved April 2, 2015

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Sanjeev Kumar. contribute

Jure Leskovec Stanford University

Graph Mining and Social Network Analysis

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Support Vector Machines with Clustering for Training with Very Large Datasets

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Big Data Analytics CSCI 4030

Rank one SVD: un algorithm pour la visualisation d une matrice non négative

Transcription:

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 1/22 LLNL-PRES-671587 New Applications of Computer Analysis to Biomedical Data Sets QB3 Seminar, UCSF Medical School, May 28 th, 215 LINEAR-ALGEBRAIC GRAPH MINING Geoffrey Sanders, CASC/LLNL Lawrence Livermore National Laboratory, P. O. Box 88, Livermore, CA 94551! This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-7NA27344

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 2/22 LLNL and LDRD LLNL is a DOE FFRDC Center for Applied Scientific Computing (CASC) Several of us work on Laboratory Directed Research and Development (LDRD) projects in HPC and Data Analysis Graph Analytics, Machine Learning, Network Analysis We are ALWAYS looking for domain scientist collaborators with interesting datasets or new data mining tasks DOE national labs have a history of building open HPC software for PDE-related applications (Physic Simulation) PETSc [H2], Trilinos [H4], Hypre [H3], Samrai [H1], etc. Desire to do so for graph mining applications

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 3/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 4/22 Graph Model Definitions Graph G(V,E) Vertices i, j in V Edges (i,j) in E Edge weights i j Undirected vs Directed? (i,j) and (j,i) Hypergraphs (i,j,k,l) and (p,q,r) in E Attributes? Vertex Labels Height, Gender, Profession Edge Labels Timestamp, volume

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 5/22 Difficult Topologies Scale-Free Small-World Community Structure Hierarchical Overlapping Heterogeneous in size, density, type, etc Other Structure Tree-Like Periphery

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 6/22 Web Data Commons Hyperlink Graph Crawled in 214, directed [D1] 1.7 B webpages, 64B hyperlinks

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 7/22 Spy Plot Autonomous System Graph [D2] vertices Graphs have natural sparse matrix representations vertices Linear algebra applies

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 8/22 Linear-Algebraic Kernels Linear Solve Eigensolve L x = b L x = λ x Matrix Factorization L F Tensor Factorization[T1] G t A r u k=1 w k k v k

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 9/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 1/22 Ranking Calculations Global Ranking? Ordered list Often only care about top few of the list PageRank [R1] Centrality measure Random walk Personalized PR [R2] Supervised Connection Subgraph [R3] Solve for direction Rank vertices 1

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 11/22 Exotic Ranking Calculations WalkScore[R4]: Meet a blend of complex constraints? My brother gets My score Worse than a buoy

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 12/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 13/22 Clustering Unsupervised? Spectral Clustering [O1,C4] Hard or Soft? sign(v).*(log 1 ( v +ε)+min{log 1 ( v +ε)}) Agglomerative [C1] Start with n groups Make local grouping decisions to maximize Modularity: comms! # " Internal edges $! & # % " Expected internal edges $ & % cc vec 1 cc 2 vec 3 Cc 4 ordered randomly Recursive Bipartite SC ordered by Fiedler vector O R I G I N A L V E R T E X S E T 1 2 3 4 Reason split: vector splitting: 5 6 7 8 9 connected components: Reason stopped: minimum cluster size max clusters 1 1 1 12 13

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 14/22 Overlapping (Co-)Clustering 2 HAIFENG XU, HANS DE STERCK AND GEOFFREY SANDERS A" B" C" D 1" Non-Negative 2" 3" MF[C2] Factors positive-valued L Feature'Matrix'F' 1" 2" 3" 6" 8" 9" 6" 1" 8" 1" A" B" 1" C" Latent Dirichlet D" Weighted'Bipar1te'Graph' Figure 1. Feature matrix F and induced weighted bipartite graph. Red dots correspond to row variables of F, and blue dots to column variables. The rows and columns may, e.g., represent LinkedIn users and skills, with the weights indicating how often a user s skill was endorsed by the user s connections. research groups, etc. These hierarchical structures are overlapping; for example, some professors may be active in multiple departments or faculties, and many skills (often even the more specialized ones) are taught in multiple degree programs. In a similar way, it is to be expected that many of the currently emerging online social networks also contain inherent overlapping hierarchical organization, in particular when they focus on a specific dimension of the human condition, like, e.g., the professional dimension. Consider for example the LinkedIn social network, where users connect to their business relations and acquaintances, and list user-defined skills and expertise on their user profiles that can be endorsed by their connections. Similar to the case of universities, it is clear that in a social network like LinkedIn there must be hierarchical overlapping groups of Interpreted as probabilities Coarsening [C5] Multilevel F Linked-in data G t users with similar skills and professions, and hierarchical overlapping groups of skill keywords that characterize professional groups. Figure 2. Bipartite graph hierarchy obtained by the FMCC algorithms. The input feature matrix is located at the bottom of the diagram. However, in contrast to the example of universities, in emerging social networks this hierarchy is not hard-coded into the structure of the network; if it were, it would seriously impede the growth and dynamical evolution of these networks. Since the hierarchy is not explicitly hard-coded into the structure of the network but is nevertheless present, it is at once a very interesting and a challenging problem to try to automatically generate a representation of this hierarchy from the Allocation [C3] Model a document as a weighted group of topics Each topic has individual vocabulary Document is a random combination of terms from its topics More general than termdocument data

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 15/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 16/22 Approximations to Expensive Calcs Triangle Counting[O3] Diagonal of A 3 is 6 x (# triangles) Mincut [O1] Estimate diagonal entries of A 3 Trace(A 3 ) = sum [ eigenvalues(a) 3 ] Nearly-planar coloring[o2] Maxcut [O4] Near Bi-Partite Structure [O5]

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 17/22 Outline 1. Introduction 2. Analytics that Rank 3. Analytics that Cluster 4. Analytics that Approximate Expensive Calculations 5. Current Research Directions

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 18/22 Important Extensions Figure 12: vector Matrix sparsity structure for example 3 under the reordering given by the Fiedler vector Figure 12: Matrix sparsity structure for example 3 under the reordering given by the Fiedler ordering in of row and column variables (right). The purple line in the figure (left) and associated ordering in of row and column variables (right). The purple (left) line inand theassociated figure on considered the right shows on the right shows the one-dimensional search space for row and column splittings by the one-dimensional search space for row and column splittings considered by the out-of-box undirected method. the out-of-box undirected method. Bipartite Graphs Tensors Time is a tensor dim. Causality? Co-cluster! Author Gather stats " " Dynamic Graphs [T2] Streaming where M [ADD THE NECESSARY CONSTRAINTS TO M]. Then, the eigenpairs of B ( o, x) and ( o, x) can be used to define a mapping into R2 such that the nodes are mapped to [DETERMINE PARAMETERS OF REGION] around vectors of length 1 at angles of 3, or 5 3. Spectrum of D 1 A 1.6.4 ".8 5 1 15.2 2 Im Directed Graphs 25.2 3.4 35.6 4.8 Labeled Graphs Factor in labels Label anomalies? 13: Original graph from example 4 (left), randomly reordered (middle), and bipartized Figure 13: Original graph from example 4 (left), randomly reordered (middle),figure and bipartized where M [ADD THE NECESSARY CONSTRAINTS TO M]. Then, the eigenpairs of B ( o, x) and (right). (right). 2 ( o, x) can be used to define a mapping into R such that the nodes are mapped to [DETERMINE PARAMETERS OF REGION] around vectors of length 1 at angles of 3, or 5 3. 1 45 Highly-cyclic structure 5 1 15 2 25 nz = 2613 3 35 4 1 45 15.8.6 1 Spectral Coordinates.25.25.6.2.2.15.15 1.4 15.2.4.6.8 1.1 Im (vi wi) Im (vi wi).1 Im.2 Spectral Coordinates.8 25 15 5 2.2 Re Figure 2:. Spectrum of D 1 A.4.5.2.5.5.5.1 3.4 35.6.1.15.15.2.2.25.8 4.25.3.25.2.15 1 45 5 1 15 2 25 nz = 2613 3 35 4 1 45.8.6.4.2.1.5 Re (v + w ) i i.2.5.4.1.6 Re Figure 2:. 4.25.15.8.1.5.5 Re (vi + wi).1.15.2 Figure 3:. 1, p,c 2 (Cc ). Entries of the eigenvector Cv = p,c v are vi = ip,c. Spectral Coordinates Spectral Coordinates.2.15 General c-cyclic structure For p =,..., c.25.2 1.25.2 Bc = B C c.25 Conference

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 19/22 Summary Linear Algebra addresses a diverse set of graph analytics Linear Algebra kernels are somewhat scalable, implemented in many computing environments Often requires close interaction with math or computer scientist to tune to new type of data/analytic

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 2/22 References I of III HPC [H1] Wissink, et al. Large Scale Structured AMR Calculations Using the SAMRAI Framework, SC1 Proceedings, 21. LLNL tech report UCRL-JC-144755. [H2] Balay, et al. Efficient Management of Parallelism in Object Oriented Numerical Software Libraries, Modern Software Tools in Scientific Computing, 1997 [H3] Falgout et al. Design of the hypre Preconditioner Library, Proc. of the SIAM Workshop on Object Oriented Methods for Inter-operable Scientific and Engineering Computing, 1998 [H4] Heroux et al. An Overview of Trilinos, SNL Tech Report SAND23-2927, 23 Data [D1] Meusel et al., Web Data Commons 214 Hyperlink Graph, http://webdatacommons.org/hyperlinkgraph/214-4/topology.htm [D2] Leskovec et al. Stanford Large Network Dataset Collection, https://snap.stanford.edu/data/

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 21/22 References II of III Ranking [R1] Page. PageRank: Bringing order to the web. Stanford Digital Libraries Working Paper, 1997 [R2] Haveliwala. Topic-sensitive pagerank, In WWW pages 517 526, 22 [R3] Faloutsos et al. Fast Discovery of Connection Subgraphs, KDD 24 [R4] Walkscore, http://www.walkscore.com/ Clustering [C1] Blondel et al. Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment (1), P18, 28. [C2] Paatero et al. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 1994 [C3] Blei et al. Latent Dirichlet Allocation. Journal of Machine Learning, 23 [C4] von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 27 [C5] Xu et al. Fast Multilevel Co-Clustering: Unraveling the Multilevel Overlapping Cluster Structure of Social Network Data, submitted to Numerical Linear Algebra with Applications, May 215

UCSF QB3 Seminar 5/28/215 Linear Algebraic Graph Mining, sanders29@llnl.gov 22/22 References III of III Tensors [T1] Kolda et al. Tensor Decompositions and Applications, SIAM Review, 28 [T2] Dunlavy et al. Clustering network data using graphs, hypergraphs, and tensors, lecture given at University of Montreal, May, 215 Discrete Optimization [O1] Fiedler. Algebraic connectivity of Graphs, Czechoslovak Mathematical Journal: 23 (98), 1973. [O2] Hu et al. On Maximum Differential Graph Coloring, Lecture Notes in Computer Science, 211 [O3] Tsourakakis et al. Spectral Counting of Triangles in Power-Law Networks via Element-Wise Sparsification, Social Network Analysis and Mining, 29. [O4] Trevisan. Max Cut and the Smallest Eigenvalue, SIAM J. Comput. 212 Earlier version in Proc. of 41st ACM STOC, 29 [O5] Kirkland et al. Bipartite subgraphs and the signless laplacian matrix. Applicable Analysis and Discrete Mathematics, 211.