Network Algorithms for Homeland Security



Similar documents
Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

An Empirical Study of Two MIS Algorithms

Graph Security Testing

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Small Maximal Independent Sets and Faster Exact Graph Coloring

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Computer Algorithms. NP-Complete Problems. CISC 4080 Yanjun Li

Algorithm Design and Analysis

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Part 2: Community Detection

THE PROBLEM WORMS (1) WORMS (2) THE PROBLEM OF WORM PROPAGATION/PREVENTION THE MINIMUM VERTEX COVER PROBLEM

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Near Optimal Solutions

Fairness in Routing and Load Balancing

8.1 Min Degree Spanning Tree

Approximation Algorithms

Guessing Game: NP-Complete?

Graph Mining and Social Network Analysis

Cycles and clique-minors in expanders

ONLINE DEGREE-BOUNDED STEINER NETWORK DESIGN. Sina Dehghani Saeed Seddighin Ali Shafahi Fall 2015

NP-complete? NP-hard? Some Foundations of Complexity. Prof. Sven Hartmann Clausthal University of Technology Department of Informatics

CIS 700: algorithms for Big Data

Graph theoretic techniques in the analysis of uniquely localizable sensor networks

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

A scalable multilevel algorithm for graph clustering and community structure detection

. P. 4.3 Basic feasible solutions and vertices of polyhedra. x 1. x 2

Applied Algorithm Design Lecture 5

5.1 Bipartite Matching

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

Single machine parallel batch scheduling with unbounded capacity

On the independence number of graphs with maximum degree 3

Why? A central concept in Computer Science. Algorithms are ubiquitous.

An Approximation Algorithm for Bounded Degree Deletion

Scheduling Shop Scheduling. Tim Nieberg

Protein Protein Interaction Networks

On the k-path cover problem for cacti

Problem Set 7 Solutions

Cycle transversals in bounded degree graphs

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Introduction to Logic in Computer Science: Autumn 2006

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

CSC 373: Algorithm Design and Analysis Lecture 16

Random graphs with a given degree sequence

On Integer Additive Set-Indexers of Graphs

Social Media Mining. Graph Essentials

CSC2420 Spring 2015: Lecture 3

Offline sorting buffers on Line

Large induced subgraphs with all degrees odd

Using Adversary Structures to Analyze Network Models,

Determination of the normalization level of database schemas through equivalence classes of attributes

SEQUENCES OF MAXIMAL DEGREE VERTICES IN GRAPHS. Nickolay Khadzhiivanov, Nedyalko Nenov

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

3.1 Solving Systems Using Tables and Graphs

Definition Given a graph G on n vertices, we define the following quantities:

How To Cluster Of Complex Systems

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Chapter 6: Episode discovery process

Max Flow, Min Cut, and Matchings (Solution)

A Fast Algorithm For Finding Hamilton Cycles

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Big Data Analytics. Lucas Rego Drumond

Practical Graph Mining with R. 5. Link Analysis

Ph.D. Thesis. Judit Nagy-György. Supervisor: Péter Hajnal Associate Professor

Weighted Sum Coloring in Batch Scheduling of Conflicting Jobs

Big Data & Scripting Part II Streaming Algorithms

Online and Offline Selling in Limit Order Markets

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Lecture 1: Course overview, circuits, and formulas

Solutions to Homework 6

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Exponential time algorithms for graph coloring

Visual Data Mining. Motivation. Why Visual Data Mining. Integration of visualization and data mining : Chidroop Madhavarapu CSE 591:Visual Analytics

Triangle deletion. Ernie Croot. February 3, 2010

Lecture 4: BK inequality 27th August and 6th September, 2007

Private Approximation of Clustering and Vertex Cover

On the Relationship between Classes P and NP

Tutorial 8. NP-Complete Problems

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis. Contents. Introduction. Maarten van Steen. Version: April 28, 2014

Generalized Induced Factor Problems

Distributed Computing over Communication Networks: Maximal Independent Set

Warshall s Algorithm: Transitive Closure

Big Data Graph Algorithms

Short Cycles make W-hard problems hard: FPT algorithms for W-hard Problems in Graphs with no short Cycles

Transcription:

Network Algorithms for Homeland Security Mark Goldberg and Malik Magdon-Ismail Rensselaer Polytechnic Institute September 27, 2004. Collaborators J. Baumes, M. Krishmamoorthy, N. Preston, W. Wallace. Partially supported by NSF grants: 0324947, 0346341.

Motivation Social communities communicate in structured ways. (spatial correlation) A group planning an activity communicates in order to coordinate. (spatial and temporal correlation) We should be able to exploit the structure in such communications to discover social comunities and groups that may be planning some activity. 1

Communication Networks Communication Data: Sequence of actor pairs, and time/content of communication. Communication Cycle: A time period over which communications are aggregated. Communication Graph for a Communication Cycle: Actors are nodes. Edge between two actors if a communication occured. Many cycles = time series of communication graphs. 2

Non-Semantic Analysis Semantic information is informative but computationally intensive to analyse for large communication networks. Semantic information can be misleading if cryptographic protocols are being used. Semantic information may be unavailable, especially on account of privacy constraints. What can we accomplish without semantic-analysis? 3

Outline Part I: Hidden Groups Example Model Algorithms Experiments Extensions and Ongoing work. Part II: Overlapping Communities Example Model Algorithms Experiments 4

PART I Hidden Groups 5

Example 6

Example 7

Example 8

Example Where is the hidden group? 9

Example Where is the hidden group? 10

What is a Hidden Group? A Hidden Group is a group of actors that is planning some activity. The group may or may not actively try to conceal their planning, eg: group planning a Sunday afternoon picnic, or PTA meeting. group planning a malicious terrorist act. Their planning related communications are embedded in the (random) background communications and hence obscured. Planning related communications must occur for the activity to succeed; background communications tend to be random. 11

What is Planning? All members of the group need to exchange information during each communication cycle. Implication on the communication graph: Connectivity 12

Types of Connectivity Internal External Disconnected B B B A A C A Secretive Non-Secretive Not a Hidden Group Paranoid Trusting A malicious group is more likely to be secretive internally connected. 13

Persistence IP t = 1 t = 2 t = 3 t = 4 EP D IP is internally persistent (IP) internally connected in every graph. EP is externally persistent (EP). D is not persistent. 14

Hidden Group Model Internal Persistence. A (paranoid) hidden group is internally persistent Static. The hidden group members do not change. Problem Statement Input: Communication graphs, G 1, G 2,..., G T ; integer K. Task: Is there a hidden group of size K; find all such hidden groups. Let E t be the number of edges in G t ; Let E = t E t (total # edges). 15

Maximal Persistent Components A persistent component C is maximal if any other persistent component that overlaps C is contained in C. Observation: Maximal persistent components are disjoint. Implication: The vertex set can be (uniquely) partitioned into maximally persistent components. Let P = {A 1,..., A K } be such a partition. 16

Algorithm Ext Persistent Observation: A ext is EP if it is in the same connected component in every G t. Implication: Maximal EP component = find maximal intersections of the connected components of {G t }. Computational complexity: O(E + V T). 17

Algorithm Int Persistent Suppose A ext is maximally EP. Observation: A ext can be partitioned into maximal IP components. Consider the induced subgraphs {G 1 (A ext ), G 2 (A ext ),..., G T (A ext )} Let P ext be a partition of {G t (A ext )} into maximal EP components. Lemma. A ext is maximally IP iff P ext = 1 Implication: Int Persistent can be implemented recursively on a partition obtained from Ext Persistent. 18

Recursive Algorithm Int Persistent 1: Int Persistent({G t } T t=1, V ) 2: //Input: Graphs {G t = (E t, V )} T t=1. 3: //Output: A partition P = {V j } of V. 4: {V i } K i=1 = Ext Persistent({G t} T t=1, V ) 5: if K = 1, then 6: P = {V 1 }; 7: else 8: P = K k=1 Int Persistent({G t(v k )} T t=1, V k); 9: return P; Computational complexity: O(V E + V 2 T). 19

Example: Int Persistent G1 G2 CC(G1) CC(G2) EP{G1} EP{G1,G2} IP{G1} IP{G1,G2} 20

Status There is a polynomial time algorithm to construct a partition into maximal internally persistent components. That is only half the story. 21

Example 22

Example 23

Example 24

Example What about background groups that appear persistent? How do we know that a discovered persistent component is a hidden group and not due to the background communications? 25

Random Background If the background (non-planning related) communications are random, then the fortuitous persistence of background groups will be short lived. 26

Two Background Models Uniform Model G(n, p): n: number of actors. p: probability of an edge between any two actors. Group Model Gr(n, n g, m, p g, p e ): n g : number of groups. m: group size. p g : each group is a G(m, p g ). p e : probability of an edge between two actors not sharing a group. 27

Background IP Components How likely are chance background IP components? For example: if the graph is connected w.p.1 due to background communications, then V will be IP w.p.1. Connectivity of the background random graph model will play a role. 28

The Double Jump Phase transitions in the connectivity of a G(n, p) random graph: L(G(n, p)) = p = c n O(ln n) 0 < c < 1 O(n 2/3 ) c = 1 β(c)n c > 1, β(c) < 1 p = ln n n + x n, x > 0 P[L(G(n, p)) = n] e e x L(G(n, p)) is the size of the largest component of a G(n, p) graph. 29

Detecting the Hidden Group Uniform Model p c n Detecting the Hidden Group Should be easy cln n n? c Don t quit your day job What about the Group random model? 30

Random Persistent Components X(τ): size of largest persistent component in G 1,..., G τ Consider E[X(τ)] Expected size of the largest persistent component. E[X] H τ(h) t τ(h) the time to detect a hidden group of size H reliably. 31

τ(h) Challenging to compute even for simple random background models. Simulation Random Models: τ(h) can be computed offline. Real Data: τ(h) can be computed from the communication history. 32

Experiments: Uniform Model Largest Persistent Internally Connected Component, G(n,p) Largest Component Size as Fraction of n 1 0.8 0.6 0.4 0.2 0 p = 1/n p = ln(n)/n p = 1/10 10 20 30 40 50 60 70 80 90 100 Time 33

Uniform Model vs. Group Model Largest Persistent Internally Connected Component, deg av = 6 Largest Component Size as Fraction of n 1 0.8 0.6 0.4 0.2 0 Uniform G(n,p) 50 Groups Size 20 100 Groups Size 20 200 Groups Size 20 10 20 30 40 50 60 70 80 90 100 Time Group Model (deg av = 6) Random Model n g (# of groups) 50 100 200 G(n, n 6 ) τ(1) > 100 63 36 32 34

IP vs. EP Largest Persistent Connected Component, deg av = 6 1 Internally Connected Externally Connected Largest Component Size as Fraction of n 0.8 0.6 0.4 0.2 0 10 20 30 40 50 60 70 80 90 100 Time deg av τ(1) for EP τ(1) for IP 2 28 2 6 > 100 32 35

Summary Easier to detect the hidden group (statistically) Less dense background communication. Less structured background communication. Paranoid/Secretive hidden group. 36

General Hidden Group Model Hidden group is not active during every communication cycle. The hidden group is active for some time period, [t 1, t 2 ]. The hidden group does not communicate during every cycle in the time period of its activity. The entire hidden group need not participate in the planning at every cycle in which it communicates. 37

General Problem is Hard Problem: Frequently Mostly Connected (FMC) Input: A sequence of graphs {G i (V, E i )} T i=1 on the same vertex set V ; positive integers s V and k T, and the fraction ɛ, 0 < ɛ 1; Question: Is there a subset S V with S s which induces an ɛ-partially connected subgraph in at least k of the {G i }? Theorem. Problem FMC is NP-complete. Proof. Reduction from Balanced Complete Bipartite Subgraph [Garey & Johnson, (problem GT24)] 38

Conclusions and Ongoing Work Two facets: algorithmic, statistical. Streaming data. More realistic random models; real data. If you get to choose your battles, detect paranoid hidden groups in unstructured, sparce backgrounds. Taking communication intensities into account. Approaching the general problem: More structure on the hidden group Heuristics What about non-maximal IP components. Evolving hidden groups. 39

PART II Overlapping Communities 40

Example G H Computing Clusters: How many edges can we add before two clusters become one? 41

Defining Clusters Social Network Context: Communities are clusters. Communities are characterized by more intense communications A cluster is a locally maximal subgraph: adding a new (removing an old) vertex decreases the density How to define density: Depends on applications May depend on the number of inside/outside edges the set 42

Possible Density Definitions. e in (S) ; S e in (S) e in (S)+αe out (S) ; 2e in (S) S ( S 1) ; e in (S) e in (S)+α S 1 e out (S) S S /radius(g S ); S /diameter(g S ) 43

Iterative Scan C seed; w W(C); increased TRUE; while increased do for all v V (G) do if v C then C (C v) else C C {v}; if W(C ) > W(C) then C C ; if W(C) = w then increased FALSE; else w W(C); return C Desired range 44

Rank Removal (RaRe) Procedure RaRe(G, W); Global R ; ComputeRanks(V (G)); /* φ p (v) = c u,v φ p(u) deg (v) + (1 c) n */ {H i } connected components of G; for all H i, if V (H i ) max H i is a core cluster else ClusterComponent(H i ); Clusters {C i } core clusters; for all v R do for all clusters C i do if W(v C i ) > W(C i ) add v to C i ; 45

Rank Removal (continue) Procedure ClusterComponent(H); if V (H i ) max T {t highest rank vertices in H}; R R T H H T; {F j } are connected components of H; for all F j do ClusterComponent(F j ); else if min V (H) do mark H as a core cluster; 46

Run Time of IS and RaRe 4000 3500 RaRe IS 3000 Runtime (s) 2500 2000 1500 1000 500 0 1 1.5 2 2.5 3 3.5 4 4.5 Number of Nodes x 10 4 1000 900 RaRe IS 800 700 Runtime (s) 600 500 400 300 200 100 0 1 1.5 2 2.5 3 3.5 4 4.5 5 Number of Edges x 10 4 47