CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

Size: px

Start display at page:

Download "CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University"

Griselda Osborne
7 years ago
Views:

1 CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

2 Thursday: guest lecture Lars Backstrom (Facebook/Cornell) on networks & geography 2

3 Blogs information epidemics Which are the influential/infectious blogs? Viral marketing Who arethe trendsetters? Influential people? Disease spreading Where to place monitoring stations to detect epidemics? 3

4 Most influential set of a 0.4 d 0.4 size k: set S of k nodes producing largest b f e expected cascade size f(s) if activated g [Domingos Richardson 01] c 0.4 i 0.2 h 0.3 Optimization problem: max f ( S ) p p S of size k 4

5 f is submodular: S T f(s {u}) f(s) f(t {u}) f(t) Gain of adding a node to a small set Natural example: S A A A B A Sets A 1, A 2,, A n f(a) = size of union of A i (i (size of covered area) Gain of adding a node to a large set u If f 1,,f,f K are submodular, then p i f i is submodular 5

6 Start with S 0 ={} For i=1 k Choose node v that max f(s i 1 {v}) Let S i = S i 1 {v} b d a f(s i 1 {v}) a b c Hill climbing produces a c solution S where f(s) (1 1/e) 1/e) of optimal value (~63%) when f is monotone and submodular [Hemhauser, Fisher, Wolsey 78] e d e 6

7 [Leskovec et al., KDD 07] Lazy hill climbing: Keep an ordered list of marginal benefits b i fromprevious iteration Re evaluate evaluate b i only for top node Re sortand prune Marginal gain a b c d e b c d a e 11/3/2009 Jure Leskovec, Stanford CS322: Network Analysis Part 2 7

8 [Leskovec et al., KDD 07] Lazy hill climbing: Keep an ordered list of marginal benefits b i fromprevious iteration Re evaluate evaluate b i only for top node Re sortand prune Marginal gain a b c d e b c d a e 11/3/2009 Jure Leskovec, Stanford CS322: Network Analysis Part 2 8

9 [Leskovec et al., KDD 07] Lazy hill climbing: Keep an ordered list of marginal benefits b i fromprevious iteration Re evaluate evaluate b i only for top node Re sortand prune Marginal gain a d b e c b c d a e 11/3/2009 Jure Leskovec, Stanford CS322: Network Analysis Part 2 9

10 [Leskovec et al., KDD 07] Given a real city water distribution network And ddt data on how contaminants spread in the network Problem posed by US Environmental Protection Agency S 10

11 [Leskovec et al., KDD 07] Given a graph G(V,E) and a budget tbb for sensors and data on how contaminations spread over the network: for each contamination i we know the time T(i,u) when it contaminated node u Select a subset of nodes A that maximize the expected reward subject to cost(a) < B Reward for detecting contamination i 11

12 S 2 S 4 [Leskovec et al., KDD 07] Observation: Diminishing returns New sensor: S 1 N S S 1 S 3 S 2 Placement A={S 1, S 2 } Adding S helps a lot Placement B={S 1, S 2, S 3, S 4 } Adding S helps very littlel 12

13 [Leskovec et al., KDD 07] Claim: Reward function is submodular Consider cascade i: f i (u k ) = set of nodes saved from u k f i (A) = size of union f i (u k ), u k A f i is submodular Global optimization: f(a) = Prob(i) f i (A) f is submodular Cascade i u 2 f i (u 2 ) u 1 f i (u 1 ) 13

14 [Leskovec et al., KDD 07] Consider: hill climbing ignoring the cost Ignore sensor cost. Repeatedly select sensor with highest marginal gain It always prefers more expensive sensor with reward r to a cheaper sensor with reward r ε For variable cost it can fail arbitrarily badly Idea: What if we optimize benefit cost ratio? 14

15 [Leskovec et al., KDD 07] Benefit cost ratio can fail arbitrarily badly Consider: budget B: 2 locations s 1 and s 2 : Costs: c(s 1 )=ε,c(s 2 )=B Only 1 cascade: R(s 1 )=2ε, R(s 2 )=B Then benefit cost ratio is bc(s 1 )=2 and bc(s 2 )=1 So, we first select s 1 and then can not afford s 2 We get reward 2ε instead of B Now send ε to 0 and we get arbitrarily bad 15

16 [Leskovec et al., KDD 07] CELF (cost effective lazy forward selection): A two pass greedy algorithm: Set (solution) A: use benefit cost greedy Set (solution) B: use unit cost greedy Final solution: argmax(r(a), R(B)) Theorem: CELF is near optimal CELF achieves ½(1-1/e) 1/ ) factor approximation i 16

17 [Leskovec et al., KDD 07] Real metropolitan area water network V = 21,000 nodes E = 25,000 pipes Use a cluster of 50 machines for a month Simulate 3.6 million epidemic scenarios (152 GB of epidemic data) By exploiting sparsity we fit it into main memory (16GB) 17

18 [Leskovec et al., KDD 07] Offline bound: OPT/(1 1/e) 1/e) Online bound on OPT [Leskovec et al., KDD 07] CELF The online (data dependent) bound gives much better estimate of how far from unknown optimal solution is the CELF solution 18

19 [Leskovec et al., KDD 07] Heuristics placements perform much worse One really needs to consider the spread of epidemics 19

20 [Leskovec et al., KDD 07] 20

21 [Leskovec et al., KDD 07] CELF is an order of magnitude faster than hill climbing 21

22 = I have 10 minutes. Which? blogs should I read to be most up to date? = Who are the most influential bloggers? 22

23 Want to read things before others do. Detect blue & yellow soon but miss red. Detect all stories but late. 23

24 [Leskovec et al., KDD 07] Online bound [Leskovec et al., KDD 07] is much tighter 13% instead of 37% Old bound Our bound CELF 24

25 [Leskovec et al., KDD 07] Unit cost: algorithm picks large popular blogs: instapundit.com, michellemalkin.com lki Variable cost: proportional to the number of posts Variable cost Unit cost We can do much better when considering costs 25

26 [Leskovec et al., KDD 07] But then algorithm picks lots of small blogs that participate in few cascades We pick best solution that interpolates between the costs We can get good solutions with few blogs and few posts Each curve represents solutions with ih same final reward 26

27 [Leskovec et al., KDD 07] Heuristics perform much worse One really needs to perform optimization 27

28 [Leskovec et al., KDD 07] CELF runs 700 times faster than simple hillclimbing algorithm 28

30 [Onnela et al. 07] Real edge strengths in mobile call graph 30

31 Findings so far suggest that network groups are tightly connected Network communities: Sets of nodes with lots of connections inside and few to outside (the rest of the network) Communities, clusters, groups, modules 31

32 How to automatically find such densely connected groups ofnodes? Ideally such automatically detected clusters would then correspond to real groups For example: Communities, clusters, groups, modules 32

33 Zachary s Karate club network: Observe social ties and rivalries in a university karate club During his observation, conflicts led the group to split Split could be explained by a minimum cut in the network 11/3/2009 Jure Leskovec, Stanford CS322: Network Analysis Part 1 33

34 Find micro markets markets by partitioning the query x advertiser graph: query advertiser 34

Many methods: Linear (low rank) methods: If Gaussian, then low rank space is good Kernel (non linear) methods: If low dimensional manifold, then kernels are good

35 Many methods: Linear (low rank) methods: If Gaussian, then low rank space is good Kernel (non linear) methods: If low dimensional manifold, then kernels are good Hierarchical methods: Top down and bottom up common in social sciences Graph partitioning methods: Define edge counting metric conductance, expansion, modularity, etc. and optimize! 35

36 What is a good notion that would extract such clusters? 36

37 [Girvan Newman PNAS 02] Divisive hierarchical clustering based on the notion of edge betweenness: Number of shortest paths passing through the edge Remove edges in decreasing betweenness

38 [Girvan Newman PNAS 02] 38

39 [Newman Girvan PhysRevE 03] Zachary s Karate club: hierarchical decomposition 39

40 [Newman Girvan PhysRevE 03] Communities in physics collaborations 40

41 Breath first search starting ti from A: Want to compute betweenness of paths starting at node A 41

42 Count the number of shortest paths from A to all other nodes of the network: 42

43 Compute betweenness by working up the tree: If there are multiple paths count them fractionally Repeat the BFS procedure for each node of the network Add edge scores 1 path to K Split evenly 1+1 paths to H Split evenly paths to J Split 1:2 43

Cost effective Outbreak Detection in Networks

Cost effective Outbreak Detection in Networks Jure Leskovec Joint work with Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie Glance Diffusion in Social Networks One of