Cost effective Outbreak Detection in Networks

Similar documents

Mining Large Graphs. ECML/PKDD 2007 tutorial. Part 2: Diffusion and cascading behavior

Cost-effective Outbreak Detection in Networks

Efficient Influence Maximization in Social Networks

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Tutorial at the ACM SIGKDD conference, Jure Leskovec Stanford University

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Between Friends Sites like Facebook are proving the value of the "social graph."

Finding Influential Neighbors to Maximize Information Diffusion in Twitter

In Search of Influential Event Organizers in Online Social Networks

Dynamics of Large Networks. Jurij Leskovec

Recent Advances in Information Diffusion and Influence Maximization of Complex Social Networks

The tipping point is often referred to in popular literature. First introduced in Schelling 78 and Granovetter 78

Influence Propagation in Social Networks: a Data Mining Perspective

INTEGRATING EVENT DETECTION SYSTEM OPERATING CHARACTERISTICS INTO SENSOR PLACEMENT OPTIMIZATION

A Connectivity-Based Popularity Prediction Approach for Social Networks

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Dynamics of information spread on networks. Kristina Lerman USC Information Sciences Institute

COS 116 The Computational Universe Laboratory 9: Virus and Worm Propagation in Networks

Measuring Propagation in Online Social Networks: The Case of YouTube

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Ranking User Influence in Healthcare Social Media

Protein Protein Interaction Networks

IC05 Introduction on Networks &Visualization Nov

Maximizing the Spread of Influence through a Social Network

Querying Past and Future in Web Applications

Graph Mining Techniques for Social Media Analysis

ONLINE SOCIAL NETWORK MINING: CURRENT TRENDS AND RESEARCH ISSUES

News Detection in the Blogosphere: Two Approaches Based on Structure and Content Analysis

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS

Structural and Relational Properties of Social Contact Networks with Applications to Public Health Informatics

Big Graph Processing: Some Background

Big Data Big Deal? Salford Systems

Algorithmic Models for Social Network Phenomena

Learning Influence Probabilities In Social Networks

Mechanisms for Fair Attribution

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Web (2.0) Mining: Analyzing Social Media

Viral Marketing and the Diffusion of Trends on Social Networks

Validating Network Value of Influencers by means of Explanations

Computer Algorithms. NP-Complete Problems. CISC 4080 Yanjun Li

Community-Aware Prediction of Virality Timing Using Big Data of Social Cascades

A Sarsa based Autonomous Stock Trading Agent

Competitive Influence Maximization in Social Networks

WikiPrints Rendering Enterprise Wiki Content for Printing

Social Media Mining. Data Mining Essentials

The Dynamics of Viral Marketing

Pricing strategies for viral marketing on Social Networks

Chapter 20: Data Analysis

Applied Algorithm Design Lecture 5

Chapter 6: Episode discovery process

Graph Security Testing

Data Mining Techniques for Online Social Network Analysis

Contextual-Bandit Approach to Recommendation Konstantin Knauf

Recognizing Informed Option Trading

Optimizing Display Advertising in Online Social Networks

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Capital and Benefit in Social Networks

Differential privacy in health care analytics and medical research An interactive tutorial

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Classification and Prediction

5.1 Bipartite Matching

School of Computer Science Carnegie Mellon Graph Mining, self-similarity and power laws

Fairness in Routing and Load Balancing

Small Maximal Independent Sets and Faster Exact Graph Coloring

Graph Database Proof of Concept Report

Bargaining Solutions in a Social Network

How To Monitor Target Tweets On Twitter

Outline. What is Big data and where they come from? How we deal with Big data?

CMU SCS Large Graph Mining Patterns, Tools and Cascade analysis

An Analytical Research on Social Media System

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

Mining the Temporal Dimension of the Information Propagation

Distributed Computing over Communication Networks: Maximal Independent Set

Energy Efficient Monitoring in Sensor Networks

The Basics of Graphical Models

A Review of Data Mining Techniques

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

Big Data Mining Services and Knowledge Discovery Applications on Clouds

An Analysis of Verifications in Microblogging Social Networks - Sina Weibo

Online Traffic Generation

Online Estimating the k Central Nodes of a Network

Exponential time algorithms for graph coloring

Distributed forests for MapReduce-based machine learning

Discovering Local Subgroups, with an Application to Fraud Detection

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Scaling Out With Apache Spark. DTL Meeting Slides based on

Composite Social Network for Predicting Mobile Apps Installation

Cheap, Easy, and Massively Effective Viral Marketing in Social Networks: Truth or Fiction?

Compact Representations and Approximations for Compuation in Games

Information Management course

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Why Rumors Spread Fast in Social Networks

Using Peer to Peer Dynamic Querying in Grid Information Services

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Link Prediction in Social Networks

Transcription:

Cost effective Outbreak Detection in Networks Jure Leskovec Joint work with Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie Glance

Diffusion in Social Networks One of the networks is a spread of a disease, the other one is product recommendations Which is which?

Diffusion in Social Networks A fundamental process in social networks: Behaviors that cascade from node to node like an epidemic News, opinions, rumors, fads, urban legends, Word of mouth effects in marketing: rise of new websites, free web based services Virus, disease propagation Change in social priorities: smoking, recycling Saturation news coverage: topic diffusion among bloggers Internet energized political campaigns Cascading failures in financial markets Localized effects: riots, people walking out of a lecture

Empirical Studies of Diffusion Experimental studies of diffusion have long history: Spread of new agricultural practices [Ryan Gross 1943] Adoption of a new hybrid corn between the 259 farmers in Iowa Classical study of diffusion Interpersonal network plays important role in adoption Diffusion is a social process Spread of new medical practices [Coleman et al 1966] Studied the adoption of a new drug between doctors in Illinois Clinical studies and scientific evaluations were not sufficient to convince the doctors It was the social power of peers that led to adoption

Diffusion in Networks Initially some nodes are active Active nodes spread their influence on the other nodes, and so on a d b f e h c g i

Scenario 1: Water Network Given a real city water distribution network And data on how contaminants spread in the network Problem posed by US Environmental Protection Agency On which nodes should we place sensors to efficiently detect the all possible contaminations? S

Scenario 2: Online media Which news websites should one read to detect new stories as quickly as possible?

Cascade Detection: General Problem Given a dynamic process spreading over the network We want to select a set of nodes to detect the process effectively Many other applications: Epidemics Network security

Two Parts to the Problem Reward, e.g.: 1) Minimize time to detection 2) Maximize number of detected propagations 3) Minimize number of infected people Cost (location dependent): Reading big blogs is more time consuming Placing a sensor in a remote location is expensive

Problem Setting S Given a graph G(V,E) and a budget B for sensors and data on how contaminations spread over the network: for each contamination i we know the time T(i, u) when it contaminated node u Select a subset of nodes A that maximize the expected reward subject to cost(a) < B Reward for detecting contamination i

Structure of the Problem Solving the problem exactly is NP hard Set cover (or vertex cover) Observation: Diminishing returns S 1 New sensor: S S 1 S 3 S 2 Adding S helps a lot S 2 Placement A={S 1, S 2 } Adding S helps very S 4 little Placement B={S 1, S 2, S 3, S 4 }

Analysis Analysis: diminishing returns at individual nodes implies diminishing returns at a global level Covered area grows slower and slower with placement size Reward R (covered area) Δ1 Δ1 Number of sensors

An Approximation Result Diminishing returns: Covered area grows slower and slower with placement size R is submodular: if A B then R(A {x}) R(A) R(B {x}) R(B) Theorem [Nehmhauser et al. 78]: If f is a function that is monotone and submodular, then k step hill climbing produces set S for which f(s) is within (1 1/e) of optimal.

Reward functions: Submodularity We must show that R is submodular: Benefit of adding a sensor to a small placement What do we know about submodular functions? 1) If R 1, R 2,, R k are submodular, and a 1, a 2, a k > 0 then a i R i is also submodular 2) Natural example: Sets A 1, A 2,, A n : R(S) = size of union of A i B A Benefit of adding a sensor to a large placement x

Reward Functions are Submodular Objective functions from Battle of Water Sensor Networks competition [Ostfeld et al]: 1) Time to detection (DT) How long does it take to detect a contamination? 2) Detection likelihood (DL) How many contaminations do we detect? 3) Population affected (PA) How many people drank contaminated water? are all submodular

Background: Submodular functions Hill climbing a b c d e reward b Add sensor with highest marginal gain c d a e What do we know about optimizing submodular functions? A hill climbing (i.e., greedy) is near optimal (1-1/e (~63%) of optimal) But 1) this only works for unit cost case (each sensor/location costs the same) 2) Hill climbing algorithm is slow At each iteration we need to re evaluate marginal gains It scales as O( V B)

Towards a New Algorithm Possible algorithm: hill climbing ignoring the cost Repeatedly select sensor with highest marginal gain Ignore sensor cost It always prefers more expensive sensor with reward r to a cheaper sensor with reward r ε For variable cost it can fail arbitrarily badly Idea What if we optimize benefit cost ratio?

Benefit Cost: More Problems Bad news: Optimizing benefit cost ratio can fail arbitrarily badly Example: Given a budget B, consider: 2 locations s 1 and s 2 : Costs: What c(s 1 )=ε,c(s if we 2 )=B take best Only 1 cascade with reward: R(s 1 )=2ε, R(s 2 )=B of both solutions? Then benefit cost ratio is bc(s 1 )=2 and bc(s 2 )=1 So, we first select s 1 and then can not afford s 2 We get reward 2ε instead of B Now send ε to 0 and we get arbitrarily bad

Solution: CELF Algorithm CELF (cost effective lazy forward selection): A two pass greedy algorithm: Set (solution) A: use benefit cost greedy Set (solution) B: use unit cost greedy Final solution: argmax(r(a), R(B)) How far is CELF from (unknown) optimal solution? Theorem: CELF is near optimal CELF achieves ½(1-1/e) factor approximation CELF is much faster than standard hill climbing

How good is the solution? Traditional bound (1 1/e) tells us: How far from optimal are we even before seeing the data and running the algorithm Can we do better? Yes! We develop a new tighter bound. Intuition: Marginal gains are decreasing with the solution size We use this to get tighter bound on the solution

Scaling up CELF algorithm Observation: Submodularity guarantees that marginal benefits decrease with the solution size d reward Idea: exploit submodularity, doing lazy evaluations! (considered by Robertazzi et al. for unit cost case)

Scaling up CELF CELF algorithm hill climbing: Keep an ordered list of marginal benefits b i from previous iteration a b reward b d a Re evaluate b i only for top sensor c d c e Re sort and prune e

Scaling up CELF CELF algorithm hill climbing: Keep an ordered list of marginal benefits b i from previous iteration a b reward b d a Re evaluate b i only for top sensor c d c e Re sort and prune e

Scaling up CELF CELF algorithm hill climbing: Keep an ordered list of marginal benefits b i from previous iteration a d reward b d a Re evaluate b i only for top sensor b e c e Re sort and prune c

Experiments: 2 Case Studies We have real propagation data Blog network: We crawled blogs for 1 year We identified cascades temporal propagation of information Water distribution network: Real city water distribution networks Realistic simulator of water consumption provided by US Environmental Protection Agency

Case study 1: Cascades in Blogs Blog Time stamp Blog post hyperlink We follow hyperlinks in time to obtain cascades (traces of information propagation)

Diffusion in Blogs Posts Blogs Information cascade Time ordered hyperlinks Data Blogs: We crawled 45,000 blogs for 1 year 10 million posts and 350,000 cascades

Q1: Blogs: Solution Quality Our bound is much tighter 13% instead of 37% Old bound Our bound CELF

Q2: Blogs: Cost of a Blog Unit cost: algorithm picks large popular blogs: instapundit.com, michellemalkin.com Variable cost: proportional to the number of posts We can do much better when considering costs Variable cost Unit cost

Q2: Blogs: Cost of a Blog But then algorithm picks lots of small blogs that participate in few cascades We pick best solution that interpolates between the costs We can get good solutions with few blogs and few posts Each curve represents solutions with same final reward

Q4: Blogs: Heuristic Selection Heuristics perform much worse One really needs to perform optimization

Blogs: Generalization to Future We want to generalize well to future (unknown) cascades Limiting selection to bigger blogs improves generalization

Q5: Blogs: Scalability CELF runs 700 times faster than simple hill climbing algorithm

Case study 2: Water Network Real metropolitan area water network V = 21,000 nodes E = 25,000 pipes Use a cluster of 50 machines for a month Simulate 3.6 million epidemic scenarios (152 GB of epidemic data) By exploiting sparsity we fit it into main memory (16GB)

Water: Solution Quality Old bound Our bound CELF The new bound gives much better estimate of solution quality

Water: Heuristic Placement Heuristics placements perform much worse One really needs to consider the spread of epidemics

Water: Placement Visualization Different reward functions give different sensor placements Population affected Detection likelihood

Water: Algorithm Scalability CELF is an order of magnitude faster than hill climbing

Results of BWSN competition Battle of Water Sensor Networks competition [Ostfeld et al]: count number of non dominated solutions Author #non dominated (out of 30) CELF 26 Berry et. al. 21 Dorini et. al. 20 Wu and Walski 19 Ostfeld et al 14 Propato et. al. 12 Eliades et. al. 11 Huang et. al. 7 Guan et. al. 4 Ghimire et. al. 3 Trachtman 2 Gueli 2 Preis and Ostfeld 1 39

Other results Many more details: Fractional selection of the blogs Generalization to future unseen cascades Multi criterion optimization We show that triggering model of Kempe et al is a special case of out setting 40

Conclusion General methodology for selecting nodes to detect outbreaks Results: Submodularity observation Variable cost algorithm with optimality guarantee Tighter bound Significant speed up (700 times) Evaluation on large real datasets (150GB) CELF won consistently 41

Conclusion and Connections Diffusion of Topics How news cascade through on line networks Do we need new notions of rank? Incentives and Diffusion Using diffusion in the design of on line systems Connections to game theory When will one product overtake the other?

Further Connections Diffusion of topics [Gruhl et al 04, Adar et al 04]: News stories cascade through networks of bloggers How do we track stories and rank news sources? Recommendation incentive networks [Leskovec Adamic Huberman 07]: How much reward is needed to make the product workof mouth success? Query incentive networks [Kleinberg Raghavan 05]: Pose a request to neighbors; offer reward for answer Neighbors can pass on request by offering (smaller) reward How much reward is needed to produce an answer?

Topic Diffusion: what blogs to read? News and discussion spreads via diffusion: Political cascades are different than technological cascades Suggests new ranking measures for blogs

References D. Kempe, J. Kleinberg, E. Tardos. Maximizing the Spread of Influence through a Social Network. ACM KDD, 2003. Jure Leskovec, Lada Adamic, Bernardo Huberman. The Dynamics of Viral Marketing. ACM TWEB, 2007. Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst. Cascading Behavior in Large Blog Graphs. SIAM Data Mining, 2007. Jure Leskovec, Ajit Singh, Jon Kleinberg. Patterns of Influence in a Recommendation Network. PAKDD, 2006. Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance. Cost effective Outbreak Detection in Networks. ACM KDD, 2007.