Real-time Targeted Influence Maximization for Online Advertisements



Similar documents
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

In Search of Influential Event Organizers in Online Social Networks

Algorithm Design and Analysis

An Empirical Study of Two MIS Algorithms

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Load Balancing. Load Balancing 1 / 24

Distributed Computing over Communication Networks: Maximal Independent Set

Applied Algorithm Design Lecture 5

How to Design the Shortest Path Between Two Keyword Algorithms

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Benchmarking Hadoop & HBase on Violin

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Chapter 6: Episode discovery process

Approximation Algorithms

The PageRank Citation Ranking: Bring Order to the Web

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Offline sorting buffers on Line

arxiv: v1 [math.pr] 5 Dec 2011

The Cubetree Storage Organization

Magnetometer Realignment: Theory and Implementation

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Inverted Indexes: Trading Precision for Efficiency

Procedia Computer Science 00 (2012) Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Scheduling Single Machine Scheduling. Tim Nieberg

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Joint Optimization of Overlapping Phases in MapReduce

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

1 o Semestre 2007/2008

Binary search tree with SIMD bandwidth optimization using SSE

8.1 Min Degree Spanning Tree

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE

THE SCHEDULING OF MAINTENANCE SERVICE

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

5 Double Integrals over Rectangular Regions

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

InfiniteGraph: The Distributed Graph Database

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Benchmarking Cassandra on Violin

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Network (Tree) Topology Inference Based on Prüfer Sequence

The Goldberg Rao Algorithm for the Maximum Flow Problem

Scheduling Shop Scheduling. Tim Nieberg

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Introduction to Parallel Programming and MapReduce

Module 6. Embedded System Software. Version 2 EE IIT, Kharagpur 1

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

Oracle Database In-Memory The Next Big Thing

Persuasion by Cheap Talk - Online Appendix

Mechanisms for Fair Attribution

1 Approximating Set Cover

Developing MapReduce Programs

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

11. APPROXIMATION ALGORITHMS

Efficient Algorithms for Masking and Finding Quasi-Identifiers

Part 2: Community Detection

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Graph Mining and Social Network Analysis

Lokalisieren von Defekten in Software mit Hilfe von Data Mining

Cost effective Outbreak Detection in Networks

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Bargaining Solutions in a Social Network

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Recent Advances in Information Diffusion and Influence Maximization of Complex Social Networks

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

Social Media Mining. Data Mining Essentials

Rethinking SIMD Vectorization for In-Memory Databases

Linear programming approach for online advertising

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Software Reliability Measuring using Modified Maximum Likelihood Estimation and SPC

Three Effective Top-Down Clustering Algorithms for Location Database Systems

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Graph Database Proof of Concept Report

A Theoretical Framework for Incorporating Scenarios into Operational Risk Modeling

Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

6.852: Distributed Algorithms Fall, Class 2

Big Data Processing with Google s MapReduce. Alexandru Costan

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Transcription:

Real-time Targeted Influence Maximization for Online Advertisements Yuchen Li Dongxiang Zhang ian-lee Tan Department of Computer Science School of Computing, National University of Singapore {liyuchen,zhangdo,tankl}@comp.nus.edu.sg ABSTRACT Advertising in social network has ecome a multi-illiondollar industry. A main challenge is to identify key influencers who can effectively contriute to the dissemination of information. Although the influence maximization prolem, which finds a seed set of k most influential users ased on certain propagation models, has een well studied, it is not target-aware and cannot e directly applied to online advertising. In this paper, we propose a new prolem, named eyword-based Targeted Influence Maximization (B-TIM), to find a seed set that maximizes the expected influence over users who are relevant to a given advertisement. To solve the prolem, we propose a sampling technique ased on weighted reverse influence set and achieve an approximation ratio of ( /e ε). To meet the instant-speed requirement, we propose two disk-ased solutions that improve the query processing time y two orders of magnitude over the state-of-the-art solutions, while keeping the theoretical ound. Experiments conducted on two real social networks confirm our theoretical findings as well as the efficiency. Given an advertisement with 5 keywords, it takes only 2 seconds to find the most influential users in a social network with illions of edges.. INTRODUCTION Many companies have started to use social network as the main advertising tool to precisely target the customers. Such trends rought Faceook a total advertising revenue of 4.28 Billion dollars in 202. Influence Maximization (IM) is a key algorithmic prolem ehind online viral marketing. By word-of-mouth propagation effect among friends, it finds a seed set of k users to maximize the expected influence among all the users in a social network. Since the prolem is NP-Hard, empe et al. [5] first proposed a greedy algorithm to solve the IM prolem, which returns a seed set http://www.engadget.com/203/0/30/faceook-202-q4- earnings/ This work is licensed under the Creative Commons Attriution- NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/y-nc-nd/3.0/. Otain permission prior to any use eyond those covered y the license. Contact copyright holder y emailing info@vld.org. Articles from this volume were invited to present their results at the 4st International Conference on Very Large Data Bases, August 3st - Septemer 4th 205, ohala Coast, Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 0 Copyright 205 VLDB Endowment 250-8097/5/06. with a ( /e ε) approximation ratio to the optimal solution. However, the greedy solution still takes a prohiitively long time to finish. To address the efficiency issue, a stateof-the-art solution, named Reverse Influence Set (RIS), was proposed to support ( /e ε) approximation ratio [2, 2]. The method uses random sampling and can support various propagation models that have een proposed. Despite the performance improvement, it still takes nearly an hour to find the most influential users in a social network with millions of nodes. There have een some efforts to extend the influence maximization prolem to topic-aware IM [, 3, 4, 8, 3] so as to support online advertisements. The propagation model is required to take into account influence proaility ased on different topics. However, all of the proposed techniques suffer from the efficiency issue. Their models require offline training of the propagation proaility w.r.t. different topics, which is not scalale to the graph size and numer of topics. The most recent work [3] was reported to handle a graph with 4 million vertices and 0 topics only. In addition, the proposed solutions are all heuristic and none of them provides theoretical guarantee on the quality of the results. To ridge the gap, we propose a new eyword-based Targeted Influence Maximization (B-TIM) query for online targeted advertising. The query finds a seed set that maximizes the expected influence over users who are relevant to a given advertisement. In other words, the expected influence only incorporates those users interested in the advertisement as targeted customers. To solve the prolem, we propose a weighted sampling technique ased on RIS and achieve an approximation ratio of ( /e ε). However, the method needs to generate hundreds of thousands of random sample sets to guarantee the theoretical ound and requires intensive computation overhead. To meet the real-time requirement, we propose two disk-ased solutions, and, that improve the query processing performance. The idea is to push the sampling procedure from online to offline and uild index on the random sample sets for each keyword. During query processing, directly loads all the related random sample sets into memory and uses the greedy algorithm on the maximum coverage prolem [22] to find the top-k seed users. futher improves over y incrementally loading the most promising sets into memory and adopts the top-k aggregation strategy to save computation costs. Our contriutions can e summarized as follows:. We propose a B-TIM query to support scalale social IM in online advertising platforms. 070

2. We propose a weighted sampling technique, i.e. WRIS, ased on RIS and show that it has an approximation ratio of ( /e ε) to the optimal solution to a B- TIM query. 3. To meet the instant-speed requirement, we propose two disk-ased solutions that improve the running time y two orders of magnitude over WRIS, while preserving the theoretical ound. 4. We evaluate the performance on real social network with illions of edges and hundreds of topics. The experiment results confirm our theoretical findings and show that the two disk-ased methods significantly outperform the weighted online sampling. We present the preliminaries in Section 2 and the prolem definition of B-TIM query in Section 3. We extend the RIS method and propose WRIS in Section 3.2. Then two disk-ased methods, and, are presented in Sections 4 and 5 respectively. We report the experimental results in Section 6. Susequently, various IM techniques are reviewed in Section 7. Finally we conclude the paper in Section 8. 2. PRELIMINARY In this section, we review the classic influence maximization prolem as well as the state-of-the-art solutions to the prolem. 2. Influence Maximization (IM) Consider a directed graph G = (V, E) where vertices in V are users and edges in E capture the friendships or follow relationships in a social network. To model the propagation process, a numer of methods such as independent cascade (IC) model [9], linear threshold (LT) model [] and general triggering model [5] have een proposed. In this paper, we adopt IC model ecause it has een widely used in previous work [2, 5, 6, 9] 2. Under the IC model, each directed edge e = (u, v) is associated with an influence proaility p(e) to measure the social impact from user u to user v. This proaility is normally set to p(e) = N v, where N v is the in-degree of v 3. Each user is either in an active state or inactive state, and an active user can activate his inactive neighors with proaility p(e). Once a user is activated, his (active) state remains unchanged. Initially, a set of seed users S are selected to influence other people and their states are set to e active at time step 0. Then, each active user at time step i will activate the neighors that are inactive at time step i + with proaility p(e). Note that each user u has only one chance to activate his neighors. In other words, we flip a coin with P (head) = p(e) and if head occurs, v is activated y u. Otherwise, v can only e activated y other incoming neighors except u. This procedure terminates when no more users can e activated. Let I(S) denote the set of nodes activated y the seeds S in an instance of the influence propagation process. Intuitively, the IM prolem finds a seed set S with k users to maximize the expected numer of users influenced y a seed set S, denoted y E[I(S)], over the social network and can e formally defined as follows: 2 Note that the methods proposed in this paper can support LT and general triggering model as well. 3 Our proposed methods are also independent of how p(e) is set. music ook a music 0.3 ook 0.3 sport 0.4 music 0.6 ook 0.2 sport 0. car 0. c.0 music ook 0.3 car 0.2 d g e f car.0 ook.0 sport 0.2 ook 0.2 travel 0.6 Figure : the social network adopted in the paper. Definition (IM). Let OP T k denote the maximum expected influence spread of any node set with size k, i.e. OP T k = max S V, S =k E[I(S)]. The IM prolem finds an optimal seed set S with k users such that E[I(S )] = OP T k. Due to the linearity of expectation, E[I(S)] can also e expressed as E[I(S)] = p(s v) where p(s v) is the proaility with which a user v is activated y seed set S. The following is an example of how to compute the expected influence given a seed set and how to retrieve the optimal influential seed set on a social graph. Example. In Figure, we have a social network with 7 users {a,, c, d, e, f, g} and the edge value is the influence proaility from one user to his neighor. Let us assume, at time step 0, the seed set S contains two nodes e and f, i.e. S = {e, f}. Suppose after flipping the coin in step, nodes (a, c) are activated y e and node d is activated y f, ut in step 2, node a fails to activate. Then, the process terminates ecause no more nodes can e further activated and I(S) = {a, c, d, e, f}. Although evaluating p(s v) has een proven to e #P hard in [5], the calculation of the proaility is feasile in this small example graph. Take p({e, g} ) as an example: since p(e ) = and p(g ) = and each activated node has only one chance to activate the neighors, the proaility of eing activated y {e, g} is p({e, g} ) = [ p(e )] [ p(g )] = 0.75. We can then find, among all possile seed sets with two nodes, the optimal initial set is S = {e, g} for the IM prolem and E[I(S )] = p(s v) = + 0.75 + 0.6875 + 0.375 + + 0 + = 4.825. 2.2 Reverse Influence Set (RIS) The state-of-the-art solutions to the IM prolem [2, 2] are ased on a sampling technique called Reverse Influence Set (RIS). To facilitate the understanding of RIS, we first introduce the concepts of the Reverse Reachale () Set and Random Set [2, 2]: Definition 2. Let G denote a su-graph of G generated y removing any edge e E with proaility p(e). The Reverse Reachale () Set for vertex v contains all the vertices in G that can reach v. Then, a random set 07

is generated on an instance of G sampled from G and v is randomly picked from V. Intuitively, the random set generated from a random user v contains the users who can influence v. By uilding multiple random sets on different random users, if a user u has a great impact on other people, u will have a higher proaility to appear in these random sets. Similarly, if S covers most of the sets, S is likely to maximize E[I(S))]. Based on this idea, the query processing framework of RIS works as follows:. Generate θ random sets from G. 2. Use the standard greedy algorithm on the maximum coverage prolem [22] to select k users to cover the maximum numer of sets generated aove. The result set S is a ( /e ε)-approximate solution for the IM prolem. In [2], it was proved that when θ is sufficiently large, RIS returns near-optimal results with high proaility: ln V +ln ( V k )+ln 2 Theorem ([2]). If θ (8+2ε) V, OP T k ε 2 RIS returns a ( /e ε)-approximate solution with at least V proaility. The proof sketch in [2] is summarized as follows: S: Let F θ (S) denote the numer of sampled sets covered y a seed set S and prove E[ F θ(s) V ] = E[I(S)]. θ S2: Based on the property in S, show that if θ (8 + 2ε) V ( ) ln V +ln V +ln 2 k OP T k ε 2, we have F θ(s) V E[I(S)] < ε θ 2 OP T k holds with proaility V simultaneously for all S s.t S = k. S3: Utilize the greedy algorithm of maximum coverage prolem which produces a ( /e) approximation solution. S4: By comining the two approximation ratios ε and ( /e) 2 in S2 and S3, show the final approximation ratio is ( /e ε) with at least V proaility. Example 2. Suppose k = 2, θ = 4 and four random sets G d = {, d, f}, G e = {e}, G d = {d, f} and G = {a,, e} are generated from the social graph in Figure (d is sampled twice). Then {e, f} will e selected as the seed set as {e, f} intersects (or covers ) all the random sets generated. 3. TARGETED INFLUENCE MAXIMIZATION In this section, we define eyword-based Targeted Influence Maximization (B-TIM) query and propose a aseline solution. 3. Prolem Definition To model a social network as an online advertisement platform, we extend each node v in G to e associated with a user profile represented y a weighted term vector. Each advertisement is modeled as a weighted term vector and the impact of an advertisement to an end user is calculated as the similarity etween two term vectors. If a user does not contain any keyword in the advertisement, we consider the user not eing impacted. Our goal is to find k seeds in the social network, that generate the maximum impact ased on the influence propagation model. Formally, let G = (V, E, T ) e a social network for online advertisements where T is a universal topic space to model user interests. Each user is associated with a weighted term vector to capture the preference in different topics. The term vector can e generated y aggregating all the social activities such as new posts, like and sharing into a single document and then use existing topic modeling techniques [2] to map explicit keywords into the latent topic space T. Let tf w,v denote the preference weight etween a user v and a topic w and idf w denote the inverse document frequency for a topic w. By applying tf-idf model [23] on the topic space, the impact of an advertisement with a keyword set Q.T T 4 on a user v is defined as: φ(v, Q) = w Q.T tf w,v idf w () Then, the impact of k selected seeds w.r.t an advertisement can e calculated y accumulating the impact to each activated user when the influence propagation terminates. Let I Q (S) denote the influence score of I(S) w.r.t the query keywords Q.T, i.e. I Q (S) = v I(S) φ(v, Q). we have E[I Q (S)] = E[ φ(v, Q)] = p(s v) φ(v, Q) (2) v I(S) To this end, we define our keyword ased targeted influence maximization prolem as follows: Definition 3. (eyword Based Targeted Influence Maximization (B-TIM)). A B-TIM query Q on a social graph G is associated with a tuple (Q.T, ), where Q.T T is the advertisement keyword set and is the numer of seed users. Let OP T Q.T denote the maximum expected influence spread of any size- node w.r.t the weighting function for Q.T, i.e. OP T Q.T = max S V, S =E[I Q.T (S)]. A B- TIM query finds seed users to achieve OP T Q.T. Example 3. In Figure, each node is associated with a user profile depicting the user s preferences on different topics. Suppose a B-TIM query is Q = ({music}, 2), the optimal seed set S selected y Q should e {, e} and the expected impact is E[I {music} (S )] = p(s v) φ(v, {music}) = + 0.3 + 0.75 0.6 + 0.875 + 0 0 + 0 0 + 0 0 =.5. The result is different from the general IM prolem in Example 2 as we are now considering targeted influence maximization. 3.2 Weighted RIS Sampling To facilitate our presentation, all frequently used notations are listed in Tale. In the traditional IM prolem, all the users are treated as candidates to e influenced. However, in our B-TIM query, we only target at the users associated with query keywords in the advertisement. The original uniform sampling technique cannot work in the new scenario ecause the condition S in the proof sketch of Theorem does not hold, i.e., E[ F θ(s) V ] E[I Q (S)]. Therefore, we need to find a new uniased estimator for E[I Q (S)]. θ In this section, we propose a weighted RIS (WRIS) sampling technique for the B-TIM query processing. 4 In this paper, we use topic and keyword interchangealy. 072

Tale : Frequent notations used across the paper Notation meaning G, V, E, T the social network, the vertex set, the edge set and the topic space respectively. Q(Q.T, ) the B-TIM query Q. Q.T are the topics queried and is the size of seed set. We use Q to denote Q.T if there is no amiguity in the context. OP T Q.T the maximum expected spread among all seed set with size for query keywords Q.T. F θ (S) the numer of sets, which is covered y seed set S. tf w,v the preference weight etween a user v and a topic w. idf w the inverse document frequency for a topic w. φ w sum of the relevance scores among all users for a given topic w, i.e., tfw,v idfw. φ(v, Q) the relevance score of a B-TIM query Q to a user v, i.e., w Q.T tfw,v idfw. sum of the relevance scores among all users to Q, i.e., φ(v, Q). θ the numer of sets needed to process Q using WRIS. p w the proportion of sets w.r.t a topic w among all sets needed to process Q, i.e. φ w/. p s(v, w) the proaility to sample a user v w.r.t a topic w in discriminative WRIS sampling. I(S) an instance of the influence spread of a seed set S. I Q (S) influence score of I(S) w.r.t query keywords Q.T, I Q (S) = v I(S) φ(v, Q) To differentiate the sampling proaility of targeted users from non-targeted users, we define the sampling proaility for a user v w.r.t a B-TIM query Q as follows: p s(v, Q) = φ(v, Q) φ(v, Q) = (3) φ(v, Q) where we denote = φ(v, Q). Intuitively, the users are sampled ased on their relevancies to the query. Users whose profiles are more relevant have a higher proaility to e sampled. Under the weighted sampling scheme, we propose a new uniased estimator for computing the expected influence score: Lemma. E[ F θ(s) ] φ θ Q = E[I Q (S)] Proof. Let R v e the set y first sampling v with proaility p s(v, Q) = φ(v,q) and then sampling a given v to form R v, it follows: E[ F θ(s) ] = φ(v, Q) p(r v S ) (4) θ p(r v S ) = p(s v) where p(s v) is the proaility that S activates v on G. This leads to: E[ F θ(s) ] = φ(v, Q) p(s v) = E[IQ (S)] (5) θ This means F θ(s) φ θ Q is an uniased estimator to the expected influence score of any seed set S for B-TIM query. With the sampling scheme, we propose our WRIS method as follows:. Sample θ numer of vertices from G with a proaility of p s(v, Q) (in Eqn. 3) for any vertex v. 2. For each vertex v sampled, sample a set for v. 3. Follow step 2 of RIS The steps of WRIS are similar to the RIS algorithm, except that we need to determine a θ value as the minimum numer of random sets to guarantee that W RIS can also achieve a theoretical ound like Theorem for RIS. Theorem 2 gives a sufficient lower ound for θ such that our WRIS algorithm is a ( /e ε)-approximate solution with high proaility. Theorem 2. If θ satisfies: θ (8 + 2ε) ln V + ln ( V ) + ln 2 OP T Q.T ε2 (6) the weighted RIS method returns a ( /e ε)-approximate solution with at least V proaility. With the revised sampling technique in Lemma, the proof is similar to that in Theorem (S-S4). A complete version is presented in [20]. 4. DIS-BASED INDEX FOR B-TIM QUERY PROCESSING Existing methods [2, 2] solve the IM prolem y sampling random sets online, this takes prohiitively long time to finish on social networks with millions of users which hinders interactive market analysis and decision. To achieve realtime response, we move the sampling procedure offline and propose a disk-ased index to store the pre-computed sampling sets. In this section, we first introduce how to uild an index for each keyword and then present our query processing algorithm ased on index. 4. Discriminative WRIS Sampling We note that the sampling proaility p s(v, Q) appeared in Eqn. 3 relies on the query and cannot e determined in advance. Hence, given a query, we cannot pre-compute the random sets using p s(v, Q). To support offline sampling, we propose discriminative WRIS sampling y rewriting p s(v, Q) as w Q.T tfv,w idfw p s(v, Q) = = w Q.T = Let p s(v, w) = have w Q.T tf v,w tf v,w idf w tfv,w idfw tfv,w idfw tf v,w tfv,w idfw tfv,w tfv,w and pw = p s(v, Q) = w Q.T tfv,w idfw. We p s(v, w) p w (7) This implies that we can sample the sets for each keyword w and store them on disk as an offline step. When a query Q is issued, the sets associated with the relevant keywords are loaded and merged. The random sets generated y our discriminative WRIS sampling still guarantee 073

Algorithm : Build(Set T ) Index (R, L) (, ) 2 for w T do 3 Compute ˆθ w sets 4 R w generate θ w sets with proaility p s(v, w) 5 L w inverse mapping of R w 6 Store (R, L) rr rr 2 rr 3 rr 4 rr 5 rr 6 rr 7 a,e d,f d c, c,,g,a,e c,e a rr, rr 5 rr 4, rr 5, rr 7, rr 8 c rr 4, rr 5, rr 6, rr 9 d rr 2, rr 3 e rr, rr 5, rr 6, rr 9 rr rr 2 rr 3 rr 4 d,f d,,f,g a rr 2, rr 3, rr 4, rr 5 c rr 5, rr 6 d rr, rr 2 e rr 5 rr 8 F rr 2 rr 5 c,,e f rr, rr 2 the same theoretical ound as in Theorem 2 ased on the following lemma: rr 9 c,e G rr 5 rr 6 c g rr 4 Lemma 2. Given a query Q, let θ Q w e the numer of sets sampled y a proaility of p s(v, w) w.r.t each vertex v given a query keyword w. Then θ Q is the total numer of sets and θ Q = w Q.T θq w. If θ Q θ and θq w = p θ w, we can achieve the same theoretical ound as in Theorem Q 2 for the discriminative WRIS sampling. Proof. On one hand, as the samples are taken independently, the expected fraction of samples w.r.t v among all the samples using WRIS should e p s(v, Q) according to Eqn. 7. On the other hand, the discriminative sampling is expected to sample w Q.T [(θq p w) p s(v, w)] numer of samples w.r.t each user v if we sample θ Q sets. Therefore the expected fraction of sets w.r.t v among the θ Q samples is: w Q.T [(θq p w) p s(v, w)]/θ Q = p s(v, Q) which is the same as the WRIS method. Besides, we take more samples than the desired threshold specified in Eqn. 6 for WRIS. Thus we conclude that the discriminative sampling achieves at the same theoretically accuracy as the WRIS method as in Theorem 2. 4.2 Index Construction As we want to pre-compute the sets w.r.t keyword w Q.T in an offline step and merge the relevant sets for query processing, we need to determine how many sets needs to e uilt for the index w.r.t each w. We define to e a system specified parameter such that Q. This is ecause will not e interesting if is large since the purpose of IM is to influence a large group of people y a small seed set for udgeted advertisement. Since θ p w is a value dependent on Q, we cannot set θw Q = θ p w in the offline sampling. Hence, we need to find a θ w value that is independent of Q and satisfies θ w θ p w, ut as small as possile to save the time and space cost for index construction. Following this intuition, we have the following lemma: Lemma 3. For any keyword w, if we choose ˆθ w = (8 + 2ε)( tf w,v) ln V + ln ( ) V + ln 2 (8) ε 2 we have ˆθ w θ p w. Proof. According to Eqn. 6 and the definition of p w, we have: θ p w = (8 + 2ε)( Now let us prove tf w,v) ln V + ln ( V ) + ln 2 OP T Q.T /idfw (9) ε2 Q.T OP T seed set that achieves influence of /idfw. Let Sw e the w.r.t the weight Figure 2: example of asic index structures for keyword music and ook. of w, we have: E[I Q.T (S w )] OP T Q.T definition of OP T Q.T. Then: OP T Q.T idf w E[IQ.T (S w )] idf w E[ idf w v I(S w ) w Q.T E[ w Q.T \{w} v I(S w ) E[ w Q.T \{w} v I(S w ) idf w tf w,v idf w tf w,v according to the tf w,v idf w ] ] + E[ tf w,v] idf w v I(S w ) idf w ] + 0 0 Since the influence spread is monotonic w.r.t the size of the seed set [5], OP T Q.T /idfw. In addition, as, ln ( ) ( V ln V ) when V /2. Thus we can conclude that ˆθ w θ p w. The index construction procedure ased on the offline sampling is shown in Algorithm. For each keyword w, we uild an index with two components (R w, L w). R w is the sets sampled with proaility p s(v, w) for the vertices in the graph. There are ˆθ w sets for each keyword. For each vertex v R w, we maintain an inverted list in L w to indicate which sets contain v. Note that in Line 3 of Algorithm, is unknown and needs to e estimated in advance. We adopt the weighted iterative estimation method in [2] to estimate. After and ˆθ w are determined, we construct R w y sampling ˆθ w numer of sets followed y computing its inverse mapping L w. Finally, we can store the index (R, L) in disk for query processing. Example 4. Figure 2 shows the index uilt for keywords music and ook for the running example in Figure. In this example, we estimate θ music = 9 and θ ook = 6. Then, we sample 9 random sets for keyword music and 6 for ook. Based on the sets, we construct L music and L ook to store the inverse mapping etween vertex and set. 074

Algorithm 2: Query(B-TIM Q) θ Q min{ θw p w w Q.T } 2 S Q, (R Q, L Q ) (, ) 3 for w Q.T do 4 R Q w θ Q p w numer of sets from R w. 5 L Q w L w 6 for i = to do 7 v i the user that covers the most sets in R Q. 8 S Q S Q {v i } 9 for w Q.T do 0 for rr L Q w[v i ] do if rr is not covered then 2 mark rr as covered 3 if rr R Q w then 4 remove rr from R Q w 5 return S Q 4.3 Improved Estimation of θ w Based on Lemma 3, we know that as long as we set θ w = ˆθ w, the numer of sampled sets is sufficient to guarantee the theoretical ound with high proaility. However, ˆθ w could e such a large value that it takes huge amounts of disk storage to uild the index. To reduce the index size, we propose a compact estimation of θ w ased on the oservation ln V +ln ( V )+ln 2 that ln V +ln ( can e approximated to when V )+ln V 2 is significantly larger than and. Lemma 4. For any keyword w, if we choose θ w = (8 + 2ε)( tf w,v) ln V + ln ( ) V + ln 2 (0) ε2 we have θ w θ p w. Proof. As θ p w (8 + 2ε)( Q.T OP T /idfw according to Lemma 3, V ln V +ln ( )+ln 2 tfw,v). ε2 To prove θ w θ p w, we need to justify: ln V +ln ( V )+ln 2 ln V +ln ( V )+ln 2 ε2 )+ln 2 ln V +ln ( OP {w} T. Since )+ln V 2 ε2. Let r = ln V +ln ( V ln V +ln ( V )+ln 2 ln V +ln ( )+ln V 2 can e approximated to, r =. Let S e the seed set to achieve with S =, according to the sumodular property of the influence spread function [5], there exists a set S S with S = s.t. E[I {w} I(S)] OP T {w} {w} OP T E[I{w} I(S)] {w} OP T. According to the definition of,. This means r. Lastly, we can conclude that θ w θ p w whenever θ w satisfies Eqn. 0. To utilized the improved θ w for constructing the index, we simply replace ˆθ w with θ w in Line 3 of Algorithm. will e estimated similarly as how we estimated. 4.4 B-TIM Query Processing The algorithm for B-TIM query processing ased on index is shown in Algorithm 2. To process a query, we need rr 4 rr 5 rr 7 rr 8 rr 6 rr 9 rr rr 2 rr 3 c, c,,g,a,e c,e c,e a,e d,f d rr 4, rr 5, rr 7, rr 8 c rr 4, rr 5, rr 6, rr 9 e rr, rr 5, rr 6, rr 9 d rr 2, rr 3 a rr, rr 5 f rr 2 g rr 5 a rr rr 4 c rr 4 d rr 2 e rr f rr 2 g rr 5 rr 2 rr 3 rr 4 rr 5 rr 6 rr d,,f,g c,,e c d,f rr 2, rr 3, rr 4, rr 5 c rr 5, rr 6 d rr, rr 2 f rr, rr 2 e rr 5 g rr 4 a Figure 3: Example of incremental index structures for keyword music and ook. a rr 2 c rr 5 d rr e rr 5 f rr g rr 4 to retrieve θ Q numer of sets from the index from all the query keywords. To determine θ Q, we need to ensure the conditions in Lemma 2 so that the algorithm has the result guaranteed as in Theorem 2. Note that we cannot set θ Q to θ via Eqn. 0 ecause the optimal influence spread OP T Q.T is unknown. To determine a proper θq, we set θ Q = min{ θw p w w Q.T } () When θ Q is determined (line ), we retrieve θ Q p w numer of sets from each query keyword w to ensure the sampled sets are not iased towards any query keyword according to Lemma 2 (line 4). Since θ Q p w θ w, we can guarantee that there are at least θ Q p w sets for each keyword w in the index. Finally, the result seed set S Q is identified y running a greedy algorithm for the maximum coverage prolem [22] on (R Q, L Q ) (lines 6-4). Example 5. Suppose Q.T = { music, ook } and k = 2. If the ratio of sets for keyword music to ook is 9 : 4, we can determine θ Q as 3 ecause min{ 3 θ music, 3 θ ook } = 9 4 } = 3. This means we need 9 sets from music and 4 from ook. Thus, we load rr -rr 9 in R music and rr -rr 4 in R ook into memory. The whole index of L music and L ook are also loaded. Then, we run the greedy maximum coverage algorithm [22] on the 3 sets in memory: in the first iteration, is selected as the first seed ecause it appears in the most numer of sets. In other words, has the highest proaility to influence other people. In the next step, we remove all the sets containing and find that e to e the most frequent user in the remaining sets. Therefore, we return {, e} as two most influential users for query keywords { music, ook }. min{ 3 9, 3 6 9 4 Finally, we can show that our algorithm returns a ( /e ε)-approximate solution to any B-TIM query with a proaility of at least V. Due to space limit, the proof is presented in [20]. 5. INCREMENTAL INDEX () Although the index significantly improves the B- TIM query processing compared to the WRIS method, it has to load θw Q sets in R w and all the inverted lists in L w for each query keyword, which still incurs high disk I/O. In addition, we oserve that a large numer of sets do not contain the seed users. It is thus a waste of disk I/O to load these sets in memory ecause they are not accessed y the query processing algorithm. These motivate us to 075

Algorithm 3: Build(Set T, BlockSize δ) Index (IR, IL, IP ) (, ) 2 for w T do 3 Compute θ w according to Lemma 4 4 R w generate θ w sets with proaility p s(v, w) 5 L w inverse mapping of R w 6 for each user v s.t L w[v] exists do 7 IP w[v] the set with the smallest ID in L w[v] 8 Sort rows of L w y descending order of L w(v).size 9 p 0 while L w do IL p w δ rows L w 2 IR p w {rr rr IL p w rr / j<p IRj w} 3 p p + 4 Remove the first δ rows from L w 5 Store (IR, IL, IP ) design an index that incrementally loads relevant sets into memory for query processing. 5. Index Construction We oserve that if a user has a very high impact in the social network, e.g. followed y millions of other users in Twitter, he has a good chance to e frequently sampled in the sets of different keywords. Hence, we sort the inverted lists in L w in decreasing order of the list length. In this way, y incrementally loading the inverted lists, the more impactful users will e loaded first. Our index consists of three components (IR w, IL w, IP w), which can e derived from (R w, L w) in the index. IL w sorts L w in decreasing order of list length. Then, we further split IL w into m equi-size partitions IL w, IL 2 w,..., IL m w. IR w also contains m partitions IRw, IRw, 2..., IRw m generated from the partitions in IL w. {rr rr R w rr IL i IRw i w } if i = = {rr rr R w rr IL i w rr / j<i IRj w} if i > In other words, IR i w picks from the remaining sets in R w that contain users in partition IL i w. IP w preserves the mapping etween each vertex in IL w and its first occurrence in R w of index. Whenever we process a query Q and θ Q w sets need to e retrieved w.r.t w Q.T, IP w determines whether a vertex covers at least one set which is among the θ Q w samples. Example 6. Figure 3 shows the three components of the index for keywords music and ook. We can see that the inverted lists in IL w are now sorted y the list length, compared to the L w in Figure 2. In this example, we set the partition size in IL w to e 2. The first partition of IL music contain users {, c} and all the sets with user or c are organized in the first partition in IR music. The second partition of IL music contains {d, e}, which appear in all the remaining sets. Thus, all of them are put into the second partition in IR music. Since all the sets have een processed, the following partitions in IL music correspond to empty partitions in IR music. The IP w preserves the first occurrence in the original R w sets. For example, user d first appears in rr 2 in the R music in Figure 2. Hence, it has an entry in IP music mapping to rr 2. Algorithm 4: Query(B-TIM Q) θ Q min{ θw p w w Q.T } 2 for each keyword w Q.T do 3 k[w] // upper ound score for w 4 θ Q w θ Q p w // numer of sets for w 5 pq // priority queue sorted y upper ound score 6 S Q // the result set 7 while S Q < do 8 (s, tp) pq.top() // s is the score for user tp in pq 9 s re-compute upper ound score of tp 0 if s s then pq.pop() 2 push (s, tp) to pq 3 continue 4 if tp.status = COMPLETE s w Q.T k[w] then 5 pq.pop() 6 S Q S Q tp 7 for each keyword w Q.T do 8 for each set rr containing tp in IR w do 9 if rr is not covered rr < θw Q then 20 mark rr as covered 2 for each user v IR w[rr] do 22 update upper ound for v w.r.t w 23 else 24 for each keyword w Q.T do 25 load the next partition IR w and IL w from the index of w 26 for each user v IL w do 27 update upper ound for v w.r.t w 28 push v with new upper ound into pq 29 set tp.status to COMPLETE if all the partial scores have een determined 30 update k[w] 3 return S Q The construction of index is presented in Algorithm 3. For each keyword w, we uild the sets R w and the reverse mapping L w as the asic index. Then IP is computed in Lines 6-7. Lastly, we divide the users, into partitions and each of the partitions has δ users, to form IL p w where p is the partition ID. The matching sets partition IR p w is uilt against IL p w. The construction terminates when all the users are visited. 5.2 Incremental B-TIM Query Processing In our index, the users in IL w are sorted for each keyword. This motivates us to model the B-TIM query as a top-k aggregation prolem and employ an algorithm similar to NRA [8]. The NRA algorithm maintains an accumulation tale for candidates with incomplete results and incrementally loads locks of items in the sorted lists for aggregation. If a candidate contains partial scores from all the keywords, its status is set to COMPLETE and pushed into a heap storing top-k results. The algorithm terminates when the upper ound score for the partial or unvisited results is smaller than the k-th est score ever found. B-TIM query processing ased on index rings two new issues when employing NRA algorithm: ) how to determine the status of a candidate user is COMPLETE when there are missing partial scores; 2) how to find and update all the affected users when a new seed user is confirmed and added to the result set. For the first issue in NRA, a candidate is said to e COMPLETE if the partial scores of all the attriutes have een 076

accessed and aggregated. However, we only access (θ Q p w) sets for keyword w in query processing. If a user is not contained in the sets of a query keyword, there is a missing partial score for the keyword and the complete score can only e determined when all the sets have een scanned. To avoid the case, we check whether a user appears in the sets of a keyword efore the query processing. If IP w[v] θ Q p w, where IP w[v] indicates the first occurrence in the sets, we set the partial score of user v on keyword w to e 0. The status of a user ecomes COMPLETE when the partial scores for all query keywords are set. We use score(u) to denote the complete score of user u. The second issue arises ecause each time a new seed user is confirmed y the greedy maximum coverage algorithm [22], we need to remove the seed user from the sets and update the list length for the all the affected candidate users. To save computation cost, we propose a lazy evaluation strategy. We mark the sets containing previous seed users as covered and refine the score of a candidate only when it is the top element in the priority queue. We present our incremental B-TIM query processing solution in Algorithm 4. Given a query, we first calculate θ Q to determine the numer of sets that should e loaded for each keyword. pq is a priority queue in which candidate users are sorted y their upper ound score and k[w] stores the maximum list length in IL w for the unvisited user candidates. The algorithm iterates until k seed users have een found. In each iteration, we access the user tp with the highest upper ound score. If u has een loaded into memory, we know the accurate numer of sets for that keyword. Otherwise, the upper ound score on that keyword is k[w]. The upper ound score is the sum of the scores from all the query keywords. If the score of the user is affected y previous seed users, we further refine the score y removing those sets containing previous seed users. The user with the new candidate upper ound is pushed ack into the priority queue. Otherwise, we check whether the user is a new seed user. If his status is COMPLETE, the upper ound score is an exact score. If the score is larger than the upper ound score of unseen users, we can guarantee that there is no other user with a etter score than tp. So we insert tp to the result set and the sets covered y tp are marked (lines 4-22 ) such that the candidate users contained in any of these set are affected. If tp cannot e determined as a new seed, we load more partitions from each query keyword s index. Then we update the upper ound scores for all the users in the loaded partition (lines 23-30). The algorithm terminates when seed users are found. An example is included in the technical report [20] for more details. Theorem 3. The impact scores of top-k seed users returned y Algorithm 4 and Algorithm 2 are the same. Proof. Suppose Algorithm 4 iteratively returns k seed users {u, u 2,..., u k }, Algorithm 2 returns {v, v 2,..., v k }. If the top-k impact scores are not the same, since Algorithm 2 picks the user with the maximum score in each iteration, we can find an index i such that u i is the first user with score(u i) < score(v i) and score(u j) = score(v j) for j < i. case ): v i u i. In Algorithm 4, u i is selected as a new seed only when it has the maximum upper ound score in the priority queue or its score is larger than w Q.T k[w]. Therefore, if v i is in the priority queue, we have score(v i) Numer of Users Tale 2: All parameter settings used in the experiments. The default values are highlighted. Datasets Twitter dataset News dataset #Users 0M,20M,30M,40M 0.2,0.6M,M,.4M #Edges 0.7B,.B,.2B,.3B.0,.9M,2.6M,3.M AveDegree 76.4,56.8,46.,38.9 5.2,3.,2.6,2.2 #QWords,2,...,5,6 k 0,5,...,30,...,50 0 5 0 4 0 3 0 2 0 0 0 2 0 3 0 4 In Degrees Numer of Users 0 6 0 5 0 4 0 3 0 2 0 0 0 2 0 3 0 4 0 5 In Degrees (a) News () Twitter Figure 4: In-degrees distriutions for oth datasets score(u i). Otherwise, v i has not een loaded into memory and its upper ound score is w Q.T k[w] score(ui). Both cases contradict with score(u i) < score(v i). case 2): v i = u i. When u i is selected as a seed user, we know that its score is complete, i.e., all the related sets have een loaded into memory. Since Algorithm 4 finds the same i seed users in previous iterations, after removing all these i seed users from the sets of u i, we know that its inverted list length is the same with that of v i in Algorithm 2. Thus, score(v i) = score(u i), which contradicts with score(u i) < score(v i). 6. EXPERIMENTAL STUDY In this section, we study the performance of B-TIM query processing on two real datasets. We use WRIS (in Section 3.2) as a aseline solution ecause it uses online sampling and can e considered as a variant of the stateof-the-art RIS methods [2, 2]. We compare the methods ased on offline sampling ( and index in Sections 4 and 5 respectively) with WRIS and evaluate the efficiency y average running time and effectiveness y expected influence. We did not evaluate the existing topic-aware IM solutions since they cannot handle the social networks used in this paper due to the incapaility to scale for large graph sizes and numer of topics (detailed discussion in Section 7). All the methods are implemented with C++ and run on a CentOS server (Intel i7-3820 3.6GHz CPU with 8 cores and 60GB RAM). 6. Experimental Setup Datasets. We use two real world datasets, Twitter and News, from SNAP 5 for performance evaluation. The Twitter dataset contains 4.6 million users with 476 million tweets and the news dataset contains.42 million media extracted from a collection of 96 million online news corpus. For the 5 http://snap.stanford.edu/ 077

Tale 3: Disk space and running time of using ˆθ w and θ w for constructing indices for news datasets Disk Size (GB) Time (Secs) Data ˆθ w θ w ˆθw θ w ˆθw θ w ˆθw θ w n0.2m 37 4. 37 4. 662 67.3 700 69.9 n0.6m 47 5.7 47 5.8 982 2 994 22 n.0m 54 6.7 54 6.9 238 57 270 64 n.4m 62 7.6 63 7.8 487 67 497 88 news dataset, the vertex in the graph denotes an online media whereas the edge means that there is a link from one online media to another. We extract 200 topics from each dataset and user profile is represented y a term vector in the topic space. To test the scalaility with increasing numer of users, we sampled 0M, 20M, 30M, 40M users for the Twitter dataset and 0.2M, 0.6M, M,.4M for the news dataset. We use t0m, t20m, t30m, t0m to denote the respective Twitter datasets and n0.2m,n0.6m,n.0m,n.6m for new datasets. The statistics of the social networks are shown in Tale 2. We also plot the frequency distriution of incoming degrees in Figure 4, which shows Twitter is a much denser graph than news social network and many nodes are followed y a large numer of users. Queries. We use real keyword queries from AOL search engine 6. Given the 200 topics defined in advance, we first filter the keyword queries and retain those only containing our topic keywords. To evaluate the performance in terms of increasing numer of topics (or keywords) in a query, we vary the numer of query keywords from to 6 and extract 00 queries for each length. Parameters. In our algorithms, two parameters, ε and, need to e determined to evaluate θ in Eqn. 6, ˆθ w in Eqn. 8 and θ w in Eqn. 0. ε is set to 0. for all experiments as it is the most accurate setting adopted in [2] (no experiment is done in [2]). is set to 00 since the largest in our experiment is 50. For, the partition size δ is set to 00 for all experiments. In the following experiments, we evaluate the scalaility w.r.t. increasing seed users, query keywords Q.T and social network size V as shown in Tale 2. 6.2 Index Sizes and Construction Time For and, oth methods require offline sampling of sets to uild respective indices. All indices are constructed y running 8 threads in parallel. We first study the difference etween using ˆθ w(eqn. 8) and θ w(eqn. 0) for index construction. The disk space and running time of using ˆθ w and θ w for index construction of news datasets are presented in Tale 3. It is ovious that using ˆθ w is not scalale against large graphs. Besides, in susequent experiments, we will see that indices constructed y θ w have equivalent approximation power as compared to that y using ˆθ w. Therefore we will not present results for indices uilt y ˆθ w for Twitter datasets. Since and use inverted lists, we can apply FastP- FOR 7 compression (adopted in Apache Lucene 4.6.x) to reduce disk storage. We report disk space and running time for uncompressed and compressed indices in Tale 4. For 6 http://www.gregsadetsky.com/aol-data/ 7 https://githu.com/lemire/fastpfor Tale 4: Disk space and running time for constructing indices for various datasets with θ w Disk Size (GB) Time (Secs) uncompress compress uncompress compress Data n0.2m 4. 4. 2.0 2.0 67.3 69.9 70. 73.8 n0.6m 5.7 5.8 3.0 3. 2 22 3 9 n.0m 6.7 6.9 3.7 3.9 57 64 48 57 n.4m 7.6 7.8 4.3 4.6 67 88 64 67 t0m 88 94 52 58 5.9h 6.0h 5.8h 6.0h t20m 84 93 56 65 4.3h 4.5h 4.4h 4.4h t30m 77 87 54 64 3.7h 3.8h 3.7h 3.7h t40m 55 63 4 50 2.4h 2.5h 2.4h 2.5h Tale 5: Sum of θ w and mean set size for increasing graph size (News) Sum of Mean of Twitter Sum Mean V θ w size V θ w size 0.2M 385M 2.7 0M 28M 94.9 0.6M 640M 2.3 20M 247M 7.9 M 798M 2. 30M 276M 55.0.4M 926M 2.0 40M 374M 26.7 uncompressed indices, we can see that the index size in the news dataset grows with the graph size. However, to our surprise, such pattern does not apply to Twitter datasets. The reason is that the index size depends on two factors: the numer of sets, i.e. θ w in Lemma 4, and the average size of a set. On one hand, θ w increases for large graphs since θ w depends on V. We report the sum of θ w among all keywords and average set size for oth datasets in Tale 5. On the other hand, as a set is constructed y first randomly picking a starting vertex and then performing a random reath first search on the social graph, the size of a set will e larger if the graph is denser. As shown in Tale 2, the average degree of oth the news and the Twitter dataset decrease as the graph ecomes larger. This lead to a decrease in the average set size for oth datasets. Hence, θ w and average set size are two conflicting factors as V varies. In the news dataset, θ w takes the major role in determining the index size while in the Twitter dataset, the average set size is more critical. For compressed indices, there are approximately 50% and 40% space reductions for news and Twitter datasets respectively. Besides the construction time for the compressed indices does not increase significantly compared to that of the uncompressed indices. Thus, in the following experiments, we will adopt the compressed scheme for oth and indices. We also overse that Twitter consumes much more disk space than the news dataset. This is ecause Twitter is a much larger and denser social network than news. For each user in Twitter, it can e affected y a large numer of users. The construction time of the index grows linearly with the index size. It requires hours to uild the index for Twitter datasets, even when we used 8 threads in parallel. This is ecause we need to uild the index for 200 keywords and each keyword requires hundreds of thousand of random walks to generate sets. 6.3 Vary the Seed Set Size We first examine the performance with increasing seed users. The running time and numer of sets accessed are shown in Figure 5. It is ovious that the methods ased on offline sampling are significantly faster than 078

Tale 6: Numer of I/O for when varying 0 5 20 25 30 35 40 45 50 News 6.0 7.34 8.75 0.2 4.0 9.3 33.6 78. 70 Twitter 8.00 4.5 23.0 34.0 40.0 5.0 58.5 69.0 8.0 Tale 7: Influence spread when varying. News dataset Twitter dataset WRIS (ˆθw ) WRIS 0 289.6 289.4 289.4 289.4 8673.9 8672.6 8672.6 5 356.5 356.3 356.4 356.4 0666 0666 0666 20 408.9 409.0 409.0 409.0 2088 2092 2092 25 448.7 448.6 448.6 448.6 300 3099 3099 30 480.3 480.3 480.2 480.2 3929 3925 3925 35 506.7 506.8 506.7 506.7 468 466 466 40 528.6 528.8 528.8 528.8 5209 5209 5209 45 548.6 548.0 548. 548. 573 5735 5735 50 564.9 564.9 564.9 564.9 62 624 624 the online sampling method. In the Twitter dataset, the average response time to a B-TIM query using and indices are 60x and 434x times smaller than WRIS respectively. runs faster than for two reasons. First, method needs to load θ w sets and this numer is invariant to. method incrementally loads partitions for top-k aggregation. The numer of sets loaded y the two algorithms is plotted in Figure 5. Second, in method, after a seed is found, we need to eliminate its impact on the remaining candidate users. In other words, we scan the inverted lists L w and update the length y removing those sets containing the seed user. The operation is expensive. In, our proposed lazy update mechanism only requires updating the score of the user at the top of the priority queue in each iteration. As increases, it takes a slightly longer time for and methods to answer a query. This is ecause oth and require more iterations to find the seed users, causing more CPU cost and disk I/O. But the performance of WRIS is slightly faster as increases. The reason is that the performance of WRIS is mainly dependent on the numer of sets generated, i.e. θ in Theorem 2, which is inversely proportional to the optimal spread. Due to the monotonicity of the IM prolem, the optimal spread ecomes larger when increases, resulting in smaller numer of sets sampled. We also note that the performance of degrades to e close to in the news dataset. This is ecause the I/O efficiency of is etter than. method incurs a sequential disk I/Os for each query keyword (default numer of query keywords is 5 which lead to 5 I/Os for each query). As shown in Tale 6, method relies on incremental loading of partitions into memory, which causes increasing disk I/Os for larger. However, in the Twitter dataset, although still has the advantages in I/Os, the numer of sets loaded for is significantly smaller than that of, which explains why does not degrade to. In addition to the result that and are much faster than WRIS, the expected influence of the top- seed users generated y them is also no worse than WRIS. We report the expected influence of the seed users returned y all the methods in Tale 7. To justify our improved estimation of θ w in Sec. 4.3, we also report the news dataset result for running method on indices constructed y using ˆθ w. As shown in Tale 7, there are almost no difference etween all the methods. The expected influence spread for all three methods will not e presented in the rest of the experimental study as the results show similar patterns. 6.4 Vary the Numer of Query eywords Execution Time (s) 32 6 8 4 2 0.25 News WRIS 0.25 0 5 20 25 30 35 40 45 50 2048 024 52 256 28 64 32 6 8 4 2 Execution Time (s) Twitter WRIS 0 5 20 25 30 35 40 45 50 Numer of sets loaded Numer of sets loaded 2.8x0 6 2.4x0 6 2.0x0 6.6x0 6 News.2x0 6 0 5 20 25 30 35 40 45 50.8x0 6.6x0 6.4x0 6.2x0 6.0x0 6 8.0x0 5 6.0x0 5 Twitter 4.0x0 5 0 5 20 25 30 35 40 45 50 Figure 5: Varying the seed set size: When we increase the numer of query keywords Q.T from to 6, as shown in Figure 6, the results demonstrate similar patterns: and are at least two orders of magnitude etter than WRIS in a social network with illions of edges. The running time of outperforms in the Twitter dataset ut degrades to e close to in the news dataset. This is ecause in Twitter, there are a large numer of users with dominating numer of followers. After we sort the users ased on their overall influence, we only need to access a small portion of sets. In the news dataset, the numer of accessed sets for the two methods grows linearly and degrades ecause it requires more random disk I/O. 6.5 Vary the Graph Size We also vary the graph size, i.e. V, to test the scalaility of our proposed solutions. The results are shown in Figure 7. and clearly outperform WRIS y great margins in all scenarios. In the news datasets, can outperform in certain cases ecause its I/O is more efficient when examining similar numer of sets. Nevertheless, has dominating performance against when the size of Twitter datasets increases. It shows that is more effective for larger graphs without compromising its performance superiority. 6.6 Effectiveness of Propagation Model In the last experiment, we discuss the effectiveness of the propagation model applied for B-TIM query processing using the two real datasets. In this section, we present results for oth independent cascade (IC) model [2, 5, 6, 9] and linear threshold (LT) model [2, 9, 5]. For IC model, the propagation proaility set to p(e) = N v which is widely adopted in [2, 5, 9]. For LT model, following existing work [2, 5, 7], we assign a random value in [0, ] to each user s incoming neighours and normalize the values so that the sum of all neighours influence proailities equals. In Tale 8, we illustrate the top-8 most influential users for 079

Tale 8: Example B-TIM query results Method eyword News dataset WRIS(IC) software k.vmware.com en.wikipedia.org davidstef9.wordpress.com codeplex.com suse.ehelp.pl virtualgeek.typepad.com softwarelivre.net linux.pl WRIS(LT) software k.vmware.com en.wikipedia.org orums.asp.net communities.vmware.com suse.ehelp.pl virtualgeek.typepad.com ntwizard.spaces.live.com linux.pl WRIS(IC) journal www.ilegateway.com ookology.wordpress.com hugh.journalspace.com journals.aol.com journal.peishan.org earticle.wordpress.com signefavor7.spaces.live.com jazzitalia.net WRIS(LT) journal journals.aol.com ookology.wordpress.com izjournals.com www.ilegateway.com journal.peishan.org journaldugeek.com ookalytics.com jazzitalia.net RIS N.A en.wikipedia.org logger.com youtue.com mywe.yahoo.com wordpress.com wrzuta.pl match.seesaa.jp 2.p.logspot.com Method eyword Twitter dataset WRIS(IC) software BarackOama ritneyspears cnnrk aplusk THE REAL SHAQ kevinrose jimmyfallon twitter WRIS(LT) software iz cnnrk kevinrose twitter BarackOama mashale jimmyfallon ritneyspears WRIS(IC) journal kevinrose twitter BarackOama TheEllenShow RyanSeacrest ritneyspears THE REAL SHAQ taylorswift3 WRIS(LT) journal ev cnnrk twitter BarackOama TheOnion NotTinaFey jimmyfallon ritneyspears RIS N.A ev cnnrk kevinrose BarackOama mashale jimmyfallon TheEllenShow ritneyspears Execution Time (s) 6 8 4 2 0.25 0.25 News WRIS 0.0625 2 3 4 5 6 Q.T Numer of sets loaded 3.5x0 6 3.0x0 6 2.5x0 6 2.0x0 6.5x0 6.0x0 6 5.0x0 5 News 0.0x0 0 2 3 4 5 6 Q.T Execution Time (s) 6 8 4 2 0.25 0.25 News WRIS 0.0625 0.2 0.4 0.6 0.8.2.4 Graph Size Numer of sets loaded 2.8x0 6 2.6x0 6 2.4x0 6 2.2x0 6 2.0x0 6.8x0 6.6x0 6.4x0 6 News.2x0 6 0.2 0.4 0.6 0.8.2.4 Graph Size Execution Time (s) 024 256 64 6 4 Twitter WRIS 0.25 2 3 4 5 6 Q.T Numer of sets loaded 2.2x0 6 2.0x0 6.8x0 6.6x0 6.4x0 6.2x0 6.0x0 6 8.0x0 5 6.0x0 5 4.0x0 5 2.0x0 5 0.0x0 0 Twitter 2 3 4 5 6 Q.T Figure 6: Varying query keywords length: Q.T Execution Time (s) 024 52 256 28 64 32 6 8 4 2 Twitter WRIS 0 5 20 25 30 35 40 Graph Size Numer of sets loaded.8x0 6.6x0 6.4x0 6.2x0 6.0x0 6 8.0x0 5 6.0x0 5 4.0x0 5 Twitter 2.0x0 5 0 5 20 25 30 35 40 Graph Size Figure 7: Varying the graph size: V keywords software and journal in oth news dataset and Twitter dataset for oth models. We can see that it is more effective for the news dataset than the Twitter dataset. In the news dataset, we can see that many wesites in the top- 0 results are highly relevant to the keyword software and journal. It shows that ) our proposed B-TIM query can e applied for targeted advertising and disseminate the advertisement in the more relevant communities or clusters in the social network; 2) The propagation model used in this paper is effective for topic-aware advertising in the news dataset. However, in the Twitter dataset, the results for different topics are similar ecause the reported users have a huge numer of followers with diverse ackground. In other words, they are very influential in different topics. We also note that RIS method always returns the same results for different query keywords. There is no clue etween its top-0 seed users and query keywords. Note that how to determine a proper propagation model is eyond the scope of this paper. In this paper, we focus on the efficiency issue and our method can e easily applied to other propagation models like the general triggering model [5, 9]. This is ecause our proposed WRIS is an extension of RIS and RIS has een shown to incorporate all propagation models in [2]. The only difference is that RIS uniformly samples vertices to create sets whereas WRIS conducts a weighted sampling approach. Since influence propagation models is independent of the vertex sampling methods, WRIS can directly adopt other propagation models supported y RIS. 7. RELATED WOR IM prolem is a NP-Hard prolem that has een extensively studied. Since empe et al. [5] proposed the 080

first simple greedy algorithm with an approximation ratio of ( /e ε), there has een a large ody of research work devoted to improving the efficiency while keeping the theoretical ound [7, 0, 5, 7, 6, 6, 2, 4, 2]. Although CELF [7] and its variant CELF++ [0] have significantly improved the running time, the methods were only examined in small graphs with thousands of vertices. RIS [2] is the first method scalale enough to handle graphs with millions of vertices. It uses random sampling and can handle various propagation models that have een proposed. The method was further improved in [2] in terms of sampling efficiency. However, RIS still requires nearly one hour to process a graph with millions of vertices. Another ranch of work on IM improved the efficiency y discarding the theoretical ound. In other words, the expected influence returned does not have any approximation ratio to the optimal results. Chen et al. [6] used vertex degree as a quick selection criterion. In [5, 7, 6, 4], the propagation ehaviour is simplified y removing social paths that have low propagation proailities. Although these heuristic solutions are more efficient, none of them have een examined in large social networks. The state-of-the-art solutions (PMIA [5] and IRIE [4]) require more than 0 minutes to run a graph with less than million edges. The aove methods cannot e directly applied in online advertisements ecause the same seed users are returned for different advertisements. To solve the issue, topic-aware IM prolem was proposed [, 3, 4, 8, 3]. In [], the influence proailities that a user influences his neighors are different for different topics. Susequently, Chen et al. [4] proposed a solution for the topic-ware IM y extending PMIA. As mentioned, PMIA does not have any theoretical ound and is not applicale to very large graphs. In [3], Inflex was proposed to achieve real-time performance. Inflex first precomputes a numer of top-k seed sets offline and the online query is processed y finding nearest neighours among the pre-computed seed sets w.r.t the query topics. Thus, Inflex does not have a theoretical ound due to the nearest neighour approximation. Very recently, Shuo et al. [3] proposed another heuristic solution to improve the efficiency of Inflex, ut still failed to provide a theoretical ound. In addition, existing works are not scalale to large numer of topics as they take prohiitively long time to train the propagation proailities and huge storage spaces for different topics. No study was reported in [, 3, 4] on handling graphs with more than million vertices and [3] is only capale of handling 0 topics for a graph with 4 million users. Oviously, these solutions are infeasile for real-world social IM applications. Compared with existing topic-aware IM solutions, our B-TIM query takes into account the influence on the targeted users relevant to the advertisement instead of assigning different influence proailities etween directly connected users for different advertisements. Our solution not only achieves an approximation ratio of ( /e ε), ut also is scalale to retrieve the seed users with a few seconds in a graph with illions of edges and hundreds of topics. 8. CONCLUSION In this work, we studied the eyword-based Targeted Influence Maximization (B-TIM) query on social networks. We first proposed an online sampling RIS method WRIS that returns a solution with a ( /e ε) approximation ratio. Then a disk ased index is developed so that the query processing can e done in real time. Susequently, an incremental index and query processing technique, i.e., is presented to further oost the performance of method. Extensive experiments on real world social network data have verified the theoretical findings and efficiency of our solutions. 9. ACNOWLEDGEMENT We would like to thank Prof. Xiaokui Xiao (NTU) for his insightful comments. This work was supported y NUS- Tsinghua Extreme Search project under the grant numer: R-252-300-00-490. 0. REFERENCES [] N. Barieri, F. Bonchi, and G. Manco. -aware social influence propagation models. In ICDM, pages 8 90, 202. [2] C. Borgs, M. Brautar, J. T. Chayes, and B. Lucier. Influence maximization in social networks: Towards an optimal algorithmic solution. Co, as/22.0884, 202. [3] S. Chen, J. Fan, G. Li, J. Feng,.-l. Tan, and J. Tang. Online topic-aware influence maximization. Proc. VLDB Endow., 8(6):666 677, 205. [4] W. Chen, T. Lin, and C. Yang. Efficient topic-aware influence maximization using preprocessing. Co, as/403.0057, 204. [5] W. Chen, C. Wang, and Y. Wang. Scalale influence maximization for prevalent viral marketing in large-scale social networks. In DD, pages 029 038, 200. [6] W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In DD, pages 99 208, 2009. [7] W. Chen, Y. Yuan, and L. Zhang. Scalale influence maximization in social networks under the linear threshold model. In ICDM, pages 88 97, 200. [8] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):64 656, 2003. [9] J. Goldenerg, B. Liai, and Muller. Using complex systems analysis to advance marketing theory development. Academy of Marketing Science Review, 200. [0] A. Goyal, W. Lu, and L. V. Lakshmanan. Celf++: Optimizing the greedy algorithm for influence maximization in social networks. In WWW, pages 47 48, 20. [] M. Granovetter. Threshold models of collective ehavior. The American Journal of Sociology, 83(6):420 443, 978. [2] L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In SOMA, pages 80 88, 200. [3] Çigdem Aslay, N. Barieri, F. Bonchi, and R. A. Baeza-Yates. Online topic-aware influence maximization queries. In EDBT, pages 295 306, 204. [4]. Jung, W. Heo, and W. Chen. IRIE: scalale and roust influence maximization in social networks. In ICDM, pages 98 923, 202. [5] D. empe, J. leinerg, and E. Tardos. Maximizing the spread of influence through a social network. In DD, pages 37 46, 2003. [6] M. imura and. Saito. Tractale models for information diffusion in social networks. In PDD, pages 259 27, 2006. [7] J. Leskovec, A. rause, C. Guestrin, C. Faloutsos, J. M. VanBriesen, and N. S. Glance. Cost-effective outreak detection in networks. In DD, pages 420 429, 2007. [8] F.-H. Li, C.-T. Li, and M.-. Shan. Laeled influence maximization in social networks for target marketing. In SocialCom/PASSAT, pages 560 563, 20. [9] G. Li, S. Chen, J. Feng,. Tan, and W. Li. Efficient location-aware influence maximization. In SIGMOD, pages 87 98, 204. [20] Y. Li, D. Zhang, and.-l. Tan. Real-time targeted influence maximization for online advertisements. Technical report, 205. http://www.comp.nus.edu.sg/~a004794/. [2] Y. Tang, X. Xiao, and Y. Shi. Influence maximization: Near-optimal time complexity meets practical efficiency. In SIGMOD, pages 75 86, 204. [22] V. V. Vazirani. Approximation algorithms. Springer, 200. [23] J. Zoel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006. 08