Who to Follow and Why: Link Prediction with Explanations

Size: px
Start display at page:

Download "Who to Follow and Why: Link Prediction with Explanations"

Transcription

1 Who to Foow and Why: Lin rediction with Expanations Nicoa Barbieri Yahoo Labs Barceona, Spain Francesco Bonchi Yahoo Labs Barceona, Spain Giuseppe Manco ICAR-CNR Rende, Itay ABSTRACT User recommender systems are a ey component in any onine socia networing patform: they hep the users growing their networ faster, thus driving engagement and oyaty. In this paper we study in prediction with expanations for user recommendation in socia networs. For this probem we propose WTFW ( Who to Foow and Why ), a stochastic topic mode for in prediction over directed and nodes-attributed graphs. Our mode not ony predicts ins, but for each predicted in it decides whether it is a topica or a socia in, and depending on this decision it produces a di erent type of expanation. A topica in is recommended between a user interested in a topic and a user authoritative in that topic: the expanation in this case is a set of binary features describing the topic responsibe of the in creation. A socia in is recommended between users which share a arge socia neighborhood: in this case the expanation is the set of neighbors which are more iey to be responsibe for the in creation. Our experimenta assessment on rea-word data confirms the accuracy of WTFW in the in prediction and the quaity of the associated expanations. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Appications - Data Mining eywords: socia networs; in prediction 1. INTRODUCTION Lin prediction is the tas of estimating the ieihood of the existence of an unobserved in between two nodes, based on the other observabe ins around the two nodes and, when avaiabe, the attributes of the nodes [8]. It finds appication in any context in which the networ is ony partiay observabe and we want to guess the unobserved part. A typica setting is when we consider the networ evoving aong time, so that the unobservabe part of the networ is the set of ins which are not yet created: given the graph observed at time t, we want to predict the set of ins which wi be created in the time interva [t, t +1][17]. ermission to mae digita or hard copies of a or part of this wor for persona or cassroom use is granted without fee provided that copies are not made or distributed for profit or commercia advantage and that copies bear this notice and the fu citation on the first page. Copyrights for components of this wor owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repubish, to post on servers or to redistribute to ists, requires prior specific permission and/or a fee. Request permissions from DD 14, August 24 27, 2014, New Yor, NY, USA. Copyright 2014 ACM /14/08...$ Lin prediction has been appied in a variety of domains, ranging from bioinformatics to web sites management, from bibiography to e-commerce [12, 18, 5]. However, the most immediate and prominent appication of in prediction is the recommendation of users to other users of a socia networ. This is one of the most fundamenta functionaities common to a on-ine socia networing patforms 1 : it heps the users having a quicer start in buiding their networ, thus driving engagement and oyaty. It is a ey component for growth and sustenance of a socia networ: for instance, the Wtf ( Who to Foow ) service at Twitter is caimed to be responsibe for miions of new ins daiy [11]. Given that growing the user base and maintaining a high eve of engagement are ey factors for the success (or the death) of these biion-doar businesses, one can easiy figure out the importance of user recommendation systems. In this paper we study in prediction with expanations for user recommendation systems in on-ine socia networs. Enriching recommendations with expanations has the benefit to increase the trust of the user in the recommendation, and thus the ieihood that the recommendation is adopted. Whie these benefits are we understood in cassic coaborative-fitering recommender systems [14, 25, 30], providing expanations in the context of user recommendation systems is sti argey underdeveoped: in fact, in most of the rea-word systems, the unique expanations given for user recommendations are of the type you shoud foow user Z because your contacts X and Y do the same. Our starting observation is that a in creation is usuay expainabe by one of two main reasons: interest identity or persona socia reations. This observation is rooted in socioogy, where it goes under the name common identity and common bond theory [24, 26]. Identity-based attachment hods when peope join a community based on their interest in a we-defined common theme shared by a of the members of that community. The goa in this case is information coecting and sharing in the specific theme of interest. eope joining a community through identity-based ins may not even directy participate, e.g., by producing content or by engaging with other members, and instead ony passivey consume information. Conversey, bond-based attachment is driven by persona socia reations with other specific individuas (e.g., famiy, friends, coeagues), and thus it does not require a common theme of interest to be justified. Bond-based ins are usu- 1 E.g., eope You May now in Faceboo and LinedIn, Recommended Bogs in Tumbr, or Who to Foow in Twitter, just to mention a few.

2 ay reciprocated, whie identity-based ins are much more directiona, where the direction is given by the eve of authoritativeness of the user on the theme. The two types of ins create two di erent types of communities, that for simpicity we name topica for identity-based and socia for bond-based [10]. Based on this observation we define a stochastic mode, dubbed WTFW ( Who to Foow and Why ), which not ony predicts ins, but for each predicted in it decides whether it is a topica or a socia in, and depending on this decision it produces a di erent type of expanation. A topica in u! v (u shoud foow v) is usuay recommended to u when v is authoritative in a topic in which u has demonstrated interest. In this case the expanation is a set of the top- binary features (e.g., tags in Ficr or hashtags in Twitter) describing the topic of authoritativeness of v, which maes v a potentia source of interesting information for u. A socia in u! v instead is recommended when u and v are aready part of the same socia community, i.e., they have many contacts in common. In this case the expanation is the set of the top- common neighbors w.r.t. the ieihood of being responsibe for the in creation. As an important by-product, WTFW aso impicity detects communities and their type (socia or topica). More in detais WTFW is a bayesian topic mode defined over directed and nodes-attributed graphs. In WTFW each in creation and each attribute adoption by a node are expained w.r.t. a finite number of atent factors. These atent factors can be abstracty thought as topics or communities: in the rest of the paper we wi use the three terms (atent factor, topic, and community) interchangeaby. Each community is characterized by a eve of sociaity/topicaity: socia communities are characterized by high density and reciprocity of ins, whereas topica communities are characterized by ow entropy in the features and by the presence of authoritative users on the reevant topic. Each user tend to be invoved in di erent communities to di erent extent and with di erent roes. These components are modeed by three di erent mutinomia distributions over the set of users, modeing their sociaity, authoritativeness and interest in each topic. Finay, each topic is characterized by a mutinomia distribution over the feature set, which provide a semantic interpretation of the topic. aper contributions. The contributions of this paper are summarized as foows: We study for the first time the probem of in prediction with expanations, which is motivated by the rea-word appication of user recommender systems in onine socia networs. We introduce WTFW ( Who to Foow and Why ) a stochastic topic mode which not ony predicts ins, but for each predicted in it decides whether it is a topica or a socia in, and depending on this decision it produces a di erent type of expanation. As a by-product, WTFW aso impicity extract communities that can be abeed as either topica or socia. Our experimenta assessment on two rea-word datasets (Twitter and Ficr) confirms that our mode is very accurate in in prediction and in abeing the predicted in as socia or topica. The experiments aso highight the high quaity of the topics extracted and their coherence with the topica Vs. socia abeing. 2. RELATED WOR Lin prediction has attracted a great dea of attention in the ast decade (the interested reader may refer to [12, 18] for a comprehensive survey): however, to the best of our nowedge, no previous wor has studied in prediction with expanations for user recommendation systems. Our proposa can aso be coocated in the iterature on reationa earning methods that are abe to everage attribute information on nodes [29, 34, 19]. The main drawbac of those approaches is scaabiity, which seriousy prevents their appication on rea-word networs. The Supervised Random Wa agorithm for in prediction [1], expoits edge features to earn the edge strength that is then used random wa transition probabiity. Aternative random-wa approaches rey on merging the socia graph and node attributes in a unique graph with personnodes and attribute-nodes ined among them [33, 9]. The joint factorization of socia ins and node attributes is cosey reated to the tas of detecting communities in nodes-attributed graphs. [35] uses node attributes to augment the socia graph by generating attribute edges between nodes that are simiar on a given attribute, and then identify communities in the augmented graph. [21] introduces the probem of finding cohesive patterns, defined as connected subgraphs whose density exceeds a given threshod, and with homogeneous vaues on node-attributes. [31] proposes a co-custering framewor based on users and tags. Users are impicity connected by their common interests, as expressed by the tags they use. [23] studies the probem of finding communities with concise descriptions based on the nodes attributes. Severa stochastic modes for community detection in networs with node attributes have been proposed in the iterature. In Lin-LDA [6] socia connections and user attributes are generated by a mixture of user-specific distributions over topics. In [22, 15, 32] the community-membership vectors are used to factorize both ins and the attribute-profie of each user. [27] extends the author-topic mode to communication networs in which the sender and recipient of each post are nown. [2] proposes a generative stochastic mode to detect communities from the socia graph and a database of information propagations over the socia graph. 3. WHO TO FOLLOW AND WHY In this section we introduce the WTFW mode for in prediction with expanations. Our appication scenario is that of onine socia networing patforms, where users buid and maintain socia connections, share information, and foow updates from other users. We represent this as a directed graph, where each node is a user and it has associated a set of binary features, representing the interests of the user. More formay, et G =(V,E) be the socia graph where V is a set of n users, E V V is a set of m directed arcs, and (u, v) indicates that u foows v and hence he is notified of v s activities. We aso denote the neighborhood of a node u as N (u) ={v 2 V :(u, v) 2 E _ (v, u) 2 E}. Moreover et F denote a set of h binary features. We are given a binary n h matrix F such that F u,f = 1 when user u is interested in the feature f. For simpicity we denote this case aso as (u, f) 2 F. Finay, we denote a the features of the node u as F (u) ={f 2 F :(u, f) 2 F } and the set of a the nodes having attribute f as V (f) ={u 2 V :(u, f) 2 F }.

3 1. sampe Dir ~ 2. For each 2 {1,...,} sampe S Beta( 0, 1) Beta( 0, 1 ) Dir (~ ) Dir (~ ) A Dir ~ S Dir (~ ) A 3. For each in 2 { 1,..., m} to generate: (a) Choose Discrete( ) (b) Sampe x Bernoui( ) (c) if x =1 (d) sampe source u Discrete( ) sampe destination v Discrete( ) ese sampe source u Discrete(S ) sampe destination v Discrete(A ) 4. For each feature pair a 2 {a 1,,a t} to associate (a) sampe Discrete( ) (b) Sampe y a Bernoui( ): if y a =1thenu a Discrete(A ) otherwise u a Discrete(S ) (c) sampe f a Discrete( ) Figure 1: Generative process for the WTFW mode Foowing the common identity and common bond theory discussed in Section 1, we assume two main types of behavior in creating connections in a socia networ. The topica behavior, in which a user u decides to foow another user v because of u s interest in a topic in which v is authoritative; and the socia behavior in which u foows v because they now each other in the rea word, or they have many common contacts in the socia networ. In the topica behavior case we can further identify two distinct roes for a user, either as authoritative ( infuentia ) for the topic or just interested ( susceptibe ) in the topic. In the socia case instead there are no specific roes, but a generic tendency to connect among the users of a cose-nit circe. Foowing these considerations, we propose to expain the structure of the networ (the ins) and the features of the nodes, by introducing a set of atent factors representing users interests, and by abeing the ins as either socia or topica. This is done by means of a unique stochastic topic mode, which is based on the foowing assumptions: Lins can be expained by di erent atent factors (overapping communities); Socia ins tend to be reciproca and communities characterized by a high eve of sociaity exhibit high density; Topica ins tend to exhibit a cear directionaity and communities that are highy topicaity have ow entropy on the set of features assigned to nodes. More in detais, the degree of invovement and roe of user u in the community/topic is governed by three parameters: (1)A,u which measures the degree of the authoritativeness of u in ; (2)S,u which measures the degree of interest u in the topic, or in other terms, the ieihood of foowing users that are authoritative in (susceptibiity to socia infuence); and (3),u denotes the socia tendency of u, i.e., her ieihood to connect to other socia peers within community. Moreover, each atent factor is characterized by v x u z m u z a 0, 1 0, 1 Figure 2: The WTFW mode in pate notation. a propensity to adopt certain features in F over others. We can formaize such a propensity by means of a weight,f, denoting the importance of feature f within. A these components are accommodated in a mixture membership mode expressed in a Bayesian setting [4], to define distributions governing the stochastic process, given some prior hypotheses. Bayesian modeing is better suited when the underying data is characterized by high sparsity (ie in our case), as it aows a better contro of the priors which govern the mode and it prevents overfitting. In particuar, we directy mode each observed socia in (u, v) 2 E or adoption of feature by a node (u, f) 2 F and introduce random variabes on the source/destination of these observations. That is, for each in (u, v) 2 E we mode the ieihood that there exists a atent factor, such that u has high probabiity of being a source, whie v has high probabiity of being a destination. We further introduce a atent variabe x u,v, which encodes the (socia/topica) nature of an existing in. Anaogousy, the adoption of an observed feature association (u, f) 2 F wi be expained by a atent factor and by the status of the atent variabe y u,f which represent the roe of the user u, either as authoritative or just interested, when adopting the feature f. The underying generative process for socia ins and adoption of features depends jointy on the components, A, S and, as described in Figure 1 and depicted in pate notation in Figure 2. The overa generative process is governed by the foowing components: A mutinomia distribution over a fixed number of atent factors, which generate atent communityassignments z and z a, for each in 2 E and for each adoption of feature a 2 F ; The mutinomia distributions, A and S over the set of user V, which specify, respectivey, the degree of sociaity, authority and susceptibiity of each user within ; The mutinomia probabiity over F which specify the ieihood of observing each feature within the community. f y a t

4 The degree of sociaity (or topicaity, 1 ) which measures the ieihood of observing socia/topica connections within each community ; The authoritative attitude of observing the adoption of an attribute by authoritative subject in (or, duay, the susceptibe attitude, 1 ). Since the whoe mode reies on mutinomia and Bernoui distributions, a fu Bayesian treatment can be obtained by adopting Dirichet and Beta conjugate priors. Let = {,,,, A, S} denote the status of the distributions described above. Both the probabiity of observing in =(u, v) and a feature assignment a =(u, f) canbeexpressed as mixtures over the atent community assignments z and z a: X r( ) = r( z =, ) (1) r(a ) = =1 X r(a z a =, ) (2) =1 The generation of a in changes depending on the status of the atent variabe x. A socia connection =(u, v) can ony be observed if, by picing a atent community, u and v have high degrees of socia attitude,u and,v, that is r( z =, x =1, ) =,u,v. Conversey, a topica connection =(u, v) can ony be observed if, by picing a atent community, u has a high degree of activeness A,u and v have a high degree of passive interest S,u, that is r( z =, x =0, ) =S,u A,v. Note that the ieihood of observing the reciproca in (v, u) is equay iey in case of socia connection, whie it is di erent in a topica context, and hence refect our design assumption on the directionaity of ins in socia/topica communities. Each in is finay generated by taing into account the socia/topica mixture of each community: r( z =, ) = r( z =, x =1, ) +(1 ) r( z =, x =0, ) =,u,v +(1 ) S,u A,v Simiary, the probabiity of observing a node-feature pair a = (u, f) 2 F depends on the degree of authoritativeness/susceptibiity of the user and by the ieihood of observing the attribute f within each atent factor : r(a z a =, ) =( A,u +(1 ) S,u ),f. Here, the term A +(1 )S defines a mutinomia distribution over users, which encodes the joint (both susceptibe and authoritative) attitude of users within that community. 3.1 Learning We have described the intuitions behind our joint modeing of ins and feature associations and now we focus on defining a procedure for inference and parameter estimation under WTFW. Let = { ~, ~, ~, ~, ~, ~ = { 0, 1}, ~ = { 0, 1}} denote the set of hyperparameters of the Dirichet/Beta priors. Aso, et Z e represents a binary m matrix where z, =1 denotes that in has been associated with the -th atent factor (i.e., z = ). Anaogousy, Z f denotes the t binary matrix where z a, = 1 denotes that feature assignment a 2 F is associated with the -th atent factor (z a = ). Finay, X and Y denote the vectors of assignments x and y a. With an abuse of notation, we aso introduce the counters described in Tab. 1, reative to these matrices. The ey probem in inference is to compute the posterior distribution of atent variabes given the observed data. We start by expressing the joint ieihood as: where r(e,f,, Z e,z f, X, Y ) = r(e, X, Z e )= Y u r(f, Y, Z f )= Y u r(z e ) = Y r(z f ) = Y r(x Z e, )= Y r(y Z f, )= Y r(e, X, Z e ) r(f, Y, Z f ) r(z e ) r(z f ) r(x Z e, ) r(y Z f, ) r( ) Y Y c d cs,s,u +cs,d,u,u S c t,s t,d,u,u Ac,u,u A da,u,u Sds,u,u c s (1 ) ct da (1 ) ds Y Y f d,f,f and r( ) represents the product of a the Dirichet and Beta priors. By marginaizing over, we can obtain a cosed form for the joint ieihood r(e,f,z e, Z f, X, Y ). The atter is the basis for deveoping a stochastic EM strategy [3, section ], where the E-step consists of a coapsed Gibbs samping procedure [13, 3] for estimating the matrices Z e, Z f, X and Y, and the M-step estimates both the predictive distributions in and the hyperparameters of interest in. In particuar, the samping step consists of a sequentia update of each arc and featureassignment, of the status of the corresponding atent variabes in Z e, Z f, X and Y. A possibe samping strategy for each arc 2 E and adoption a 2 F is based on the foowing chain: r(z = Rest), r(z a = Rest), r(x = 1 Rest) and r(y a =1 Rest) 2. By agebraic manipuations, we can devise the samping equations expressed in Tab. 8. The overa earning scheme is shown in Ag. 1. Lines 5-12 of the agorithm represent the Gibbs samping steps, whie ine 14 represents the update of the mutinomia distributions which are coapsed in the derivation of the samping equations: (3) A,u = ct,d,u + da,u + u c t + da + u u (4),u = cs,s,u + cs,d,u + u 2c s + u u (5) S,u = ct,s,u + ds,u + u c t + cs + u u (6),f = d,f + d + f = c + d + m + t + In ine 15 we update the Beta ( ~, ~ ) and Dirichet ~ hyperparameters, according to the fixed point iterative procedure 2 The term Rest denotes the remaining variabes in the set {E, F, Z e, Z f, X, Y,, } after the expicit variabes in both the conditioning and conditioned part have been removed. f f (7) (8)

5 Symbo Description Expression c Number of ins associated with community 2E z, c s Number of socia ins 2E x c t Number of topica ins 2E (1 x ) c s Number of socia ins associated with community 2E x z, c t Number of topica ins associated with community 2E (1 x ) z, c s,s,u Number of socia ins associated with community where u is the source =(u, )2E {x z, } c s,d,u Number of socia ins associated with community where u is the destination =(,u)2e {x z, } c t,s,u Number of topica ins associated with community where u is the source =(u, )2E {(1 x ) z, } c t,d,u Number of topica ins associated with community where u is the destination =(,u)2e {(1 x ) z, } d Number of feature-assignments associated with community a2f z a, d a Number of authoritative feature-assignments a2f ya d s Number of susceptibe feature-assignments a2f (1 ya) d a Number of feature-assignments within community on authoritative users a2f ya z a, d s Number of feature-assignments within community on susceptibe users a2f (1 ya) z a, d,f Number of recipients associated with community reative to feature f a=(,f)2f {(1 ya) z a,} d a,u Number of features associated with community where u is the authoritative source a=(u, )2F {ya z a,} d s,u Number of features associated with community where u is the susceptibe source a=(u, )2F {(1 ya) z a,} Tabe 1: Counters adopted in the Gibbs Samping and their meaning. described in [20]. The fina predictive distributions A, S, and, and are averaged aong a the steps of the Gibbs samping procedure. A singe iteration of the samper performs O((m + t) ) computations and hence it is inear on the size of observed data. In Ag. 1 we assume that the number of topics is given as input; typicay this vaue is determined experimentay as the number of topics that maximizes the predictive performances. However, it is possibe to automaticay devise the number of topics by reying on Bayesian nonparametrics. In fact, as shown in [7], it is possibe to adapt the samping equations in order to mae expicit the annihiation of some topics as we the generation of new ones, according to the Chinese Restaurant rocess principe. Agorithm 1 Gibbs-samping with parameter estimation Require: G and F,the number of atent features, initia hyperparameter set. 1: Random initiaization for the matrices Z e, Z f, X and Y; 2: it 0 3: converged fase 4: whie it < nmaxit and converged do 5: for a observed in do 6: Sampe z according to Eq. 13 and 14 7: Sampe x according to Eq. 15 and 16 8: end for 9: for a observed attribute-assignment a do 10: Sampe z a according to Eq. 17 and 18 11: Sampe y a according to Eq. 19 and 20 12: end for 13: if (it > burn-in) and (it%sampelag = 0) then 14: Sampe A (Eq. 4), (Eq. 5), S (Eq. 6), (Eq. 7), and (Eq. 8); 15: Update hyperparameters ~, ~ and ~ ; 16: end if 17: it it +1 18: end whie 3.2 roducing expanations The success of a recommender system does not ony depend on its accuracy in inferring and expoiting users interests, but it aso reies on how the depoyed recommendations are perceived by the users. Expanations increase the transparency of the recommendation process and may positivey contribute in gaining users trust and satisfaction. When generating expanations for socia recommendations, the first step is to understand if the proposed connection =(u, v) is socia (i.e., such that x = 1) or topica (i.e., x =0). WTFW provides a natura way to do this: r(x =1, ) / X r(x =0, ) / X,u,v (9) (1 )S,u A,v, (10) Socia connections have a natura expanation in terms of cose-nit circes. Thus, for a given in =(u, v) predicted as socia (i.e., such that x = 1), we can provide an expanation as the set of the most prospective common neighbors, raned according to the foowing score: ran(w; ) = X,u,v,w. (11) This ran promotes common neighbors that have high degree of invovement in socia communities where both u and v are invoved as we. Interestingy, the score finds an expanation in terms of the probabiity of observing a socia triange among u, v and w. In fact, the joint probabiity of observing (u, v), (u, w) and (v, w) within community is proportiona to,u,v,w. And, since by definition both (u, w) and (v, w) hod in the data, the score expain the prospective new in (u, v) in terms of the common neighbors which are more iey to devise a triange in the data. Conversey, topica ins can be expained through a ist of attributes which are representative of the topics of interest by the current user and for which the recommended connection has high authority. For each feature common to the two nodes, we define the foowing score: ran(f; ) = X (1 ),f A,v ( A,u +(1 )S,u ). (12) Here, the atter term represents the topica invovement of the user u within community. Again, the score has an interpretation in terms of the prospective triange among (u,v), (u,f) and (v,f). Notice, however, that the directionaity pays a roe here, since we are ony interested in those features for which v is authoritative. The procedure for producing expanations for a recommended in is summarized in Ag. 2. In short, the procedure predicts the nature (either socia or topica) of the prospective in, hence providing the ist of most prominent neighbors/common features.

6 Agorithm 2 roducing expanations Require: The socia networ G, the WTFW mode, a recommended in =(u, v) and the number of expanations L; Ensure: aistl of either socia or topica expanations for the in. 1: L ; 2: Compute x according to equations 9 and 10 3: if x =1then 4: L N ; 5: for a w 2 N (u) \ N (v) do 6: Compute ran(w, ) according to Eq. 11 7: L N L N [ (w, ran(w, )) 8: end for 9: Sort L N and compute L = top(l N,L) 10: ese 11: L F ; 12: for a f 2 F (u) \ F (v) do 13: Compute ran(f, ) according to Eq : L F L F [ (f, ran(f, )) 15: end for 16: Sort L F and compute L = top(l F,L) 17: end if 4. EXERIMENTAL EVALUATION In this section we report the empirica assessment of the proposed WTFW mode on rea networs. The experimentation is aimed at assessing the foowing: The accuracy of the mode for what concerns both in prediction and abe prediction, where the atter refers to the cassification of a in as either socia or topica. The scaabiity and stabiity of the earning procedure, by studying earning time and performance varying the number of iterations of the Gibbs samper. The quaity of the associations between ins and features, that we show by means of anecdota evidence in the reconstruction of the data through the mode. Datasets. For our purposes we need datasets coming from socia networing patforms in which ins creation can be expained in terms of interest identity and/or persona socia reations. This requirement is satisfied, among the others, by two popuar socia networing patforms, namey Twitter and Ficr. On both patforms, the underying networ is inherenty directed to refect interest of users towards important, and authoritative, information sources. Moreover, in these systems the roe of users may naturay change with respect to di erent topics. The Twitter dataset we use is pubicy avaiabe 3 and it incudes information from 973 ego-networs crawed from the pubic AI. The resuting networ contains roughy 80 thousand nodes and 1.7 miion directed ins. Attribute information consists in a the hashtags (e.g. #sanfrancisco) and mentions Obama), used by those users. Ficr data has been obtained by querying Ficr pubic AI in the time window and then by performing forest fire samping [16] on the resuting networ. Features are generated by crawing a the tags used by each users. Ficr aso contains a form of ground-truth for the abe prediction tas. Specificay, for each in in the dataset there are two fags, namey friend and famiy, that a user can specify. We naturay interpret these fags as foows: a in is abeed as socia if it is either mared as famiy or friend. Conversey, a in is topica if none of the two fags are set. It is important to stress that this ground-truth is expectedy very noisy as it is any user-decared information 3 on the internet. As such, it is iey to produce an underestimation of the accuracy in the abe prediction tas. In order to eep the experimenta setting as cose as possibe to the origina data (high dimensionaity and exceptiona sparsity), no further pre-processing has been performed. Basic statistics about these two datasets are given in Tabe 2. These datasets are characterized by di erent properties. The socia graph in Twitter is much more directed and sparse than in Ficr, whie the number of attributes per user is much higher in Ficr. Twitter Ficr Number of nodes 81, , 000 Number of ins 1, 768, , 036, 407 Number of one-way ins 1, 342, 311 9, 604, 945 Number of bidirectiona ins 425, 838 4, 431, 462 Number of socia ins - 6, 747, 085 Number of topica ins - 7, 289, 322 Avg in-degree Avg out-degree Number of features 211, , 201 Number of feature assignments 1, 102, , 316, 862 Avg. features per user Avg. users per feature 5 45 Tabe 2: Datasets statistics. Experimenta setting. In a the experiments we assume a partia observation of the networ and a compete set of user features. 4 The earning agorithm starts with a random assignments to atent variabes, it performs a burn-in phase (burn-in=500) to stabiize the Marov chain, and the parameters of the mode are updated at reguar intervas (samping ag=20) for the next 2000 iterations. We initiaize hyperparameters with the foowing (symmetric) vaues: = = = 1, = 1, 0 = 1 = n h 0 = 1 = = Mode Assessment Evauation on in prediction. In a first set of experiments, we measure the accuracy of the mode in predicting new ins. On Twitter, we perform a Monte Caro Cross- Vaidation in 5 fods, by randomy spitting the networ into training and test data. We aso measure the accuracy of the earned modes for di erent proportions of training/test, namey 60/40, 70/30, 80/20. This aows us to stress the robustness of the in prediction tas for di erent proportions, and to mitigate the e ects of the random spits. In Ficr instead the dataset contains the timestamp of creation of the in, aowing us to perform a chronoogica spit, where oder ins (70% of the data) are used for earning the mode, whie the most recent 30% are used as prediction target. The accuracy of in prediction is measured by computing the area under the ROC curve (AUC) over a set of positive and negative exampes drawn from the test set. In principe, we can consider a ins in the test-set as positive exampes, and a non-existing ins as negative exampe. However, the sparsity of the networs poses two major issues: (i) the number of non-existing ins can be enormous, thus maing the computation of the AUC infeasibe; (ii) missing ins do not necessariy represent negative information, but rather unseen information [28]. Foowing [1], we thus imit the negative exampes to a the 2-hops non-existing ins. 4 The tas of predicting/recommending missing features is not investigated here and it is eft as future wor.

7 Number of atent factors Method Spit WTFW 60/ / / JSVD 60/ / / CNF /0.7125/ AA-NF /0.7397/ Tabe 3: AUC on in prediction - Twitter Number of atent factors Method WTFW JSVD CNF 0.53 AA-NF 0.58 Tabe 4: AUC on in prediction - Ficr Number of atent factors Method WTFW Baseine Tabe 5: AUC on in abeing - Ficr. We compare the performance of the WTFW mode with some popuar baseine approaches from the iterature, which perform we on a range of networs [18, 9]: Common Neighbors/Features (CNF) and Adamic-Adar (AA-NF). CNF is a oca simiarity index that produces a score for each in (u, v), which is given by the number of common neighbors/features: score(u, v) = N (u) \ N (v) + F (u) \ F (v). AA-NF represents a refinement of the simpe counting of common neighbors/features, which is achieved by assigning more weight to ess-connected components. score(u, v) = X w2n (u)\n (v) 1 N (w) + X f2f (u)\f (v) 1 V (f) In addition, we compare WTFW with a matrix factorization approach based on SVD, dubbed Joint SVD (JSVD) [9]. In practice, the approach computes a ow-ran factorization of the joint adjacency/feature matrix X =[EF]as X U diag( 1,..., ) V T, where is the ran of the decomposition and 1,..., are the square roots of the greatest eigenvaues of X T X. The matrices U and V provide substantia interpretation in terms of connectivity of both E and F. The term U u, can be interpreted as the tendency of u to be either a source in E or an adopter in F, reative to factor. Anaogousy, V u, represents the tendency of u to appear as a destination in E, and V f, represents the ieihood that item f is adopted in. The in prediction score can hence be computed as: score(u, v) = X U u, V v,. =1 Tabes 3 and 4 summarize the resuts of the evauation, for increasing vaues of the number of atent topics/factors. On Twitter data, both WTFW and JSVD underperform when the number of atent factors is imited, but exhibit a competitive advange over the baseines for higher vaues of. WTFW outperforms the other considered approaches and these resuts are stabe on di erent training/test set proportions. The prediction on Ficr is in genera weaer for a Figure 3: Lin prediction: Twitter (eft) and Ficr (right). Time (mins) Twitter Ficr N. of Topics (a) (b) Figure 4: (a) Accuracy of in abeing on Ficr. (b) Learning times on the 70/30 spit for both Ficr and Twitter. methods. However, the resuts seems staber, since the difference with regards to JSVD remains constant for increasing vaues of. The standard baseines perform poory on this dataset. Figure 3 shows the sope of the ROC curves on both datasets for =256. OnTwitter, there are some imited areas where the JSVD is sewed but, in genera, WTFW ceary outperforms the other methods. This is even more evident on the Ficr dataset. Evauation on in abeing. We next turn our attention to the tas of discriminating between socia and topica ins, thans to the ground truth that we have in the Ficr dataset. Again, we measure the accuracy by computing the AUC on the prediction, and by comparing the resut with a baseine based on common neighbors/features. That is, a in =(u, v) is deemed socia if the weight of the common neighbors is higher than those of the common features, and topica otherwise. Formay: r(x =1 ) = N(u) \ N(v) N(u) \ N(v) + F (u) \ F (v). Tabe 5 reports the resuts for increasing vaues of. The best resuts are obtains on a ower number of topics, and in particuar for = 32. This is somehow surprising if we compare this resuts with the resuts on in prediction discussed above. In an attempt to expain such a behavior, we anaysed the vaues of and in the mode, and we noticed that a modes exhibit a strongy dominant atent factor. We wi discuss this component aso in the next subsection: it is worth mentioning, however that the associated probabiity eans towards 0.5 (a cear sign that the community tends to mix topica and socia contributions). Ceary, the baanced vaue of does not a ect the performance in in prediction (as it ony depends on whether any of the socia/topica components is strong enough to trigger

Health Literacy Online

Health Literacy Online Heath Literacy Onine A guide to writing and designing easy-to-use heath Web sites Strategies Actions Testing Methods Resources HEALTH OF & HUMAN SERVICES USA U.S. Department of Heath and Human Services

More information

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni G. Parmigiani. For further volumes: http://www.springer.

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni G. Parmigiani. For further volumes: http://www.springer. Use R! Series Editors: Robert Genteman Kurt Hornik Giovanni G. Parmigiani For further voumes: http://www.springer.com/series/6991 Graham Wiiams Data Mining with Ratte and R The Art of Excavating Data

More information

On the relationship between radiance and irradiance: determining the illumination from images of a convex Lambertian object

On the relationship between radiance and irradiance: determining the illumination from images of a convex Lambertian object 2448 J. Opt. Soc. Am. A/ Vo. 18, No. 10/ October 2001 R. Ramamoorthi and P. Hanrahan On the reationship between radiance and irradiance: determining the iumination from images of a convex Lambertian object

More information

OPINION Two cheers for P-values?

OPINION Two cheers for P-values? Journa of Epidemioogy and Biostatistics (2001) Vo. 6, No. 2, 193 204 OPINION Two cheers for P-vaues? S SENN Department of Epidemioogy and Pubic Heath, Department of Statistica Science, University Coege

More information

All Aspects. of a...business...industry...company. Planning. Management. Finance. An Information. Technical Skills. Technology.

All Aspects. of a...business...industry...company. Planning. Management. Finance. An Information. Technical Skills. Technology. A Aspects Panning of a...business...industry...company Management Finance Technica Skis Technoogy Labor Issues An Information Sourcebook Community Issues Heath & Safety Persona Work Habits Acknowedgement

More information

Are Health Problems Systemic?

Are Health Problems Systemic? Document de travai Working paper Are Heath Probems Systemic? Poitics of Access and Choice under Beveridge and Bismarck Systems Zeynep Or (Irdes) Chanta Cases (Irdes) Meanie Lisac (Bertesmann Stiftung)

More information

Securing the future of excellent patient care. Final report of the independent review Led by Professor David Greenaway

Securing the future of excellent patient care. Final report of the independent review Led by Professor David Greenaway Securing the future of exceent patient care Fina report of the independent review Led by Professor David Greenaway Contents Foreword 3 Executive summary 4 Training structure for the future 6 Recommendations

More information

How to Make Adoption an Affordable Option

How to Make Adoption an Affordable Option How to Make Adoption an Affordabe Option How to Make Adoption an Affordabe Option 2015 Nationa Endowment for Financia Education. A rights reserved. The content areas in this materia are beieved to be current

More information

The IBM System/38. 8.1 Introduction

The IBM System/38. 8.1 Introduction 8 The IBM System/38 8.1 Introduction IBM s capabiity-based System38 [Berstis 80a, Houdek 81, IBM Sa, IBM 82b], announced in 1978 and deivered in 1980, is an outgrowth of work that began in the ate sixties

More information

Can cascades be predicted?

Can cascades be predicted? Can cascades be predicted? Justin Cheng Stanford University jcccf@cs.stanford.edu Jon Kleinberg Cornell University leinber@cs.cornell.edu Lada A. Adamic Faceboo ladamic@fb.com Jure Lesovec Stanford University

More information

Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow

Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell

More information

Relationship Between the Retirement, Disability, and Unemployment Insurance Programs: The U.S. Experience

Relationship Between the Retirement, Disability, and Unemployment Insurance Programs: The U.S. Experience Reationship Between the Retirement, Disabiity, and Unempoyment Insurance Programs The US Experience by Virginia P Reno and Danie N, Price* This artice was prepared initiay for an internationa conference

More information

Feedback Effects between Similarity and Social Influence in Online Communities

Feedback Effects between Similarity and Social Influence in Online Communities Feedback Effects between Similarity and Social Influence in Online Communities David Crandall Dept. of Computer Science crandall@cs.cornell.edu Jon Kleinberg Dept. of Computer Science kleinber@cs.cornell.edu

More information

Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations

Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations Jure Leskovec Carnegie Mellon University jure@cs.cmu.edu Jon Kleinberg Cornell University kleinber@cs.cornell.edu Christos

More information

EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT SNAKES (BUT WERE AFRAID TO ASK) Jim Ivins & John Porrill

EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT SNAKES (BUT WERE AFRAID TO ASK) Jim Ivins & John Porrill EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT SNAKES (BUT WERE AFRAID TO ASK) Jim Ivins & John Porri AIVRU Technica Memo #86, Juy 993 (Revised June 995; March 2000) Artificia Inteigence Vision Research Unit

More information

Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park

More information

Minimum Required Payment and Supplemental Information Disclosure Effects on Consumer Debt Repayment Decisions

Minimum Required Payment and Supplemental Information Disclosure Effects on Consumer Debt Repayment Decisions DANIEL NAVARRO-MARTINEZ, LINDA COURT SALISBURY, KATHERINE N. LEMON, NEIL STEWART, WILLIAM J. MATTHEWS, and ADAM J.L. HARRIS Repayment decisions ow muc of te oan to repay and wen to make te payments directy

More information

Combating Web Spam with TrustRank

Combating Web Spam with TrustRank Combating Web Spam with TrustRank Zoltán Gyöngyi Hector Garcia-Molina Jan Pedersen Stanford University Stanford University Yahoo! Inc. Computer Science Department Computer Science Department 70 First Avenue

More information

Practical Lessons from Predicting Clicks on Ads at Facebook

Practical Lessons from Predicting Clicks on Ads at Facebook Practical Lessons from Predicting Clicks on Ads at Facebook Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, Joaquin Quiñonero Candela

More information

On Smoothing and Inference for Topic Models

On Smoothing and Inference for Topic Models On Smoothing and Inference for Topic Models Arthur Asuncion, Max Welling, Padhraic Smyth Department of Computer Science University of California, Irvine Irvine, CA, USA {asuncion,welling,smyth}@ics.uci.edu

More information

Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends

Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends Xuerui Wang, Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA 01003 xuerui@cs.umass.edu, mccallum@cs.umass.edu

More information

Recovering Semantics of Tables on the Web

Recovering Semantics of Tables on the Web Recovering Semantics of Tables on the Web Petros Venetis Alon Halevy Jayant Madhavan Marius Paşca Stanford University Google Inc. Google Inc. Google Inc. venetis@cs.stanford.edu halevy@google.com jayant@google.com

More information

Are Automated Debugging Techniques Actually Helping Programmers?

Are Automated Debugging Techniques Actually Helping Programmers? Are Automated Debugging Techniques Actually Helping Programmers? Chris Parnin and Alessandro Orso Georgia Institute of Technology College of Computing {chris.parnin orso}@gatech.edu ABSTRACT Debugging

More information

Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content Ilaria Bordino Yahoo! Research Barcelona, Catalunya, Spain bordino@yahoo-inc.com Yelena Mejova Yahoo! Research Barcelona,

More information

Application of Dimensionality Reduction in Recommender System -- A Case Study

Application of Dimensionality Reduction in Recommender System -- A Case Study Application of Dimensionality Reduction in Recommender System -- A Case Study Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl GroupLens Research Group / Army HPC Research Center Department

More information

Getting More for Less: Optimized Crowdsourcing with Dynamic Tasks and Goals

Getting More for Less: Optimized Crowdsourcing with Dynamic Tasks and Goals Getting More for Less: Optimized Crowdsourcing with Dynamic Tasks and Goals ABSTRACT Ari Kobren Google & UMass Amherst akobren@cs.umass.edu Panagiotis Ipeirotis Google & NYU panos@stern.nyu.edu In crowdsourcing

More information

Dude, Where s My Card? RFID Positioning That Works with Multipath and Non-Line of Sight

Dude, Where s My Card? RFID Positioning That Works with Multipath and Non-Line of Sight Dude, Where s My Card? RFID Positioning That Works with Multipath and Non-Line of Sight Jue Wang and Dina Katabi Massachusetts Institute of Technology {jue_w,dk}@mit.edu ABSTRACT RFIDs are emerging as

More information

On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking

On-Line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking On-Line LDA: Adaptive Topic Models for Mining Text s with Applications to Topic Detection and Tracking Loulwah AlSumait, Daniel Barbará, Carlotta Domeniconi Department of Computer Science George Mason

More information

Maximizing the Spread of Influence through a Social Network

Maximizing the Spread of Influence through a Social Network Maximizing the Spread of Influence through a Social Network David Kempe Dept. of Computer Science Cornell University, Ithaca NY kempe@cs.cornell.edu Jon Kleinberg Dept. of Computer Science Cornell University,

More information

No Free Lunch in Data Privacy

No Free Lunch in Data Privacy No Free Lunch in Data Privacy Daniel Kifer Penn State University dan+sigmod11@cse.psu.edu Ashwin Machanavajjhala Yahoo! Research mvnak@yahoo-inc.com ABSTRACT Differential privacy is a powerful tool for

More information