Probabilistic Community and Role Model for Social Networks

Probabilistic Community and Role Model for Social Networks Yu Han and Jie Tang Department of Computer Science and Tecnology, Tsingua Uniersity Tsingua National Laboratory for Information Science and Tecnology (TNList) Jiangsu Collaboratie Innoation Center for Language Ability, Jiangsu Normal Uniersity, Cina yuantu@16.com, jietang@tsingua.edu.cn ABSTRACT Numerous models ae been proposed for modeling social networks to explore teir structure or to address application problems, suc as community detection and beaior prediction. Howeer, te results are still far from satisfactory. One of te biggest callenges is ow to capture all te information of a social network suc as links, communities, user attributes, roles and beaiors, in a unified manner. In tis paper, we propose a unified probabilistic framework, te Community Role Model (), to model a social network. incorporates all te information of nodes and edges tat form a social network. We propose metods based on Gibbs sampling and an EM algoritm to estimate model parameters and fit our model to real social networks. Real data experiments sow tat can be used not only to represent a social network, but also to andle arious application problems wit better performance tan a baseline model, witout any modification to te model. Categories and Subject Descriptors J.4 [Social and Beaioral Sciences]: Sociology; H..8 [Database Applications]: Data Mining Keywords Social Network, Community, Beaior Prediction 1. INTRODUCTION Online social networks e.g., Twitter,, Flickr ae become large complex irtual systems. Visible and inisible elements interact and affect eac oter. We can use a grap to model te structure of a social network, were nodes and edges represent users and interactions, wic are isible elements. Tere are also dynamic isible elements i.e., user actions, suc as retweeting in Twitter and commenting in Flickr. Moreoer, tere are also inisible elements, suc as community [13, 38] and role [46, 50], tat affect te isible elements. Preious researc, suc as [10, 33, Permission to make digital or ard copies of all or part of tis work for personal or classroom use is granted witout fee proided tat copies are not made or distributed for profit or commercial adantage and tat copies bear tis notice and te full citation on te first page. Copyrigts for components of tis work owned by oters tan ACM must be onored. Abstracting wit credit is permitted. To copy oterwise, or republis, to post on serers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. KDD 15, August 10-13, 015, Sydney, NSW, Australia. c 015 ACM. ISBN 978-1-4503-3664-/15/08...$15.00. ttp://dx.doi.org/10.1145/78358.78374. 45] and empirical studies on online social networks, including [44], Twitter [0], Flickr [31], YouTube [3], Yaoo!360 [18], Cyworld, Myspace and Orkut [1], reealed many interesting penomena and basic underlying laws. For example, in a social network, nodes may ae closer relationsips witin a community tan across communities. Nodes may ae different attributes for example, some nodes may be popular, and ae many followers, but oters may be different. Nodes may exibit different beaiors for example, some nodes seem ery actie, and repost messages or comment on pictures, wile oters may not. Howeer, according to [4], people s beaiors not only depend on teir own attributes, but also on te influence of teir neigbors and communities. How sould we model a complex social network so tat te model can capture te intrinsic relations between all tese elements, suc as conformity influence, indiidual attributes, and actions? How do we use a social network model to andle issues suc as community detection and beaior prediction? Social network analysis as been attracting muc interest from researcers. Many models ae been proposed to model te structure of a social network [15, 16,, 35, 36, 37, 48] and to andle issues suc as social influence analysis [, 9, 14, 8, 41, 4, 50], beaior prediction [39, 46], and link prediction [17, 3, 7, 40]. [15] uses latent space to model a social network in wic eery node is associated wit a location in p-dimensional space, and two nodes are more likely to ae links if tey are closer. [48] describes a random grap model for social networks based on te dot product, wic assigns eac node a random ector, and quantifies te probability of a link between two nodes by te dot product of teir ectors. [6] proposes a model tat regards nodes as points in Euclidean space, and generates edges based on a mixture of te distances between nodes and a ranking function. [1] proposes a model to simulate te forming process of a social network wit te Kronecker product of adjacency matrices. [16] introduces a multiplicatie attribute grap model tat uses te affinity of attributes of two nodes to indicate te potential for tem to form a link. [46] takes te roles tat nodes migt play into consideration and proposes a model to predict information diffusion in a social network. [50] proposes a probabilistic model tat combines te nodes attributes and community influence to analyze nodes beaiors. Altoug muc progress as been made, te results of existing work are not satisfactory, due to teir limitations: 1. Most social network models utilize only portions of te aailable social network information. For example, [4]

only takes link information into consideration, ignoring te differences between te nodes temseles, wile [46] assumes tree roles tat nodes could play, ignoring te conformity influence in information diffusion.. Most models only focus on a few aspects of social networks, missing te global iew. For example, some papers only focus on te static structure of social networks, wile oters focus only on user beaiors.?? 3. Many models are based on discriminatie metods and ae not capture te nature of social networks. As a result, tey can only be used to settle specific issues. Suc models may seems reasonable in some specific circumstances but not in oters. 4. Some works use a deterministic metod. Howeer, tis is usually impractical in complex social networks. In tis paper, we mine te intrinsic relationsips between all isible and inisible elements of a social network, including communities, links, node attributes, roles and actions, and incorporate tem into a unified probabilistic generatie framework. Te proposed model can also easily andle many practical issues in social networks, suc as community detection and beaior prediction, witout any modification to te model. To te best of our knowledge, tis is te first model tat captures all te information of a social network and can represent all its facets. Te contributions of tis paper include: 1. We incorporate arious elements of a social network into a unified probabilistic generatie framework, wic can represent a complex social network better tan oter models. We furter design a metod to estimate te parameters of te model.. We use our model to generate a syntetic network wit te learned parameters, and erify te superiority of our model to te baseline metod in terms of six metrics. 3. We apply te model to two problems beaior prediction and community detection erifying its ersatility and effectieness. Tis paper is organized as follows: In Section, we propose te Community Role Model () to model a social network and proide a metod for parameter estimation and inference of. We conduct two sets of experiments and a case study on real data sets in Section 3. Section 4 is a surey of related work. We conclude te paper in Section 5.. MODEL.1 Intuition Te intuition beind our model is tat we can describe a social network as follows: First, a social network is composed of many nodes/users, and eac node is associated wit many edges/links. [1] offers an edge-distribution law, stating tat te distribution of edges is usually locally inomogeneous, and igly concentrated witin special groups of nodes, but sparse between tese groups. In oter words, eac node may belong to seeral communities, and weter it as a link to oter nodes 1 Figure 1: Social network migt depend on te communities to wic it belongs. Tus we can assume tat eac node as a distribution oer te communities i.e., tat different nodes may be located in different communities. A node in a specific community may ae a unique probability to link to anoter node. For example, as Figure 1 sows, in community c 1, as a iger probability to link to u and a lower probability to link to w. Howeer, te situation is reersed wen belongs to community c. Second, eac node as many attributes, suc as in-degree, out-degree, and oter attributes. Based on tese attributes, we can classify te nodes into clusters. Eac cluster can be regarded as a role tat nodes play. For example, some nodes may ae iger in-degree, and play te role of opinion leader [30], wile oters may ae iger out-degree, and tend to transfer messages across communities, playing te role of structural ole spanner [9]. Te attributes of eac role satisfy a specific distribution suc as a Gaussian distribution. Eac node as a distribution oer roles according to its attributes. Last, eac node may take some actions, suc as transferring a message, commenting on oter people s pictures, or following oters. Most nodes tend to take similar actions wit nodes in te same community; in oter words, weter a node takes a specific action partly depends on te community it belongs to. Moreoer, weter a node takes an action may also depend on te role it plays. For example, according to [9], 5% of information diffusion is controlled by 1% of nodes sering te role of structural ole spanners. Tus, wen we predict te action tat a node migt take, we must consider te distributions tat te node as oer bot communities and roles.. Formulation We use G = (V, E, X) to denote te structure of a social network, were V is te set of all users and E is an N N matrix, wit eac element e,u = 0 or 1 indicating weter user as a link to/wit user u. We use te cardinality V = N to denote te number of te users. Te set of edges tat associate wit is denoted as E. Notation X denotes an attributes matrix wit size N H, were H is te number of all attributes. Eac element x () X denotes is te -t attribute of user. Unlike te alue of e, x () continuous. Definition 1. Community. A social network consists of multiple communities, denoted as c = [1,,..., C]. Eac community as a multinomial distribution oer all pairs

Table 1: Notations in te model SYMBOL DESCRIPTION C number of communities R number of roles e,u te edge between and u x () te -t attribute of node y m () te m-t action of node z,i te community tat te i-t edge of node is assigned to d te role tat node is assigned to φ () multinomial distribution oer communities specific to node θ () multinomial distribution oer roles specific to node ζ (c) multinomial distribution oer edges/nodes specific to community c ρ τ,r multinomial distribution oer actions specific to community-role pair (τ, r) µ r, mean of -t attribute specific to role r σ r, standard deiation of -t attribute specific to role r (, u), denoted as ζ. For a directed grap, te edge e,u and e u, sare one item in te parameters i.e., te pair (, u), in community distributions. ζ,u (c) denotes te probability of e,u/e u, in community c, subject to,u ζ(c),u = 1. Note tat, since edges are denoted by nodes, we could easily transform te distribution of communities oer edges into distribution oer nodes, wic conforms to te usual definition of community and become easier to understand in some circumstances, suc as community detection. Definition. Node Distribution oer Communities. Eac node as a multinomial distribution oer communities, wic is denoted as φ. φ c () denotes te probability for to be located in c, and is subject to c φ() c = 1. Definition 3. Role. A node may play multiple different roles, denoted as r = [1,,..., R]. Eac role as a set of parameters for te distribution te attributes conform to. Here we use Gaussian distribution. If a node plays role r, its -t attribute conforms to N(µ r,, σr,). Definition 4. Nodes Distribution oer Roles. Eac node as a multinomial distribution oer roles, wic is denoted as θ. θ r () denotes te probability for to play role r, and is subject to r θ() r = 1. Definition 5. Action. Eac node can take some actions, suc as transferring a message or following oters. For different kinds of social networks, actions take different forms. Take te action of repost in a microblog network, for example. We use y () to denote a repost action of user. We set time t = 0 as te start point. During time period [0, T ], tere are M messages posted by te users tat follows. We use y m () = 0 or 1 (i = 1,,..., M) to denote weter reposts te m-t message during a reasonable time period [0, T ]. Definition 6. Community-Role Pair. Weter a node would take an action depends on te communities it and its λ ζ β Φ z e E N γ ρ y N M Figure : Te model target belong to and te role it plays. We use ρ to denote te distribution of community-role pairs oer actions. According to te aboe definition, action y m () only contains two cases, so we can use a Bernoulli distribution to model te distribution of community-role pairs oer actions. ρ τ,r m denotes te probability for y m = 1, were τ = 1(c c u). 1( ) is an indicator function. If c c u, 1(c c u) = 1, oterwise 0. It is noted tat te community in community-role pair represents weter te node and its target belong to te same community, so τ is binary, wit 0 meaning te same community and 1 meaning different communities..3 Model Description Our goal is to deise a probabilistic generatie model,, to represent a social network by capturing relationsips and interactions between all te elements of suc a network, including links, node attributes, communities, roles, actions, etc. To do tis, assumes tat a social network can be generated troug tree processes, wit eac process based on one of te tree isible elements in a social network edges, node attributes, and actions. An edge is defined to be an item from a set indexed by 1,,, N. (For an undirected grap, it is 1,,, N(N + 1)/.) We represent edges using unit-basis ectors, of wic only one component is 1 and all oter components are 0. Eac node is associated wit a sequence of seeral edges denoted by = (e,i, e j,), were i I in and j I out. I in is te set of tail endpoints adjacent to, wile I out is te set of ead endpoints adjacent to. For an undirected grap, eac edge of can be denoted wit te node wic is te edge s endpoint adjacent to. Eac node belongs to seeral communities. Tus we can regard a node as a random mixture oer communities. Te generatie process of all edges in a social network can be described as follows: For eac node in te grap: 1. Draw ζ from Diriclet(λ);. Draw a φ from Diriclet(β) prior; 3. For eac edge e,i: Draw a community z,i = c from multinomial distribution φ ; Draw an edge e,i from a multinomial ζ (c) specific to community c. α θ d x H N μ σ

Te time complexity of te aboe process is O(nonezeros(E)), were nonezeros(e) denotes te number of nonzero items in E. Te distribution of edges E is as: p(e β, λ) = p(ζ λ) p(φ β) E z,i p(z,i φ )p(e z,i, ζ)dφ dζ. Eac node plays seeral roles and is associated wit a sequence of seeral attributes, denoted by = (x ), were = [1,,..., H]. We define eac role as a distribution oer attributes and eac node is a random mixture oer roles. Te generatie process of all nodes in a social network can be described as follows: For eac node in te grap: 1. Draw a θ from Diriclet(α) prior;. Draw a role d = r from multinomial distribution θ ; 3. For eac attribute of, draw a alue x (r) G(µ r,, σr,). Te time complexity of te aboe process is O(NH). Te joint distribution of attributes X is defined as: p(x α, µ,σ) = p(θ α) p(d θ ) p(x () d, µ r,k, σ r,k )dθ. d Regarding actions, eac node is associated wit a sequence of seeral actions denoted by = (y m), were m = [1,,..., M]. Te generatie process of te actions can be described as follows: For eac action y m: 1. Draw ρ from Diriclet(γ) prior;. Draw a community c for from φ ; (1) () 3. Draw a community c u for u, wic post te message m, from φ u; 4. Draw a role r from θ ; 5. Draw y m Bernoulli(ρ τ,r ). Te time complexity of te aboe process is O(NM). Te joint distribution of actions Y is defined as: p(y γ, φ, θ) = p(ρ τ,r) p(r θ )p(τ φ )p(y m () ρ τ,r)dρ τ,r. τ r.4 Inference and Parameters Estimation It is intractable to directly sole te aboe distribution functions. We use Gibbs sampling to estimate φ and ζ. Te posterior probability of z,i is calculated by (3) p(z,i = c z, i, E) n(), i,c + β E + C β n (e), i,c + λ. (4) n (e), i, + E λ After Gibbs sampling, parameters φ and ζ can be estimated by: φ,c = ζ c,e = n,c + β E + C β, (5) nc,e + λ n c + E λ. (6) We use an EM algoritm to iteratiely maximize te joint likeliood of users attributes X and to estimate parameters θ and η. Te likeliood of X can be written as: L = θ,r e (x, µ r, ) σ r,. (7) πσr, d In te E-step, we estimate te -t item of θ gien te current parameters by: θ,r = (π) 1 σ 1 r, e d (π) 1 σ 1 r, e (x, µ r, ) σ r, (x, µ r, ) σ r,. (8) Ten in te M-step, we update parameters µ and σ by te following equations. (Detailed deriation of θ, µ, and σ is gien in Appendix.) θ,rx, µ r, =, (9) θ, r σ r, = θ,r(x, µ r, ). (10) θ,r Because φ and θ ae been estimated during te aboe processes, we only need to estimate ρ. Again, wit Gibbs sampling, we first calculate te posterior probability of te (a, d ) by te following equation: p(a = τ, d = r a, r, y) (φ φ T n, m,τ,r + γ )θ M + H γ. (11) After sampling, te parameter ρ can be estimated by: ρ = n,m,τ,r + γ M + H γ. (1) Model Applications. Te learned models can be used in arious applications suc as community discoery and beaior prediction. Essentially, parameters ζ represent (oerlapping) communities discoered by, wile parameters ρ can be used to predict users actions. 3. EXPERIMENTS Now we ealuate te effectieness of te proposed model on real-world datasets. We first use a real dataset to learn te parameters of. Ten we use te parameters

Coautor 10 7 10 6 Coautor Coautor 10 5 #Nodes #Pairs of Nodes 10 4 Eigen Value 10 10 10 10 Degree (a) Degree 0 4 6 8 14 Hops (b) Pairs of Nodes Rank (c) Eigenalues 10 Coautor 10 Coautor 10 4 Coautor Eigenector 10 1 10 10 3 Clustering Coefficient #Triangles 10 10 4 10 10 4 Rank 10 1 10 Degree 10 10 4 #Participating Nodes (d) Eigenector (e) Clustering Coefficient (f) Triangle Participation Ratio Figure 3: Metric alues of te Coautor network and te two networks generated by and. outperforms for eery metric to generate a syntetic social network tat, ideally, sould recoer te original appearance. After tat, we ealuate by te following tree tasks: Structure recoery. We compare te difference of structures between te generated syntetic network and te real network by means of six metrics: degree distribution, cluster coefficient, etc. Obiously, te more similar te features of te syntetic network and te real network, te better te model. Beaior prediction. can predict users actions by parameter ρ. We use four metrics, including precision, recall, F1-measure, and AUC, to ealuate te performance of in predicting actions quantitatiely. Community detection. can mine communities by parameter ζ. We use a case study to demonstrate its effectieness in detecting communities qualitatiely. 3.1 Dataset To ealuate, we use tree datasets. Te Coautor 1 dataset is collected from [43], consisting of 1,71,433 computer science autors and,09,356 papers publised by tose autors between 1975 and 01. For ealuation, we use a sub-network from [9], wic contains 1765 autors, 13,415 corresponding collaboration relationsips, and 7,33 papers publised at 8 computer science conferences. Tese conferences can be diided into six fields: Artificial Intelligence(AI); Database(DB); Data Mining(DM); Distributed Parallel Computing(DP); Grapics, Vision and HCI (GV); Networks, Communication and 1 ttps://aminer.org/billboard/aminernetwork Performance(NC). Te conference list for eac field can be found at [9]. We define an action of tis network as publising a paper in one of aboe researc fields. Tus, tere are six kinds of actions. Te dataset is from [5], wic contains information from 4,039 users and 88,34 links. Weibo 3 is a popular microblogging serice in Cina, wic reports aing more tan 5 undred million registered users. We use a sub-network from [49] wit 1,776,950 users, 308,489,739 following relationsips, 300,000 original messages and 3,755,810 repost actions. 4 All te messages were posted between Sep. 8t, 01 and Oct. 9t, 01. We classify all te original messages into ten topics, and define an action as posting or reposting a message in one topic, so te number of kinds of actions is ten. 3. Structure Recoery We use te model described by [16] as te baseline, wic seres as a state-of-te-art metod for modeling te structure of social networks. To demonstrate our model s superiority, we use te following network properties as our metrics to measure te difference of structure between te real network and te generated syntetic network. Part of te metrics are also used in [1] and [16], wic represent te properties of a network from arious aspects. Degree is te degree of nodes ersus te number of corresponding nodes. As we know, it conforms to a power-law distribution in a scale-free network. Pairs of Nodes is te cumulatie number of pairs of nodes tat can be reaced in ops. ttp://www.facebook.com 3 ttp://weibo.com 4 ttps://aminer.org/billboard/influencelocality

10 4 10 8 10 6 10 5 10 7 10 4 #Nodes 10 #Pairs of Nodes 10 6 Eigen Value 10 10 5 10 10 4 Degree 10 4 0 5 5 Hops 10 Rank (a) Degree (b) Pairs of Nodes (c) Eigenalues 10 10 10 5 10 4 Eigenector 10 1 10 Clustering Coefficient #Triangles 10 10 3 10 1 10 4 10 10 4 Rank 10 10 10 4 Degree 10 10 4 #Participating Nodes (d) Eigenector (e) Clustering Coefficient (f) Triangle Participation Ratio Figure 4: Metric alues of te network and te two networks generated by and. outperforms for eery metric Eigenalues are eigenalues of te adjacency matrix representing te gien network ersus teir rank. Eigenector is te components of te leading eigenector ersus te rank. Clustering coefficient [45] is te aerage local clustering coefficient of nodes ersus teir degree. Triangle Participation Ratio is te number of triangles tat a node is adjacent to ersus te number of nodes. We conduct tis experiment on Coautor and. For eac dataset, we compute tese alues separately for te tree networks: te real network, G; te generated network, G wit our model; and te generated network G wit te baseline model. Part of te code to compute te metric alues is from [6]. Ten we plot eac metric of te tree networks in one sub-figure in Figure 3 for Coautor and Figure 4 for. Due to te eay-tailed penomenon of te metrics, we plot tem in terms of cumulatie distribution functions. Take te degree distribution, for example, te corresponding number of nodes for degree x is te number of nodes wose degrees are larger tan x. From Figure 3 and Figure 4 we can see tat bot te networks generated by our model on te two datasets are more similar to te ground trut tan to te baseline in all te aboe metrics, wic signifies tat our model is better tan te baseline in modeling te structure of a social network. 3.3 Beaior Prediction can be also used to predict user beaior by parameter ρ. Gien a social network G and action istory A, we can build a training set {(x i, y i)} i=1,,,n, were x i is te attribute ector of a user and y i = a indicates tat te user takes action a. Regarding baselines, we use existing Table 3: Improement sown by oer SVM, SMO, LR, NB, RBF, and C4.5 in terms of precision, recall, F1-measure, and AUC Data Sets Precision Recall F1-measure AUC Coautor 0.37% 13.76% 7.04% 9.45% Weibo 36.% 40.14% 38.14% 3.08% classification algoritms, suc as Support Vector Macine (SVM), Sequential Minimal Optimization (SMO), Logistic Regression (LR), Naie Bayes (NB), Gaussian Radial Basis Function Neural Network (RBF), and C4.5. We use Precision, Recall, F1-measure, and Area Under Cure (AUC) to ealuate te performance of eac algoritm, and compare wit te proposed model. We conduct tis experiment on Coautor and Weibo. Table lists te results of all comparison metods on te two datasets, and Table 3 gies te aerage improement by compared wit te baseline metods. clearly outperforms oter metods on most metrics in Coautor and Weibo. Take F1-measure, for example, in te Coautor dataset, results in a 7.04% improement, and in Weibo, it aciees a 38.14% improement on aerage. Te improement differences may lie in tat weter a user posts or reposts a message as a stronger relation wit is/er communities and friends, wile weter a researcer publises a paper in a specific area mostly depends on is/er attributes, aing little to do wit te influence of is/er communities and friends. aciees muc better performance in Weibo since it takes bot communities and personal attributes into consideration, wile oter metods only take indiidual attributes into consideration. On te oter and, for researcers in Coautor, s superiority is less significant, because taking communities into consideration

Table : Aerage prediction performance of different metods on te Coautor and Weibo datasets. numbers enclosed in brackets are standard deiations. Date set Metod Precision Recall F1-measure AUC SVM 0.8838(0.175) 0.556(0.3183) 0.687(0.054) 0.7360(0.1111) SMO 0.8647(0.118) 0.814(0.160) 0.8387(0.1138) 0.918(0.0366) LR 0.8668(0.14) 0.89(0.10) 0.8476(0.1016) 0.964(0.0196) Coautor NB 0.8183(0.1830) 0.8115(0.1444) 0.8149(0.1549) 0.9417(0.0335) RBF 0.855(0.1058) 0.8353(0.1165) 0.8451(0.1081) 0.9477(0.071) C4.5 0.838(0.0518) 0.8015(0.186) 0.8169(0.1478) 0.9065(0.1165) 0.856(0.1490) 0.8630(0.0598) 0.8596(0.1013) 0.9800(0.0199) Weibo SVM SMO LR NB RBF C4.5 0.5067(0.1405) 0.5074(0.1464) 0.5199(0.1306) 0.511(0.145) 0.55(0.1361) 0.537(0.1367) 0.7017(0.1300) 0.507(0.1185) 0.509(0.1099) 0.5469(0.1073) 0.569(0.1083) 0.4679(0.1117) 0.53(0.1114) 0.7305(0.1079) 0.5047(0.1150) 0.5141(0.171) 0.5331(0.1157) 0.5386(0.117) 0.4937(0.117) 0.579(0.111) 0.7158(0.1149) 0.6068(0.1113) 0.6145(0.0363) 0.6330(0.0377) 0.6397(0.0394) 0.5945(0.0085) 0.671(0.1083) 0.8174(0.033) Te Log Likeliood of Edges 3.8 x 105 4 4. 4.4 4.6 4.8 5 0 50 100 150 00 50 300 #Communites (a) Sum of log-likeliood of edges canges wit C Log Likeliood of Actions 7.7 x 104 7.8 7.9 8 8.1 8. 8.3 8.4 8.5 0 50 100 150 00 50 300 #Communites (b) Sum of log-likeliood of actions canges wit C Figure 5: Te sums of log-likeliood of edges and actions cange wit C. is less elpful in predicting weter a researcer publises a paper in an area. 3.4 Case Study can be also used to detect communities wit parameter ζ. For a new social network dataset, we must decide te number of communities C before detecting communities wit. To find te best C for te Coautor dataset, we fix R = 6 5 and set C =6, 0, 50, 100, 150, 00, 50, 300 se- 5 We conducted an experiment erifying tat te loglikeliood of actions in on te Coautor dataset is not ery sensitie to te alue of R, and R = 6 is sligtly better tan oter alues. quentially, and compute te sums of log-likeliood for edges and actions wit Eq.(13) and Eq.(14) separately sering as a posteriori measure to ealuate parameter C, and tus obtain Figure (5). Te larger te sum of log-likeliood, te better te parameter. From Figure 5 we can easily beliee tat C = 150 may be te best coice for parameter C. E L(edges) = ln p(e i), (13) i=1 Y L(actions) = ln p(y i). (14) i=1 Troug te training of te model, we obtain te community distribution oer nodes. We fix C = 150 and select tree communities. Table 5 lists te representatie fie researcers wit te igest probabilities in eac community. 4. RELATED WORK Tere are tree types of researc related to tis work: network structure modeling, beaior prediction, and community detection. Network structure modeling. Network structure modeling as a long istory and as become a ot topic, attracting more and more interest from computer-science researcers. Tere is great interest in uncoering underlying principles wit wic networks comply. Early in 1960, [11] proposed a model tat uses a real number p (0, 1) to predict weter two nodes ae a link between tem, were p is determined by te scale of te network. [5] proposes a generatie model, in wic a grap is generated by adding nodes into an existing grap, and te probabilities of new nodes aing links wit existing nodes depend on te degree of existing nodes at tat time. [8] proposes a model tat constructs a sequence of nodes by some alues. In tat model, te probability tat two nodes ae a link is proportional to te product of te alues of te two nodes. [19] proposes a model in wic, wen adding a node to an existing grap, selecting an existing node randomly and adding links wit its neigbors wit certain probabilities yields a model. [4] adopts a mecanism tat not only adds but also deletes nodes wen generating a network. [15] introduces

Table 4: Representatie researcers in tree different communities Comm. Name Affiliation 1 3 Jiawei Han Jian Pei Pilip S. Yu Hong Ceng Wei Wang Tomas S. Huang Yun Raymond Fu Suiceng Yan Mark A. Hasegawa-Jonson Xiaoou Tang Pilip A. Bernstein Natan Andrew Goodman Daid Dewitt Erard Ram Micael Stonebraker UIUC SFU UIC CUHK UNC UIUC UB NUS UIUC CUHK Microsoft UA UW-Madison U. of Leipzig MIT latent space to model a social network, and [37] extends tis concept into dynamic networks. Howeer, all te aboe works ignore an important concept: community. Beaior prediction. Tis work was first conducted by economists in te 1890s. Recently, many computer scientists ae been working on tis topic in te social network context. [7] conducts a famous experiment indicating tat one s oting coices are susceptible to tose of is/er friends. [4] predicts user beaiors wit influence from is/er friends and communities. Howeer, neiter takes personal beaior patterns into consideration. [47] analyzes retweeting beaior in Twitter, and proposes a factor grap model to predict retweeting beaior. [51] leerages knowledge of user beaior in different networks to alleiate te data sparsity problem and enance te predictie performance of user modeling. [3] analyzes click stream data and reeals key features of social network workloads, suc as ow frequently people connect to social networks and for ow long, as well as te types and sequences of actiities tat users conduct on social networks. [46] studies te reposting actions of users in social networks. Tis paper classifies users into tree roles opinion leader, structural ole spanner, and ordinary user, wic ae different beaior patterns. Weter a user reposts a message greatly depends on te role it plays. Community detection. Since te concept of community was raised formally, arious metods to detect community ae been proposed. Due to space limitations, we do not list tem ere. Wit te proliferation of community detection metods, ealuating tem as become a ot topic. Modularity [34] is a kind of measure to ealuate community detection algoritms troug comparing edge densities. [4] regards community quality as a function of its size, and offers a more-refined lens to examine community detection metods. 5. CONCLUSION In tis paper, we study ow to model a social network, capturing all its information, suc as links, communities, user roles, user attributes, and user actions. From te relationsips between tese objects, we deise a probabilistic generatie framework, te Community Role Model, to define a social network model. We apply to real-world datasets, and obtain better performance tan tat of a stateof-te-art baseline metod. can also be used to address arious practical problems witout any cange to te model itself, sowing its superiority. Understanding te nature of social networks is ery important for modeling tem, and for addressing a series of problems attaced to tem. As for future work, it would be intriguing to mine more factors tat affect network structure and user beaiors so as to simulate a dynamic social network. It is also interesting to integrate nonparametric metods into our model to base parameter alue coices on te data itself. Acknowledgements. We tank Myungwan Kim for saring te code. Te work is supported by te National Hig-tec R&D Program (No. 014AA015103), National Basic Researc Program of Cina (No. 014CB340506, No. 01CB316006), NSFC (No. 611), NSFC-ANR (No. 6161130588), National Social Science Foundation of Cina (No.13&ZD190), te Tsingua Uniersity Initiatie Scientific Researc Program (011088096), a researc fund supported by Huawei Inc., and Beijing key lab of networked multimedia. 6. REFERENCES [1] Y.-Y. An, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological caracteristics of uge online social networking serices. In WWW 07, pages 835 844, 007. [] A. Anagnostopoulos, R. Kumar, and M. Madian. Influence and correlation in social networks. In KDD 08, pages 7 15, 008. [3] F. Beneenuto, T. Rodrigues, M. Ca, and V. Almeida. Caracterizing user beaior in online social networks. In SIGCOMM 09, pages 49 6, 009. [4] B. Bollobás and O. Riordan. Robustness and ulnerability of scale-free random graps. Internet Matematics, 1(1):1 35, 004. [5] B. Bollobás, O. Riordan, J. Spencer, G. Tusnády, et al. Te degree sequence of a scale-free random grap process. Random Structures & Algoritms, 18(3):79 90, 001. [6] A. Bonato, J. Janssen, and P. Pra lat. A geometric model for on-line social networks. In International Worksop on Modeling Social Media, page 4, 010. [7] R. M. Bond, C. J. Fariss, J. J. Jones, A. D. Kramer, C. Marlow, J. E. Settle, and J. H. Fowler. A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415):95 98, 01. [8] F. Cung and L. Lu. Connected components in random graps wit gien expected degree sequences. Annals of combinatorics, 6():15 145, 00. [9] D. Crandall, D. Cosley, D. Huttenlocer, J. Kleinberg, and S. Suri. Feedback effects between similarity and social influence in online communities. In KDD 08, pages 160 168, 008. [10] D. Easley and J. Kleinberg. Networks, crowds, and markets: Reasoning about a igly connected world. Cambridge Uniersity Press, 010. [11] P. Erd6s and A. Rényi. On te eolution of random graps. Publ. Mat. Inst. Hungar. Acad. Sci, 5:17 61, 1960. [1] S. Fortunato. Community detection in graps. Pysics Reports, 486(3):75 174, 010.

[13] M. Giran and M. E. Newman. Community structure in social and biological networks. PNAS, 99(1):781 786, 00. [14] M. Gomez Rodriguez, J. Leskoec, and A. Krause. Inferring networks of diffusion and influence. In KDD 10, pages 1019 108, 010. [15] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaces to social network analysis. JASA, 97(460):1090 1098, 00. [16] M. Kim and J. Leskoec. Modeling social networks wit node attributes using te multiplicatie attribute grap model. arxi preprint arxi:1106.5053, 011. [17] M. Kim and J. Leskoec. Te network completion problem: Inferring missing nodes and edges in networks. In SDM, pages 47 58, 011. [18] R. Kumar, J. Noak, and A. Tomkins. Structure and eolution of online social networks. In Link mining: models, algoritms, and applications, pages 337 357. 010. [19] R. Kumar, P. Ragaan, S. Rajagopalan, D. Siakumar, A. Tomkins, and E. Upfal. Stocastic models for te web grap. In FOCS 00, pages 57 65, 000. [0] H. Kwak, C. Lee, H. Park, and S. Moon. Wat is twitter, a social network or a news media? In WWW 10, pages 591 600, 010. [1] J. Leskoec, D. Cakrabarti, J. Kleinberg, C. Faloutsos, and Z. Garamani. Kronecker graps: An approac to modeling networks. JMLR, 11:985 104, 010. [] J. Leskoec and C. Faloutsos. Scalable modeling of real graps using kronecker multiplication. In ICML 07, pages 497 504, 007. [3] J. Leskoec, D. Huttenlocer, and J. Kleinberg. Predicting positie and negatie links in online social networks. In WWW 10, pages 641 650, 010. [4] J. Leskoec, K. J. Lang, and M. Maoney. Empirical comparison of algoritms for network community detection. In WWW 10, pages 631 640, 010. [5] J. Leskoec and J. J. Mcauley. Learning to discoer social circles in ego networks. In NIPS 1, pages 539 547, 01. [6] J. Leskoec and R. Sosič. Snap.py: SNAP for Pyton, a general purpose network analysis and grap mining tool in Pyton. ttp://snap.stanford.edu/snappy, June 014. [7] D. Liben-Nowell and J. Kleinberg. Te link-prediction problem for social networks. JASIST, 58(7):1019 1031, 007. [8] L. Liu, J. Tang, J. Han, M. Jiang, and S. Yang. Mining topic-leel influence in eterogeneous networks. In CIKM 10, pages 199 08, 010. [9] T. Lou and J. Tang. Mining structural ole spanners troug information diffusion in social networks. In WWW 13, pages 85 836, 013. [30] R. K. Merton. Social teory and social structure. Simon and Scuster, 1968. [31] A. Misloe, H. S. Koppula, K. P. Gummadi, P. Druscel, and B. Battacarjee. Growt of te flickr social network. In Te First worksop on Online social networks, pages 5 30, 008. [3] A. Misloe, M. Marcon, K. P. Gummadi, P. Druscel, and B. Battacarjee. Measurement and analysis of online social networks. In SIGCOMM 07, pages 9 4, 007. [33] M. Newman. Networks: an introduction. Oxford Uniersity Press, 010. [34] M. E. Newman and M. Giran. Finding and ealuating community structure in networks. Pysical reiew E, 69():06113, 004. [35] G. Palla, L. Loász, and T. Vicsek. Multifractal network generator. PNAS, 107(17):7640 7645, 010. [36] G. Robins, P. Pattison, Y. Kalis, and D. Luser. An introduction to exponential random grap (p*) models for social networks. Social networks, 9():173 191, 007. [37] P. Sarkar and A. W. Moore. Dynamic social network analysis using latent space models. ACM SIGKDD Explorations Newsletter, 7():31 40, 005. [38] J. Scott. Social network analysis. Sage, 01. [39] C. Tan, J. Tang, J. Sun, Q. Lin, and F. Wang. Social action tracking ia noise tolerant time-arying factor graps. In KDD 10, pages 1049 1058, 010. [40] J. Tang, T. Lou, and J. Kleinberg. Inferring social ties across eterogeneous networks. In WSDM 1, pages 743 75, 01. [41] J. Tang, J. Sun, C. Wang, and Z. Yang. Social influence analysis in large-scale networks. In KDD 09, pages 807 816, 009. [4] J. Tang, S. Wu, and J. Sun. Confluence: Conformity influence in large social networks. In KDD 13, pages 347 355, 013. [43] J. Tang, J. Zang, L. Yao, J. Li, L. Zang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In KDD 08, pages 990 998, 008. [44] J. Ugander, B. Karrer, L. Backstrom, and C. Marlow. Te anatomy of te facebook social grap. arxi preprint arxi:1111.4503, 011. [45] D. J. Watts and S. H. Strogatz. Collectie dynamics of aősmall-world aŕnetworks. Nature, 393(6684):440 44, 1998. [46] Y. Yang, J. Tang, C. W.-k. Leung, Y. Sun, Q. Cen, J. Li, and Q. Yang. Rain: Social role-aware information diffusion. 015. [47] Z. Yang, J. Guo, K. Cai, J. Tang, J. Li, L. Zang, and Z. Su. Understanding retweeting beaiors in social networks. In CIKM 10, pages 1633 1636, 010. [48] S. J. Young and E. R. Sceinerman. Random dot product grap models for social networks. In Algoritms and models for te web-grap, pages 138 149. 007. [49] J. Zang, B. Liu, J. Tang, T. Cen, and J. Li. Social influence locality for modeling retweeting beaiors. In IJCAI 13, pages 761 767, 013. [50] J. Zang, J. Tang, H. Zuang, C. W.-K. Leung, and J. Li. Role-aware conformity influence modeling and analysis in social networks. In AAAI 14, pages 958 965, 014. [51] E. Zong, W. Fan, J. Wang, L. Xiao, and Y. Li. Comsoc: adaptie transfer of user beaiors oer composite social network. In KDD 1, pages 696 704, 01.

APPENDIX A. ESTIMATING θ, µ AND σ From Eq.(7), we get te log-likeliood of attributes of all nodes, as in Eq.(15). L = = ln p(x ; θ, µ, σ) ln d () R =1 p(x d () ; µ, σ)p(d() ; θ) = R ln Q(d () = r) p(x d () ; µ, σ)p(d() ; θ) r=1 Q(d () = r) R Q(d () = r) ln p(x d () ; µ, σ)p(d() ; θ) r=1 Q(d () = r) = R wr ln p(x d () ; µ, σ)p(d() ; θ) w r = r=1 R wr ln r=1 θ,r πσr, e (x, µ r, ) σ r, w r. (15) According to te Jensen Inequation, te inequality sign ;µ,σ)p(d();θ) can be remoed iff p(x d() = c, were c is a wr constant and w is a distribution d oer r, so r w r = 1. We can set wr = p(x, d () ; θ, µ, σ) r p(x, d() ; θ, µ, σ) = (π) 1 σ 1 r, e (x, µ r, ) σ r, (x, µ r, ) r (π) 1 σ 1 r, e σ r,. (16) First we assume tat we ae te alues of µ and σ, ten we maximize te lower bound of L by updating θ. Since r θ,r = 1, we get Eq.(17) troug te Lagrange Multiplier. L θ = R wr ln r=1 θ,r πσr, e (x, µ r, ) σ r, w r ɛ( r θ,r 1). (17) We compute te deriatie of Eq.(17) wit regard to θ,r, and obtain Eq.(18). L θ θ,r = w r θ,r + ɛ. (18) We set Eq.(18) to 0, and get w r θ,r = ɛ = constant. Because r θ,r = r w r = 1, we get H Eq.(19). θ,r = w r. (19) Ten we maximize te lower bound of L by computing its deriatie wit regard to µ. R µr, wr ln = µr, = w r r=1 w r x () σr, µ r,. θ,r πσr, e (x, µ r, ) σ r, (x () µ r,) σ r, We set Eq.(0) to 0, and get Eq.(1). µ r, = w r (0) w r x (). (1) w r Next we maximize te lower bound of L by computing its deriatie wit regard to σ. R σr, wr ln r=1 θ,r πσr, e (x, µ r, ) σ r, w r = σr, wr (ln σ r, (x() µ r,) ) = σ r, wr ( (x() µ r,) σ σ r,). 1 r, () We set Eq.() to 0, and get Eq.(3). σ r, = w r (x () µ r,). (3) w r Combining Eq.(16), Eq.(19), Eq.(1) and Eq.(3), we get Eq.(8), Eq.(9) and Eq.(10).